Note: Both R and Python code are shown below in one place for simplicity. You will be running the R code in R Markdown, and the Python code in Jupyter Notebooks. The individual keys (.Rmd and .ipynb) are in the session repo.
tidyverse, here, and janitor packagespandas with import pandas as pdWe’ll be switching back and forth between R Markdown and Jupyter Notebooks today to do some basic data wrangling in dplyr and pandas. Remember: Cmd + Tab to switch between applications.
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(tidyverse)
library(here)
library(janitor)# Don't forget pandas
# Note: if ModuleNotFoundError: No module named 'pandas', install pandas! 
# Install in Terminal: pip install pandas
import pandas as pdThe data files we’ll use today are in the data subfolder of the project. They are:
wb_indicators.csv: a CSV containing data for select development indicators for countries in the World Bank database, from 2001 - 2020wb_indicators_metadata.csv: a CSV containing metadata information for the development indicatorswb_indicators <- read_csv(here("data", "wb_indicators.csv"), na = c("..", ""))
wb_metadata <- read_csv(here("data", "wb_indicators_metadata.csv"))wb_indicators = pd.read_csv('data/wb_indicators.csv', na_values = c("..", ""))
wb_metadata = pd.read_csv('data/wb_indicators_metadata.csv')ALWAYS ALWAYS ALWAYS look at what you’ve done.
We see that years - a single variable - is spread out over multiple columns. We’ll want to reshape this data frame from wide-to-long format so that year is only in a single column to make it Tidy.
tidyr::pivot_longer()wb_indicators_long <- wb_indicators %>% 
  pivot_longer(cols = '2001 [YR2001]':'2020 [YR2020]', # Which columns to squish
               names_to = "year", # The original column names are squished into this column
               values_to = "indicator_value") # The values are correctly aligned in this column
# Check it out (always):
# View(wb_indicators_long) # Why do I comment this out or run it in the Console? melt()wb_indicators_long = wb_indicators.melt(id_vars = ['Country Name', 'Country Code', 'Series Name', 'Series Code'],
                                       var_name = 'year',
                                       value_name = 'indicator_value')
# Check it out: 
wb_indicators_long.head
# Check the dimensions: ## <bound method NDFrame.head of                                            Country Name  ... indicator_value
## 0                                           Afghanistan  ...            9.51
## 1                                           Afghanistan  ...             NaN
## 2                                           Afghanistan  ...          810.00
## 3                                           Afghanistan  ...             NaN
## 4                                           Afghanistan  ...             NaN
## ...                                                 ...  ...             ...
## 26695                                               NaN  ...             NaN
## 26696                                               NaN  ...             NaN
## 26697                                               NaN  ...             NaN
## 26698  Data from database: World Development Indicators  ...             NaN
## 26699                          Last Updated: 07/21/2021  ...             NaN
## 
## [26700 rows x 6 columns]>wb_indicators_long.shape## (26700, 6)We can see that the year is stored in a weird format (e.g. 2018 [YR2018]) that’s going to make our life difficult if we want to, for example, use year as a number to plot changes in the indicator values over time.
Let’s separate the information in the year column so that we can just keep the nice 4-digit year as a number.
tidyr::separate()wb_data_clean <- wb_indicators_long %>% 
  tidyr::separate(col = year, into = c("year", "year_chr"), sep = " ") %>% 
  dplyr::select(-year_chr, -'Country Code', -'Series Code') # This drops a few redundant columns (caution here...best to leave things if you're not sure)
head(wb_data_clean)## # A tibble: 6 x 4
##   `Country Name` `Series Name`                             year  indicator_value
##   <chr>          <chr>                                     <chr>           <dbl>
## 1 Afghanistan    Access to clean fuels and technologies f… 2001             9.51
## 2 Afghanistan    Access to clean fuels and technologies f… 2002            10.4 
## 3 Afghanistan    Access to clean fuels and technologies f… 2003            11.5 
## 4 Afghanistan    Access to clean fuels and technologies f… 2004            12.4 
## 5 Afghanistan    Access to clean fuels and technologies f… 2005            13.5 
## 6 Afghanistan    Access to clean fuels and technologies f… 2006            14.8str.split()wb_indicators_long[['year','year_chr']] = wb_indicators_long.year.str.split(expand=True)
list(wb_indicators_long) # Cool, now there's year and year_chr
# Let's also drop some variables we won't use:## ['Country Name', 'Country Code', 'Series Name', 'Series Code', 'year', 'indicator_value', 'year_chr']wb_data_clean = wb_indicators_long.drop(['Country Code', 'Series Code', 'year_chr'], axis = 1)
list(wb_data_clean)## ['Country Name', 'Series Name', 'year', 'indicator_value']Our data still aren’t quite tidy! Why?
Notice that we have multiple variables that were measured (our different indicators) all in a single column. This is a scenario where there are multiple variables in a single column. To be Tidy, we want each variable to live in just one column.
tidyr::pivot_wider()That means we’re going to need to widen this data. We’ll do that using tidyr::pivot_wider().
wb_data_tidy <- wb_data_clean %>% 
  tidyr::drop_na('Series Name') %>% 
  tidyr::pivot_wider(names_from = 'Series Name', values_from = indicator_value) # Pivot to wide format
head(wb_data_tidy)## # A tibble: 6 x 7
##   `Country Name` year  `Access to clean fue… `Access to electr… `CO2 emissions …
##   <chr>          <chr>                 <dbl>              <dbl>            <dbl>
## 1 Afghanistan    2001                   9.51               NA                810
## 2 Afghanistan    2002                  10.4                NA               1100
## 3 Afghanistan    2003                  11.5                NA               1350
## 4 Afghanistan    2004                  12.4                NA               1130
## 5 Afghanistan    2005                  13.5                22.3             1640
## 6 Afghanistan    2006                  14.8                28.1             1940
## # … with 2 more variables: Fossil fuel energy consumption (% of total) <dbl>,
## #   Level of water stress: freshwater withdrawal as a proportion of available freshwater resources <dbl>wb_data_tidy = wb_data_clean.pivot_table(index = ['Country Name', 'year'],
                                        columns = 'Series Name',
                                        values = 'indicator_value')
                                        
# Indexes back to normal column entries:                                         
wb_data_tidy = wb_data_tidy.reset_index()
# Check the data frame now:
wb_data_tidy.head() # I feel better.
                                        ## Series Name Country Name  ... Level of water stress: freshwater withdrawal as a proportion of available freshwater resources
## 0            Afghanistan  ...                                                NaN                                            
## 1            Afghanistan  ...                                          54.757019                                            
## 2            Afghanistan  ...                                                NaN                                            
## 3            Afghanistan  ...                                                NaN                                            
## 4            Afghanistan  ...                                                NaN                                            
## 
## [5 rows x 7 columns]wb_data_tidy.head()## Series Name Country Name  ... Level of water stress: freshwater withdrawal as a proportion of available freshwater resources
## 0            Afghanistan  ...                                                NaN                                            
## 1            Afghanistan  ...                                          54.757019                                            
## 2            Afghanistan  ...                                                NaN                                            
## 3            Afghanistan  ...                                                NaN                                            
## 4            Afghanistan  ...                                                NaN                                            
## 
## [5 rows x 7 columns]Our column names are now a nightmare. We can reassign all names as follows (in this order):
names(wb_data_tidy) <- c("country", "year", "access_clean_fuels_pp", "access_electricity_pp", "co2_emissions_kt", "fossil_fuel_cons_pt", "water_stress")
head(wb_data_tidy)## # A tibble: 6 x 7
##   country     year  access_clean_fuels_pp access_electricity_pp co2_emissions_kt
##   <chr>       <chr>                 <dbl>                 <dbl>            <dbl>
## 1 Afghanistan 2001                   9.51                  NA                810
## 2 Afghanistan 2002                  10.4                   NA               1100
## 3 Afghanistan 2003                  11.5                   NA               1350
## 4 Afghanistan 2004                  12.4                   NA               1130
## 5 Afghanistan 2005                  13.5                   22.3             1640
## 6 Afghanistan 2006                  14.8                   28.1             1940
## # … with 2 more variables: fossil_fuel_cons_pt <dbl>, water_stress <dbl>Or, we’ll learn how to use dplyr::rename() soon…
wb_data_tidy = wb_data_tidy.rename(columns = {'Country Name': 'country', 'Access to clean fuels and technologies for cooking (% of population)': 'access_fuels_pp',
 'Access to electricity (% of population)': 'access_electricity_pp',
 'CO2 emissions (kt)': 'co2_emissions_kt',
 'Fossil fuel energy consumption (% of total)': 'fossil_fuel_consumption_pt',
 'Level of water stress: freshwater withdrawal as a proportion of available freshwater resources': 'water_stress'})
 
wb_data_tidy.head()## Series Name      country  year  ...  fossil_fuel_consumption_pt  water_stress
## 0            Afghanistan  2001  ...                         NaN           NaN
## 1            Afghanistan  2002  ...                         NaN     54.757019
## 2            Afghanistan  2003  ...                         NaN           NaN
## 3            Afghanistan  2004  ...                         NaN           NaN
## 4            Afghanistan  2005  ...                         NaN           NaN
## 
## [5 rows x 7 columns]dplyrdplyr::filter()Use dplyr::filter() to keep or exclude rows based on your conditions.
Some examples:
I only want to keep observations from the dataset above for “United States”:
us_wb <- wb_data_tidy %>% 
  filter(country == "United States")
head(us_wb)## # A tibble: 6 x 7
##   country       year  access_clean_fuels_… access_electricity_… co2_emissions_kt
##   <chr>         <chr>                <dbl>                <dbl>            <dbl>
## 1 United States 2001                   100                  100          5749250
## 2 United States 2002                   100                  100          5594160
## 3 United States 2003                   100                  100          5659630
## 4 United States 2004                   100                  100          5740030
## 5 United States 2005                   100                  100          5756080
## 6 United States 2006                   100                  100          5656580
## # … with 2 more variables: fossil_fuel_cons_pt <dbl>, water_stress <dbl>wb_data_tidy[(wb_data_tidy['country'] == "United States")]I want to keep observations if the country is “United States” OR “Mexico” OR “Brazil”:
us_mx_bz <- wb_data_tidy %>% 
  filter(country %in% c("United States", "Mexico", "Brazil"))
unique(us_mx_bz$country)## [1] "Brazil"        "Mexico"        "United States"wb_data_tidy[(wb_data_tidy['country'] == "United States") | (wb_data_tidy['country'] == "Mexico") | (wb_data_tidy['country'] == "Brazil")]I want to keep observations if the country is “Guatemala” OR the year is 2020:
guatemala_or_2020 <- wb_data_tidy %>% 
  filter(country == "Guatemala" | year == 2020)I want to keep observations in the year is 2010 and CO2 emissions is greater than 10,000kt:
co2_2010_over10k <- wb_data_tidy %>% 
  filter(year == 2010, co2_emissions_kt > 10000)dplyr::select()Select (or exclude) columns using dplyr::select(). Put a minus sign (-) in front of a column name or position to exclude it.
nicaragua_co2 <- wb_data_tidy %>% 
  filter(country == "Nicaragua") %>% 
  select(year, co2_emissions_kt)Exclude the water_stress and access_electricity_pp columns:
wb_subset <- wb_data_tidy %>% 
  select(-c(water_stress, access_electricity_pp))# Keep columns country, year, and co2_emissions_kt
wb_data_tidy[['country','year','co2_emissions_kt']]
# Exclude column access_fuels_pp
wb_data_tidy.drop('access_fuels_pp', axis = 1) # axis = 1 here indicates drop COLUMN (0 = rows)dplyr::rename()Use dplyr::rename() to rename one or more columns, in the order new_name = old_name.
wb_newnames <- wb_data_tidy %>% 
  rename(elec = access_electricity_pp, co2 = co2_emissions_kt)dplyr::mutate()Use dplyr::mutate() to add a new column, or transform an existing one.
Example: to change the class of a variable (careful - this overwrites the existing column!)
# Check the class of year:
class(wb_data_tidy$year) # Character! Let's change it. ## [1] "character"wb_data_tidy <- wb_data_tidy %>% 
  mutate(year = as.numeric(year))
# Check again: 
class(wb_data_tidy$year)## [1] "numeric"Example: Add a new column that has co2 in TONS (instead of kilotons):
wb_co2_tons <- wb_data_tidy %>% 
  mutate(co2_tons = co2_emissions_kt * 1000)
head(wb_co2_tons)## # A tibble: 6 x 8
##   country      year access_clean_fuels_pp access_electricity_pp co2_emissions_kt
##   <chr>       <dbl>                 <dbl>                 <dbl>            <dbl>
## 1 Afghanistan  2001                  9.51                  NA                810
## 2 Afghanistan  2002                 10.4                   NA               1100
## 3 Afghanistan  2003                 11.5                   NA               1350
## 4 Afghanistan  2004                 12.4                   NA               1130
## 5 Afghanistan  2005                 13.5                   22.3             1640
## 6 Afghanistan  2006                 14.8                   28.1             1940
## # … with 3 more variables: fossil_fuel_cons_pt <dbl>, water_stress <dbl>,
## #   co2_tons <dbl>co2_tons = wb_data_tidy.assign(co2_t = wb_data_tidy['co2_emissions_kt'] * 1000)dplyr::group_by() %>% summarize()To perform one or more functions on data by group, returning a nice summary table, use group_by + summarize().
Example: find the total reported co2 emissions (kt) for 2001 - 2020 from each country:
co2_total <- wb_data_tidy %>% 
  group_by(country) %>% 
  summarize(total_co2_kt = sum(co2_emissions_kt, na.rm = TRUE))co2_sum = wb_data_tidy.groupby('country')['co2_emissions_kt'].agg(['sum'])Example: find the total co2 emissions (kt) across all country for each year from 2001 - 2020:
co2_annual <- wb_data_tidy %>% 
  group_by(year) %>% 
  summarize(annual_total_co2_kt = sum(co2_emissions_kt, na.rm = TRUE))
# Let's plot this for fun: 
ggplot(data = co2_annual, aes(x = year, y = annual_total_co2_kt)) +
  geom_line()# Always look. What is happening here? Always always always look at your data. We’ve learned a bunch of different useful functions for data wrangling in the {tidyverse}. But this may still feel a bit tedious.
Often, for readability and efficiency, we may want to string together different steps into a sequence. We can do that using the pipe operator (%>% in the tidyverse, or |> is the sparkly new native pipe in R).
Let’s take our raw data that we initially read in:
head(wb_indicators)## # A tibble: 6 x 24
##   `Country Name` `Country Code` `Series Name`      `Series Code` `2001 [YR2001]`
##   <chr>          <chr>          <chr>              <chr>                   <dbl>
## 1 Afghanistan    AFG            Access to clean f… EG.CFT.ACCS.…            9.51
## 2 Afghanistan    AFG            Access to electri… EG.ELC.ACCS.…           NA   
## 3 Afghanistan    AFG            CO2 emissions (kt) EN.ATM.CO2E.…          810   
## 4 Afghanistan    AFG            Fossil fuel energ… EG.USE.COMM.…           NA   
## 5 Afghanistan    AFG            Level of water st… ER.H2O.FWST.…           NA   
## 6 Albania        ALB            Access to clean f… EG.CFT.ACCS.…           42.7 
## # … with 19 more variables: 2002 [YR2002] <dbl>, 2003 [YR2003] <dbl>,
## #   2004 [YR2004] <dbl>, 2005 [YR2005] <dbl>, 2006 [YR2006] <dbl>,
## #   2007 [YR2007] <dbl>, 2008 [YR2008] <dbl>, 2009 [YR2009] <dbl>,
## #   2010 [YR2010] <dbl>, 2011 [YR2011] <dbl>, 2012 [YR2012] <dbl>,
## #   2013 [YR2013] <dbl>, 2014 [YR2014] <dbl>, 2015 [YR2015] <dbl>,
## #   2016 [YR2016] <dbl>, 2017 [YR2017] <dbl>, 2018 [YR2018] <dbl>,
## #   2019 [YR2019] <dbl>, 2020 [YR2020] <lgl>Let’s tidy this up in a single sequence, with the pipe operator between.
REMEMBER: Look at what you’ve done after every step in a sequence.
wb_tidy <- wb_indicators %>% 
  tidyr::pivot_longer(cols = `2001 [YR2001]`:`2020 [YR2020]`,
               names_to = "year",
               values_to = "indicator_value") %>% 
  tidyr::separate(col = year, into = c("year", "year_chr"), sep = " ") %>% 
  dplyr::select(-'Country Code', -'Series Code', -year_chr) %>% 
  tidyr::drop_na('Series Name') %>% 
  tidyr::pivot_wider(names_from = 'Series Name', values_from = 'indicator_value') %>% 
  dplyr::rename(country = 'Country Name', 
                year = 'year', 
                clean_fuels = 'Access to clean fuels and technologies for cooking (% of population)',
                access_elec = 'Access to electricity (% of population)',
                co2 = 'CO2 emissions (kt)',
                fossil_fuels = 'Fossil fuel energy consumption (% of total)',
                water_stress = 'Level of water stress: freshwater withdrawal as a proportion of available freshwater resources') %>% 
  dplyr::mutate(year = as.numeric(year))
  
# Recall you can get the names of columns easily using names(df)Some questions: How long should a piped sequence be before I store the output and start a new piped sequence? Can a piped sequence be too long? Can I always pipe into everything?