Note: Both R and Python code are shown below in one place for simplicity. You will be running the R code in R Markdown, and the Python code in Jupyter Notebooks. The individual keys (.Rmd and .ipynb) are in the session repo.
tidyverse, here, and janitor packagespandas with import pandas as pdWe’ll be switching back and forth between R Markdown and Jupyter Notebooks today to do some basic data wrangling in dplyr and pandas. Remember: Cmd + Tab to switch between applications.
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(tidyverse)
library(here)
library(janitor)
# Don't forget pandas
# Note: if ModuleNotFoundError: No module named 'pandas', install pandas!
# Install in Terminal: pip install pandas
import pandas as pd
The data files we’ll use today are in the data subfolder of the project. They are:
wb_indicators.csv: a CSV containing data for select development indicators for countries in the World Bank database, from 2001 - 2020wb_indicators_metadata.csv: a CSV containing metadata information for the development indicatorswb_indicators <- read_csv(here("data", "wb_indicators.csv"), na = c("..", ""))
wb_metadata <- read_csv(here("data", "wb_indicators_metadata.csv"))
wb_indicators = pd.read_csv('data/wb_indicators.csv', na_values = c("..", ""))
wb_metadata = pd.read_csv('data/wb_indicators_metadata.csv')
ALWAYS ALWAYS ALWAYS look at what you’ve done.
We see that years - a single variable - is spread out over multiple columns. We’ll want to reshape this data frame from wide-to-long format so that year is only in a single column to make it Tidy.
tidyr::pivot_longer()wb_indicators_long <- wb_indicators %>%
pivot_longer(cols = '2001 [YR2001]':'2020 [YR2020]', # Which columns to squish
names_to = "year", # The original column names are squished into this column
values_to = "indicator_value") # The values are correctly aligned in this column
# Check it out (always):
# View(wb_indicators_long) # Why do I comment this out or run it in the Console?
melt()wb_indicators_long = wb_indicators.melt(id_vars = ['Country Name', 'Country Code', 'Series Name', 'Series Code'],
var_name = 'year',
value_name = 'indicator_value')
# Check it out:
wb_indicators_long.head
# Check the dimensions:
## <bound method NDFrame.head of Country Name ... indicator_value
## 0 Afghanistan ... 9.51
## 1 Afghanistan ... NaN
## 2 Afghanistan ... 810.00
## 3 Afghanistan ... NaN
## 4 Afghanistan ... NaN
## ... ... ... ...
## 26695 NaN ... NaN
## 26696 NaN ... NaN
## 26697 NaN ... NaN
## 26698 Data from database: World Development Indicators ... NaN
## 26699 Last Updated: 07/21/2021 ... NaN
##
## [26700 rows x 6 columns]>
wb_indicators_long.shape
## (26700, 6)
We can see that the year is stored in a weird format (e.g. 2018 [YR2018]) that’s going to make our life difficult if we want to, for example, use year as a number to plot changes in the indicator values over time.
Let’s separate the information in the year column so that we can just keep the nice 4-digit year as a number.
tidyr::separate()wb_data_clean <- wb_indicators_long %>%
tidyr::separate(col = year, into = c("year", "year_chr"), sep = " ") %>%
dplyr::select(-year_chr, -'Country Code', -'Series Code') # This drops a few redundant columns (caution here...best to leave things if you're not sure)
head(wb_data_clean)
## # A tibble: 6 x 4
## `Country Name` `Series Name` year indicator_value
## <chr> <chr> <chr> <dbl>
## 1 Afghanistan Access to clean fuels and technologies f… 2001 9.51
## 2 Afghanistan Access to clean fuels and technologies f… 2002 10.4
## 3 Afghanistan Access to clean fuels and technologies f… 2003 11.5
## 4 Afghanistan Access to clean fuels and technologies f… 2004 12.4
## 5 Afghanistan Access to clean fuels and technologies f… 2005 13.5
## 6 Afghanistan Access to clean fuels and technologies f… 2006 14.8
str.split()wb_indicators_long[['year','year_chr']] = wb_indicators_long.year.str.split(expand=True)
list(wb_indicators_long) # Cool, now there's year and year_chr
# Let's also drop some variables we won't use:
## ['Country Name', 'Country Code', 'Series Name', 'Series Code', 'year', 'indicator_value', 'year_chr']
wb_data_clean = wb_indicators_long.drop(['Country Code', 'Series Code', 'year_chr'], axis = 1)
list(wb_data_clean)
## ['Country Name', 'Series Name', 'year', 'indicator_value']
Our data still aren’t quite tidy! Why?
Notice that we have multiple variables that were measured (our different indicators) all in a single column. This is a scenario where there are multiple variables in a single column. To be Tidy, we want each variable to live in just one column.
tidyr::pivot_wider()That means we’re going to need to widen this data. We’ll do that using tidyr::pivot_wider().
wb_data_tidy <- wb_data_clean %>%
tidyr::drop_na('Series Name') %>%
tidyr::pivot_wider(names_from = 'Series Name', values_from = indicator_value) # Pivot to wide format
head(wb_data_tidy)
## # A tibble: 6 x 7
## `Country Name` year `Access to clean fue… `Access to electr… `CO2 emissions …
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 2001 9.51 NA 810
## 2 Afghanistan 2002 10.4 NA 1100
## 3 Afghanistan 2003 11.5 NA 1350
## 4 Afghanistan 2004 12.4 NA 1130
## 5 Afghanistan 2005 13.5 22.3 1640
## 6 Afghanistan 2006 14.8 28.1 1940
## # … with 2 more variables: Fossil fuel energy consumption (% of total) <dbl>,
## # Level of water stress: freshwater withdrawal as a proportion of available freshwater resources <dbl>
wb_data_tidy = wb_data_clean.pivot_table(index = ['Country Name', 'year'],
columns = 'Series Name',
values = 'indicator_value')
# Indexes back to normal column entries:
wb_data_tidy = wb_data_tidy.reset_index()
# Check the data frame now:
wb_data_tidy.head() # I feel better.
## Series Name Country Name ... Level of water stress: freshwater withdrawal as a proportion of available freshwater resources
## 0 Afghanistan ... NaN
## 1 Afghanistan ... 54.757019
## 2 Afghanistan ... NaN
## 3 Afghanistan ... NaN
## 4 Afghanistan ... NaN
##
## [5 rows x 7 columns]
wb_data_tidy.head()
## Series Name Country Name ... Level of water stress: freshwater withdrawal as a proportion of available freshwater resources
## 0 Afghanistan ... NaN
## 1 Afghanistan ... 54.757019
## 2 Afghanistan ... NaN
## 3 Afghanistan ... NaN
## 4 Afghanistan ... NaN
##
## [5 rows x 7 columns]
Our column names are now a nightmare. We can reassign all names as follows (in this order):
names(wb_data_tidy) <- c("country", "year", "access_clean_fuels_pp", "access_electricity_pp", "co2_emissions_kt", "fossil_fuel_cons_pt", "water_stress")
head(wb_data_tidy)
## # A tibble: 6 x 7
## country year access_clean_fuels_pp access_electricity_pp co2_emissions_kt
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 2001 9.51 NA 810
## 2 Afghanistan 2002 10.4 NA 1100
## 3 Afghanistan 2003 11.5 NA 1350
## 4 Afghanistan 2004 12.4 NA 1130
## 5 Afghanistan 2005 13.5 22.3 1640
## 6 Afghanistan 2006 14.8 28.1 1940
## # … with 2 more variables: fossil_fuel_cons_pt <dbl>, water_stress <dbl>
Or, we’ll learn how to use dplyr::rename() soon…
wb_data_tidy = wb_data_tidy.rename(columns = {'Country Name': 'country', 'Access to clean fuels and technologies for cooking (% of population)': 'access_fuels_pp',
'Access to electricity (% of population)': 'access_electricity_pp',
'CO2 emissions (kt)': 'co2_emissions_kt',
'Fossil fuel energy consumption (% of total)': 'fossil_fuel_consumption_pt',
'Level of water stress: freshwater withdrawal as a proportion of available freshwater resources': 'water_stress'})
wb_data_tidy.head()
## Series Name country year ... fossil_fuel_consumption_pt water_stress
## 0 Afghanistan 2001 ... NaN NaN
## 1 Afghanistan 2002 ... NaN 54.757019
## 2 Afghanistan 2003 ... NaN NaN
## 3 Afghanistan 2004 ... NaN NaN
## 4 Afghanistan 2005 ... NaN NaN
##
## [5 rows x 7 columns]
dplyrdplyr::filter()Use dplyr::filter() to keep or exclude rows based on your conditions.
Some examples:
I only want to keep observations from the dataset above for “United States”:
us_wb <- wb_data_tidy %>%
filter(country == "United States")
head(us_wb)
## # A tibble: 6 x 7
## country year access_clean_fuels_… access_electricity_… co2_emissions_kt
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 United States 2001 100 100 5749250
## 2 United States 2002 100 100 5594160
## 3 United States 2003 100 100 5659630
## 4 United States 2004 100 100 5740030
## 5 United States 2005 100 100 5756080
## 6 United States 2006 100 100 5656580
## # … with 2 more variables: fossil_fuel_cons_pt <dbl>, water_stress <dbl>
wb_data_tidy[(wb_data_tidy['country'] == "United States")]
I want to keep observations if the country is “United States” OR “Mexico” OR “Brazil”:
us_mx_bz <- wb_data_tidy %>%
filter(country %in% c("United States", "Mexico", "Brazil"))
unique(us_mx_bz$country)
## [1] "Brazil" "Mexico" "United States"
wb_data_tidy[(wb_data_tidy['country'] == "United States") | (wb_data_tidy['country'] == "Mexico") | (wb_data_tidy['country'] == "Brazil")]
I want to keep observations if the country is “Guatemala” OR the year is 2020:
guatemala_or_2020 <- wb_data_tidy %>%
filter(country == "Guatemala" | year == 2020)
I want to keep observations in the year is 2010 and CO2 emissions is greater than 10,000kt:
co2_2010_over10k <- wb_data_tidy %>%
filter(year == 2010, co2_emissions_kt > 10000)
dplyr::select()Select (or exclude) columns using dplyr::select(). Put a minus sign (-) in front of a column name or position to exclude it.
nicaragua_co2 <- wb_data_tidy %>%
filter(country == "Nicaragua") %>%
select(year, co2_emissions_kt)
Exclude the water_stress and access_electricity_pp columns:
wb_subset <- wb_data_tidy %>%
select(-c(water_stress, access_electricity_pp))
# Keep columns country, year, and co2_emissions_kt
wb_data_tidy[['country','year','co2_emissions_kt']]
# Exclude column access_fuels_pp
wb_data_tidy.drop('access_fuels_pp', axis = 1) # axis = 1 here indicates drop COLUMN (0 = rows)
dplyr::rename()Use dplyr::rename() to rename one or more columns, in the order new_name = old_name.
wb_newnames <- wb_data_tidy %>%
rename(elec = access_electricity_pp, co2 = co2_emissions_kt)
dplyr::mutate()Use dplyr::mutate() to add a new column, or transform an existing one.
Example: to change the class of a variable (careful - this overwrites the existing column!)
# Check the class of year:
class(wb_data_tidy$year) # Character! Let's change it.
## [1] "character"
wb_data_tidy <- wb_data_tidy %>%
mutate(year = as.numeric(year))
# Check again:
class(wb_data_tidy$year)
## [1] "numeric"
Example: Add a new column that has co2 in TONS (instead of kilotons):
wb_co2_tons <- wb_data_tidy %>%
mutate(co2_tons = co2_emissions_kt * 1000)
head(wb_co2_tons)
## # A tibble: 6 x 8
## country year access_clean_fuels_pp access_electricity_pp co2_emissions_kt
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 2001 9.51 NA 810
## 2 Afghanistan 2002 10.4 NA 1100
## 3 Afghanistan 2003 11.5 NA 1350
## 4 Afghanistan 2004 12.4 NA 1130
## 5 Afghanistan 2005 13.5 22.3 1640
## 6 Afghanistan 2006 14.8 28.1 1940
## # … with 3 more variables: fossil_fuel_cons_pt <dbl>, water_stress <dbl>,
## # co2_tons <dbl>
co2_tons = wb_data_tidy.assign(co2_t = wb_data_tidy['co2_emissions_kt'] * 1000)
dplyr::group_by() %>% summarize()To perform one or more functions on data by group, returning a nice summary table, use group_by + summarize().
Example: find the total reported co2 emissions (kt) for 2001 - 2020 from each country:
co2_total <- wb_data_tidy %>%
group_by(country) %>%
summarize(total_co2_kt = sum(co2_emissions_kt, na.rm = TRUE))
co2_sum = wb_data_tidy.groupby('country')['co2_emissions_kt'].agg(['sum'])
Example: find the total co2 emissions (kt) across all country for each year from 2001 - 2020:
co2_annual <- wb_data_tidy %>%
group_by(year) %>%
summarize(annual_total_co2_kt = sum(co2_emissions_kt, na.rm = TRUE))
# Let's plot this for fun:
ggplot(data = co2_annual, aes(x = year, y = annual_total_co2_kt)) +
geom_line()
# Always look. What is happening here? Always always always look at your data.
We’ve learned a bunch of different useful functions for data wrangling in the {tidyverse}. But this may still feel a bit tedious.
Often, for readability and efficiency, we may want to string together different steps into a sequence. We can do that using the pipe operator (%>% in the tidyverse, or |> is the sparkly new native pipe in R).
Let’s take our raw data that we initially read in:
head(wb_indicators)
## # A tibble: 6 x 24
## `Country Name` `Country Code` `Series Name` `Series Code` `2001 [YR2001]`
## <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan AFG Access to clean f… EG.CFT.ACCS.… 9.51
## 2 Afghanistan AFG Access to electri… EG.ELC.ACCS.… NA
## 3 Afghanistan AFG CO2 emissions (kt) EN.ATM.CO2E.… 810
## 4 Afghanistan AFG Fossil fuel energ… EG.USE.COMM.… NA
## 5 Afghanistan AFG Level of water st… ER.H2O.FWST.… NA
## 6 Albania ALB Access to clean f… EG.CFT.ACCS.… 42.7
## # … with 19 more variables: 2002 [YR2002] <dbl>, 2003 [YR2003] <dbl>,
## # 2004 [YR2004] <dbl>, 2005 [YR2005] <dbl>, 2006 [YR2006] <dbl>,
## # 2007 [YR2007] <dbl>, 2008 [YR2008] <dbl>, 2009 [YR2009] <dbl>,
## # 2010 [YR2010] <dbl>, 2011 [YR2011] <dbl>, 2012 [YR2012] <dbl>,
## # 2013 [YR2013] <dbl>, 2014 [YR2014] <dbl>, 2015 [YR2015] <dbl>,
## # 2016 [YR2016] <dbl>, 2017 [YR2017] <dbl>, 2018 [YR2018] <dbl>,
## # 2019 [YR2019] <dbl>, 2020 [YR2020] <lgl>
Let’s tidy this up in a single sequence, with the pipe operator between.
REMEMBER: Look at what you’ve done after every step in a sequence.
wb_tidy <- wb_indicators %>%
tidyr::pivot_longer(cols = `2001 [YR2001]`:`2020 [YR2020]`,
names_to = "year",
values_to = "indicator_value") %>%
tidyr::separate(col = year, into = c("year", "year_chr"), sep = " ") %>%
dplyr::select(-'Country Code', -'Series Code', -year_chr) %>%
tidyr::drop_na('Series Name') %>%
tidyr::pivot_wider(names_from = 'Series Name', values_from = 'indicator_value') %>%
dplyr::rename(country = 'Country Name',
year = 'year',
clean_fuels = 'Access to clean fuels and technologies for cooking (% of population)',
access_elec = 'Access to electricity (% of population)',
co2 = 'CO2 emissions (kt)',
fossil_fuels = 'Fossil fuel energy consumption (% of total)',
water_stress = 'Level of water stress: freshwater withdrawal as a proportion of available freshwater resources') %>%
dplyr::mutate(year = as.numeric(year))
# Recall you can get the names of columns easily using names(df)
Some questions: How long should a piped sequence be before I store the output and start a new piped sequence? Can a piped sequence be too long? Can I always pipe into everything?