library(tidyverse)
library(palmerpenguins)
library(lubridate)
library(kableExtra)

Part 0: Setup

  • Create a new version-controlled R Project called eds221-m2021-day7-interactive
  • Create a new R Markdown document in the project
  • Attach the following packages in the setup chunk:
    • tidyverse
    • palmerpenguins
    • lubridate

Part 1: Data wrangling refresher

Refresher 1: Starting with the penguins dataset in the palmerpenguins package, write a single piped sequence in which you:

  1. Only include penguins at Biscoe and Dream islands
  2. Remove the year and sex variables
  3. Add a new column called body_mass_kg, with penguin mass converted from grams to kilograms
  4. Rename the island variable to location

Refresher 2: Staring with the penguins dataset in the palmerpenguins package, write a single piped sequence in which you:

  1. Limit to only Adelie penguins
  2. Remove any observations where flipper_length_mm is NA (hint: !is.na())
  3. Group the data by sex
  4. Find the mean (mean()), standard deviation (sd) and sample size (n) of flipper_length_mm for male and female Adelie penguins, returned in a nice summary table

Part 2: Wrangling continued - joins of different flavors

Mutating joins

Let’s create some data to practice and clarify different types of joins.

  • Install the datapasta package in R.
  • Quit & restart RStudio.
  • Copy the content of the first table below.
  • With your cursor in a code chunk, go up to Addins > Datapasta > as tribble
  • Assign the code to object name animals
  • Similarly, copy and datapasta the second table, storing as sites
location species maturity
lagoon bobcat adult
bluff coyote juvenile
creek fox adult
oaks squirrel juvenile
bluff bobcat adult
location full_site_name jurisdiction
beach Goleta Beach SB City
lagoon UCSB Lagoon UCSB
bluff Ellwood Mesa SB City
oaks Fremont Campground USFS

dplyr::full_join()

The dplyr::full_join() function adds columns from the second df to the first df. It is the safest join - nothing is excluded. When in doubt, full join.

full_join_example <- full_join(animals, sites)
## Joining, by = "location"
full_join_example
## # A tibble: 6 × 5
##   location species  maturity full_site_name     jurisdiction
##   <chr>    <chr>    <chr>    <chr>              <chr>       
## 1 lagoon   bobcat   adult    UCSB Lagoon        UCSB        
## 2 bluff    coyote   juvenile Ellwood Mesa       SB City     
## 3 creek    fox      adult    <NA>               <NA>        
## 4 oaks     squirrel juvenile Fremont Campground USFS        
## 5 bluff    bobcat   adult    Ellwood Mesa       SB City     
## 6 beach    <NA>     <NA>     Goleta Beach       SB City

dplyr::left_join()

The dplyr::left_join(x,y) function keeps everything in x, and only joins from y (by matching key) if they have a match in x. Otherwise they’re dropped.

left_join_example <- left_join(animals, sites)
## Joining, by = "location"
left_join_example
## # A tibble: 5 × 5
##   location species  maturity full_site_name     jurisdiction
##   <chr>    <chr>    <chr>    <chr>              <chr>       
## 1 lagoon   bobcat   adult    UCSB Lagoon        UCSB        
## 2 bluff    coyote   juvenile Ellwood Mesa       SB City     
## 3 creek    fox      adult    <NA>               <NA>        
## 4 oaks     squirrel juvenile Fremont Campground USFS        
## 5 bluff    bobcat   adult    Ellwood Mesa       SB City

dplyr::right_join()

Opposite of a left_join().

right_join_example <- right_join(animals, sites)
## Joining, by = "location"
right_join_example
## # A tibble: 5 × 5
##   location species  maturity full_site_name     jurisdiction
##   <chr>    <chr>    <chr>    <chr>              <chr>       
## 1 lagoon   bobcat   adult    UCSB Lagoon        UCSB        
## 2 bluff    coyote   juvenile Ellwood Mesa       SB City     
## 3 oaks     squirrel juvenile Fremont Campground USFS        
## 4 bluff    bobcat   adult    Ellwood Mesa       SB City     
## 5 beach    <NA>     <NA>     Goleta Beach       SB City

dplyr::inner_join()

Rows are only kept if the key matches in both x and y (intersection).

inner_join_example <- inner_join(animals, sites)
## Joining, by = "location"
inner_join_example
## # A tibble: 4 × 5
##   location species  maturity full_site_name     jurisdiction
##   <chr>    <chr>    <chr>    <chr>              <chr>       
## 1 lagoon   bobcat   adult    UCSB Lagoon        UCSB        
## 2 bluff    coyote   juvenile Ellwood Mesa       SB City     
## 3 oaks     squirrel juvenile Fremont Campground USFS        
## 4 bluff    bobcat   adult    Ellwood Mesa       SB City

Filtering joins

We’ll just look at two filtering join functions (from dplyr documentation):

  • semi_join() “return[s] all rows from x with a match in y”
  • anti_join() “return[s] all rows from x without a match in y”
semi_join_example <- semi_join(animals, sites)
## Joining, by = "location"
semi_join_example
## # A tibble: 4 × 3
##   location species  maturity
##   <chr>    <chr>    <chr>   
## 1 lagoon   bobcat   adult   
## 2 bluff    coyote   juvenile
## 3 oaks     squirrel juvenile
## 4 bluff    bobcat   adult
anti_join_example <- anti_join(animals, sites)
## Joining, by = "location"
anti_join_example
## # A tibble: 1 × 3
##   location species maturity
##   <chr>    <chr>   <chr>   
## 1 creek    fox     adult

Part 3: easier dates and times with {lubridate}

Witness some lubridate magic:

my_date <- "03-15-1998"
lubridate::mdy(my_date)
## [1] "1998-03-15"
my_date <- "11/7/2003"
lubridate::mdy(my_date)
## [1] "2003-11-07"
my_date <- "08-Jun-1974"
lubridate::dmy(my_date)
## [1] "1974-06-08"
my_date <- "19610518"
lubridate::ymd(my_date)
## [1] "1961-05-18"

ISO 8601 for everyone!

Wait can it fail?

Yup. What happens if we give lubridate something that doesn’t make sense?

lubridate::mdy("1942-08-30")
## Warning: All formats failed to parse. No formats found.
## [1] NA
# Nope.
lubridate::dmy("09/12/84")
## [1] "1984-12-09"
# Wait...but couldn't that be correct? 
# Sure, you and your computer wouldn't know.
# SO KNOW YOUR DATE FORMAT.

In other words, parsing dates can fail in multiple ways:

  • The parsing can fail because the values don’t make sense based on the order you gave it (e.g. “wait you’re telling me that the month is 17? Nope.”)
  • The parsing can work, but you messed up with the order - so the code runs, but the date stored is wrong. This is more dangerous.

KNOW YOUR DATA.

It even makes it relatively easy to deal with times & time zones

See the Olson Names: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

Use ymd-hm to convert this to a date time that R will understand.

Question: Why am I using ymd_hm here?

time <- "2020-08-12 11:18"
time <- ymd_hm(time)
time # Note that the default is UTC
## [1] "2020-08-12 11:18:00 UTC"
class(time) # Class is POSIXct
## [1] "POSIXct" "POSIXt"
# Convert to PDT:
with_tz(time, "America/Los_Angeles")
## [1] "2020-08-12 04:18:00 PDT"
# Convert to AEST:
with_tz(time, "Australia/Sydney")
## [1] "2020-08-12 21:18:00 AEST"
# pull just the time
my_time <- lubridate::ymd_hms(time)
week(my_time)
## [1] 33
day(my_time)
## [1] 12
hour(my_time)
## [1] 11
minute(my_time)
## [1] 18
second(my_time)
## [1] 0

More time examples:

# Get your system time
comp_time <- Sys.time()
# Convert comp_time to Europe/Belgrade time: 
with_tz(comp_time, "Europe/Belgrade")
## [1] "2021-08-17 19:03:52 CEST"

Pulling pieces

You can also get information about your dates using nice built-in lubridate functions.

teddy_bday <- lubridate::ymd("20170615")

# Return the date 
teddy_bday
## [1] "2017-06-15"
# What day of the week? 
wday(teddy_bday, label = TRUE)
## [1] Thu
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
# What week of the year? 
week(teddy_bday)
## [1] 24
# Pull the year
year(teddy_bday)
## [1] 2017
# Month
month(teddy_bday)
## [1] 6

Date pieces as new columns

This can be useful if you want to group your day in different ways for analyses or exploration. Use lubridate functions, in combination with mutate(), to add new columns containing separate pieces of the date, e.g. year, month, day in three separate columns).

For example, let’s just make a little data frame to try this out:

urchin_counts <- tribble(
  ~date, ~species, ~size_mm,
  "10/3/2020", "purple", 55,
  "10/4/2020", "red", 48,
  "11/17/2020", "red", 67
)

urchin_counts_ymd <- urchin_counts %>% 
  mutate(date = lubridate::mdy(date)) %>% 
  mutate(year = year(date),
         month = month(date),
         day = day(date))

urchin_counts_ymd
## # A tibble: 3 × 6
##   date       species size_mm  year month   day
##   <date>     <chr>     <dbl> <dbl> <dbl> <int>
## 1 2020-10-03 purple       55  2020    10     3
## 2 2020-10-04 red          48  2020    10     4
## 3 2020-11-17 red          67  2020    11    17
# And then we could use group_by() to find different summary values by group, for example.

Find durations of times

day_1 <- lubridate::ymd("2020-01-06")
day_2 <- lubridate::ymd("2020-05-18")
day_3 <- lubridate::ymd("2020-05-19")

# Create a time interval
time_interval <- interval(day_1, day_2)

# Check the length in weeks
time_length(time_interval, "week")
## [1] 19
# Check the length in years
time_length(time_interval, "year")
## [1] 0.363388

Wrangling strings with stringr

Use str_detect() to detect a string pattern

Returns TRUE or FALSE based on whether the pattern is or is not detected.

my_string <- "Teddy loves eating salmon and socks."

# Does the pattern "love" exist within the string?
my_string %>% str_detect("love")
## [1] TRUE
# Does the pattern "pup" exist within the string?
my_string %>% str_detect("pup")
## [1] FALSE

This also works on vectors (…think ahead - data frame columns!). It is case sensitive (by default):

my_string <- c("burrito", "fish taco", "Taco salad")

# Does the vector element contain the pattern "fish"?
my_string %>% str_detect("fish")
## [1] FALSE  TRUE FALSE

It is most powerful when used in combination with other functions.

For example, let’s look at the starwars dataset (in dplyr):

head(starwars)
## # A tibble: 6 × 14
##   name     height  mass hair_color  skin_color eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sk…    172    77 blond       fair       blue            19   male  mascu…
## 2 C-3PO       167    75 <NA>        gold       yellow         112   none  mascu…
## 3 R2-D2        96    32 <NA>        white, bl… red             33   none  mascu…
## 4 Darth V…    202   136 none        white      yellow          41.9 male  mascu…
## 5 Leia Or…    150    49 brown       light      brown           19   fema… femin…
## 6 Owen La…    178   120 brown, grey light      blue            52   male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

I want to only keep rows where the name column contains the pattern “Skywalker.” Remember: what’s the function to keep or exclude rows based on our conditions? It’s dplyr::filter()! I can use that with str_detect() to get the Skywalker family characters for me:

skywalkers <- starwars %>% 
  filter(str_detect(name, "Skywalker"))

skywalkers
## # A tibble: 3 × 14
##   name     height  mass hair_color skin_color eye_color birth_year sex    gender
##   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr> 
## 1 Luke Sk…    172    77 blond      fair       blue            19   male   mascu…
## 2 Anakin …    188    84 blond      fair       blue            41.9 male   mascu…
## 3 Shmi Sk…    163    NA black      fair       brown           72   female femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Use str_replace() to replace a string pattern with something else

firewalkers <- starwars %>% 
  mutate(name = str_replace(name, pattern = "Sky", replacement = "Fire"))

head(firewalkers)
## # A tibble: 6 × 14
##   name     height  mass hair_color  skin_color eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Fi…    172    77 blond       fair       blue            19   male  mascu…
## 2 C-3PO       167    75 <NA>        gold       yellow         112   none  mascu…
## 3 R2-D2        96    32 <NA>        white, bl… red             33   none  mascu…
## 4 Darth V…    202   136 none        white      yellow          41.9 male  mascu…
## 5 Leia Or…    150    49 brown       light      brown           19   fema… femin…
## 6 Owen La…    178   120 brown, grey light      blue            52   male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

You can imagine this may be really helpful if there is a repeated spelling error, annoying syntax, or otherwise that you want to update throughout your data frame.

Note: this is very different from text mining and analysis, which involves analyzing textual information to gain insights about patterns, trends, and sentiments - look forward to that in EDS 242!

Use str_trim() or str_squish() to remove excess white space

feedback <- c(" I ate     some   nachos", "Wednesday morning   ")

# Removes leading, trailing & duplicate interior whitespaces
str_squish(feedback)
## [1] "I ate some nachos" "Wednesday morning"
# Removes leading & trailing whitespaces
str_trim(feedback)
## [1] "I ate     some   nachos" "Wednesday morning"

Convert cases

str_to_lower(feedback)
## [1] " i ate     some   nachos" "wednesday morning   "
str_to_upper(feedback)
## [1] " I ATE     SOME   NACHOS" "WEDNESDAY MORNING   "
str_to_title(feedback)
## [1] " I Ate     Some   Nachos" "Wednesday Morning   "

Count matches in a string

str_count(feedback, pattern = "nachos")
## [1] 1 0

End