dplyr::inner_join()
stringr
str_detect()
to detect a string patternstr_replace()
to replace a string pattern with something elsestr_trim()
or str_squish()
to remove excess white spacelibrary(tidyverse)
library(palmerpenguins)
library(lubridate)
library(kableExtra)
eds221-m2021-day7-interactive
tidyverse
palmerpenguins
lubridate
Refresher 1: Starting with the penguins
dataset in the palmerpenguins
package, write a single piped sequence in which you:
year
and sex
variablesbody_mass_kg
, with penguin mass converted from grams to kilogramsisland
variable to location
Refresher 2: Staring with the penguins
dataset in the palmerpenguins
package, write a single piped sequence in which you:
flipper_length_mm
is NA
(hint: !is.na()
)sex
mean()
), standard deviation (sd
) and sample size (n
) of flipper_length_mm for male and female Adelie penguins, returned in a nice summary tableLet’s create some data to practice and clarify different types of joins.
datapasta
package in R.animals
sites
location | species | maturity |
---|---|---|
lagoon | bobcat | adult |
bluff | coyote | juvenile |
creek | fox | adult |
oaks | squirrel | juvenile |
bluff | bobcat | adult |
location | full_site_name | jurisdiction |
---|---|---|
beach | Goleta Beach | SB City |
lagoon | UCSB Lagoon | UCSB |
bluff | Ellwood Mesa | SB City |
oaks | Fremont Campground | USFS |
The dplyr::full_join()
function adds columns from the second df to the first df. It is the safest join - nothing is excluded. When in doubt, full join.
full_join_example <- full_join(animals, sites)
## Joining, by = "location"
full_join_example
## # A tibble: 6 × 5
## location species maturity full_site_name jurisdiction
## <chr> <chr> <chr> <chr> <chr>
## 1 lagoon bobcat adult UCSB Lagoon UCSB
## 2 bluff coyote juvenile Ellwood Mesa SB City
## 3 creek fox adult <NA> <NA>
## 4 oaks squirrel juvenile Fremont Campground USFS
## 5 bluff bobcat adult Ellwood Mesa SB City
## 6 beach <NA> <NA> Goleta Beach SB City
The dplyr::left_join(x,y)
function keeps everything in x, and only joins from y (by matching key) if they have a match in x. Otherwise they’re dropped.
left_join_example <- left_join(animals, sites)
## Joining, by = "location"
left_join_example
## # A tibble: 5 × 5
## location species maturity full_site_name jurisdiction
## <chr> <chr> <chr> <chr> <chr>
## 1 lagoon bobcat adult UCSB Lagoon UCSB
## 2 bluff coyote juvenile Ellwood Mesa SB City
## 3 creek fox adult <NA> <NA>
## 4 oaks squirrel juvenile Fremont Campground USFS
## 5 bluff bobcat adult Ellwood Mesa SB City
Opposite of a left_join()
.
right_join_example <- right_join(animals, sites)
## Joining, by = "location"
right_join_example
## # A tibble: 5 × 5
## location species maturity full_site_name jurisdiction
## <chr> <chr> <chr> <chr> <chr>
## 1 lagoon bobcat adult UCSB Lagoon UCSB
## 2 bluff coyote juvenile Ellwood Mesa SB City
## 3 oaks squirrel juvenile Fremont Campground USFS
## 4 bluff bobcat adult Ellwood Mesa SB City
## 5 beach <NA> <NA> Goleta Beach SB City
dplyr::inner_join()
Rows are only kept if the key matches in both x and y (intersection).
inner_join_example <- inner_join(animals, sites)
## Joining, by = "location"
inner_join_example
## # A tibble: 4 × 5
## location species maturity full_site_name jurisdiction
## <chr> <chr> <chr> <chr> <chr>
## 1 lagoon bobcat adult UCSB Lagoon UCSB
## 2 bluff coyote juvenile Ellwood Mesa SB City
## 3 oaks squirrel juvenile Fremont Campground USFS
## 4 bluff bobcat adult Ellwood Mesa SB City
We’ll just look at two filtering join functions (from dplyr documentation):
semi_join()
“return[s] all rows from x with a match in y”anti_join()
“return[s] all rows from x without a match in y”semi_join_example <- semi_join(animals, sites)
## Joining, by = "location"
semi_join_example
## # A tibble: 4 × 3
## location species maturity
## <chr> <chr> <chr>
## 1 lagoon bobcat adult
## 2 bluff coyote juvenile
## 3 oaks squirrel juvenile
## 4 bluff bobcat adult
anti_join_example <- anti_join(animals, sites)
## Joining, by = "location"
anti_join_example
## # A tibble: 1 × 3
## location species maturity
## <chr> <chr> <chr>
## 1 creek fox adult
my_date <- "03-15-1998"
lubridate::mdy(my_date)
## [1] "1998-03-15"
my_date <- "11/7/2003"
lubridate::mdy(my_date)
## [1] "2003-11-07"
my_date <- "08-Jun-1974"
lubridate::dmy(my_date)
## [1] "1974-06-08"
my_date <- "19610518"
lubridate::ymd(my_date)
## [1] "1961-05-18"
ISO 8601 for everyone!
Yup. What happens if we give lubridate something that doesn’t make sense?
lubridate::mdy("1942-08-30")
## Warning: All formats failed to parse. No formats found.
## [1] NA
# Nope.
lubridate::dmy("09/12/84")
## [1] "1984-12-09"
# Wait...but couldn't that be correct?
# Sure, you and your computer wouldn't know.
# SO KNOW YOUR DATE FORMAT.
In other words, parsing dates can fail in multiple ways:
KNOW YOUR DATA.
See the Olson Names: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
Use ymd-hm
to convert this to a date time that R will understand.
Question: Why am I using ymd_hm
here?
time <- "2020-08-12 11:18"
time <- ymd_hm(time)
time # Note that the default is UTC
## [1] "2020-08-12 11:18:00 UTC"
class(time) # Class is POSIXct
## [1] "POSIXct" "POSIXt"
# Convert to PDT:
with_tz(time, "America/Los_Angeles")
## [1] "2020-08-12 04:18:00 PDT"
# Convert to AEST:
with_tz(time, "Australia/Sydney")
## [1] "2020-08-12 21:18:00 AEST"
# pull just the time
my_time <- lubridate::ymd_hms(time)
week(my_time)
## [1] 33
day(my_time)
## [1] 12
hour(my_time)
## [1] 11
minute(my_time)
## [1] 18
second(my_time)
## [1] 0
# Get your system time
comp_time <- Sys.time()
# Convert comp_time to Europe/Belgrade time:
with_tz(comp_time, "Europe/Belgrade")
## [1] "2021-08-17 19:03:52 CEST"
You can also get information about your dates using nice built-in lubridate functions.
teddy_bday <- lubridate::ymd("20170615")
# Return the date
teddy_bday
## [1] "2017-06-15"
# What day of the week?
wday(teddy_bday, label = TRUE)
## [1] Thu
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
# What week of the year?
week(teddy_bday)
## [1] 24
# Pull the year
year(teddy_bday)
## [1] 2017
# Month
month(teddy_bday)
## [1] 6
This can be useful if you want to group your day in different ways for analyses or exploration. Use lubridate functions, in combination with mutate(), to add new columns containing separate pieces of the date, e.g. year, month, day in three separate columns).
For example, let’s just make a little data frame to try this out:
urchin_counts <- tribble(
~date, ~species, ~size_mm,
"10/3/2020", "purple", 55,
"10/4/2020", "red", 48,
"11/17/2020", "red", 67
)
urchin_counts_ymd <- urchin_counts %>%
mutate(date = lubridate::mdy(date)) %>%
mutate(year = year(date),
month = month(date),
day = day(date))
urchin_counts_ymd
## # A tibble: 3 × 6
## date species size_mm year month day
## <date> <chr> <dbl> <dbl> <dbl> <int>
## 1 2020-10-03 purple 55 2020 10 3
## 2 2020-10-04 red 48 2020 10 4
## 3 2020-11-17 red 67 2020 11 17
# And then we could use group_by() to find different summary values by group, for example.
day_1 <- lubridate::ymd("2020-01-06")
day_2 <- lubridate::ymd("2020-05-18")
day_3 <- lubridate::ymd("2020-05-19")
# Create a time interval
time_interval <- interval(day_1, day_2)
# Check the length in weeks
time_length(time_interval, "week")
## [1] 19
# Check the length in years
time_length(time_interval, "year")
## [1] 0.363388
stringr
str_detect()
to detect a string patternReturns TRUE or FALSE based on whether the pattern is or is not detected.
my_string <- "Teddy loves eating salmon and socks."
# Does the pattern "love" exist within the string?
my_string %>% str_detect("love")
## [1] TRUE
# Does the pattern "pup" exist within the string?
my_string %>% str_detect("pup")
## [1] FALSE
This also works on vectors (…think ahead - data frame columns!). It is case sensitive (by default):
my_string <- c("burrito", "fish taco", "Taco salad")
# Does the vector element contain the pattern "fish"?
my_string %>% str_detect("fish")
## [1] FALSE TRUE FALSE
It is most powerful when used in combination with other functions.
For example, let’s look at the starwars
dataset (in dplyr
):
head(starwars)
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, grey light blue 52 male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
I want to only keep rows where the name column contains the pattern “Skywalker.” Remember: what’s the function to keep or exclude rows based on our conditions? It’s dplyr::filter()
! I can use that with str_detect()
to get the Skywalker family characters for me:
skywalkers <- starwars %>%
filter(str_detect(name, "Skywalker"))
skywalkers
## # A tibble: 3 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 Anakin … 188 84 blond fair blue 41.9 male mascu…
## 3 Shmi Sk… 163 NA black fair brown 72 female femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
str_replace()
to replace a string pattern with something elsefirewalkers <- starwars %>%
mutate(name = str_replace(name, pattern = "Sky", replacement = "Fire"))
head(firewalkers)
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Fi… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, grey light blue 52 male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
You can imagine this may be really helpful if there is a repeated spelling error, annoying syntax, or otherwise that you want to update throughout your data frame.
Note: this is very different from text mining and analysis, which involves analyzing textual information to gain insights about patterns, trends, and sentiments - look forward to that in EDS 242!
str_trim()
or str_squish()
to remove excess white spacefeedback <- c(" I ate some nachos", "Wednesday morning ")
# Removes leading, trailing & duplicate interior whitespaces
str_squish(feedback)
## [1] "I ate some nachos" "Wednesday morning"
# Removes leading & trailing whitespaces
str_trim(feedback)
## [1] "I ate some nachos" "Wednesday morning"
str_to_lower(feedback)
## [1] " i ate some nachos" "wednesday morning "
str_to_upper(feedback)
## [1] " I ATE SOME NACHOS" "WEDNESDAY MORNING "
str_to_title(feedback)
## [1] " I Ate Some Nachos" "Wednesday Morning "
str_count(feedback, pattern = "nachos")
## [1] 1 0