In EDS 212, we learned about Boolean operations and logic - the land of TRUE and FALSE. Boolean operations are ubiquitous in scientific programming. Whether looking for string pattern matches, comparing values, or filtering data frames, Boolean operators show up all the time - and even if we aren’t writing them explicitly, are working behind the scenes.
Let’s refresh some basic Boolean (logical) operators for programming:
# Create some objects (let's say these are tree heights)
pinyon_pine <- 14
lodgepole_pine <- 46
# Some logical expressions:
pinyon_pine == 10 # Exact match?
## [1] FALSE
pinyon_pine < lodgepole_pine # Less than?
## [1] TRUE
lodgepole_pine >= 46 # Greater than or equal?
## [1] TRUE
pinyon_pine != 25 # Not equal to?
## [1] TRUE
We write conditionals to tell a computer what do to based on whether or not some conditions that we set are satisfied. For example, we may want to use conditionals to:
Only keep observations from counties with median home value > $600,000
Re-categorize farms as “moderate implementation” if they implement between 4 and 8 best management practices for mitigating nutrient runoff
if
statementThe fundamental conditional statement is an if
statement, which we can read as “If this condition is met, then do this.”
An example in R:
burrito <- 2.4 # Assign an object value
# Write a short 'if' statement:
if (burrito > 2) {
print("I love burritos!")
}
## [1] "I love burritos!"
Try changing the value of burritos to 1.5. What is returned when the if
statement condition isn’t met?
This is important: for a solely “if” statement, there is no “else” option. So nothing happens if the condition isn’t met.
In Python:
burrito = 2.4
if burrito > 2:
print("I love burritos!")
## I love burritos!
Here, we’re using a new function from the stringr
package: stringr::str_detect()
. First, let’s learn something about how str_detect()
works (see ?str_detect()
for more information).
str_detect()
“detects the presence or absence of a pattern in a string.” For example:
my_ships <- c("Millenium Falcon", "X-wing", "Tie-Fighter", "Death Star")
str_detect(my_ships, pattern = "r") # Asks: which elements in the vector contain "r"
## [1] FALSE FALSE TRUE TRUE
Notice that it returns a logical TRUE or FALSE for each element in the vector, based on whether or not they contain the string pattern “r”.
Now let’s try using it in a conditional statement. Here, if a phrase contains the word “love”, return the phrase “Big burrito fan!” Otherwise, return nothing.
phrase <- "I love burritos"
if (str_detect(phrase, "love")) {
print("Big burrito fan!")
}
## [1] "Big burrito fan!"
Try updating the phrase or string pattern in the code above to test different variations.
if-else
statementIn the examples above, there was a lonely “if” statement, which returned something if the condition was met, but otherwise returned nothing. Usually, we’ll want our code to return something else if our condition is not met. In that case, we can write an if-else
statement. Here are a couple of examples:
In R:
pika <- 89.1
if (pika > 60) {
print("mega pika")
} else
print("normal pika")
## [1] "mega pika"
In Python:
pika = 89.1
if pika > 60:
print("mega pika")
else:
print("normal pika")
## mega pika
food <- "I love enchiladas!"
if (str_detect(food, "burritos")) {
print("yay burritos!")
} else
print("what about burritos?")
## [1] "what about burritos?"
if-else if-else
statementsSometimes there aren’t just two outcomes! In that case, you can specify different options using the if-else if-else
structure. Several examples are shown below.
In R:
marmot <- 2.8
if (marmot < 0.5) {
print("a small marmot!")
} else if (marmot >= 0.5 & marmot < 3) {
print("a medium marmot!")
} else
print("a large marmot!")
## [1] "a medium marmot!"
marmot = 2.8
if marmot < 0.5:
print("a small marmot!")
elif 0.5 <= marmot < 3:
print("a medium marmot!")
else:
print("a large marmot!")
## a medium marmot!
switch
statementsA slightly more efficient tool is the switch()
function, which allows you to “select one of a list of alternatives.” See ?switch
for more information. This can be particularly useful for switching between different character strings, based on a condition or user selection.
In R:
species = "mouse"
switch(species,
"cat" = print("Meow"),
"dog" = print("Woof!"),
"mouse" = print("Squeak"))
## [1] "Squeak"
As we get into Week 2 of EDS 221, we’ll learn other functions that make conditionals a bit more read and writable. For example, the dplyr::case_when()
function is very helpful for writing vectorized if-else statements!
A general guideline from RStudio Chief Scientist Hadley Wickham is that if we copy something more than twice, we should write a function or a loop. Avoiding redundancy in code can make it more readable, reproducible and efficient.
We write a for loop to iterate through different elements of a data structure (e.g. values in a vector, or columns in a data frame), applying some operation to each, and returning the new output. In this section, we’ll learn some for loop basics.
Let’s start with a very basic for loop: for each element in a vector, do something to it and return the new thing. Here, we start with a vector of puppy names. Starting with the first name “Teddy”, we enter the loop and add “My dog’s name is” to the beginning of the string. Then we move on to the next name in the vector, continuing until we have applied the updated text to all puppy names.
In R:
dog_names <- c("Teddy", "Khora", "Banjo", "Waffle")
for (pupster in dog_names) {
print(paste("My dog's name is", pupster))
}
## [1] "My dog's name is Teddy"
## [1] "My dog's name is Khora"
## [1] "My dog's name is Banjo"
## [1] "My dog's name is Waffle"
# Or similarly (\n is for new line):
for (pupster in dog_names) {
cat("My dog's name is", pupster, "\n")
}
## My dog's name is Teddy
## My dog's name is Khora
## My dog's name is Banjo
## My dog's name is Waffle
In Python:
dog_names = ["Teddy", "Khora", "Banjo", "Waffle"]
for i in dog_names:
print("My dog's name is " + i)
## My dog's name is Teddy
## My dog's name is Khora
## My dog's name is Banjo
## My dog's name is Waffle
Let’s check out another little example. Here, we create a sequence that ranges from 0 to 3 by increments of 0.5. Then, for each element in that vector, we add 2 to it, then move on to the next element.
In R:
mass <- seq(from = 0, to = 3, by = 0.5)
for (i in mass) {
new_val = i + 2
print(new_val)
}
## [1] 2
## [1] 2.5
## [1] 3
## [1] 3.5
## [1] 4
## [1] 4.5
## [1] 5
Or, using seq_along()
:
mass <- seq(from = 0, to = 3, by = 0.5)
for (i in seq_along(mass)) {
new_val = mass[i] + 2
print(new_val)
}
## [1] 2
## [1] 2.5
## [1] 3
## [1] 3.5
## [1] 4
## [1] 4.5
## [1] 5
Let’s try another one with seq_along()
:
For each element in a vector, find the sum of that value plus the next value in the sequence:
tree_height <- c(1,2,6,10)
for (i in seq_along(tree_height)) {
val = tree_height[i] + tree_height[i + 1]
print(val)
}
## [1] 3
## [1] 8
## [1] 16
## [1] NA
Earlier in this session, we learned how to write a conditional if
or if-else
statement. Sometimes we’ll want to change what a for loop does based on a conditional - so we’ll have a conditional statement within a for loop. Let’s take a look and talk through an example:
A basic conditional for loop in R: Here, we have a vector of animals called animal
. Running through each animal in the vector, if the species is “dog” we want to return “I love dogs!”. For all other animals, we’ll return “These are other animals!”
# Create the animals vector:
animal <- c("cat", "dog", "dog", "zebra", "dog")
# Create the for loop with conditional statement:
for (i in seq_along(animal)) {
if (animal[i] == "dog") {
print("I love dogs!")
} else
print("These are other animals!")
}
## [1] "These are other animals!"
## [1] "I love dogs!"
## [1] "I love dogs!"
## [1] "These are other animals!"
## [1] "I love dogs!"
Or, for a numerical example:
# Animal types:
species <- c("dog", "elephant", "goat", "dog", "dog", "elephant")
# And their respective ages in human years:
age_human <- c(3, 8, 4, 6, 12, 18)
# Convert ages to "animal years" using the following:
# 1 human year = 7 in dog years
# 1 human year = 0.88 in elephant years
# 1 human year = 4.7 in goat years
for (i in seq_along(species)) {
if (species[i] == "dog") {
animal_age <- age_human[i] * 7
} else if (species[i] == "elephant") {
animal_age <- age_human[i] * 0.88
} else if (species[i] == "goat") {
animal_age <- age_human[i] * 4.7
}
print(animal_age)
}
## [1] 21
## [1] 7.04
## [1] 18.8
## [1] 42
## [1] 84
## [1] 15.84
Reminder: Keep this idea in mind when we learn dplyr::case_when()
!
So far, we’ve returned outputs of a for loop, but we haven’t stored the outputs of a for loop as a new object in our environment.
To store outputs of a for loop, we’ll create an empty vector, then populate it with the for loop elements as they’re created. It’s important to do this - it will make your loops quicker, which is critical once you start working with big data.
# Create the empty vector animal_ages:
animal_ages <- vector(mode = "numeric", length = length(species))
# Vectors with species and human age:
species <- c("dog", "elephant", "goat", "dog", "dog", "elephant")
age_human <- c(3, 8, 4, 6, 12, 18)
# Same loop as above, with additional piece added
# To populate our empty vector
for (i in seq_along(species)) {
if (species[i] == "dog") {
animal_age <- age_human[i] * 7
} else if (species[i] == "elephant") {
animal_age <- age_human[i] * 0.88
} else if (species[i] == "goat") {
animal_age <- age_human[i] * 4.7
}
animal_ages[i] <- animal_age # Populate our empty vector
}
Another example of storing an output:
tigers <- c(29, 34, 82)
lions <- c(2, 18, 6)
big_cats <- vector(mode = "numeric", length = length(tigers))
for (i in seq_along(tigers)) {
total_cats <- tigers[i] + lions[i]
big_cats[i] <- total_cats
}
Note: Don’t make your life harder for no reason. What’s the easiest way to fine the big_cats
values as calculated above?
Recall from lecture: df[[i]]
calls the ith column from the df
data frame with simplification (i.e., it is pulled out as a vector, not a 1-d data frame).
Write a loop that iteratively calculates the mean of value of each column in mtcars.
# Create our storage vector
# Note: ncol() returns the number of columns in a data frame
mean_mtcars <- vector(mode = "numeric", length = ncol(mtcars))
# Write the loop
for (i in 1:ncol(mtcars)) {
mean_val <- mean(mtcars[[i]], na.rm = TRUE)
mean_mtcars[[i]] <- mean_val
}
# Tada.
Sometimes you’ll want to iterate over some, but not all, columns in a data frame. Then, you may want to write a for loop with a condition.
For example, starting with the penguins
data frame (from the palmerpenguins
package), let’s find the median value of all numeric variables.
for (i in seq_along(penguins)) {
if (is.numeric(penguins[[i]])) {
penguin_med <- median(penguins[[i]], na.rm = TRUE)
print(penguin_med)
} else {
print("non-numeric")
}
}
## [1] "non-numeric"
## [1] "non-numeric"
## [1] 44.45
## [1] 17.3
## [1] 197
## [1] 4050
## [1] "non-numeric"
## [1] 2008
For loops are a critical skill for any data scientist to understand. However, you may not end up writing many of them from scratch. That’s because there exist a number of useful tools (functions) to make iteration easier for us. Here, we’ll briefly learn about two:
apply
family of functions for iteration{purrr}
apply
The apply
functions simplify iteration over elements of a data structure (e.g., a data frame). To use apply()
most simply, we need to tell it 3 things:
For example, let’s say we want to find the mean value of all columns in the mtcars
data frame:
apply(X = mtcars, MARGIN = 2, FUN = mean)
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
Take a moment to break down what each argument does in the apply()
function above. Keep this in mind - we can also write our own function that we’d want to apply to each element!
There are also variations on apply
, like lapply
and sapply
, that are worth looking into for specific use cases.
dplyr::across()
, group_by()
and summarize()
in comboIf the takeaway is “Whoa, there are a lot of options…” - YES. There are many existing functions that can help you loop over elements of your data and apply a function.
Here’ we’ll learn some of my favorites - to loop over columns and by group - and return a nice table of values.
penguins %>%
group_by(species) %>%
summarize(across(where(is.numeric), mean, na.rm = TRUE))
## # A tibble: 3 × 6
## species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie 38.8 18.3 190. 3701. 2008.
## 2 Chinstrap 48.8 18.4 196. 3733. 2008.
## 3 Gentoo 47.5 15.0 217. 5076. 2008.
{purrr}
The documentation for ?purrr::map()
is titled “Apply a function to each element of a list or atomic vector.” Which should ring your for loop brain bells.
There are a number of functions within the purrr::map()
family to best suit your specific needs.
The equivalent map()
approach to the example above (finding the mean of each column in the mtcars
data frame) is as follows:
map(.x = mtcars, .f = mean)
## $mpg
## [1] 20.09062
##
## $cyl
## [1] 6.1875
##
## $disp
## [1] 230.7219
##
## $hp
## [1] 146.6875
##
## $drat
## [1] 3.596563
##
## $wt
## [1] 3.21725
##
## $qsec
## [1] 17.84875
##
## $vs
## [1] 0.4375
##
## $am
## [1] 0.40625
##
## $gear
## [1] 3.6875
##
## $carb
## [1] 2.8125
# Or, to return the output in a data frame (instead of a list):
map_df(.x = mtcars, .f = mean)
## # A tibble: 1 × 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 20.1 6.19 231. 147. 3.60 3.22 17.8 0.438 0.406 3.69 2.81