tidyr
and dplyr
Take ~15 minutes to read Broman & Woo’s evergreen paper Data organization in spreadsheets. As you read, think about data that you have created or had to work with that did not follow these guidelines. Make notes of examples to share from several - how did you input data previously? How would you change the way you input data?
Questions:
What are major / most common ways you have seen these guidelines ignored?
What is your experience working with or creating data in spreadsheets that don’t follow these guidelines?
Data source: Santa Barbara Coastal LTER, D. Reed, and R. Miller. 2021. SBC LTER: Reef: Abundance, size and fishing effort for California Spiny Lobster (Panulirus interruptus), ongoing since 2012 ver 6. Environmental Data Initiative. https://doi.org/10.6073/pasta/0bcdc7e8b22b8f2c1801085e8ca24d59
eds221-day6-activities
data
and docs
data
subfolder.docs
, create a new .Rmd or .qmd saved with file
prefix lobster_exploration
data/Lobster_Abundance_All_Years_20210412.csv
file. Take
note of values that can be considered NA
(see metadata) and
update your import line to convert those to NA
valuesdplyr::uncount()
on the existing count
column. What did this do? Add annotation in your code
explaining dplyr::uncount()
Here’s code to read in your data, just to get your started:
lobsters <- read_csv(here("data","Lobster_Abundance_All_Years_20210412.csv"), na = c("-99999", "")) %>%
clean_names() %>%
uncount(count)
n()
), and mean carapace lengths of lobsters observed in the
dataset by site and year.The legal lobster size (carapace length) in California is 79.76 mm.
Create a subset that only contains lobster data from 2020 (note: this should be from the original data you read in, not the summary table you created above)
Write code (you can decide how to do this - there are a number of
ways) to find the counts of lobsters observed at each site (only using
site as the grouping factor) that are above and below the legal limit.
Hint: You may want to add a new column
legal
that contains “yes” or “no” based on the size of the
observed lobster (see dplyr::case_when()
for a really nice
way to do this), then use group_by() %>% summarize(n())
or dplyr::count()
to get counts by group within
variables
Create a stacked column graph that shows the proportion
of legal and non-legal lobsters at each site. **Hint: create a stacked
column graph with geom_col()
, then add the argument
position = "fill"
to convert from a graph of absolute
counts to proportions.
Which two sites had the largest proportion of legal lobsters in 2020? Explore the metadata to come up with a hypothesis about why that might be.
Starting with the original lobsters data that you read in as
lobsters
, complete the following (separately - these are
not expected to be done in sequence or anything). You can store each of
the outputs as ex_a
, ex_b
, etc. for the
purposes of this task.
Create and store a subset that only contains lobsters from sites “IVEE”, “CARP” and “NAPL”. Check your output data frame to ensure that only those three sites exist.
Create a subset that only contains lobsters observed in August.
Create a subset with lobsters at Arroyo Quemado (AQUE) OR with a carapace length greater than 70 mm.
Create a subset that does NOT include observations from Naples Reef (NAPL)
Find the mean and standard deviation of lobster carapace length, grouped by site.
Find the maximum carapace length by site and month.
Add a new column that contains lobster carapace length converted to centimeters. Check output.
Update the site column to all lowercase. Check output.
Convert the area column to a character (not sure why you’d want to do this, but try it anyway). Check output.
Use case_when()
to add a new column called
size_bin
that contains “small” if carapace size is <= 70
mm, or “large” if it is greater than 70 mm. Check output.
Use case_when()
to add a new column called
designation
that contains “MPA” if the site is “IVEE” or
“NAPL”, and “not MPA” for all other outcomes.