Day 6 Activities

Task 1: Read Broman & Woo’s Data organization in spreadsheets

Take ~15 minutes to read Broman & Woo’s evergreen paper Data organization in spreadsheets. As you read, think about data that you have created or had to work with that did not follow these guidelines. Make notes of examples to share from several - how did you input data previously? How would you change the way you input data?

Questions:

What are major / most common ways you have seen these guidelines ignored?
What is your experience working with or creating data in spreadsheets that don’t follow these guidelines?

Task 2: SBC Lobsters

Data source: Santa Barbara Coastal LTER, D. Reed, and R. Miller. 2021. SBC LTER: Reef: Abundance, size and fishing effort for California Spiny Lobster (Panulirus interruptus), ongoing since 2012 ver 6. Environmental Data Initiative. https://doi.org/10.6073/pasta/0bcdc7e8b22b8f2c1801085e8ca24d59

Getting started

Create a new GitHub repo called eds221-day6-activities
Clone to create a version controlled R project
Add subfolders data and docs
Download the California Spiny lobster abundance data from this SBC LTER data package. Familiarize yourself with the metadata. Save the CSV containing lobster abundance data in your data subfolder.
In docs, create a new .Rmd or .qmd saved with file prefix lobster_exploration
Within your notebook, write organized and well-annotated code to do the following:
- Read in and take a look at the data in the data/Lobster_Abundance_All_Years_20210412.csv file. Take note of values that can be considered NA (see metadata) and update your import line to convert those to NA values
- Convert column names to lower snake case
- Convert the data from frequency to case format using dplyr::uncount() on the existing count column. What did this do? Add annotation in your code explaining dplyr::uncount()

Here’s code to read in your data, just to get your started:

lobsters <- read_csv(here("data","Lobster_Abundance_All_Years_20210412.csv"), na = c("-99999", "")) %>% 
  clean_names() %>% 
  uncount(count)

Find counts and mean sizes by site & year

Create a summary table that finds the total counts (see: n()), and mean carapace lengths of lobsters observed in the dataset by site and year.
Create a ggplot graph of the number of total lobsters observed (y-axis) by year (x-axis) in the study, grouped (either aesthetically or by faceting) by site

Find the proportion of legal lobsters at each site for 2020

The legal lobster size (carapace length) in California is 79.76 mm.

Create a subset that only contains lobster data from 2020 (note: this should be from the original data you read in, not the summary table you created above)
Write code (you can decide how to do this - there are a number of ways) to find the counts of lobsters observed at each site (only using site as the grouping factor) that are above and below the legal limit. Hint: You may want to add a new column legal that contains “yes” or “no” based on the size of the observed lobster (see dplyr::case_when() for a really nice way to do this), then use group_by() %>% summarize(n()) or dplyr::count() to get counts by group within variables
Create a stacked column graph that shows the proportion of legal and non-legal lobsters at each site. **Hint: create a stacked column graph with geom_col(), then add the argument position = "fill" to convert from a graph of absolute counts to proportions.

Which two sites had the largest proportion of legal lobsters in 2020? Explore the metadata to come up with a hypothesis about why that might be.

Task 3: Random lobster wrangling

Starting with the original lobsters data that you read in as lobsters, complete the following (separately - these are not expected to be done in sequence or anything). You can store each of the outputs as ex_a, ex_b, etc. for the purposes of this task.

filter() practice

Create and store a subset that only contains lobsters from sites “IVEE”, “CARP” and “NAPL”. Check your output data frame to ensure that only those three sites exist.
Create a subset that only contains lobsters observed in August.
Create a subset with lobsters at Arroyo Quemado (AQUE) OR with a carapace length greater than 70 mm.
Create a subset that does NOT include observations from Naples Reef (NAPL)

group_by() %>% summarize() practice

Find the mean and standard deviation of lobster carapace length, grouped by site.
Find the maximum carapace length by site and month.

mutate() practice

Add a new column that contains lobster carapace length converted to centimeters. Check output.
Update the site column to all lowercase. Check output.
Convert the area column to a character (not sure why you’d want to do this, but try it anyway). Check output.

case_when() practice

Use case_when() to add a new column called size_bin that contains “small” if carapace size is <= 70 mm, or “large” if it is greater than 70 mm. Check output.
Use case_when() to add a new column called designation that contains “MPA” if the site is “IVEE” or “NAPL”, and “not MPA” for all other outcomes.