EDS 212 Computational Session 5A

Setup

Create a new repo on GitHub called eds212-day5-comp (with a ReadMe)
Clone to create a version-controlled R project
Add a new Quarto document
Remove everything below the first code chunk
In the setup chunk, attach the tidyverse package
Save & render

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

What does the output look like? A mess. We might not want all of that stuff that R is reporting to show up in my rendered document. Not a problem - we get to define exactly what shows up in our knitted html, either across the entire document or for each specific code chunk.

Customizing what shows up in our Quarto doc

We can update Quarto execution options to control what shows up (and doesn’t) in our rendered document.

For example:

echo: FALSE will hide source code in the rendered doc
warning: FALSE will hide warnings / messages

Setting these execute options in the YAML will make them the default for all code chunks. You can also override the default using the hashpipe #| within a code chunk, followed by the execution option (e.g. #| echo: false ).

Look at the starwars dataset (in dplyr)

Use several of the tools we learned yesterday (e.g. head(), tail(), summary(), dim(), etc.). Consider this “exploratory” work, that you would not want to show up at all in a final knitted document. Update the code chunk header with an option to not including anything from this code chunk (include = FALSE).

# Return the first 6 lines of `starwars`
head(starwars)

## # A tibble: 6 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
## 2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
## 3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
## 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
## 5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
## 6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

# Check the dimensions
dim(starwars)

## [1] 87 14

Update the setup code chunk options so the warnings and messages are hidden, but the library(tidyverse) code does still show up in your knitted document. Knit to check.
In a new code chunk, create a ggplot2 graph of character mass (y-axis) versus height (x-axis). Update so that the color of the points changes based on the value of mass (this is unnecessary, but just for customization practice). Update axis labels (with units). Remember, use ?starwars for more information. Check the warnings that show up when you run your code. What are they telling you? Then, update the code chunk option so that only the graph appears in your knitted document (no code or warnings / messages). Knit to check.
In a new code chunk, create a histogram of character heights. Update the fill color to purple, and the line color to red (this will look awful - do it anyway for practice). Update the x- and y-axis labels. Update your code chunk options so that only your code and the output graph shows up in the knitted document (no messages or warnings). Knit to check.

ggplot(data = starwars, aes(x = height)) +
  geom_histogram(color = "red", fill = "purple")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 6 rows containing non-finite values (`stat_bin()`).

Adding a figure caption and alt-text

You can add a figure caption or alt-text to a graphic using #| fig-cap: "caption text" and #| fig-alt: "figure alt text"

Finding summary statistics for individual columns

Here, we’ll learn how to find some summary statistics (mean, standard deviation, variance). You can refer to a single column in a data frame using df$colname. For example, if we have a data frame cats with a column mass, then I can refer to the mass column in cats using cats$mass.

Let’s take a look. In the Console, try calling a couple columns individually from starwars (you don’t need to store these). E.g. starwars$name, starwars$birth_year, etc.

We’ll learn a bunch of tools that help to automate finding summary statistics across multiple columns (e.g. dplyr::across()) or groups within the same column (e.g. dplyr::group_by() %>% summarize()), but for now let’s say we just want a single mean from a single column.

Use the mean() function applied to the column to return the value.

In a new code chunk, store the mean of the height column as sw_height_mean

sw_height_mean <- mean(starwars$height)

Call the value back to yourself (in the Console). What does it tell you the mean height is? Uh oh…
Check out the documentation for mean(). What is the behavior (default) for dealing with NA (missing) values?
Update your code so that NA values are removed, by adding the argument na.rm = TRUE within the mean() function. Does the value make sense given the histogram you created above?

sw_height_mean <- mean(starwars$height, na.rm = TRUE)

In your code chunk, add code to find the median (median()), variance (var()), and standard deviation (sd()) for Star Wars character heights. Store them using a consistent naming system as your sw_height_mean object above. Check all outputs.

In-line reporting from code outputs

Let’s say you wanted to report the mean and standard deviation of a variable in text (remember - summary statistics hide things! Always consider accompanying summary statistics with visualizations or tables that show more).

Would you want to manually type the value you found from your code into your Quarto document? Why or why not?

We want our outputs in text to be as reproducible and automatically updated as anything else in our work, so that if anything changes, we aren’t going to be manually (and treacherously) copying and pasting hoping we are updating everything correctly.

Reference stored objects in text by adding inline R code with single backticks, a lowercase r between them, and then whatever you want to have show up.

Warning: PAY ATTENTION TO SIGNIFICANT FIGURES. Are you presenting your outcomes at a reasonable level of resolution? Do you need to round your output to make it a responsible reflection of the measurements you have?

Safely store your work

Stage, commit, pull, push. Check to make sure your changes exist on GitHub.