Allison Horst, PhD

Assistant Teaching Professor, Bren School, UC Santa Barbara

Illustration by Allison Horst

Illustration by Allison Horst

1 Welcome

Welcome to the Advanced Data Visualization with ggplot2 workshop!

In this workshop, we’ll cover concepts and skills to create effective, elegant data visualizations with ggplot2 in R. We will start by reviewing, then building on, ggplot basics to make highly customized figures (including with scales, guides, themes, direct labeling and highlighting) while reinforcing data visualization principles by carefully considering why we update (or don’t update) each graph component. Next, we will learn how to make several elegant and modern (but less commonly seen) graph types before putting them together into a compound figure using the patchwork package. To wrap up, we’ll construct a beautiful map in ggplot to reinforce how tools and concepts we’ve learned are transferable across different data and visualization types.

Participants should be able to:

  • Install and attach R packages
  • Make basic ggplot2 graphs in R
  • Work comfortably in either R scripts of R Markdown (Allison will be in R Markdown)

1.1 Workshop outline

  • Conceptual hierarchy of data viz
  • ggplot2 basics review
    • Aesthetic mapping
    • Themes
    • Labels
    • Facets (& facet grids vs facet wraps)
    • Getting things in order (e.g. fct_reorder)
  • Advanced customization in ggplot2
    • scales for thoughtful breaks and labels
    • …and color schemes (+ paletteer!)
    • In the weeds of themes (gridlines, panel colors, margins, etc.)
    • Direct annotation (as an alternative to legends)
    • Repulsive labels (e.g. ggrepel)
    • Highlighting for clarity (e.g. with gghighlight)
  • Compound figures with patchwork
  • A few new graph types to consider
    • Marginal plot
    • Beeswarm plots with ggbeeswarm
    • Heatmaps with geom_tile()
    • A map! in ggplot2
  • Export & save your graphs
  • Keep learning

2 Citations and data

2.1 R packages

Thank you to developers, sharers, teachers, and the entire R community for building things and creating resources to help us all learn. I’d especially like to thank developers & maintainers for the following packages:

2.2 Lizard size measurement data

Our data are a curated subset from Jornada Basin Long Term Ecological Research site in New Mexico, part of the US Long Term Ecological Research (LTER) network:

From the data package: “This data package contains data on lizards sampled by pitfall traps located at 11 consumer plots at Jornada Basin LTER site from 1989-2006. The objective of this study is to observe how shifts in vegetation resulting from desertification processes in the Chihuahaun desert have changed the spatial and temporal availability of resources for consumers. Desertification changes in the Jornada Basin include changes from grass to shrub dominated communities and major soil changes. If grassland systems respond to rainfall without significant lags, but shrub systems do not, then consumer species should reflect these differences. In addition, shifts from grassland to shrubland results in greater structural heterogeneity of the habitats. We hypothesized that consumer populations, diversity, and densities of some consumers will be higher in grasslands than in shrublands and will be related to the NPP of the sites. Lizards were captured in pitfall traps at the 11 LTER II/III consumer plots (a subset of NPP plots) quarterly for 2 weeks per quarter. Variables measured include species, sex, recapture status, snout-vent length, total length, weight, and whether tail is broken or whole. This study is complete.”

There are 16 total variables in the lizards.csv data we’ll read in. The ones we’ll use in this workshop are:

  • date: data collection date
  • scientific_name: lizard scientific name
  • common_name: lizard common name
  • site: research site code
  • sex: lizard sex (m = male; f = female; j = juvenile)
  • sv_length: snout-vent length (millimeters)
  • total_length: body length (millimeters)
  • toe_num: toe mark number
  • weight: body weight (grams)
  • tail: tail condition (b = broken; w = whole)

2.3 Jornada vegetation spatial data

From Jornada Basin LTER Spatial Data: Dominant Vegetation of the JER and CDRRC in 1998 (Download KMZ 3972 KB) Dominant and subdominant vegetation on the Jornada Experimental Range and Chihuahuan Desert Rangeland Research Center in 1998. Published in Gibbens, R. P., McNeely, R. P., Havstad, K. M., Beck, R. F., & Nolen, B. (2005). Vegetation changes in the Jornada Basin from 1858 to 1998. Journal of Arid Environments, 61(4), 651-668.

3 Set-up

3.1 Get workshop materials

You can get the workshop materials in two ways:

  1. Clone the workshop repo from GitHub to work locally
  2. Create an RStudio Cloud account, and click HERE to get to the project. Make sure you click on ‘Make permanent copy’ so your updates & notes will be stored.

3.2 Create a new R Markdown document or R script

Allison will be working in R Markdown, but you can follow along in either an .Rmd or R script.

3.3 Attach R packages

# General use packages:
library(tidyverse)
library(here)
library(janitor)

# Specifically for plots:
library(patchwork)
library(ggrepel)
library(gghighlight)
library(paletteer)
library(ggExtra)
library(ggbeeswarm)

# Spatial data simplified:
library(sf)

# And for another dataset we'll explore:
library(gapminder)

3.4 Read in the lizard data

lizards <- read_csv(here("data_tidy", "lizards.csv"))

4 ggplot2 Basics Review

First, we’ll cover some ggplot2 basics to create the foundation that we’ll make our great customized data visualization on.

4.1 The essentials

When we start creating a ggplot graph, we need three basic building blocks:

  1. We’re using ggplot
  2. What data we want to use in our graph
  3. What type of graph we’re creating

For example:

# ggplot essential pieces, 3 ways (that do the same thing):

# Like this: 
ggplot(data = lizards, aes(x = total_length, y = weight)) + # That's 1 & 2
  geom_point() # That's 3

# Or, alternatively:
ggplot(data = lizards) +
  geom_point(aes(x = total_length, y = weight))

# Or another way:
ggplot() +
  geom_point(data = lizards, aes(x = total_length, y = weight))

Which all produce the same thing:

Which makes changing graph types straightforward by updating the geom_:

ggplot(data = lizards, aes(x = total_length, y = weight)) +
  geom_line() # Bad idea, just demonstrating a geom switch.

Keep in mind that some graph types only require one variable - for example, geom_histogram:

ggplot(data = lizards, aes(x = total_length)) +
  geom_histogram()

And remember to carefully consider the type of data you’re trying to visualize, which will help to direct the graph type. For example, a jitterplot usually has one categorical variable and one continuous variable:

ggplot(data = lizards, aes(y = common_name, x = weight)) +
  geom_jitter()

Not sure which type of graph is appropriate for your data? My favorite resource is Yan Holtz’ From Data to Viz - check it out, it is fun and amazing, and links to code examples from the R Graph Gallery.

4.2 Aesthetic mapping

4.2.1 Updating based on a constant? NO aes()!

To change aesthetics of a graph based on a constant (e.g. “Make all the points BLUE”), we can add the information directly to the relevant geom_ layer. Some things to keep in mind:

  • fill: updates fill colors (e.g. column, density, violin, & boxplot interior fill color)
  • color: updates point & border line colors (generally)
  • shape: update point style
  • alpha: update transparency (0 = transparent, 1 = opaque)
  • size: point size or line width
  • linetype: update the line type (e.g. “dotted”, “dashed”, “dotdash”, etc.)

If you are updating these by referring to a constant value, they should not be within an aes().

For example, let’s make some nightmares:

ggplot(data = lizards, aes(x = weight)) +
  geom_histogram(color = "orange", 
                 fill = "purple", 
                 size = 2, 
                 linetype = "dotted")

Some shapes have both a fill and color aesthetic:

ggplot(data = lizards, aes(x = total_length, y = weight)) +
  geom_point(color = "cyan4", 
             fill = "yellow",
             shape = 22, 
             size = 3, 
             alpha = 0.4)

4.2.2 Updating an aesthetic based on a variable? YES aes()!

If you want to map a variable onto a graph aesthetic (e.g., point color should be based on lizard species), put it within aes().

ggplot(data = lizards, aes(x = total_length, y = weight)) +
  geom_point(aes(color = common_name, size = total_length))

These can be used in combination. For example, if we want the color to be based on species, but the transparency for all points is 0.3:

ggplot(data = lizards, aes(x = total_length, y = weight)) +
  geom_point(aes(color = common_name), alpha = 0.3)

4.3 Themes

Quick reminder: yeah there are some built-in themes you can add with + theme_*().

A few useful baselines are:

  • theme_minimal(): minimal theme
  • theme_bw(): also pretty good for some stuff
  • theme_light(): a nice light one
ggplot(data = lizards, aes(x = site, y = weight)) +
  geom_jitter(aes(color = common_name)) +
  theme_minimal()

4.4 Axis labels

For basic axis labels, I recommend labs():

ggplot(data = lizards, aes(x = total_length, y = weight)) +
  geom_point() +
  labs(x = "Total length (mm)",
       y = "Weight (grams)",
       title = "Lizard size")

We’ll learn a few more advanced label skills later on.

4.5 Facetting

Sometimes it’s useful to split up information in a graph into separate panels. For example, maybe we want to have a separate graph of total length versus weight for each lizard species. That would be really tedious to create them all manually from subsets. Instead, we’ll facet by distinct groups within a variable.

We’ll learn two ways to do this:

  • facet_wrap(): the one where you give it one faceting variable and the panels get wrapped into a grid
  • facet_grid(): the one where you make a grid based on row & column faceting variables

For example, let’s say we just want each species to have its own panel. Then we can use facet_wrap():

ggplot(data = lizards, aes(x = total_length, y = weight)) +
  geom_point() +
  facet_wrap(~common_name, ncol = 3, scales = "free")

But what if we want to make a grid where the panels are split across groups by lizard sex and if it has a broken tail or not? Since we have two variables being used to create our grid, we’ll use facet_grid():

ggplot(data = lizards, aes(x = total_length, y = weight)) +
  geom_point() +
  facet_grid(sex ~ tail)

4.6 Getting things in order

ggplot loves putting things in alphabetical order - but that’s rarely the order you actually want things in if you have categorical groups. Let’s find some total counts of lizards in the dataset by common name, then make a column graph:

lizard_counts <- lizards %>% 
  count(common_name)

ggplot(data = lizard_counts, aes(y = fct_reorder(common_name, n), x = n)) +
  geom_col()

4.7 ggplot basics: synthesis examples

Example 1: A quick review of basics, including:

  • ggplot essentials
  • aesthetic mapping
  • themes
  • facet_wrap & facet_grid
  • labels with labs
ggplot(data = lizards, aes(x = total_length, y = weight)) +
  geom_point(aes(color = common_name, shape = common_name), 
             fill = "black",
             size = 2) +
  theme_minimal() +
  labs(x = "Total length (mm)",
       y = "Weight (g)",
       color = "Lizard species") +
  facet_wrap(~common_name, scales = "free")

Example 2: Reminders of position, facet_grid, and factor reordering

Let’s make a stacked column graph of lizard species by site:

ggplot(data = lizards, aes(y = fct_infreq(common_name))) +
  geom_bar(aes(fill = site)) +
  theme_bw() +
  labs(x = "Lizard counts",
       y = "Species (common name)") +
  facet_grid(sex ~ tail)

# That annoying space below zero? Let's keep that in mind...

5 Advanced ggplot2 customization

5.1 An unsung hero: scales

The scales package in R is truly an unsung hero of finalizing ggplot graphs. To hear more, I strongly recommend watching Dana Seidel’s 20 minute talk on The little package that could: Taking visualizations to the next level with the scales package from rstudio::conf(2020).

Why does that matter to us? Because a whole lot of the subtle things that make a graph way better are updating using the scales suite of helpful functions.

For a complete list of scales functions & usage, see: https://scales.r-lib.org/index.html

5.1.1 Thoughtful breaks, limits & labels

Little things make a big difference in data visualization. Just like we should take great care to make axis labels useful and complete, we also need to think about how values are communicated for our different variables.

In 2-D data visualization, that means customizing your breaks, limits, & tick mark labels & formatting. From Hadley Wickham & Dana Seidel: “The most common use of the scales package is to control the appearance of axis and legend labels. Use a break_ function to control how breaks are generated from the limits, and a label_ function to control how breaks are turned in to labels.”

Let’s explore some different ways to update breaks and labels.

5.1.1.1 Updating breaks & labels

The important thing: know what type of variable you have on each axis so that you know what scale_ version to call. For example:

  • For dates: scale_*_date()
  • For continuous variables: scale_*_continuous()
  • For discrete variables: scale_*_discrete()

Within those layers added to your plot, you can update the breaks =, limits =, labels = and expand =options.

ggplot(data = lizards, aes(x = total_length, y = weight)) +
  geom_point()

ggplot(data = lizards, aes(x = total_length, y = weight)) +
  geom_point() +
  scale_x_continuous(breaks = c(0, 250, 500), 
                     limits = c(0, 500)) +
  scale_y_continuous(breaks = seq(from = 0, to = 70, by = 10), 
                     limits = c(0, 70)) +
  theme_light()