Palmer Archipelago Penguins Data in the palmerpenguins R Package - An Alternative to Anderson’s Irises

In 1935, Edgar Anderson collected size measurements for 150 flowers from three species of Iris on the Gaspé Peninsula in Quebec, Canada. Since then, Anderson’s Iris observations have become a classic dataset in statistics, machine learning, and data science teaching materials. It is included in the base R datasets package as iris, making it easy for users to access without knowing much about it. However, the lack of data documentation, presence of non-intuitive variables (e.g. “sepal width”), and perfectly balanced groups with zero missing values make iris an inadequate and stale dataset for teaching and learning modern data science skills. Users would benefit from working with a more representative, real-world environmental dataset with a clear link to current scientific research. Importantly, Anderson’s Iris data appeared in a 1936 publication by R. A. Fisher in the Annals of Eugenics (which is often the first-listed citation for the dataset), inextricably linking iris to eugenics research. Thus, a modern alternative to iris is needed. In this paper, we introduce the palmerpenguins R package (Horst et al. 2020), which includes body size measurements collected from 2007 - 2009 for three species of Pygoscelis penguins that breed on islands throughout the Palmer Archipelago, Antarctica. The penguins dataset in palmerpenguins provides an approachable, charismatic, and near drop-in replacement for iris with topical relevance for polar climate change and environmental impacts on marine predators. Since the release on CRAN in July 2020, the palmerpenguins package has been downloaded over 462,000 times, highlighting the demand and widespread adoption of this viable iris alternative. We directly compare the iris and penguins datasets for selected analyses to demonstrate that R users, in particular teachers and learners currently using iris, can switch to the Palmer Archipelago penguins for many use cases including data wrangling, visualization, linear modeling, multivariate analysis (e.g., PCA), cluster analysis and classification (e.g., by k-means).

Allison M. Horst (University of California Santa Barbara)https://bren.ucsb.edu/ , Alison Presmanes Hill (Voltron Data)https://voltrondata.com/ , Kristen B. Gorman (University of Alaska Fairbanks)https://www.uaf.edu/cfos/

Introduction

In 1935, American botanist Edgar Anderson measured petal and sepal structural dimensions (length and width) for 50 flowers from three Iris species: Iris setosa, Iris versicolor, and Iris virginica (Anderson 1935). The manageable but non-trivial size (5 variables and 150 total observations) and characteristics of Anderson’s Iris dataset, including linear relationships and multivariate normality, have made it amenable for introducing a wide range of statistical methods including data wrangling, visualization, linear modeling, multivariate analyses, and machine learning. The Iris dataset is built into a number of software packages including the auto-installed datasets package in R (as iris, R Core Team 2021), Python’s scikit-learn machine learning library (Pedregosa et al. 2011), and the SAS Sashelp library (SAS Institute, Cary NC), which has facilitated its widespread use. As a result, eighty-six years after the data were initially published, the Iris dataset remains ubiquitous in statistics, computational methods, software documentation, and data science courses and materials.

There are a number of reasons that modern data science practitioners and educators may want to move on from iris. First, the dataset lacks metadata (Anderson 1935), which does not reinforce best practices and limits meaningful interpretation and discussion of research methods, analyses, and outcomes. Of the five variables in iris, two (Sepal.Width and Sepal.Length) are not intuitive for most non-botanists. Even with explanation, the difference between petal and sepal dimensions is not obvious. Second, iris contains equal sample sizes for each of the three species (n = 50) with no missing values, which is cleaner than most real-world data that learners are likely to encounter. Third, the single factor (Species) in iris limits options for analyses. Finally, due to its publication in the Annals of Eugenics by statistician R.A. Fisher (Fisher 1936), iris is burdened by a history in eugenics research, which we are committed to addressing through the development of new data science education products as described below.

Given the growing need for fresh data science-ready datasets, we sought to identify an alternative dataset that could be made easily accessible for a broad audience. After evaluating the positive and negative features of iris in data science and statistics materials, we established the following criteria for a suitable alternative:

Here, we describe an alternative to iris that largely satisfies these criteria: a refreshing, approachable, and charismatic dataset containing real-world body size measurements for three Pygoscelis penguin species that breed throughout the Western Antarctic Peninsula region, made available through the United States Long-Term Ecological Research (US LTER) Network. By comparing data structure, size, and a range of analyses side-by-side for the two datasets, we demonstrate that the Palmer Archipelago penguin data are an ideal substitute for iris for many use cases in statistics and data science education.

The palmerpenguins package hex sticker designed by Allison Horst

Figure 1: The palmerpenguins package hex sticker designed by Allison Horst

Data source

Body size measurements (bill length and depth, flipper length - flippers are the modified “wings” of penguins used for maneuvering in water, and body mass), clutch (i.e., egg laying) observations (e.g., date of first egg laid, and clutch completion), and carbon (13C/12C, \(\delta\)13C) and nitrogen (15N/14N, \(\delta\)15N) stable isotope values of red blood cells for adult male and female Adélie (P. adeliae), chinstrap (P. antarcticus), and gentoo (P. papua) penguins on three islands (Biscoe, Dream, and Torgersen) within the Palmer Archipelago were collected from 2007 - 2009 by Dr. Kristen Gorman in collaboration with the Palmer Station LTER, part of the US LTER Network. For complete data collection methods and published analyses, see Gorman et al. (2014). Throughout this paper, penguins species are referred to as “Adélie”, “Chinstrap”, and “Gentoo”.

The data in the palmerpenguins R package are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station LTER Data Policy and the LTER Data Access Policy, and were imported from the Environmental Data Initiative (EDI) Data Portal at the links below:

The palmerpenguins R package

R users can install the palmerpenguins package from CRAN:

install.packages("palmerpenguins")

Information, examples, and links to community-contributed materials are available on the palmerpenguins package website: allisonhorst.github.io/palmerpenguins/. See the Appendix for how Python and Julia users can access the same data.

The palmerpenguins R package contains two data objects: penguins_raw and penguins. The penguins_raw data consists of all raw data for 17 variables, recorded completely or in part for 344 individual penguins, accessed directly from EDI (penguins_raw properties are summarized in Appendix B). We generally recommend using the curated data in penguins, which is a subset of penguins_raw retaining all 344 observations, minimally updated (Appendix A) and reduced to the following eight variables:

The same data exist as comma-separated value (CSV) files in the package (“penguins_raw.csv” and “penguins.csv”), and can be read in using the built-in path_to_file() function in palmerpenguins. For example,

library(palmerpenguins)
df <- read.csv(path_to_file("penguins.csv"))

will read in “penguins.csv” as if from an external file, thus automatically parsing species, island, and sex variables as characters instead of factors. This option allows users opportunities to practice or demonstrate reading in data from a CSV, then updating variable class (e.g., characters to factors).

Comparing iris and penguins

The penguins data in palmerpenguins is useful and approachable for data science and statistics education, and is uniquely well-suited to replace the iris dataset. Comparisons presented are selected examples for common iris uses, and are not exhaustive.

Table 1: Overview comparison of penguins and iris dataset features and characteristics.
Feature iris penguins
Year(s) collected 1935 2007 - 2009
Dimensions (col x row) 5 x 150 8 x 344
Documentation minimal complete metadata
Variable classes double (4), factor (1) double (2), int (3), factor (3)
Missing values? no (n = 0; 0.0%) yes (n = 19; 0.7%)

Data structure and sample size

Both iris and penguins are in tidy format (Wickham 2014) with each column denoting a single variable and each row containing measurements for a single iris flower or penguin, respectively. The two datasets are comparable in size: dimensions (columns × rows) are 5 × 150 and 8 × 344 for iris and penguins, respectively, and sample sizes within species are similar (Tables 1 & 2).

Notably, while sample sizes in iris across species are all the same, sample sizes in penguins differ across the three species. The inclusion of three factor variables in penguins (species, island, and sex), along with year, create additional opportunities for grouping, faceting, and analysis compared to the single factor (Species) in iris.

Unlike iris, which contains only complete cases, the penguins dataset contains a small number of missing values (nmissing = 19, out of 2,752 total values). Missing values and unequal sample sizes are common in real-world data, and create added learning opportunity to the penguins dataset.

Table 2: Grouped sample size for iris (by species; n = 150 total) and penguins (by species and sex; n = 344 total). Data in penguins can be further grouped by island and study year.
iris sample size (by species)
penguins sample size (by species and sex)
Iris species Sample size Penguin species Female Male NA
setosa 50 Adélie 73 73 6
versicolor 50 Chinstrap 34 34 0
virginica 50 Gentoo 58 61 5

Continuous quantitative variables

Distributions, relationships between variables, and clustering can be visually explored between species for the four structural size measurements in penguins (flipper length, body mass, bill length and depth; Figure 2) and iris (sepal width and length, petal width and length; Figure 3).

Figure 2: Distributions and correlations for numeric variables in the penguins data (flipper length (mm), body mass (g), bill length (mm) and bill depth (mm)) for the three observed species: Gentoo (green, triangles); Chinstrap (blue, circles); and Adélie (orange, squares). Significance indicated for bivariate correlations: *p < 0.05; **p < 0.01; ***p < 0.001.

Figure 3: Distributions and correlations for numeric variables in iris (petal length (cm), petal width (cm), sepal length (cm) and sepal width (cm)) for the three included iris species: Iris setosa (light gray, circles); Iris versicolor (dark gray, triangles); and Iris virginica (black, squares). Significance indicated for bivariate correlations: *p < 0.05; **p < 0.01; ***p < 0.001.

Both penguins and iris offer numerous opportunities to explore linear relationships and correlations, within and across species (Figures 2 & 3). A bivariate scatterplot made with the iris dataset reveals a clear linear relationship between petal length and petal width. Using penguins (Figure 4), we can create a uniquely similar scatterplot with flipper length and body mass. The overall trend across all three species is approximately linear for both iris and penguins. Teachers may encourage students to explore how simple linear regression results and predictions differ when the species variable is omitted, compared to, for example, multiple linear regression with species included (Figure 4).

Figure 4: Representative linear relationships for (A): penguin flipper length (mm) and body mass (g) for Adélie (orange circles), Chinstrap (blue triangles), and Gentoo (green squares) penguins; (B): iris petal length (cm) and width (cm) for Iris setosa (light gray circles), Iris versicolor (dark gray triangles) and Iris virginica (black squares). Within-species linear model is visualized for each penguin or iris species.

Notably, distinctions between species are clearer for iris petals - particularly, the much smaller petals for Iris setosa - compared to penguins, in which Adélie and Chinstrap penguins are largely overlapping in body size (body mass and flipper length), and are both generally smaller than Gentoo penguins.

Simpson’s Paradox is a data phenomenon in which a trend observed between variables is reversed when data are pooled, omitting a meaningful variable. While often taught and discussed in statistics courses, finding a real-world and approachable example of Simpson’s Paradox can be a challenge. Here, we show one (of several possible - see Figure 2) Simpson’s Paradox example in penguins: exploring bill dimensions with and without species included (Figure 5). When penguin species is omitted (Figure 5A), bill length and depth appear negatively correlated overall. The trend is reversed when species is included, revealing an obviously positive correlation between bill length and bill depth within species (Figure 5B).

Figure 5: Trends for penguin bill dimensions (bill length and bill depth, millimeters) if the species variable is excluded (A) or included (B), illustrating Simpson’s Paradox. Note: linear regression for bill dimensions without including species in (A) is ill-advised; the linear trendline is only included to visualize trend reversal for Simpson’s Paradox when compared to (B).

Principal component analysis

Principal component analysis (PCA) is a dimensional reduction method commonly used to explore patterns in multivariate data. The iris dataset frequently appears in PCA tutorials due to multivariate normality and clear interpretation of variable loadings and clustering.

A comparison of PCA with the four variables of structural size measurements in penguins and iris (both normalized prior to PCA) reveals highly similar results (Figure 6). For both datasets, one species is distinct (Gentoo penguins, and setosa irises) while the other two species (Chinstrap/Adélie and versicolor/virginica) appear somewhat overlapping in the first two principal components (Figure 6 A,B). Screeplots reveal that the variance explained by each principal component (PC) is very similar across the two datasets, particularly for PC1 and PC2: for penguins, 88.15% of total variance is captured by the first two PCs, compared to 95.81% for iris, with a similarly large percentage of variance captured by PC1 and PC2 in each (Figure 6 C,D).

Figure 6: Principal component analysis biplots and screeplots for structural size measurements in penguins (A,C) and iris (B,D), revealing similarities in multivariate patterns, variable loadings, and variance explained by each component. For penguins, variables are flipper length (mm), body mass (g), bill length (mm) and bill depth (mm); groups are visualized by species (Adélie = orange circles, Chinstrap = blue triangles, Gentoo = green squares). For iris, variables are petal length (cm), petal width (cm), sepal length (cm) and sepal width (cm); groups are visualized by species (Iris setosa = light gray circles, Iris versicolor = dark gray triangles, Iris virginica = black squares). Values above screeplot columns (C,D) indicate percent of total variance explained by each of the four principal components.

K-means clustering

Unsupervised clustering by k-means is a common and popular entryway to machine learning and classification, and again, the iris dataset is frequently used in introductory examples. The penguins data provides similar opportunities for introducing k-means clustering. For simplicity, we compare k-means clustering using only two variables for each dataset: for iris, petal width and petal length, and for penguins, bill length and bill depth. All variables are scaled prior to k-means. Three clusters (k = 3) are specified for each, since there are three species of irises (Iris setosa, Iris versicolor, and Iris virginica) and penguins (Adélie, Chinstrap and Gentoo).

K-means clustering with penguin bill dimensions and iris petal dimensions yields largely distinct clusters, each dominated by one species (Figure 7). For iris petal dimensions, k-means yields a perfectly separated cluster (Cluster 3) containing all 50 Iris setosa observations and zero misclassified Iris virginica or Iris versicolor (Table 3). While clustering is not perfectly distinct for any penguin species, each species is largely contained within a single cluster, with little overlap from the other two species. For example, considering Adélie penguins (orange observations in Figure 7A): 147 (out of 151) Adélie penguins are assigned to Cluster 3, zero are assigned to Cluster 1, and 4 are assigned to the Chinstrap-dominated Cluster 2 (Table 3). Only 5 (of 68) Chinstrap penguins and 1 (of 123) Gentoo penguins are assigned to the Adélie-dominated Cluster 3 (Table 3).

Figure 7: K-means clustering outcomes for penguin bill dimensions (A) and iris petal dimensions (B). Numbers indicate the cluster to which an observation was assigned, revealing a high degree of separation between species for both penguins and iris. Penguin species (Adélie = orange, Chinstrap = blue, Gentoo = green) and iris species (setosa = light gray, versicolor = medium gray, virginica = dark gray), along with bill dimensions and cluster number, are included in the tooltip when hovering.

Table 3: K-means cluster assignments by species based on penguin bill length (mm) and depth (mm), and iris petal length (cm) and width (cm).
Penguins cluster assignments
Iris cluster assignments
Cluster Adélie Chinstrap Gentoo Cluster setosa versicolor virginica
1 0 9 116 1 0 2 46
2 4 54 6 2 0 48 4
3 147 5 1 3 50 0 0

Conclusion

Here, we have shown that structural size measurements for Palmer Archipelago Pygoscelis penguins, available as penguins in the palmerpenguins R package, offer a near drop-in replacement for iris in a number of common use cases for data science and statistics education including exploratory data visualization, linear correlation and regression, PCA, and clustering by k-means. In addition, teaching and learning opportunities in penguins are increased due to a greater number of variables, missing values, unequal sample sizes, and Simpson’s Paradox examples. Importantly, the penguins dataset encompasses real-world information derived from several charismatic marine predator species with regional breeding populations notably responding to environmental change occurring throughout the Western Antarctic Peninsula region of the Southern Ocean (see Bestelmeyer et al. (2011), Gorman et al. (2014), Gorman et al. (2017), Gorman et al. (2021)). Thus, the penguins dataset can facilitate discussions more broadly on biodiversity responses to global change - a contemporary and critical topic in ecology, evolution, and the environmental sciences.

Penguins data processing

Data in the penguins object have been minimally updated from penguins_raw as follows:

Summary of the penguins_raw dataset

Feature penguins_raw
Year(s) collected 2007 - 2009
Dimensions (col x row) 17 x 344
Documentation complete metadata
Variable classes character (9), Date (1), numeric (7)
Missing values? yes (n = 336; 5.7%)

palmerpenguins for other programming languages

Python: Python users can load the palmerpenguins datasets into their Python environment using the following code to install and access data in the palmerpenguins Python package:

pip install palmerpenguins
from palmerpenguins import load_penguins
penguins = load_penguins()

Julia: Julia users can access the penguins data in the PalmerPenguins.jl package. Example code to import the penguins data through PalmerPenguins.jl (more information on PalmerPenguins.jl from David Widmann can be found here):

julia> using PalmerPenguins
julia> table = PalmerPenguins.load()

TensorFlow: TensorFlow users can access the penguins data in TensorFlow Datasets. Information and examples for penguins data in TensorFlow can be found here.

Acknowledgements

All analyses were performed in the R language environment using version 4.1.2 (R Core Team 2021). Complete code for this paper is shared in the Supplemental Material. We acknowledge the following R packages used in analyses, with gratitude to developers and contributors:

CRAN packages used

datasets, palmerpenguins, GGally, ggiraph, ggplot2, kableExtra, paletteer, colorblindr, patchwork, plotly, recipes, broom, shadowtext, tidyverse

CRAN Task Views implied by cited packages

Spatial, TeachingStatistics, WebTechnologies

E. Anderson. The irises of the Gaspé Peninsula. Bulletin of the American Iris Society, 59: 2–5, 1935.
B. T. Bestelmeyer, A. M. Ellison, W. R. Fraser, K. B. Gorman, S. J. Holbrook, C. M. Laney, M. D. Ohman, D. P. C. Peters, F. C. Pillsbury, A. Rassweiler, et al. Analysis of abrupt transitions in ecological systems. Ecosphere, 2(12): art129, 2011. URL http://doi.wiley.com/10.1890/ES11-00216.1 [online; last accessed March 27, 2021].
R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2): 179–188, 1936. URL http://doi.wiley.com/10.1111/j.1469-1809.1936.tb02137.x [online; last accessed July 1, 2020].
D. Gohel and P. Skintzos. Ggiraph: Make ’ggplot2’ graphics interactive. 2022. URL https://CRAN.R-project.org/package=ggiraph. R package version 0.8.2.
K. B. Gorman, K. E. Ruck, T. D. Williams and W. R. Fraser. Advancing the Sea Ice Hypothesis: Trophic Interactions Among Breeding pygoscelis Penguins With Divergent Population Trends Throughout the Western Antarctic Peninsula. Frontiers in Marine Science, 8: 526092, 2021. URL https://www.frontiersin.org/articles/10.3389/fmars.2021.526092/full [online; last accessed September 25, 2021].
K. B. Gorman, S. L. Talbot, S. A. Sonsthagen, G. K. Sage, M. C. Gravely, W. R. Fraser and T. D. Williams. Population genetic structure and gene flow of Adélie penguins (Pygoscelis adeliae) breeding throughout the western Antarctic Peninsula. Antarctic Science, 29(6): 499–510, 2017. URL https://www.cambridge.org/core/product/identifier/S0954102017000293/type/journal_article [online; last accessed March 27, 2021].
K. B. Gorman, T. D. Williams and W. R. Fraser. Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus pygoscelis). PLoS ONE, 9(3): e90081, 2014. URL https://dx.plos.org/10.1371/journal.pone.0090081 [online; last accessed July 1, 2020].
A. Horst, A. Hill and K. Gorman. Palmerpenguins: Palmer archipelago (antarctica) penguin data. 2020. URL https://CRAN.R-project.org/package=palmerpenguins. R package version 0.1.0.
E. Hvitfeldt. Paletteer: Comprehensive collection of color palettes. 2021. URL https://github.com/EmilHvitfeldt/paletteer. R package version 1.3.0.
M. Kuhn and H. Wickham. Recipes: Preprocessing and feature engineering steps for modeling. 2021. URL https://CRAN.R-project.org/package=recipes. R package version 0.1.17.
Palmer Station Antarctica LTER and K. B. Gorman. Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009. 2020a. URL https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-pal.219.5 [online; last accessed July 1, 2020].
Palmer Station Antarctica LTER and K. B. Gorman. Structural size measurements and isotopic signatures of foraging among adult male and female Chinstrap penguin (Pygoscelis antarctica) nesting along the Palmer Archipelago near Palmer Station, 2007-2009. 2020b. URL https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-pal.221.6 [online; last accessed July 1, 2020].
Palmer Station Antarctica LTER and K. B. Gorman. Structural size measurements and isotopic signatures of foraging among adult male and female Gentoo penguin (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009. 2020c. URL https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-pal.220.5 [online; last accessed July 1, 2020].
T. L. Pedersen. Patchwork: The composer of plots. 2020. URL https://CRAN.R-project.org/package=patchwork. R package version 1.1.1.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12: 2825–2830, 2011.
R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2021. URL https://www.R-project.org/.
D. Robinson, A. Hayes and S. Couch. Broom: Convert statistical objects into tidy tibbles. 2022. URL https://CRAN.R-project.org/package=broom. R package version 0.7.11.
B. Schloerke, D. Cook, J. Larmarange, F. Briatte, M. Marbach, E. Thoen, A. Elberg and J. Crowley. GGally: Extension to ggplot2. 2021. URL https://CRAN.R-project.org/package=GGally. R package version 2.1.2.
C. Sievert, C. Parmer, T. Hocking, S. Chamberlain, K. Ram, M. Corvellec and P. Despouy. Plotly: Create interactive web graphics via plotly.js. 2021. URL https://CRAN.R-project.org/package=plotly. R package version 4.10.0.
H. Wickham. Tidy Data. Journal of Statistical Software, 59(10): 2014. URL http://www.jstatsoft.org/v59/i10/ [online; last accessed July 1, 2020].
H. Wickham, M. Averick, J. Bryan, W. Chang, L. D. McGowan, R. François, G. Grolemund, A. Hayes, L. Henry, J. Hester, et al. Welcome to the tidyverse. Journal of Open Source Software, 4(43): 1686, 2019. DOI 10.21105/joss.01686.
H. Wickham, W. Chang, L. Henry, T. L. Pedersen, K. Takahashi, C. Wilke, K. Woo, H. Yutani and D. Dunnington. ggplot2: Create elegant data visualisations using the grammar of graphics. 2021. URL https://CRAN.R-project.org/package=ggplot2. R package version 3.3.5.
G. Yu. Shadowtext: Shadow text grob and layer. 2022. URL https://github.com/GuangchuangYu/shadowtext/. R package version 0.1.1.
H. Zhu. kableExtra: Construct complex table with kable and pipe syntax. 2021. URL https://CRAN.R-project.org/package=kableExtra. R package version 1.3.4.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".