In 1935, Edgar Anderson collected size measurements for 150
flowers from three species of Iris on the Gaspé Peninsula in
Quebec, Canada. Since then, Anderson’s Iris observations have
become a classic dataset in statistics, machine learning, and data
science teaching materials. It is included in the base R datasets
package as iris
, making it easy for users to access without
knowing much about it. However, the lack of data documentation, presence
of non-intuitive variables (e.g. “sepal width”), and perfectly balanced
groups with zero missing values make iris
an inadequate and
stale dataset for teaching and learning modern data science skills.
Users would benefit from working with a more representative, real-world
environmental dataset with a clear link to current scientific research.
Importantly, Anderson’s Iris data appeared in a 1936
publication by R. A. Fisher in the Annals of Eugenics (which is
often the first-listed citation for the dataset), inextricably linking
iris
to eugenics research. Thus, a modern alternative to
iris
is needed. In this paper, we introduce the
palmerpenguins R package (Horst et al. 2020), which includes body
size measurements collected from 2007 - 2009 for three species of
Pygoscelis penguins that breed on islands throughout the Palmer
Archipelago, Antarctica. The penguins
dataset in
palmerpenguins provides an approachable, charismatic, and near drop-in
replacement for iris
with topical relevance for polar
climate change and environmental impacts on marine predators. Since the
release on CRAN in July 2020, the palmerpenguins package has been
downloaded over 462,000 times, highlighting the demand and widespread
adoption of this viable iris
alternative. We directly
compare the iris
and penguins
datasets for
selected analyses to demonstrate that R users, in particular teachers
and learners currently using iris
, can switch to the Palmer
Archipelago penguins for many use cases including data wrangling,
visualization, linear modeling, multivariate analysis (e.g., PCA),
cluster analysis and classification (e.g., by k-means).
In 1935, American botanist Edgar Anderson measured petal and sepal
structural dimensions (length and width) for 50 flowers from three
Iris species: Iris setosa, Iris versicolor,
and Iris virginica (Anderson 1935). The manageable but
non-trivial size (5 variables and 150 total observations) and
characteristics of Anderson’s Iris dataset, including linear
relationships and multivariate normality, have made it amenable for
introducing a wide range of statistical methods including data
wrangling, visualization, linear modeling, multivariate analyses, and
machine learning. The Iris dataset is built into a number of
software packages including the auto-installed datasets package in R
(as iris
, R Core Team 2021),
Python’s scikit-learn machine learning library (Pedregosa
et al. 2011), and the SAS Sashelp library (SAS Institute,
Cary NC), which has facilitated its widespread use. As a result,
eighty-six years after the data were initially published, the
Iris dataset remains ubiquitous in statistics, computational
methods, software documentation, and data science courses and
materials.
There are a number of reasons that modern data science practitioners
and educators may want to move on from iris
. First, the
dataset lacks metadata (Anderson 1935), which does not reinforce
best practices and limits meaningful interpretation and discussion of
research methods, analyses, and outcomes. Of the five variables in
iris
, two (Sepal.Width
and
Sepal.Length
) are not intuitive for most non-botanists.
Even with explanation, the difference between petal and
sepal dimensions is not obvious. Second, iris
contains equal sample sizes for each of the three species (n =
50) with no missing values, which is cleaner than most real-world data
that learners are likely to encounter. Third, the single factor
(Species
) in iris
limits options for analyses.
Finally, due to its publication in the Annals of Eugenics by
statistician R.A. Fisher (Fisher 1936), iris
is
burdened by a history in eugenics research, which we are committed to
addressing through the development of new data science education
products as described below.
Given the growing need for fresh data science-ready datasets, we
sought to identify an alternative dataset that could be made easily
accessible for a broad audience. After evaluating the positive and
negative features of iris
in data science and statistics
materials, we established the following criteria for a suitable
alternative:
iris
for most use casesHere, we describe an alternative to iris
that largely
satisfies these criteria: a refreshing, approachable, and charismatic
dataset containing real-world body size measurements for three
Pygoscelis penguin species that breed throughout the Western
Antarctic Peninsula region, made available through the United States
Long-Term Ecological Research (US LTER) Network. By comparing data
structure, size, and a range of analyses side-by-side for the two
datasets, we demonstrate that the Palmer Archipelago penguin data are an
ideal substitute for iris
for many use cases in statistics
and data science education.
Body size measurements (bill length and depth, flipper length - flippers are the modified “wings” of penguins used for maneuvering in water, and body mass), clutch (i.e., egg laying) observations (e.g., date of first egg laid, and clutch completion), and carbon (13C/12C, \(\delta\)13C) and nitrogen (15N/14N, \(\delta\)15N) stable isotope values of red blood cells for adult male and female Adélie (P. adeliae), chinstrap (P. antarcticus), and gentoo (P. papua) penguins on three islands (Biscoe, Dream, and Torgersen) within the Palmer Archipelago were collected from 2007 - 2009 by Dr. Kristen Gorman in collaboration with the Palmer Station LTER, part of the US LTER Network. For complete data collection methods and published analyses, see Gorman et al. (2014). Throughout this paper, penguins species are referred to as “Adélie”, “Chinstrap”, and “Gentoo”.
The data in the palmerpenguins R package are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station LTER Data Policy and the LTER Data Access Policy, and were imported from the Environmental Data Initiative (EDI) Data Portal at the links below:
R users can install the palmerpenguins package from CRAN:
install.packages("palmerpenguins")
Information, examples, and links to community-contributed materials are available on the palmerpenguins package website: allisonhorst.github.io/palmerpenguins/. See the Appendix for how Python and Julia users can access the same data.
The palmerpenguins
R package contains two data objects: penguins_raw
and
penguins
. The penguins_raw
data consists of
all raw data for 17 variables, recorded completely or in part for 344
individual penguins, accessed directly from EDI
(penguins_raw
properties are summarized in Appendix B). We
generally recommend using the curated data in penguins
,
which is a subset of penguins_raw
retaining all 344
observations, minimally updated (Appendix A) and reduced to the
following eight variables:
species
: a factor denoting the penguin species (Adélie,
Chinstrap, or Gentoo)island
: a factor denoting the Palmer Archipelago island
in Antarctica where each penguin was observed (Biscoe Point, Dream
Island, or Torgersen Island)bill_length_mm
: a number denoting length of the dorsal
ridge of a penguin bill (millimeters)bill_depth_mm
: a number denoting the depth of a penguin
bill (millimeters)flipper_length_mm
: an integer denoting the length of a
penguin flipper (millimeters)body_mass_g
: an integer denoting the weight of a
penguin’s body (grams)sex
: a factor denoting the sex of a penguin sex (male,
female) based on molecular datayear
: an integer denoting the year of study (2007,
2008, or 2009)The same data exist as comma-separated value (CSV) files in the
package (“penguins_raw.csv” and “penguins.csv”), and can be read in
using the built-in path_to_file()
function in palmerpenguins.
For example,
library(palmerpenguins)
<- read.csv(path_to_file("penguins.csv")) df
will read in “penguins.csv” as if from an external file, thus
automatically parsing species
, island
, and
sex
variables as characters instead of factors. This option
allows users opportunities to practice or demonstrate reading in data
from a CSV, then updating variable class (e.g., characters to
factors).
iris
and
penguins
The penguins
data in palmerpenguins
is useful and approachable for data science and statistics education,
and is uniquely well-suited to replace the iris
dataset.
Comparisons presented are selected examples for common iris
uses, and are not exhaustive.
Feature | iris | penguins |
---|---|---|
Year(s) collected | 1935 | 2007 - 2009 |
Dimensions (col x row) | 5 x 150 | 8 x 344 |
Documentation | minimal | complete metadata |
Variable classes | double (4), factor (1) | double (2), int (3), factor (3) |
Missing values? | no (n = 0; 0.0%) | yes (n = 19; 0.7%) |
Both iris
and penguins
are in tidy format
(Wickham
2014) with each column denoting a single variable and each
row containing measurements for a single iris flower or penguin,
respectively. The two datasets are comparable in size: dimensions
(columns × rows) are 5 × 150 and 8 × 344 for iris
and
penguins
, respectively, and sample sizes within species are
similar (Tables 1 &
2).
Notably, while sample sizes in iris
across species are
all the same, sample sizes in penguins
differ across the
three species. The inclusion of three factor variables in
penguins
(species
, island
, and
sex
), along with year
, create additional
opportunities for grouping, faceting, and analysis compared to the
single factor (Species
) in iris
.
Unlike iris
, which contains only complete cases, the
penguins
dataset contains a small number of missing values
(nmissing = 19, out of 2,752 total values). Missing
values and unequal sample sizes are common in real-world data, and
create added learning opportunity to the penguins
dataset.
Iris species | Sample size | Penguin species | Female | Male | NA |
---|---|---|---|---|---|
setosa | 50 | Adélie | 73 | 73 | 6 |
versicolor | 50 | Chinstrap | 34 | 34 | 0 |
virginica | 50 | Gentoo | 58 | 61 | 5 |
Distributions, relationships between variables, and clustering can be
visually explored between species for the four structural size
measurements in penguins
(flipper length, body mass, bill
length and depth; Figure 2) and
iris
(sepal width and length, petal width and length;
Figure 3).
Both penguins
and iris
offer numerous
opportunities to explore linear relationships and correlations, within
and across species (Figures 2 &
3). A bivariate scatterplot made with the
iris
dataset reveals a clear linear relationship between
petal length and petal width. Using penguins
(Figure
4), we can create a uniquely similar scatterplot with
flipper length and body mass. The overall trend across all three species
is approximately linear for both iris
and
penguins
. Teachers may encourage students to explore how
simple linear regression results and predictions differ when the species
variable is omitted, compared to, for example, multiple linear
regression with species included (Figure 4).
Notably, distinctions between species are clearer for iris petals - particularly, the much smaller petals for Iris setosa - compared to penguins, in which Adélie and Chinstrap penguins are largely overlapping in body size (body mass and flipper length), and are both generally smaller than Gentoo penguins.
Simpson’s Paradox is a data phenomenon in which a trend observed
between variables is reversed when data are pooled, omitting a
meaningful variable. While often taught and discussed in statistics
courses, finding a real-world and approachable example of Simpson’s
Paradox can be a challenge. Here, we show one (of several possible - see
Figure 2) Simpson’s Paradox example in
penguins
: exploring bill dimensions with and without
species included (Figure 5). When penguin species
is omitted (Figure 5A), bill length and depth
appear negatively correlated overall. The trend is reversed when species
is included, revealing an obviously positive correlation between bill
length and bill depth within species (Figure
5B).
Principal component analysis (PCA) is a dimensional reduction method
commonly used to explore patterns in multivariate data. The
iris
dataset frequently appears in PCA tutorials due to
multivariate normality and clear interpretation of variable loadings and
clustering.
A comparison of PCA with the four variables of structural size
measurements in penguins
and iris
(both
normalized prior to PCA) reveals highly similar results (Figure
6). For both datasets, one species is distinct (Gentoo
penguins, and setosa irises) while the other two species
(Chinstrap/Adélie and versicolor/virginica) appear
somewhat overlapping in the first two principal components (Figure
6 A,B). Screeplots reveal that the variance explained by
each principal component (PC) is very similar across the two datasets,
particularly for PC1 and PC2: for penguins
, 88.15% of total
variance is captured by the first two PCs, compared to 95.81% for
iris
, with a similarly large percentage of variance
captured by PC1 and PC2 in each (Figure 6 C,D).
Unsupervised clustering by k-means is a common and popular entryway
to machine learning and classification, and again, the iris
dataset is frequently used in introductory examples. The
penguins
data provides similar opportunities for
introducing k-means clustering. For simplicity, we compare k-means
clustering using only two variables for each dataset: for
iris
, petal width and petal length, and for
penguins
, bill length and bill depth. All variables are
scaled prior to k-means. Three clusters (k = 3) are specified
for each, since there are three species of irises (Iris setosa,
Iris versicolor, and Iris virginica) and penguins
(Adélie, Chinstrap and Gentoo).
K-means clustering with penguin bill dimensions and iris petal dimensions yields largely distinct clusters, each dominated by one species (Figure 7). For iris petal dimensions, k-means yields a perfectly separated cluster (Cluster 3) containing all 50 Iris setosa observations and zero misclassified Iris virginica or Iris versicolor (Table 3). While clustering is not perfectly distinct for any penguin species, each species is largely contained within a single cluster, with little overlap from the other two species. For example, considering Adélie penguins (orange observations in Figure 7A): 147 (out of 151) Adélie penguins are assigned to Cluster 3, zero are assigned to Cluster 1, and 4 are assigned to the Chinstrap-dominated Cluster 2 (Table 3). Only 5 (of 68) Chinstrap penguins and 1 (of 123) Gentoo penguins are assigned to the Adélie-dominated Cluster 3 (Table 3).
Cluster | Adélie | Chinstrap | Gentoo | Cluster | setosa | versicolor | virginica |
---|---|---|---|---|---|---|---|
1 | 0 | 9 | 116 | 1 | 0 | 2 | 46 |
2 | 4 | 54 | 6 | 2 | 0 | 48 | 4 |
3 | 147 | 5 | 1 | 3 | 50 | 0 | 0 |
Here, we have shown that structural size measurements for Palmer
Archipelago Pygoscelis penguins, available as
penguins
in the palmerpenguins
R package, offer a near drop-in replacement for iris
in a
number of common use cases for data science and statistics education
including exploratory data visualization, linear correlation and
regression, PCA, and clustering by k-means. In addition, teaching and
learning opportunities in penguins
are increased due to a
greater number of variables, missing values, unequal sample sizes, and
Simpson’s Paradox examples. Importantly, the penguins
dataset encompasses real-world information derived from several
charismatic marine predator species with regional breeding populations
notably responding to environmental change occurring throughout the
Western Antarctic Peninsula region of the Southern Ocean (see Bestelmeyer et
al. (2011), Gorman et al. (2014), Gorman et al. (2017), Gorman et al. (2021)). Thus, the penguins
dataset can facilitate discussions more broadly on biodiversity
responses to global change - a contemporary and critical topic in
ecology, evolution, and the environmental sciences.
Data in the penguins
object have been minimally updated
from penguins_raw
as follows:
Flipper Length (mm)
to flipper_length_mm
)species
are truncated to only include the
common name (e.g. “Gentoo”, instead of “gentoo penguin (Pygoscelis
papua)”)NA
culmen_length_mm
and culmen_depth_mm
variable names are updated to bill_length_mm
and
bill_depth_mm
, respectivelyspecies
,
island
, sex
) is updated to factoryear
was pulled from clutch observationspenguins_raw
datasetFeature | penguins_raw |
---|---|
Year(s) collected | 2007 - 2009 |
Dimensions (col x row) | 17 x 344 |
Documentation | complete metadata |
Variable classes | character (9), Date (1), numeric (7) |
Missing values? | yes (n = 336; 5.7%) |
Python: Python users can load the palmerpenguins datasets into their Python environment using the following code to install and access data in the palmerpenguins Python package:
pip install palmerpenguinsfrom palmerpenguins import load_penguins
= load_penguins() penguins
Julia: Julia users can access the penguins data in the PalmerPenguins.jl package. Example code to import the penguins data through PalmerPenguins.jl (more information on PalmerPenguins.jl from David Widmann can be found here):
> using PalmerPenguins
julia> table = PalmerPenguins.load() julia
TensorFlow: TensorFlow users can access the penguins data in TensorFlow Datasets. Information and examples for penguins data in TensorFlow can be found here.
All analyses were performed in the R language environment using version 4.1.2 (R Core Team 2021). Complete code for this paper is shared in the Supplemental Material. We acknowledge the following R packages used in analyses, with gratitude to developers and contributors:
datasets, palmerpenguins, GGally, ggiraph, ggplot2, kableExtra, paletteer, colorblindr, patchwork, plotly, recipes, broom, shadowtext, tidyverse
Spatial, TeachingStatistics, WebTechnologies
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".