Set up

  • Create a new repository on GitHub called eds212-comp-4b, with a ReadMe
  • Clone and create a new version-controlled R Project
  • Add a new R Markdown document, save as r-exploring
  • Open Anaconda Navigator, launch Jupyter Notebook, navigate to the Project folder you just created, & make a new Jupyter Notebook there w/Python3 (when you save your Jupyter NB, make sure it’s saving in your R project)
  • Rename your Jupyter Notebook py-exploring
  • Check your files pane in RStudio to ensure that your .ipynb is saved in the right place

Exploring data in R:

In your RMarkdown document, attach the following packages in the setup chunk (you’ll need to install the first two):

  • GGally
  • skimr
  • palmerpenguins

Rapid-fire low-level exploration of data:

# Always look at it
# View(penguins)

# Check the column names
names(penguins) # See df.columns in pandas
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
# Check the dimensions 
dim(penguins) # See df.shape in pandas
## [1] 344   8
# Get a summary
summary(penguins) # See df.describe() in pandas
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2
# Print the first 6 lines
head(penguins) # See df.head() in pandas
## # A tibble: 6 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 6 Adelie  Torge…           39.3          20.6              190        3650 male 
## # … with 1 more variable: year <int>
# Print the last 6 lines 
tail(penguins) # See df.tail() in pandas
## # A tibble: 6 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Chinst… Dream            45.7          17                195        3650 fema…
## 2 Chinst… Dream            55.8          19.8              207        4000 male 
## 3 Chinst… Dream            43.5          18.1              202        3400 fema…
## 4 Chinst… Dream            49.6          18.2              193        3775 male 
## 5 Chinst… Dream            50.8          19                210        4100 male 
## 6 Chinst… Dream            50.2          18.7              198        3775 fema…
## # … with 1 more variable: year <int>
# Make a pairplot
GGally::ggpairs(penguins)

# Make a histogram of penguin flipper lengths
ggplot(data = penguins, aes(x = flipper_length_mm)) +
  geom_histogram()

Let’s try the same stuff, but in Python!

# Import Python packages
import seaborn as sns
import pandas as pd
import numpy as np
# Load the penguins dataset from the seaborn package
penguins = sns.load_dataset('penguins')
penguins.columns # See names(penguins) in R
 
## Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
##        'flipper_length_mm', 'body_mass_g', 'sex'],
##       dtype='object')
penguins.shape # See dim(penguins) in R
## (344, 7)
penguins.head() # See head(penguins) in R
##   species     island  bill_length_mm  ...  flipper_length_mm  body_mass_g     sex
## 0  Adelie  Torgersen            39.1  ...              181.0       3750.0    Male
## 1  Adelie  Torgersen            39.5  ...              186.0       3800.0  Female
## 2  Adelie  Torgersen            40.3  ...              195.0       3250.0  Female
## 3  Adelie  Torgersen             NaN  ...                NaN          NaN     NaN
## 4  Adelie  Torgersen            36.7  ...              193.0       3450.0  Female
## 
## [5 rows x 7 columns]
penguins.tail() # See tail(penguins) in R
##     species  island  bill_length_mm  ...  flipper_length_mm  body_mass_g     sex
## 339  Gentoo  Biscoe             NaN  ...                NaN          NaN     NaN
## 340  Gentoo  Biscoe            46.8  ...              215.0       4850.0  Female
## 341  Gentoo  Biscoe            50.4  ...              222.0       5750.0    Male
## 342  Gentoo  Biscoe            45.2  ...              212.0       5200.0  Female
## 343  Gentoo  Biscoe            49.9  ...              213.0       5400.0    Male
## 
## [5 rows x 7 columns]
penguins.describe() # See summary(penguins) in R

# Make a pairs plot with seaborn pairplot
##        bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
## count      342.000000     342.000000         342.000000   342.000000
## mean        43.921930      17.151170         200.915205  4201.754386
## std          5.459584       1.974793          14.061714   801.954536
## min         32.100000      13.100000         172.000000  2700.000000
## 25%         39.225000      15.600000         190.000000  3550.000000
## 50%         44.450000      17.300000         197.000000  4050.000000
## 75%         48.500000      18.700000         213.000000  4750.000000
## max         59.600000      21.500000         231.000000  6300.000000
sns.pairplot(penguins) # See GGally::ggpairs() in R

# Make a histogram of flipper lengths with sns.histplot:

sns.histplot(data=penguins, x="flipper_length_mm") # See geom_histogram() in R

One more thing: vectors in Python

vec_a = np.array([1,2,3])
vec_b = np.array([10,11,12])

vec_a + vec_b
## array([11, 13, 15])
vec_b - vec_a
## array([9, 9, 9])
vec_a * vec_b
## array([10, 22, 36])
6 * vec_a
## array([ 6, 12, 18])