Activity: Palmer Penguins Anatomy

Artwork by @allison_horst (source)

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

Objective:

To understand and perform linear regression using R and visualize the relationship between two continuous variables in the Palmer Penguins dataset.

Dataset:

The Palmer Penguins dataset contains size measurements for three species of penguins from three islands in the Palmer Archipelago, Antarctica.

Variables:


Initial Setup

1. Open a new script and make a header: At the top of your R script, include a header with your name, the date, and a brief description of the analysis.

###########################
# Lab Activity Week 5: Palmer penguin linear regression
# Author: Lauren Olinger
# Date: September 18, 2023
###########################

2. Setting the Working Directory: Set the working directory to where your dataset is located, and where you want to save your scripts and outputs. It is helpful to save these scripts all together so you can come back to them.

setwd("set/your/working/directory/here")

3. Install and load libraries: Load the necessary packages (e.g., tidyverse) for the exercise.

Important

you will need to install the palmerpenguins library in order to view the data. You must un-comment and run the install.packages("palmerpenguins") below. Remove or re-comment the command after installing.

# Install and load the necessary libraries
# install.packages("palmerpenguins")

library(palmerpenguins)
library(tidyverse)

Load, describe, and clean the dataset

4. Load the dataset and look at it

data(penguins) # this calls the penguin dataset as a "promise" (dont worry about this too much, just run it)

load(penguins) # this loads the penguins dataset from palmer penguins library into your environment

head(penguins)

5. Describe the dataset

complete the sentence: “In this data, each row contains ________________”.


6. Clean the data and prepare for analyses

do we need to remove any NAs?

which(is.na(penguins),arr.ind = T)

remove using na.omit(). remember that we used this same command last week.

penguins <- penguins %>% na.omit()

did this work?

which(is.na(penguins),arr.ind = T)

looks better - there is nothing in the row/column indices of penguins when we ask which row and column of penguins contains NA. Good.


Exploratory Data Analysis (EDA) and Visualization

7. summarize the data

Lets do some summarizing of the data first, using our trusty tidyverse syntax

penguins %>% 
  group_by(species) %>%
  summarize(across(where(is.numeric), mean))

note the slightly different syntax and new use of across() to get means of all the numeric columns. This can be a useful tool just to get a feel for the data.

we see some differences between species in all of these variables, including bill length and body mass. If you are not a bird expert (I am not a bird expert), here is a helpful diagram showing what bill length and bill depth actually measured:


8. make a scatterplot of the data

Before getting into regression, it’s useful to visualize the data. Here we are looking at the relationship between bill length and body mass, and we are coloring each point by penguin species. Plot this and look at it in your console.

p <- ggplot(penguins, aes(x=body_mass_g, y=bill_length_mm, color=species)) +
  geom_point() +
  labs(x="Body Mass (g)", y="Bill Length (mm)")

p

note that we also store this ggplot as p because we will add to it later.


Conduct regression

9. Perform Regression

model <- lm(bill_length_mm ~ body_mass_g, data=penguins)

here we are using lm() to perform the linear regression. here is some more information about each argument:

  1. lm(): This stands for “linear model.” It is a function in R used to fit linear models, which can be either simple (one predictor) or multiple (more than one predictor).

  2. bill_length_mm ~ body_mass_g: This is the formula for the linear regression. The ~ symbol can be read as “is modeled as” or “is predicted by”. So, bill_length_mm is the response (or dependent) variable that we’re trying to predict, and body_mass_g is the predictor (or independent) variable we’re using for prediction.

  3. data=penguins: This specifies the dataset being used, which in this case is the penguins dataset.

  4. model <-: This is assigning the output of the lm() function (i.e., the linear model) to a new variable named model.


10. Summarize regression model

now get a summary of model using the summary() function.

summary(model)

11. Interpret outputs of regression model

Interpret the outputs of this linear regression, referring to the previous section to do so. Interpret the following:

  • Coefficients, their estimates, errors and statistics

  • Redidual standard error

  • R squared (multiple and adjusted)

  • F statistic and Pvalue.


12. Add regression line to plot

extract the coefficients (the slope and intercept of the line from model)

coefficients <- coef(model)
intercept <- coefficients[1]
slope <- coefficients[2]

now use these coefficients as a line to add to the scatter plot:

p + geom_abline(intercept=intercept, slope=slope, color="black")  # Regression line

Checking Assumptions

13. Homoscedasticity (Constant Variance)

Does the variance of the residuals remains constant across different levels of the independent variable(s)? This can be checked visually using a residuals vs. fitted values plot.

residuals_data <- residuals(model)
predicted_values <- fitted(model)

plot(predicted_values, residuals_data, 
     xlab="Predicted Values", ylab="Residuals",
     main="Checking Homoscedasticity")
abline(h=0, col="blue")

14. Normality of Residuals

For valid hypothesis testing (like testing if coefficients are significant), the residuals should be approximately normally distributed. This can be visually inspected using a histogram or a Q-Q plot.

  • Histogram: Outliers or skewness in the histogram can indicate potential violations.
hist(residuals_data, breaks=20, main="Histogram of Residuals", xlab="Residuals")
  • QQ plot: A Q-Q plot provides a graphical view of how the residuals compare to a normal distribution. If the residuals are normally distributed, the points on the Q-Q plot should lie roughly on a 45-degree straight line.
qqnorm(residuals_data, main="Q-Q Plot of Residuals", col="darkblue")
qqline(residuals_data, col="red")

15. Big picture questions

  • Based on what you see in the plot in number 12, do you think this regression does a good job of fitting bill length and body mass?

  • So far we only have one variable in our regression model. Could adding another grouping variable lead to a better fit regression line? From the plot in number 12, do you see any variables that you think could be useful for this? Explain your answer in 1-2 sentences.

  • Were the assumptions of homoscedasticity and normality of residuals met for this analysis? Explain your answer in 1-2 sentences