Activity 1: TCRMP Fish

Comparing Fish Abundance and Diversity at TCRMP Sites

Use the template R script copied at the bottom of this page to help complete this activity. For any ALL_CAPS terms, replace or rename them with appropriate terms or more descriptive variable names.

Caution

This will be the only time I provide a template, from now on you will start with a blank canvas and some guiding instructions/example commands in the manual


Overview

taking a fish survey (source)

This exercise focuses on the analysis of some fish count data from TCRMP fish surveys. Fish surveys have been historically conducted at 14 sites around St. Croix and 10 sites around St. Thomas and, starting in 2012, are conducted at 32 of the 33 monitoring site. Ten replicate belt transects and three replicate roving dive surveys are conducted at each site. Belt transects are 25 x 4 m and are conducted in 15 minutes per replicate. Roving replicates are also 15 minutes. All fish encountered are recorded except blennies and most gobies.

In this dataset, we are looking at counts of these fish data from sites in 2014 and 2015. You will compare:

  • fish abundance at Sprat Hole and College Shoal East from surveys conducted in 2014

  • fish diversity (defined as the number of unique species) at Jacks Bay and Hind Bank East FSA from surveys conducted in 2015.

Here is a flow chart showing what your script will be doing.

This week we are doing some exploratory data analysis (EDA). In many ways this approach mirrors data analysis for your thesis data, where you will spend many days/weeks on a single dataset to learn as much about it as you can.


Setup the Script

  • i. Update the header: At the top of your R script, fill in your name, the date, and a brief description of the analysis.

  • ii. Set your working directory: Set the working directory to where you want to save your scripts and outputs.

  • iii. Loading Packages: Load the necessary packages (e.g., Tidyverse, dplyr, ggplot2) for the exercise. Make sure you have these installed (install.packages()). Remove or comment out Install.packages() BEFORE turning in your script.

###########################
# Lab Activity Week 3-4: Summarizing TCRMP Fish Data
# Author: Lauren Olinger
# Date: August 10, 2023
###########################

# Example for Setting Working Directory
setwd("YOUR_WORKING_DIRECTORY")

# Example for Loading Packages
library(dplyr)
library(ggplot2)

Import the Data

  1. Import the dataset using read.csv() and give it a descriptive name. We use DATA_RAW in the instructions, but consider being more specific in your naming.
DATA_RAW <- read.csv("https://github.com/laurenkolinger/MES503data/raw/main/week3/s4pt4_fishbiodivCounts_23sites_2014_2015.csv")

Explore and Describe the Data

  1. View the First Few Rows of the Dataset using head() . Pay attention to the column names. There are definitions for those columns in the associated metadata. Click here to download that metadata and save it in your working directory.

  2. Get a Summary of the Dataset using summary() .

    What does summary() tell you?

  3. Look at the data and metadata, and complete the sentence: “In this data, each row contains ________________”.

    Describing the data is an important first step and will be the first step of all activities going forward!


Clean the data

  1. Remove Missing Values. use this example code to see if any data are NA, and if so, remove them.
# find out the indices of the NAs
which(is.na(DATA_RAW), arr.ind = T)

# remove them using na.omit()
DATA_RAW <- 
  DATA_RAW |> na.omit()
  • did na.omit() work? run which(is.na(DATA_RAW), arr.ind = T) to find out.
  1. aggregate (condense) counts by summing across species A common practice with this kind of data is to reduce the amount of detail you want to work with in order to answer the question at hand. We have a bunch of cool data on different species, but we dont need this much detail. So we are going to make a second variable DATA_CLEAN to reduce that complexity, by doing some initial “tidying”. Here is example code to get you started, with the answer mostly filled in1. Remember to change DATA_CLEAN to a more descriptive variable name.
DATA_CLEAN <- DATA_RAW |>
  select(site, year, replicate, sppname, counts) |>
  group_by(site, year, replicate) |> 
  summarise(
    counts = sum(counts),
    diversity = length(sppname))

Compare fish abundance at Sprat Hole and College Shoal East in 2014

7. Filter Data to include 2014 surveys conducted at Sprat Hole and College Shoal East.

Here is more example code. You will have to change any terms in all caps (e.g., ALL_CAPS) to the correct terms. Starting with a straightforward example:

# Example Command
DATA_ABUND <- DATA_CLEAN |>
  filter(year == YEAR, site %in% c("SITE1", "SITE2"))

See here2 for a plain language explanation of this code.

8. Calculate some descriptive statistics for (i.e., summarize) Fish Abundance in 2014, including mean, standard deviation, and SEM.

Think about the goal of this part of the activity. How do you want to group your data, so that you can compare the sites? If done correctly, DATA_ABUND_SUMMARY should only have two rows. Replace GROUPING_VAR with the name of the column corresponding to that grouping variable. Also, what is the measurement that we care about? replace MEASUREDVAR with the name of the column corresponding to that measured variable.

Note: length() calculates the number of observations, useful for statistical calculations like SEM.

# Example Command
DATA_ABUND_SUMMARY <- DATA_ABUND |>
  group_by(GROUPING_VAR) |>
  summarize(
    MEAN_ABUNDANCE = mean(MEASUREDVAR),
    SD_ABUNDANCE = sd(MEASUREDVAR),
    SEM_ABUNDANCE = sd(MEASUREDVAR) / sqrt(length(MEASUREDVAR))
  )

9. Visualize Fish Abundance in 2014 with Error Bars. Here is a basic command to start with. Don’t forget to label your axes, and include a descriptive title. You can look up how to do this very easily, for example see the ggplot cheat sheet.

Important

Check that your plot contains all required elements.

# Example Command
ggplot(DATA_ABUND_SUMMARY, aes(x=GROUPING_VAR, y=MEAN_ABUNDANCE)) +
  geom_bar(stat="identity") +
  geom_errorbar(aes(ymin=MEAN_ABUNDANCE-SEM_ABUNDANCE, ymax=MEAN_ABUNDANCE+SEM_ABUNDANCE), width=.2)

10. Interpret differences in fish abundance between sites.

  • Report the calculated means ± SEM as if you were reporting this in a publication, or your thesis:

    • Eg. “Surveys in YEAR resulted in an average (± SEM) of ### (± ###) fish at SITE1 and ### (± ###) fish at SITE2.”
  • Which site had higher fish abundance? Do you think a statistical test would indicate significant differences in their abundances?


Compare fish diversity at Jacks Bay and Hind Bank East FSA in 2015

  1. Filter Data by repeating step 7 above, but do not overwrite DATA_ABUND. Maybe replace _ABUND with _DIVERSITY to represent diversity.

    Important

    Note the different YEAR and SITE1 and SITE2 that you are looking for in this part!

  2. Repeat step 8 for diversity

    Important

    Replace GROUPING_VAR with the name of the column corresponding to that grouping variable. Also, what is the measurement that we care about? replace MEASUREDVAR with the name of the column corresponding to that measured variable.

  3. Repeat step 9 to make a plot of fish diversity in 2015.

  4. Interpret differences in fish diversity between sites.

    • Report the calculated means ± SEM.

    • Which site had higher fish diversity? Do you think a statistical test would indicate significant differences in their diversity?

  5. perform a T test on diversity by site and answer these three questions:

    1. What are your null and alternative hypotheses?
    1. Do you reject your null hypothesis?
    1. Compare this result to what you guessed in 14.
        test <-
          t.test(
            MEASURED_VAR ~ GROUPING_VAR,
            data = DATA_DIVERSITY,
            var.equal = TRUE
          )

Template R Script

###########################
# Lab Activity Week 3-4: Summarizing TCRMP Fish Data
# Author: [Your Name]
# Date: [Current Date]
###########################

# Setting Working Directory
setwd("YOUR_WORKING_DIRECTORY")

# Loading Packages
library(dplyr)
library(ggplot2)

# ------------------------------------------------------------------------
# Data Import

# 1. Import the dataset. Change DATA_RAW (and all other names in all caps) to something more descriptive.

DATA_RAW <- read.csv("https://github.com/laurenkolinger/MES503data/raw/main/week3/s4pt4_fishbiodivCounts_23sites_2014_2015.csv")

# ------------------------------------------------------------------------
# Preliminary Data Exploration and Summary

# 2. View the first few rows of the dataset

    head(DATA_RAW)

# 3. Get a summary of the dataset. 

    summary(DATA_RAW)

    # [Insert answer to question 3 here]

# 4. Complete the sentence: "In this data, each row contains _____________"

    # [Insert answer here]

# ------------------------------------------------------------------------
# Data Cleaning

# 5. Remove Missing Values
    
    # Find out the indices of the NAs
    which(is.na(DATA_RAW), arr.ind = T)

    # Remove NAs
    DATA_RAW <- DATA_RAW |> na.omit()

    # [Your answer to question asked in 5 here]

# 6. Aggregate counts by summing across species
    
    DATA_CLEAN <- DATA_RAW |>
      select(site, year, replicate, sppname, counts) |>
      group_by(site, year, replicate) |> 
      summarise(
        counts = sum(counts),
        diversity = length(sppname))

# ------------------------------------------------------------------------
# Comparing fish abundance at Sprat Hole and College Shoal East in 2014

# 7. Filter Data for 2014 surveys at Sprat Hole and College Shoal East

    DATA_ABUND <- DATA_CLEAN |>
      filter(year == YEAR, site %in% c("SITE1", "SITE2"))

# 8. Calculate descriptive statistics for Fish Abundance in 2014

    DATA_ABUND_SUMMARY <- DATA_ABUND |>
      group_by(GROUPING_VAR) |>
      summarize(
        MEAN_ABUNDANCE = mean(MEASUREDVAR),
        SD_ABUNDANCE = sd(MEASUREDVAR),
        SEM_ABUNDANCE = sd(MEASUREDVAR) / sqrt(length(MEASUREDVAR))
      )

# 9. Visualize Fish Abundance in 2014 with Error Bars
    
    ggplot(DATA_ABUND_SUMMARY, aes(x=GROUPING_VAR, y=MEAN_ABUNDANCE)) +
      geom_bar(stat="identity") +
      geom_errorbar(aes(ymin=MEAN_ABUNDANCE-SEM_ABUNDANCE, ymax=MEAN_ABUNDANCE+SEM_ABUNDANCE), width=.2) +
      labs(x = "Site", y = "Mean Abundance", title = "Fish Abundance in 2014 at Sprat Hole and College Shoal East")

# 10. Interpret differences in fish abundance between sites

    # [Insert answer here]

# ------------------------------------------------------------------------
# Comparing fish diversity at Jacks Bay and Hind Bank East FSA in 2015

# 11. Filter Data for 2015 surveys at Jacks Bay and Hind Bank East FSA

    # [Insert code here]

# 12. Calculate descriptive statistics for Fish Diversity in 2015

    # [Insert code here]

# 13. Visualize Fish Diversity in 2015 with Error Bars

    # [Insert code here]

# 14. Interpret differences in fish diversity between sites

    # [Insert answer here]

# 15. Perform a T test on diversity by site
            test <-
          t.test(
            MEASURED_VAR ~ GROUPING_VAR,
            data = DATA_DIVERSITY,
            na.rm = TRUE, # remove nas? this is redundant here (since you removed nas previously) but helpful for future applications. 
            var.equal = TRUE
          )
     # [Insert answer here]
     # [Insert answer here]
     # [Insert answer here]
     # [Insert answer here]

    1. Note:

      • this makes a new variable, called diversity, which counts the numbers of species per replicate. This which will assist in part 2 of the activity.

      • This removes columns (program, month, replicatetype, trophicgroup)by not including them in select(). Even though this is not an essential step, a cleaner data frame can help you better understand your analysis and focus on the questions at hand.

    ↩︎
  1. This piped command performs a specific data filtering operation using the dplyr package in R. Here’s what it does in plain language:

    • DATA_CLEAN: This is the dataset you’re starting with, presumably after you’ve cleaned it to remove missing values or other unwanted data.

    • |>: This is the “pipe” operator. It takes the output from the command on its left (in this case, DATA_RAW ) and “pipes” it into the command on its right (in this case, filter()). Essentially, it’s a way to string multiple commands together in a readable and organized manner.

    • filter(): This function is used to filter rows of a dataset based on certain conditions.

    • year == YEAR: This condition says to keep only the rows where the year column is equal to YEAR (replace this with the numbered year you want). Numeric data types do not need to be enclosed in quotes. The == is an equality operator, which checks if the value on its left is the same as the value on its right.

    • site %in% c(SITE1, SITE2): This condition says to keep only the rows where the site is either SITE1 or SITE2. The %in% operator checks if a value is part of a set of values (in this case, the set is c("SITE1", "SITE2"). You must replace SITE1 and SITE2 with the sites you want to keep (remember perfect syntax/spelling/spaces because computers are dumb). Here, “SITE1” and “SITE2” are string values and thus need to be in quotes.

    • filter(year == YEAR, site %in% c(SITE1, SITE2)): Both conditions are placed within the filter() function, meaning that the function will keep rows that satisfy both conditions.

    ↩︎