Correlation Theory and Long Hand Calculation

Correlation is a fundamental concept in statistics that describes the linear relationship between two quantitative variables. The strength and direction of this relationship is quantified by a correlation coefficient, typically denoted as \(r\). The value of \(r\) lies between -1 and 1, where:

\(r = 1\) implies a perfect positive linear relationship.
\(r = -1\) implies a perfect negative linear relationship.
\(r = 0\) implies no linear relationship.

Calculating correlation coefficients

for this section, we will look at this example dataset, called fish_survey_data

Variables:

Sea Surface Temperature (SST): Temperature of the upper few meters of the ocean. It can influence marine life directly and indirectly.
Chlorophyll Concentration: Indicates the presence of phytoplankton, the base of the marine food chain.
Salinity: The salt concentration in the water. Can influence where certain species can thrive.
Fish Abundance: Population of a particular species of fish, which can be influenced by the other variables.
Seagrass Coverage: The area covered by seagrass, a crucial habitat and food source in marine ecosystems.

Objective: To understand how environmental factors (SST, Chlorophyll Concentration, and Salinity) influence the biological components (Fish Abundance and Seagrass Coverage) of a marine ecosystem.

library(tibble)
set.seed(123)  # For reproducibility

fish_survey_data <- tibble(
  SST = rnorm(100, mean = 25, sd = 2),  # Sea Surface Temperature in Celsius
  Chlorophyll = rnorm(100, mean = 1.5, sd = 0.3),  # Chlorophyll concentration in mg/m^3
  Salinity = rnorm(100, mean = 35, sd = 1),  # Salinity in PSU
  Fish_Abundance = 5000 + 20 * SST + 3000 * Chlorophyll - 50 * Salinity + rnorm(100, mean = 0, sd = 500),
  Seagrass_Coverage = rnorm(100, mean = 50, sd = 10)  # Seagrass coverage in percentage
)

head(fish_survey_data)  # Display the first few rows of the dataset

# A tibble: 6 × 5
    SST Chlorophyll Salinity Fish_Abundance Seagrass_Coverage
  <dbl>       <dbl>    <dbl>          <dbl>             <dbl>
1  23.9        1.29     37.2          7121.              49.3
2  24.5        1.58     36.3          8030.              38.3
3  28.1        1.43     34.7          7634.              43.7
4  25.1        1.40     35.5          7387.              49.7
5  25.3        1.21     34.6          7201.              56.7
6  28.4        1.49     34.5          8467.              33.5

This simulated dataset provides a foundation to explore various ecological questions related to marine ecosystems. For instance:

Is there a relationship between sea surface temperature and Fish Abundance?

library(ggplot2)

ggplot(fish_survey_data, aes(x = SST, y = Fish_Abundance)) +
  geom_point(color = "blue", alpha = 0.5) +
  # geom_smooth(method = "lm", col = "red") +  # Adds a linear regression line
  theme_minimal() +
  labs(title = "Relationship between SST and Fish Abundance",
       x = "Sea Surface Temperature (°C)",
       y = "Fish Abundance")

Is there a relationship between chlorophyll concentration (indicative of phytoplankton presence) and Fish Abundance?

library(ggplot2)

ggplot(fish_survey_data, aes(x = Chlorophyll, y = Fish_Abundance)) +
  geom_point(color = "blue", alpha = 0.5) +
  # geom_smooth(method = "lm", col = "red") +  # Adds a linear regression line
  theme_minimal() +
  labs(title = "Relationship between Chlorophyll and Fish Abundance",
       x = "Chlorophyll concentration in mg/m^3",
       y = "Fish Abundance")

Pearson’s Correlation Coefficient Formula

Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the product of their standard deviations:

\[ r = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}} \]

Where:

\(X_i\) and \(Y_i\) are the individual data points.
\(\bar{X}\) and \(\bar{Y}\) are the means of \(X\) and \(Y\) respectively.

Long Hand Calculation

Dont believe me? Lets calculate the long way :

Calculate the Means:

We’ll start by calculating the mean of Chlorophyll (\(\bar{X}\)) and Fish_Abundance (\(\bar{Y}\)):

mean_X <- mean(fish_survey_data$Chlorophyll)
mean_Y <- mean(fish_survey_data$Fish_Abundance)

Calculate the Numerator:

For the numerator, we need to calculate the sum of the product of the deviations of Chlorophyll and Fish_Abundance from their respective means (aka the covariance):

numerator <- sum((fish_survey_data$Chlorophyll - mean_X) * (fish_survey_data$Fish_Abundance - mean_Y))

Calculate the Denominator:

For the denominator, we’ll first calculate the square root of the sum of the squared deviations of Chlorophyll and Fish_Abundance from their means, and then multiply the two results:

sqrt_X <- sqrt(sum((fish_survey_data$Chlorophyll - mean_X)^2))
sqrt_Y <- sqrt(sum((fish_survey_data$Fish_Abundance - mean_Y)^2))
denominator <- sqrt_X * sqrt_Y

Calculate Pearson’s Correlation Coefficient:

Finally, divide the numerator by the denominator to get \(r\):

r <- numerator / denominator
r

[1] 0.8619306

Shorthand version

r_shorthand <- cor(fish_survey_data$Chlorophyll, fish_survey_data$Fish_Abundance, method = "pearson")
r_shorthand

[1] 0.8619306

An \(r\) value of \(0.86\) is a Pearson correlation coefficient, and it provides insights into the strength and direction of the linear relationship between two variables.

In the context of our dataset, an \(r\) value of \(0.86\) between Chlorophyll and Fish_Abundance suggests that as Chlorophyll increases, the Fish Abundance also tends to increase, and this relationship is strong and linear.

This relationship is fairly evident in the plot:

library(ggplot2)

ggplot(fish_survey_data, aes(x = Chlorophyll, y = Fish_Abundance)) +
  geom_point(color = "blue", alpha = 0.5) +
  # geom_smooth(method = "lm", col = "red") +  # Adds a linear regression line
  theme_minimal() +
  labs(title = "Relationship between Chlorophyll and Fish Abundance",
       x = "Chlorophyll concentration in mg/m^3",
       y = "Fish Abundance")

Kumar, Sunil, and Ilyoung Chong. 2018. “Correlation Analysis to Identify the Effective Data in Machine Learning: Prediction of Depressive Disorder and Emotion States.” International Journal of Environmental Research and Public Health 15 (12): 2907. https://doi.org/10.3390/ijerph15122907.