Choosing the Right Correlation Measure

Relationships between Pearson, Spearman, and Kendall Correlation in a bivariate normal population (source)

Types of Correlation Measures

Pearson’s Correlation Coefficient

Pearson’s correlation quantifies the linear relationship between two continuous variables. It’s calculated as the covariance of the two variables divided by the product of their standard deviations:

\(r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}\)

Pearson’s correlation assumes:

  1. Bivariate normal distributions

  2. Variables are continuous

  3. Linear relationship between variables

  4. Homoscedasticity (constant variance)

  5. Absence of outliers

Spearman’s Rank Correlation

Spearman’s correlation assesses monotonic relationships, whether linear or non-linear. It’s calculated using ranked values:

\(\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\)

Where \(d_i\) is the difference between the ranks of corresponding values.

Spearman’s correlation:

  1. Is less sensitive to outliers than Pearson’s

  2. Can be used with ordinal data

  3. Doesn’t assume normality or linearity

  4. Measures the strength of monotonic relationships

Kendall’s Tau

Kendall’s Tau evaluates the ordinal association between two variables. It’s based on the number of concordant and discordant pairs:

\(\tau = \frac{(number of concordant pairs) - (number of discordant pairs)}{n(n-1)/2}\)

Kendall’s Tau:

  1. Is more robust against outliers than Spearman’s

  2. Has a more intuitive interpretation in terms of probabilities

  3. Is preferred for small sample sizes with a large number of tied ranks

Practical Application

Let’s apply these correlation measures for some ecosystem data:

library(tibble)
library(ggplot2)

set.seed(789)  # For reproducibility

ecosystem_data <- tibble(
  temperature = rnorm(100, mean = 25, sd = 2),  # Temperature in Celsius
  salinity = rnorm(100, mean = 35, sd = 1),    # Salinity 
  nutrient_levels = rnorm(100, mean = 10, sd = 2),  # Nutrient levels
  biodiversity = 50 + 0.3 * temperature - 0.5 * salinity + 0.4 * nutrient_levels + rnorm(100, mean = 0, sd = 5)
)

pearson_cor <- cor(ecosystem_data$temperature, ecosystem_data$biodiversity, method = "pearson")
spearman_cor <- cor(ecosystem_data$temperature, ecosystem_data$biodiversity, method = "spearman")
kendall_tau <- cor(ecosystem_data$temperature, ecosystem_data$biodiversity, method = "kendall")

print(paste("Pearson's correlation:", round(pearson_cor, 3)))
[1] "Pearson's correlation: 0.194"
print(paste("Spearman's correlation:", round(spearman_cor, 3)))
[1] "Spearman's correlation: 0.176"
print(paste("Kendall's Tau:", round(kendall_tau, 3)))
[1] "Kendall's Tau: 0.125"

Different values, but why?

  1. Pearson’s correlation (0.194):
  • measures the linear relationship between two variables. It assumes that the relationship is linear and sensitive to the actual values of the data. In this case, the relationship between temperature and biodiversity may not be perfectly linear, which is why the correlation is moderate rather than stronger.
  1. Spearman’s correlation (0.176):
  • measures the rank-based relationship between two variables, meaning it looks at the monotonicity of the relationship. This method is less sensitive to outliers and assumes a monotonic relationship (either increasing or decreasing). Since it doesn’t assume linearity, the lower value compared to Pearson might indicate that while temperature and biodiversity have some association, it may not be strictly monotonic, or there may be variability in the rankings.
  1. Kendall’s Tau (0.125):
  • Kendall’s Tau also measures the rank-based relationship but is typically more conservative than Spearman’s. It looks at the concordance between pairs of data points (how often the ranks of the two variables agree). This metric tends to be lower because it’s a stricter assessment of monotonicity. The lower value suggests that there is even less agreement in the rank-order between temperature and biodiversity compared to Spearman’s.

    Important

    A relationship between two variables is monotonic if, as one variable increases (or decreases), the other variable consistently moves in one direction—either always increasing or always decreasing.

Choosing the Right Measure

1. Visual Inspection

Start with a scatter plot:

ggplot(ecosystem_data, aes(x=temperature, y=biodiversity)) +
  geom_point() +
  # geom_smooth(method="lm", col="red") +
  theme_minimal() +
  labs(title="Scatter plot of temperature vs. biodiversity")

2. Check for Linearity

Assess if the relationship appears linear. Non-linear patterns may require Spearman’s or Kendall’s methods.

3. Test for Normality

Use the Shapiro-Wilk test:

shapiro.test(ecosystem_data$temperature)

    Shapiro-Wilk normality test

data:  ecosystem_data$temperature
W = 0.98763, p-value = 0.4816
shapiro.test(ecosystem_data$biodiversity)

    Shapiro-Wilk normality test

data:  ecosystem_data$biodiversity
W = 0.99392, p-value = 0.9369

4. Consider Sample Size and Ties

For small samples (n < 30) or data with many ties, Kendall’s Tau may be more appropriate than Spearman’s.

In this context, ties are instances where two or more observations have the same value.

5. Interpretability

Pearson’s r^2 represents the proportion of variance in one variable explained by the other. Spearman’s ρ^2 represents the proportion of variance in ranks explained. Kendall’s τ represents the probability of concordance minus the probability of discordance.

Additional Considerations

Bootstrapping for Confidence Intervals

Bootstrapping is a resampling technique that can provide more robust estimates of correlation coefficients (or anything that you want to measure). It involves repeatedly sampling with replacement from the dataset to calculate confidence intervals.

For more robust estimates, consider bootstrapping to calculate confidence intervals:

library(boot)

# Function to compute correlation
cor_func <- function(data, indices) {
  d <- data[indices,]
  return(cor(d$temperature, d$biodiversity, method="pearson"))
}

# Bootstrapping
set.seed(123)
boot_results <- boot(ecosystem_data, cor_func, R=1000)

# Print results
print(boot.ci(boot_results, type="bca"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot_results, type = "bca")

Intervals : 
Level       BCa          
95%   ( 0.0003,  0.3971 )  
Calculations and Intervals on Original Scale

Partial Correlation

When you want to measure the relationship between two variables while controlling for the effect of one or more other variables, use partial correlation:

library(ppcor)

pcor_result <- pcor.test(ecosystem_data$temperature, ecosystem_data$biodiversity, ecosystem_data$salinity, method= "pearson")

print(pcor_result)
   estimate    p.value statistic   n gp  Method
1 0.1875848 0.06298709  1.880885 100  1 pearson

This partial correlation shows the relationship between temperature and biodiversity, controlling for the effect of salinity.

The choice of correlation measure should be guided by your data’s characteristics, research question, and the assumptions you can reasonably make. Each method offers unique insights, and sometimes using multiple methods can provide a more comprehensive understanding of your data.