Understanding Statistical Questions for Categorical Data

Goodness-of-Fit Analysis

When we want to know if our observed data matches theoretical expectations, we turn to goodness-of-fit analysis. The null hypothesis posits that our observations align with predicted frequencies. Let’s look at a concrete example with species distribution:

# Species counts
observed_counts <- c(25, 27, 25, 23)
expected_counts <- c(25, 25, 25, 25)

# Chi-square test
test_result <- chisq.test(x = observed_counts, 
                         p = expected_counts / sum(expected_counts))
test_result

    Chi-squared test for given probabilities

data:  observed_counts
X-squared = 0.32, df = 3, p-value = 0.9562

Here we’re testing whether four species appear in equal numbers. Our chi-square value of 0.32 (p=0.9562) suggests they do - any small variations from perfect equality likely occurred by chance.

Testing Independence

Sometimes we need to know if two categorical variables influence each other. For instance, do organisms show preferences for certain habitats? The independence test helps answer this question by examining whether the distribution of one variable changes with another.

# Create and analyze habitat preference data
observed_counts <- matrix(c(12, 38, 14, 36), nrow = 2, byrow = TRUE)
rownames(observed_counts) <- c("Species A", "Species B")
colnames(observed_counts) <- c("Shallow Water", "Deep Water")

test_result <- chisq.test(observed_counts)
test_result

    Pearson's Chi-squared test with Yates' continuity correction

data:  observed_counts
X-squared = 0.051975, df = 1, p-value = 0.8197

With a chi-square value of 0.052 (p=0.8197), we see no evidence that species show habitat preferences - they appear to use shallow and deep water independently.

Homogeneity Testing

Homogeneity tests ask whether different samples come from the same population. Think of comparing species composition between regions to determine if they’re connected. Here’s an example comparing four samples across two regions:

# Create regional comparison data
observed_counts <- matrix(c(50, 55, 20, 58, 36, 24, 37, 23),
                         nrow = 4,
                         byrow = TRUE)
rownames(observed_counts) <- c("Region 1 - Sample 1",
                              "Region 1 - Sample 2",
                              "Region 2 - Sample 1",
                              "Region 2 - Sample 2")
colnames(observed_counts) <- c("Species A", "Species B")

test_result <- chisq.test(observed_counts)
test_result

    Pearson's Chi-squared test

data:  observed_counts
X-squared = 23.538, df = 3, p-value = 3.119e-05

The high chi-square value of 23.538 (p<0.001) suggests these regions differ significantly. Let’s visualize this on a chi-square distribution:

curve(dchisq(x, df = 1), 0, 50, ylab = "Density")

# Test statistic
abline(v = as.numeric(test_result$statistic), col = "firebrick4")
text(as.numeric(test_result$statistic), 0.1, "Test\n stat", pos = 4, col = "firebrick4")

# Critical value
abline(v = qchisq(0.95,df = 3), col = "steelblue")
text(qchisq(0.95, df = 3), 0.3, "Critical\n value", pos = 4, col = "steelblue")

The red line (our test statistic) falls well beyond the blue critical value line, confirming significant differences between regions.

Choosing Your Analysis

Your research question guides your choice of test:

  • Testing against theoretical expectations? Use goodness-of-fit

  • Examining relationships between variables? Use independence tests

  • Comparing samples for fundamental differences? Use homogeneity tests