Understanding Statistical Questions for Categorical Data
Goodness-of-Fit Analysis
When we want to know if our observed data matches theoretical expectations, we turn to goodness-of-fit analysis. The null hypothesis posits that our observations align with predicted frequencies. Let’s look at a concrete example with species distribution:
# Species countsobserved_counts<-c(25, 27, 25, 23)expected_counts<-c(25, 25, 25, 25)# Chi-square testtest_result<-chisq.test(x =observed_counts, p =expected_counts/sum(expected_counts))test_result
Chi-squared test for given probabilities
data: observed_counts
X-squared = 0.32, df = 3, p-value = 0.9562
Here we’re testing whether four species appear in equal numbers. Our chi-square value of 0.32 (p=0.9562) suggests they do - any small variations from perfect equality likely occurred by chance.
Testing Independence
Sometimes we need to know if two categorical variables influence each other. For instance, do organisms show preferences for certain habitats? The independence test helps answer this question by examining whether the distribution of one variable changes with another.
Pearson's Chi-squared test with Yates' continuity correction
data: observed_counts
X-squared = 0.051975, df = 1, p-value = 0.8197
With a chi-square value of 0.052 (p=0.8197), we see no evidence that species show habitat preferences - they appear to use shallow and deep water independently.
Homogeneity Testing
Homogeneity tests ask whether different samples come from the same population. Think of comparing species composition between regions to determine if they’re connected. Here’s an example comparing four samples across two regions: