Advanced data visualization with ggplot

Causes of Mortality in the Army of the East — Florence Nightingale

A. Main components of the grammar of graphics

Typically, to build or describe any visualization with one or more dimensions, we can use the components as follows (source)

The Grammar of Graphics: Building Plots Layer by Layer

The grammar of graphics is a powerful framework for creating data visualizations. It breaks down the process of making a plot into distinct components, allowing you to build complex visualizations step by step. Let’s explore this concept using R’s ggplot2 package, which implements the grammar of graphics.

The Basic Components

At its core, the grammar of graphics consists of several key components:

Data: The dataset you want to visualize
Aesthetics: How your data maps to visual properties
Geometries: The shapes used to represent your data
Scales: How the data is mapped to the plot
Facets: How to split your plot into subplots
Themes: The overall visual style of your plot

Let’s build a plot incrementally to see how these components work together.

Building a Plot Step by Step

We’ll use a simple dataset about car fuel efficiency for our examples.

library(ggplot2)
library(dplyr)

# Load the dataset
data(mpg)

# View the first few rows
head(mpg)

# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

Step 1: Start with Data

The first step is to specify your data:

p <- ggplot(mpg)
p

This creates a blank canvas. Nothing exciting yet, but it’s the foundation for our plot.

Step 2: Add Aesthetics

Next, we map variables in our data to visual properties:

p <- ggplot(mpg, aes(x = displ, y = hwy))
p

We’ve defined our x and y axes, but still no points appear because we haven’t told ggplot how to represent the data.

Step 3: Add Geometry

Now let’s add a geometry to represent our data points:

p <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()
p

Now we see our data! Each point represents a car, with engine displacement on the x-axis and highway fuel efficiency on the y-axis.

Step 4: Enhance with More Aesthetics

We can map more variables to other visual properties:

p <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()
p

Now the color of each point represents the class of the car.

Step 5: Add Another Geometry

Let’s add a trend line to our plot:

p <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
p

We’ve added a linear regression line for each class.

Step 6: Modify Scales

We can adjust how our data is mapped to the plot:

p <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(limits = c(1, 7)) +
  scale_y_continuous(limits = c(10, 45))
p

We’ve adjusted the limits of our x and y axes for better focus on our data.

Step 7: Add Facets

To compare across categories, we can split our plot into subplots:

p <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(limits = c(1, 7)) +
  scale_y_continuous(limits = c(10, 45)) +
  facet_wrap(~year)
p

Now we have separate plots for each year in our dataset.

Step 8: Refine with Themes

Finally, let’s adjust the overall look of our plot:

p <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(limits = c(1, 7)) +
  scale_y_continuous(limits = c(10, 45)) +
  facet_wrap(~year) +
  theme_minimal() +
  labs(title = "Car Efficiency vs Engine Size",
       x = "Engine Displacement (L)",
       y = "Highway MPG",
       color = "Car Class")
p

We’ve added a minimal theme and proper labels to our plot.

Lets explore more and other ggplot functions

B. Boxplots with Significance Letters

Boxplots are great for visualizing the central tendency and dispersion of a continuous variable across different categories. Adding significance letters aids in quickly conveying statistical differences between groups.

Example: Comparing shell lengths across different mussel species.

# Load necessary library
library(ggplot2)

# Simulating data
set.seed(123)
data_box <- data.frame(
  species = rep(c("A", "B", "C"), each=30),
  shell_length = c(rnorm(30, mean=50, sd=10),
                   rnorm(30, mean=60, sd=10),
                   rnorm(30, mean=50, sd=10))
)

# Creating a separate data frame for labels
labels <- data.frame(
  species = c("A", "B", "C"),
  y = c(75, 90, 75),  # y position of labels, adjust as needed
  label = c("a", "b", "a")
)

# Creating a ggplot
p <- ggplot(data_box) +
  geom_boxplot(aes(x = species, y = shell_length, fill = species), show.legend =
                 FALSE) +
  geom_text(data=labels, aes(x=species, y=y, label=label), color="black", size = 10) +
  labs(x="Species", y="Shell Length") +
  theme_minimal()

# Displaying the plot
p

C. Heatmaps

Heatmaps display numerical values as colors, often used to represent how a response variable changes across two categorical variables.

Example: Visualizing phytoplankton abundance across various depths and months.

# Simulating data
set.seed(123)
data_heatmap <- expand.grid(month=1:12, depth=1:100)
data_heatmap$abundance <- rnorm(1200, mean=15, sd=5)

# Heatmap
p_heatmap <- ggplot(data_heatmap, aes(x=month, y=depth)) +
  geom_tile(aes(fill=abundance), color="white") +
  scale_fill_gradient(low="blue", high="yellow") +
  labs(x="Month", y="Depth", fill="Abundance") +
  theme_minimal()
p_heatmap

D. Dendrograms

Dendrograms provide a visual representation of the arrangement of clusters produced by hierarchical clustering.

Example: Understanding genetic similarities between different marine algae species.

# Simulating data
set.seed(123)
data_dendro <- matrix(rnorm(100), nrow=10, ncol=10)

# Dendrogram
hclust_obj <- hclust(dist(data_dendro))
dendrogram <- as.dendrogram(hclust_obj)

# Plot
plot(dendrogram, main="Hierarchical Clustering of Marine Species", xlab="Species", sub="", ylab="Distance")

E. Pie Charts

Pie Charts visualize categorical data as slices of a pie, where each slice represents the proportion of each category.

Example: Displaying proportions of different substrates in a coral reef area.

# Simulating data
data_pie <- data.frame(substrate=c("Sponge", "Algae", "Coral"), count=c(10, 60, 30))

# Pie chart
p_pie <- ggplot(data_pie, aes(x="", y=count, fill=substrate)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y") +
  labs(x=NULL, y=NULL, title="Proportion of Benthic Cover") +
  theme_minimal()
p_pie

F. Temperature/Depth Profiles

Temperature/Depth Profiles demonstrate how a variable (e.g., temperature) changes across different depths, providing a clear visual depiction of trends or patterns.

Example: Observing ocean temperature variations at different depths.

# Load necessary library
library(ggplot2)
library(dplyr)

# Existing data
data <- data.frame(
  date = rep("9/28/2019", 10),
  station = rep("S4.5", 10),
  depth = c(0, 10, 20, 30, 40, 50, 75, 100, 125, 150),
  abs = c(0.049, 0.049, 0.054, 0.054, 0.056, 0.062, 0.065, 0.085, 0.094, 0.089),
  conc = c(36.69231, 36.69231, 40.53846, 40.53846, 42.07692, 46.69231, 49.00000, 64.38462, 71.30769, 67.46154)
)

# Adding a temperature column with a thermocline
data <- data %>%
  mutate(
    temperature = case_when(
      depth <= 20  ~ 25 - depth * 0.2,  # Warm, near-surface layer
      depth <= 100 ~ 21 - (depth - 20) * 0.07,  # Rapid cooling in the thermocline
      TRUE         ~ 14.5 - (depth - 100) * 0.01  # Cold, deep layer
    )
  )

# Displaying the first few rows of the updated data
head(data)

       date station depth   abs     conc temperature
1 9/28/2019    S4.5     0 0.049 36.69231        25.0
2 9/28/2019    S4.5    10 0.049 36.69231        23.0
3 9/28/2019    S4.5    20 0.054 40.53846        21.0
4 9/28/2019    S4.5    30 0.054 40.53846        20.3
5 9/28/2019    S4.5    40 0.056 42.07692        19.6
6 9/28/2019    S4.5    50 0.062 46.69231        18.9

# Plotting the Data

# Creating the plot as per your specifications
p <- ggplot(data, aes(x=depth, y=temperature)) +
  geom_line(color="red", size=1, linetype=2) +
  geom_point() +
  scale_x_reverse() +
  # scale_y_continuous(position="right") +
  coord_flip() +
  labs(x="Depth (m)", y="Temperature (degrees C)") +
  theme_minimal()

# Displaying the plot
p

G. Violin Plots

Violin Plots combine boxplots and kernel density estimation, allowing you to visualize both the distribution and descriptive statistics of the data.

Example: Comparing light absorption levels across different coral species.

# Simulating data
set.seed(123)
data_violin <- data.frame(
  species=rep(c("X", "Y", "Z"), each=30),
  absorption=c(rnorm(30, 0.5, 0.1), rnorm(30, 0.6, 0.1), rnorm(30, 0.7, 0.1))
)

# Violin Plot
p_violin <- ggplot(data_violin, aes(x=species, y=absorption, fill=species)) +
  geom_violin(show.legend=FALSE) +
  geom_boxplot(width=0.1) +
  labs(x="Species", y="Light Absorption") +
  theme_minimal()
p_violin

H. Faceting

Faceting involves creating multiple plots which share the same variables and view, aiding in analyzing interactions and patterns.

Example: Observing fish abundance across different months and regions.

# Simulating data
set.seed(123)
data_facet <- expand.grid(month=1:12, region=c("A", "B"), species=c("tuna", "fairy basslet"))
data_facet$abundance <- rpois(48, lambda=20)

# Faceting
p_facet <- ggplot(data_facet, aes(x=month, y=abundance, color=species)) +
  geom_line() +
  facet_wrap(~region) +
  labs(x="Month", y="Fish Abundance") +
  theme_bw()
p_facet

I. 3D Plots

3D Plots enable visualization of three variables simultaneously, though they are often advised against due to potential misinterpretation.

Example: Visualizing some x y and zs that make a shape

# 3D Plot using scatterplot3d
#install.packages("scatterplot3d")
library(scatterplot3d)

temp <- seq(-pi, 0, length = 50)
x <- c(rep(1, 50) %*% t(cos(temp)))
y <- c(cos(temp) %*% t(sin(temp)))
z <- c(sin(temp) %*% t(sin(temp)))

scatterplot3d(
  x,
  y,
  z,
  highlight.3d = TRUE,
  col.axis = "blue",
  col.grid = "lightblue",
  main = "scatterplot3d",
  pch = 20
)

Another cool example:

## example 6; by Martin Maechler
cubedraw <-
  function(res3d,
           min = 0,
           max = 255,
           cex = 2,
           text. = FALSE)
  {
    ## Purpose: Draw nice cube with corners
    cube01 <-
      rbind(c(0, 0, 1),
            0,
            c(1, 0, 0),
            c(1, 1, 0),
            1,
            c(0, 1, 1),
            # < 6 outer
            c(1, 0, 1),
            c(0, 1, 0)) # <- "inner": fore- & back-ground
    cub <- min + (max - min) * cube01
    ## visibile corners + lines:
    res3d$points3d(cub[c(1:6, 1, 7, 3, 7, 5) , ],
                   cex = cex,
                   type = 'b',
                   lty = 1)
    ## hidden corner + lines
    res3d$points3d(cub[c(2, 8, 4, 8, 6),],
                   cex = cex,
                   type = 'b',
                   lty = 3)
    if (text.)
      ## debug
      text(
        res3d$xyz.convert(cub),
        labels = 1:nrow(cub),
        col = 'tomato',
        cex = 2
      )
  }
## 6 a) The named colors in R, i.e. colors()
cc <- colors()
crgb <- t(col2rgb(cc))
par(xpd = TRUE)
rr <- scatterplot3d(
  crgb,
  color = cc,
  box = FALSE,
  angle = 24,
  xlim = c(-50, 300),
  ylim = c(-50, 300),
  zlim = c(-50, 300)
)
cubedraw(rr)

## 6 b) The rainbow colors from rainbow(201)
rbc <- rainbow(201)
Rrb <- t(col2rgb(rbc))
rR <- scatterplot3d(
  Rrb,
  color = rbc,
  box = FALSE,
  angle = 24,
  xlim = c(-50, 300),
  ylim = c(-50, 300),
  zlim = c(-50, 300)
)
cubedraw(rR)
rR$points3d(Rrb, col = rbc, pch = 16)

J. Network Diagrams

Network Diagrams illustrate relationships between entities, useful for showing connections within a dataset.

Example: Visualizing a marine food web.

# Simulating data
set.seed(123)
data_network <- data.frame(from=c("algae", "algae", "small_fish", "small_fish", "large_fish"),
                           to=c("small_fish", "crab", "large_fish", "shrimp", "shark"))

# Network Diagram using igraph
library(igraph)
p_network <- graph_from_data_frame(data_network, directed=TRUE)
plot(p_network, edge.arrow.size=0.5, vertex.size=15, vertex.label.cex=0.8,
     main="Marine Food Web", vertex.color="skyblue")

K. Maps

Maps allow the visualization of spatial patterns in data across geographical locations.

Example: Displaying shark sightings along a coast.

# Simulating data
set.seed(123)
data_map <- data.frame(lon=rnorm(10, mean=-80, sd=0.6), lat=rnorm(10, mean=25, sd=0.8))

# Maps using ggplot
p_map <- ggplot(data_map, aes(x=lon, y=lat)) +
  borders("world", colour="gray80") +
  geom_point(color="blue", size=3, alpha=0.6) +
  labs(x="Longitude", y="Latitude", title="Shark Sightings") +
  coord_cartesian(xlim=c(-85, -75), ylim=c(22, 28)) +
  theme_minimal()
p_map

L. Ridgeline Plots

Ridgeline Plots enable the visualization of the distribution of a numerical value across several groups.

Example: Visualizing sea surface temperatures across different years.

# Simulating data
set.seed(123)
data_ridge <- data.frame(year=factor(rep(2000:2004, each=100)), temp=rnorm(500, mean=25, sd=2))

# Ridgeline Plot using ggridges
library(ggridges)
p_ridge <- ggplot(data_ridge, aes(x=temp, y=year, fill=year)) +
  geom_density_ridges() +
  labs(x="Sea Surface Temperature (°C)", y="Year") +
  theme_minimal() +
  scale_fill_viridis_d()
p_ridge

M. Correlograms/ or correlation matrix

Correlograms offer a visual representation of the correlation between different variables in a dataset.

Example: Exploring correlations between different oceanographic variables.

# Simulating data
set.seed(123)
data_corr <- data.frame(temp=rnorm(100, mean=25, sd=2), salinity=rnorm(100, mean=35, sd=0.5),
                        chlorophyll=rnorm(100, mean=2, sd=0.5), oxygen=rnorm(100, mean=5, sd=1))

# Correlogram using corrplot
library(corrplot)
corr_matrix <- cor(data_corr)
p_corr <- corrplot(corr_matrix, method="circle", type="upper", tl.col="black")

N. Bubble Plots

Bubble Plots represent three variables by utilizing the x, y position of points and a third variable represented by size.

Example: Representing biodiversity, sea surface temperature, and fishing intensity at different locations.

# Simulating dataset
set.seed(123)
data_bubble <- data.frame(lon=rnorm(20, mean=-80, sd=1), lat=rnorm(20, mean=25, sd=1), 
                          biodiversity=rnorm(20, mean=100, sd=15))

# Bubble Plot using ggplot
p_bubble <- ggplot(data_bubble, aes(x=lon, y=lat, size=biodiversity)) +
  geom_point(alpha=0.6, color="blue") +
  labs(x="Longitude", y="Latitude", size="Biodiversity Index") +
  coord_fixed(ratio=1.3) +
  theme_minimal()
p_bubble

O. Stacked Area Plots

Stacked Area Plots illustrate the values of different groups on top of each other, showcasing the evolution of the total value as well as the part each group represents.

Example: Observing the catch of different fish species over time.

# Simulating data
set.seed(123)
data_area <- data.frame(year=rep(2000:2010, each=3), 
                        species=rep(c("Sardine", "Tuna", "Salmon"), times=11),
                        catch=rnorm(33, mean=1000, sd=200))

# Stacked Area Plot using ggplot
p_area <- ggplot(data_area, aes(x=year, y=catch, fill=species)) +
  geom_area(position="stack") +
  labs(x="Year", y="Catch (tons)", fill="Species") +
  theme_minimal()
p_area

P. Patchwork: Combining ggplot2 Plots

The patchwork package allows for easy combination and layout manipulation of ggplot2 plots, offering an elegant and readable syntax. It’s like “patchworking” your plots together!

Installation:

You can install patchwork from CRAN:

#install.packages("patchwork")
library(patchwork)

The main idea behind patchwork is that you can add plots together using +, /, and other operators to specify different layouts.

Examples:

Combine Two Plots Side-by-Side:

p_area + p_bubble

Arrange Vertically:

p_map / p_area

Nested Layouts: If you want a more complex layout, you can nest combined plots:

(p_pie + p_heatmap) / p_facet

You can label your plots using plot_annotation:

(p_bubble + p_map) + plot_annotation(tag_levels = 'A')

Adding an empty area

p_map + plot_spacer() + p_bubble + plot_spacer() + p + plot_spacer()

… and much more! for further reading: https://patchwork.data-imaginist.com/index.html