Creating Boxplots with Points Highlighted for Each Diagnostic Group Using R and ggplot2/ggforce

Highlighting Points in Boxplots by Diagnostic Group with ggplot2 and ggforce

In this post, we will discuss how to create boxplots with points highlighted for each diagnostic group using the ggplot2 and ggforce packages in R. We’ll explore two approaches: one using only base ggplot2 functions and another that utilizes both ggplot2 and ggforce.

Introduction

Boxplots are a useful visualization tool for understanding the distribution of data across different groups or categories. When working with boxplots, it’s often necessary to highlight specific points or outliers within each group. In this post, we’ll delve into creating boxplots with highlighted points by diagnostic group using R and ggplot2/ggforce.

Problem Description

The original question from Stack Overflow presents a scenario where we have a dataset with four possible diagnostics (AN and CRC) and want to highlight specific points in the boxplot corresponding to each diagnostic group. The author has successfully plotted AN points first and then added CRC points, but when grouping the plot by diagnostic ID, the points collapse onto the middle boxplot corresponding to the AN group.

Solution Approach 1: Using Base ggplot2 Functions

The first approach we’ll explore uses only base ggplot2 functions, including geom_boxplot and geom_jitter. This method leverages the built-in capabilities of ggplot2 for creating boxplots and points without relying on the ggforce package.

Code Implementation

library(tidyverse)

# Data generation
company_a <- sample(1:200, 100, replace = TRUE)
company_b <- sample(1:200, 100, replace = TRUE)
company_c <- sample(1:200, 100, replace = TRUE)
diagnostic <- sample(1:4, 100, replace = TRUE)

df <- data.frame(company_a, company_b, company_c, diagnostic)
df_r <- gather(df, "Companies", "FIT", 1:3)

# Define the diagnostic variable as a factor
df_r$diagnostic <- as.factor(df_r$diagnostic)

# Creating boxplot with points highlighted by diagnostic group
df_r %>%
  ggplot(aes(x = diagnostic, y = FIT, fill = Companies)) +
  geom_boxplot(alpha = 0.1, position = "dodge2") +
  geom_point(aes(color = Companies), position = position_dodge(width = 0.75)) +
  geom_jitter(aes(color = Companies), position = position_dodge2(width = 0.75), size = 1) +
  scale_y_continuous(breaks = seq(0, 200, 25)) +
  theme_classic() +
  theme(legend.position = "top")

Explanation

This code snippet first generates a sample dataset and defines the diagnostic variable as a factor using as.factor(). Then, it creates a boxplot with points highlighted for each diagnostic group by employing two main components:

  1. geom_boxplot() : This function is used to create the boxplots themselves.
  2. geom_jitter() : This component adds jittered points (representing outliers or points not on the whisker) at specific positions determined by the width parameter.

By setting the x-axis to “diagnostic” and using fill = Companies, each diagnostic group can be uniquely identified within the plot. The position argument allows us to control where the points are placed on the plot for better visualization clarity.

Solution Approach 2: Using ggplot2 with geom_sina()

The second approach utilizes both ggplot2 and ggforce for creating boxplots with highlighted points by diagnostic group. This method leverages additional capabilities in the ggforce package, particularly geom_sina(), which allows us to create sine curves over the data points.

Code Implementation

library(ggplot2)
library(ggforce)

# Data generation
company_a <- sample(1:200, 100, replace = TRUE)
company_b <- sample(1:200, 100, replace = TRUE)
company_c <- sample(1:200, 100, replace = TRUE)
diagnostic <- sample(1:4, 100, replace = TRUE)

df <- data.frame(company_a, company_b, company_c, diagnostic)
df_r <- gather(df, "Companies", "FIT", 1:3)

# Define the diagnostic variable as a factor
df_r$diagnostic <- as.factor(df_r$diagnostic)

# Creating boxplot with points highlighted by diagnostic group
df_r %>%
  ggplot(aes(x = diagnostic, y = FIT, fill = Companies)) +
  geom_boxplot(alpha = 0.1, position = "dodge2") +
  geom_point(aes(color = Companies), position = position_dodge(width = 0.75)) +
  ggforce::geom_sina(aes(color = Companies), size = 1, position = position_dodge(width = 0.75)) +
  scale_y_continuous(breaks = seq(0, 200, 25)) +
  theme_classic() +
  theme(legend.position = "top")

Explanation

This approach is similar to the first one but employs geom_sina() from the ggforce package in addition to geom_boxplot and geom_jitter. This results in a boxplot with a sine curve drawn over each diagnostic group, where points are color-coded according to Companies. The use of position_dodge() allows for proper alignment of points across groups.

Conclusion

In this post, we explored two approaches for creating boxplots with highlighted points by diagnostic group using R and ggplot2/ggforce. By utilizing the base ggplot2 functions along with geom_jitter, we can achieve a clean and informative visualization that meets our requirements. Alternatively, combining the capabilities of ggplot2 with geom_sina() from ggforce offers an elegant solution as well.

Both methods are suitable depending on specific needs or preferences regarding customization of visual elements within your boxplots.


Last modified on 2024-01-31