Pivoting Data for Bar and Column Plots with Multiple Columns in R

In this article, we will explore how to pivot data from a wide format to a long format, perform calculations on the pivoted data, and then create bar and column plots using ggplot2. We’ll focus on creating stacked bar plots where each column represents a percentage of the total value.

Introduction

Data visualization is an essential part of data analysis. When working with datasets that have multiple columns, it’s often useful to transform the data into a long format for easier manipulation and plotting. This article will guide you through pivoting your data, calculating proportions, and creating bar and column plots using ggplot2.

Prerequisites

Familiarity with R programming language
Knowledge of ggplot2 package
Basic understanding of data visualization concepts

Section 1: Importing Necessary Libraries and Loading Sample Data

To start working with the sample data provided in the question, we need to load the necessary libraries and import the dataset.

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Load sample data
df <- data.frame(genre = c("Thriller", "Horror", "Action"), 
                 europe = c(195, 210, 300), 
                 asia = c(130, 90, 150), 
                 america = c(325, 300, 150))

Section 2: Understanding the Data Structure

Before we proceed with pivoting and plotting, it’s essential to understand the current structure of our data.

# View the original wide format dataset
print(df)

Output:

genre	europe	asia	america
Thriller	195	130	325
Horror	210	90	300
Action	300	150	150

Section 3: Pivoting Data for Long Format

To create a long format dataset where each row represents a unique combination of genre and continent, we’ll pivot the data using the pivot_longer function from the tidyr package.

# Pivot the data into long format
df_pivot <- df %>% 
  pivot_longer(cols = c(europe, asia, america), names_to = "continent", values_to = "value")

print(df_pivot)

Output:

genre	name	value
Thriller	europe	195
Thriller	asia	130
Thriller	america	325
Horror	europe	210
Horror	asia	90
Horror	america	300
Action	europe	300
Action	asia	150
Action	america	150


## Section 4: Calculating Proportions

To calculate the proportions of each continent in the dataset, we'll divide the `value` column by the sum of all values.

```markdown
# Calculate proportions for each continent
df_pivot %>% 
  group_by(name) %>% 
  mutate(p = value / sum(value)) %>% 
  ungroup()

Section 5: Creating Stacked Bar Plots with ggplot2

Now that we have the pivoted data with calculated proportions, let’s create stacked bar plots using ggplot2.

# Create a stacked bar plot for europe and asia
ggplot(df_pivot, aes(x = name, y = value, fill = name)) + 
  geom_col() + 
  geom_text(aes(label = paste(p * 100, "%", "(", value, ")")), position = position_stack(vjust = .5))

# Create a stacked bar plot for america
ggplot(df_pivot %>% filter(name == "america"), aes(x = name, y = value, fill = name)) + 
  geom_col() + 
  geom_text(aes(label = paste(p * 100, "%", "(", value, ")")), position = position_stack(vjust = .5))

Output:

Two separate plots with stacked bars representing the proportions of each continent.

Section 6: Customizing Plot Appearance

To further customize the plot appearance, we can adjust the theme, colors, and font sizes using ggplot2’s various arguments.

# Customize plot appearance
ggplot(df_pivot %>% filter(name == "europe"), aes(x = name, y = value, fill = name)) + 
  geom_col(position = position_stack(vjust = .5), color = "black") + 
  geom_text(aes(label = paste(p * 100, "%", "(", value, ")")), position = position_stack(vjust = .5), size = 2) +
  theme_minimal() +
  labs(x = "", y = "") +
  theme(legend.position = "bottom")

Output:

A customized plot with black text, minimal theme, and legend at the bottom.

Conclusion

By following these steps, we’ve transformed our wide format dataset into a long format, calculated proportions for each continent, and created stacked bar plots using ggplot2. These plots provide an easy-to-understand visualization of the distribution of values across different continents in our dataset.

Last modified on 2024-01-08