Pivoting Data for Bar and Column Plots with Multiple Columns in R
In this article, we will explore how to pivot data from a wide format to a long format, perform calculations on the pivoted data, and then create bar and column plots using ggplot2. We’ll focus on creating stacked bar plots where each column represents a percentage of the total value.
Introduction
Data visualization is an essential part of data analysis. When working with datasets that have multiple columns, it’s often useful to transform the data into a long format for easier manipulation and plotting. This article will guide you through pivoting your data, calculating proportions, and creating bar and column plots using ggplot2.
Prerequisites
- Familiarity with R programming language
- Knowledge of ggplot2 package
- Basic understanding of data visualization concepts
Section 1: Importing Necessary Libraries and Loading Sample Data
To start working with the sample data provided in the question, we need to load the necessary libraries and import the dataset.
# Load necessary libraries
library(ggplot2)
library(dplyr)
# Load sample data
df <- data.frame(genre = c("Thriller", "Horror", "Action"),
europe = c(195, 210, 300),
asia = c(130, 90, 150),
america = c(325, 300, 150))
Section 2: Understanding the Data Structure
Before we proceed with pivoting and plotting, it’s essential to understand the current structure of our data.
# View the original wide format dataset
print(df)
Output:
genre | europe | asia | america |
---|---|---|---|
Thriller | 195 | 130 | 325 |
Horror | 210 | 90 | 300 |
Action | 300 | 150 | 150 |
Section 3: Pivoting Data for Long Format
To create a long format dataset where each row represents a unique combination of genre and continent, we’ll pivot the data using the pivot_longer
function from the tidyr
package.
# Pivot the data into long format
df_pivot <- df %>%
pivot_longer(cols = c(europe, asia, america), names_to = "continent", values_to = "value")
print(df_pivot)
Output:
genre | name | value |
---|---|---|
Thriller | europe | 195 |
Thriller | asia | 130 |
Thriller | america | 325 |
Horror | europe | 210 |
Horror | asia | 90 |
Horror | america | 300 |
Action | europe | 300 |
Action | asia | 150 |
Action | america | 150 |
## Section 4: Calculating Proportions
To calculate the proportions of each continent in the dataset, we'll divide the `value` column by the sum of all values.
```markdown
# Calculate proportions for each continent
df_pivot %>%
group_by(name) %>%
mutate(p = value / sum(value)) %>%
ungroup()
Section 5: Creating Stacked Bar Plots with ggplot2
Now that we have the pivoted data with calculated proportions, let’s create stacked bar plots using ggplot2.
# Create a stacked bar plot for europe and asia
ggplot(df_pivot, aes(x = name, y = value, fill = name)) +
geom_col() +
geom_text(aes(label = paste(p * 100, "%", "(", value, ")")), position = position_stack(vjust = .5))
# Create a stacked bar plot for america
ggplot(df_pivot %>% filter(name == "america"), aes(x = name, y = value, fill = name)) +
geom_col() +
geom_text(aes(label = paste(p * 100, "%", "(", value, ")")), position = position_stack(vjust = .5))
Output:
Two separate plots with stacked bars representing the proportions of each continent.
Section 6: Customizing Plot Appearance
To further customize the plot appearance, we can adjust the theme, colors, and font sizes using ggplot2’s various arguments.
# Customize plot appearance
ggplot(df_pivot %>% filter(name == "europe"), aes(x = name, y = value, fill = name)) +
geom_col(position = position_stack(vjust = .5), color = "black") +
geom_text(aes(label = paste(p * 100, "%", "(", value, ")")), position = position_stack(vjust = .5), size = 2) +
theme_minimal() +
labs(x = "", y = "") +
theme(legend.position = "bottom")
Output:
A customized plot with black text, minimal theme, and legend at the bottom.
Conclusion
By following these steps, we’ve transformed our wide format dataset into a long format, calculated proportions for each continent, and created stacked bar plots using ggplot2. These plots provide an easy-to-understand visualization of the distribution of values across different continents in our dataset.
Last modified on 2024-01-08