Calculating Percentages in R using Dplyr and the Percentage Function

Calculating Percentages in R using Dplyr and the Percentage Function

Introduction

In this article, we’ll explore how to calculate percentages in R for each value of a specific variable. This is particularly useful when working with reshaped data frames created using the dcast function from the reshape2 package.

We’ll delve into the details of how to use the dplyr package and its various functions, including the percentage function, to achieve this goal.

Understanding the Problem

Let’s consider a simple example. Suppose we have a data frame called data with two variables: “centre” and “bmi”. The “centre” variable takes on values from the letters A to J, while the “bmi” variable represents different body mass index measurements. We’ve reshaped this data using the dcast function.

library(reshape2)
data = data.frame("centre"=LETTERS[sample(1:10,size=100,replace=T)], 
                  "bmi"=sample(1:3,100, replace=T))
head(data)
  centre bmi
1      F   2
2      A   1
3      E   3
4      I   1
5      E   1
6      A   1

The reshaped data frame d_edu now has each unique value of “bmi” as a separate column, with the corresponding “centre” values.

d_edu = dcast(data,bmi~centre)
d_edu
  bmi A B C D E F G H I J
1   1 5 1 2 6 3 5 3 2 4 0
2   2 3 0 1 2 4 8 2 6 6 3
3   3 2 2 2 3 4 6 3 5 5 2

Now, let’s say we want to calculate the percentage of people with a specific BMI value for each centre. This is where things can get a bit tricky.

Using a For Loop

One way to solve this problem is by using a for loop to iterate through each unique centre value and then calculate the corresponding percentages.

for (i in 1:nrow(data)) {
  centre <- data[i, 1]
  bmi <- data[i, 2]
  sum_bmi <- sum(d_edu[, as.character(bmi)])
  percent <- d_edu[as.character(centre), as.character(bmi)] / sum_bmi * 100
  print(paste("For centre", centre, "the percentage is:", round(percent, 2)))
}

This code will iterate through each row of the original data frame and calculate the percentage for each unique centre value. However, this approach can be cumbersome and time-consuming.

Using Dplyr and the Percentage Function

Fortunately, R has a more elegant solution using the dplyr package and its summarise_at function.

First, we need to add a new column called “Aperc” (which stands for “aperçu”) that calculates the percentage of people with each BMI value for each centre. We’ll use this variable to calculate our desired percentages later on.

library(dplyr)
d_edu <- d_edu %>%
  group_by(centre) %>%
  summarise(Aperc = (sum(bmi) / n()) * 100)

This code groups the data frame by centre and then calculates the percentage for each unique BMI value within that centre.

Calculating Percentages with Multiple Columns

Now, let’s say we want to calculate percentages for multiple columns in d_edu. We can use the across function from dplyr to achieve this goal.

library(dplyr)
d_edu <- d_edu %>%
  group_by(centre) %>%
  summarise(across(as.character(bmi), ~ (sum(.x) / n()) * 100))

This code groups the data frame by centre and then calculates the percentage for each unique BMI value within that centre. It uses the across function to apply this calculation to all columns specified in the as.character(bmi) expression.

Conclusion

In conclusion, calculating percentages in R can be achieved using various methods depending on your specific needs. In this article, we explored two approaches: using a for loop and leveraging the power of dplyr with its percentage function.

While both methods have their strengths and weaknesses, using dplyr is generally more efficient and elegant than relying on a for loop. By taking advantage of R’s powerful data manipulation libraries like dplyr, you can streamline your code and make it more maintainable.

Whether you’re working with reshaped data frames or performing more complex calculations, the techniques outlined in this article will help you calculate percentages with ease. So go ahead, give them a try!


Last modified on 2025-04-28