Calculating Percentages in R using Dplyr and the Percentage Function
Introduction
In this article, we’ll explore how to calculate percentages in R for each value of a specific variable. This is particularly useful when working with reshaped data frames created using the dcast
function from the reshape2
package.
We’ll delve into the details of how to use the dplyr
package and its various functions, including the percentage function, to achieve this goal.
Understanding the Problem
Let’s consider a simple example. Suppose we have a data frame called data
with two variables: “centre” and “bmi”. The “centre” variable takes on values from the letters A to J, while the “bmi” variable represents different body mass index measurements. We’ve reshaped this data using the dcast
function.
library(reshape2)
data = data.frame("centre"=LETTERS[sample(1:10,size=100,replace=T)],
"bmi"=sample(1:3,100, replace=T))
head(data)
centre bmi
1 F 2
2 A 1
3 E 3
4 I 1
5 E 1
6 A 1
The reshaped data frame d_edu
now has each unique value of “bmi” as a separate column, with the corresponding “centre” values.
d_edu = dcast(data,bmi~centre)
d_edu
bmi A B C D E F G H I J
1 1 5 1 2 6 3 5 3 2 4 0
2 2 3 0 1 2 4 8 2 6 6 3
3 3 2 2 2 3 4 6 3 5 5 2
Now, let’s say we want to calculate the percentage of people with a specific BMI value for each centre. This is where things can get a bit tricky.
Using a For Loop
One way to solve this problem is by using a for loop to iterate through each unique centre value and then calculate the corresponding percentages.
for (i in 1:nrow(data)) {
centre <- data[i, 1]
bmi <- data[i, 2]
sum_bmi <- sum(d_edu[, as.character(bmi)])
percent <- d_edu[as.character(centre), as.character(bmi)] / sum_bmi * 100
print(paste("For centre", centre, "the percentage is:", round(percent, 2)))
}
This code will iterate through each row of the original data frame and calculate the percentage for each unique centre value. However, this approach can be cumbersome and time-consuming.
Using Dplyr and the Percentage Function
Fortunately, R has a more elegant solution using the dplyr
package and its summarise_at
function.
First, we need to add a new column called “Aperc” (which stands for “aperçu”) that calculates the percentage of people with each BMI value for each centre. We’ll use this variable to calculate our desired percentages later on.
library(dplyr)
d_edu <- d_edu %>%
group_by(centre) %>%
summarise(Aperc = (sum(bmi) / n()) * 100)
This code groups the data frame by centre and then calculates the percentage for each unique BMI value within that centre.
Calculating Percentages with Multiple Columns
Now, let’s say we want to calculate percentages for multiple columns in d_edu
. We can use the across
function from dplyr
to achieve this goal.
library(dplyr)
d_edu <- d_edu %>%
group_by(centre) %>%
summarise(across(as.character(bmi), ~ (sum(.x) / n()) * 100))
This code groups the data frame by centre and then calculates the percentage for each unique BMI value within that centre. It uses the across
function to apply this calculation to all columns specified in the as.character(bmi)
expression.
Conclusion
In conclusion, calculating percentages in R can be achieved using various methods depending on your specific needs. In this article, we explored two approaches: using a for loop and leveraging the power of dplyr
with its percentage function.
While both methods have their strengths and weaknesses, using dplyr
is generally more efficient and elegant than relying on a for loop. By taking advantage of R’s powerful data manipulation libraries like dplyr
, you can streamline your code and make it more maintainable.
Whether you’re working with reshaped data frames or performing more complex calculations, the techniques outlined in this article will help you calculate percentages with ease. So go ahead, give them a try!
Last modified on 2025-04-28