Using `mutate()` and `case_when()` to Simplify Complex Data Analysis in Tidy R

Using mutate() and case_when() to Add a New Column Based on Multiple Conditions in Tidy R

Introduction

As data analysts, we often encounter the need to perform complex operations on datasets. One such operation is adding a new column based on multiple conditions. In this article, we will explore how to achieve this using the mutate() function and case_when() from the tidyverse package in R.

Background

The provided Stack Overflow question highlights a common challenge faced by data analysts: creating a new column that depends on the values of multiple columns in a dataset. The original solution used ifelse() statements, which can be cumbersome and limited for more complex conditions.

In this article, we will demonstrate how to use mutate() and case_when() to add a new column based on multiple conditions, making it easier to manage and extend the logic for more complex datasets.

The Problem

Let’s consider an example dataset that includes various columns, such as “elephant_zoo”, “rhino_zoo”, “hippo_zoo”, etc. We want to create a new column called “ZOO” that contains a classification (“zoo”) if the sum of all values in the relevant zoo columns is greater than 0; otherwise, it will contain an empty string.

We also need to consider another column called “WILD” with a similar logic for its classification based on the presence of certain keywords in the corresponding columns.

library(tidyverse)

# Example dataset
df <- tibble(
  elephant_zoo = c(1, 1, 1, 2, 0),
  rhino_zoo = c(1, 2, 3, 1, 0),
  hippo_zoo = c(1, 1, 0, 0, 0),
  elephant_wild_A = c(0, 0, 1, 1, 3),
  rhino_wild_A = c(0, 0, 4, 3, 1),
  elephant_wild_B = c(0, 0, 0, 0, 0),
  rhino_wild_C = c(0, 0, 0, 5, 7),
  hippo_wild_B = c(0, 0, 0, 0, 0)
)

# Print the original dataset
print(df)

Solution

To create a new column “ZOO” based on the sum of all values in the relevant zoo columns, we can use mutate() and summarise_all().

# Create a new column 'ZOO' using mutate()
df %>% 
  mutate(ZOO = case_when(
    any(zoo_columns() | ~`&gt;`(summarise_all(zoo_columns(), sum), 0))) ~ "zoo",
    TRUE ~ ""
  ))

# Function to generate zoo columns for summarise_all
zoo_columns <- function() {
  c("elephant_zoo", "rhino_zoo", "hippo_zoo")
}

# Print the updated dataset with 'ZOO' column
print(df)

We also want to create another column called “WILD” based on the presence of certain keywords in its corresponding columns.

# Create a new column 'WILD' using mutate()
df %>% 
  mutate(WILD = case_when(
    any(wild_columns() | ~`&gt;`(summarise_all(wild_columns(), sum), 0))) ~ "wild",
    TRUE ~ ""
  ))

# Function to generate wild columns for summarise_all
wild_columns <- function() {
  c("elephant_wild_A", "rhino_wild_A", "elephant_wild_B", "rhino_wild_C", "hippo_wild_B")
}

# Print the updated dataset with 'ZOO' and 'WILD' columns
print(df)

Conclusion

In this article, we demonstrated how to use mutate() and case_when() from the tidyverse package in R to add new columns based on multiple conditions. By utilizing intermediate functions like zoo_columns() and wild_columns(), we made it easier to manage and extend the logic for more complex datasets.

This approach simplifies the process of adding new columns with conditional logic, making data analysis more efficient and effective.


Last modified on 2025-02-19