Generating Partial Dependence Data with XGBoost in MLR: A Step-by-Step Solution to Common Issues

Generating Partial Dependence Data with XGBoost in MLR

In this article, we will delve into the world of partial dependence plots, which are a powerful tool for understanding the relationships between predictors and the response variable in machine learning models. We will explore the issues encountered when using the generatePartialDependenceData function from the mlr package with an XGBoost multiclass classification model, and provide solutions to these problems.

Introduction

Partial dependence plots are a graphical representation of how a specific predictor affects the expected value of the response variable in a machine learning model. These plots can help us identify the most relevant predictors for our model and understand the relationships between them. In this article, we will focus on generating partial dependence data using XGBoost with an MLR package.

The Problem

When we tried to use the generatePartialDependenceData function from the mlr package with an XGBoost multiclass classification model, we encountered an error. The error message indicated that there was a problem with the measure variables specified in the melt.data.table function.

Error in melt.data.table(as.data.table(out), measure.vars = target, variable.name = if (td$type ==  : One or more values in 'measure.vars' is invalid.

Setting Predict Type

As mentioned by KacZdr, setting the predict.type argument to "prob" works fine. However, we need to make sure that we set this argument correctly.

# build learners
xgb_class_learner <- makeLearner(
  "classif.xgboost",
  predict.type = "prob"
)

Alternative Solution using MLR3

Since the mlr package is deprecated, we will use an alternative solution with the mlr3 package. However, there seems to be an issue with ggplot in the $plot() function for FeatureEffects objects.

## Error in geom_rug()
! problem while computing position.
i 
Caused by error in `if (params$width > 0) ...`
! Missing value, where TRUE/FALSE is required

Plotting Partial Dependence Data Manually

To avoid the issues with ggplot, we will generate partial dependence data manually using the FeatureEffects class from the mlr3learners package.

# library
library(tidyverse)
library(mlr3)
library(mlr3learners)

peng <- palmerpenguins::penguins

# build task
tsk_peng <- peng %>% select(-sex, -year) %>% 
  as_task_classif(target = "species")

# data partition
splits <- partition(tsk_peng)

# build learner
lrn_classif <- as_learner(po("encode", method = "one-hot") %>>% lrn("classif.xgboost"))

# train model
lrn_classif$train(tsk_peng, row_ids = splits$train)

# partial dependence
predictor <- Predictor$new(
  lrn_classif, 
  data = tsk_peng$data(rows = splits$train, cols = tsk_peng$feature_names),
  y = tsk_peng$data(rows = splits$train, cols = tsk Peng$target_names)
  )

effect <- FeatureEffects$new(predictor, method = "pdp")

# plot
## continuous
effect$results %>% 
  keep(names(.) %in% effect$features[1:4]) %>% 
  bind_rows() %>% 
  ggplot(aes(x = .borders, y = .value, col = .class))+
  geom_line()+
  facet_grid(~.feature, scale = "free")

## factor
effect.results$island %>% 
  ggplot(aes(x = .borders, y = .value, fill = .class))+
  geom_bar(stat = "identity", position = "dodge")

Conclusion

In this article, we explored the issues encountered when using the generatePartialDependenceData function from the mlr package with an XGBoost multiclass classification model. We provided solutions to these problems by setting the predict.type argument correctly and plotting partial dependence data manually using the FeatureEffects class from the mlr3learners package.

Technical Background

The mlr package is a popular R package for machine learning that provides an interface to various machine learning algorithms. The generatePartialDependenceData function generates partial dependence plots for a specified model and dataset.

# generate_partial_dependence_data
generate_partial_dependence_data(
  model, 
  data, 
  target, 
  features, 
  predict_type = NULL, 
  features_to_plot
)

The FeatureEffects class from the mlr3learners package is a powerful tool for generating partial dependence plots.

# FeatureEffects
class FeatureEffects {
  # ... ...
  
  # Plotting partial dependence data
  plot() %>% 
    ggplot(aes(x = .borders, y = .value, col = .class)) +
    geom_line() +
    facet_grid(~.feature, scale = "free")
}

Future Work

In future articles, we will explore other machine learning algorithms and techniques for generating partial dependence plots using the mlr3 package.

# Future Work

1.  **Random Forest**
    *   Generate partial dependence data using random forest models.
2.  **Neural Networks**
    *   Generate partial dependence data using neural networks.
3.  **Gradient Boosting Machines**
    *   Generate partial dependence data using gradient boosting machines.

Stay tuned for future articles on machine learning with R!

Last modified on 2025-02-27