Generating Partial Dependence Data with XGBoost in MLR
In this article, we will delve into the world of partial dependence plots, which are a powerful tool for understanding the relationships between predictors and the response variable in machine learning models. We will explore the issues encountered when using the generatePartialDependenceData
function from the mlr
package with an XGBoost multiclass classification model, and provide solutions to these problems.
Introduction
Partial dependence plots are a graphical representation of how a specific predictor affects the expected value of the response variable in a machine learning model. These plots can help us identify the most relevant predictors for our model and understand the relationships between them. In this article, we will focus on generating partial dependence data using XGBoost with an MLR package.
The Problem
When we tried to use the generatePartialDependenceData
function from the mlr
package with an XGBoost multiclass classification model, we encountered an error. The error message indicated that there was a problem with the measure variables specified in the melt.data.table
function.
Error in melt.data.table(as.data.table(out), measure.vars = target, variable.name = if (td$type == : One or more values in 'measure.vars' is invalid.
Setting Predict Type
As mentioned by KacZdr, setting the predict.type
argument to "prob"
works fine. However, we need to make sure that we set this argument correctly.
# build learners
xgb_class_learner <- makeLearner(
"classif.xgboost",
predict.type = "prob"
)
Alternative Solution using MLR3
Since the mlr
package is deprecated, we will use an alternative solution with the mlr3
package. However, there seems to be an issue with ggplot in the $plot()
function for FeatureEffects
objects.
## Error in geom_rug()
! problem while computing position.
i
Caused by error in `if (params$width > 0) ...`
! Missing value, where TRUE/FALSE is required
Plotting Partial Dependence Data Manually
To avoid the issues with ggplot, we will generate partial dependence data manually using the FeatureEffects
class from the mlr3learners
package.
# library
library(tidyverse)
library(mlr3)
library(mlr3learners)
peng <- palmerpenguins::penguins
# build task
tsk_peng <- peng %>% select(-sex, -year) %>%
as_task_classif(target = "species")
# data partition
splits <- partition(tsk_peng)
# build learner
lrn_classif <- as_learner(po("encode", method = "one-hot") %>>% lrn("classif.xgboost"))
# train model
lrn_classif$train(tsk_peng, row_ids = splits$train)
# partial dependence
predictor <- Predictor$new(
lrn_classif,
data = tsk_peng$data(rows = splits$train, cols = tsk_peng$feature_names),
y = tsk_peng$data(rows = splits$train, cols = tsk Peng$target_names)
)
effect <- FeatureEffects$new(predictor, method = "pdp")
# plot
## continuous
effect$results %>%
keep(names(.) %in% effect$features[1:4]) %>%
bind_rows() %>%
ggplot(aes(x = .borders, y = .value, col = .class))+
geom_line()+
facet_grid(~.feature, scale = "free")
## factor
effect.results$island %>%
ggplot(aes(x = .borders, y = .value, fill = .class))+
geom_bar(stat = "identity", position = "dodge")
Conclusion
In this article, we explored the issues encountered when using the generatePartialDependenceData
function from the mlr
package with an XGBoost multiclass classification model. We provided solutions to these problems by setting the predict.type
argument correctly and plotting partial dependence data manually using the FeatureEffects
class from the mlr3learners
package.
Technical Background
The mlr
package is a popular R package for machine learning that provides an interface to various machine learning algorithms. The generatePartialDependenceData
function generates partial dependence plots for a specified model and dataset.
# generate_partial_dependence_data
generate_partial_dependence_data(
model,
data,
target,
features,
predict_type = NULL,
features_to_plot
)
The FeatureEffects
class from the mlr3learners
package is a powerful tool for generating partial dependence plots.
# FeatureEffects
class FeatureEffects {
# ... ...
# Plotting partial dependence data
plot() %>%
ggplot(aes(x = .borders, y = .value, col = .class)) +
geom_line() +
facet_grid(~.feature, scale = "free")
}
Future Work
In future articles, we will explore other machine learning algorithms and techniques for generating partial dependence plots using the mlr3
package.
# Future Work
1. **Random Forest**
* Generate partial dependence data using random forest models.
2. **Neural Networks**
* Generate partial dependence data using neural networks.
3. **Gradient Boosting Machines**
* Generate partial dependence data using gradient boosting machines.
Stay tuned for future articles on machine learning with R!
Last modified on 2025-02-27