Understanding K-Prototypes Clustering in tidyclust
Introduction
The tidyclust
framework is a modern alternative to traditional clustering methods like k-means. It provides an efficient and flexible way to perform unsupervised clustering using various algorithms, including the popular K-prototypes method. In this article, we’ll delve into the world of K-prototypes clustering in tidyclust
and explore how to tune a K-prototypes model for optimal performance.
Background
K-prototypes is a density-based clustering algorithm that groups data points based on their proximity to each other. It’s particularly useful when dealing with noisy or high-dimensional data, as it can handle outliers and irregular shapes in the data distribution. The tidyclust
framework provides an implementation of K-prototypes using the clustMixType
engine.
Setting up the Environment
To get started with K-prototypes clustering in tidyclust
, we’ll need to install the necessary packages and load them in our R environment.
library(tidyclust)
library(tidyverse)
library(tidymodels)
We’ll also use the built-in penguins
dataset from modeldata
.
data("penguins", package = "modeldata")
penguins %>% drop_na()
Creating a K-Prototypes Workflow
Next, we’ll create two different workflows for our K-prototypes model: one with a non-tunable number of clusters and another with a tunable number of clusters.
# spec1 is for a non-tunable model
kmeans_spec1 <- k_means(engine = 'clustMixType', num_clusters = 4)
# spec2 is for a tunable model
kmeans_spec2 <- k_means(engine = 'clustMixType', num_clusters = tune())
penguins_rec <- recipe(~ ., data = penguins)
kmeans_wflow1 <- workflow(penguins_rec, kmeans_spec1)
kmeans_wflow2 <- workflow(penguins_rec, kmeans_spec2)
Creating a Grid of Clusters
To perform tuning, we need to create a grid of possible cluster numbers. We’ll use grid_regular
from tidyclust
to generate this grid.
clust_num_grid <- grid_regular(num_clusters(), levels = 10)
Fitting the Model and Tuning
Now that we have our workflows and grid, we can fit the model and perform tuning using tune_cluster
.
# non tunable clustering fit
kmeans_fit <- fit(kmeans_wflow1, data = penguins)
# this works without errors
sse_within_total(kmeans_fit)
# this also works
sse_within_total(kmeans_fit, dist_fun = cluster::daisy)
# However, this doesn't work
res <- tune_cluster(
kmeans_wflow2,
resamples = penguins_cv,
grid = clust_num_grid,
control = control_grid(save_pred = TRUE, extract = identity),
metrics = cluster_metric_set(sse_within_total)
)
Error Message Analysis
The error message you’re seeing indicates that sse_within_total
is expecting a numeric input, but the type of distance function passed to it is not compatible with the requested type. Specifically, it’s complaining about the character type in the target variable.
Understanding Distance Functions
In K-prototypes clustering, the distance function plays a crucial role in determining how clusters are formed and maintained. The dist_fun
argument in sse_within_total
allows us to specify a custom distance function.
Analyzing the Error Message
Upon closer inspection of the error message, we see that it mentions compatibility issues with the type of distance function passed to sse_within_total
. This suggests that the problem lies in the way we’re specifying the distance function.
Using Custom Distance Functions
One way to resolve this issue is by using a custom distance function. The tidyclust
package provides several built-in distance functions, including daisy
, euclidean
, and mahalanobis
. We can use one of these functions as our custom distance function.
Rfast::dista() Example
Let’s take a closer look at how Rfast::dista()
works. It seems to be the correct type of distance function we’re looking for.
# Use Rfast::dista() as the custom distance function
res <- tune_cluster(
kmeans_wflow2,
resamples = penguins_cv,
grid = clust_num_grid,
control = control_grid(save_pred = TRUE, extract = identity),
metrics = cluster_metric_set(sse_within_total, dist_fun = Rfast::dista)
)
Why Does This Work?
Using Rfast::dista()
as our custom distance function works because it provides the correct type of output for the first argument in the distance function. The first argument should be a data.frame containing the points to which we want to compute distances.
Conclusion
In this article, we explored how to tune a K-prototypes model in tidyclust
using custom distance functions. By specifying a custom distance function and ensuring that it’s compatible with the requested type, we can resolve issues related to character types in the target variable.
Additional Resources
Example Code
Here’s the complete example code for this article:
# Load necessary packages
library(tidyclust)
library(tidyverse)
library(tidymodels)
# Create a dataset
data("penguins", package = "modeldata")
penguins %>% drop_na()
# Create workflows
spec1 <- k_means(engine = 'clustMixType', num_clusters = 4)
spec2 <- k_means(engine = 'clustMixType', num_clusters = tune())
rec <- recipe(~ ., data = penguins)
wflow1 <- workflow(rec, spec1)
wflow2 <- workflow(rec, spec2)
# Create grid of clusters
grid <- grid_regular(num_clusters(), levels = 10)
# Fit the model and perform tuning
kmeans_fit <- fit(wflow1, data = penguins)
res <- tune_cluster(
wflow2,
resamples = penguins_cv,
grid = grid,
control = control_grid(save_pred = TRUE, extract = identity),
metrics = cluster_metric_set(sse_within_total, dist_fun = Rfast::dista)
)
# Print the result
print(res)
Last modified on 2025-01-20