How to Tune a K-Prototypes Model in tidyclust Using Custom Distance Functions

Understanding K-Prototypes Clustering in tidyclust

Introduction

The tidyclust framework is a modern alternative to traditional clustering methods like k-means. It provides an efficient and flexible way to perform unsupervised clustering using various algorithms, including the popular K-prototypes method. In this article, we’ll delve into the world of K-prototypes clustering in tidyclust and explore how to tune a K-prototypes model for optimal performance.

Background

K-prototypes is a density-based clustering algorithm that groups data points based on their proximity to each other. It’s particularly useful when dealing with noisy or high-dimensional data, as it can handle outliers and irregular shapes in the data distribution. The tidyclust framework provides an implementation of K-prototypes using the clustMixType engine.

Setting up the Environment

To get started with K-prototypes clustering in tidyclust, we’ll need to install the necessary packages and load them in our R environment.

library(tidyclust)
library(tidyverse)
library(tidymodels)

We’ll also use the built-in penguins dataset from modeldata.

data("penguins", package = "modeldata")
penguins %>% drop_na()

Creating a K-Prototypes Workflow

Next, we’ll create two different workflows for our K-prototypes model: one with a non-tunable number of clusters and another with a tunable number of clusters.

# spec1 is for a non-tunable model
kmeans_spec1 <- k_means(engine = 'clustMixType', num_clusters = 4)

# spec2 is for a tunable model
kmeans_spec2 <- k_means(engine = 'clustMixType', num_clusters = tune())

penguins_rec <- recipe(~ ., data = penguins)

kmeans_wflow1 <- workflow(penguins_rec, kmeans_spec1)
kmeans_wflow2 <- workflow(penguins_rec, kmeans_spec2)

Creating a Grid of Clusters

To perform tuning, we need to create a grid of possible cluster numbers. We’ll use grid_regular from tidyclust to generate this grid.

clust_num_grid <- grid_regular(num_clusters(), levels = 10)

Fitting the Model and Tuning

Now that we have our workflows and grid, we can fit the model and perform tuning using tune_cluster.

# non tunable clustering fit
kmeans_fit <- fit(kmeans_wflow1, data = penguins)

# this works without errors
sse_within_total(kmeans_fit)

# this also works
sse_within_total(kmeans_fit, dist_fun = cluster::daisy)

# However, this doesn't work
res <- tune_cluster(
  kmeans_wflow2,
  resamples = penguins_cv,
  grid = clust_num_grid,
  control = control_grid(save_pred = TRUE, extract = identity),
  metrics = cluster_metric_set(sse_within_total)
)

Error Message Analysis

The error message you’re seeing indicates that sse_within_total is expecting a numeric input, but the type of distance function passed to it is not compatible with the requested type. Specifically, it’s complaining about the character type in the target variable.

Understanding Distance Functions

In K-prototypes clustering, the distance function plays a crucial role in determining how clusters are formed and maintained. The dist_fun argument in sse_within_total allows us to specify a custom distance function.

Analyzing the Error Message

Upon closer inspection of the error message, we see that it mentions compatibility issues with the type of distance function passed to sse_within_total. This suggests that the problem lies in the way we’re specifying the distance function.

Using Custom Distance Functions

One way to resolve this issue is by using a custom distance function. The tidyclust package provides several built-in distance functions, including daisy, euclidean, and mahalanobis. We can use one of these functions as our custom distance function.

Rfast::dista() Example

Let’s take a closer look at how Rfast::dista() works. It seems to be the correct type of distance function we’re looking for.

# Use Rfast::dista() as the custom distance function
res <- tune_cluster(
  kmeans_wflow2,
  resamples = penguins_cv,
  grid = clust_num_grid,
  control = control_grid(save_pred = TRUE, extract = identity),
  metrics = cluster_metric_set(sse_within_total, dist_fun = Rfast::dista)
)

Why Does This Work?

Using Rfast::dista() as our custom distance function works because it provides the correct type of output for the first argument in the distance function. The first argument should be a data.frame containing the points to which we want to compute distances.

Conclusion

In this article, we explored how to tune a K-prototypes model in tidyclust using custom distance functions. By specifying a custom distance function and ensuring that it’s compatible with the requested type, we can resolve issues related to character types in the target variable.

Additional Resources

Example Code

Here’s the complete example code for this article:

# Load necessary packages
library(tidyclust)
library(tidyverse)
library(tidymodels)

# Create a dataset
data("penguins", package = "modeldata")
penguins %>% drop_na()

# Create workflows
spec1 <- k_means(engine = 'clustMixType', num_clusters = 4)
spec2 <- k_means(engine = 'clustMixType', num_clusters = tune())

rec <- recipe(~ ., data = penguins)

wflow1 <- workflow(rec, spec1)
wflow2 <- workflow(rec, spec2)

# Create grid of clusters
grid <- grid_regular(num_clusters(), levels = 10)

# Fit the model and perform tuning
kmeans_fit <- fit(wflow1, data = penguins)
res <- tune_cluster(
  wflow2,
  resamples = penguins_cv,
  grid = grid,
  control = control_grid(save_pred = TRUE, extract = identity),
  metrics = cluster_metric_set(sse_within_total, dist_fun = Rfast::dista)
)

# Print the result
print(res)

Last modified on 2025-01-20