Finding the Optimal Number of Clusters in Your R Dataset Using Two Distinct Methods

To find the K furthest apart groups, you can use the following R code:

k <- 5 # specify the number of furthest apart groups

group_means <- rowMeans(df)
indices     <- seq(nrow(df))

k_furthest <- c(which.min(group_means), which.max(group_means))
k_vals     <- c(min(group_means), max(group_means))

group_means <- group_means[-k_furthest]
indices     <- indices[-k_furthest]

while(length(k_furthest) < k)
{
  best <- which.max(rowSums(sapply(k_vals, function(x) (x - group_means)^2)))
  k_vals <- c(k_vals, group_means[best])
  k_furthest <- c(k_furthest, indices[best])
  group_means <- group_means[-best]
  indices     <- indices[-best]
}

df[k_furthest, ]

This code first calculates the mean of each column in the dataframe df. Then it finds the pair of groups with the smallest and largest means. It then iteratively selects the two most distant columns to be removed from consideration until there are exactly K groups left.

Please note that this algorithm effectively just takes the rows with the highest and lowest means alternately on each iteration, which may not produce the desired result if you want a different “distance” between groups.

Alternatively, if you want to maximize the sum of element-wise difference between groups, you can use:

distances <- as.data.frame(t(sapply(1:nrow(df), function(i) {
  a <- rowSums(apply(df, 2, function(x) abs(x[i] - x)))
  c(row = i, most_distant = which.max(a), difference = max(a))
})))

head(distances)
i <- unique(c(t(distances[order(-distances$difference)[seq(k)], 1:2])))[seq(k)]

df[i,]

This code calculates the distance between each row and all other rows, then selects the K groups with the largest difference.


Last modified on 2023-10-27