Using purrr to map over a range of k-means clusters

Mapping a k-means factory

David Harper https://dh-data.org/
2022-09-05

Since I’m (re-) learning k-means clustering in Matt Dancho’s brilliant course, I thought I’d apply map() in pursuit of a “ggplot factory.” Please note that my ellipses are not proper renders of the k-means cluster; e.g., their center is not the k-means centroid. I lazily here used ggplot’s built-in stat_ellipse() as a bit of visual flair. I’m getting more comfortable with the power of purrr. Yay!

library(tidyverse)
library(patchwork)

# bivariate normal 
set.seed(55)
rows <- 40
simulation <- tibble(x = c(rnorm(rows/4), 
                           rnorm(rows/4, mean = 1.0, sd = 0.8), 
                           rnorm(rows/4, mean = 2.0, sd = 0.6), 
                           rnorm(rows/4, mean = 3.0, sd = 0.4)),
                     y = c(rnorm(rows/4), 
                           rnorm(rows/4, mean = 1.0, sd = 0.8),
                           rnorm(rows/4, mean = 2.0, sd = 0.6), 
                           rnorm(rows/4, mean = 3.0, sd = 0.4)))

# accepts data = matrix and number of k-means centers
# and returns a vector of clusters. 
# Notice as_factor() returns a factored vector so ggplot's color will work
kmeans_vec <- function(data, centers= 3){
    w <- data %>% kmeans(centers = centers, nstart = 100)
    w$cluster %>% as_factor()
}

range <- 3:6 # for example, we want n = {3, 4, 5, 6} clusters
clusters_df <- range %>% map_dfc(~kmeans_vec(simulation, .)) %>% setNames(range)
# map_dfc returns a column-bound dataframe
# each column is a vector that contains the cluster assignment
# e.g., column named '3' has {1, 2, or 3}. 
clusters_df 
# A tibble: 40 × 4
   `3`   `4`   `5`   `6`  
   <fct> <fct> <fct> <fct>
 1 1     4     4     1    
 2 3     2     3     5    
 3 3     3     5     6    
 4 3     2     3     5    
 5 1     4     4     1    
 6 1     4     4     3    
 7 3     2     3     5    
 8 3     3     5     6    
 9 3     3     5     6    
10 3     3     5     6    
# … with 30 more rows
sim_bind_centers <- cbind(simulation, clusters_df) #bind the simulation matrix to the clusters df

# this function returns a ggplot. Notice:
# I had to use get()
# stat_ellipse() works out of the box!
gg_obj <- function(data, vector = 4) {
    data %>% ggplot(aes(x = x, y = y, color = get(as.character(vector)))) +
        geom_point() +
        stat_ellipse() +
        theme(legend.title = element_blank())
}

# p_output is a list of 4
p_output <- range %>% map(~gg_obj(sim_bind_centers, .))
p1 <- p_output[[1]]
p2 <- p_output[[2]]
p3 <- p_output[[3]]
p4 <- p_output[[4]]

(p1 + p2) / (p3 + p4)