Clustering

About


(This page is based primarily on material from Chapter 10 of Introduction to Statistical Learning)
 

In general, we can cluster observations on the basis of the features in order to identify subgroups among the observations, or we can cluster features on the basis of the observations in order to discover subgroups among the features. In what follows, for simplicity we will discuss clustering observations on the basis of the features, though the converse can be performed by simply transposing the data matrix.

K-Means Clustering

Algorithm

  1. Randomly assign each observation to one of clusters
  1. Until the cluster assignments stop changing:
    1. Compute the centroid (mean) of each cluster
    2. Assign each observation to the cluster is the closest centroid

Note

  • We typically use Euclidean distance
  • We typically average over many runs with different initial assignments

Bottom-up / Agglomerative (Hierarchical) Clustering

Dendograms

This is a diagram representing a hierarchical organisation of the data based on cluster proximity.
notion image
Clusters can be produced using a horizontal cut at a given height.

Interpreting a dendogram

✅ The similarity between two observations is determined by the height at which their branches fuse
❎ It is not determined by their distance in the tree's branches

Algorithm

  1. Compute all the pairwise dissimilarities (e.g. Euclidean distance) between observations
  1. Repeat:
    1. Using some linkage measure, calculate the dissimilarities of each cluster pair
    2. Fuse the pair of clusters that are most similar
    3. The height of the fuse in the dendogram = the dissimilarity of the clusters

Dissimilarity Measures

For per-observation dissimilarity (used for linkage), we need a measure of similarity between observations, across all features.
  • Typically we can standardise and then use Euclidean distance
  • However, correlation-based measures can also be a useful alternative

Linkage Measures

Hierarchical clustering requires distance measures between clusters rather than individual observations - this is known as linkage.
Linkage
Name
Tags
Cons
Max dissimilarity between observations
Min dissimilarity between observations
Dendograms can be very unbalanced
Mean dissimilarity between observations
Dissimilarity between centroids
Can lead to inversions - where two clusters fuse below their individual clusters