Clustering

K-Means Clustering Algorithm Note Bottom-up / Agglomerative (Hierarchical) Clustering Dendograms Interpreting a dendogram Algorithm Dissimilarity Measures Linkage Measures

About

(This page is based primarily on material from Chapter 10 of Introduction to Statistical Learning)

In general, we can cluster observations on the basis of the features in order to identify subgroups among the observations, or we can cluster features on the basis of the observations in order to discover subgroups among the features. In what follows, for simplicity we will discuss clustering observations on the basis of the features, though the converse can be performed by simply transposing the data matrix.

K-Means Clustering

Algorithm

Randomly assign each observation to one of clusters

Until the cluster assignments stop changing:

Compute the centroid (mean) of each cluster
Assign each observation to the cluster is the closest centroid

Note

We typically use Euclidean distance

We typically average over many runs with different initial assignments

Bottom-up / Agglomerative (Hierarchical) Clustering

Dendograms

This is a diagram representing a hierarchical organisation of the data based on cluster proximity.

Clusters can be produced using a horizontal cut at a given height.

Interpreting a dendogram

✅ The similarity between two observations is determined by the height at which their branches fuse

❎ It is not determined by their distance in the tree's branches

Algorithm

Compute all the pairwise dissimilarities (e.g. Euclidean distance) between observations

Repeat:

Using some linkage measure, calculate the dissimilarities of each cluster pair
Fuse the pair of clusters that are most similar
The height of the fuse in the dendogram = the dissimilarity of the clusters

Dissimilarity Measures

For per-observation dissimilarity (used for linkage), we need a measure of similarity between observations, across all features.

Typically we can standardise and then use Euclidean distance

However, correlation-based measures can also be a useful alternative

Linkage Measures

Hierarchical clustering requires distance measures between clusters rather than individual observations - this is known as linkage.

Linkage

Name

Clustering

Contents

About

K-Means Clustering

Algorithm

Note

Bottom-up / Agglomerative (Hierarchical) Clustering

Dendograms

Interpreting a dendogram

Algorithm

Dissimilarity Measures

Linkage Measures