In general, we can cluster observations on the basis of the features in order to identify subgroups among the observations, or we can cluster features on the basis of the observations in order to discover subgroups among the features. In what follows, for simplicity we will discuss clustering observations on the basis of the features, though the converse can be performed by simply transposing the data matrix.
K-Means Clustering
Algorithm
- Randomly assign each observation to one of clusters
- Until the cluster assignments stop changing:
- Compute the centroid (mean) of each cluster
- Assign each observation to the cluster is the closest centroid
Note
- We typically use Euclidean distance
- We typically average over many runs with different initial assignments
Bottom-up / Agglomerative (Hierarchical) Clustering
Dendograms
This is a diagram representing a hierarchical organisation of the data based on cluster proximity.
Clusters can be produced using a horizontal cut at a given height.
Interpreting a dendogram
✅ The similarity between two observations is determined by the height at which their branches fuse
❎ It is not determined by their distance in the tree's branches
Algorithm
- Compute all the pairwise dissimilarities (e.g. Euclidean distance) between observations
- Repeat:
- Using some linkage measure, calculate the dissimilarities of each cluster pair
- Fuse the pair of clusters that are most similar
- The height of the fuse in the dendogram = the dissimilarity of the clusters
Dissimilarity Measures
For per-observation dissimilarity (used for linkage), we need a measure of similarity between observations, across all features.
- Typically we can standardise and then use Euclidean distance
- However, correlation-based measures can also be a useful alternative
Linkage Measures
Hierarchical clustering requires distance measures between clusters rather than individual observations - this is known as linkage.
Linkage
Name
Tags
Cons