What is clustering and why is it hard?

Contributed by Alex Williams

I’ve been working on some clustering techniques to identify cell types from DNA methylation data. When you dive into the literature on clustering, two things becomes immediately apparent: first, clustering is fundamental to many scientific questions, and second, there is “distressingly little general theory” on how it works or how to apply it to your particular data.

This was surprising to me. I imagine that most biologists and neuroscientists come across k-means clustering, hierarchical clustering, and similar techniques all the time in papers related to their work. Given how commonplace these techniques are, one would think that we have a solid handle on how they work and what can go wrong.

This will be the first post in a short series on clustering techniques. I will try to explain why clustering is hard from a high-level, intuitive perspective. The next post will cover some more technical theoretical results. I’ll focus on Jon Kleinberg’s paper which precisely defines an ideal clustering function, but then proves that no such function exists and that there are inevitable tradeoffs that must be made. The final few posts will cover other theoretical work and some current projects of mine.

What is clustering?

Clustering is difficult because it is an unsupervised learning problem: we are given a dataset and are asked to infer structure within it (in this case, the latent clusters/categories in the data). The problem is that there isn’t necessarily a “correct” or ground truth solution that we can refer to if we want to check our answers. This is in contrast to classification problems, where we do know the ground truth. Deep artificial neural networks are very good at classification (NYT article; Deng et al. 2009), but clustering is still a very open problem.

For example, it is a classification problem to predict whether or not a patient has a common disease based on a list of symptoms. In this case, we can draw upon past clinical records to make this judgment, and we can gather further data (e.g. a blood test) to confirm our prediction. In other words, we assume there is a self-evident ground truth (the patient either has or does not have disease X) that can be observed.

For clustering, we lack this critical information. For example, suppose you are given a large number of beetles and told to group them into clusters based on their appearance. Assuming that you aren’t an entomologist, this will involve some judgment calls and guesswork.[1] If you and a friend sort the same 100 beetles into 5 groups, you will likely come up with slightly different answers. And — here’s the important part — there isn’t really a way to determine which one of you is “right”.

Approaches for clustering

There is a lot of material written on this already, so rather than re-hash what’s out there I will just point you to the best resources.

The important thing to realize is that all of these approaches are very computationally difficult to solve exactly for large datasets (more on this in my next post). As a result, we often resort to optimization heuristics that may or may not produce reasonable results. And, as we will soon see, even if the results are “reasonable” from the algorithm’s perspective, they might not align with our intuition, prior knowledge, or desired outcome.

It is difficult to determine the number of clusters in a dataset

This has to be the most widely understood problem with clustering. In fact, there is an entire Wikipedia article devoted to it. If you think about the problem for long enough, you will come to the inescapable conclusion is that there is no “true” number of clusters (though some numbers feel better than others), and that the same dataset is appropriately viewed at various levels of granularity depending on analysis goals.

While this problem cannot be “solved” definitively, there are some nice ways of dealing with it. Hierarchical clustering approaches provide cluster assignments for all possible number of clusters, allowing the analyst or reader to view the data across different levels of granularity. There are also Bayesian approaches such as Dirichlet Process Mixture Models that adaptively estimate the number of clusters based on a hyperparameter which tunes dispersion. A number of recent papers have focused on convex clustering techniques that fuse cluster centroids together in a continuous manner along a regularization path; this exposes a hierarchical structure for a clustering approach (roughly) similar to k-means.[3] Of course, there are many other papers out there on this subject.[4]

It is difficult to cluster outliers (even if they form a common group)

I recommend you read David Robinson’s excellent post on the shortcomings of k-means clustering. The following example he provides is particularly compelling:

Three spherical clusters with variable numbers of elements/points.

Raw Data. Three spherical clusters with variable numbers of elements/points.

The human eye can pretty easily separate these data into three groups, but the k-means algorithm fails pretty hard:

Rather than assigning the points in the upper left corner to their own cluster, the algorithm breaks the largest cluster (in the upper right) into two clusters. In other words it tolerates a few large errors (upper left) in order to decrease the errors where data is particularly dense (upper right). This likely doesn’t align with our analysis, but it is completely reasonable from the perspective of the algorithm. And again, there isn’t a ground truth to show that the algorithm is “wrong” per se.

It is difficult to cluster non-spherical, overlapping data

A final, related problem arises from the shape of the data clusters. Every clustering algorithm makes structural assumptions about the dataset that need to be considered. For example, k-means works by minimizing the total sum-of-squared distance to the cluster centroids. This can produce undesirable results when the clusters are elongated in certain directions — particularly when the between-cluster distance is smaller than the maximum within-cluster distance. Single-linkage clustering, in contrast, can perform well in these cases, since points are clustered together based on their nearest neighbor, which facilitates clustering along ‘paths’ in the dataset.

If your dataset contains long paths, then single-linkage clustering (panels B and D) will typically perform better than k-means (panels A and C).

Datasets where single-linkage outperforms k-means. If your dataset contains long paths, then single-linkage clustering (panels B and D) will typically perform better than k-means (panels A and C).

However, there is no free lunch.[5] Single-linkage clustering is more sensitive to noise, because each clustering assignment is based on a single pair of datapoints (the pair with minimal distance). This can cause paths to form between overlapping clouds of points. In contrast, k-means uses a more global calculation — minimizing the distance to the nearest centroid summed over all points. As a result, k-means typically does a better job of identifying partially overlapping clusters.

Single-linkage clustering tends to erroneously fuse together overlapping groups of points (red dots); small groups of outliers (black dots) are clustered together based on their small pairwise distances.

A dataset where k-means outperforms single-linkage. Single-linkage clustering tends to erroneously fuse together overlapping groups of points (red dots); small groups of outliers (black dots) are clustered together based on their small pairwise distances.

The above figures were schematically reproduced from these lecture notes from a statistics course at Carnegie Mellon.

Are there solutions these problems?

This post was meant to highlight the inherent difficulty of clustering rather than propose solutions to these issues. It may therefore come off as a bit pessimistic. There are many heuristics that can help overcome the above issues, but I think it is important to emphasize that these are only heuristics, not guarantees. While many of biologists treat k-means as an “off the shelf” clustering algorithm, we need to be at least a little careful when we do this.

One of the more interesting heuristics worth reading up on is called ensemble clustering. The basic idea is to average the outcomes of several clustering techniques or from the same technique fit from different random initializations. Each clustering fit may suffer from instability, but the average behavior of the ensemble of models will tend to be more desirable. This general trick is called ensemble averaging and has been successfully applied to a number of machine learning problems.[6]

This post only provides a quick outline of the typical issues that arise for clustering problems. The details of the algorithms have been purposefully omitted, although a deep understanding of these issues likely requires a closer look at these specifics. Jain (2010) provides a more comprehensive review that is worth reading.


[1] It will probably involve guesswork even if you are.
[2] Highlighting the difficulty of clustering, Larry Wasserman has joked that “mixtures, like tequila, are inherently evil and should be avoided at all costs.” Andrew Gelman is slightly less pessimistic.
[3] Convex clustering performs continuous clustering, similar to how LASSO performs continuous variable selection.
[4] See Tibshirani et al. (2001), Dudoit & Fridlyand (2002), Figueiredo & Jain (2002), Yan & Ye (2007), Kulis & Jordan (2012)
[5] The “no free lunch” theorem roughly states that whenever an algorithm performs well on a certain class of problems it is because it makes good assumptions about those problems; however, you can always construct new problems that violate these assumptions, leading to worse performance. Interestingly, this basic idea pops up in other contexts. For example, certain feedback control systems can be engineered so that they are robust to particular perturbations, but such engineering renders them more sensitive to other forms of perturbations (see “waterbed effect.”)
[6] See bootstrap aggregation, random forests, and ensemble learning. Seminal work on this topic was done by Leo Breiman — his papers are lucid, fascinating, and accessible, and I particularly recommend his 1996 article “Bagging Predictors”. This is also covered in any modern textbook, such as The Elements of Statistical Learning.