Clustering Multiple Long-Term Conditions

What makes a good set of clusters?

Before running any clustering algorithm, it’s important to have an idea of what constitutes a ‘good’ set of clusters. Although the process of clustering itself is data-driven, there are several choices that need to be made, including data inputs, choice of algorithm, optimal number of clusters and metrics of success. I’ll write more on these choices (and the degree to which clustering is really ‘objective’) in the next blog, but to make sense of the choices first requires an idea of what we’re trying to produce.

During my thesis, I found little guidance in published literature to answer the question of what features make a good set of clusters in a health context. But while researching the history of the ICD-10 classification system (I argue in my thesis that clustering diseases shares goals with classifying diseases into disease hierarchies), I came across a 1960 paper by the statistician Iwao Moriyama which sets out “general principles of classification”.

It struck me that these have has relevance not only for a good classification system, but for a good set of clusters. While Moriyama set out seven principles, I added two (simplicity and multi-resolution), and removed one as a property of reporting, rather than of creating clusters. The first two properties relate to choices about the types of clusters (hard versus soft, or hierarchical versus non-hierarchical) which depend on the use, while the subsequent six are general properties that are desirable in most cases.

  1. Hard versus soft: clusters can be ‘hard’, or mutually exclusive, where each data point is assigned to one and only one cluster. Hard clusters are simpler to visualise and interpret, as each data point appears only once. However, data items at boundaries between two clusters will be forced into one cluster, which may not reflect the degree of uncertainty of assignment. In contrast, ‘soft’ (also called fuzzy) clusters allow data items to belong to multiple clusters, so those on a cluster boundary could belong to two (or more).
  2. Hierarchical versus non-hierarchical: hierarchical clustering algorithms produce tree-like structures which can be visualised at different levels from small to large numbers of clusters. Two data points that cluster together at a granular scale (with many clusters) will always remain together at a coarser scale (with few clusters). This structure makes it easier to track how clusters combine as the scale is changed, and makes visualisation easier, but may not provide a true representation where no underlying hierarchy exists.
  3. Meaningful: generating clusters requires a measure of the similarity between data points, which should reflect the aim of generating the clusters. For example, in my work investigating patterns of conditions in people with Multiple Long-Term Conditions, I chose similarity measures reflecting joint co-occurrence of diseases.
  4. Homogeneous: the clusters should be homogeneous, that is, containing data items that are similar (with respect to the meaningful similarity measure defined above), and by extension, different to data items in other clusters. This property forms the basis of many metrics of evaluating clustering.
  5. Simple: as a method by which to reduce complex data to a smaller number of interpretable clusters, the simplest solution with the fewest number of clusters should be found.
  6. Multi-resolution: while most clustering approaches seek to define a single ‘optimal’ set of clusters, there may be advantages in simultaneously considering a range of optimal sets of clusters from a small to large number of clusters, which I call ‘multi-resolution’. Investigating how data points distribute among these clusters at different resolutions may provide informative insights. For example, in my research clustering diseases, cystic fibrosis joined different clusters across resolutions, which may reflect its multi-system effects.
  7. Balanced: in general, clusters which are balanced in terms of the number of data items in each one are likely to be more desirable. In early work in my thesis clustering diseases, I identified two large clusters covering 90% of diseases, and a few highly specific small clusters, which were uninformative. Changing my similarity metrics and clustering algorithm produced more balanced clusters which were more informative.
  8. Exhaustive: ideally, all data points should be assigned to a cluster. However, this is not a strong rule – some soft clustering algorithms can leave ‘orphan’ data points which are not assigned to clusters, and this may provide insights into items which are unique and don’t neatly fit with others.

While I’m sure there will be examples where not all these properties will be desirable, I found that formulating what a good set of clusters looks like at the outset of research helped me to frame what success looked like when comparing outputs from different clustering algorithms. In the next blog, I will discuss the choices that should be considered during clustering.