Clustering diseases

Overview of approach for generating disease clusters

Disease clusters were created using a two-step process:

Vector representations of the diseases ("disease embeddings") were created using a) co-occurrence and b) sequence-based methods. From these representations, cosine similarity is calculated, giving a 212 by 212 matrix.
Unsupervised clustering of the similarity matrix to generate clusters of diseases.

Here, we aim to generate what we call "multi-resolution" clusters, that is, a set of clusters from more granular (with many clusters) to coarse (with few clusters). This allows us to visualise how clusters aggregate or split across the resolutions. Further details on our approach to generating disease clusters can be found in our paper in Communications Medicine.

Analysis pipeline

Pipeline for generating disease clusters Used with permission from Beaney et al (2024)

Disease embedding methods

As a co-occurrence-based method, we selected Multiple Correspondence Analysis (MCA), as this requires few distributional assumptions about the data. We used this to create a vector representation of each disease, trialling retention of different numbers of dimensions from two to 30.

View the interactive plot of the first two dimensions from MCA.

In the figure, more similar diseases appear closer together on each dimension. Also overlaid on the plot are patient age category and the number of long-term conditions (LTCs). As can be seen, age and number of LTCs are strongly represented on the first dimension. Conditions common in younger people tend to separate from those more common in older people along dimension 1. Chronic liver conditions separate strongly on dimension 2.

Of co-occurrence methods, we found MCA retaining 30 dimensions to perform best according to our evaluation criteria explained below.

For sequence-based methods, we compared three popular word embedding algorithms (GloVe, and the two word2vec algorithms of skipgram and continuous-bag-of-words). Here, we also tested different inputs of disease codes, using a code either once (at diagnosis, and not repeated) or allowing multiple codes to appear in the sequence. We found skip-gram using multiple codes (SGM) to perform best. Further information on the evaluation criteria is given below. CSV files of the MCA and SGM embeddings are available to download here:

Download: MCA embeddings

Download: SGM embeddings

Clustering methods

We adopted a graph-based clustering algorithm, Markov Multiscale Community Detection (MMCD), with a Python implementation available on GitHub. This method allows the identification of optimal clusters simultaneously at different resolutions, and we compare the resulting clusters at a fine, medium and coarse resolution. Our analysis pipeline is available in a Jupyter notebook below, which runs in Python version 3.8.10 and uses the MCA or SG-M embeddings.

Download jupyter notebook: MMCD pipeline

Disease cluster assignments

Finally, we created word cloud representations of the disease assignments to clusters, separately for the best co-occurrence method (MCA) and best sequence-based method (SGM). Here, the size of the text reflects the relative prevalence of the disease within the cluster.

MCA-derived clusters

MCA clusters Used with permission from Beaney et al (2024)

SGM-derived clusters

SGM clusters Used with permission from Beaney et al (2024)

CSV files of the assignment of diseases to each cluster can be downloaded below:

Download: MCA clusters

Download: SGM clusters

Evaluating disease embeddings and clusters

To evaluate the interpretability of disease clusters, we developed a set of 253 established disease association pairs, relevant to our 212 LTCs. The aim here is to provide a list of pairs of disease for which there is strong clinical evidence of an association, but is not an attempt to provide a comprehensive list of all possible associations. To create this list, we considered all 212 diseases, and used sources including the BMJ Best Practice guidelines, to create a set of known disease associations for each disease, incorporating both known risk factors and known complications for a given disease. The set of disease pairs were reviewed by three clinicians and any disagreements were discussed. A total of 144 conditions were included with at least one association to another disease (68% of the total).

Link to download the disease pairs: DISEASE_ASSOCIATION_PAIRS.csv

Note that each pair appears twice, and you may wish to de-duplicate each pair of associations depending on the purpose of your research.