Clustering Multiple Long-Term Conditions

Clustering people

Overview of approach for generating disease clusters

As with clustering diseases, clustering people used a two-step process:

  1. Vector representations of people were created, based on each person's sequence of diseases.
  2. Unsupervised clustering of the vector representations to generate clusters of people.

Comparing representation methods

We compared different methods of generating vector representations of people (based on their sequences of diseases), inspired by methods from NLP, including topic modelling and transformer architectures. Each method generates an unsupervised vector representation of 6.3 million people with MLTC. We also compared different inputs of using the raw Medcodes (n=9,462) or disease categories (n=212) as the sequence inputs.

Given the unsupervised generation of these embeddings, we required a strategy to select which embedding to use for clustering. We rationalised that a good embedding will be one the captures a high degree of the inherent information in the disease sequences. Therefore, we developed an evaluation strategy to select the embedding that could best predict future patient outcomes (using data that was not seen in generating the embeddings). We used the patient vectors as inputs into a simple logistic classifier to predict the following outcomes in the 12 months after the embeddings were generated:

  1. Mortality in those aged ≥ 65 years
  2. Any emergency department attendance
  3. Any emergency hospital admission
  4. Any attendance with a diagnosis code for:
    1. Hypertension
    2. Diabetes
    3. Depression
  5. A new diagnosis of:
    1. Hypertension
    2. Diabetes
    3. Depression
Analysis pipeline

Pipeline for generating representations of people Used with permission from Beaney et al (2024)

Of the various methods we tested, we found transformer methods to perform best across outcomes. The figure below shows the relative improvement in both AUC and APS averaged across all outcomes for each of the methods that we tested. EHR-BERT was the best-performing method across all outcomes. Use of the smaller number of 212 clinically categorised diseases performed slightly better than use of the larger vocabulary of Medcodes.

Further details of the comparisons and their performance for predicting outcomes are given in our publication in JAMIA.

Embedding performance for prediction

Patient embedding performance

EHR-BERT

We developed a transformer architecture inspired by BERT, and extending on the BEHRT model. This model incorporates information on patient age, gender, ethnicity, socioeconomic deprivation (measured by the Index of Multiple Deprivation ) as well as the visit number and calendar year associated with each visit.

EHR-BERT architecture

EHR-BERT architecture Used with permission from Beaney et al (2024)

Further information on the model is given in in our publication. Codes relating to EHR-BERT are available to download below. The folder includes a python notebook, along with a dummy data set (in .pkl format), which runs in Python version 3.10.9.

Download EHR-BERT files: EHR-BERT folder

Work in progress

We are working on generating clusters of people based on the embeddings of the best-performing EHR-BERT model. Further information will be added here with a link to the pre-print when available.