This research is applied to data from the Clinical Practice Research Datalink Aurum dataset, a large and nationally representative sample of electronic health records (EHRs) from General Practices (GPs) in England. Data are also linked to death registration data from the Office for National Statistics and to secondary care data from Hospital Episode Statistics provided by NHS England.
Most research on MLTC clusters has focused on relatively few diseases (typically less than 20), and found relatively broad clusters, such as clusters of cardiometabolic and clusters of mental health conditions (see Busija et al, 2019). An aim of my research is to use a larger number of diseases, which might help to uncover relationships between less common diseases.
When using EHR data, diseases can be identified by diagnostic codes (Medcodes) entered by clinicians during clinical encounters. However, there are tens of thousands of distinct codes, and so these are aggregated into disease code lists to reduce redundancy and aid intepretation. Generating such code lists is a very laborious task! Fortunately, many existing code lists are publicly available. Here, I use a set of diseases defined by other researchers. The original set of diseases and code lists were generated by Kuan and colleagues for the CALIBER study, and are available from the HDR UK Phenotype Library . Of the original 308 conditions, Head and colleagues selected 211 conditions relevant to multimorbidity for a study of multimorbidity incidence and prevalence, with the code lists available on GitHub. These codes lists were developed specifically for use with CPRD Aurum.
I reviewed the codes in the original lists from Head and colleagues and made some edits to the codes. I also created a new category of 'Chronic Primary Pain', as a common condition in primary care, and one that is frequently included in studies of MLTC, but not included in the original CALIBER code lists. Thus, a total of 212 LTCs were included. Where conditions were included in another category, these were removed, for example Fibromyalgia was originally included in 'Chronic Fatigue Syndrome', and was removed from this category. There were also changes to the codes included for diabetes, with removal of codes indicating a specific Type 1 or Type 2 diagnosis from the 'Other/unspecified' diabetes category. The full list of codes are available via the link below. The disease, disease number, system and system number are retained as they were recorded in the code lists from Head et al. The 'medcodeid' variable represents the unique code identifier available in CPRD. The 'istest' category represents codes which have an assigned a value. In these cases, whether a condition is incldued depends on the treshold value being met.
A CSV file with the mapping of Medcodes to diseases can be downloaded via the link below:
If importing the CSV into software such as Microsoft Excel, be sure to import as text to avoid rounding of the Medcode IDs!
When using a large number of diseases, it can be challenging to distinguish diseases that represent acute and short-lived versus chronic or long-term conditions. What counts as 'chronic' varies across the literature, ranging from 3 months in duration, to life-long. For some diseases, a diagnosis indicates life-long risk, for example, atrial fibrillation, or a stroke. However, other diseases, such as gastritis, a disease referring to inflammation of the stomach, may last a few days, or many years. In some cases, it is possible to tell if a disease is active from a person's medication history. For example, asthma could be judged to be 'active' if a person has had a prescription of an inhaler in the last 12 months. However, when looking at many conditions simulataneously, it may not be possible to determine which condition a medication is prescribed for - in the case of gastritis, a proton-pump inhibitor is often prescribed, but there are several other reasons someone could be taking one.
In existing literature, different approaches have been used to judge whether a condition is chronic, based on the number of codes appearing in the EHR over time. We compared different timeframes on chronic conditions in a study published in BMJ Medicine and found that choice has a significant impact on the prevalence of MLTC.
Given the exploratory nature of the work to generate clusters, we use the most inclusive definition of 'chronic' to include a disease which is coded at least once in the EHR. This will increase the chance of including diseases which are no longer active and so will lead to a higher estimate of prevalence of MLTC than many other studies, but ensures that we are not excluding patients from the analysis and enhances the power to detect less common disease associations. Furthermore, given our interest in disease sequences, even if a disease only appears once in the record (suggesting it is short-lived), it could still be relevant to future disease development.
An aim of this research is to evaluate methods which incorporate the sequence of a person's diseases developed over time, rather than only the co-occurrence of diseases. Diagnostic codes are usually entered by clinicians during a consultation, or following receipt of communication from secondary care. Although previous research has shown good agreement between the prevalence of conditions as recorded in CPRD compared with population sources for many diseases, there may be many reasons why a code is not entered for a given consultation. At the outset of this work, it was unclear whether the sequence of diseases might be impacted by factors specific to a person, or external to them (such as the GP practice, or financial incentives).
In a study published in BMJ Open, we showed that the frequency of diagnostic codes recorded in GP EHR data is signficantly impacted both by patient factors (age, gender, ethnicity and socioeconomic deprivation) and by factors external to the patient (including GP practice, coding incentives and the COVID-19 pandemic). Therefore, code frequency should not be assumed to be an objective marker of a person's health. As a result, we included some of these factors as variables in the sequence-based algorithm that we developed (EHR-BERT).
Ethnicity was defined using code lists developed by Davidson et al (2021). Their code lists are available via the LSHTM Data Compass. I made small edits to these codes, to remove four codes suggesting an examination finding (including the term "o/e") rather than self-reported ethnicity. A TXT file of my edited list of ethnicity codes can be downloaded via the link below. I used the most recently recorded ethnicity code in CPRD to define ethnicity. If missing in CPRD, then the documented ethnicity in HES was used instead.
Code lists have been created to categorise staff types in CPRD Aurum. These have been categorised into GP, nurse, clinical_other (e.g. healthcare assistants), admin or other.
A CSV file containing the 'jobcatid' assigned to each staff member can be downloaded via the link below:
Code lists were created to define the consultation type in CPRD Aurum. These have been categorised into face-to-face, remote, unknown and nurse appointments. I used codes developed by Foley et al (2021), with small modifications, including a new category of nurse appointments.
A CSV file of the consultation type codes can be downloaded via the link below: