Phenotypes identified using cluster analysis

1.1 Asthma: disease background

1.2.3 Phenotypes identified using cluster analysis

Cluster analysis can help reveal hidden arrangements of entities, in this case patients, with similar attributes into groups and differentiate groups of patients with heterogeneous characteristics.(183,184) Patients can be grouped together based on characteristics that make them similar (high intra-class similarity) and separate them from different groups (low inter-class similarity).(185) The patients within a cluster are geometrically grouped together, and the distance between patients in different clusters is greater than the distance between patients within the same cluster. In the context of health data, cluster analysis can be used to identify which patient belongs to which group, and to identify the ideal number of clusters and thus reveal a latent structure within a dataset or group of patients.(186)

There are several different methods of cluster analysis, including k-means, multivariate Gaussian mixture, hierarchical clustering, spectral and nearest neighbour method.(187)(188)

One of the most influential studies using cluster analysis in asthma in order to identify distinct phenotypic groups was conducted by Haldar et al. using cluster analysis of multiple clinical variables.(101) Among 184 patients managed in primary care, three clusters were found: one group with benign asthma, one group with obese non- eosinophilic asthma, and one group with early-onset atopic asthma. Further cluster

analysis of two other asthma populations which were managed in secondary care and were mostly refractory (N=255 total), added an early symptom predominant cluster and an inflammation predominant cluster.

The study by Haldar et al. used the k-clustering algorithm.(101) This algorithm has been used widely and requires the number of groups (k) and a distance metric as inputs.(189) The first step is to associate each data point with one of the k clusters, depending on the distance to the cluster centers (centroids) of each cluster.

The next step is to calculate new centroids and reclassify the data points for the new centroids. This process can then be repeated until there are no more significant changes in centroid position observed at each new step.

One of the main limitations of the k-means algorithm is the a priori setting of the number of clusters, as the final classification of clusters can strongly depend on the choice of number of centroids. The k-means is also not indicated if the clusters have very different sizes,(190,191) and is sensitive to the initial seed selection which determines the initial cluster centres. The advantage of the k-means are the low computational cost ( easy to implement and can be faster than alternatives such as hierarchical clustering) and the good results in practical situations such as detection of anomalies within a dataset or grouping patients likely to benefit from a certain intervention through data segmentation(192),

The specific limitations of using clustering analysis on health data is that disease and health is a continuous spectrum, and separating the population into discrete clusters may not be realistic. The study by Haldar et al further mentions that other methods with a more probabilistic approach to cluster grouping could be valuable.(193) In addition, the choice of variables remains subjective as well as the number of clusters chosen for the population.

The authors aimed to choose variables that were measured and could contribute to the clinical evaluation, variables that were considered important in the definition of phenotype definition and avoid variables that would in effect measure the same characteristic twice. The variables were categorised as either symptoms, atopy/allergy, eosinophilic inflammation, psychological status or variable airflow obstruction.

Not all variables were recorded and not all etiologic factors could be explored. The number of clusters in Haldar’s study were estimated from the dendrogram plots obtained using Ward’s method. Further limitations reported by the study are the question of stability in cluster membership over time and changes in treatment. There was no significant difference in treatments between the clusters. Differences in clusters may have been due to a difference in disease profile and differences in response to treatment.

Other phenotyping studies using similar clinical characteristics found comparable phenotypes.(103,106,194–196) The identified clusters can be found in the figure below. As this categorisation forms the basis for the study included in Chapter 6 of this thesis, these phenotypes are explained in further detail.

Figure 3: Clinical phenotypes of asthma by Haldar P. Reprinted with permission of the American Thoracic Society. Copyright © 2018 American Thoracic Society. Haldar P et al. Cluster analysis and clinical asthma phenotypes. Am J Respir Crit Care Med. 2008;178(3):218–24. The American Journal of Respiratory and Critical Care Medicine is an official journal of the American Thoracic Society

The early-onset atopic phenotype includes primary care patients with airway obstruction reversibility and eosinophilic inflammation and asthma onset in childhood. Obese non-eosinophilic asthma includes mostly female overweight primary care patients with less eosinophilic inflammation. The benign asthma phenotype is mostly composed of primary care patients with good control of symptoms and inflammation, and a favourable prognosis. The early symptom predominant asthma phenotype includes secondary care patients with less inflammation and reversibility, but strong symptom expression. Inflammation predominant asthma is a secondary care phenotype with clear eosinophilic inflammation, but few symptoms.

Another influential study on asthma phenotypes was undertaken by Moore et al. based on the Severe Asthma Research Program (SARP).(94) The defining criteria of these phenotypes were the lung function based on the maximum FEV1 and the age of onset, in which five clusters were found that broadly corresponded to the clusters found in Haldar’s study. These clusters were mild atopic asthma, mild to moderate atopic asthma, late-onset non-atopic asthma, severe atopic asthma, and severe asthma with fixed airflow. Moore et al. used Ward's minimum-variance hierarchical clustering method as an unsupervised modelling approach to identify asthma phenotypes within the SARP cohort.(103)

1.3 Electronic healthcare records

In document Asthma in electronic health records: validity and phenotyping (Page 42-47)