Part II Statistical Methodology
5.2 Methods to identify subgroups with high or low outcome
Regression modelling
In many fields of research, a βsubgroup analysisβ refers to the identification of subgroups with high or low outcome i.e. prognostic subgroups. Such analyses are performed in different kinds of study designs such as RCTs, prospective cohort studies and retrospective studies. The identification of prognostic factors provides clinicians with valuable information to aid decision making and to help them better predict outcome. Depending on the outcome of interest associated with the field of research, an appropriate multivariate regression modelling approach is taken to identify and
evaluate prognostic factors. For example, linear regression modelling is used when the outcome is continuous and normally distributed. There is no ideal approach to
69
identifying and modelling prognostic factors. However, two standard strategies are to either fit a full model that includes all potential prognostic factors or to use backward elimination (111). Inferences can then be made by observing the parameter estimates of the final fitted model as to whether any of the included factors are predictive of outcome. For example, imagine the final model was of the form
πΈ(π¦) = 1.3 + 3.4 β πΊπππππ(πΉπππππ)
where π¦ is a continuous and normally distributed outcome, which here might, for example be a pain score on a scale from 0-100, where a higher score indicates worse pain. Also, suppose gender was found to be statistically significant in the model output, hence suggesting it is a prognostic factor. Then the regression coefficient for gender would suggest that females, on average, experience more pain than males i.e. the gender of an individual helps predict whether the response will be high or low.
Non-parametric methods
When the predictor is continuous, one may encounter situations where there is a non- linear association between the potential prognostic factor and the dependent variable i.e. the linearity assumption does not hold. In such a situation, one might consider using generalized additive models (GAMs) to model the non-linearity (112). Typically
regression models such as linear and logistic regression work by modelling the ππ (π =
1, β¦ , π) potential variables (including interaction variables) as a linear predictor of the form πΈ(π) = π½0+ β π½πππ. GAMs work by replacing the linear predictor by an additive predictor of the form πΈ(π) = π½0+ β ππ(ππ) with a monotonic link function to link π½0+
β ππ(ππ) to the expectation of Y. Here, ππ (π = 1, β¦ , π) are non-parametric smooth
functions that are estimated from the data. This approach is relevant when a continuous covariate is of interest e.g. age. A general linear model can thus be
70
considered a special case of a GAM, where the link function is simply the identity function. The flexible nature of GAMs allows the assumption of linearity to be relaxed however extra caution must be taken not to over-fit the data (112).
Cluster analysis
An alternative approach is cluster analysis; an approach used in the field of data mining and machine learning for identifying subgroups or clusters of patients that are most similar in terms of outcome only, i.e. ignoring the predictors (113-115). Cluster analysis is based on mathematical formulation, whereby each individual is assigned to only one subgroup based on the Euclidean distance from an initial starting value or centroid. This method works by taking a large heterogeneous population and breaking it down into subgroups that maximize the between subgroup heterogeneity whilst minimizing the within subgroup heterogeneity. The number of subgroups to be formed has to be specified before running the analyses; however the choice of how many is a difficult task. Moreover, a starting point or centroid for each of the pre-determined subgroups also has to be specified. The following example illustrates how the cluster analysis algorithm works. Consider a sample of patients that have reported a pain score (0-100; higher score is worse). Assuming we pre-specify that we are interested in forming two subgroups. We therefore have to specify a starting point for each
subgroup. Based on the pain score, letβs give one of the subgroups a low starting point, say 10, and the other subgroup a high starting point, say 80. The method works by computing the distance of the outcome of each patient from the two subgroup starting points and then assigning each patient to the subgroup it is closest to. A natural measure of the distance for each patient in this example would be the squared error. Thus, in the first iteration, all patients will be assigned to one of two subgroups. The mean pain score in each of the subgroups is then computed and used as the starting point for the next iteration. The iterations continue until the starting points no longer
71
change. Once the subgroups have been formed, the outcome measure can be
summarized within each subgroup and then labelled accordingly to best describe the subgroup e.g. good response and poor response. The final clusters can then be compared in terms of the characteristics of the patients within each cluster or subgroup. Any differences found in the characteristics of the two groups can aid in predicting outcome of new patients. For example, the cluster classified as having a good response may be on average younger than the cluster classified as having poor response. Thus, age would be a predictor of outcome for a new patient. A common and well documented problem associated with cluster analysis is that different solutions are often produced when different starting points or centroids are used. Thus there is no justification as to which solution is the correct or final solution.
Data mining methods
In the field of data mining, a number of data driven approaches exist that are an attractive alternative for performing subgroup analyses with the aim of identifying main effects i.e. subgroups that are predictors of outcome. Many of these are sophisticated data mining methods, such as support vector machines (SVM), neural networks, Bayesian networks and K-nearest neighbour classifiers, which specialize in discovering patterns and relationships between covariates and outcome within large datasets using algorithms. An initial concern with many of these complex methods is that there are numerous algorithms available for each of these methods. The majority of the data mining methods identified from the literature search do not look for treatment effect heterogeneity or interaction effects. Instead, they simply look for subgroups or subsets of the entire dataset with heterogeneous outcome. However, there are some data mining methods that do exist that aim to identify interactions. These methods will be described later on in section 5.3.2.
72
As mentioned earlier, the approaches in this section of the chapter aim to identify subgroups that differ in terms of a final outcome i.e. identifying prognostic subgroups. Though these methods are a form of subgroup analysis, they do not identify subgroups that have high or low treatment effects, which is the focus of this thesis. The next section will therefore describe several methods that can be used to identify and evaluate subgroups with high or low treatment effects i.e. differential subgroups effects.