% TIA with increasing score using primary care records
15.5 Methodologies for diagnostic model derivation – rationale for classification trees
Multivariable logistic regression models use only those variables that still significantly predict the presence of TIA after adjusting for other predictors found on univariate analysis. Thus the model assesses the strength of a predictor allowing for the presence of other variables that also have some predictive value.
However, it may well be the case that a variable has strong predictive power only in a group of patients defined by certain characteristics. In this case that predictor may not be included in a logistic regression model as its statistical effect will be diminished by the presence of other predictors, unless a specific interaction term is inserted into the logistic regression model to explicitly account for the fact that we are interested in the predictive power of this variable given the presence of other features.
With a large number of potential predictors, specifying all the interactions a priori will generate a large and complex logistic regression model, as well as a decision rule that is even more complex as a given predictor may have a different weighted multiplier in a score depending on the co-occurrence of other predictors. Furthermore, even without the
complexity of specifying interaction terms, the diagnostic scores derived from logistic regression models are not straightforward to use, particularly if weighted models are used.
Rather than calculating a score which is time consuming and error-prone, an alternative strategy is to derive an algorithm where the response to certain key questions determines the decision to refer for a suspected TIA.
A classification tree model consists of a number of questions about the predictor variables which are asked in a fixed order. This is said to mimic the clinical decision making process
163 with most important predictive variables asked about first (276). A tree is constructed by producing partitions or splits in a derivation dataset depending on the value of a predictor – either a cut point for continuous predictors such as age or blood pressure, or the presence or absence of a binary categorical predictor such as weakness or confusion.
The splits in the data produce smaller sub groups, and are chosen such that there is maximum difference between the subgroups i.e. an age cut off is chosen which produces two subgroups with the greatest difference in % TIA within the subgroups. The predictor which results in the largest separation in % outcome in the two subgroups is placed at the top of the tree as the most important predictor.
As the tree is ‘grown’ with increasing numbers of predictors used, splitting of the remaining data into further subgroups continues until there are too few cases to produce subgroups (this can be specified in the modelling) or until there is no more improvement in outcome prevalence in subgroups compared with the parent group used to produce the split.
Two advantages of tree models over logistic regression are simplicity of presentation (and usage) and the incorporation of multiple interaction effects, and given the number of
branches of trees, these interactions can be of high order (i.e. the effect of variable a, given a value of variable b and c etc. along the branches of a tree). In logistic regression, such interactions would need to be explicitly included and are rarely higher than second order i.e.
a term for the interaction of two predictors. The process of tree construction assumes that all predictors interact with each other.
However, although interactions are presumed in classification trees, the higher order interaction is only modelled in a specific branch of a tree and not modelled along other branches. This is because predictors are used to split groups where they produce the greatest subgroup differences, and this may be several nodes away from an initial large split in the dataset. As such, there may be patients channelled along distant parallel branches of a tree that could contribute to an interaction term but are not included in the generation of a subgroup split using that interaction, as they were split off at an earlier stage due to the presence or absence of another predictor. An interaction term in a logistic regression model would be applied across the dataset rather than in a restricted set of patients due to prior branching.
A further disadvantage of classification trees are the categorisation of continuous variables needed to produce a split as this reduces information and would not occur in regression modelling. However, in the dataset under consideration in this thesis only one clinical
164 predictor is not categorical (age), with the others being the dichotomised presence of
predictors from clinical histories and thus this is not a particular disadvantage.
Irrespective of theoretical advantages and disadvantages, a classification tree analysis of the GP dataset is warranted to explore the potential for a clearer set of decisions that could support referral from primary care to secondary care for suspected TIA, without needing to perform potentially complicated calculations in the real time setting of the consultation. If a classification tree has similar discriminating performance to a complicated score then it may have greater utility as it will be easier to follow and incorporate into history taking and clinical assessment.
15.6 ‘Pruning’ a classification tree
The ‘rpart’ package in R software (www.r-project.org) was used to generate a classification tree with TIA diagnosis as the outcome to be predicted, and all GP clinical variables were included alongside age and prior history of cerebrovascular disease as potential predictors.
This is similar to the initial analysis for the logistic regression method where all potential predictors are considered in the univariate analyses for statistical significance.
Trees can be ‘pruned’ such that a smaller number of branches are used but ensuring that they still effectively partition the dataset into patients with and without disease. The pruning process is determined by the effectiveness of splits in the data for the distal branches in a tree (277). In the basic tree models, the trees grow until there is no more data or the split doesn’t result in different prevalences of the outcome in the subgroups compared with the parent group at that split. Some splits could produce fairly trivial differences in subgroup prevalences and add complexity without adding much to the discriminating ability of the tree in the overall dataset.
A ‘complexity parameter’ can be defined which will only allow a splitting if there is a big enough improvement in model fit and if not, then the tree will stop growing (rather than being
‘pruned back’, the tree is not allowed to grow, so the gardening metaphor behind the terminology does not reflect the mathematical processes).
The complexity parameter takes into account the number of patients at the ‘node’ or decision point to be split, the number of nodes in the tree as a whole and the change in the predictive ability of the whole tree (the model fit) as a result of the split. The complexity parameter informs the tree growing algorithm that a further split is allowed if it reduces the overall lack of fit (the residual mean square) by a certain factor (the value of the complexity parameter)
165 Identifying the best value for the complexity parameter comes from a cross-validation
exercise. The dataset is randomly split into a number of groups and classification trees are
‘grown’ using the same initial variables but only using the data from outside a given group and tested for how accurate its predictions are on data in the given group, by deriving a misclassification rate. This is done over a range of different complexity parameters, which will therefore result in a range of tree sizes for each attempt to predict the diagnosis in a given group. This whole process is repeated for each of the randomly constructed groups, one at a time, giving a misclassification rate each time a complexity parameter is tested in each of the groups. This results in an estimate across the groups of the misclassification error for each potential value of the complexity parameter.
The ideal complexity parameter (cp) is the one that results in the least error. As tree size increases, associated with lower and lower potential values of the cp, a minimum value can be identified. However, as this could still be associated with a large tree, the ideal value is taken as one that is within one standard error of the minimum value. This allows the choice of a cp which is low enough to reduce misclassification but not so low that it will end up generating a tree that is as complex as the initial one.