optimistic impression of model performance that may be achieved in new subjects from the underlying population. Optimism can be defined as the difference between true performance and apparent performance (Steyerberg, 2009). Here the true performance refers to model behaviour in the underlying population and apparent performance refers to the estimated performance in the sample.
Validation of statistical prediction models means the evaluation of their predictive abil- ity. Following model development, we might be able to assess performance of prediction in new/independent patients, i.e. external validation (Harrell, 2001; Hastie et al., 2009; Steyer- berg, 2009). However, if further data is unavailable, we might seek to use original sample of data for model development and validation. One approach is to randomly split the data into two portions: one for model development/training and the other for model validation. But such data splitting procedures are often considered suboptimal (Efron and Tibshirani, 1993) and may fail to demonstrate the adequacy of the model when subjected to external validation given the possibility to produce less representative subsamples for either model development and/or validation. Bootstrap is considered the most efficient validation procedure (Harrell, 2001). As a validation technique, bootstrap repeatedly analyses subsamples of a given data where each subsample is a random sample with replacement from the original sample. The fact that each bootstrap samples will be relatively different than the original data set mimics application to a different sample at every iteration hence providing reliable results (Anderson, 2005). Bootstrap validation allows calculation of predicted probabilities from a model which can be compared with the actually observed outcomes.
Calibration and discrimination statistics are among the most effective and commonly used measures of validation performance (Harrell et al., 1996; Steyerberg et al., 2004). Cal- ibration (also called reliability) refers to how well the model predictions compare with the observed outcomes. Calibration is essentially a measure of bias that evaluates the agreement between observed and predicted probabilities. For example, if the average predicted propor- tion of compliers to HRT tablet allocations among similar group of women is 80% and the actual proportion complying is 80%, then the predictions are well calibrated. In measuring calibration, it is sufficient to focus on either an intercept or slope of a linear predictor and not both (Steyerberg et al., 2004). However, in practice calibration is often quantified by the calibration slope as originally proposed by Cox (1958b). The calibration slope can be obtained from the validation plot which is a plot of observed probabilities against the pre- dicted probabilities. The line from the validation plot can be defined with an intercept α
and a slope β, where α = 0 and β = 1 corresponds to a perfect calibration. The calibration slope lies between 0 and 1 and the bigger, the better calibrated the model under study.
There is a relationship between the calibration slope and penalty factor in penalized regression (Miller et al., 1993). For example, for a logistic model with the linear predictor as the only covariate, the calibration slope is the estimated regression coefficient β, i.e.
logit(treatment compliance) = α + β· linear predictor. (3.9)
Copas (1983) and van Houwelingen and le Cessie (1990) demonstrated that the slope β of the linear predictor is identical to the uniform shrinkage factor s given by Equation (3.8) above. Discrimination refers to the ability of the model to distinguish between subjects with positive or negative outcomes (e.g. the ability of a model to distinguish compliers with treatment allocation from non-compliers). Discrimination is commonly measured using the concordance (c)-statistic. For binary outcomes this statistic is identical to the area under the receiver operating characteristic curve (Harrell, 2001). Given a random pair of patients with different outcome values, the c-statistic can be interpreted as the likelihood of a patient with the desirable outcome (complier) to have a higher predicted probability for having that outcome than a patient without the outcome (e.g. non-complier). The c-statistic varies between 0.5 (random predictions) and 1.0 (perfect prediction) and the higher the better (Harrell et al., 1996; Miller et al., 1993). The concordance c-statistics can also be expressed in terms of the widely used Somers (1962) Dxy rank correlation which is a measure of the
difference between concordance and discordance probabilities (Harrell, 2001):
Dxy = 2(c− 0.5), (3.10)
where Dxy= 0 and 1 here implies random predictions and perfect discriminations respectively.
We will evaluate the predictive performance of our models using calibration slope, calcu- late discrimination’s concordance c statistics from the reported Dxy value and the percentage
of optimism as implemented in the Design package in R software. However, it is worth noting that good (or even perfect) calibration and discriminative ability are not sufficient for a model to be declared clinically useful. Only a model’s ability to provide useful additional information for clinical decision making makes application of a prediction model sensible. In our case the quality of compliance information for both treatment arms would be crucial to making relevant clinical decision.
Following the review, the next five chapters provide applications of the methods discussed thus far. Chapters 4, 5 and 6 provide analyses of the Esprit data. On the other hand Chapters 7 and 8 provide simulation studies which evaluate respectively the performance of specialist methods adjusting for noncompliance in one treatment arm and Roy et al. (2008) method of principal stratification adjusting for noncompliance in two treatment arms.