Validation statistics - Validation of prognostic models

1 Introduction

1.6 Validation of prognostic models

1.6.3 Validation statistics

Various statistics are used to quantify the predictive performance of prognostic models. Key validation statistics for prognostic model research assess how accurately the model predicts the observed risk of participants (calibration) and how well the model distinguished between those who do and do not experience the event of interest (discrimination) (Steyerberg, 2008). Other measures of predictive performance exist; however, a comprehensive review of these methods is beyond the scope of this thesis. Instead only those which are relevant to, and thus utilised in, this thesis are discussed below.

1.6.3.1 Measures of calibration

Calibration reflects the prediction accuracy of the model (Royston and Altman, 2013), it measures how accurately the expected risks from a prognostic model predict the observed risks of the participants. Calibration of prognostic models with time-to- event outcomes is only possible if the models contain an estimate of the baseline survival function, as this is required to calculate participants’ absolute risks (Royston and Altman, 2013). For time-to-event outcomes the expected and observed risks of the event of interest are evaluated both over time and within specified time periods (Moons et al., 2012b). Calibration statistics which can be utilised for models with time- to-event outcomes include: overall calibration; the expected/observed ratio (at particular time-points); and the calibration slope.

Overall calibration (also known as calibration-in-the-large) compares the expected probability of events predicted by the prognostic model to the observed probability of events in the study, over the study time period. As censoring is likely to

expected and observed risks should account for censoring. The expected risk is represented by the predicted cumulative incidence curve from the fitted model, and the observed risk is represented by the 1– Kaplan-Meier curve of the study participants’ cumulative risk. Overall calibration is assessed by overlaying the curves on the same plot (Royston and Altman, 2013), if the curves are similar the model is “well calibrated”. This can be assessed for the sample as a whole (calibration-in-the-large) or for risk groups, created by grouping participants with similar predicted risks, to assess calibration at different levels of risk. Often risk groups are created by splitting at deciles of predicted probabilities (Steyerberg, 2008), however it is recommended that risk groups contain a minimum of 50 participants for stable Kaplan-Meier estimation (Harrell, 2015). The expected/observed ratio (E/O) assesses overall calibration at specified time points. Again E/O may be reported for the entire validation sample or over risk groups. The ratio of the expected and observed probabilities of an event should be close to one for a well calibrated model.

Calibration plots graphically depict the expected and observed probabilities of an event occurring prior to a specified time point for risk groups (Altman et al., 2009), as depicted in Box 1.6. The slope of a fitted line in a calibration plot is referred to as the calibration slope (Steyerberg, 2008). When assessing prognostic models with time-to- event outcomes the calibration slope can be assessed as an average slope over time by estimating the regression coefficient of a model containing the linear predictor from the prognostic model as the only variable (Royston and Altman, 2013). A well calibrated model produces a calibration slope estimate close to one and E/O close to one at most time-points. Methods of recalibration may be utilised to improve prognostic models with poor calibration to improve model fit (Royston, 2010).

When competing events are present, observed risks are calculated using cumulative incidence estimates (rather than survival estimates), as these account for

Box 1.6: Example of a calibration plot

Calibration plot from the external validation of a prognostic model for 30 day risk of mortality following stroke developed using Cox regression (Counsell et al., 2002)i_.

Deciles of predicted probability were used to form groups in calibration plot.

The diagonal dashed line represents perfect calibration and the vertical lines represent 95% confidence intervals for observed risk in each group.

1.6.3.2 Measures of discrimination

Discrimination refers to the extent to which predicted risk estimates distinguish between different patient prognoses (Royston and Altman, 2013). For prognostic models with time-to-event outcomes, discrimination not only distinguishes between those who do and do not go on to experience the event, but also should distinguish between the times at which the events occur. Groups of participants with higher predicted risks should have higher event rates and experience events sooner than those with lower predicted risks (Royston and Altman, 2013). Discrimination statistics

utilised for models with time-to-event outcomes include Harrell’s C-index (Harrell et al., 1982), and Royston and Sauerbrei’s D-statistic and R2_D_{(Royston and Sauerbrei, 2004).}

Harrell’s C-index (Harrell et al., 1982) is the probability that, of two randomly chosen participants, the one with the highest expected risk will experience the event first. Not all pairs of participants are evaluable; if neither of the participants experience the event during study follow-up it is not possible determine which will experience the event first. Similarly, if a participant is censored before the other experiences the event, it is not possible to determine the order of the pairing. Thus, Harrell’s C-index is known to be biased in instances with heavy censoring, (Royston and Altman, 2013).

Royston and Sauerbrei’s D-statistic is a measure of prognostic separation (Royston and Sauerbrei, 2004). This can be interpreted as the log hazard ratio comparing two equal-sized groups, created by splitting the sample using the median value of the estimated linear predictor from the prognostic model (Riley et al., 2016). This is achieved by utilising the linear predictor from the prognostic model to calculate each individual’s linear predictor value. These values are ordered, and corresponding standard normal order statistics (rankits) are calculated, and then scaled by a factor κ = √8 π⁄ . The scaled rankits are then regressed on the outcome, the resulting estimated regression coefficient is the D-statistic (Royston and Sauerbrei, 2004). Higher values of the D-statistic represent more separation, thus greater discrimination. Once calculated the D-statistic can be incorporated into a generalisation of the multiple correlation coefficient R2D, representing the proportion of explained variation on the log

relative hazard scale (Royston and Altman, 2013) as follows:

𝑹𝑫𝟐 = 𝑫𝟐 𝜿𝟐 ⁄ 𝝈𝟐_{+ 𝑫}𝟐 𝜿𝟐 ⁄ , 𝝈 𝟐₌𝝅𝟐 𝟔 Equation 1.33

In document Investigating the presence and impact of competing events on prognostic model research (Page 73-76)