5.3 Application to child mortality data
5.3.1 Child mortality data
5.3.2.2 Model Validation
The model was then used to predict the survival probability in the validation data, and the predictive performance of the model was assessed using the validation mea- sures described in Section 5.2. To calculate the validation measures, the frailties, the survival probabilities S(t|ω), and the associated prognostic indices were estimated in
the validation data, using the estimated model parameters from the development data. User written Stata code was used to calculate the validation measures (Appendix B: Figure B.3).
Figure 5.3: (a) Distribution of the predicted prognostic index ˆηre = ˆβTxij + ˆ$j (b)
Kaplan-Meier survival function at the tertiles of the predicted prognostic index.
Some of the validation measures are based on a normality assumption of the pre- dicted prognostic index. Therefore, the distribution of the predicted prognostic index is presented in Figure 5.3(a); there appears to be some skewness towards the right, however it is perhaps not unreasonable to consider the distribution of the predicted prognostic index as normal. Furthermore, to examine the spread in survival predic-
tions as predicted by the model in the validation data, Kaplan-Meier survival function at the tertiles of the predicted prognostic index is presented in Figure 5.3(b). This plot suggests that the model had good ability to separate (discriminate) children with low risk of mortality from those with high risk.
The estimates of the validation measures are presented in Table 5.2. The concor- dance statistic Cre was estimated as 0.750 (95% CI:0.723 to 0.785), which suggests
that the model has reasonably good ability to discriminate between low and high risk children. However, Kre suggests relatively a lower ability for discrimination, with an
estimate of 0.686 (95% CI: 0.669 to 0.701). The possible reason for this difference is similar to that for the standard C-index that Cre may be affected by the high degree
of censoring in the child mortality data. The estimate of the D statistic (Dre) also
suggests moderate discrimination between low and high risk children. Similar to the standard D and K statistics, both Dre and Kre may not be affected by the censoring
in this dataset. The calibration slope (CSre) suggests that the model has good overall
calibration. The extent of inaccuracy in the individual survival prediction (IBSre) was
estimated to be 0.07, suggesting reasonably lower inaccuracy in survival predictions. Additionally, the Brier score calculated at each time point was plotted against the ob- served time points in Figure 5.4, to see the model’s predictive accuracy over the entire follow-up period. Two additional plots of the Brier score for the null model and the model with all fixed predictors only (frailties set to one) were obtained, to examine the difference in predictive accuracy between these three models. The survival prediction error (the Brier score) was lower when the predictions were made by the model with all the fixed predictors along with the frailties compared to those obtained from the model with only fixed predictors and the null model.
The ‘pooled’ estimate of the cluster-specific C-indices, Cw, indicates a good discrim-
ination between low and high risk children belonging to the same cluster, whereas Kw
indicates relatively less discrimination, and Dw suggested poor discrimination between
these two groups (Table 5.2). However, these ‘pooled’ estimates are smaller than their corresponding ‘overall’ estimates, although the frailty effects in these data were not that strong. Similarly the ‘pooled’ estimate of the cluster-specific calibration slopes (CSw)
Table 5.2: Estimates of the validation measures based on the validation data
Overall measures
Measures’ name Notations Estimate 95% CI
Harrell’s C-index Cre 0.750 [0.723, 0.785]
Gonen and Heller’s K(β) Kre 0.686 [0.669, 0.701]
D-statistics Dre 1.52 [1.29, 1.74]
Calibration slope CSre 1.01 [0.91, 1.09]
Integrated Brier score IBSre 0.07 [-]
Pooled cluster-specific measures
Harrell’s C-index Cw 0.701 [0.671,0.742]
Gonen and Heller’s K(β) Kw 0.649 [0.626, 0.679]
D-statistics Dw 0.89 [0.65, 1.13]
Calibration slope CSw 0.64 [0.48, 0.81]
Integrated Brier score IBSw 0.06 [0.05, 0.07]
Figure 5.4: Brier scores over the entire follow-up period. The results are obtained for the predictions from the model with all fixed predictors and the frailties, the model with all fixed predictors only, and the null model.
suggests worse calibration than the ‘overall’ calibration. This difference may be caused by the cluster size of the child mortality data, where clusters are reasonably small and several of the clusters have no events. These clusters were dropped due to the lack of events to calculate the measure and the pooled estimate was based on the reduced
number of clusters. The reason is similar to those described for ‘pooled cluster-specific’ measures for clustered binary data in pages 98-99, Section 4.5.3.2, Chapter 4. However, the estimate of the ‘pooled’ Brier score (IBSw= 0.06) was close to that of the ‘overall’
Brier score.
Since the cluster-specific estimates are useful in detecting outlying clusters (regions), these estimates with their level of uncertainty are plotted against the rank order of the clusters in Figure 5.5. Each horizontal solid line indicates the ‘pooled estimate’ of the respective measure. Figure 5.5 shows the estimates for 76 clusters only. It was not possible to estimate the validation measures for the rest of the clusters as the required number of events to enable the calculation of the measures were not observed in these clusters. Note that to enable the calculation, some of the measures, for example, Cw
requires at least one event, and the other measures, for example, Dwrequires two events
(see, Section 5.2). This plot made a comparison between clusters (regions) in terms of the model performance. The results shows that the predictive ability of the model for some of the clusters were significantly worse (better) than the average performance. This heterogeneity in the model predictive performance between the clusters (regions) may be caused by unobserved cluster (region) level characteristics. Therefore, it may be important to identify the factor which explains this heterogeneity.
In summary, this illustration of the validation measures using the child mortal- ity data showed that both the ‘overall’ and ‘pooled cluster-specific’ measures have a meaningful interpretation in a clustered survival data setting. However, the valida- tion measures appeared to provide different conclusions regarding the model’s predic- tive performance. For example, the C-index suggested strong discrimination, whereas K(β) suggested moderate discrimination. Furthermore, while the ‘overall’ estimates of both the D-statistic and calibration slope indicated a reasonably good predictive performance of the model, their ‘pooled’ estimate indicated very poor performance. These dissimilarities may be caused by high degree of censoring in the child mortality data and/or by the small clusters. Therefore, in the next section, a simulation study is conducted to evaluate the performance of the measures in a range of conditions based on a clustered survival data setting.
ation to c hild mortalit y