• No results found

6.2 Summary of the methods and results

6.2.2 Validation measures for clustered data

Chapter 4 shows extensions of some of the standard validation measures for use with models for clustered binary outcomes. These are the C-index [45] and D-statistic

[49] (both assess discrimination), the calibration slope [39, 42] (assesses calibration), and the Brier score [55] (assesses predictive accuracy). Two approaches, termed as the ‘overall’ and ‘pooled cluster-specific’ are proposed to calculate these measures for clustered data. Each approach can produce three different measures depending on how the random effects estimates are used in predictions from the model. For example, conditional predictions can be obtained by either using the random effects estimates in predictions or setting them at their mean value of zero. Marginal predictions can be obtained by integrating out the random effects.

The new validation measures are illustrated by developing a model that predicts in-hospital mortality following heart valve surgery in UK hospitals and validating its predictive performance. Both the ‘overall’ and ‘pooled cluster-specific’ measures are shown to have meaningful interpretation in a clustered data setting. Additionally, the separate cluster-specific estimates can be used to identify clusters where model performance is either good or poor compared to the average performance. It would be of great interest to investigate the factors which explain this heterogeneity. One possibility is the unobserved cluster level characteristics or mis-specification of the model. Simulation studies were conducted to evaluate the performance of the measures under a range of conditions related to clustered data. The ‘overall’ measures based on the conditional predictions using the estimates of the random effects showed reasonably good performance in a range of conditions, except for those where the clusters were small. This is because the empirical Bayes estimates of the random effects were poorly estimated for these clusters. These findings are similar to those obtained by Oirbeek and Lesaffre [131]. The validation measures based on the marginal predictions and the conditional predictions that set the random effects to be zero performed poorly in the presence of clustering, because they ignore the effect of clustering. In general, the ‘pooled cluster-specific’ measures had reasonably good performance when the clusters were large. They showed bias for small clusters, since this approach ignores information from clusters that have very few events.

The validation measures for clustered binary outcome also differ in their flexibility regarding their assumptions and the form of the prognostic model. Therefore one

needs to be careful about these before choosing the measures. Both the parametric C-index and D statistic require that the prognostic index derived from the model should be normally distributed. In contrast, the non-parametric C-index only requires that the prognostic model is able to rank the patients. The calibration slope (CS) assumes that the model is correctly specified. The Brier score only requires that a risk algorithm can be calculated for all patients. In practice, the non-parametric C- index, calibration slope, and Brier score are recommended since they are free from a distributional assumption of the prognostic index. The parametric C-index and D statistic can be used provided that the prognostic index is normally distributed.

In Chapter 5, the calibration slope, Harrell’s C-index, K statistic, D statistic, and the Integrated Brier score (IBS) are extended for use with proportional hazards frailty model for clustered survival outcomes, using the same approach as that discussed for clustered binary outcomes. This chapter discusses the use of these measures only for model’s conditional predictions that use empirical Bayes estimates of the frailties. Us- ing this approach, it is straightforward to extend the measures for use with marginal predictions and conditional predictions that set the frailties to be one or log-frailties to be zero. An application of these validation measures is illustrated using child mortality data from Bangladesh. A simulation study was conducted to assess the effect of censor- ing on these measures under various clustered survival data scenarios. The validation measures behaved similarly as the corresponding standard measures for independent survival data, particularly in the presence of censoring. For example, the ‘overall’ K statistic (Kre) showed good performance against censoring in a range of conditions.

The prognostic index was specified as normal throughout the simulations and thus the effect of censoring on the D statistic (Dre) was negligible when the clusters were large.

Similar results were observed for the calibration slope. However, the C-index (Cre)

was affected by censoring; the bias was acceptable for censoring up to 30%. Similar to the standard measures, IBS (IBSre) had poor performance even when data have

small amount of censoring. In general, the measures were affected by the non-zero intra-cluster correlation particularly when the clusters were small, possibly due to the poor estimation of the frailties. Similar to the analogous measures for clustered binary data, the ‘pooled’ measures had poor performance for the small clusters, probably due to ignoring the clusters that have few events.

Similar to the standard measures, the validation measures for clustered survival data differ in their flexibility regarding their assumptions and the form of the prognostic model. The C-index (Cre and Cw) only requires that the prognostic model is able to

rank the patients. However, the K statistic (Kreand Kw) requires that the prognostic

model was fitted using the proportional hazards (PH) frailty model. The D statistic (Dre and Dw) assumes that proportional hazards given the frailty holds and that the

prognostic index is normally distributed. Similarly, the calibration slope (CSre and

CSw) also assumes proportional hazards given the frailty. The predictive accuracy

measure IBS (IBSreand IBSw) only requires that a survival function given the frailty

can be calculated for all patients. One should be aware of these before choosing the measures.

A similar pattern of recommendations regarding the practical use of these measures for censored data can be made to those with the standard measures. For example, the K statistic (Kre and Kw) and calibration slope (CSre and CSw) can be recommended

for validating prognostic model developed with PH frailty model. The D statistic (Dre

and Dw) can be recommended provided that the distribution of prognostic index is

normal. The C-index (Cre and Cw) cannot be recommended for censoring more than

30%. IBS (IBSre and IBSw) cannot be recommended.

In practice, both the ‘overall’ and ‘pooled cluster-specific’ measures are recom- mended to use when validating models for clustered data. However, one needs to investigate whether the clusters in the validation data are sufficiently large (for exam- ple, greater than 30) and each of these contains at least two events before using the ‘pooled’ measures.

An important issue that one should consider when validating model for clustered data is whether the validation data involve the same clusters as the development data or involve new clusters. If the clusters are the same for which the random effects are known, conditional predictions using the random effects and the validation measures based on this approach are recommended to assess the predictive ability of the model. It is not straightforward to use this approach for validating model using subjects from

new clusters, since the random effects are unknown. In such circumstances, one option would be to investigate the characteristics of the new clusters to see whether they match those of the clusters in the development data. For example, when predicting clinical outcomes in hospitals one could investigate the prevalence of the outcome, the geographical location, the experience of the clinicians, staff to patient ratios, and information on other relevant factors that could be obtained from routinely collected hospital data. If these important characteristics are similar for the development and validation hospitals, it may then be appropriate to assume that the development and validation hospitals come from the same population. Then the random effects could be estimated from the validation data using the information from the development data, provided that the number of patients in each hospital is not small, for example, not less than 30. When the random effects are estimated from the validation data and used in the predictions, this may be considered as a form of model re-calibration. One could also inspect the value of the between cluster variance in the development data to examine how closely it agrees with that in the validation data and infer whether it is reasonable to use predictions based on the random effects from the validation data. Thus the estimate of the between cluster variance for development data clusters will need to be published along with the risk algorithm by the model developers. If the number of clusters in both validation and development data are of reasonable size, one could use more formal method of comparison such as examining whether the confidence intervals for the between cluster variances from the two datasets overlap or use F-test (for models with normally distributed random effects). However, the equality in the level of clustering between both datasets may be unlikely in practice.

Alternatively, the marginal predictions or conditional predictions setting the ran- dom effects at their mean value and the validation measures based on these approach could be used. However, if the validation dataset involves several new clusters, and there is a moderate to high degree of variation between these clusters, then the validation measures based on these two approaches may not produce optimal results regarding the model predictive performance. However, they are conservative with the level of clustering so that high (low for Brier score) values would still imply a model with good predictive ability. However, any form of validation for clustered data would require expert statistical skills and thus may not be suitable to be done by clinicians