Method Three Bayesian Model Averaging with an Additionally Weighting

Weighting

The next method is an adaptation of BMA to include an additionally weighting in an attempt to create an improved selective screening tool. This is to allow more evidence to be included in determining the weights for the models. The additionally weighting is based upon a comparison of the discriminative ability of the models in the dataset. A model with an increased discrimination, measured by the AUC, is assigned a stronger prior. Although, unless there is a large difference in the model AUC the change in final weights will be minor. BMA with an additionally weighting is simple to apply, as with the original BMA method, once the AUC for each of the models has been determined.

The posterior, Mj of the model is calculated in exactly the same way as previously presented using

Equations 11.5 & 11.7. The additionally weighting takes the form of the prior odds for each model and is determined as follows; P r(Mi) = AU Ci Pn j=1AU Cj (11.9)

Then, for each model, j, we can assign a new value for mj determined by the likelihood ratio. Then the

final model weights can be calculated;

wj =

P r(Mj) × mj

i=1P r(Mj) × mi

(11.10) This provides the finals weights for each model, which for the lung cancer prediction models were determined as follows;

Model Weightings

Bach 2.71E-09 NA

PLCO 0.00206 0.50742

Hoggart 0.99794 0.49258

Pittsburgh 7.24E-17 6.66E-16

Table 11.6: Model Weightings from BMA with an Informative Prior

As can be seen in Tables 11.4 & 11.6, the difference in weights between BMA and BMA with an additionally weighting was negligible because the AUC results were similar and the external data did not

provide enough evidence to significantly alter the model weights. However, the PLCOM2014 was slightly

the AUC results would be a more appropriate method if there was a clear leading model which should be favourably weighted in the final model.

Bach Validation Internal/External Hos.-Leme. Brier Score AUC [95% CI] Threshold (%) Sens. (%) Spec. (%) Youden PLR With 1 Internal 0 (491.05) 0.3726 0.5295 [0.508, 0.551] 25.04 29.63 76.48 0.0611 1.2597 External 0.0024 (501.84) 0.5149 0.6103 [0.591, 0.629] 24.98 31.14 81.01 0.1215 1.6396 2 Internal 0 (491.05) 0.3726 0.5295 [0.508, 0.551] 32.23 12.59 89.99 0.0259 1.2583 External 0.0024 (501.84) 0.5149 0.6103 [0.591, 0.629] 31.20 18.79 90.03 0.0882 1.8840 Without 1 Internal 0 (744.14) 0.4226 0.6115 [0.596, 0.627] 1.50 82.70 36.44 0.1914 1.3011 External 0 (1030.13) 0.5316 0.6924 [0.680, 0.705] 1.50 83.91 42.38 0.2628 1.4562 2 Internal 0 (716.55) 0.4158 0.5983 [0.582, 0.615] 15.81 12.62 89.99 0.0261 1.2607 External 0 (971.88) 0.5236 0.7003 [0.686, 0.714] 13.87 25.97 89.98 0.1595 2.5919

Validation 1; Optimal Risk Threshold - Validation 2; UKLS Guidelines

Table 11.7: BMA Validation Results with an Informative Prior

Since the meta-model had a very similar weighting as the original BMA meta-models there were no major differences in the validation results. Neither version of the meta-model including or excluding the Bach Model reported a good calibration, although the version with the Bach Model nearly reported a good calibration in the external validation. This was an improvement upon the original models; further supported by the improved Brier Score. This suggests BMA can improve the accuracy of predictions.

The discriminative ability of the new models was similar to the initial models. The meta-model that excluded the Bach Model reported the strongest discriminative ability and prediction rules. This reported a small improvement over the BMA version without the informative prior. This demonstrates how considering the discriminative ability as prior knowledge can improve the potential of developing a robust selective

screening tool. This had a comparable performance to PLCOM2014 Model. Indeed, the model in the

external validation had a higher Youden index and PLR than the leading original models. This meta-model, excluding the Bach Model, was optimal at the 1.5% risk threshold where the sensitivity was approximately 83% across the internal and external validation dataset. Here, the model reported a specificity between 36-42%. Using the UKLS guidelines the model should be applied at a risk threshold between 13.8-15.8%. This maintained a specificity of 90%, although the sensitivity was variable reporting a sensitivity of 12.6% in the internal validation which increased to 26% in the external validation. The results have only been confirmed in one external dataset; the model should be validated in additional sample populations to assess if it can consistently perform to a higher standard in comparison to the original models.

In summary, BMA with an additionally weighting led to a slightly improved model. The model version excluding the Bach Model had a strong performance. In the external validation the calibration improved upon the original models, although the meta-model still did not report a good calibration because of the calibration deficiencies of the original models. The discrimination and prediction slightly improved upon the leading original models. However, the results were variable between the internal and external validation. The model should be applied in new populations to ascertain if an improved prediction model has been created in comparison to the original models.

Overall, this method showed potential for creating a more robust model. Considering a prior for each model could be a better alternative to the original BMA method. The additionally weighting, based on the models discriminative ability, could allow a more robust selective screening tool to be devised. However, if there is not a large difference in the original models AUCs then the prior information does not significantly alter the model weights from the original BMA method.

11.6 Summary

The chapter applied Model Averaging and BMA methods to aggregate multiple lung cancer prediction models. Model Averaging dramatically improved the model calibration both in the internal and external validation. This was an advantage of recalibrating the original models before aggregating the models in the dataset. The models were recalibrated in a dataset with a high lung cancer incidence rate, so that when

they were applied in the external validation dataset, which also reported a high lung cancer incidence rate, they remained well calibrated. Unfortunately, the discrimination and the prediction rules were inferior to the original models. Additionally, the final weights were evenly split across all models considered in the meta-model. This may be a result of the original models still being poorly calibrated in the dataset so the method failed to identify more robust models.

BMA did not improve the calibration and minimised the impact of some of the original models by effectively assigning them a weight of zero in the final model. This was observed when considering the Bach Model in the model aggregating; the final model was effectively the Hoggart Model. When excluding

the Bach Model, the final model was formed using an equal weighting between the PLCOM2014and Hoggart

Model, however, the Pittsburgh Model was still minimised in the final meta-model. When incorporating in the models’ discriminative ability as an additionally weighting the final weights differed only slightly because the AUC results were relatively similar. This method, however, created a meta-model with a slightly improved discrimination and prediction rules to the original models.

A lung cancer prediction model was devised with a strong performance in comparison to the original

models. This model combined the evidence of the PLCOM2014, Hoggart and Pittsburgh models. It can be

applied to ever-smokers aged at least 35 years to predict six year risk of lung cancer incidence. The model performed optimally at the 1.5% risk threshold with a sensitivity of 83% and specificity between 36-42%. To maintain a specificity of 90% the model should be applied at a risk threshold between 13.8-15.8%, which is excessive because of the Hoggart Model estimated high risks. At this threshold the sensitivity varied between 12.6% and 26%. The model should be applied in new external validations to ascertain if an improved prediction model has been created in comparison to the original models. The final model is

the weightings of the PLCOM2014, Hoggart, and Pittsburgh models as presented in Table 11.6, and a Stata

CHAPTER

12

Discussion

12.1 Developing Project Objectives and Brief Summary

The primary objective of this project was to review how lung cancer diagnosis rates could be improved. We highlighted how the key to improving the diagnosis rate was to identify lung cancer early. This could be achieved by implementing a screening trial, where people who are high risk of developing lung cancer are periodically reviewed and screened to identify early lung cancer developments. Therefore, we aimed to review how lung cancer screening could be implemented and improved.

Initially, an overview of lung cancer and current screening programmes was conducted. This highlighted how research was still needed as to how to effectively select populations for lung cancer screening. Firstly, there was an indication the current proposed populations for screening trials were somewhat arbitrarily chosen without evidence to support why they had been selected. The two major current screening pro- posals either selected the LLP Model despite this model not demonstrating a leading prediction model performance, or selecting older ever-smokers without consideration into other key factors that could ex- plain lung cancer risk. Despite these limitations, the trials, when evaluated, demonstrated some promising results, including improving early stage lung cancer diagnosis. This highlighted the potential benefit of a lung cancer screening programme. However, before a programme could be implemented there should be more justification as to why a specific criterion had been chosen with evidence demonstrating why this would be the most beneficial method available. Without this evidence available with the current screening programmes, a systematic review was proposed to quantify all information into available prediction models, which could become a selective screening programme, and any evidence into their expected performance as a clinical utility.

We also discussed the key measures that would be expected of a criterion before it could be implemented,

with focus on the UK screening trial guidelines. Namely, a screening criterion would be required to

demonstrate a high level of benefit in identifying lung cancer while reducing the potential for negative impacts caused by unnecessary screening. It also had to demonstrate the potential to be cost-effective, which could be challenging due to the expensive CT scanning commonly required to identify lung cancer. This was considered during all assessments of how prediction models could be implemented as a selective screening trial.

The next stage of the project aimed to identify any leading selective screening criteria formulated using either the existing lung cancer screening programmes or available prediction models. Therefore, a systematic review was conducted to identifying any leading screening guidelines currently published. However, our research found the current reporting into how lung cancer prediction models’ could be utilised as a selective screening tool was limited. The results reported were inconsistent and, with a lack of compelling evidence demonstrating their potential benefits, this contributed to lung cancer screening not currently being considered on a large scale.

This formed the next objectives of this research to identify a leading selective screening criterion. We requested datasets from ILCCO to provide a range of sample populations to review existing lung

cancer models and currently considered selective screening criteria. The models and criteria were then analysed in the datasets and recommendations on how to define a high risk population for screening were presented. By the conclusion of the external validation we presented two alternatives, both using the

successful PLCO2014 Model. One criterion would identify a large proportion of lung cancer cases but have

a substantial proportion of additional screening of controls; this would be beneficial if and when lung cancer screening becomes easier, cheaper, and less invasive. The second option, while still identifying a substantial proportion of lung cancer cases aimed to reduce screening of people who were lung cancer free, to allow the screening programme to be more economically viable and reduce harms. By the conclusion of external validation the objectives had been met and a leading selective screening tool had been presented. Next, the project presented how models could be updated to combine the evidence and information available in the model building dataset with new information in an external dataset. This can be achieved by updating a single model or combining multiple models based on their performance in the external dataset. The literature review presented the different methods that were available which have differing merits depending on the success of the original model in the external dataset. Model aggregating techniques were also presented. These methods to combine models had not been utilised often, so we evaluated the methods and how they have been previously successful or can be successfully applied. We also reviewed whether the methods were appropriate and discussed any constraints when applying different methods, such as requiring the same model form. Once the methods had been presented and reviewed the final objective of the project assessed if an improved lung cancer screening tool could be created. Unfortunately, a significantly improved model could not be developed. This was mainly a consequence of the datasets obtained which reported a very high lung cancer incidence rate. Therefore, the models were poorly calibrated, which is the main assessment in the model updating, and therefore the methods were hindered. Additionally, the proposed models would be unrealistic to apply in the real world because they were recalibrated to predict exceptionally high levels of lung cancer risk based on the dataset.

Overall, the majority of the objectives were achieved. The systematic review synthesised the evidence of lung cancer prediction models and highlighted where they had been successful and where further research was required; namely how the models could be successfully applied as a selective screening tool. This became the project’s objective and we conducted more extensive research into lung cancer prediction models. We provided clear evidence into the leading model and how this should be applied to maximise potential benefit. The different methods that are available for model updating were presented and analysed. These could then be used in subsequent research to update prediction models for lung cancer or other diseases. We also provided our own suggestion to model aggregation that considered the models’ discriminative ability rather than solely aggregating models based off their calibration. This aimed to ensure models with a good discriminative ability were not too harshly penalised and nullified in the final meta-model. In addition the systematic review research has been published and there are plans to publish the external validation results.

Unfortunately, an improved model could not be created, and we decided not to publish a proposed model that did not improve on the existing literature. In addition we could not provide any conclusive evidence into the success of different model updating methods based off our own analysis. This was a consequence of the datasets where the model updating was conducted.

In summary, the work can provide a contribution to the field of lung cancer screening. Previously, multiple lung cancer models had been developed without further research into how the models should be applied. This has hindered models being implemented as a selective screening tool. In this research we demonstrated how all prediction models would benefit from thorough validations that allow constructive

results. In addition we have provided clear recommendation into how to apply the PLCO2014 Model as a

selective screening tool, with recommendations for the next stage of research before hopefully this model being implemented as a screening tool at a regional, national or international level.

In document Validating and Updating Lung Cancer Prediction Models (Page 170-175)