• No results found

3.4 Testing, Validating and Benchmarking

4.1.1 Predictive Modelling

providers’ reimbursement (Holmes and Jain,2012,Lewis et al.,2011).

Moreover, the identification of emergent risk can be categorised into modelling of three main aspects: stratification, clinical profiles and resource utilisation profiles. Also, in the modelling of the events, the time dimensions can be designed as time-to-event models (Appendix A.1.4) or as risk score models.

4.1.1 Predictive Modelling

Predictive modelling is directly associated with machine learning1, pattern recognition

and data mining. The practice of predictive modelling defines the process of devel- opment of models that their prediction accuracy can be understood and quantified (Kuhn and Johnson,2013). Geisser(1993) defines predictive modelling as ”the process by which a model is created or chosen to try to best predict the probability of an outcome.”

Physicians are interested in evaluating and forecasting adverse events that may provoke mortality or longer hospital stay for the patient, and assign a quantity to the patient risk profile (Cornalba,2009).Regarding risk impact, healthcare risk analysis can be cat- egorised into two categories: Operational Risks (ORs) and Clinical Risks (Kohn et al.,

2000). The predictive modelling of ORs in healthcare modelling, such as emergency readmission (Appendix A.1.1), Length-of-Stay (LoS) and End-of-Life (EoL) (Appendix A.1.3) modelling, varies across systems and often lacks robustness and generalisation. A popularORanalysis approach in financial modelling problems is loss-event risk mod- elling using Bayesian Networks (BNs) (Fenton and Neil, 2012). The BN approach is an ideal choice since it is great at identifying common causes of failures that affect the whole trading process. This type of ORanalysis tries to quantify the ORthat affects the system as a whole, to identify the routes and causes. Another advantage of this approach is that it enables stress testing the system to determine the effects. Two examples of OR analysis in loss-event risk modelling are the identification of rogue trading and stress testing financial markets.

1

Note that predictive modelling is different notion from Predictive or Supervised Learning approach in machine learning.

4.1.2 Risk Adjustment 33

4.1.2 Risk Adjustment

Although the medical advances have contributed to the improvement of life expectancy, they have little to do with life expectancy and much more to do with life quality. Risk adjustment methods are either used directly by health insurers for selection of good (profitable) risks from an insurer pool or indirectly by designing insurance products. The models are often based on a linear utility function framework (Newhouse,1996), and the objective is to minimise the outcome (i.e. risk) (Culyer and Newhouse, 2000,

Culyer et al.,2012).

4.2

Modelling Techniques

Since the late 1980s, machine learning methods have been used in extending the sta- tistical analysis for making inferences from data, and there are a lot to be done in the area of automated methods for learning and forecasting in healthcare.

Based on the knowledge of interest, BN, Artificial Neural Network (ANN), Decision Tree (DT) and kernel methods, like Support Vector Machine (SVM) and Gaussian Pro- cesses, are often used in healthcare data mining problems (Bardsley,2012,Kansagara et al., 2011, Lewis et al., 2011, ACI, 2014, DH, 2011a, Paton et al., 2014). Other approaches in machine learning can be found in the work of Bishop and Nasrabadi

(2006).

Predictive models vary in terms of prediction time-window (time-horizon), selected population, input variables, algorithm design and benchmarking methods. A list of good practices is proposed by Sinha et al. (2013), which covers a number of issues. However, not all studies (Lewis et al., 2011) clearly specify the details of analyses, including publicly and privately funded projects (Appendix A.1.1).

This chapter is divided into six sections. Firstly, a brief introduction to Transfer Learn- ing is provided. Then, Ensemble learning is discussed in detail. Afterwards, regression modelling and Logistic Regression (LR) approaches are briefly summarised. Next, major Decision Tree modelling algorithms are outlined, including the Random For- est (RF). Moreover, a recap of the SVM models is presented. Furthermore, Bayesian approaches are reviewed. After that, an abstract introduction to the Deep Neural Networks (DNNs) is provided. Then the Bayes Point Machines (BPM) modelling ap- proach is defined. Later, Deep learning approach and Wide and Deep Neural Network

4.2.1 Transfer Learning 34 (WDNN) model are described extensively. Finally, outlines of some other major mod- elling techniques are given in Appendix A.1.2.

4.2.1 Transfer Learning

The Transfer Learning (Woodworth and Thorndike, 1901) is a wide area of research in machine learning (Pratt et al., 1993) that focuses on improvement of the learning through the transfer of knowledge from sub-models or inputs that are learnt. The Transfer Learning refers to methods that harness and adapt models to a specific new predictive task at hand. The Transfer Learning is also known as Multi-Task Learning (Caruana,1998) or Learning to Learn (Thrun and Pratt,1997), and it refers to fitting many related models to get better performance.

Transfer Learning methodologies can help to use forecasting and predictive modelling techniques to provide a systematic methodology of analysis for similar cases with a smaller number of visible parameters. This may also be extended to perform a semi- supervised machine learning modelling such as active learning (semi-supervised ma- chine learning) or latent feature modelling (Ghahramani et al.,2007) for use in com- plex, real-world settings (Graham et al., 2011, Horvitz, 2010, Koller and Friedman,

2009).

The main application of the Transfer Learning is in domain adaptation. Examples of domain adaptation problems are spam filtering, news analysis and many other person- alised classifiers, or models that transfer the learnt features to another problem. In this chapter, Ensemble learning is discussed, which is a partial subset of Multi-Task Learning. In addition, DNNs are briefly overviewed. Because of recent breakthroughs in graphics hardware (Oh and Jung,2004), accelerated computing (Weber et al.,2011) and backpropagation optimisation (Hinton,2007) allowedDNNs to become one of the most powerful tools in the Transfer Learning domain (Yosinski et al.,2014).

4.2.2 Ensemble Learning

The Ensemble learning approaches (Dasarathy and Sheela,1979,Hansen and Salamon,

1990,Schapire,1990) are used in statistics and machine learning techniques to combine multiple learning algorithms to achieve a better performance. Ensemble methods have been applied or integrated within a wide range of modelling techniques, including

4.2.2 Ensemble Learning 35

ANNs, Decision Trees, and unsupervised learning scenarios, like anomaly detection. Some of the common Ensemble algorithms are Bagging, Boosting and various Bayesian methods (Murphy,2012,Rokach,2010, Sammut and Webb,2011,Sewell,2008,Zhou,

2012).

Firstly, the Bagging method (Breiman, 1996) stands for bootstrap aggregating, and it combines classifications of randomly generated training sets to decrease the error and improve the classification. Firstly, the algorithm uses bootstrap distribution for generating different base learners (Efron and Tibshirani, 1994). Then, it applies a popular combination method, known as Voting, in order to aggregate the output of learners. For example, Smedira et al. (2013) used a Bagging method to enhance the stability of the multivariate analysis of a non-proportional hazard hospital readmission model. The Bagging approach helped to increase the stability of the model, to be able to analyse the association between readmission, resource use and mortality. However, the studied population was very small and isolated, and the presented performance benchmark was subjective.

The Boosting method (Schapire and Freund,2012) can reduce the variance of proba- bility estimates, by averaging together many estimates. In another word, the models in the Ensemble modelling space try to correct weaker ones by focusing on the mis- taken cases. AdaBoost method (Freund et al.,1996) is an extension of Boosting with many variations (e.g. M1, M2 and R algorithms), which allows it to be implemented on multi-class problems and regression problems. For instance,Turgeman and May(2016) applied a boosted Decision Tree in combination with anSVM algorithm to model hos- pital readmission. The model was tested on a dataset from veteran hospitals in a city in theUSA. The model performed considerably better than other basic models, including

LR,SVM and Decision Tree. But, the applied optimisation approach had a moderate performance.

Bayesian methods, like Bayes Optimal Classifier (BOC), Bayes Model Averaging (BMA) and Bayesian Model Combination (BMC) can be used to include hypotheses from the hypothesis space and the associated prior probabilities. For instance,Monteith et al.

(2011) demonstrated that BMC provides a theoretical basis for soft-selecting from a space of Ensemble models. The model was applied to a machine learning dataset, and it was shown thatBMC could outperform BMA, Bagging, and Boosting, in terms of prediction accuracy.

However, two major disadvantages of Ensemble methods are moderately high comput- ing resource usage, and difficulty in interpretability. Firstly, the computing resources have been improved significantly in the past decade, and an Ensemble model with a

4.2.2 Ensemble Learning 36 moderate number of sub-models can run very quickly with comparable prediction per- formance. Secondly, there are various post-processing techniques that can be applied to interpret the models. The partial dependence plot (Goldstein et al.,2015) and impor- tance rankings (Breiman et al.,1984,Chen and Lin,2006) of features are two generic approaches that can be used to interpret a black-box method (Murphy,2012).

In the following subsection, a number of approaches for combining and selecting sub- models in Ensemble modelling are discussed.

4.2.2.1 Combination Methods

In the final level of Ensemble modelling, a combination method must be applied to include the estimated probability of all the sub-models in Ensemble modelling space. The popular methods for combining a set of models are Voting, Stacking, Sum, Median, Mean, Product, Mixture of Experts (MxE) and finally using weighting in combination with other methods (Murphy, 2012, Sammut and Webb, 2011, Sewell, 2008, Zhou,

2012).

Firstly, Voting algorithms use a selection approach, like majority, soft averaging and weighted combination of estimates to combine models. Voting algorithms are applied in Bagging and Boosting algorithms and many classification algorithms, such as the Random Forest (RF).

Moreover, the Stacking method (Wolpert,1992) (a.k.a. Stacking Generalisation) uses the produced estimated probability from a combination of sub-models as an additional input to the main prediction model.

Furthermore, the weighting approach is used in Ensemble methods, like Weighted- Average, Weighted Voting and Bayesian methods, like BOC. The weights are usually derived using an approximation technique, like Expectation Maximisation (EM), to optimise a performance indicator.

Moreover, theMxEalgorithm (Jacobs et al.,1991,Jordan and Jacobs,1994) generates a group of sub-classifiers (i.e. Experts) whose outputs are combined and inputted into a Generalised Linear Model (GLM). The inputted classifiers to theGLMare weighted by a gating function using a method likeEM. TheMxEis particularly useful when the feature space is heterogeneous, and classifiers on different parts of the space provide more informative and synthetic estimates.

4.2.3 Logistic Regression (LR) 37 For instance, Liu et al.(2014) proposed an Ensemble model based on MxE to predict risk scores for acute cardiac complications. The developedMxE predictive model in- corporates multiple sources of features and the weights of experts are defined using a hybrid method. The model was developed using a small sample of cardiac patients in the Singapore, and the performance of the model was fairly high based on a small population in Singapore.

4.2.3 Logistic Regression (LR)

Before 1980, almost all learning methods were learnt linear surfaces. Linear Regression modelling methods, such as the Logistic Regression (LR) and mixed models have been applied extensively in previous literature in social science and healthcare modelling. The LR (Cox, 1958, Walker and Duncan, 1967) method is similar to Linear Regres- sion methods, but it has been developed for binary linear classification. For the LR, the observed variable has Bernoulli distribution (Uspensky,1937) instead of Gaussian (Feller, 2008), and the estimated response variable is passed through a Sigmoid func- tion (i.e. Logistic or Logit) to squash the estimates between zero and one. Moreover, to fit the LR, there is a wide range of estimation and optimisation algorithms. One of the popular methods is the Maximum Likelihood Estimation (MLE), which is the same as minimising cross-entropy. A method like theMLEsuffers from overfitting and is sensitive to sparse features. The algorithm of a basic logistic regression model can be represented as the following (Eq. 4.1):

ˆ

f (x) = 1

1 + e(a+bx) (4.1)

, where ˆf is the prediction of the dependent variable for a vector of data points x. 1+e1 t

represents a Sigmoid function, and a and b are the coefficients and the error term is implicit.

To overcome overfitting, L2 regularisation (a.k.a. weight decay) may be applied to sparse models with a large number of features, and L1 regularisation may be applied to sparse models with a small number of features. Regularisation in statistics is an effective approach to favour simpler models (Blumer et al.,1987), which can work very well with a large amount of data to reduce overfitting (Halevy et al.,2009). However, when the dataset is small or more personalised results are required, then more complex approaches are needed. For instance, a Multi-Task Learning or an Ensemble modelling

4.2.3 Logistic Regression (LR) 38 methods may be used to include multiple classifiers (Section 4.2.1) and create more specialised or personalised solutions.

The Linear Regression modelling is a well-understood approach with a very broad range of algorithms. A brief summary of other major regression algorithm is provided in Appendix A.1.2. For example, Demir et al. (2009) presented a predictive model for emergency readmission to hospital usingHESdatabase. It evaluated the use ofLR

with a simple transition model to incorporate patients’ history of readmission and other covariates. The research focused on Chronic Obstructive Pulmonary Disease (COPD) patients that are admitted to the England’s hospitals during a 7-year period. Factors, including demographics, admission events and Length-of-Stay (LoS) were included, and ultimately the performance was compared usingROC. The research demonstrated that use of an only administrative database and a simple phase-type distribution could effectively predict the risk; however, it was designed for a very specific cohort.

Wennberg et al. (2006) developed the LR to predict hospital emergency readmission, similar to other NHS’s models (Lewis,2011,DH,2006,Nuffield-Trust,2012). The de- veloped model, Combined Predictive Model (CPM), takes advantage of variables from inpatient, outpatient,A&Eand GPfrom five PCTs. However, very little performance statistics were reported for the model. The update of CPM (Billings et al., 2013) was published in 2013 and reported a modestly highROC for the model by including data from all the four care sectors, but it had a very weak true positive rate (sensitivity).

Howell et al. (2009) used a multivariateLR to predict hospital readmission within 12- month for patients with chronic medical conditions, using Queensland Hospital data in Australia. The model includes demographics, socioeconomic status, geographic remote- ness, comorbidities and previous care utilisation. The model has a modest performance and very narrow focus.

Demir (2014) presented a comparison between Decision Trees, LR,GAM and MARS

for predicting hospital readmission for a PCT in the UK. The benchmark shows that for this particular population and very narrow specification, the LR came first and others came very close.

In addition, one application of regression modelling is in the pathway modelling (i.e. the factors that arise from heterogeneity amongst patients) of the EoL(Appendix A.1.3) and frailty function modelling. For instance, a multinomial Logit model was developed by Adeyemi and Chaussalet (2009) for modelling COPD patients’ pathway. In the model, the patient frailties were regarded as mixed effect type, and the random effects distributions were modelled based on patient pathways. The model was successful in identifying the high probability pathways for survival and cost objective functions, but

4.2.4 Decision Trees 39 must be tested on other cohorts of patients.

4.2.4 Decision Trees

In 1980’s, Decision Trees allowed efficient learning of nonlinear decision surfaces. De- cision Trees in predictive modelling are defined by recursively partitioning the input space, and defining a sub-model in each resulting region of input space (Murphy,2012,

Zhou, 2012). There are many algorithms for Decision Tree, with specific criteria for building and training, including C4.5 byQuinlan (1986) and its commercial successor C5.0 (Quinlan, 2014), Classification And Regression Tree (CART) by Breiman et al.

(1984) and Multivariate Adaptive Regression Splines (MARS) by Friedman(1991). Decision Trees are popular, because of the ease of interpretability, ability to handle discrete and continuous features, insensitiveness to the monotone transformation of features, automated feature importance ranking, robustness to outliers and scalability in terms of number features and observations.

One of the main disadvantages of Decision Trees is that a small change in the distri- bution of top features in the tree can have a large effect on the model. The algorithms that use greedy search to find optimal tree usually have lower accuracy than any other kind of algorithm, which uses a more sophisticated optimisation algorithm, such as the hierarchicalMxE (Murphy,2012,Zhou,2012).

For example, Austin (2007) compares CART, Generalised Additive Model (GAM),

MARS and Logistic Regression (LR) for prediction of mortality after Acute Myocar- dial Infraction (AMI) hospitalisation, using Ontario hospital data. All the models demonstrated very close prediction performance, except the CARTmodel, which suf- fered due to its algorithmic inability to incorporate complex or non-piecewise linear relationships.

In the following subsection, Random Forest (RF) classifier is outlined, which is an en- semble of Decision Trees, and provides a solution to the high variance in the estimations.

4.2.4.1 Random Forest (RF)

The Random Forest (RF) is an Ensemble Decision Tree, which was first introduced by

Breiman(2001), and is based on the CART algorithm (Breiman et al.,1984) and the Bagging Ensemble method (Breiman, 1996). To reduce the correlation between the

4.2.4 Decision Trees 40 classifiers, Breiman (2001) algorithm implements a technique to decorrelate the base learning trees based on a random feature selection.

Moreover, the Breiman RF is sensitive to highly correlated features (i.e. correlation bias), and the scale or the number of categories of features. Although it can produce very good fit for the data, the RF feature importance predictions must be treated with caution (Strobl et al.,2007, Tolo¸si and Lengauer, 2011). The cForest algorithm proposed by Hothorn et al.(2006), is an alternative to the original RF. It is based on conditional inference trees and reduces selection bias with a much higher computation burden. Therefore, if the original RFis going to be used, the input features must be pre-processed by a correlation analysis and a feature transformation approach, in order to guarantee unbiased and reliable feature importance ranking (Figure 4.1).

For a given training set with {(Xi, Yi)}ni=1as input features and response variables, the

Bagging part of the RFmodel can be represented as the following (Eq. 4.2). ˆ f (x) = 1 B B X b=1 fb(x0) (4.2)

, where Bagging has been carried out for B number of times, and random sampling with replacement at each iteration. The fb(x0) represents a trained tree for unseen

sample x0. In addition, in the random forest at each candidate split in trees, a random subset of the features is used.

For instance, Zheng et al. (2015) benchmarks the LACE score (van Walraven et al.,

2010). The study used aRF, a particle swarm optimisation based SVM and a Radial basis functionANN, for predicting hospital readmission using a small sample of heart failure (HF) patients. The presented statistics indicate that theSVMoutperformed the rest, but with a very steep computation cost. Additionally, the RFwas in the second place with high accuracy and sensitivity.

Moreover,RFis one of the most accurate learning algorithm, which can efficiently han- dle missing observations well. However, the main disadvantage of RF, as many other techniques, is its algorithmic weakness in dealing with noisy classification problems. Also, when the number of observations is lower than the number of features or order of problem’s convolutional structures, RFcan over-fit and under-perform.

4.2.5 Support Vector Machine (SVM) 41