Alpaydin E. Introduction to Machine Learning. MIT Press, 2004.

(1)

Bibliography: Evaluating Predictive Models

Last update: 29 July 2007

General

An excellent introduction to the field of machine learning. As with most books on machine

learning, the emphasis is on classification. Chapter 14 describes the assessment and

comparison of classification algorithms.

Altman DG, Royston P. What do we mean by validating a prognostic model?

Statistics in Medicine 2000;

19

: 453–73.

This paper examines (i) what is meant by validation of prognostic models, (ii) reviews why

it is necessary, and (iii) describes how validations should be carried out; The emphasis is

on conceptual rather than technical issues. The idea of validating a prognostic model is

generally taken to mean that it works satisfactorily for patients other than those from

whose data the model was derived. The authors suggest that it is desirable to distinguish

statistical from clinical validity. Statistical validity means that the model is the best that

can be found with the available factors, while clinical validity means that the model

predicts accurately enough for its purpose

–

of course, this depends crucially on one's view

of the aims of the model. The paper spents considerable attention to the problem of

overfitting. It is known that analyses that are not prespecified but are data-dependent are

liable to overoptimistic conclusions. The data-dependent aspect of most prognostic models

stems from the variable selection and discretization procedures.

Duda RO, Hart PE, Stork DG, Pattern Classification. Wiley, 1997.

This classic textbook from 1973 was revised and updated in 1997. It covers a broad range

of pattern classification techniques. Chapter 2 discusses Bayesian decision theory.

Hand DJ, Construction and Assessment of Classification Rules. Wiley, 1997.

Chapter 6 of this book

(Aspects of evaluation)

presents a framework for understanding

model evaluation concepts. A distinction is made between measuring

accuracy

,

precision

,

separability

, and

resemblance

. A large number of examples is provided, e.g. the

misclassification

(i.e.

error

)

rate

and the

Brier score

are accuracy measures. The highly

popular misclassification rate is further investigated in Chapter 7, where several

cross-validation schemes (rotation, leave-one-out, bootstrap) for estimating the actual error are

discussed. Chapter 7 deals extensively with aspects of classification accuracy.

Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data

Mining, Inference, and Prediction. New York: Springer-Verlag, 2001.

Chapter 7 of this excellent textbook on statistical learning methods discusses model

assessment and selection methods based on loss (accuracy) functions.

Mitchell TM. Machine Learning. McGraw-Hill, 1997.

Another very good textbook on machine learning. Chapter 5 considers classification

accuracy, including the statistical estimation of performance.

(2)

Pepe MS, The Statistical Evaluation of Medical Tests for Classification and

Prediction. Oxford University Press, 2003.

Wasson JH, Sox HC, Neff RK, Goldman L. Clinical prediction rules. New England

Journal of Medicine 1985; 313: 793-9.

In this article, the following methodological standards for creating and validating clinical

prediction rules are proposed: 1. The event to be predicted (outcome) should be clearly

defined, preferrably by biological rather than sociological or behavioral criteria; 2.

Predictive findings should be defined precisely and have a similar meaning to anyone who

may use them; 3. The list of predictors should not include any criteria that are used in

defining the outcome ("blind assessment"; most relevant for diagnostic rules); 4.

Characterics of the patient population used to develop the rule should be clearly

described; 5. The study site and type of practice where the data was gathered should be

described; 6. An unbiased estimate of the rule's performance should be reported; 7. Effects

of using the rule should be prospectively measured; and 8. The statistical technique that

was used to derive the rule should be described.

Thirty-three publications of clinical predictions rules during the years 1981—1984 in four

leading medical journals were reviewed by these standards. Most of the criteria were met

by more than 80% of the studies. However, performance statistics were seldomly reported

(11 publications), and effects of clinical use was almost never prospectively measured (2

publications).

Ch. 1 Introduction: Predictive Models and Evaluation

Abu-Hanna A, Lucas PJF. Prognostic models in medicine, Methods of Information in

Medicine 2001;

40

: 1–5.

Wyatt JC, Altman DG. Commentary: Prognostic models: clinically useful of quickly

forgotten? British Medical Journal 1995;

311

: 1539–41.

Few prognostic models are routinely used to inform difficult clinical decisions. Wyatt and

Altman believe that the main reasons why doctors reject published prognostic models are

lack of clinical credibility and lack of evidence that a prognostic model can support

decisions about patient care (that is, evidence of accuracy, generality, and effectiveness).

Ch. 3 Evaluating Probabilities – (i) ROC analysis

Bamber D. The area above the ordinal dominance graph and the area below the

receiver operating characteristic graph. Journal of Mathematical Psychology 1975;

12

: 387-415.

In this paper it is shown that the area under the ROC curve (AUC) equals the probability

that a randomly chosen positive case was given a higher test value (or higher prediction by

a model) than a randomly chosen negative case. Furthermore, it is shown that the

estimated AUC is equivalent to the Mann-Whitney U statistics normalized by the number of

pairs of negative and positive cases.

(3)

DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or

more correlated receiver operating characteristic curves: a nonparametric approach,

Biometrics 1988;

44

: 837–45.

A nonparametric method for comparing the areas under the ROC curves of two distinct

models on the same dataset. The method is based on the theory of generalized U statistics.

Hand DJ, Till RJ. A simple generalization of the area under the ROC curve to

multiple class classification problems, Machine Learning 2001;

45

(2): 171–86.

Hanley JA, McNeil BJ. The meaning and use of the area under the receiver operating

characteristic (ROC) curve. Radiology 1982;

143

: 29–36.

Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating

characteristic curves derived from the same cases. Radiology 1983;

148

: 839–43.

Parametric method for comparing the areas under the ROC curves of two distinct models

on the same dataset, assuming a Normal (Gaussian) distribution of the AUC. Superseded

by the nonparametric method of DeLong et al. (1988).

Lasko TA, Bhagwat JG, Zou KH, Ohno-Machado L. The use of receiver operating

characteristic curves in biomedical informatics. Journal of Biomedical Informatics

2005;

38

(5): 404-15.

An overview of different methods to estimate and compare areas on the ROC curve, and

software packages available for ROC analysis. The paper does not present new methods

but summarizes the existing literature. The approach is practical, and aims at readers who

need to choose between applying different methods.

Metz CE, Basic principles of ROC analysis, Seminars in Nuclear Medicine 1978;

8

(4): 283–98.

Charles Metz was one of the people who popularized the use of ROC analysis in medical

research.

Provost F, Fawcett T, Kohavi R. The case against accuracy estimation for comparing

induction algorithms. Proceedings of the 15th International Conference on Machine

Learning (ICML–98), pp. 445–53, 1998.

Machine Learning research has traditionally concentrated on designing algorithms for

building classifier functions, and the predominant evaluation methodology in this field is

classification accuracy (error rate) estimation. In this influential paper, the authors argue

that estimating classification accuracy is insufficient for comparing competing classifiers

and algorithms, because classification accuracy assumes equal misclassification costs and

a known marginal class distribution. However, both misclassification costs and marginal

class distribution can be unknown at the time of building the model and may even vary

from time to time, place to place, and situation to situation where the model is applied. For

these reasons, the authors argue that classifiers and induction algorithms should be

evaluated and compared using ROC analyses. Using ten datasets from the UCI repository

and several standard machine learning algorithms, it is shown that high classification

accuracy does not imply domination in ROC space. Therefore, comparing accuracies on

benchmark datasets says little, if anything, about classifier performance on real-world

tasks. (Note: the authors do not mention the fact that a model may be very imprecise (badly

calibrated) even though it dominates other models in ROC space. This is an imperfection of

ROC analysis.)

(4)

Somers RH. A new asymmetric measure of association for ordinal variables,

American Sociological Review 1962;

27

: 799–811.

Somers' rank correlation D

xy

is a nonparametric measure of association between

ordinal variables, and is related to the concordance index C (= nonparametric

AUC) as follows: D

xy

= 2(C–0.5).

Ch. 3 Evaluating Probabilities – (ii) Accuracy of probabilities

Ash A, Shwartz M. R

2

: a useful measure of model performance when predicting a

dichotomous outcome. Statistics in Medicine 1999;

18

: 375–84.

Brier GW. Verification of weather forecasts expressed in terms of probability.

Monthly Weather Review 1950;

78

: 1–3.

The principal reference for the Brier inaccuracy score.

Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ. Probabilistic Networks and

Expert Systems. Berlin: Springer-Verlag, 1999.

Chapter 10 of this textbook on probabilistic network models considers the problem of

checking models against data. Although some of the methods are specific for Bayesian

networks, others are general (Bayesian) statistical tools for evaluating predictive models.

Specific attention is spent to the logarithmic score (deviance).

Mittlböck M, Schemper M. Explained variation for logistic regression. Statistics in

Medicine 1996;

15

: 1987–97.

Mittlböck and Schemper review 12 statistical measures that have been proposed to

quantify the explained variation of binary predictive model (in contrast to what the title

suggests, none of the measures is restricted to use in conjunction with logistic regression).

Six measures are based on the correlation of estimated probabilities and observed outcome

(e.g. Pearson correlation and Somers D), four are based on reduction in dispersion of the

outcome (e.g. sum-of-squares R

2

, Gini index, classification error), and two are based on

model likelihood (likelihood-ratio and Nagelkerke R

2

).

Nagelkerke NJD. A note on a general definition of the coefficient of determination.

Biometrika 1991;

78

(3): 691–2.

Nagelkerke proposes to use the ratio of log likelihoods of a binary predictive model (e.g.

logistic regression) and the 'null' (intercept only) model as a measure of explained

variation by the model. This performance statistic is presented by some statistical packages

(e.g. SAS) as 'Nagelkerke R

2

'. Although the statistic has some attractive properties (e.g.

consistency with classical R

2

, consistency with maximum likelihood estimation,

independent of sample size), there are serious problems with its interpretation (e.g. see

Mitlböck and Schemper, 1996).

Redelmeier DA, Bloch DA, Hickam DH. Assessing predictive accuracy: How to

compare Brier scores. Journal of Clinical Epidemiology 1991;

44

(11): 1141-6.

This paper presents a statistical method to compare the Brier scores from two different sets

of predictive assessments (predicted probabilities) on a single test set. The method is an

extension of Spiegelhalter's test whether a given Brier score is incompatable with the

(5)

observed outcomes. A problem with the comparison method is that the test statistic depends

on the true, unknown probabilities. To solve this problem, the authors suggest to use the

mean of both predictions. The paper contains a small example based on probability

judgements of five medical students whom independently reviewed the symptoms and

elektrocardiograms of 25 patients with recurrent chest pain.

Ch. 6 Assessing the fit of a Model

le Cessie S, van Houwelingen JC. A goodness-of-fit test for binary regression models,

based on smoothing methods. Biometrics 1991;

47

: 1267–82.

Hosmer DW, Hosmer T, le Cessie S, Lemeshow S. A comparison of goodness-of-fit

tests for the logistic regression model. Statistics in Medicine 1997;

16

: 965–80.

Experimental comparison of several goodness-of-fit tests for the logistic regression model.

Hosmer DW, Lemeshow S. Goodness-of-fit tests for the multiple logistic regression

model. Communications in Statistics 1980;

A10

: 1043–69.

This article presents the original Hosmer-Lemeshow goodness-of-fit test for logistic

regression model. The test is based on a statistic C that sums squared Pearson residuals in

g (usually 10) risk groups, where the grouping is either based on fixed values of the

estimated probabilities or on percentiles of the estimated probabilities (the latter approach

is often preferrable). It was experimentally shown that the distribution of the statistic C is

well approximated by the

χ

2

distribution with g–

λ

–1 degrees of freedom, where

λ

=1 if C is

computed from the training data set, and

λ

=0 otherwise. If C is large, this indicates that, at

least in part of the feature space, the estimated probabilities strongly deviate from the true

probabilities. The Hosmer-Lemeshow goodness-of-fit test as described here was

implemented in many statistical packages and is routinely applied in epidemiological

research, even though it was later shown that the statistic may be unstable and the test is

therefore unreliable.

Hosmer DW, Lemeshow S. Applied Logistic Regression. New York: John Wiley &

Sons, 2nd edition, 2000.

Chapter 5 of this textbook deals extensively with the evaluation of logistic regression

models. The evaluation methods described are partly dedicated for logistic regession

models (e.g. goodness-of-fit tests) and partly generic (e.g. ROC analysis).

Miller ME, Langefeld CD, Tierney WM, Hui SL, McDonald CJ. Validation of

probabilistic predictions. Medical Decision Making 1993;

13

(1): 49–58.

Moons E, Aerts M, Wets G. A tree-based lack-of-fit for multiple logistic regression.

Statistics in Medicine 2004;

23

: 1425–38.

Ch. 7 Model Validation

Bleeker SE, Moll HA, Steyerberg EW, Donders AR, Derksen-Lubsen G, Grobbee

DE, Moons KG. External validation is necessary in prediction research: a clinical

example. Journal of Clinical Epidemiology 2003;

56

(9):826-32.

(6)

Case study in pediatric diagnostic management (predicting bacterial infections) shows that

for relatively small data sets, internal validation of prediction models by bootstrap

techniques may not be sufficient and indicative for the model's performance in future

patients. External validation is essential before implementing prediction models in clinical

practice.

Davison AC, Hinkley DV. Bootstrap Methods and their Application. Cambridge:

Cambridge University Press, 1997.

Gant V, Rodway S, Wyatt JC. Artificial neural networks: practical considerations for

clinical applications. In Clinical Applications of Artificial Neural Networks

(Dybowski R, Gant V, eds.), Cambridge: Cambridge University Press, 2001, pp. 329–

56.

Hadorn DC, Draper D, Rogers WH, Keeler EB, Brook RH. Cross-validation

performance of mortality prediction models. Statistics in Medicine 1992;

11

(4): 475–

89.

Early study on the performance of different modelling techniques (linear regression,

logistic regression, Cox regression, CART) in predicting death after acute myocardial

infarction. Similar, but more rigorous, studies were conducted by Steyerberg et al.

Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic

information. Annals of Internal Medicine 1999;

130

: 515–24.

Miller ME, Hui SL, Tierney WM. Validation techniques for logistic regression

models. Statistics in Medicine 1991;

10

(8):1213-26.

This paper presents a comprehensive approach to the validation of logistic prediction

models. It reviews measures of overall goodness-of-fit, and indices of calibration and

refinement. Using a model-based approach developed by Cox, logistic regression

diagnostic techniques are adapted for use in model validation. This allows identification of

problematic predictor variables in the prediction model as well as influential observations

in the validation data that adversely affect the fit of the model. In appropriate situations,

recommendations are made for correction of models that provide poor fit.

Peek N, Arts DG, Bosman RJ, Van der Voort PH, De Keizer NF. External validation

of prognostic models for critically ill patients required substantial sample sizes.

Journal of Clinical Epidemiology 2007;

60

(5): 491–501.

This study considers the behavior of predictive performance measures that are commonly

used in external validation of prognostic models. A resampling scheme was used to

investigate the effects of sample size; the domain of application was intensive care. The

AUC and Brier score showed large variation with small samples. It was found that

substantial sample sizes are required for performance assessment and model comparison

in external validation. Standard errors of AUC values were accurate but the power to

detect differences in performance was low. Calibration statistics and the associated

significance tests are extremely sensitive to sample size, and should not be used in these

settings. Instead, D. Cox’ customization method to repair lack-of-fit problems is

recommended. Direct comparison of performance, without statistical analysis, was

unreliable with either measure.

(7)

Schwarzer G, Vach W, Schumacher M. On the misuses of artificial neural networks

for prognostic and diagnostic classification in oncology. Statistics in Medicine 2000;

19

: 541–61.

Schwarzer et al. present a critical review of applications of artificial neural networks

(ANNs) in biomedicine. The flexibility of ANNs is often cited as an advantage, but the

authors argue that it must be seen as a major concern. Several common pitfalls are

discussed (e.g. fitting implausible functions, incorrect modelling of survival data, and

biased estimation of network accuracy), and a review of the literature of ANN applications

in oncology is presented. Many of the 43 applications that are discussed show (severe)

methodological weaknesses.

Steyerberg EW, Harrell FE, Borsboom GJJM, Eijkemans MJC, Vergouwe Y,

Habbema JDF, Internal validation of predictive models: efficiency of some

procedures for logistic regression analysis, Journal of Clinical Epidemiology 2001;

54

(8): 774–81.

The performance of a predictive model is overestimated when simply determined on the

sample of subjects that was used to construct the model. Several internal validation

methods are available that aim to provide a more accurate estimate of model performance

in new subjects. This study evaluated several variants of split-sample, cross-validation and

bootstrapping methods with a logistic regression model that included eight predictors for

30-day mortality after an acute myocardial infarction. Random samples of varying size

were drawn from a large data set. Split-sample analyses gave overly pessimistic estimates

of performance, with large variability. Cross-validation on 10% of the sample had low bias

and low variability, but was not suitable for all performance measures. Internal validity

could best be estimated with bootstrapping, which provided stable estimates with low bias.

Steyerberg EW, Bleeker SE, Moll HA, Grobbee DE, Moons KG. Internal and external

validation of predictive models: a simulation study of bias and precision in small

samples. Journal of Clinical Epidemiology 2003;

56

(5): 441–7.

Simulation study to investigate the accuracy of bootstrap estimates of optimism (internal

validation) and the precision of performance estimates in independent validation samples

(external validation). Random samples were drawn from a data set on infectious diseases

in children, for the development (n=376) and validation (n=179) of logistic regression

models. Model development, including the selection of predictors, and validation were

repeated in a bootstrapping procedure. The average apparent ROC area was 0.74, which

was expected (based on bootstrapping) to decrease by 0.07 to 0.67, whereas the observed

decrease in the validation samples was 0.09 to 0.65. Omitting the selection of predictors

from the bootstrap procedure led to a severe underestimation of the optimism (decrease

0.006). The standard error of the observed ROC area in the independent validation

samples was large (0.05). So, for external validation, substantial sample sizes should be

used for sufficient power to detect clinically important changes in performance as

compared with the internally validated estimate.

Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ, Habbema JD.

Validation and updating of predictive logistic regression models: a study on sample

size and shrinkage. Statistics in Medicine 2004;

23

(16): 2567–86.

A logistic regression model may be used to provide predictions of outcome for individual

patients at another centre than where the model was developed. When empirical data are

available from this centre, the validity of predictions can be assessed by comparing

observed outcomes and predicted probabilities. Subsequently, the model may be updated to

improve predictions for future patients. In this study, a previously published model for

(8)

predicting 30-day mortality after acute myocardial infarction was validated and updated

with external validation samples that varied in size. Heuristic shrinkage approaches were

applied in the model revision methods, such that regression coefficients were shrunken

towards their re-calibrated values. Parsimonious updating methods were found preferable

to more extensive model revisions, which should only be attempted with relatively large

validation samples in combination with shrinkage.

Terrin N, Schmid CH, Griffith JL, D'Agostino RB, Selker HP. External validity of

predictive models: a comparison of logistic regression, classification trees, and neural

networks. Journal of Clinical Epidemiology 2003;

56

(8):721-9.

Simulation study that compared the external validity of standard logistic regression (LR1),

logistic regression with piecewise-linear and quadratic terms (LR2), classification trees,

and neural networks (NNETs). Predictive models were developed on data simulated from a

specified population and on data from perturbed forms of the population not representative

of the original distribution. All models were tested on new data generated from the

population. The performance of LR2 was superior to that of the other model types when the

models were developed on data sampled from the population and when they were

developed on nonrepresentative data. However, when the models developed using

nonrepresentative data were compared with models developed from data sampled from the

population, LR2 had the greatest loss in performance. These results highlight the necessity

of external validation to test the transportability of predictive models.

Vergouwe Y, Steyerberg, EW, Eijkemans MJC, Habbema JDF. Validity of prognostic

models: When is a model clinically useful? Seminars in Urologic Oncology 2002;

20

(2): 96–107.

Vergouwe et al. distinguish three aspects of validity of prognostic models: (1) agreement

between predicted probabilities and observed probabilities (calibration), (2) ability of the

model to distinguish subjects with different outcomes (discrimination), and (3) ability of the

model to improve the decision-making process (clinical usefulness). Several techniques for

visualizing and quantifiying calibration and discrimination are discussed. Clinical

usefulness is inspected by considering classification accuracy, sensitivity, and specificity of

the model (after choosing a classification threshold), and by estimating the expected

decrease in disutility when the model is applied in practice. This is done by comparing the

model’s classifications and conventional policy by weighing positive and

false-negative classified patients according to relative severity.

Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Substantial effective

sample sizes were required for external validation studies of predictive logistic

regression models. Journal of Clinical Epidemiology 2005 ;

58

(5): 475–83.

Simulation study in the field of oncology (predicting the probability that residual masses of

patients treated for metastatic testicular cancer contained only benign tissue) suggests that

a minimum of 100 events and 100 nonevents are required for external validation samples.

Zhu B-P, Lemeshow S, Hosmer DW, Klar J, Avrunin J, Teres D. Factors affecting the

performance of the models in the Mortality Probability Model II system and strategies

of customization: A simulation study. Critical Care Medicine 1996;

24

:57–63.