Bibliography: Evaluating Predictive Models
Last update: 29 July 2007
General
Alpaydin E. Introduction to Machine Learning. MIT Press, 2004.
An excellent introduction to the field of machine learning. As with most books on machine
learning, the emphasis is on classification. Chapter 14 describes the assessment and
comparison of classification algorithms.
Altman DG, Royston P. What do we mean by validating a prognostic model?
Statistics in Medicine 2000;
19
: 453–73.
This paper examines (i) what is meant by validation of prognostic models, (ii) reviews why
it is necessary, and (iii) describes how validations should be carried out; The emphasis is
on conceptual rather than technical issues. The idea of validating a prognostic model is
generally taken to mean that it works satisfactorily for patients other than those from
whose data the model was derived. The authors suggest that it is desirable to distinguish
statistical from clinical validity. Statistical validity means that the model is the best that
can be found with the available factors, while clinical validity means that the model
predicts accurately enough for its purpose
–
of course, this depends crucially on one's view
of the aims of the model. The paper spents considerable attention to the problem of
overfitting. It is known that analyses that are not prespecified but are data-dependent are
liable to overoptimistic conclusions. The data-dependent aspect of most prognostic models
stems from the variable selection and discretization procedures.
Duda RO, Hart PE, Stork DG, Pattern Classification. Wiley, 1997.
This classic textbook from 1973 was revised and updated in 1997. It covers a broad range
of pattern classification techniques. Chapter 2 discusses Bayesian decision theory.
Hand DJ, Construction and Assessment of Classification Rules. Wiley, 1997.
Chapter 6 of this book
(Aspects of evaluation)
presents a framework for understanding
model evaluation concepts. A distinction is made between measuring
accuracy
,
precision
,
separability
, and
resemblance
. A large number of examples is provided, e.g. the
misclassification
(i.e.
error
)
rate
and the
Brier score
are accuracy measures. The highly
popular misclassification rate is further investigated in Chapter 7, where several
cross-validation schemes (rotation, leave-one-out, bootstrap) for estimating the actual error are
discussed. Chapter 7 deals extensively with aspects of classification accuracy.
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. New York: Springer-Verlag, 2001.
Chapter 7 of this excellent textbook on statistical learning methods discusses model
assessment and selection methods based on loss (accuracy) functions.
Mitchell TM. Machine Learning. McGraw-Hill, 1997.
Another very good textbook on machine learning. Chapter 5 considers classification
accuracy, including the statistical estimation of performance.
Pepe MS, The Statistical Evaluation of Medical Tests for Classification and
Prediction. Oxford University Press, 2003.
Wasson JH, Sox HC, Neff RK, Goldman L. Clinical prediction rules. New England
Journal of Medicine 1985; 313: 793-9.
In this article, the following methodological standards for creating and validating clinical
prediction rules are proposed: 1. The event to be predicted (outcome) should be clearly
defined, preferrably by biological rather than sociological or behavioral criteria; 2.
Predictive findings should be defined precisely and have a similar meaning to anyone who
may use them; 3. The list of predictors should not include any criteria that are used in
defining the outcome ("blind assessment"; most relevant for diagnostic rules); 4.
Characterics of the patient population used to develop the rule should be clearly
described; 5. The study site and type of practice where the data was gathered should be
described; 6. An unbiased estimate of the rule's performance should be reported; 7. Effects
of using the rule should be prospectively measured; and 8. The statistical technique that
was used to derive the rule should be described.
Thirty-three publications of clinical predictions rules during the years 1981—1984 in four
leading medical journals were reviewed by these standards. Most of the criteria were met
by more than 80% of the studies. However, performance statistics were seldomly reported
(11 publications), and effects of clinical use was almost never prospectively measured (2
publications).
Ch. 1 Introduction: Predictive Models and Evaluation
Abu-Hanna A, Lucas PJF. Prognostic models in medicine, Methods of Information in
Medicine 2001;
40
: 1–5.
Wyatt JC, Altman DG. Commentary: Prognostic models: clinically useful of quickly
forgotten? British Medical Journal 1995;
311
: 1539–41.
Few prognostic models are routinely used to inform difficult clinical decisions. Wyatt and
Altman believe that the main reasons why doctors reject published prognostic models are
lack of clinical credibility and lack of evidence that a prognostic model can support
decisions about patient care (that is, evidence of accuracy, generality, and effectiveness).
Ch. 3 Evaluating Probabilities – (i) ROC analysis
Bamber D. The area above the ordinal dominance graph and the area below the
receiver operating characteristic graph. Journal of Mathematical Psychology 1975;
12
: 387-415.
In this paper it is shown that the area under the ROC curve (AUC) equals the probability
that a randomly chosen positive case was given a higher test value (or higher prediction by
a model) than a randomly chosen negative case. Furthermore, it is shown that the
estimated AUC is equivalent to the Mann-Whitney U statistics normalized by the number of
pairs of negative and positive cases.
DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or
more correlated receiver operating characteristic curves: a nonparametric approach,
Biometrics 1988;
44
: 837–45.
A nonparametric method for comparing the areas under the ROC curves of two distinct
models on the same dataset. The method is based on the theory of generalized U statistics.
Hand DJ, Till RJ. A simple generalization of the area under the ROC curve to
multiple class classification problems, Machine Learning 2001;
45
(2): 171–86.
Hanley JA, McNeil BJ. The meaning and use of the area under the receiver operating
characteristic (ROC) curve. Radiology 1982;
143
: 29–36.
Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating
characteristic curves derived from the same cases. Radiology 1983;
148
: 839–43.
Parametric method for comparing the areas under the ROC curves of two distinct models
on the same dataset, assuming a Normal (Gaussian) distribution of the AUC. Superseded
by the nonparametric method of DeLong et al. (1988).
Lasko TA, Bhagwat JG, Zou KH, Ohno-Machado L. The use of receiver operating
characteristic curves in biomedical informatics. Journal of Biomedical Informatics
2005;
38
(5): 404-15.
An overview of different methods to estimate and compare areas on the ROC curve, and
software packages available for ROC analysis. The paper does not present new methods
but summarizes the existing literature. The approach is practical, and aims at readers who
need to choose between applying different methods.
Metz CE, Basic principles of ROC analysis, Seminars in Nuclear Medicine 1978;
8
(4): 283–98.
Charles Metz was one of the people who popularized the use of ROC analysis in medical
research.
Provost F, Fawcett T, Kohavi R. The case against accuracy estimation for comparing
induction algorithms. Proceedings of the 15th International Conference on Machine
Learning (ICML–98), pp. 445–53, 1998.
Machine Learning research has traditionally concentrated on designing algorithms for
building classifier functions, and the predominant evaluation methodology in this field is
classification accuracy (error rate) estimation. In this influential paper, the authors argue
that estimating classification accuracy is insufficient for comparing competing classifiers
and algorithms, because classification accuracy assumes equal misclassification costs and
a known marginal class distribution. However, both misclassification costs and marginal
class distribution can be unknown at the time of building the model and may even vary
from time to time, place to place, and situation to situation where the model is applied. For
these reasons, the authors argue that classifiers and induction algorithms should be
evaluated and compared using ROC analyses. Using ten datasets from the UCI repository
and several standard machine learning algorithms, it is shown that high classification
accuracy does not imply domination in ROC space. Therefore, comparing accuracies on
benchmark datasets says little, if anything, about classifier performance on real-world
tasks. (Note: the authors do not mention the fact that a model may be very imprecise (badly
calibrated) even though it dominates other models in ROC space. This is an imperfection of
ROC analysis.)
Somers RH. A new asymmetric measure of association for ordinal variables,
American Sociological Review 1962;
27
: 799–811.
Somers' rank correlation D
xyis a nonparametric measure of association between
ordinal variables, and is related to the concordance index C (= nonparametric
AUC) as follows: D
xy= 2(C–0.5).
Ch. 3 Evaluating Probabilities – (ii) Accuracy of probabilities
Ash A, Shwartz M. R
2: a useful measure of model performance when predicting a
dichotomous outcome. Statistics in Medicine 1999;
18
: 375–84.
Brier GW. Verification of weather forecasts expressed in terms of probability.
Monthly Weather Review 1950;
78
: 1–3.
The principal reference for the Brier inaccuracy score.
Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ. Probabilistic Networks and
Expert Systems. Berlin: Springer-Verlag, 1999.
Chapter 10 of this textbook on probabilistic network models considers the problem of
checking models against data. Although some of the methods are specific for Bayesian
networks, others are general (Bayesian) statistical tools for evaluating predictive models.
Specific attention is spent to the logarithmic score (deviance).
Mittlböck M, Schemper M. Explained variation for logistic regression. Statistics in
Medicine 1996;
15
: 1987–97.
Mittlböck and Schemper review 12 statistical measures that have been proposed to
quantify the explained variation of binary predictive model (in contrast to what the title
suggests, none of the measures is restricted to use in conjunction with logistic regression).
Six measures are based on the correlation of estimated probabilities and observed outcome
(e.g. Pearson correlation and Somers D), four are based on reduction in dispersion of the
outcome (e.g. sum-of-squares R
2, Gini index, classification error), and two are based on
model likelihood (likelihood-ratio and Nagelkerke R
2).
Nagelkerke NJD. A note on a general definition of the coefficient of determination.
Biometrika 1991;
78
(3): 691–2.
Nagelkerke proposes to use the ratio of log likelihoods of a binary predictive model (e.g.
logistic regression) and the 'null' (intercept only) model as a measure of explained
variation by the model. This performance statistic is presented by some statistical packages
(e.g. SAS) as 'Nagelkerke R
2'. Although the statistic has some attractive properties (e.g.
consistency with classical R
2, consistency with maximum likelihood estimation,
independent of sample size), there are serious problems with its interpretation (e.g. see
Mitlböck and Schemper, 1996).
Redelmeier DA, Bloch DA, Hickam DH. Assessing predictive accuracy: How to
compare Brier scores. Journal of Clinical Epidemiology 1991;
44
(11): 1141-6.
This paper presents a statistical method to compare the Brier scores from two different sets
of predictive assessments (predicted probabilities) on a single test set. The method is an
extension of Spiegelhalter's test whether a given Brier score is incompatable with the
observed outcomes. A problem with the comparison method is that the test statistic depends
on the true, unknown probabilities. To solve this problem, the authors suggest to use the
mean of both predictions. The paper contains a small example based on probability
judgements of five medical students whom independently reviewed the symptoms and
elektrocardiograms of 25 patients with recurrent chest pain.
Ch. 6 Assessing the fit of a Model
le Cessie S, van Houwelingen JC. A goodness-of-fit test for binary regression models,
based on smoothing methods. Biometrics 1991;
47
: 1267–82.
Hosmer DW, Hosmer T, le Cessie S, Lemeshow S. A comparison of goodness-of-fit
tests for the logistic regression model. Statistics in Medicine 1997;
16
: 965–80.
Experimental comparison of several goodness-of-fit tests for the logistic regression model.
Hosmer DW, Lemeshow S. Goodness-of-fit tests for the multiple logistic regression
model. Communications in Statistics 1980;
A10
: 1043–69.
This article presents the original Hosmer-Lemeshow goodness-of-fit test for logistic
regression model. The test is based on a statistic C that sums squared Pearson residuals in
g (usually 10) risk groups, where the grouping is either based on fixed values of the
estimated probabilities or on percentiles of the estimated probabilities (the latter approach
is often preferrable). It was experimentally shown that the distribution of the statistic C is
well approximated by the
χ
2distribution with g–
λ
–1 degrees of freedom, where
λ
=1 if C is
computed from the training data set, and
λ
=0 otherwise. If C is large, this indicates that, at
least in part of the feature space, the estimated probabilities strongly deviate from the true
probabilities. The Hosmer-Lemeshow goodness-of-fit test as described here was
implemented in many statistical packages and is routinely applied in epidemiological
research, even though it was later shown that the statistic may be unstable and the test is
therefore unreliable.
Hosmer DW, Lemeshow S. Applied Logistic Regression. New York: John Wiley &
Sons, 2nd edition, 2000.
Chapter 5 of this textbook deals extensively with the evaluation of logistic regression
models. The evaluation methods described are partly dedicated for logistic regession
models (e.g. goodness-of-fit tests) and partly generic (e.g. ROC analysis).
Miller ME, Langefeld CD, Tierney WM, Hui SL, McDonald CJ. Validation of
probabilistic predictions. Medical Decision Making 1993;
13
(1): 49–58.
Moons E, Aerts M, Wets G. A tree-based lack-of-fit for multiple logistic regression.
Statistics in Medicine 2004;
23
: 1425–38.
Ch. 7 Model Validation
Bleeker SE, Moll HA, Steyerberg EW, Donders AR, Derksen-Lubsen G, Grobbee
DE, Moons KG. External validation is necessary in prediction research: a clinical
example. Journal of Clinical Epidemiology 2003;
56
(9):826-32.
Case study in pediatric diagnostic management (predicting bacterial infections) shows that
for relatively small data sets, internal validation of prediction models by bootstrap
techniques may not be sufficient and indicative for the model's performance in future
patients. External validation is essential before implementing prediction models in clinical
practice.
Davison AC, Hinkley DV. Bootstrap Methods and their Application. Cambridge:
Cambridge University Press, 1997.
Gant V, Rodway S, Wyatt JC. Artificial neural networks: practical considerations for
clinical applications. In Clinical Applications of Artificial Neural Networks
(Dybowski R, Gant V, eds.), Cambridge: Cambridge University Press, 2001, pp. 329–
56.
Hadorn DC, Draper D, Rogers WH, Keeler EB, Brook RH. Cross-validation
performance of mortality prediction models. Statistics in Medicine 1992;
11
(4): 475–
89.
Early study on the performance of different modelling techniques (linear regression,
logistic regression, Cox regression, CART) in predicting death after acute myocardial
infarction. Similar, but more rigorous, studies were conducted by Steyerberg et al.
Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic
information. Annals of Internal Medicine 1999;
130
: 515–24.
Miller ME, Hui SL, Tierney WM. Validation techniques for logistic regression
models. Statistics in Medicine 1991;
10
(8):1213-26.
This paper presents a comprehensive approach to the validation of logistic prediction
models. It reviews measures of overall goodness-of-fit, and indices of calibration and
refinement. Using a model-based approach developed by Cox, logistic regression
diagnostic techniques are adapted for use in model validation. This allows identification of
problematic predictor variables in the prediction model as well as influential observations
in the validation data that adversely affect the fit of the model. In appropriate situations,
recommendations are made for correction of models that provide poor fit.
Peek N, Arts DG, Bosman RJ, Van der Voort PH, De Keizer NF. External validation
of prognostic models for critically ill patients required substantial sample sizes.
Journal of Clinical Epidemiology 2007;
60
(5): 491–501.
This study considers the behavior of predictive performance measures that are commonly
used in external validation of prognostic models. A resampling scheme was used to
investigate the effects of sample size; the domain of application was intensive care. The
AUC and Brier score showed large variation with small samples. It was found that
substantial sample sizes are required for performance assessment and model comparison
in external validation. Standard errors of AUC values were accurate but the power to
detect differences in performance was low. Calibration statistics and the associated
significance tests are extremely sensitive to sample size, and should not be used in these
settings. Instead, D. Cox’ customization method to repair lack-of-fit problems is
recommended. Direct comparison of performance, without statistical analysis, was
unreliable with either measure.
Schwarzer G, Vach W, Schumacher M. On the misuses of artificial neural networks
for prognostic and diagnostic classification in oncology. Statistics in Medicine 2000;
19
: 541–61.
Schwarzer et al. present a critical review of applications of artificial neural networks
(ANNs) in biomedicine. The flexibility of ANNs is often cited as an advantage, but the
authors argue that it must be seen as a major concern. Several common pitfalls are
discussed (e.g. fitting implausible functions, incorrect modelling of survival data, and
biased estimation of network accuracy), and a review of the literature of ANN applications
in oncology is presented. Many of the 43 applications that are discussed show (severe)
methodological weaknesses.
Steyerberg EW, Harrell FE, Borsboom GJJM, Eijkemans MJC, Vergouwe Y,
Habbema JDF, Internal validation of predictive models: efficiency of some
procedures for logistic regression analysis, Journal of Clinical Epidemiology 2001;
54
(8): 774–81.
The performance of a predictive model is overestimated when simply determined on the
sample of subjects that was used to construct the model. Several internal validation
methods are available that aim to provide a more accurate estimate of model performance
in new subjects. This study evaluated several variants of split-sample, cross-validation and
bootstrapping methods with a logistic regression model that included eight predictors for
30-day mortality after an acute myocardial infarction. Random samples of varying size
were drawn from a large data set. Split-sample analyses gave overly pessimistic estimates
of performance, with large variability. Cross-validation on 10% of the sample had low bias
and low variability, but was not suitable for all performance measures. Internal validity
could best be estimated with bootstrapping, which provided stable estimates with low bias.
Steyerberg EW, Bleeker SE, Moll HA, Grobbee DE, Moons KG. Internal and external
validation of predictive models: a simulation study of bias and precision in small
samples. Journal of Clinical Epidemiology 2003;
56
(5): 441–7.
Simulation study to investigate the accuracy of bootstrap estimates of optimism (internal
validation) and the precision of performance estimates in independent validation samples
(external validation). Random samples were drawn from a data set on infectious diseases
in children, for the development (n=376) and validation (n=179) of logistic regression
models. Model development, including the selection of predictors, and validation were
repeated in a bootstrapping procedure. The average apparent ROC area was 0.74, which
was expected (based on bootstrapping) to decrease by 0.07 to 0.67, whereas the observed
decrease in the validation samples was 0.09 to 0.65. Omitting the selection of predictors
from the bootstrap procedure led to a severe underestimation of the optimism (decrease
0.006). The standard error of the observed ROC area in the independent validation
samples was large (0.05). So, for external validation, substantial sample sizes should be
used for sufficient power to detect clinically important changes in performance as
compared with the internally validated estimate.
Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ, Habbema JD.
Validation and updating of predictive logistic regression models: a study on sample
size and shrinkage. Statistics in Medicine 2004;
23
(16): 2567–86.
A logistic regression model may be used to provide predictions of outcome for individual
patients at another centre than where the model was developed. When empirical data are
available from this centre, the validity of predictions can be assessed by comparing
observed outcomes and predicted probabilities. Subsequently, the model may be updated to
improve predictions for future patients. In this study, a previously published model for
predicting 30-day mortality after acute myocardial infarction was validated and updated
with external validation samples that varied in size. Heuristic shrinkage approaches were
applied in the model revision methods, such that regression coefficients were shrunken
towards their re-calibrated values. Parsimonious updating methods were found preferable
to more extensive model revisions, which should only be attempted with relatively large
validation samples in combination with shrinkage.
Terrin N, Schmid CH, Griffith JL, D'Agostino RB, Selker HP. External validity of
predictive models: a comparison of logistic regression, classification trees, and neural
networks. Journal of Clinical Epidemiology 2003;
56
(8):721-9.
Simulation study that compared the external validity of standard logistic regression (LR1),
logistic regression with piecewise-linear and quadratic terms (LR2), classification trees,
and neural networks (NNETs). Predictive models were developed on data simulated from a
specified population and on data from perturbed forms of the population not representative
of the original distribution. All models were tested on new data generated from the
population. The performance of LR2 was superior to that of the other model types when the
models were developed on data sampled from the population and when they were
developed on nonrepresentative data. However, when the models developed using
nonrepresentative data were compared with models developed from data sampled from the
population, LR2 had the greatest loss in performance. These results highlight the necessity
of external validation to test the transportability of predictive models.
Vergouwe Y, Steyerberg, EW, Eijkemans MJC, Habbema JDF. Validity of prognostic
models: When is a model clinically useful? Seminars in Urologic Oncology 2002;
20
(2): 96–107.
Vergouwe et al. distinguish three aspects of validity of prognostic models: (1) agreement
between predicted probabilities and observed probabilities (calibration), (2) ability of the
model to distinguish subjects with different outcomes (discrimination), and (3) ability of the
model to improve the decision-making process (clinical usefulness). Several techniques for
visualizing and quantifiying calibration and discrimination are discussed. Clinical
usefulness is inspected by considering classification accuracy, sensitivity, and specificity of
the model (after choosing a classification threshold), and by estimating the expected
decrease in disutility when the model is applied in practice. This is done by comparing the
model’s classifications and conventional policy by weighing positive and
false-negative classified patients according to relative severity.
Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Substantial effective
sample sizes were required for external validation studies of predictive logistic
regression models. Journal of Clinical Epidemiology 2005 ;
58
(5): 475–83.
Simulation study in the field of oncology (predicting the probability that residual masses of
patients treated for metastatic testicular cancer contained only benign tissue) suggests that
a minimum of 100 events and 100 nonevents are required for external validation samples.
Zhu B-P, Lemeshow S, Hosmer DW, Klar J, Avrunin J, Teres D. Factors affecting the
performance of the models in the Mortality Probability Model II system and strategies
of customization: A simulation study. Critical Care Medicine 1996;
24
:57–63.