A Receiver Operating Characteristics (ROC) curve is a technique for visu- alizing, organizing and selecting classifiers based on their performance. In essence, it is another performance evaluation technique for classification models. ROC curves have long been used in signal detection theory to de- pict the tradeoff between hit rates and false alarm rates of classifiers.4 The
use of ROC analysis has been extended into visualizing and analyzing the
4 J.P. Egan (1975). Signal Detection Theory and ROC Analysis, Series in Cogniti-
Area Under the ROC Curve 145
behavior of diagnostic systems.5 Recently, the medical decision making
community has developed an extensive literature on the use of ROC curves as one of the primary methods for diagnostic testing.6
Given a classifier and an instance, there are four possible prediction out- comes. If the instance is positive and it is classified as positive, it is counted as a true positive; if it is classified as negative, it is counted as a false negative. If the instance is negative and it is classified as negative, it is counted as a true negative; if it is classified as positive, it is counted as a false positive. Given a classifier and a set of instances (the test set), a two- by-two coincidence matrix (also called a contingency table) can be con- structed representing the dispositions of the set of instances (see Fig. 10.1). This matrix forms the basis for many common metrics including the ROC curves.
ROC graphs are two-dimensional graphs in which true positive (TP) rate is plotted on the Y axis and false positive (FP) rate is plotted on the X axis (see Fig. 9.6). In essence, an ROC graph depicts relative trade-off between benefits (true positives) and costs (false positives). Several points in ROC space are important to note. The lower left point (0; 0) represents the strat- egy of never issuing a positive classification; such a classifier commits no false positive errors but also gains no true positives. The opposite strategy, of unconditionally issuing positive classifications, is represented by the upper right point (1; 1). The point (0; 1) represents perfect classification. Informally, one point in ROC space is better than another if it is to the northwest (TP rate is higher, FP rate is lower, or both) of the first. Classifiers appearing on the left hand-side of an ROC graph, near the X axis, may be thought of as “conservative”: they make positive classifications only with strong evidence so they make fewer false positive errors, but they often have low true positive rates as well. Classifiers on the upper right-hand side of an ROC graph may be thought of as “liberal”: they make positive classifica- tions with weak evidence so they classify nearly all positives correctly, but they often have high false positive rates. Many real world domains are dominated by large numbers of negative instances, so performance in the far left-hand side of the ROC graph becomes more interesting.
An ROC curve is basically a two-dimensional depiction of a classifier’s performance. To compare classifiers or to judge the fitness of a single classi- fier one may want to reduce the ROC measures to a single scalar value rep- resenting the expected performance. A common method to perform such
5 J. Swets (1988). Measuring the accuracy of diagnostic systems, Science 240, 1285–1293.
6 K.H. Zou (2002). Receiver operating characteristic (ROC) literature research, Online bibliography at http://splweb.bwh.harvard.edu.
146 9 Performance Evaluation for Predictive Modeling
task is to calculate the area under the ROC curve, abbreviated as AUC. Since the AUC is a portion of the area of the unit square, its value will al- ways be between 0 and 1.0. A perfect accuracy gets a value of 1.0. The di- agonal line y = x represents the strategy of randomly guessing a class. For example, if a classifier randomly guesses the positive class half the time (much like flipping a coin), it can be expected to get half the positives and half the negatives correct; this yields the point (0:5; 0:5) in ROC space, which in turn translates into area under the ROC curve value of 0.5. No clas- sifier that has any classification power should have an AUC less than 0.5.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1 0.9 0.8
False Positive Rate (1–Specificity)
True Positive Rate (Sensitivity)
A
B C
Fig. 9.6. A sample ROC curve
In Fig. 9.6 classification performance of three classifiers (A, B and C) are shown in a single ROC graph. Since the AUC is the commonly used metric for performance comparison of prediction models, one can easily tell that the best performing classifier (out of the three that is being com- pared to each other) is A, followed by B. The classifier C is not shoving any predictive power; staying at the same level as random chance.
Summary 147
Summary
Above listed estimation methods illustrates a wide range of options for the practitioner. Some of them are more statistically sound (based on sound theories) while others have empirically showed to produce better results. Additionally, some of these methods are computationally more costly then the others. It is not always the case that increasing the computational cost is beneficial especially if the relative accuracies are more important than the exact values. For example leave-one-out is almost unbiased but it has high variance leading to unreliable estimates, not to mention the computa- tional cost that it bring into the picture while dealing with relatively large datasets. For linear models using leave-one-out cross-validation for model selection is asymptotically inconsistent in the sense that the probability of selecting the model with the best predictive power does not converge to one as the total number of observations approaches infinity.7 Based on the
recent trends in data mining, two of the most commonly used estimation techniques are area under the ROC curve (AUC) and the stratified k-fold cross validation. That said, most of the commercial data mining tools are still promoting and using simple split as the estimation technique.
7 P. Zhang (1992). On the distributional properties of model selection criteria,