2.5 Classification
2.6.1 Accuracy, AUC, TCV and SD
Figure 2.12: Confusion matrix.
There are many mechanisms for assessing the effectiveness of classifiers; as noted above, with respect to the work presented in this thesis, accuracy and AUC were used because the usage of these measures is frequently reported in the literature [112, 147]. Alternative methods include sensitivity and specificity. Accuracy is calculated simply as shown in Equation 2.8:
accuracy= number of records correctly classif ied
total number of records (2.8)
However, although accuracy provides for an easily understandable measure of the overall quality of a classifier (hence its usage in this thesis), it does not take into con- sideration the distribution of the classes. In this respect AUC is a more appropriate measure [23, 97]. Broadly, the ROC curve concept was originally used in signal de- tection theory to depict the trade-off between hit rates and false alarm rates. The “hit rate” is called the True Positive Rate (TPR), benefit or sensitivity; while the “false alarm rate” is called the False Positive Rate (FPR), or cost. Both are expressed in the form of a real number ranging from between 0.0 and 1.0. TPR and FPR are calculated using the concept of what is known as aconfusion matrixas shown in Figure 2.12. Confusion matrices are used with respect to two class problems. With reference to Figure 2.12: (i) the True Positives (TP) value is the number of instances that are correctly classified as belonging to the Positive class, (ii) the False Negatives (FN) value is the number of instances belonging to Positive class that are erroneously predicted as belonging to Negative class, (iii) the True Negatives (TN) value is the number of instances that are correctly classified as belonging to Negative class and (iv) the False Positives (FP) value is the number of instances belonging to Negative class that are erroneously predicted as belonging to Positive class. Using a confusion matrix TPR and FPR are calculated as shown in Equations 2.9 and 2.10 respectively. Note that accuracy can also be derived from a confusion matrix using Equation 2.11.
Figure 2.13: The ROC curve. The solid blue line indicates a good ROC curve that
reaches the upper left corner and the dotted line indicates a random classifier (guessing).
T P R= T P T P +F N =sensitivity (2.9) F P R= F P T N +F P = 1−specif icity (2.10) accuracy= T P +T N T P +T N +F P +F N (2.11)
A ROC curve is generated by plotting the FPR (False Positive Rate) against the TPR (True Positive Rate) (with the FPR plotted along the X-axis and the TPR along the Y-axis). Both TPR and FPR range from 0 to 19. In the ROC space, the best classification performance exists in the upper left corner (where FPR=0 and TPR=1) while the diagonal represents random classification (guessing); as shown in Figure 2.13. Therefore, a “good” ROC curve is one that reaches the upper left corner.
The Area Under a ROC curve (AUC) is a single value frequently used to measure classifier performance (0 ≤ AU C ≤ 1). In other words AUC is an indicator of the probability that a classifier will correctly classify instances [9, 113, 148, 161]. Note that an AUC value of 0.5 indicates a random classifier (guessing). To illustrate the distinction between accuracy and AUC, consider a 2-class problem where class 1 has 990 instances and class 2 has 10 instances, then the accuracy of the generated model by simply guessing class 1 would be 990+10990 ×100 = 99%; “on the face of it” a good accuracy value. However, a classifier that does this is clearly not a good classifier as indicated by the AUC= 0.5 that would describe this situation. Thus the main advantage of AUC is its ability to deal with unbalanced data sets since it considers the distribution of classes (TPR and FPR values) [73]. Therefore, AUC was chosen to be the other performance measure used with respect to the proposed classifiers presented in this thesis because of
the uneven vertex label distributions within the evaluation datasets. The further detail of AUC can be found in [147].
However, with respect to the evaluation data sets used these featured more than two classes, hence the above confusion matrix based approach to AUC calculation was inappropriate. Instead the Mann-Whitney-Wilcoxon (MWW) statistical method, which employs a ranking concept based on the signal detection theory proposed by [96], was used with respect to the work described in this thesis to calculate AUC values10. A full example on how to calculate the AUC value, based on the MWW statistic, is presented in Appendix A.
For the presented evaluations Ten Cross Validation (TCV) [192] was adopted, where appropriate, in order to reduce the likelihood of overfitting [79]. Overfitting mainly occurs when a generated classifier (model) is fitted to the training data in such a per- fect manner that the resulting classifier is not suited to classifying anything else (thus defeating the objective of generating the classifier in the first place). TCV is used in order to limit the implication of overfitting. TCV is a well established technique for evaluating the performance of supervised learners whereby the data is divided into ten parts so that class labels are distributed equally (stratified). Using the TCV technique the learner is applied ten times, each time to a different 9/10 of the data set, and tested using the remaining 1/10. On completion, the recorded results of the ten iterations are used to compute an averaged set of results.
Note that in this thesis, when reporting average values, such as those generated using TCV, the associated Standard Deviation (SD) is also reported. SD is a measure of how much variation exists with respect to a given average value. A low SD indicates that the values are close to the average. A high SD indicates that the values are spread out over a large range of values.