This section gives details of several performance measures. These performance measures are used to assess the quality of machine learning approaches and prediction models.
The confusion matrix is one of the most used performance measures. The con- fusion matrix is used to analyse how well the classifier can recognise different classes [71]. A general representation of the confusion matrix is shown in Table 2:1. TP and TN mean the classifier gave a true prediction, while FP and FN mean the classifier gave a
false prediction. The confusion matrix is used to produce other performance measure- ments such as accuracy, recall, precision, F1, Kappa, etc. These performance measures are discussed in detail in the following subsections.
Yes No Total Yes TP FN P
No FP TN N
Total PΜ NΜ P + N
Table 2:1 The Confusion Matrix and the evaluation measures: true positive TP, true nega- tive TN, false positive FP, false negative FN, positive P, and negative samples N.
2.5.1Accuracy and error rate
Accuracy is one of the most well-known performance assessment techniques for predic- tion problems. The accuracy of the model is defined as the rate of correctly classified instances. It can be calculated from the confusion matrix as follows:
π΄πππ’ππππ¦ = ππ + ππ π + π
Although most prediction algorithms are using accuracy to measure their perfor- mance, sometimes the accuracy may be a misleading performance measure. For example, if we have a dataset that has an output (class) attribute very skewed such that instances are distributed as 80% belonging to class A and 20% to class B, if the two classes have equal importance, then the algorithm that has predicted all instances in class A will have 80% accuracy. In this case, we would prefer an algorithm with less accuracy, but that can predict some of the instances in class B.
The error rate, which is also the misclassification rate, is just the complement of the accuracy 1-accuracy. Besides, it could be computed from the confusion matrix as follows:
πΈππππ π ππ‘π = πΉπ + πΉπ π + π
2.5.2Recall
Recall is known as sensitivity in the medical field or as the true positive rate. Recall measures the proportion of the actual positives that are correctly classified [1]. For instance, recall may refer to the percentage of sick patients who were correctly classified. Recall could be calculated from the confusion matrix as follows:
π πππππ ( π πππ ππ‘ππ£ππ‘π¦) = ππ ππ + πΉπ
We note that the recall of the negative class is called specificity, and this is a symmetrical measure with respect to sensitivity if we change the focus on the negative class.
2.5.3Precision
Precision is defined as the proportion of the true positives against all the positive results including false positive. For example, precision refers to the percentage of sick patients who were correctly classified as having a particular disease among the total of people who were actually sick. Precision is calculated from the confusion matrix as follows:
ππππππ πππ = ππ ππ + πΉπ
2.5.4F-measure
F-measure is the harmonic mean of precision and recall and is known as F-score or F1 score. F1 is calculated from the precision and recall as follows:
πΉ1 = 2 β ππππππ πππ Γ ππππππ ππππππ πππ + ππππππ
The F-measure is used to measure the effectiveness of a classifier. It ignores the TN, which can vary without affecting the statistic.
2.5.5Area under the ROC curve
The receiver operating characteristic (ROC) curve graphically displays the trade-off be- tween the true positive rate and the false positive rate of a classifier. The ROC curve is created by building a graph in which TP is plotted along the y-axis and FP is plotted along the x-axis as shown in Figure 2:3.
Figure 2:3 The ROC curve.
AUC is the area under the ROC curve with a value between 0 and 1 [72]. Note that, because random guessing produces the diagonal dashed line between (0, 0) and (1, 1), which is a curve corresponding to an AUC of 0.5, no authentic classifier should have an AUC value of 0.5 or less. The AUC is equivalent to the Wilcoxon test of ranks [73]. AUC is usually used for model comparison. Note that some representations of the ROC curve display the sensitivity on the y-axis and (1-specificity) on the x-axis, which is entirely equivalent to the representation in Figure 2:3 that uses, for illustration of varieties of representations, alternative names for the same quantities.
2.5.6Kappa
Kappa, which is also called Cohenβs Kappa, is a statistical measure that assesses the in- terrater agreement for categorical items [2]. Kappa takes into account the accuracy that could be possibly occurring by chance. The Kappa equation is as follows:
πΎππππ = π β πΈ 1 β πΈ
Above π is the obsereved accuracy and πΈ is the expected accuracy. Kappa values range between -1 and 1. When Kappa equals to 0, this means there is no agreement be- tween the predicted and the actual classes. In contrast, when Kappa has a value of 1, it shows excellent concordance of the model prediction and the observed classes.
When the class distributions are equivalent, the overall accuracy and Kappa are proportional. Depending on the context, Kappa values within 0.30 to 0.50 indicate rea- sonable agreement [2]. However, if Kappa was less than 0.30, it indicates that the model's performance occurs mostly by chance only and the accuracy does not reflect how good the model is.