Estimating the model performance - Predictive Modelling Approach to Data-driven Computational P

This section gives details of several performance measures. These performance measures are used to assess the quality of machine learning approaches and prediction models.

The confusion matrix is one of the most used performance measures. The confusion matrix is used to analyse how well the classifier can recognise different classes [71]. A general representation of the confusion matrix is shown in Table 2:1. TP and TN mean the classifier gave a true prediction, while FP and FN mean the classifier gave a

false prediction. The confusion matrix is used to produce other performance measure- ments such as accuracy, recall, precision, F1, Kappa, etc. These performance measures are discussed in detail in the following subsections.

Yes No Total Yes TP FN P

No FP TN N

Total P̅ N̅ P + N

Table 2:1 The Confusion Matrix and the evaluation measures: true positive TP, true negative TN, false positive FP, false negative FN, positive P, and negative samples N.

2.5.1Accuracy and error rate

Accuracy is one of the most well-known performance assessment techniques for prediction problems. The accuracy of the model is defined as the rate of correctly classified instances. It can be calculated from the confusion matrix as follows:

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 𝑃 + 𝑁

Although most prediction algorithms are using accuracy to measure their performance, sometimes the accuracy may be a misleading performance measure. For example, if we have a dataset that has an output (class) attribute very skewed such that instances are distributed as 80% belonging to class A and 20% to class B, if the two classes have equal importance, then the algorithm that has predicted all instances in class A will have 80% accuracy. In this case, we would prefer an algorithm with less accuracy, but that can predict some of the instances in class B.

The error rate, which is also the misclassification rate, is just the complement of the accuracy 1-accuracy. Besides, it could be computed from the confusion matrix as follows:

𝐸𝑟𝑟𝑜𝑟 𝑅𝑎𝑡𝑒 = 𝐹𝑃 + 𝐹𝑁 𝑃 + 𝑁

2.5.2Recall

Recall is known as sensitivity in the medical field or as the true positive rate. Recall measures the proportion of the actual positives that are correctly classified [1]. For instance, recall may refer to the percentage of sick patients who were correctly classified. Recall could be calculated from the confusion matrix as follows:

𝑅𝑒𝑐𝑎𝑙𝑙 ( 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦) = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁

We note that the recall of the negative class is called specificity, and this is a symmetrical measure with respect to sensitivity if we change the focus on the negative class.

2.5.3Precision

Precision is defined as the proportion of the true positives against all the positive results including false positive. For example, precision refers to the percentage of sick patients who were correctly classified as having a particular disease among the total of people who were actually sick. Precision is calculated from the confusion matrix as follows:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃

2.5.4F-measure

F-measure is the harmonic mean of precision and recall and is known as F-score or F1 score. F1 is calculated from the precision and recall as follows:

𝐹1 = 2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

The F-measure is used to measure the effectiveness of a classifier. It ignores the TN, which can vary without affecting the statistic.

2.5.5Area under the ROC curve

The receiver operating characteristic (ROC) curve graphically displays the trade-off between the true positive rate and the false positive rate of a classifier. The ROC curve is created by building a graph in which TP is plotted along the y-axis and FP is plotted along the x-axis as shown in Figure 2:3.

Figure 2:3 The ROC curve.

AUC is the area under the ROC curve with a value between 0 and 1 [72]. Note that, because random guessing produces the diagonal dashed line between (0, 0) and (1, 1), which is a curve corresponding to an AUC of 0.5, no authentic classifier should have an AUC value of 0.5 or less. The AUC is equivalent to the Wilcoxon test of ranks [73]. AUC is usually used for model comparison. Note that some representations of the ROC curve display the sensitivity on the y-axis and (1-specificity) on the x-axis, which is entirely equivalent to the representation in Figure 2:3 that uses, for illustration of varieties of representations, alternative names for the same quantities.

2.5.6Kappa

Kappa, which is also called Cohen’s Kappa, is a statistical measure that assesses the in- terrater agreement for categorical items [2]. Kappa takes into account the accuracy that could be possibly occurring by chance. The Kappa equation is as follows:

𝐾𝑎𝑝𝑝𝑎 = 𝑂 − 𝐸 1 − 𝐸

Above 𝑂 is the obsereved accuracy and 𝐸 is the expected accuracy. Kappa values range between -1 and 1. When Kappa equals to 0, this means there is no agreement between the predicted and the actual classes. In contrast, when Kappa has a value of 1, it shows excellent concordance of the model prediction and the observed classes.

When the class distributions are equivalent, the overall accuracy and Kappa are proportional. Depending on the context, Kappa values within 0.30 to 0.50 indicate rea- sonable agreement [2]. However, if Kappa was less than 0.30, it indicates that the model's performance occurs mostly by chance only and the accuracy does not reflect how good the model is.

In document Predictive Modelling Approach to Data-driven Computational Psychiatry (Page 39-43)