• No results found

2.3 Conducting classification experiments

2.3.1 Evaluation measures

In this thesis, we focus on classification applications, where observations are labelled with a categorical outcome (their class). A classifier assigns a class label to a newly presented instance based on a previously learned prediction model. Its performance can be evaluated in several ways. In Section2.3.1.1we describe a number of traditional measures. Section2.3.1.2explains that some measures are unsuitable when the class distribution is skewed and presents some imbalance-resistant alternatives. Throughout this thesis, we take special care to select suitable evaluation measures for the studied problems. In particular, custom evaluation measures for multi-instance and multi-label classification are used in Chapters6-7and their definitions are recalled there.

2.3.1.1 General evaluation measures

The intuitively most relevant facet to evaluate is the correctness of the predictions made. Let T sbe a test set of elements for which class labels were predicted and corr(·) a function that counts the number of correct predictions for its set argument. Probably the most commonly used evaluation measure is still the traditional global classificationaccuracy. It is defined as

acc(T s) = corr(T s) |T s|

Table 2.1: Confusion matrix obtained after classification of a two-class dataset. Actual Predicted Positive Negative Positive TP FN Negative FP TN

and measures the ratio of correctly classified instances in the test set. The complement of the accuracy is referred to as theerror rate and is given by

err(T s) = 1−acc(T s) = |T s| −corr(T s)

|T s| .

In binary classification problems, only two classes are present and can often be interpreted as ‘positive’ and ‘negative’. A basic two-classconfusion matrix can be constructed that summa- rizes the prediction behaviour as presented in Table 2.1. The rows and columns correspond to the actual and predicted classes respectively. The entries correspond to the number oftrue positive (TP), true negative (TN), false positive (FP) and false negative (FN) predictions. The former two can be normalized to the true positive rate (TPR) and true negative rate (TNR) as

T P R= T P

T P +F N and T N R=

T N T N +F P.

In some applications, the false positive rate(FPR) and false negative rate (FNR), defined as F P R= F P

T P +F N and F N R=

F N T N +F P, may be of interest as well.

The TPR is also called recall (r) or sensitivity. A complementary measure from the infor- mation retrieval domain is theprecision (also calledconfidence or positive predictive value), defined as

p= T P T P +F P.

While the recall r measures the rate of correctly classified positive instances, the precision p represents the fraction of positive predictions that are correct. To evaluate the trade-off between these two aspects, theFβ-measure, defined as

= (1 +β2)· p·r β2·p+r,

can be applied. Parameter β can take on any positive real value, but is usually set toβ = 1. In the latter case, the measure is simply referred to as the F-measureand is given by

F = 2·p·r p+r ,

the harmonic mean of precision and recall. A confusion matrix similar to the one in Table2.1 can be constructed for datasets with more than two classes as well. For a dataset with m

classes, the entry on the ith row and jth column of the m×m matrix lists the number of classielements that were assigned to classj in the prediction step.

Another component commonly assessed when comparing machine learning methods is their complexity. Several aspects can be captured under this general term. A first one is the theoretical complexity related to the construction of the prediction model as well as to the prediction of the outcome of new instances. Their theoretical complexity has a direct influence on the runtime of these two processes, an essential practical consideration. It is usually represented using so-called big-O notation, which indicates how the runtime of an algorithm increases with, for example, the size of the input (number of instances) or its dimensionality (number of features). Secondly, the actual complexity or understandability of the learned prediction model is often of interest as well. Even though the computer algorithm is somehow applied as a black box in the prediction process, the interpretability of the intermediate model is still important. Easy-to-understand classification rules can for instance lead to new insights on the application at hand.

2.3.1.2 Evaluation measures in the presence of class imbalance

The specific properties of a problem may require custom evaluation measures. For instance, it has been established in the machine learning community that the regular classification accuracy is inappropriate to use in the presence of class imbalance (e.g. [391], Section 1.1.1, Chapter4). As an example, consider a two-class dataset in which 100 instances belong to the first class and 900 to the second and a classifier that predicts the second label for all elements. The accuracy value of this evaluation would be 90%, since 900 out of 1000 instances are classified correctly. This measure does not take the imbalance between classes into account. It can provide deceptive results and lead to unreliable conclusions. In particular, a strong performance on a majority class can easily overshadow a poor result on the minority class in the accuracy calculation. In this toy example, the 90% accuracy rate does not adequately reflect the poor performance on the first class, which has been misclassified entirely.

Example alternatives like thegeometric mean (gmean)orbalanced accuracy (also calledaver- age accuracy) aggregate the class-wise accuracies to counteract the dominance of the majority class. The former does so by taking the geometric mean of the class-wise accuracies

gmean(T s) = m v u u t m Y i=1 corr(T si) |T si| ,

with T si the subset of instances belonging to class i, while the latter uses their arithmetic

mean balacc(T s) =avgacc(T s) = 1 m m X i=1 corr(T si) |T si| ! .

Another popular evaluation measure to use in the presence of class imbalance is the Area under the ROC-Curve (AUC). For a binary classification problem, a Receiver Operator Char- acteristics (ROC)-curve is defined in the unit square and models the trade-off of a classifier between its TPR and FPR [141]. The area between the curve and the horizontal axis is used to capture the represented graphical information in one value. An algorithmic way to com- pute the AUC is by means of the trapezoid rule [141,319]. Instead of deriving an exact class

Figure 2.6: Example of how the calculation of p+(x) values can lead to a ROC curve. assignment, the classifier calculates, for each instancex, the probabilityp+(x) thatx belongs to the positive class. Afterwards, these values are sorted and each one is used as a threshold θ, such that only instances with p+(x) ≥θ are finally classified as positive. This procedure leads to specific TPR and FPR values for each threshold, that can be plotted as points in a Euclidean coordinate system as represented in Figure2.6. For example, at thresholdθ= 0.7, eight of the ten positive instances and five of the ten negative instances are classified as pos- itive, respectively leading to a TPR of 0.8 and FPR of 0.5 and together to the point with coordinates (0.5;0.8) in the plot. The area underneath the curve connecting these points can be computed as the sum of the areas of a triangle and a sequence of trapezoids.

Several extensions of the AUC measure to datasets with more than two classes have been proposed in the literature. An often used example is the mean AUC (MAUC, [205]) that computes the overall AUC as the average of each of the binary AUC values between pairs of classes. In [151], the ROC-curve was extended to a surface and the AUC measure was replaced by the volume underneath it. The authors of [207] introduced the AUCarea measure. The binary AUC values between class pairs are plotted in a polar diagram and the area within the figure is used as metric. A normalized version of the AUCarea measure was proposed in [203].