• No results found

Confusion matrix and classification performance measures

4. Evaluation Methods

4.1.1. Confusion matrix and classification performance measures

Because our current research is focused on binary classification tasks, we provide below the formulas for metric estimation mostly for binary classification. They can be easily

4.1. Evaluation metrics 73

extended for multi-class tasks.

The most commonly used evaluation metrics are estimated from theconfusion matrix,

which is a quadratic (C × C)-matrix, where the rows correspond to the labelled categories yL, columns to the predicted categories yP, and the entry in row j and column k is equal

to the number of classification instances labelled as belonging to category yL, which were

predicted as belonging to category yP.

Let T be the overall number of classification instances, or windows, and xi denote the

feature vector for a classification window i ∈ {1, ..., T } from the processed feature matrix X. For the binary classification yL, yP ∈ {0; 1}, and the confusion matrix consists of the

four following entries:

• The number of true positives corresponds to the number of positive instances predicted as positive: T P = T X i=1 yL(xi) · yP(xi). (4.1)

• The number of true negatives corresponds to the number of negative instances predicted as negative: T N = T X i=1 (1 − yL(xi)) · (1 − yP(xi)) . (4.2)

• The number of false positives corresponds to the number of negative instances predicted as positive: F P = T X i=1 (1 − yL(xi)) · yP(xi). (4.3)

• The number of false negatives corresponds to the number of positive instances predicted as negative: F N = T X i=1 yL(xi) · (1 − yP(xi)) . (4.4)

Several metrics are derived from T P , T N , F P and F N (we provide here again the defi- nitions for binary classification, which can be extended for a multi-class case):

Accuracy corresponds to the average rate of correctly predicted instances: mACC =

T P + T N

T P + T N + F P + F N =

T P + T N

T . (4.5)

If a data set is highly imbalanced, and a classifier performs well on the stronger rep- resented class (in worst case classifying every instance as belonging to the strongest category), the accuracy may be indeed high. Therefore, the calculation of other metrics is reasonable. On the other side, it depends on the classification scenario,

if the classification performances on the stronger and weaker classes have an equal relevance.

Precisiondescribes the fraction of the correctly identified positive instances to the number of instances identified as belonging to this category:

mP REC =

T P

T P + F P. (4.6)

Recall, orsensitivity, is the fraction of the correctly identified positive instances

to the number of positive instances: mREC =

T P

T P + F N. (4.7)

Specificitymeasures the percentage of the negative instances, which were predicted as negative:

mSP EC =

T N

F P + T N. (4.8)

Numeric prediction errors measure the number of misclassifications, and can also be ap- plied for binary classification (the corresponding formulas are marked with ‘bin= ’):

Absolute error is equal to the number of misclassifications:

mAE = T

X

i=1

|yL(xi) − yP(xi)|bin= F P + F N. (4.9)

• Relative error corresponds to the average number of misclassifications:

mRE = 1 T · T X i=1 |yL(xi) − yP(xi)|bin= F P + F N T P + T N + F P + F N. (4.10)

Mean squared error can be estimated, if the ground truth is not always binary as defined in our earlier studies [205,223] (we used a slightly modified mM SE version

there): mM SE = 1 T T X i=1 (yL− yP)2. (4.11)

Some metrics are designed especially for the measurement of classifier performance on imbalanced sets:

• Balanced relative erroris the mean of the relative errors estimated separately

for the instances of both classes: mBRE = 1 2  F N T P + F N + F P T N + F P  . (4.12)

4.1. Evaluation metrics 75

• F-measure is the weighted harmonic mean of precision and recall:

mF =

(αF + 1) · mP REC· mREC

αF · mP REC + mREC

, where (4.13)

αF adjusts the balance between mP REC and mREC and is often set to 1.

The following three metrics are combinations of recall (sensitivity) and specificity, which are also helpful for classifier evaluation on imbalanced sets. Their application for the evaluation of classification is motivated in [198].

Youden’s index is a simple combination of mREC and mSP EC:

mY = mREC+ mSP EC − 1. (4.14)

Positive and negative likelihoods measure the performance on positive and negative instances separately, however with respect both to sensitivity and specificity values: mL+= mREC 1 − mSP EC ; mL−= 1 − mREC mSP EC . (4.15)

Geometric meanis the squared product of mREC and mSP EC:

mGEOM =

mREC· mSP EC. (4.16)

Another possibility to evaluate the classification quality is to measure the correlation between the sequence of labels for all classification windows yL, and the sequence of

predicted categories for all classification windows yP:

Standard correlation coefficientis equal to 1 in case of the strongest depen- dency between the input variables, -1 in presence of the strongest anticorrelation (an increase of the first variable leads to a decrease of the second one) and is equal to 0, if the variables are not dependent on each other. It is defined as follows:

mc=

Cov(yP, yL)

pV ar(yP) · V ar(yL)

, where thecovarianceis: (4.17)

Cov(yP, yL) =

PT

i=1(yP − yP) · (yL− yL)

T − 1 and thevariances are: (4.18)

V ar(yP) = PT i=1(yP − yP)2 T − 1 ; V ar(yL) = PT i=1(yL− yL)2 T − 1 . (4.19)

Spearman’s rho rank coefficient is a special case of the Pearson product- moment correlation coefficient, where R(·) measures a rank of the input variable,

based on the preceeding sorting:

cρ=

PT

i=1(R(yP(xi)) · R(yL(xi))) − T T +12

2 r  PT i=1(R2(yP(xi))) − T T +12 2 ·PT i=1(R2(yL(xi))) − T T +12 2 (4.20)

Because we build many classification windows from a single song for genre and style prediction (see Section2.3.4), we distinguish between song-level and classification window-levelevaluation metric estimation. The classification window-level evaluation

calculates the performance for all classification windows. For binary song-level evaluation, which is based on a binary partition-level classification with yP(xi) ∈ {0; 1}, we estimate

the predicted song category by majority voting across all predicted labels for classification window feature vectors:

yP(x1, ..., xT0) = & PT0 i=1yP(xi) T0 − 0.5 ' , where (4.21)

T0 is the number of classification windows in a song. The yP(xi) and yL(xi) values in

Equations 4.1 to 4.20 can then be replaced by the corresponding labels for songs. If a metric mi was estimated on the song level, we denote it by msi.

The metrics estimated on window level evaluate a classifier more precisely. On the other side, for user-driven scenarios it is almost always acceptable or even desired that complete songs are assigned to categories. In the last case, the classification performance is usually better than the classification window-level performance: for example a ‘classic’ song is identified correctly for the share of ‘classic’ classification windows between 50% and 100%. Therefore, we applied partition-level FS optimisation for the recognition of the high-level features, and song-level FS optimisation for the recognition of genres and styles.

Figure 4.1 illustrates this effect. Both subfigures plot the balanced relative error and the selected feature rate (defined later in Equ.4.24) on the holdout set from the feature subsets, which have been generated during 2,000 evaluations of the 10 experiments for the recognition of the Classic category with RF. The runs in the left subfigure were evaluated and optimised using mBRE(partition-level) and mSF R. For the right subfigure, the metrics

were msBRE (song-level) and mSF R. Song-level classification has significantly lower errors

than window-level classification: even with larger feature sets almost always msBRE < 0.04, and for window-level classification in most cases 0.04 < mBRE < 0.1.