Statistical Comparison of Algorithms - Regression-based estimation of pain and facial expressio

This chapter explains how we evaluate proposed recognition systems. First, Sec.5.1provides details on how to obtain a recognition score given predictions and ground-truth. Then the partitioning of training and testing is given in Sec. 5.2. Finally, Sec. 5.3 explains how the performance of multiple algorithms are statistically compared based on scores from multiple databases.

5.1 Metrics

In order to evaluate a recognition system, it is necessary to compare its prediction output from testing data with the given ground-truth. The better the recognition system, the closer the predictions should match the ground-truth. The distance between ground-truth and predictions is measured quantitatively by the evaluation metric. This section provides an overview of the most common evaluation metrics for continuous targets, which are the Pearson product- moment correlation coefficient (CORR), the mean squared error (MSE) and the Intra-class correlation coefficient (ICC). We include CORR and MSE, since they have been the most common metrics evaluating affect analysis [69], and MSE is the most common metric for evaluating regression algorithms (e.g. [147,174]). Recently, the Intra-Class Correlation Coefficient ICC(3,1) [166] has been proposed for evaluating approaches related to automatic human be- haviour analysis (e.g. [74,121]) and thus we include this measure as well. First, we define each

of the metrics and then discuss their differences.

Given N ground-truth targets t = [t1, ..., tN]>, tn ∈ R and N regressor predictions y =

[y1, ..., yN]>, yn∈ R, then the CORR is defined as:

CORR(t, y) = cov[t, y] σtσy

, (5.1)

where σt is the standard deviation of t and cov[t, y] is the covariance between t and y. The

MSE is defined as the Expected value of the squared error: MSE(t, y) = 1

N X

(tn− yn)2 (5.2)

Note that some authors additionally apply the square-root to the MSE (RMSE) (e.g., [158]), but this does not change the performance ranking between algorithms, since the square-root is strictly monotonically increasing. If we assume the error to be a continuous random variable, then the MSE is its variance, while the RMSE is its standard deviation. Thus the RMSE has the same unit as the error, while the MSE has the original unit squared.

In order to compare the (R)MSE across different datasets, the ground-truth targets should have the same standard deviation. Otherwise, we could scale the targets and predictions of one dataset with an arbitrary constant c and thus both scale the MSE and RMSE arbitrarily (the only difference is that the RMSE would be scaled linearly in c, while the MSE would be scaled by c2). A standardized scaling is implicitly performed for CORR and ICC, and thus CORR and ICC are better measures for comparison across datasets than MSE and RMSE.

The ICC [166] originates from behavioural psychology and measures the agreement between two or more raters. It is based on quantities obtained by the Analysis of Variance (ANOVA) framework. Several types of ICC have been defined, each one differing by the data model, see [125,166]. The ICC of concern in this work is noted as ICC(3,1) according [166] and ICC(C,1) Case 3 according [125], since it is the commonly used ICC for evaluating AU intensity estimation (e.g., in [121,122,152]). All further mentioning of ICC is referring to this specific ICC type. The ICC is defined as ICC = _{BMS+(K−1)EMS}BMS−EMS , where K is the number of raters, BMS are the between target mean squares and EMS are the residual mean squares, as defined by ANOVA. We use the ICC as evaluation metric and thus K = 2, since t and y correspond to one rater each. In this case, the formula can be simplified to:

ICC(t, y) = 2cov[t, y] σ_t2+ σ2

(5.3)

CORR is a linearity index, since it measures the degree to which y = at + b holds for arbitrary constants a and b [125]. In contrast to that, ICC measures the degree to which

5.2. Division of Training and Testing Data y = t + b holds, and thus is an additivity index [125]. MSE measures the degree to which the identity mapping y = t holds.

To better understand the differences between the metrics, we describe their respective in- variances, i.e., how t and y can change without changing the value of the metric. Furthermore, we demonstrate an equivalence transform between the metrics, i.e., how t and y can be nor- malized in order to obtain equivalent metric values.

From the functional mapping between ground-truth and targets above, it follows that (1) CORR is invariant regarding additive and multiplicative constants, (2) ICC is invariant regarding additive constants and (3) MSE is not invariant regarding any constants.

The ICC is equivalent to CORR if we normalize t and y regarding their standard deviation, i.e., for ˆt = _σt

t and ˆy =

σy, the following holds:

ICC(ˆt, ˆy) = 2cov[ˆt, ˆy] σ2 ˆ t + σ 2 ˆ y = 2cov h t σt, y σy i 1 + 1 = cov[t, y] σtσy = CORR(t, y) (5.4) Analogous, the MSE is equivalent to CORR if we normalize t and y regarding their mean and standard deviation, i.e., for ˆt = t−µt

σt and ˆy =

y−µy

σy where µt is the mean of t, the following holds: MSE(ˆt, ˆy) = 1 N X n ˆ t2_n− 2ˆtnyˆn+ ˆy2n = 1 − 2 cov[t, y] σtσy + 1 = −2CORR(t, y) + 2 (5.5)

CORR, MSE and ICC all measure different aspects of the distance between ground-truth and predictions. Which measure is preferred depends highly on the application domain and thus we usually show the results of all three measures.

5.2 Division of Training and Testing Data

When evaluating a recognition system, we are interested in a performance estimate for unseen data, i.e. data that has not been used during training. This makes it necessary to divide the available data in non-overlapping sets of training and testing data. When applying the same principle to data from human subjects, we can extend the requirement to performance estimates from unseen subjects, i.e. subjects thats has not been used during training.

Thus for dividing the data into training and testing sets, we use the subject-independent setting, where the videos of selected subjects are left out for testing, and the videos of all other subjects in the dataset are used for training. This process is repeated with different

subjects, until all subjects have been used for testing. The results are combined by calculating the weighted average across all subjects left out for testing. The weight of each subject corresponds to the number of frames each subject occurs in.

We always use the subject-independent setting within this thesis for all datasets, except the results in Tab. 8.4, where we compare our work to previously published subject-dependent results.

5.3 Statistical Comparison of Algorithms

When evaluating different AU and pain recognition methods regarding the performance metrics explained in Sec.5.1, we usually obtain one score per algorithm, per target and per database. Multiple scores make it difficult to directly compare and rank algorithms, since usually a single algorithm is not consistently performing significantly better than all others. Therefore we perform statistical comparison tests of algorithms over multiple data sets, as suggested by [38]. First, the Friedman test [58] is applied to obtain a score rank and to detect whether all algorithms are statistically the same. If the null-hypothesis is rejected, then the Hommel [78] post-hoc procedure is applied to detect which pairs of algorithms are different. Both tests are performed with a significance value p = 0.05. Since a larger set of databases (2 or 3 is not sufficient) is needed to produce a meaningful result, we assume each target to be a different database and thus obtain an overall ranking of algorithms across targets and databases. E.g. when comparing algorithms on DISFA with 12 AUs and ShoulderPain with 10 AUs, then we apply the Friedman test over a total of 12 + 10 = 22 databases.

The Friedman results are reported as a ranked list of algorithms. Each algorithm subset which has equal score rank (according the Hommel procedure) is indicated by a black bar on the right side which spans the rows of the included algorithms. An example result is shown in Tab. 5.1: in this case, the algorithm pairs (A,B), (B,C) and (C,D) are not statistically different, but there is a difference between the pairs (A,C), (A,D), and (B,D). Thus, no single algorithm is clearly the best, but we know that the best must be either A or B (including the option that A and B are equally good).

5.3. Statistical Comparison of Algorithms

Table 5.1: Example representation of the Friedman test [58] rank results and equal-performance subsets obtained by Hommel’s dynamic procedure [78]. The different algorithms are ranked by their expected performance rate. The subsets of algorithms which have statistically equal performance are indicated by a black bar on the right side.

Rank Method

1 Algorithm A

2 Algorithm B

3 Algorithm C

Chapter

6 Pre-processing

Contents

6.1 Overview . . . 55

In document Regression-based estimation of pain and facial expression intensity (Page 49-55)