Experiment 3: The effect of post-processing

Since the predictions from the machine learning step usually suffer from issues such as bias, scaling and noise, various post-processing techniques have been employed for recently developed affect recognition systems. For instance, K¨achele et al. (2015) scaled the predictions using the minimum and maximum value from the training partitions. Exponential smoothing is used to remove the noise in the work carried out by Chen and Jin (2015). He et al. (2015) employed Gaussian smoothing with both fixed and variable window length to remove the noise. The aim of this experiment is to study how post-processing steps can be used to improve the recognition result.

Table 4.6: Results on development partition for different block sizes with shifted annotation. The best results in different metrics is highlighted in bold

Arousal Valence

Delay RMSE CC CCC Delay RMSE CC CCC LBP 1.2 0.189 0.189 0.061 2.4 0.127 0.183 0.130 LPQ 0.8 0.202 0.201 0.143 1.2 0.132 0.334 0.284 LGBP 3.6 0.223 0.123 0.054 2 0.124 0.204 0.105 LBP-TOP 1.6 0.177 0.406 0.246 2.4 0.125 0.249 0.183 LPQ-TOP 1.6 0.181 0.430 0.348 1.6 0.122 0.403 0.373 LGBP-TOP 2 0.193 0.205 0.065 7.2 0.130 0.169 0.134

(a) Block size: 1 × 1

Arousal Valence

Delay RMSE CC CCC Delay RMSE CC CCC LBP 1.2 0.189 0.188 0.057 1.6 0.122 0.289 0.231 LPQ 0.8 0.200 0.136 0.090 2 0.124 0.362 0.327 LGBP 4.8 0.224 0.082 0.039 2.4 0.131 0.191 0.065 LBP-TOP 2 0.179 0.393 0.227 1.6 0.132 0.205 0.174 LPQ-TOP 2 0.188 0.284 0.176 1.6 0.121 0.378 0.356 LGBP-TOP 2 0.193 0.293 0.192 1.6 0.135 0.193 0.155 (b) Block size: 2 × 2 Arousal Valence

Delay RMSE CC CCC Delay RMSE CC CCC LBP 1.6 0.200 0.279 0.144 2.4 0.113 0.397 0.304 LPQ 6.4 0.266 -0.024 -0.022 1.2 0.129 0.307 0.240 LGBP 3.2 0.232 0.032 0.018 2 0.129 0.311 0.231 LBP-TOP 2 0.184 0.435 0.225 2.4 0.116 0.412 0.243 LPQ-TOP 1.6 0.188 0.316 0.195 1.6 0.111 0.445 0.303 LGBP-TOP 2 0.189 0.305 0.097 2 0.121 0.269 0.135 (c) Block size: 4 × 4

(a) Arousal

(b) Valence

Figure 4.3: Comparison of the performance of different block size configurations in terms of CC score on Arousal and Valence dimensions

Figure 4.4: Plot of arousal delay for static features

Figure 4.6: Plot of valence delay for static features

4.4.1 Experimental Procedure

Similar to experiment, the AVEC 2015 dataset was used for this experiment. For this experiment, the following post-processing steps are applied to the initial predictions: (i) median filtering and (ii) centering and scaling. The median filter is a nonlinear filter which is commonly used in digital signal and image processing for noise reduction. The main idea of the median filter is to go through every entry in the signal and replace each entry with the median of its neighbouring entries. The size of neighbouring entries is usually referred as the window size. In this experiment, various window size have been tested ranging from 0.4s (10 frames) to 20s (500 frames) with a step size of 0.4s. For centring and scaling, the following formula was used.

yf inal =

(ypred− µpred)

σpred

∗ σtrain+ µtrain (4.1)

where yf inal is the final prediction, ypred is the raw prediction, mupred is the mean

of the raw prediction, σpred is the standard deviation of the raw prediction, σtrain is

the standard deviation of the training labels, and µtrain is the mean of the training

labels.

The prediction results from the second experiment were used in this experiment, specifically the 4 × 4 LBP-TOP feature was selected for arousal prediction while the 4 × 4 LPQ-TOP feature was selected for valence prediction since they achieved the highest CC scores respectively after delay compensation.

4.4.2 Experimental Results and Analysis

Table 4.7 shows the original prediction results and the prediction results after post- processing. As it can be seen, both the arousal and valence prediction have been improved in term of CC score and largely improved in terms of CCC score. Through the experiment, it was found that for the arousal dimension, the median filter with window size of 4.8s achieved best results, while for the valence dimension, a window

size of 4s achieved best results. To visualise the effect of the post-processing steps, Figure 4.8 and 4.9 show the arousal and valence curves of a segment of video, including the ground-truth labels, the initial prediction result, the prediction after median filter was applied and the prediction after centering and scaling. From the figures it can be seen, after the median filter, the prediction curve becomes much smoother while still capturing the trend of the ground truth. After the centering and scaling steps, the bias and scaling issues that come with the initial prediction have been reduced.

Table 4.7: Original and post processed result on the development partition of AVEC 2015 dataset

Original Post Processed RMSE CC CCC RMSE CC CCC LBP-TOP Arousal 0.184 0.435 0.225 0.181 0.590 0.553 LPQ-TOP Valence 0.111 0.445 0.303 0.125 0.546 0.537

Figure 4.8: Plot of post processed prediction on arousal dimension using LBP-TOP features

Table 4.8 and 4.9 compare the recognition results with the AVEC 2015 baseline system and the AVEC 2015 winning system where the best results are highlighted in bold. It can be seen that our system outperformed the AVEC 2015 baseline system on both arousal and valence dimensions. Compared to the winning system in AVEC 2015, the proposed system achieved close performance on the arousal dimension and better performance on the valence dimension. However, it should be noted that the

Figure 4.9: Plot of post processed prediction on valence dimension using LPQ-TOP features

Table 4.8: Comparison with selected baseline results on the development set on arousal dimension

Arousal

Ringeval et al. (2015b) He et al. (2015) our results

RMSE 0.214 0.148 0.181

CC 0.183 0.665 0.590

CCC 0.103 0.587 0.553

winning system used a deep Bidirectional Long Short-Term Memory (BLSTM) Re- current Neural Network which is known to have a better performance on modelling temporal information compared to a linear SVR. It also used additional steps such as feature selection to select more correlated features. Since the aim of these ex- periments is to thoroughly investigate the performance of different histogram-based features on continuous affect recognition, only the linear SVR was selected as the learning technique in order to benchmark different features.

Table 4.9: Comparison with selected baseline results on the development set on valence dimension

Valence

Ringeval et al. (2015b) He et al. (2015) our results

RMSE 0.117 0.105 0.125

CC 0.358 0.501 0.546

In document Investigating multi-modal features for continuous affect recognition using visual sensing (Page 101-109)