Comparison of Existing Systems - Investigating multi-modal features for continuous affect recog

To select the most appropriate baseline system, a number of systems from the Audio- Visual Emotion Challenge (AVEC) are chosen. AVEC is an annual competition event aimed at automatic affect analysis. The challenge provides a common bench- mark dataset for multimodal affect recognition. In particular, the systems from AVEC 2012 (Schuller et al., 2012) are compared since this was the most recent system at the time of implementation.

Table I: AVEC 2012 Audio Low-Level Descriptors (LLD) Schuller et al. (2012)

Energy & Spectral (25) loudness (auditory model based), zero crossing rate,

energy in bands from 250-650 Hz, 1kHz-4kHz, 25%, 50%, 75% , and 90% spectral roll-off points spectral flux, entropy, variance, skewness, kurtosis, psychoacoustic sharpness, harmonicity,

MFCC 1-10

Voicing Related (6)

F0 (sub-harmonic summation, followed by Viterbi smoothing)

probability of voicing, jitter, shimmer (local), jitter (delta: ”jitter of jitter”),

logarithmic Harmonics-to-Noise Ratio (logHNR)

The system proposed by Schuller et al. (2012) utilises Local Binary Pattern (LBP) features as the video features and statistical features of the low level descriptors for audio features (as shown in Table I and Table II). The features are then learned using Support Vector Machine regression (SVR) with Histogram Intersec- tion Kernels and a Sequential Minimal Optimization (SMO) technique. This system was used as the baseline system for the AVEC 2012 challenge.

The system proposed by Nicolle et al. (2012) uses the log-magnitude Fourier spectra to extract dynamic information from the signal that describes the shape deformation, local and global face appearance. The same set of features as in Schuller et al. (2012) are used for audio features. A correlation-based feature selection process is then applied to select a relevant set of features, followed by a weighted K-Means and Nadaraya-Watson kernel regression. As the final step the predictions from each feature set are fused using a local linear regression to produce the final prediction.

Another work proposed by Savran et al. (2012a) uses Bayesian filtering with particle filtering to combine the features extracted from the video, audio and lexical modalities. The video features are extracted using Local Binary Patterns (LBP) based on temporal statistics, while the audio features include a subset of features

Table II: Set of all 42 functionals. 1Not applied to delta coefficient contours. 2_{For delta coefficients the mean of only positive values is}

applied, other wise the arithmetic mean is applied. 3_{Not applied to}

voicing related LLD Schuller et al. (2012) Statistical functionals (23)

(positive2_{) arithmetic mean, root quadratic mean,}

standard deviation, flatness, skewness, kurtosis, quartiles, inter-quartile ranges,

1%, 99% percentile, percentile range 1%99%, percentage of frames contour is above:

minimum + 25%, 50%, and 90% of the range, percentage of frames contour is rising,

maximum, mean, minimum segment length1,3, standard deviation of segment length1,3

Regression functionals1 (4)

linear regression slope, and corresponding approximation error (linear),

quadratic regression coefficient a, and approximation error (linear)

Local minima/maxima related functionals1 ₍₉₎

mean and standard deviation of rising and falling slopes (minimum to maximum),

mean and standard deviation of inter maxima distances, amplitude mean of maxima, amplitude mean of minima, amplitude range of maxima

Other1,3 (6) LP gain, LPC 1-5

used in Schuller et al. (2012) plus class-level spectral features based on three dis- tinct phoneme classes. The lexical features are calculated as the pointwise mutual information (PMI) between a word and a given affect dimension. A Support Vector Machine for Regression (SVR) is then used for each modality and the final results are fused using a Bayesian framework via particle filtering.

The work reported by Baltrusaitis et al. (2013) proposed a framework to utilise the combination of Continuous Conditional Random Fields (CCRF) and SVR for modeling continuous affective state in dimensional space. For video features, the system extracts the geometric features described by the expression parameter, along with appearance features described by Local Binary Patterns on Three Orthogo- nal Planes (LBP-TOP) and motion features described by head movements. The prosodic features used in Ozkan et al. (2012) are adopted as audio features. SVR is then used to predict the affective state for each of the four feature sets (geometric, appearance, motion and audio) and the final results are fused using the CCRF.

The recent work by Wei et al. (2014) developed a Long Short-Term Memory Re- current Neural Network (LSTM-RNN) and multiple kernel learning (MKL) based multi-modal affect prediction framework (LSTM-MKL). Their motivation was to leverage the advantages of LSTM-RNN for modeling long range dependencies between observations and MKL for modelling non-linear correlations between input and output. The system uses visual features proposed in Savran et al. (2012a) and audio features detailed in Schuller et al. (2012).

The prediction results measured in terms of Pearson Cross Correlation (See Sec- tion 2.4.4 for more details) of the above systems are shown in Table III. The text in bold indicates the highest score for a particular dimension. Compared to the bench- mark system proposed by Schuller et al. (2012), all four systems showed a significant increase across all four dimensions. This could be a result of introducing information on previous affective states and across dimensions. Among the four systems, the system developed by Nicolle et al. (2012) achieved the best arousal, expectancy

Table III: Pearson’s correlation score for different systems tested on AVEC 2012 development database

System Aro Exp Pow Val Mean

Schuller 2012 Schuller et al. (2012) 0.181 0.148 0.084 0.215 0.157

Nicolle 2012 Nicolle et al. (2012) 0.644 0.341 0.511 0.350 0.461

Savran 2012 Savran et al. (2012a) 0.383 0.266 0.556 0.473 0.384

Baltrusaitis 2013 Baltrusaitis et al. (2013) 0.333 0.218 0.309 0.343 0.301

Wei 2014 Wei et al. (2014) 0.453 0.298 0.339 0.327 0.354

and average prediction result. It also achieved the second best result on valence and very close to best result on power. In addition, part of the feature extraction code used in the system is publicly available. It is proposed here that this system is both the best performing and most reproducible baseline system. For the above reasons, the system proposed by Nicolle et al. (2012) was selected as the baseline system. Although this thesis focused on using visual features for affect recognition, in order to compare the results from the implementation with the original paper the audio features are also used for consistency

In document Investigating multi-modal features for continuous affect recognition using visual sensing (Page 151-155)