State-of-the-Art
3.4. Open issues 43
4.3.2 Defining Optimal Parameters
When using classifiers an initial problem is an optimal selection of the model para-meters. For a GMM classifier these are the number of mixtures and the number of iteration steps. For HMMs an additional parameter, namely the number of hidden states, has to be defined. Furthermore, the choice of utilised feature sets also has an effect on the classification performance. Afterwards, the classifier can be trained to determine the values of the parameters, accordingly.
Optimal Parameters for HMMs
The number of hidden states for emotion recognition was investigated by [Böck et al.
2010], for instance. In a comparative experiment with three different databases the number of states was changed step-wise from one state to four states. As an optimal number, three states were identified. In the case of very short utterances consisting only of a few phonemes, even one state, leading to a GMM classifier system, was identified as sufficient [Böck et al. 2010].
The second parameter, the number of iterations, was also investigated in [Böck et al. 2010] for HMMs. This number specifies the iterations for the BW algorithm and was changed between 1 and 30. The authors concluded that three iterations provide the best recognition performance utilising a three-state HMM on simulated material, whereas on naturalistic material five iterations provide the best performance utilising the same classifier. The use of more iterations results in a decreased performance.
Thus, it can be concluded that the models lose their capability to generalise, which is comparable to the over-fitting problem for ANNs (cf. [Böck 2013]).
16The corrected mixture weight is calculated by subtracting the number of already performed splits in the actual step from the corresponding mixture component. This method assures that repeated splitting of the same mixture component is discouraged (cf. [Young et al. 2006]).
4.3. Classifiers 85
Also, the influence of different spectral features sets was analysed in [Böck et al. 2010]
and [Böck 2013]. The difference of the zeroth cepstral coefficient (C0), which represents the mean of the logarithmic Mel spectrum and thus closely related to the signal energy (cf. [Marti et al. 2008]), and the short-term energy (E) itself were investigated. To this end, two different spectral feature sets, MFCC, PLP, their temporal information (∆
and ∆∆), are compared once utilising the C0 and once using E. These investigations are pursued on both simulated and naturalistic material. Böck et al. stated that for simulated material the performance of the feature sets according to the additional term is quite similar. For naturalistic material, the performance utilising short-term energy degrades [Böck 2013]. This is attributed to the fact that in naturalistic material this energy term is influenced by several factors (distance speaker to microphone, different loudness of speakers). In comparison of PLP and MFCC features, the author concluded that MFCCs should be preferred. This is supported by observations of the INTERSPEECH 2009 Emotion Challenge [Schuller et al. 2011c]. The importance of temporal information for HMMs using ∆ and ∆∆ coefficients are confirmed in, for instance, [Glüge et al. 2011], by comparing the classification results for emotion recognition of SRNs, having temporal information by design.
Another study by Cullen & Harte compared five different feature sets to classify various dimensional affects on a naturalistic affect corpus using HMMs. The utilised feature sets are (1) energy, spectral, and pitch related features, (2) pure spectral features (MFCC), (3) glottal features, (4) Teager Energy Operator (TEO) features, and (5) long term static and dynamic modulation spectrum (SDMS) features . The authors compared the performance of these feature sets for different emotional dimensions, as activation, valence, power, expectation, and overall emotional intensity.
Cullen & Harte concluded that for different emotional dimensions, different feature sets gain an optimal performance. Feature set (1) gains the best performance on activation and also captures power and valence. These findings are also approved by [Schuller et al. 2009a]. Feature set (2) provides the best results for power and valence. Using glottal features, the classifier performance decreases for all dimensions.
An HMM trained with TEO features gains high performance for expectation and valence. The long-term SDMS features perform well on expectation and it is assumed that this affect may vary quite slowly (cf. [Cullen & Harte 2012]).
Optimal Parameters for GMMs
In contrast to HMMs, only two parameters have to be investigated for GMMs. Applying GMMs for emotion recognition gives better classification results than HMMs, as shown in [Vlasenko et al. 2014]. The optimal number of mixtures and iterations depends
largely on the type of material. Especially in [Vlasenko et al. 2007b] and [Vlasenko et al. 2014] the number of mixtures needed for GMMs utilising simulated material (emoDB with low and high arousal emotional clustering, cf. Section 5.1.1) and naturalistic material (VAM, cf. Section 5.2.2) was investigated. To this end, the authors varied the number of mixtures in the range of 2 to 120 and concluded that the optimal number of mixtures to gain stable and robust results is 117 for the simulated (emoDB) and in the range of 77 to 90 for the used naturalistic affect database (VAM) when applying their phonetic pattern independent classifiers. As features they used the first 12 MFCCs and the zeroth cepstral coefficient (C0) with ∆ and ∆∆ coefficients. The authors used five iteration steps for their experiments. The authors of [Vlasenko et al. 2014] and [Vlasenko et al. 2007b] could furthermore show that the results gained with GMMs are more stable and robust in comparison to HMMs with two to five states. The gained UAR on HMMs was roughly 10 % lower than the UAR gained with GMMs.
My own experiments on the influence of different features, the effect on over-fitting when incorporating investigations about the number of iterations and the optimal number of mixtures can be found in Section 6.2.1.
4.3.3 Incorporating Speaker Characteristics
As emotional expressions are very individual, it would be the best to utilise individual-ised classifiers or adopt the classifers onto the emotional reaction of a specific user. But these methods are not always feasible since the material for each emotional reaction of a user has to be present. However, the problem of speaker variability has been already addressed for ASR systems (cf. [Burkhardt et al. 2010; Bahari & Hamme 2012]).
In ASR, the problem of inter-speaker variability caused a performance degradation while recognising many different users [Emori & Shinoda 2001]. This is due to different speaker characteristics, where gender is the most significant influence. This gender effect is caused by different sizes of the vocal tract between male and female users.
The vocal tract of male users is approx. 18 cm long and generates a lower frequency spectrum, whereas female users’ vocal tract is approx. only 13 cm long, resulting in higher frequencies [Lee & Rose 1996]. These differences affect the spectral formant positions by as much as 25 % (cf. [Lee & Rose 1998]). Apart from these anatomical reasons, different speaking habits also have an effect on speech production, as for instance the speaking rate or the intonation (cf. [Ho 2001]). The authors in [Dellwo et al. 2012] also argue that speech is a highly complex brain-operated series of muscle movements allowing to a certain degree an individual operation. This is called an
“idiosyncratic motion” and also affects the speech signal [Dellwo et al. 2012].
4.3. Classifiers 87
Therefore, two different approaches, to deal with these inter-user variabilities, have been used successfully in speech recognition: Either the speaker variabilities are norm-alised or speaker-group dependent models are used. Vocal Tract Length Normalisation (VTLN) normalises the speaker variabilities by estimating a warping factor to correct the different vocal tract lengths of the speakers [Emori & Shinoda 2001], which is either compressed (female users) or expanded (male users). Therefore, a piecewise linear transformation of the frequency axis is pursued (cf. [Zhan & Waibel 1997]):
f0 =
where f0 is the normalised frequency, β is the user-specific warping factor, f0 is a fixed frequency to handle the bandwidth mismatching problem during the transformation, b and c can be calculated with a known f0 [Wong & Sridharan 2002]. The warped
MFCCs are then created for all files with warping factors in a range of 0.88-1.22.
An important condition for VTLN is the estimation of the warping factors. A rough rule for these factors can be deduced from the application of VTLN. As it is used to normalise anatomical differences of the vocal tract for different speakers, the factor should e.g. “reduce” the length of the vocal tract for male speakers and “stretch”
the vocal tract length for female speakers. Childrens’ vocal tract should be stretched even more [Giuliani & Gerosa 2003]. Here, an investigation is presented, showing that VTLN also has an age dependency for children. In the age of 7 to 13 the characteristics of speech changes drastically. It can be assumed that these age-dependent changes also apply to adults, but in much longer ranges of years.
For speaker-group dependent modelling, different speaker groups are defined and group-specific models are trained with each utilised speaker group and emotion. For this, the corresponding speaker group has to be identified in advance. Unnormalised features are used with the group-specific models [Vergin et al. 1996]. In order to achieve recognition, the speaker group of the actual speaker has to be known, either a priori or, for instance, by upstreamed gender-recognition. Then the acoustics are recognised applying the selected speaker-group dependent model.
Recognising age and gender automatically is well known in ASR systems. This can be done in the very first beginning of a dialogue, by using just a few words of the subject. Typical architectures to distinguish age and gender uses SVMs, MLPs or HMMs [Bocklet et al. 2008; Burkhardt et al. 2010]. An advanced method is to utilise an UBM together with a GMM to take advantage of the adjustable threshold [Bahari &
Hamme 2012; Gajšek et al. 2009]. The authors of [Burkhardt et al. 2010; Li et al. 2010]
utilise a decision-level fusion to combine several age and gender detectors. Typically
spectral and prosodic features are used, as PLPs and MFCCs, F0, jitter and shimmer [Meinedo & Trancoso 2011]. They can be enriched by their first order regression coefficients to incorporate contextual information. The application of functionals can be used to generate long-term statistical information (cf. [Bocklet et al. 2008; Li et al.
2010]). Automatic approaches clustering the user regarding age and gender reach accuracies of approx. 96 % (cf. [Lee & Kwak 2012; Mengistu 2009]).