Hidden Markov Models - Open issues 43

State-of-the-Art

3.4. Open issues 43

4.3.1 Hidden Markov Models

The human speech production can generate different variants of the same acoustics.

These variants can be either stretched or shrinked. This results in an acoustic obser-vation which basically consists of the same characteristics, but varies in the temporal occurrence of the observation. However, in the case of a stretched acoustic, these observation can be seen as consisting of repeating sub-parts whereas for the shrinked case some sub-parts are very small or even non-existent. This kind of observation

4.3. Classifiers 81

causes troubles in the recognition since the same meaning can be produced with dif-ferent (temporal) variants. To overcome this problem, HMMs are utilised. An HMM is constituted by a twofold production process: 1) a temporal evolution, to decode the temporal stretching and shrinking 2) and an output production, to decode the observed acoustic sub-part. This architecture enables the HMM to uncouple the tem-poral resolution of the speech signal from the observed features. Thus, the HMM produces at first the most possible sequence of states, whereupon a repetition of one or mores states is possible. Afterwards, for each selected state the most likely output is produced. The basic unit of a sound represented by an HMM is either a word, a phoneme, or a short utterance [Young 2008].

HMMs have long been used successfully in speech recognition as well as emotion recognition, thus I only depict the most important parts of this modelling technique.

For further details, I refer to the corresponding literature: (cf. mathematical description of HMMs [Eppinger & Herter 1993; Wendemuth 2004; Young 2008], HMMs and emotions [Schuller & Batliner 2013; Vlasenko et al. 2014], parameter optimization [Böck et al. 2010], fusion architecture [Glodek et al. 2012]).

An HMM is a finite state machine hmm = {S, K, π, aij, b_jk}where S = {s1, . . . , s_n} denotes the set of states, V = {v1, . . . , v_n} denotes the output alphabet, πs denotes the initial probability of a state s, {aij}= P(qt = sj|q_t−1 = si) are the transition prob-abilities, and {bjk}= P(Ot = vk|q_t = sj) are the production probabilities. Figure 4.11 shows the graphical representation of a commonly used left-to-right HMM.

Figure 4.11: Workflow of a four states HMM, aij is the transition probability from state si to state sj and bj(Ov) is the probability to emit the symbol Ov in state sj.

The HMM produces for every time step t = 1, . . . , T one observable output Ot ∈ K and passes through an unobservable sequences of states. In speech and emotion recog-nition it is common to use mixtures of Gaussians as output observation probabilities:

b_jk = ^X^M

m=1

c^k_jmN(o, µ^k_jm, σ_jm^k ) (4.46)

where N denotes a normal distribution with the parameters mean (µjm) and covariance (σjm). The parameter M denotes the number of Gaussians and is determined by the length of the feature vector, i.e. the number of used features. Diagonal variance matrices are used to reduce the effort of variance estimation (cf. [Wendemuth 2004; Young et al. 2006]). This restriction has the consequence that the estimated distributions are oriented according to the pre-defined coordinate axes. This constraint can be circumvented by using mixture distributions (cf. [Wendemuth 2004]).

Production modelling tries to find the sequence of words or emotional patterns W = {w1, . . . , w_k} that most likely have generated the observed output sequence O:

Wˆ = arg max

W [P(W |O)] (4.47)

As P(W |O) is difficult to model directly, the Bayes’ rule is used to transform Eq. 4.47 into the equivalent problem:

Wˆ = arg max_W [P(O|W )P(W )] (4.48) The likelihood P(O|W ) is determined by acoustic modelling, namely the HMM. The prior probability P(W ) is defined by a language modelling. These terms show the strong connection to speech recognition. The language model indicates how likely it is that a particular word was spoken or a certain affect occurred given the current context. Therefore, empirically obtained scaling factors are used, for instance n-gram modelling [Brown et al. 1992]. For emotion recognition, the use of a language model is shortly discussed in [Schuller et al. 2011c]. It is stated that due to the data sparseness mostly uni-grams have been applied and they serve as linguistic features (salient words), to define the amount of information a specific word contains about an emotion category [Steidl 2009]. There are no findings yet, on proper “language models” for emotions.

In acoustic modelling, two issues have to be solved: 1) calculate the probability for each model λ generating the observation sequence O, and 2) find the best state sequence matching the given observation. To solve the first issue, the produced ob-servations over all possible state sequences are summarised and multiplied with the likelihood that these state sequences are generated by this model (cf. Eq. 4.49). There-fore, all possible state sequences 1 . . . N and all possible output sequences 1 . . . T have to be considered. These calculations can be further simplified (cf. Eq. 4.51) and by making use of the consideration that the likelihood of the actual state is only

depend-4.3. Classifiers 83

ing on the previous state¹⁴. The corresponding algorithm is called forward-backward algorithm [Rabiner & Juang 1993]. To solve the second issue, finding the most likely path of states qmax through the model, the sequence of states with the highest likelihood have to be calculated:

q_max = max_q P(q|O, λ) (4.51)

P(O, q^∗|λ) = max

q∈Q^TP(O, q|λ) (4.52)

Eq. 4.52 is evaluating efficiently with the Viterbi algorithm [Viterbi 1967] by taking advantage of the Markov property. The Viterbi algorithm iteratively calculates the maximum attainable probabilities for a sub-part of the observation under the addi-tional condition to end in a certain gradually increasing state sj and at the same time storing the requested sequence by a backtracking matrix (cf. [Wendemuth 2004]).

But before the the most likely path can be calculated, the HMM’s parameters {aij} and {bjk}have to be estimated. To calculate these parameters, a training corpus with acoustic examples and pre-defined labels have to be utilised. For an efficient estimation the Baum-Welch (BW) algorithm is used (cf. [Wendemuth 2004]). This algorithm uses the forward-backward algorithm and is an instance of the Expectation-Maximization (EM) algorithm [Dempster et al. 1977]. The iterative EM algorithm consists of an E-step to compute state occupation probabilities and an M-step to obtain updated parameter estimates utilising maximum-likelihood calculations (cf. [Young 2008]).

As a special case, GMMs are distinguished from HMMs by having only one emitting state¹⁵. GMMs are used to capture the observed features within one state without inferring transitions. It is assumed that these models will better capture the emotional content of a whole utterance without comprising the spoken content, which is varying within the utterance. The same methods as for HMMs are used for training and testing.

The only difference is that due to the self-loop all observations in a GMM are mapped to the same state. When considering an HMM the number of Gaussian mixture

14This is called first order Markov property. The actual state only depends on the previous state and not a sequence of states that preceded it.

15In the literature (cf. [Vlasenko et al. 2007a]) these classifiers are also denoted as HMM/GMM, as for training and testing the GMM is seen as an one-state HMM with a self-loop. Thus, different lengths of utterances results in different numbers of self-loops.

components is normally between 10 and 20 [Young 2008], for GMMs commonly many more mixture components are used, 70-140 for emotion recognition [Vlasenko et al.

2014] and up to 2 048 for speaker verification [Reynolds et al. 2000]. To increase the number of mixture components, a technique called mixture splitting is mostly applied (cf. [Young et al. 2006]). Hereby, the mixture component with the highest corrected mixture weight¹⁶(“heaviest” mixture) is copied, the weights are divided by 2, and the mean is perturbed by ±0.2 of the corresponding standard deviations (cf. [Young et al.

2006]). Afterwards, all parameters are re-estimated by applying the BW algorithm.

In document Emotional and user-specific cues for improved analysis of naturalistic interactions (Page 102-106)