Automatic Emotion Recognition from Speech

(1)

S. D´Mello et al. (Eds.): ACII 2011, Part II, LNCS 6975, pp. 191–199, 2011. © Springer-Verlag Berlin Heidelberg 2011

A PhD Research Proposal

Yazid Attabi and Pierre Dumouchel

École de technologie supérieure, Montréal, Canada Centre de recherche informatique de Montréal, Montréal, Canada

{Yazid.attabi,pierre.dumouchel}@crim.ca

Abstract. This paper contains a PhD research proposal related to the domain of automatic emotion recognition from speech signal. We started by identifying our research problem, namely the acute confusion problem between emotion classes and we have cited different sources of this ambiguity. In the methodol-ogy section, we presented a method based on simililarity concept between a class and an instance patterns. We dubbed this method as Weighted Ordered classes – Nearest Neighbors. The first result obtained exceeds in performance the best result of the state-of-the art. Finally, as future work, we have made a proposition to improve the performance of the proposed system.

Keywords: Emotion recognition, simililarity concept, speech signal.

1 Problem

The automatic emotion recognition (AER) from speech is subject of increasing inter-est in the recent years given the broad field of applications that can benefit from this technology. For example, a speaker emotional state recognition system can be used to develop natural and effective human-machine interaction system and more sensitive interface to the user behaviour. Used in a distance learning context, a tutoring system could detect bored users and allows a change of style and level of the material pro-vided, or provides a compensation and emotional encouragement [1].

AER may also be used to: (1) support the driving experience and encourage better driving, given that driver emotion and driving performance are often intrinsically linked [2]; (2) detect the presence of extreme emotions, especially fear, in the context of surveillance in public places [3]; (3) automatically prioritize messages accumulated in the mailbox with different criterions such as emotional urgency, valence (happy vs. sad) and arousal (calm vs. excited). This can be used to alert the voicemail account owner and enables him to listen first to the important messages [4]; (4) use the special features carried by emotions in the development of more robust and accurate systems for automatic speaker verification [5]; (5) assess the urgency of a call and therefore helps to take a decision in the context of a medical call center offering medical advice to patients [6]; (6) improve customer service when the AER system is integrated into interactive voice response system in commercial call centers [7].

(2)

Despite all the research efforts, the performances of AER systems designed remain relatively low compared to related fields such as speaker verification. The low recog-nition rate is reflected through the high level of confusion between emotion classes.

1.1 Confusion Sources

The confusion between classes can have several sources. The first raison can be re-lated to the uncertainty that characterize the definition of emotion in the psychology domain, namely which theory to use categorical or dimensional, and in the case of categorical theory how many categories exists and what are these classes? This uncer-tainty can lead to an overlap between defined classes at their acoustic information spaces because of the lack of clear boundaries between different categories of emo-tion that marks the transiemo-tion from one emoemo-tion to another. On the other hand, the overlap in the acoustic space is not limited to the neighboring classes but also be-tween some emotions in a symmetrical position with respect to the active / passive axis of the dimensional model shown in Figure 2. The joy and anger classes are an example of ambiguity case confirmed by several studies [8, 9].

Fig. 1. Emotions mapped in Activation-Evaluation space with FEELTRACE (Cowie et al.)

Another important factor which increases the ambiguity is the significant variabil-ity between different individuals in the expression of same emotion class. This wide range of variability can have several origins such as the culture, the age, the gender of the speaker and its spoken language.

Finally, the noise contained in the emotion corpus increases the confusion. This noise can be induced by annotators during the labelling operation. This is particularly true for blended emotions such as fear and anger felt, for instance by a parent to his son when the latter crosses the road running at a red light.

(3)

2 Objectives

In this research project, our goal is to improve the performance of AER systems by mitigating the problem of ambiguity between emotion classes. We are particularly interested to recognize emotions of the categorical model or of the dimensional model converted to binary labels (e.g. active vs. passive or positive vs. negative). These AER systems are primarily intended to be deployed in a call center context, so we will focus on the use of speech signal as input data for the system.

3 Emotions and Related Works

3.1 Theoretical Models of Emotion

According to Scherer in [10], there are three theories which influence the research:

Discrete theory, which states that a small number of basic emotions exist. These emotions are recognized through very specific response patterns in physiology, and in facial and vocal expression.

Dimensional Theory, where the emotional states are mapped in two or three-dimensional space such as valence, activity and control dimensions.

Componential models of emotion, often based on appraisal theory, has the advan-tage to be more open with respect to the kind of emotional states. These models em-phasize the variability of different emotional states as produced by different types of appraisal patterns [10].

3.2 Types of Emotional Speech

As reported in [10], there are three methods to constitute an emotional corpus:

Natural emotions are recordings of spontaneous emotional states naturally oc-curred. It is characterized by a high ecological validity but suffers from the limited number of available speakers and presents difficulties for the annotation.

Simulated emotions are emotional states portrayed by professional or lay actors according to emotion labels or typical scenarios. Although this method permits easily to constitute an emotion corpus however it has been criticized that this kind of emo-tion are more exaggerated than natural or induced emoemo-tion.

Induced emotions are specific emotional states experimentally induced in groups of speakers.

3.3 Related Work

In the literature, several systems have been experimented to address the issue of emo-tion recogniemo-tion from speech signal. In this secemo-tion we will present these systems with respect to four criteria: type of feature information, 2) the temporal scope of features, 3) speech unit used in the analysis of an utterance, 4) and the type of approaches cho-sen to design the classifier.

(4)

Unit of Analysis

The analysis unit is the smallest segment of speech used as input to the classifier to get a partial or complete decision. The tested units are:

Utterance: The utterance is most used unit in studies [4, 11, 12].

Phoneme: A study in [13] where a modeling is based on phoneme class shows that vowels are more emotionally salient than other classes.

Syllable: In [14], the authors show that performances based on whole turn outper-form those based on syllabic unit.

Word: According to the results of [15], the word unit is preferable to the utterance if an effective word segmentation system is available.

Chunk: The turn is automatically segmented into fragments, depending on the acoustic properties of the turn [14]. The results show that the system performance based on the chunk unit is better than syllable unit but lower than turn unit.

Pseudo-syllable: Unit which length is guided by the valleys of the energy contour [16].

Temporal Feature Scope

The temporal scope of features can be either short-term or long-term information.

Short-term information (STI): Short-term information represents the local in-formation which extends over an interval of time called frame. We distinguish two types of STI: the spectral information conveyed by the spectral coefficients such as Mel-Frequency Cepstral Coefficients (MFCC), or the prosodic information [11, 17].

Long-term information (LTI): It characterizes the utterance as a whole. The most commonly LTI features used are the prosody, as well as voice quality (e.g. shimmer and jitter) [4, 12, 14], or can be combined with spectral coefficients [15].

Note that some authors have introduced linguistic information, e.g. information on lexical and discourse levels, in combination with paralinguistic information [6, 18].

Models of Classifiers

Several models of classifier were experimented and most of the models fall into one of these three approaches:

Static approach: In the static approach, models such as GMM, support vector ma-chines (SVM) are used [1, 14].

Dynamic approach: In this approach, the hidden Markov model (HMM) is largely used as classifier. The advantage of HMM models is that they allow to model the temporal structure of utterances [4, 13, 17].

Fuzzy logic approach: This approach is based on the fuzzy logic inference sys-tem. This choice is motivated by the uncertainties that characterize emotions, particu-larly the problem of the lack of clear boundaries between different emotions [7, 19].

4 Methodology

In this section, we present the proposed methodology to address the problem of con-fusion between emotion classes.

(5)

4.1 Data Description

The designed AER systems will be experimented using three different emotion cor-pora, two spontaneous and one simulated:

Bell Canada (Bell) dataset

This corpus is composed of spontaneous emotions recorded from customer calls of Bell Canada automated call center. Annotated into two classes (positive vs. negative), it contains approximately 5000 dialogues for each language (French and English).

FAU AIBO (AIBO) dataset

This corpus consists of spontaneous recordings of German children interacting with a pet robot. The corpus is composed of 18216 chunks and labelled into five cate-gories: Anger, Emphatic, Neutral, Positive and Rest.

LDC Emotional Prosody dataset

This corpus contains English audio recordings of emotions simulated in 15 catego-ries selected according to the study of Banse & Scherer [8].

4.2 Modeling

Weighted Ordered classes – Nearest Neighbors (WOC-NN)

The methodology presented in this section, dubbed "Weighted Ordered classes – Nearest Neighbors" (WOC-NN), is basically inspired from the dimensional theory model of figure 2 where we observe the absence of any clear boundary between emo-tions except that each categorical emotion is characterized by neighbourhood of other emotion classes which are more or less closer to it. We will use this concept of close-ness or similarity / dissimilarity between classes of the dimensional model in order to attenuate the confusion problem of categorical emotions.

The concept of weighted ordered class pattern

The decision rule in WOC-NN is based on pattern matching between test data and target class pattern. Each emotion class Ei is characterized by a neighborhood pattern

which represents the set of all emotion classes ordered with respect to their closeness to the class Ei. The order of class reflects the degree of similarity with the class Ei

relatively to other classes. In the test phase, the neighborhood pattern of the test data is generated and compared to each neighborhood pattern of each class. The distance metric used in the comparison is the sum of differences between the rank classes of the two neighborhood patterns, known as Hamming distance. The target class of the test data corresponds to the class whose neighborhood pattern gets the minimum dis-tance, i.e., the class which has the higher number of similar neighbor classes with the test data. Since not all class ranks in the neighborhood pattern have the same power of discrimination, each rank class in the pattern is weighted by a coefficient value pro-portional to the degree of its discrimination.

Neighborhood Pattern Feature Space

In the front-end system, The Mel-Frequency Cepstral Coefficients (MFCC) are ex-tracted and used as input features used to learn the model of each emotion class. The Gaussian Mixture Model (GMM), a generative model, is used as function of probabil-ity densprobabil-ity estimation to compute likelihood scores of the data. These likelihood

(6)

scores are then used to generate the ranked neighborhood classes of each emotion class with respect to their proximity. The degree of proximity is based on the likelih-ood values of each model class. More the likelihlikelih-ood scores of two models are close to each other more the classes associated with the two models are considered as tightly adjacent classes and vice-versa.

Let E1,E2,...,EC be C emotion classes and λi is the GMM model of class Ei. Let

X represent a set of cepstral vectors of speech data and Xˆ represents the likelihood vector for Xover the C models.

Each emotion class is represented by a vector of emotion classes ordered from the nearest closed one to most far one. These ranks are obtained by sorting the likelihood scores li of the vectorXˆ , where li =P

(

Xλi

)

, in descending order:

C

r r

r l l

l1≥ 2 ≥ ≥ (1)

The rank vector r of equation (2) constitutes the neighborhood pattern.

[

]

T C r r r₁ ₂ = r ₍₂₎ CLASSIFICATION SYSTEM

Let rand r′denote the neighborhood patterns of a test utterance X and a class emo-tion Ei respectively. The distance between rand r′is computed using Hamming

dis-tance. We introduce a C-dimensional vector v, with indicator element vj = 1 if

j j r

r ≠ ′and vj = 0 otherwise. We will refer to v as a distance pattern.

The direct use of Hamming distance as metric for the classification supposes that each class rank of a neighborhood pattern has the same power of discrimination as other ranks. This is misleading particularly for the first rank. For this reason we intro-duce a vector of coefficients,

[

]

T

iC i i

i= w1w2 w

w , to weight each rank of the neighborhood pattern, where wij represents the weight coefficient of rank j of the ith

class neighborhood pattern. The new metric, is expressed as :

( )

rr wT v

i wH , ′ =

D ₍₃₎

The weights ranks are obtained by measuring the importance of each rank for each neighborhood pattern separately. Thus each neighborhood pattern of an emotion class

Ei has his distinct weight vector wi. In order to estimate these weights, the C-class

problem is transformed to a two-class problem: discrimination between correct and wrong distance patterns using logistic regression function.

For empirical raisons, e.g. when the data are sparse, some classes in the neighbor-hood pattern may not be enough representative of dissimilarity between neighborneighbor-hood patterns and should be discarded. For WOC-NN system, we selected ranks of classes associated with positive weight coefficients obtained with the logistic regression. This is feasible because of the type of the output class (which represents correct or incor-rect pattern) of the logistic regression problem.

(7)

EXPERIMENTAL RESULTS

WOC-NN is experimented using the FAU AIBO [20] emotional speech corpus and its performance is compared with GMM system and the state-of-the art.

Table 1. Experiments results on FAU AIBO data

Systems Unweighted _average

GMM-Bayes 41.05

WOC-NN 42.47

Lee et al. [21] 41.3

Kockmann et al. (fusion of 2 systems) [22] 41.7

The results of Table 1 show that WOC-NN has better performance than GMM-Bayes system with a relative gain of 3.46%. Also, when compared to the best single system performance [21] and to the best combined systems performance [22] of In-terspeech 2009 Emotion Challenge, WOC-NN method exceeds relatively their per-formances by 2.83% and 1.85% respectively. For more details about this method see [23].

Future Work

In the proposed framework, we computed the neighbourhood patterns in the likeli-hood score space. As future work, we will use the feature space level instead. An interesting feature vector space candidate is the i-vector space. The i-vector represen-tation, used first in speaker verification [24], is a low dimensional vector of a given high dimensional supervector (a vector obtained by a concatenation of mean vector of each component of a GMM model). The space dimensionality reduction can be per-formed using a Factor Analysis model (FA) or a Probabilistic Principal Component Analysis (PPCA). This feature space will provide us more flexibility to deal with the variability in the expression of emotions between individuals and therefore reducing the overlap between classes.

5 The Main Contribution

In this proposal, we started by identifying our research problem, namely the acute confusion problem between emotion classes that characterizes the AER Systems. We also identified the different sources of this ambiguity. As a main contribution, we presented our methodology to improve the AER systems performances, namely

WOC-NN, to make them closer for future deployment.

References

1. Li, W., Zhang, Y., Fu, Y.: Speech emotion recognition in E-learning system based on af-fective computing. In: Proceedings - Third International Conference on Natural Computa-tion, ICNC, Hainan, China, pp. 809–813 (2007)

(8)

2. Jones, C.M., Jonsson, I.-M.: Performance analysis of acoustic emotion recognition for in-car conversational interfaces. In: Stephanidis, C. (ed.) UAHCI 2007 (Part II). LNCS, vol. 4555, pp. 411–420. Springer, Heidelberg (2007)

3. Clavel, C., et al.: De la construction du corpus émotionnel au système de détection le point de vue applicatif de la surveillance dans les lieux publics. Revue d’Intelligence Artifi-cielle 20(4-5), 529–551 (2006)

4. Inanoglu, Z., Caneel, R.: Emotive alert: HMM-based emotion detection in voicemail mes-sages. In: International Conference on Intelligent User Interfaces, Proceedings IUI, San Diego, CA, United States, pp. 251–253 (2005)

5. Panat, A.R., Ingole, V.T.: Affective state analysis of speech for speaker verification: Ex-perimental study, design and development. In: Proceedings - International Conference on Computational Intelligence and Multimedia Applications, India, pp. 255–261 (2007) 6. Devillers, L., Vidrascu, L.: Real-life emotion recognition in speech. In: Müller, C. (ed.)

Speaker Classifcation II. LNCS (LNAI), vol. 4441, pp. 34–42. Springer, Heidelberg (2007)

7. Lee, C.M., Narayanan, S.: Emotion Recognition Using a Data-Driven Fuzzy Inference System. In: Eurospeech, Geneva (2003)

8. Banse, R., Scherer, K.R.: Acoustic Profiles in Vocal Emotion Expression. Journal of Per-sonality and Social Psychology, 614–636 (1996)

9. Ververidis, D., Kotropolos, C.: Automatic speech classification to five emotional states based on gender information. In: Proc. of Eusipco, pp. 341–344 (2004)

10. Scherer, K.R.: Vocal communication of emotion: A review of research paradigms. Speech Communication 40(1-2), 227–256 (2003)

11. Sethu, V., Ambikairajah, E., Epps, J.: Speaker normalisation for speech-based emotion de-tection. In: 15th International Conference on Digital Signal Processing, Cardiff, UK, pp. 611–614 (2007)

12. Grimm, M., Kroschel, K.: Rule-based emotion classification using acoustic features. In Proc. 3rd Internat. Conf. on Telemedicine and Multimedia Communication. Kajetany, Poland (2005)

13. Lee, C.M., et al.: Emotion Recognition based on Phoneme Classes. In: ICSLP, Korea (2004)

14. Schuller, B., et al.: Comparing one and two-stage acoustic modeling in the recognition of emotion in speech. In: IEEE Workshop on Automatic Speech Recognition and Under-standing, Japan (2007)

15. Schuller, B., et al.: Towards more reality in the recognition of emotional. In: IEEE Inter-national Conference on Acoustics, Speech, and Signal Processing, Honolulu, HI, USA, pp. 941–944 (2007)

16. Attabi, Y.: Reconnaissance automatique des émotions à partir du signal acoustique, Master Thesis, École de technologie supérieure (ÉTS), Montréal (2009)

17. Huang, R., Ma, C.: Toward a speaker-independent real-time affect detection system. In: 18th International Conference on Pattern Recognition, Hong Kong, China (2006)

18. Chen, C., You, M., Song, M., Bu, J., Liu, J.: An enhanced speech emotion recognition sys-tem based on discourse information. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3991, pp. 449–456. Springer, Heidel-berg (2006)

19. Lee, C.M., Narayanan, S.: Emotion Recognition Using a Data-Driven Fuzzy Inference System. In: Eurospeech, Geneva (2003)

20. Schuller, B., Steidl, S., Batliner, A.: The Interspeech 2009 Emotion Challenge. In: Inters-peech. ISCA, Brighton (2009)

(9)

21. Lee, C., Mower, E., Busso, C., Lee, S., Narayanan, S.: Emotion Recognition Using a Hie-rarchical Binary Decision Tree Approach. In: Interspeech, ISCA, Brighton (2009) 22. Kockmann, M., Burget, L., ernocký, J.: Brno University of Technology System for

In-terspeech 2009 Emotion Challenge. In: InIn-terspeech. ISCA, Brighton (2009)

23. Attabi, Y., Dumouchel, P.: Weighted Ordered Classes - Nearest Neighbors: A New Framework for Automatic Emotion Recognition From Speech. In: Interspeech. ISCA, Flo-rence (2011)

24. Dehak, N., Kenny, et al.: Front-End Factor Analysis for Speaker Verification Submitted to IEEE Transactions on Audio, Speech and Language Processing (2009)