Vocal Emotion Recognition

(1)

State-of-the-Art in Classification of Real-Life Emotions

October 26, 2010

Stefan Steidl

International Computer Science Institute (ICSI) at Berkeley, CA

2 / 49

Overview

1 Different Perspectives on Emotion Recognition

2 FAU Aibo Emotion Corpus

3 Own Results on Emotion Classification

(2)

Overview

Psychology of Emotion Computer Science

4 INTERSPEECH 2009 Emotion Challenge

S. Steidl: Vocal Emotion Recognition

4 / 49

(3)

Universal Basic Emotions

Paul Ekman

postulates the existence of

6 basic emotions:

anger, fear, disgust, surprise, joy, sadness

other emotions are mixed or blended

emotions

universal facial expressions

6 / 49

Terminology

Different affective states [1]:

type of affective state inten- dura- syn- event appraisal rapid- behav-sity tion chroni- focus elicita- ity of ioral

zation tion change impact

emotion ::-::: : ::: ::: ::: ::: :::

mood :-:: :: : : : :: :

interpersonal stances :-:: :-:: : :: : ::: ::

attitudes ◦-:: ::-::: ◦ ◦ : ◦-: :

personality traits ◦-: ::: ◦ ◦ ◦ ◦ :

◦: low,:: medium,::: high,:::: very high, -: indicates a range

[1] K. R. Scherer: Vocal communication of emotion: A review of research paradigms, Speech Communication, Vol. 40, pp. 227-256, 2003

(4)

Terminology

(cont.)

Definition of Emotion

Emotion (Scherer)

episodes ofcoordinated changes in several componentsincluding at least:

neurophysiological activation, motor expression, and

subjective feeling but possibly also

action tendencies and cognitive processes

in response to external or internal events of major significance to the organism

8 / 49

Vocal Expression of Emotion

Results from studies in Psychology of Emotion

anger/ fear/ sadness joy/ boredom stress rage panic elation

Intensity Ú Ú Ø Ú Ú

F0floor/mean Ú Ú Ø Ú Ú

F0variability Ú Ø Ú Ø

F0range Ú Ú(Ø)1 Ø Ú Ø

Sentence contour Ø Ø

High frequency energy Ú Ú Ø (Ú)2

Speech and articulation rate Ú Ú Ø (Ú)2 Ø

1_{Banse and Scherer found a decrease in F}

0 range 2_{inconclusive evidence}

Goal

Classification of the subject’s actual emotional state (some sort of lie detector for emotions)

(5)

Human-Computer Interaction (HCI)

Emotion-Related User States

naturally occurring states of users in human-machine communication

emotions in a broader sense

coordinated changes in several components NOT required

classification of the perceived emotional state, not necessarily the actual emotion of the speaker

10 / 49

Pattern Recognition

Pattern Recognition Point of View

classification task: choose 1 of n given classes

discrimination of classes rather than classification definition of “good” features

machine classification

Actually not needed

definition of term emotion

(6)

Emotional Speech Corpora

Acted data

based on Basic Emotions theory

suited for studying prototypical emotions

corpora easy to create (inexpensive, no labeling process) high audio quality

balanced classes

neutral linguistic content (focus on acoustics only) high recognition results

12 / 49

Emotional Speech Corpora

(cont.)

Popular corpora

Emotional Prosody Speech and Transcript corpus (LDC): 15 classes

Berlin Emotional Speech Database (EmoDB): 7 classes

89.9 % accuracy (speaker independent LOSO evaluation, speaker adaptation, feature selection) [2]

Danish Emotional Speech Corpus: 5 classes

74.5 % accuracy (10-fold SCV, feature selection) [3]

[2] B. Vlasenko et al.: Combining Frame and Turn-Level Information for Robust Recognition of Emotions within Speech, INTERSPEECH 2007

[3] Schuller et al.: Emotion Recognition in the Noise Applying Large Acoustic Feature Sets, Speech Prosody 2006

(7)

Emotional Speech Corpora

(cont.)

Naturally occurring emotions

states that actually appear in HCI (real applications)

difficult to create (appropriate scenario needed, ethical concerns, need to label data)

low emotional intensity

in general ≥ 80% neutral

low audio quality (reverberation, noise, far-distance microphones) needed for machine classification (because conditions between training and test must not differ too much)

research on both acoustic and linguistic features possible new research questions: optimal emotion unit

almost no corpora large enough for machine classification available (do not exist or are not available for research)

14 / 49

Overview

Scenario

Labeling of User States

Data-driven Dimensions of Emotion Units of Analysis

Sparse Data Problem

(8)

The FAU Aibo Emotion Corpus

51 children (30 f, 21 m) at the age of 10 to 13

8.9 hours of spontaneous speech (mainly short commands) 48,401 words in 13,642 audio files

16 / 49

FAU Aibo Emotion Corpus

(cont.)

data base for CEICES and INTERSPEECH 2009 Emotion

Challenge

available for scientific, non-commercial use

http://www5.cs.fau.de/FAUAiboEmotionCorpus

[4] S. Steidl: Automatic Classification of Emotion-Related User States in Spontaneous Children’s Speech, Logos Verlag, Berlin

available online:

(9)

Emotion-Related User States

11 categories: prior inspection of the data before labeling

joyful surprised motherese neutral bored emphatic helpless touchy/irritated reprimanding angry other motherese

the way mothers/parents address their babies – either because Aibo is well-behaving or because the child wants Aibo to obey; positive equivalent to

reprimanding

emphatic

pronounced, accentuated, sometimes

hyper-articulated way but without showing any emotion

reprimanding

the child is reproachful, reprimanding, ‘wags the finger’

18 / 49

Labeling of User States

Labeling:

5 students of linguistics

holistic labeling on the word level majority vote

emotion category words

angry (A) 134 0.3 % touchy (T) 419 0.9 % reprimanding (R) 463 1.0 % emphatic (E) 2,807 5.8 % neutral (N) 39,975 82.6 % motherese (M) 1,311 2.7 % joyful (J) 109 0.2 % .. . all 48,401 100.0 %

(10)

Labeling of User States

(cont.) Confusion matrix emotion category A T R E N M J major ity v ote angry (A) 43.3 13.0 12.9 12.1 18.1 0.1 0.0 touchy (T) 4.5 42.9 11.7 13.7 23.5 1.0 0.1 reprimanding (R) 3.8 15.7 45.8 14.0 18.2 1.3 0.1 emphatic (E) 1.3 5.8 6.7 53.6 29.9 1.2 0.5 neutral (N) 0.4 2.2 1.5 13.9 77.8 2.7 0.5 motherese (M) 0.0 0.8 1.4 4.9 30.4 61.1 0.9 joyful (J) 0.1 0.6 1.1 7.3 32.4 2.0 54.2

20 / 49

Data-driven Dimensions of Emotions

Non-metric dimensional scaling:

arranging the emotion categories in the 2-dimensional space states that are often confused are close to each other

negative positive valence −interaction +interaction inter action angry touchy motherese neutral joyful reprimanding emphatic

(11)

Units of Analysis

Units of analysis

Aibo g’radeaus fein machst du das word level chunk level turn level stopp sitz stopp Ohm_18_343 Ohm_18_342 v1 v2 p3 _s3

Advantages/disadvantages of larger units

+ more information

− less emotional homogeneity

22 / 49

Sparse Data Problem

Super classes:

Anger: angry, touchy/irritated, reprimanding

Emphatic Neutral Motherese 0.5 -1 -0.5 0 1 0.5 0 1 -0.5 -1 0 0.5 1 1.5 -1 -1.5 -0.5 -1.5 -1 -0.5 0 0.5 1 1.5 ang ry repr imanding neutr al motherese touch y emphatic joyful S= 0.32 RSQ = 0.73 Neutral Anger Motherese S= 0.19 RSQ = 0.90 Emphatic

(12)

Sparse Data Problem

(cont.)

Data subsets

Aibo word set Aibo chunk set

Aibo turn set Aibo corpus

data set number of taken from

words # chunks # turns

Aibo corpus 48,401 18,216 13,642

Aibo word set 6,070 4,543 3,996

Aibo chunk set 13,217 4,543 3,996

Aibo turn set 17,618 6,413 3,996

24 / 49

Overview

Results for different Units of Analysis Machine vs. Human

Feature Types and their Relevance

(13)

Most Appropriate Unit of Analysis

Classification

complete set of features

classification with Linear Discriminant Analysis (LDA) 51-fold speaker-independent cross-validation

unit of number of number of average

analysis features samples recall

word level 265 6,070 words 67.2 %

chunk level 700 4,543 chunks 68.9 %

turn level 700 3,996 turns 63.2 %

Chunks: best compromise between length of the segment

homogeneity of the emotional state within the segment

26 / 49

Machine Classifier vs. Human Labeler

Entropy based measure:

A E N M 0.0 0.25 0.75 0.0 1 2 3 4 A E A class labeler A 1 2 → + A E N M 0.0 0.0 1.0 decoder: A E N M 0.0 0.5 0.375 0.125 1 2 Hdec =1.41 → M 0.0

implicit weighting of classification ‘errors’ depending on the word that is classified

(14)

Machine Classifier vs. Human Labeler

(cont.)

Classification: Aibo word set

avg. human labeler machine classifier 0.2 0.15 0.1 0 0.05 0.25 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 entropy rel. frequency [%]

[5] S. Steidl, M. Levit, A. Batliner, E. Nöth, H. Niemann:

“Of All Things the Measure is Man” – Classification of Emotions and Inter-Labeler Consistency, ICASSP 2005

28 / 49

Evaluation of Different Types of Features

Types of features acoustic features

prosodic features spectral features voice quality features linguistic features

Evaluation

Artificial Neural Networks (ANN)

51-fold speaker-independent cross-validation combination by early or late fusion

(15)

Acoustic Features: Prosody

Prosody

suprasegmental characteristics such as pitch contour

energy contour

temporal shortening/lengthening of words duration of pauses between words

30 / 49

Acoustic Features: Prosody

(cont.)

Classification results: Aibo chunk set

0 50.6 54.4 58.5 59.0 10 20 30 40 50 60 70 80 42.0 F 0 ₍₂₉₎ dur ation (37) energy (25) all pauses (16) a v er age recall [%]

(16)

Acoustic Features: Spectral Characteristics

(cont.)

40 50 60 70 80 20 30 10 0 59.0 58.9 48.2 prosody (107) MFCC (24) best combination HNR (2) jitter/shimmer (4) for mants (16) TEO (64) a v er age recall [%]

32 / 49

Acoustic Features: Voice Quality

40 50 60 70 80 20 30 10 0 59.0 58.9 48.2 _47.0 32.5 52.3 prosody (107) MFCC (24) for mants (16) jitter/shimmer (4) HNR (2) TEO (64) best combination a v er age recall [%]

(17)

Acoustic Features: Combination

40 50 60 70 80 20 30 10 0 59.0 58.9 48.2 _47.0 32.5 52.3 65.4 prosody (107) MFCC (24) for mants (16) jitter/shimmer (4) HNR (2) best combination TEO (64) a v er age recall [%]

34 / 49

Linguistic Features

Types of linguistic features word characteristics

average word length (number of letters, phonemes, syllables) proportion of word fragments

average number of repetitions part-of-speech features

unigram models bag-of-words

(18)

Linguistic Features

(cont.)

Part-of-Speech (POS) Features only 6 coarse POS categories

can be annotated without considering context

A

nger EmphaticNeutr al

M

othereseJoyful Other

-%

of

total

nouns, proper names inflected adjectives

particles, interjections articles, pronouns, auxiliaries

present/past participles not inflected adjectives (other) verbs, infinitives

36 / 49

Linguistic Features

(cont.)

Unigram Models

u(w,e) = log₁₀ P(e|w) P(e)

Anger P(A|w) Emphatic P(E|w) böser (bad) 29.2 % stopp (stop) 30.5 % stehenbleiben (stop) 18.9 % halt (halt) 29.3 % nein (no) 17.0 % links (left) 20.5 % aufstehen (get up) 12.3 % rechts (right) 18.9 % Aibo (Aibo) 10.1 % nein (no) 17.6 % Neutral P(N|w) Motherese P(M|w) okay (okay) 98.6 % fein (fine) 57.5 % und (and) 98.5 % ganz (very) 41.9 % Stück (bit) 98.5 % braver (good) 36.0 % in (in) 98.2 % sehr (very) 23.5 % noch (still) 96.2 % brav (good) 21.7 %

(19)

Linguistic Features

(cont.) Bag-of-Words 1 4 . . . 0 0 1₄ 1₄ 1₄ Aibolein allen . . . .

utterance: Aibo, geh nach links! (Aibo, move to the left!)

Aibo geh nach links

representation of the linguistic content word order getting lost

various dimensionality reduction techniques

38 / 49

Linguistic Features

(cont.)

80 70 60 50 40 30 20 10 0 54.3 56.1 61.9 61.9 62.2 POS (6) unig ram models (16) w_ord statistics (6) best combination BO W (254 → 50) a v er age recall [%]

(20)

Combination of Acoustic and Linguistic Features

65.4 62.2 67.1 68.9 80 70 60 50 40 30 20 10 0 best combination (ear ly fusion, LD A) (late fusion, ANN) acoustic features (late fusion, ANN) linguistic features best combination (late fusion, ANN) combination combination a v er age recall [%]

40 / 49

Similar Results within C

EICES

CEICES: Combining Efforts forImproving AutomaticClassification of Emotional UserStates

collaboration of various research groups within the European Network of Excellence HUMAINE (2004-2007)

state-of-the-art feature set with ≥ 4,000 features SVM (linear kernel), 3-fold speaker-independent cross-validation selection of 150 features (SFFS): surviving feature types?

only chunk based features, no information outside Aibo chunk set

[6] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, V. Aharonson, L. Kessous, N. Amir:

Whodunnit – Searching for the Most Important Feature Types Signalling Emotion-Related User States in Speech,

(21)

Similar Results within C

EICES

(cont.) dur ation energy F0 spectr um cepstr um v oice quality w a v elets all acoustic BO W POS higher semantics v ar ia all linguistic all # total 391 265 333 656 1699 153 216 3713 476 31 12 12 531 4244 SFFS # 10 32 16 15 16 7 5 101 25 7 17 0 49 150 F MEASURE 49.6 56.3 46.8 46.2 46.4 38.7 35.3 – 37.4 48.1 56.0 – – 65.5 SHARE 6.7 21.3 10.7 10.0 10.7 4.7 3.4 67.3 16.7 4.7 11.3 0.0 32.7 100.0 PORTION 2.6 12.1 4.8 2.3 1.0 4.6 2.3 2.7 5.3 22.6 141.7 0.0 9.6 3.5 SFFS # 28 33 23 17 23 11 15 150 94 27 27 2 150 F MEASURE 54.9 56.9 46.7 49.9 50.4 41.5 44.9 63.4 53.2 54.9 57.9 – 62.6 SHARE 18.7 22.0 15.3 11.3 15.3 7.3 10.0 100.0 62.7 18.0 18.0 0.1 100.0 PORTION 7.2 12.5 6.9 2.6 1.4 7.2 6.9 4.0 19.7 87.1 225.0 16.7 28.2

42 / 49

Overview

(22)

INTERSPEECH 2009 Emotion Challenge

New goals:

challenge with standardized test conditions

open microphone: using the complete corpus highly unbalanced classes

including all observed emotional categories

including chunks with low inter-labeler agreement

44 / 49

INTERSPEECH 2009 Emotion Challenge

(cont.)

Speaker independent training and test sets

2-class problem: NEGative vs. IDLe

# NEG IDL P

train 3 358 6 601 9 959 test 2 465 5 792 8 257

P

5 823 12 393 18 216

5-class problem: Anger, Emphatic, Neutral, Positive, Rest

# A E N P R P

train 881 2 093 5 590 674 721 9 959

test 611 1 508 5 377 215 546 8 257

P

(23)

INTERSPEECH 2009 Emotion Challenge

(cont.)

Sub-Challenges

1 Feature Sub-Challenge

optimisation of feature extraction/selection; classifier settings fixed

2 Classifier Sub-Challenge

optimisation of classification techniques; feature set given

3 Open Performance Sub-Challenge

optimisation of feature extraction/selection and classification techniques

46 / 49

INTERSPEECH 2009 Emotion Challenge

(cont.)

Participants

Open Performance Classifier Feature

Sub-Challenge Sub-Challenge Sub-Challenge number of

2 classes 5 classes 2 classes 5 classes 2 classes 5 classes participants

3 3 – – – – 7 3 – – – – – 2 – – ₃ ₃ – – 2 – – – ₃ – – 1 – – – ₃ ₃ ₃ 1 – – – – ₃ ₃ 1

[7] B. Schuller, A. Batliner, S. Steidl, D. Seppi:

Recognising Realistic Emotions and Affect in Speech: State of the Art and

Lessons Learnt from the First Challenge, Speech Communication, Special Issue

(24)

INTERSPEECH 2009 Emotion Challenge

(cont.)

2-class problem: NEGative vs. IDLe

unweighted avg. recall weighted avg. recall

60 62 64 68 70 72 74 66 71.2 70.3 69.2 68.3 67.9 67.6 67.2 67.1 67.7 66.4 Barr a-Chicote et al. Polz ehl et al. Vogt et al. Bozkur t et al. Luengo et al. Koc kmann et al. Vlasenk o et al. Dumouchel et al. Major ity voting Baseline a v er age recall [%]

48 / 49

INTERSPEECH 2009 Emotion Challenge

(cont.)

5-class problem: Anger, Emphatic, Neutral, Positive, Rest

unweighted average recall weighted average recall

45 55 40 35 50 38.2 39.4 39.4 41.2 41.4 41.4 41.6 41.6 41.7 44.0 Dumouchel et al. Planet et al. Luengo et al. Vlasenk o et al. Lee et al. Koc kmann et al. Major ity voting Barr a-Chicote et al. Vogt el al. Baseline Bozkur t et al. a v er age recall [%] 38.2

(25)

State-of-the-Art: Summary

Berlin Emotion Speech Database

7-class problem: hot anger, disgust, fear/panic, happiness, sadness/sorrow, boredom, neutral

balanced classes

+ 90 % accuracy

FAU Aibo Emotion Corpus

4-class problem: Anger,Emphatic, Neutral, Motherese subset with roughly balanced classes (Aibo chunk set)

+ 69 % unweighted average recall

5-class problem: Anger,Emphatic, Neutral, Positive, Rest highly unbalanced classes, complete corpus

+ 44 % unweighted average recall 2-class problem: NEGative vs. IDLe

highly unbalanced classes, complete corpus

+ 71 % unweighted average recall S. Steidl: Vocal Emotion Recognition