State-of-the-Art in Classification of Real-Life Emotions
October 26, 2010
Stefan Steidl
International Computer Science Institute (ICSI) at Berkeley, CA
2 / 49
Overview
1 Different Perspectives on Emotion Recognition
2 FAU Aibo Emotion Corpus
3 Own Results on Emotion Classification
Overview
1 Different Perspectives on Emotion Recognition
Psychology of Emotion Computer Science
2 FAU Aibo Emotion Corpus
3 Own Results on Emotion Classification
4 INTERSPEECH 2009 Emotion Challenge
S. Steidl: Vocal Emotion Recognition
4 / 49
Universal Basic Emotions
Paul Ekman
postulates the existence of
6 basic emotions:
anger, fear, disgust, surprise, joy, sadness
other emotions are mixed or blended
emotions
universal facial expressions
S. Steidl: Vocal Emotion Recognition
6 / 49
Terminology
Different affective states [1]:
type of affective state inten- dura- syn- event appraisal rapid- behav-sity tion chroni- focus elicita- ity of ioral
zation tion change impact
emotion ::-::: : ::: ::: ::: ::: :::
mood :-:: :: : : : :: :
interpersonal stances :-:: :-:: : :: : ::: ::
attitudes ◦-:: ::-::: ◦ ◦ : ◦-: :
personality traits ◦-: ::: ◦ ◦ ◦ ◦ :
◦: low,:: medium,::: high,:::: very high, -: indicates a range
[1] K. R. Scherer: Vocal communication of emotion: A review of research paradigms, Speech Communication, Vol. 40, pp. 227-256, 2003
Terminology
(cont.)Definition of Emotion
Emotion (Scherer)
episodes ofcoordinated changes in several componentsincluding at least:
neurophysiological activation, motor expression, and
subjective feeling but possibly also
action tendencies and cognitive processes
in response to external or internal events of major significance to the organism
S. Steidl: Vocal Emotion Recognition
8 / 49
Vocal Expression of Emotion
Results from studies in Psychology of Emotion
anger/ fear/ sadness joy/ boredom stress rage panic elation
Intensity Ú Ú Ø Ú Ú
F0floor/mean Ú Ú Ø Ú Ú
F0variability Ú Ø Ú Ø
F0range Ú Ú(Ø)1 Ø Ú Ø
Sentence contour Ø Ø
High frequency energy Ú Ú Ø (Ú)2
Speech and articulation rate Ú Ú Ø (Ú)2 Ø
1Banse and Scherer found a decrease in F
0 range 2inconclusive evidence
Goal
Classification of the subject’s actual emotional state (some sort of lie detector for emotions)
Human-Computer Interaction (HCI)
Emotion-Related User States
naturally occurring states of users in human-machine communication
emotions in a broader sense
coordinated changes in several components NOT required
classification of the perceived emotional state, not necessarily the actual emotion of the speaker
S. Steidl: Vocal Emotion Recognition
10 / 49
Pattern Recognition
Pattern Recognition Point of View
classification task: choose 1 of n given classes
discrimination of classes rather than classification definition of “good” features
machine classification
Actually not needed
definition of term emotion
Emotional Speech Corpora
Acted data
based on Basic Emotions theory
suited for studying prototypical emotions
corpora easy to create (inexpensive, no labeling process) high audio quality
balanced classes
neutral linguistic content (focus on acoustics only) high recognition results
S. Steidl: Vocal Emotion Recognition
12 / 49
Emotional Speech Corpora
(cont.)Popular corpora
Emotional Prosody Speech and Transcript corpus (LDC): 15 classes
Berlin Emotional Speech Database (EmoDB): 7 classes
89.9 % accuracy (speaker independent LOSO evaluation, speaker adaptation, feature selection) [2]
Danish Emotional Speech Corpus: 5 classes
74.5 % accuracy (10-fold SCV, feature selection) [3]
[2] B. Vlasenko et al.: Combining Frame and Turn-Level Information for Robust Recognition of Emotions within Speech, INTERSPEECH 2007
[3] Schuller et al.: Emotion Recognition in the Noise Applying Large Acoustic Feature Sets, Speech Prosody 2006
Emotional Speech Corpora
(cont.)Naturally occurring emotions
states that actually appear in HCI (real applications)
difficult to create (appropriate scenario needed, ethical concerns, need to label data)
low emotional intensity
in general ≥ 80% neutral
low audio quality (reverberation, noise, far-distance microphones) needed for machine classification (because conditions between training and test must not differ too much)
research on both acoustic and linguistic features possible new research questions: optimal emotion unit
almost no corpora large enough for machine classification available (do not exist or are not available for research)
S. Steidl: Vocal Emotion Recognition
14 / 49
Overview
1 Different Perspectives on Emotion Recognition
2 FAU Aibo Emotion Corpus
Scenario
Labeling of User States
Data-driven Dimensions of Emotion Units of Analysis
Sparse Data Problem
3 Own Results on Emotion Classification
The FAU Aibo Emotion Corpus
51 children (30 f, 21 m) at the age of 10 to 13
8.9 hours of spontaneous speech (mainly short commands) 48,401 words in 13,642 audio files
S. Steidl: Vocal Emotion Recognition
16 / 49
FAU Aibo Emotion Corpus
(cont.)data base for CEICES and INTERSPEECH 2009 Emotion
Challenge
available for scientific, non-commercial use
http://www5.cs.fau.de/FAUAiboEmotionCorpus
[4] S. Steidl: Automatic Classification of Emotion-Related User States in Spontaneous Children’s Speech, Logos Verlag, Berlin
available online:
Emotion-Related User States
11 categories: prior inspection of the data before labeling
joyful surprised motherese neutral bored emphatic helpless touchy/irritated reprimanding angry other motherese
the way mothers/parents address their babies – either because Aibo is well-behaving or because the child wants Aibo to obey; positive equivalent to
reprimanding
emphatic
pronounced, accentuated, sometimes
hyper-articulated way but without showing any emotion
reprimanding
the child is reproachful, reprimanding, ‘wags the finger’
S. Steidl: Vocal Emotion Recognition
18 / 49
Labeling of User States
Labeling:
5 students of linguistics
holistic labeling on the word level majority vote
emotion category words
angry (A) 134 0.3 % touchy (T) 419 0.9 % reprimanding (R) 463 1.0 % emphatic (E) 2,807 5.8 % neutral (N) 39,975 82.6 % motherese (M) 1,311 2.7 % joyful (J) 109 0.2 % .. . all 48,401 100.0 %
Labeling of User States
(cont.) Confusion matrix emotion category A T R E N M J major ity v ote angry (A) 43.3 13.0 12.9 12.1 18.1 0.1 0.0 touchy (T) 4.5 42.9 11.7 13.7 23.5 1.0 0.1 reprimanding (R) 3.8 15.7 45.8 14.0 18.2 1.3 0.1 emphatic (E) 1.3 5.8 6.7 53.6 29.9 1.2 0.5 neutral (N) 0.4 2.2 1.5 13.9 77.8 2.7 0.5 motherese (M) 0.0 0.8 1.4 4.9 30.4 61.1 0.9 joyful (J) 0.1 0.6 1.1 7.3 32.4 2.0 54.2S. Steidl: Vocal Emotion Recognition
20 / 49
Data-driven Dimensions of Emotions
Non-metric dimensional scaling:
arranging the emotion categories in the 2-dimensional space states that are often confused are close to each other
negative positive valence −interaction +interaction inter action angry touchy motherese neutral joyful reprimanding emphatic
Units of Analysis
Units of analysis
Aibo g’radeaus fein machst du das word level chunk level turn level stopp sitz stopp Ohm_18_343 Ohm_18_342 v1 v2 p3 s3
Advantages/disadvantages of larger units
+ more information
− less emotional homogeneity
S. Steidl: Vocal Emotion Recognition
22 / 49
Sparse Data Problem
Super classes:
Anger: angry, touchy/irritated, reprimanding
Emphatic Neutral Motherese 0.5 -1 -0.5 0 1 0.5 0 1 -0.5 -1 0 0.5 1 1.5 -1 -1.5 -0.5 -1.5 -1 -0.5 0 0.5 1 1.5 ang ry repr imanding neutr al motherese touch y emphatic joyful S= 0.32 RSQ = 0.73 Neutral Anger Motherese S= 0.19 RSQ = 0.90 Emphatic
Sparse Data Problem
(cont.)Data subsets
Aibo word set Aibo chunk set
Aibo turn set Aibo corpus
data set number of taken from
words # chunks # turns
Aibo corpus 48,401 18,216 13,642
Aibo word set 6,070 4,543 3,996
Aibo chunk set 13,217 4,543 3,996
Aibo turn set 17,618 6,413 3,996
S. Steidl: Vocal Emotion Recognition
24 / 49
Overview
1 Different Perspectives on Emotion Recognition
2 FAU Aibo Emotion Corpus
3 Own Results on Emotion Classification
Results for different Units of Analysis Machine vs. Human
Feature Types and their Relevance
Most Appropriate Unit of Analysis
Classification
complete set of features
classification with Linear Discriminant Analysis (LDA) 51-fold speaker-independent cross-validation
unit of number of number of average
analysis features samples recall
word level 265 6,070 words 67.2 %
chunk level 700 4,543 chunks 68.9 %
turn level 700 3,996 turns 63.2 %
Chunks: best compromise between length of the segment
homogeneity of the emotional state within the segment
S. Steidl: Vocal Emotion Recognition
26 / 49
Machine Classifier vs. Human Labeler
Entropy based measure:
A E N M 0.0 0.25 0.75 0.0 1 2 3 4 A E A class labeler A 1 2 → + A E N M 0.0 0.0 1.0 decoder: A E N M 0.0 0.5 0.375 0.125 1 2 Hdec =1.41 → M 0.0
implicit weighting of classification ‘errors’ depending on the word that is classified
Machine Classifier vs. Human Labeler
(cont.)Classification: Aibo word set
avg. human labeler machine classifier 0.2 0.15 0.1 0 0.05 0.25 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 entropy rel. frequency [%]
[5] S. Steidl, M. Levit, A. Batliner, E. Nöth, H. Niemann:
“Of All Things the Measure is Man” – Classification of Emotions and Inter-Labeler Consistency, ICASSP 2005
S. Steidl: Vocal Emotion Recognition
28 / 49
Evaluation of Different Types of Features
Types of features acoustic features
prosodic features spectral features voice quality features linguistic features
Evaluation
Artificial Neural Networks (ANN)
51-fold speaker-independent cross-validation combination by early or late fusion
Acoustic Features: Prosody
Prosody
suprasegmental characteristics such as pitch contour
energy contour
temporal shortening/lengthening of words duration of pauses between words
S. Steidl: Vocal Emotion Recognition
30 / 49
Acoustic Features: Prosody
(cont.)Classification results: Aibo chunk set
0 50.6 54.4 58.5 59.0 10 20 30 40 50 60 70 80 42.0 F 0 (29) dur ation (37) energy (25) all pauses (16) a v er age recall [%]
Acoustic Features: Spectral Characteristics
(cont.)Classification results: Aibo chunk set
40 50 60 70 80 20 30 10 0 59.0 58.9 48.2 prosody (107) MFCC (24) best combination HNR (2) jitter/shimmer (4) for mants (16) TEO (64) a v er age recall [%]
S. Steidl: Vocal Emotion Recognition
32 / 49
Acoustic Features: Voice Quality
Classification results: Aibo chunk set
40 50 60 70 80 20 30 10 0 59.0 58.9 48.2 47.0 32.5 52.3 prosody (107) MFCC (24) for mants (16) jitter/shimmer (4) HNR (2) TEO (64) best combination a v er age recall [%]
Acoustic Features: Combination
Classification results: Aibo chunk set
40 50 60 70 80 20 30 10 0 59.0 58.9 48.2 47.0 32.5 52.3 65.4 prosody (107) MFCC (24) for mants (16) jitter/shimmer (4) HNR (2) best combination TEO (64) a v er age recall [%]
S. Steidl: Vocal Emotion Recognition
34 / 49
Linguistic Features
Types of linguistic features word characteristics
average word length (number of letters, phonemes, syllables) proportion of word fragments
average number of repetitions part-of-speech features
unigram models bag-of-words
Linguistic Features
(cont.)Part-of-Speech (POS) Features only 6 coarse POS categories
can be annotated without considering context
A
nger EmphaticNeutr al
M
othereseJoyful Other
-%
of
total
nouns, proper names inflected adjectives
particles, interjections articles, pronouns, auxiliaries
present/past participles not inflected adjectives (other) verbs, infinitives
S. Steidl: Vocal Emotion Recognition
36 / 49
Linguistic Features
(cont.)Unigram Models
u(w,e) = log10 P(e|w) P(e)
Anger P(A|w) Emphatic P(E|w) böser (bad) 29.2 % stopp (stop) 30.5 % stehenbleiben (stop) 18.9 % halt (halt) 29.3 % nein (no) 17.0 % links (left) 20.5 % aufstehen (get up) 12.3 % rechts (right) 18.9 % Aibo (Aibo) 10.1 % nein (no) 17.6 % Neutral P(N|w) Motherese P(M|w) okay (okay) 98.6 % fein (fine) 57.5 % und (and) 98.5 % ganz (very) 41.9 % Stück (bit) 98.5 % braver (good) 36.0 % in (in) 98.2 % sehr (very) 23.5 % noch (still) 96.2 % brav (good) 21.7 %
Linguistic Features
(cont.) Bag-of-Words 1 4 . . . 0 0 14 14 14 Aibolein allen . . . .utterance: Aibo, geh nach links! (Aibo, move to the left!)
Aibo geh nach links
representation of the linguistic content word order getting lost
various dimensionality reduction techniques
S. Steidl: Vocal Emotion Recognition
38 / 49
Linguistic Features
(cont.)Classification results: Aibo chunk set
80 70 60 50 40 30 20 10 0 54.3 56.1 61.9 61.9 62.2 POS (6) unig ram models (16) word statistics (6) best combination BO W (254 → 50) a v er age recall [%]
Combination of Acoustic and Linguistic Features
Classification results: Aibo chunk set
65.4 62.2 67.1 68.9 80 70 60 50 40 30 20 10 0 best combination (ear ly fusion, LD A) (late fusion, ANN) acoustic features (late fusion, ANN) linguistic features best combination (late fusion, ANN) combination combination a v er age recall [%]
S. Steidl: Vocal Emotion Recognition
40 / 49
Similar Results within C
EICES
CEICES: Combining Efforts forImproving AutomaticClassification of Emotional UserStates
collaboration of various research groups within the European Network of Excellence HUMAINE (2004-2007)
state-of-the-art feature set with ≥ 4,000 features SVM (linear kernel), 3-fold speaker-independent cross-validation selection of 150 features (SFFS): surviving feature types?
only chunk based features, no information outside Aibo chunk set
[6] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, V. Aharonson, L. Kessous, N. Amir:
Whodunnit – Searching for the Most Important Feature Types Signalling Emotion-Related User States in Speech,
Similar Results within C
EICES
(cont.) dur ation energy F0 spectr um cepstr um v oice quality w a v elets all acoustic BO W POS higher semantics v ar ia all linguistic all # total 391 265 333 656 1699 153 216 3713 476 31 12 12 531 4244 SFFS # 10 32 16 15 16 7 5 101 25 7 17 0 49 150 F MEASURE 49.6 56.3 46.8 46.2 46.4 38.7 35.3 – 37.4 48.1 56.0 – – 65.5 SHARE 6.7 21.3 10.7 10.0 10.7 4.7 3.4 67.3 16.7 4.7 11.3 0.0 32.7 100.0 PORTION 2.6 12.1 4.8 2.3 1.0 4.6 2.3 2.7 5.3 22.6 141.7 0.0 9.6 3.5 SFFS # 28 33 23 17 23 11 15 150 94 27 27 2 150 F MEASURE 54.9 56.9 46.7 49.9 50.4 41.5 44.9 63.4 53.2 54.9 57.9 – 62.6 SHARE 18.7 22.0 15.3 11.3 15.3 7.3 10.0 100.0 62.7 18.0 18.0 0.1 100.0 PORTION 7.2 12.5 6.9 2.6 1.4 7.2 6.9 4.0 19.7 87.1 225.0 16.7 28.2S. Steidl: Vocal Emotion Recognition
42 / 49
Overview
1 Different Perspectives on Emotion Recognition
2 FAU Aibo Emotion Corpus
3 Own Results on Emotion Classification
INTERSPEECH 2009 Emotion Challenge
New goals:
challenge with standardized test conditions
open microphone: using the complete corpus highly unbalanced classes
including all observed emotional categories
including chunks with low inter-labeler agreement
S. Steidl: Vocal Emotion Recognition
44 / 49
INTERSPEECH 2009 Emotion Challenge
(cont.)Speaker independent training and test sets
2-class problem: NEGative vs. IDLe
# NEG IDL P
train 3 358 6 601 9 959 test 2 465 5 792 8 257
P
5 823 12 393 18 216
5-class problem: Anger, Emphatic, Neutral, Positive, Rest
# A E N P R P
train 881 2 093 5 590 674 721 9 959
test 611 1 508 5 377 215 546 8 257
P
INTERSPEECH 2009 Emotion Challenge
(cont.)Sub-Challenges
1 Feature Sub-Challenge
optimisation of feature extraction/selection; classifier settings fixed
2 Classifier Sub-Challenge
optimisation of classification techniques; feature set given
3 Open Performance Sub-Challenge
optimisation of feature extraction/selection and classification techniques
S. Steidl: Vocal Emotion Recognition
46 / 49
INTERSPEECH 2009 Emotion Challenge
(cont.)Participants
Open Performance Classifier Feature
Sub-Challenge Sub-Challenge Sub-Challenge number of
2 classes 5 classes 2 classes 5 classes 2 classes 5 classes participants
3 3 – – – – 7 3 – – – – – 2 – – 3 3 – – 2 – – – 3 – – 1 – – – 3 3 3 1 – – – – 3 3 1
[7] B. Schuller, A. Batliner, S. Steidl, D. Seppi:
Recognising Realistic Emotions and Affect in Speech: State of the Art and
Lessons Learnt from the First Challenge, Speech Communication, Special Issue
INTERSPEECH 2009 Emotion Challenge
(cont.)2-class problem: NEGative vs. IDLe
unweighted avg. recall weighted avg. recall
60 62 64 68 70 72 74 66 71.2 70.3 69.2 68.3 67.9 67.6 67.2 67.1 67.7 66.4 Barr a-Chicote et al. Polz ehl et al. Vogt et al. Bozkur t et al. Luengo et al. Koc kmann et al. Vlasenk o et al. Dumouchel et al. Major ity voting Baseline a v er age recall [%]
S. Steidl: Vocal Emotion Recognition
48 / 49
INTERSPEECH 2009 Emotion Challenge
(cont.)5-class problem: Anger, Emphatic, Neutral, Positive, Rest
unweighted average recall weighted average recall
45 55 40 35 50 38.2 39.4 39.4 41.2 41.4 41.4 41.6 41.6 41.7 44.0 Dumouchel et al. Planet et al. Luengo et al. Vlasenk o et al. Lee et al. Koc kmann et al. Major ity voting Barr a-Chicote et al. Vogt el al. Baseline Bozkur t et al. a v er age recall [%] 38.2
State-of-the-Art: Summary
Berlin Emotion Speech Database
7-class problem: hot anger, disgust, fear/panic, happiness, sadness/sorrow, boredom, neutral
balanced classes
+ 90 % accuracy
FAU Aibo Emotion Corpus
4-class problem: Anger,Emphatic, Neutral, Motherese subset with roughly balanced classes (Aibo chunk set)
+ 69 % unweighted average recall
5-class problem: Anger,Emphatic, Neutral, Positive, Rest highly unbalanced classes, complete corpus
+ 44 % unweighted average recall 2-class problem: NEGative vs. IDLe
highly unbalanced classes, complete corpus
+ 71 % unweighted average recall S. Steidl: Vocal Emotion Recognition