• No results found

Vocal Emotion Recognition

N/A
N/A
Protected

Academic year: 2021

Share "Vocal Emotion Recognition"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

State-of-the-Art in Classification of Real-Life Emotions

October 26, 2010

Stefan Steidl

International Computer Science Institute (ICSI) at Berkeley, CA

2 / 49

Overview

1 Different Perspectives on Emotion Recognition

2 FAU Aibo Emotion Corpus

3 Own Results on Emotion Classification

(2)

Overview

1 Different Perspectives on Emotion Recognition

Psychology of Emotion Computer Science

2 FAU Aibo Emotion Corpus

3 Own Results on Emotion Classification

4 INTERSPEECH 2009 Emotion Challenge

S. Steidl: Vocal Emotion Recognition

4 / 49

(3)

Universal Basic Emotions

Paul Ekman

postulates the existence of

6 basic emotions:

anger, fear, disgust, surprise, joy, sadness

other emotions are mixed or blended

emotions

universal facial expressions

S. Steidl: Vocal Emotion Recognition

6 / 49

Terminology

Different affective states [1]:

type of affective state inten- dura- syn- event appraisal rapid- behav-sity tion chroni- focus elicita- ity of ioral

zation tion change impact

emotion ::-::: : ::: ::: ::: ::: :::

mood :-:: :: : : : :: :

interpersonal stances :-:: :-:: : :: : ::: ::

attitudes ◦-:: ::-::: ◦ ◦ : ◦-: :

personality traits ◦-: ::: ◦ ◦ ◦ ◦ :

◦: low,:: medium,::: high,:::: very high, -: indicates a range

[1] K. R. Scherer: Vocal communication of emotion: A review of research paradigms, Speech Communication, Vol. 40, pp. 227-256, 2003

(4)

Terminology

(cont.)

Definition of Emotion

Emotion (Scherer)

episodes ofcoordinated changes in several componentsincluding at least:

neurophysiological activation, motor expression, and

subjective feeling but possibly also

action tendencies and cognitive processes

in response to external or internal events of major significance to the organism

S. Steidl: Vocal Emotion Recognition

8 / 49

Vocal Expression of Emotion

Results from studies in Psychology of Emotion

anger/ fear/ sadness joy/ boredom stress rage panic elation

Intensity Ú Ú Ø Ú Ú

F0floor/mean Ú Ú Ø Ú Ú

F0variability Ú Ø Ú Ø

F0range Ú Ú(Ø)1 Ø Ú Ø

Sentence contour Ø Ø

High frequency energy Ú Ú Ø (Ú)2

Speech and articulation rate Ú Ú Ø (Ú)2 Ø

1Banse and Scherer found a decrease in F

0 range 2inconclusive evidence

Goal

Classification of the subject’s actual emotional state (some sort of lie detector for emotions)

(5)

Human-Computer Interaction (HCI)

Emotion-Related User States

naturally occurring states of users in human-machine communication

emotions in a broader sense

coordinated changes in several components NOT required

classification of the perceived emotional state, not necessarily the actual emotion of the speaker

S. Steidl: Vocal Emotion Recognition

10 / 49

Pattern Recognition

Pattern Recognition Point of View

classification task: choose 1 of n given classes

discrimination of classes rather than classification definition of “good” features

machine classification

Actually not needed

definition of term emotion

(6)

Emotional Speech Corpora

Acted data

based on Basic Emotions theory

suited for studying prototypical emotions

corpora easy to create (inexpensive, no labeling process) high audio quality

balanced classes

neutral linguistic content (focus on acoustics only) high recognition results

S. Steidl: Vocal Emotion Recognition

12 / 49

Emotional Speech Corpora

(cont.)

Popular corpora

Emotional Prosody Speech and Transcript corpus (LDC): 15 classes

Berlin Emotional Speech Database (EmoDB): 7 classes

89.9 % accuracy (speaker independent LOSO evaluation, speaker adaptation, feature selection) [2]

Danish Emotional Speech Corpus: 5 classes

74.5 % accuracy (10-fold SCV, feature selection) [3]

[2] B. Vlasenko et al.: Combining Frame and Turn-Level Information for Robust Recognition of Emotions within Speech, INTERSPEECH 2007

[3] Schuller et al.: Emotion Recognition in the Noise Applying Large Acoustic Feature Sets, Speech Prosody 2006

(7)

Emotional Speech Corpora

(cont.)

Naturally occurring emotions

states that actually appear in HCI (real applications)

difficult to create (appropriate scenario needed, ethical concerns, need to label data)

low emotional intensity

in general ≥ 80% neutral

low audio quality (reverberation, noise, far-distance microphones) needed for machine classification (because conditions between training and test must not differ too much)

research on both acoustic and linguistic features possible new research questions: optimal emotion unit

almost no corpora large enough for machine classification available (do not exist or are not available for research)

S. Steidl: Vocal Emotion Recognition

14 / 49

Overview

1 Different Perspectives on Emotion Recognition

2 FAU Aibo Emotion Corpus

Scenario

Labeling of User States

Data-driven Dimensions of Emotion Units of Analysis

Sparse Data Problem

3 Own Results on Emotion Classification

(8)

The FAU Aibo Emotion Corpus

51 children (30 f, 21 m) at the age of 10 to 13

8.9 hours of spontaneous speech (mainly short commands) 48,401 words in 13,642 audio files

S. Steidl: Vocal Emotion Recognition

16 / 49

FAU Aibo Emotion Corpus

(cont.)

data base for CEICES and INTERSPEECH 2009 Emotion

Challenge

available for scientific, non-commercial use

http://www5.cs.fau.de/FAUAiboEmotionCorpus

[4] S. Steidl: Automatic Classification of Emotion-Related User States in Spontaneous Children’s Speech, Logos Verlag, Berlin

available online:

(9)

Emotion-Related User States

11 categories: prior inspection of the data before labeling

joyful surprised motherese neutral bored emphatic helpless touchy/irritated reprimanding angry other motherese

the way mothers/parents address their babies – either because Aibo is well-behaving or because the child wants Aibo to obey; positive equivalent to

reprimanding

emphatic

pronounced, accentuated, sometimes

hyper-articulated way but without showing any emotion

reprimanding

the child is reproachful, reprimanding, ‘wags the finger’

S. Steidl: Vocal Emotion Recognition

18 / 49

Labeling of User States

Labeling:

5 students of linguistics

holistic labeling on the word level majority vote

emotion category words

angry (A) 134 0.3 % touchy (T) 419 0.9 % reprimanding (R) 463 1.0 % emphatic (E) 2,807 5.8 % neutral (N) 39,975 82.6 % motherese (M) 1,311 2.7 % joyful (J) 109 0.2 % .. . all 48,401 100.0 %

(10)

Labeling of User States

(cont.) Confusion matrix emotion category A T R E N M J major ity v ote angry (A) 43.3 13.0 12.9 12.1 18.1 0.1 0.0 touchy (T) 4.5 42.9 11.7 13.7 23.5 1.0 0.1 reprimanding (R) 3.8 15.7 45.8 14.0 18.2 1.3 0.1 emphatic (E) 1.3 5.8 6.7 53.6 29.9 1.2 0.5 neutral (N) 0.4 2.2 1.5 13.9 77.8 2.7 0.5 motherese (M) 0.0 0.8 1.4 4.9 30.4 61.1 0.9 joyful (J) 0.1 0.6 1.1 7.3 32.4 2.0 54.2

S. Steidl: Vocal Emotion Recognition

20 / 49

Data-driven Dimensions of Emotions

Non-metric dimensional scaling:

arranging the emotion categories in the 2-dimensional space states that are often confused are close to each other

negative positive valence −interaction +interaction inter action angry touchy motherese neutral joyful reprimanding emphatic

(11)

Units of Analysis

Units of analysis

Aibo g’radeaus fein machst du das word level chunk level turn level stopp sitz stopp Ohm_18_343 Ohm_18_342 v1 v2 p3 s3

Advantages/disadvantages of larger units

+ more information

− less emotional homogeneity

S. Steidl: Vocal Emotion Recognition

22 / 49

Sparse Data Problem

Super classes:

Anger: angry, touchy/irritated, reprimanding

Emphatic Neutral Motherese 0.5 -1 -0.5 0 1 0.5 0 1 -0.5 -1 0 0.5 1 1.5 -1 -1.5 -0.5 -1.5 -1 -0.5 0 0.5 1 1.5 ang ry repr imanding neutr al motherese touch y emphatic joyful S= 0.32 RSQ = 0.73 Neutral Anger Motherese S= 0.19 RSQ = 0.90 Emphatic

(12)

Sparse Data Problem

(cont.)

Data subsets

Aibo word set Aibo chunk set

Aibo turn set Aibo corpus

data set number of taken from

words # chunks # turns

Aibo corpus 48,401 18,216 13,642

Aibo word set 6,070 4,543 3,996

Aibo chunk set 13,217 4,543 3,996

Aibo turn set 17,618 6,413 3,996

S. Steidl: Vocal Emotion Recognition

24 / 49

Overview

1 Different Perspectives on Emotion Recognition

2 FAU Aibo Emotion Corpus

3 Own Results on Emotion Classification

Results for different Units of Analysis Machine vs. Human

Feature Types and their Relevance

(13)

Most Appropriate Unit of Analysis

Classification

complete set of features

classification with Linear Discriminant Analysis (LDA) 51-fold speaker-independent cross-validation

unit of number of number of average

analysis features samples recall

word level 265 6,070 words 67.2 %

chunk level 700 4,543 chunks 68.9 %

turn level 700 3,996 turns 63.2 %

Chunks: best compromise between length of the segment

homogeneity of the emotional state within the segment

S. Steidl: Vocal Emotion Recognition

26 / 49

Machine Classifier vs. Human Labeler

Entropy based measure:

A E N M 0.0 0.25 0.75 0.0 1 2 3 4 A E A class labeler A 1 2 → + A E N M 0.0 0.0 1.0 decoder: A E N M 0.0 0.5 0.375 0.125 1 2 Hdec =1.41 → M 0.0

implicit weighting of classification ‘errors’ depending on the word that is classified

(14)

Machine Classifier vs. Human Labeler

(cont.)

Classification: Aibo word set

avg. human labeler machine classifier 0.2 0.15 0.1 0 0.05 0.25 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 entropy rel. frequency [%]

[5] S. Steidl, M. Levit, A. Batliner, E. Nöth, H. Niemann:

“Of All Things the Measure is Man” – Classification of Emotions and Inter-Labeler Consistency, ICASSP 2005

S. Steidl: Vocal Emotion Recognition

28 / 49

Evaluation of Different Types of Features

Types of features acoustic features

prosodic features spectral features voice quality features linguistic features

Evaluation

Artificial Neural Networks (ANN)

51-fold speaker-independent cross-validation combination by early or late fusion

(15)

Acoustic Features: Prosody

Prosody

suprasegmental characteristics such as pitch contour

energy contour

temporal shortening/lengthening of words duration of pauses between words

S. Steidl: Vocal Emotion Recognition

30 / 49

Acoustic Features: Prosody

(cont.)

Classification results: Aibo chunk set

0 50.6 54.4 58.5 59.0 10 20 30 40 50 60 70 80 42.0 F 0 (29) dur ation (37) energy (25) all pauses (16) a v er age recall [%]

(16)

Acoustic Features: Spectral Characteristics

(cont.)

Classification results: Aibo chunk set

40 50 60 70 80 20 30 10 0 59.0 58.9 48.2 prosody (107) MFCC (24) best combination HNR (2) jitter/shimmer (4) for mants (16) TEO (64) a v er age recall [%]

S. Steidl: Vocal Emotion Recognition

32 / 49

Acoustic Features: Voice Quality

Classification results: Aibo chunk set

40 50 60 70 80 20 30 10 0 59.0 58.9 48.2 47.0 32.5 52.3 prosody (107) MFCC (24) for mants (16) jitter/shimmer (4) HNR (2) TEO (64) best combination a v er age recall [%]

(17)

Acoustic Features: Combination

Classification results: Aibo chunk set

40 50 60 70 80 20 30 10 0 59.0 58.9 48.2 47.0 32.5 52.3 65.4 prosody (107) MFCC (24) for mants (16) jitter/shimmer (4) HNR (2) best combination TEO (64) a v er age recall [%]

S. Steidl: Vocal Emotion Recognition

34 / 49

Linguistic Features

Types of linguistic features word characteristics

average word length (number of letters, phonemes, syllables) proportion of word fragments

average number of repetitions part-of-speech features

unigram models bag-of-words

(18)

Linguistic Features

(cont.)

Part-of-Speech (POS) Features only 6 coarse POS categories

can be annotated without considering context

A

nger EmphaticNeutr al

M

othereseJoyful Other

-%

of

total

nouns, proper names inflected adjectives

particles, interjections articles, pronouns, auxiliaries

present/past participles not inflected adjectives (other) verbs, infinitives

S. Steidl: Vocal Emotion Recognition

36 / 49

Linguistic Features

(cont.)

Unigram Models

u(w,e) = log10 P(e|w) P(e)

Anger P(A|w) Emphatic P(E|w) böser (bad) 29.2 % stopp (stop) 30.5 % stehenbleiben (stop) 18.9 % halt (halt) 29.3 % nein (no) 17.0 % links (left) 20.5 % aufstehen (get up) 12.3 % rechts (right) 18.9 % Aibo (Aibo) 10.1 % nein (no) 17.6 % Neutral P(N|w) Motherese P(M|w) okay (okay) 98.6 % fein (fine) 57.5 % und (and) 98.5 % ganz (very) 41.9 % Stück (bit) 98.5 % braver (good) 36.0 % in (in) 98.2 % sehr (very) 23.5 % noch (still) 96.2 % brav (good) 21.7 %

(19)

Linguistic Features

(cont.) Bag-of-Words 1 4 . . . 0 0 14 14 14 Aibolein allen . . . .

utterance: Aibo, geh nach links! (Aibo, move to the left!)

Aibo geh nach links

representation of the linguistic content word order getting lost

various dimensionality reduction techniques

S. Steidl: Vocal Emotion Recognition

38 / 49

Linguistic Features

(cont.)

Classification results: Aibo chunk set

80 70 60 50 40 30 20 10 0 54.3 56.1 61.9 61.9 62.2 POS (6) unig ram models (16) word statistics (6) best combination BO W (254 → 50) a v er age recall [%]

(20)

Combination of Acoustic and Linguistic Features

Classification results: Aibo chunk set

65.4 62.2 67.1 68.9 80 70 60 50 40 30 20 10 0 best combination (ear ly fusion, LD A) (late fusion, ANN) acoustic features (late fusion, ANN) linguistic features best combination (late fusion, ANN) combination combination a v er age recall [%]

S. Steidl: Vocal Emotion Recognition

40 / 49

Similar Results within C

EICES

CEICES: Combining Efforts forImproving AutomaticClassification of Emotional UserStates

collaboration of various research groups within the European Network of Excellence HUMAINE (2004-2007)

state-of-the-art feature set with ≥ 4,000 features SVM (linear kernel), 3-fold speaker-independent cross-validation selection of 150 features (SFFS): surviving feature types?

only chunk based features, no information outside Aibo chunk set

[6] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, V. Aharonson, L. Kessous, N. Amir:

Whodunnit – Searching for the Most Important Feature Types Signalling Emotion-Related User States in Speech,

(21)

Similar Results within C

EICES

(cont.) dur ation energy F0 spectr um cepstr um v oice quality w a v elets all acoustic BO W POS higher semantics v ar ia all linguistic all # total 391 265 333 656 1699 153 216 3713 476 31 12 12 531 4244 SFFS # 10 32 16 15 16 7 5 101 25 7 17 0 49 150 F MEASURE 49.6 56.3 46.8 46.2 46.4 38.7 35.3 – 37.4 48.1 56.0 – – 65.5 SHARE 6.7 21.3 10.7 10.0 10.7 4.7 3.4 67.3 16.7 4.7 11.3 0.0 32.7 100.0 PORTION 2.6 12.1 4.8 2.3 1.0 4.6 2.3 2.7 5.3 22.6 141.7 0.0 9.6 3.5 SFFS # 28 33 23 17 23 11 15 150 94 27 27 2 150 F MEASURE 54.9 56.9 46.7 49.9 50.4 41.5 44.9 63.4 53.2 54.9 57.9 – 62.6 SHARE 18.7 22.0 15.3 11.3 15.3 7.3 10.0 100.0 62.7 18.0 18.0 0.1 100.0 PORTION 7.2 12.5 6.9 2.6 1.4 7.2 6.9 4.0 19.7 87.1 225.0 16.7 28.2

S. Steidl: Vocal Emotion Recognition

42 / 49

Overview

1 Different Perspectives on Emotion Recognition

2 FAU Aibo Emotion Corpus

3 Own Results on Emotion Classification

(22)

INTERSPEECH 2009 Emotion Challenge

New goals:

challenge with standardized test conditions

open microphone: using the complete corpus highly unbalanced classes

including all observed emotional categories

including chunks with low inter-labeler agreement

S. Steidl: Vocal Emotion Recognition

44 / 49

INTERSPEECH 2009 Emotion Challenge

(cont.)

Speaker independent training and test sets

2-class problem: NEGative vs. IDLe

# NEG IDL P

train 3 358 6 601 9 959 test 2 465 5 792 8 257

P

5 823 12 393 18 216

5-class problem: Anger, Emphatic, Neutral, Positive, Rest

# A E N P R P

train 881 2 093 5 590 674 721 9 959

test 611 1 508 5 377 215 546 8 257

P

(23)

INTERSPEECH 2009 Emotion Challenge

(cont.)

Sub-Challenges

1 Feature Sub-Challenge

optimisation of feature extraction/selection; classifier settings fixed

2 Classifier Sub-Challenge

optimisation of classification techniques; feature set given

3 Open Performance Sub-Challenge

optimisation of feature extraction/selection and classification techniques

S. Steidl: Vocal Emotion Recognition

46 / 49

INTERSPEECH 2009 Emotion Challenge

(cont.)

Participants

Open Performance Classifier Feature

Sub-Challenge Sub-Challenge Sub-Challenge number of

2 classes 5 classes 2 classes 5 classes 2 classes 5 classes participants

3 3 – – – – 7 3 – – – – – 2 – – 3 3 – – 2 – – – 3 – – 1 – – – 3 3 3 1 – – – – 3 3 1

[7] B. Schuller, A. Batliner, S. Steidl, D. Seppi:

Recognising Realistic Emotions and Affect in Speech: State of the Art and

Lessons Learnt from the First Challenge, Speech Communication, Special Issue

(24)

INTERSPEECH 2009 Emotion Challenge

(cont.)

2-class problem: NEGative vs. IDLe

unweighted avg. recall weighted avg. recall

60 62 64 68 70 72 74 66 71.2 70.3 69.2 68.3 67.9 67.6 67.2 67.1 67.7 66.4 Barr a-Chicote et al. Polz ehl et al. Vogt et al. Bozkur t et al. Luengo et al. Koc kmann et al. Vlasenk o et al. Dumouchel et al. Major ity voting Baseline a v er age recall [%]

S. Steidl: Vocal Emotion Recognition

48 / 49

INTERSPEECH 2009 Emotion Challenge

(cont.)

5-class problem: Anger, Emphatic, Neutral, Positive, Rest

unweighted average recall weighted average recall

45 55 40 35 50 38.2 39.4 39.4 41.2 41.4 41.4 41.6 41.6 41.7 44.0 Dumouchel et al. Planet et al. Luengo et al. Vlasenk o et al. Lee et al. Koc kmann et al. Major ity voting Barr a-Chicote et al. Vogt el al. Baseline Bozkur t et al. a v er age recall [%] 38.2

(25)

State-of-the-Art: Summary

Berlin Emotion Speech Database

7-class problem: hot anger, disgust, fear/panic, happiness, sadness/sorrow, boredom, neutral

balanced classes

+ 90 % accuracy

FAU Aibo Emotion Corpus

4-class problem: Anger,Emphatic, Neutral, Motherese subset with roughly balanced classes (Aibo chunk set)

+ 69 % unweighted average recall

5-class problem: Anger,Emphatic, Neutral, Positive, Rest highly unbalanced classes, complete corpus

+ 44 % unweighted average recall 2-class problem: NEGative vs. IDLe

highly unbalanced classes, complete corpus

+ 71 % unweighted average recall S. Steidl: Vocal Emotion Recognition

References

Related documents

The application of sulfur nanoparticles as an efficient adsorbent for the solid-phase extraction and determination of the trace amounts of Pb and Pd ions was investigated

By utilizing a Swiss market model capturing transmission constraints as well as detailed hydro interdependencies we evaluate the impact of network investments in Switzerland and

The persistence diagrams of the three functions allow us to distinguish the different models (see Figure 1 for a less trivial example) and the confidence bands, generated using

highlights some security imperatives for India regarding the Docklam region.. Keywords: Chumbi Valley,

Proximate analysis of biomass determines its moisture content, volatile matter, ash and fixed carbon; while ultimate analysis determines compositions of carbon,

Further model is used to investigate the transient behaviour of three phase induction motor, with two phase supply system, without and after including the effects of phase

Hence, this study was done to know the status of serum Mg in Type 2 DM subjects with microalbuminuria and normoalbuminuria and its relation to diabetic microvascular

Multiple regression analyses revealed that neonatal litter of origin characteristics including total born, number of piglets nursing after cross-fostering, number of piglets