• No results found

We used the locations of the marker points in 3D as the basis of our analysis, and randomly chose 4,000 frames of each of six emotions (neutral, happiness, excitement, anger, frustration, and sadness) to form a training set of 24,000 frames, creating a separate set for each of the two actors. It should be noted that the marker points were

4. Datasets for Emotion Recognition and Analysis

Figure 4.5: The distribution of data for each emotion category, Neu: Neutral, Dis: Disgust, Hap: Happiness, Sur: Surpsie, Sad: Sadness, Ang: Anger, Fea: Fear, Fru: Frustration, Exc: Excitement,oth: Other [25].

already aligned to make the nose marker at the center of each frame that removed any translation effects. The rotational effects were compensated by multiplying each frame by a rotational matrix. For details about markers alignment, refer to [25].

Each utterance was labelled by the three expert human evaluators in terms of discrete categories as well as emotion attributes (valence and activation). For the training set, we took frames from the utterances where all three experts agreed. We used six emotions rather than the full nine as for the missing emotions (disgust, surprise, and fear) there was insufficient data (as shown in Fig. 4.5), sometimes as little as 2,000 frames in total. Out of the six selected emotions, two (frustration and excitement) are the candidate basic emotions [68,163]. We also selected seven continuous conversations comprising of almost 152,000 frames in total to form a testing set. For the testing set, there was no such condition of agreement by all three experts while choosing the frames.

4. Datasets for Emotion Recognition and Analysis

Each frame of the dataset contains the motion capture information of 61 markers in 3 dimensions, so the training data was of size 24,000×183 dimensions. We reduced the dimensionality of the data for each frame in three ways:

1. Markers not on the face (such as the head and hands) were excluded.

2. Markers that did not move significantly (such as eyelids and nose) were removed.

3. Sets of markers that moved together (such as, points on the chin and forehead) were replaced by a single point at the centre of the set.

(a) rotation: 0◦ (b) rotation: 30◦

(c) rotation: 60◦ (d) rotation: 90◦

Figure 4.6: The 28 marker points in 3D used for emotion recognition and analysis.

4. Datasets for Emotion Recognition and Analysis

As a result of these simplifications each emotion frame is represented by 28 markers points covering the forehead, eyebrows, eyes, cheeks, lips, and chin (Fig. 4.6). The location of marker points are in 3D, making it an 84D vector.

Out of the three attribute-based labels (valence, activation, and dominance), we have used just valence and activation values for two reasons: first, we chose to use the activation-evaluation space which defines emotions in terms of these two dimen- sions, second, there is a considerable disagreement in defining the third dimension (dominance) for emotion representation in the psychological literature.

Due to the confusion between neutral and sadness, mostly the neutral state was assigned an attribute value of 3 for valence and 1 (very passive) for activation. To cor- rect this, based on the majority voting of categorical labels, we change the activation value back to 3 in the case of the neutral state. This correction is also supported by the model of activation-evaluation space, where neutral lies at the centre. Following the assumption that neutral state is a ’no emotion’ state, its position in the space does not effect the position of other five emotions.

The segmentation of a conversation into utterances is useful for discrete emotion recognition, however, it is a problem for continuous emotion recognition and analysis. During a conversation, there are several places where both actors were silent, which caused an interruption in the continuous information. For this reason, we asked Dr. Carlos Busso (Assistant Professor at the Electrical Engineering Department of The University of Texas at Dallas (UTD), who is one of the authors of the IEMOCAP dataset) to provide us with the unsegmented continuous data for uninterrupted anal- ysis of emotions within full conversations. The continuous conversations include the frames where both actors were silent as well as the overlapping frames where both of them were speaking at the same time. The overlap usually appears at the end of first actor’s utterance where the second actor starts talking. This type of overlap is

4. Datasets for Emotion Recognition and Analysis

Figure 4.7: An example of data layout in a continuous conversation between two actors (male and female). The term ‘overlap’ refers to the situation where both actors are talking at the same time.

quite common in natural communication. An example of the structure of continuous data in a conversation between two actors (one male and one female) is shown in the Fig. 4.7.

4.8

Summary

This chapter has presented the basic criteria for the selection of an appropriate dataset for the task of emotion recognition and analysis. A review of the most commonly- used video and audiovisual datasets was presented with respect to the given criteria. The chapter also demonstrated a review of existing tools used for annotating emotion data. Based on the critical review of the five datasets fulfilling most of the given criteria, we chose to select the IEMOCAP dataset for emotion modelling and analy- sis. The marker layout, segmentation process, and annotation details of the selected dataset was also demonstrated. Finally, the necessary data preprocessing steps were discussed.

Chapter 5

Basic Emotion Recognition

5.1

Introduction

Following the psychological assumption that whenever we feel an emotion, it appears on our face, we started with the classification of facial changes caused by expressions into the basic emotion categories. In this chapter, we will show that building statis- tical shape models of different parts of the face and combining them can give more successful results than using only a model of the whole face. The statistical shape modelling based on Principal Component Analysis is first described, and then the classification technique is discussed. The detailed analysis of the shape models is pre- sented, together with experiments showing the effectiveness of the proposed method for recognising discrete basic emotions.