State-of-the-Art
3.4. Open issues 43
4.3.4 Common Fusion Techniques
Multimodal approaches are an arising topic for emotion recognition, especially in the case of naturalistic interactions. This approach copies the human way of understanding emotions by inferring information from several modalities simultaneously. In general, two types of fusion approaches are distinguished (cf. [Wagner et al. 2011]): feature level fusion (cf. Figure 4.12(a)) and decision level fusion (cf. Figure 4.12(b)).
Modality 1 . . . Modality N
Concatenated Features Classifier
Result Feature Extraction
(a) Sketch of a feature level fusion architec-ture. Features of each modality are concat-enated. The final decision is generated by a classifier on the concatenated features.
Modality 1 . . . Modality N
Combination Rule
Result Feature Extraction
Classification
(b) Sketch of a decision level fusion archi-tecture. The features of each modality are classified separately. The final decision is generated by any kind of combination rule.
Figure 4.12: Overview of feature and decision level fusion architectures.
In the first case, the different modalities are concatenated directly on feature level into a single high-dimensional feature set (cf. [Busso et al. 2004]). For this, it is assumed that this resulting feature set contains a larger amount of information than single modalities and thus, achieves a higher classification performance. One constraint that has to be considered here is that the features of all involved modalities are extracted on the same time scales. Thus, it has to be secured that the emotional characteristics
4.3. Classifiers 89
present, for instance, in acoustics are matching the expressed facial expressions. This means in other words that the involved multimodal response patterns are present at the time of the investigation.
In decision level fusion, the contrary approach is used. Specific feature sets on single classifiers for each modality are applied. The final decision is gained afterwards, by combining the single results using rules like for instance, Bayes’ Rule or Dempster’s Rule of Combination (cf. [Paleari et al. 2010]). The decision level fusion has many benefits over the use of a feature level fusion. Different time scales of single modalities can be adjusted in the individual classifiers. Besides the obvious training efficiency attainable by using several small feature vectors instead of one high-dimensional one, the resistance against fragmentary data of real-time data is rising. Especially, when different classifiers for different modalities are used, the malfunction of one sensor device will only result in a malfunction of the corresponding classifier and just marginally influence the final decision [Wagner et al. 2011]. Additionally, also combinations of both approaches, called “hybrid fusion” are investigated (cf. [Kim 2007; Hussain et al. 2011]). In this case, both feature level fusion and decision level fusion are pursued and the final decision is achieved by combining all single decisions using a third fusion level.
Most works in emotion recognition use a bi-modal approach and focus on audiovisual information [Busso et al. 2004; Zeng et al. 2009]. There, most fusion approaches utilise either feature level fusion or decision level fusion. Surprisingly only rarely other modalities such as body gestures [Balomenos et al. 2005] or physiological information [Kim 2007; Walter et al. 2011] are utilised. These studies mostly rely on decision level fusion, as the time scales of the modalities are quite different and thus difficult to combine on feature level. Just a few studies try to integrate more than two modalities (cf. [Wagner et al. 2011]).
The Markov Fusion Network
To perform the fusion of several modalities under the constraint of fragmentary data, a late fusion approach utilised by colleagues at the Ulm University (cf. [Glodek et al.
2012]) should be shortly introduced. The Markov Fusion Network (MFN) (cf. Fig-ure 4.13) reconstructs a non-fragmented stream of decisions y based on an arbit-rary number of fragmented streams of given decisions xtm where m = 1, . . . , M and t = 1, . . . , T. In this case, M is the number of different modalities and T is the time-point a decision is available. In an MFN, the relationship of the reconstructed decisions over time is represented by a Markov chain, whereas the decisions of the modalities (input decisions) are connected to the Markov chain of final decisions whenever they
are available (cf. Glodek et al. 2012). The model is originated from the application of Markov random fields in image processing.
w w w w w
. . . . . . . . .
k1 k1 k1 k1
k2 k2 k2 k2
y x2
x1
t 1 2 3 4 5 6 T
Figure 4.13: Graphical representation of an MFN. The estimates yt are influenced by the available decisions xtm of the source m at time t and the adjacent estimates yt−1, yt+1.
Once the input decisions and parameters are determined, the most likely stream of final decisions needs to be estimated. The most important parameters of a MFN are k and w. The parameter vector k defines the strength of the influence of each single modality. Thus, in the presented approach, we distinguish between kv for the visual modality, ka defining the acoustics’ modality influence, and kg adjusting the gesture influence. The parameter vector w weights the cost of a difference between two adjacent nodes of the MFN. Due to the limited number of dependencies, it is sufficient to perform a gradient descent optimization. More details about the training alorithm can be found in [Glodek et al. 2012].
4.4 Evaluation
The main goal of evaluation is to assess the performance of the investigated method, for instance in affect recognition or prediction. This infers to choose between several feature sets, classifiers, and training algorithms. For this, at first the data samples have to be prepared in such a way that data bias and overfitting can be avoided.
Second, the classification performance or the prediction error has to be estimated and the classifier minimising this criterion has to be selected. A good survey on model selection procedures is given in [Arlot & Celisse 2010].
A validation is utilised to be able to estimate the classifier performance. For such a set the assignment of classes to the data samples is a priori known. This allows the indication of the performance of a chosen model or classifier. Therein, common statistical quality criteria are used (cf. [Olson & Delen 2008; Powers 2011]). In speech recognition as well as in emotion recognition several methods exist how training and test sets are arranged and how the performance of a classifier utilising different
4.4. Evaluation 91
emotional classes and speakers is calculated. The most common types are shortly descibed in the following.