Summary - Affect Analysis in Video

Aﬀective video content analysis has been proposed to help people to better under- stand the semantics of video, and help make applications more friendly, and natural. In this chapter, we presented a survey on the psychological emotional model and aﬀective video content analysis, respectively.

Chapter 3

Sparsity-based Aﬀect

Representation And Modeling

3.1 Introduction

Affective computing [Pic00] is currently an active research area, due to the increased users’ expectation of natural interaction with computers. Affective video content analysis is an important sub-area that makes use of both the psychological theories and computational methods to recognize the high level affective content present in videos. It is better aligned with humans’ perceptual mechanisms, which enables more friendly and usable applications.

We first present some background information on “what is the affect of a video clip” and “how to represent the affect with the psychological models”. In modern psychology, the affective domain represents one of the three divisions: the cognitive, the conative, and the affective [McK76]. Affect is the experience of feeling or emotion. In this chapter, we define the affect (emotion) of a video clip as the type of emotion that is expected to arise in the viewers when they are watching that video clip. The expected emotion refers to the one that is either intended to be felt by the viewers (by the video creator), or felt by the most viewers who are watching the video clip.

“Dimensional emotion space” and “categorical emotional states” are the two most widely used psychological models. The dimensional emotion space model considers the emotion space as a 3-dimensional space of valence, arousal and control. In the categorical

emotional states model, emotional experiences are represented by a set of discrete and distinct words such as “happy”, “sad” and “angry”. We choose this model because it is very natural for us to relate to these categorical states and hence it is very intuitive. How to model and represent these emotional categories in video is a challenge. Moreover, this model has an obvious drawback – it is not clear how one can compute the “intensity” of emotion, instead of using ill-deﬁned adjectives like “little”, and “very”. We present a solution to this problem as well in this chapter. We take a sparse representation based approach [Can06] in this chapter.

How to best map the low-level video content features (such as color and motion) into the discrete emotional states, and explicitly determine the extent of each emotional state is the most significant objective of this chapter. As stated in [Zet12], colors or particular color groups can influence our emotions, and the intelligent use of colors can produce a variety of specific overall emotional effects. Specifically, warm colors are perceived to possess high energy and excite us, but cold colors are of low energy and calm us down. The deviation around the main “hue” is what determines the warmth or coldness of a color. This was extensively studied by Rudolf Arnheim, a well-known perception psychologist and art theorist [Zet12]. He found that cold colors of less satu- ration can dampen the mood of people, whereas highly saturated warm colors can excite them. Therefore, low-level features like the color content are related to the emotions conveyed by the visual component of video. However, the relationship between the low- level features of videos and the expected emotions elicited in humans is still not well understood. How does the combination of low-level features contribute to affect is still an open problem in affective computing, and most existing approaches have not yielded good results. It is also very difficult to determine if the number of features and the construction of features are sufficient to recognize the affect within video. For sparse representation, as long as the number of features employed is large enough, even ran- domly chosen features are sufficient to recover the sparse representation (i.e. recover the important information related to affect in our problem) [Can06]. Sparse representation offers a new perspective on feature selection – it shows that the number of features is much more important than the details of how they are constructed [YWMS07]. There- fore, we use many features resulting in a high-dimensional space from which we extract

the right sparse representation. Interestingly, [WMM+10] has argued that “the sparse representation to uncover semantic information derives in part from a simple but important property of the data: although the images (or their features) are naturally very high dimensional, in many applications images belonging to the same class exhibit de-

generate structure”. Given the fact that humans agree on a small set of adjectives for

emotional experiences across languages, cultures and ages does point to the existence of some basic degenerate structure. Thus, sparse representation can be taken advantage of to capture/recover the basic characteristic of each emotion. Our ﬁndings show that the sparsity based approach is indeed eﬀectively able to represent the categorical emotion model. It corroborates the utility of the sparse representation for extracting semantic information [WMM+10]. It must be noted that many of these features have been used separately in the past, motivated by psychological considerations.

In this chapter, we propose a computational framework to link the aﬀective features with the emotional states considering the psychological model of “categorical emotional states”. We also try to address the lack of an intensity measure in this categorical psychological model. We develop a sparse vector representation in this computational framework, with a method to compute the “intensity” of the emotions. In addition, we show how to obtain the representative sparse vectors from the low-level features extracted from video. The approach is ﬂexible - features extracted from any modality (audio, visual, dialog and even subtitles) can be used in this representation framework. The key contributions of this chapter are:

• A simple, fast and intuitive method is proposed to map the low-level features to

the “categorical” emotional states.

• A computational measure is proposed to capture the “intensity” of “discrete” e-

motional states.

This chapter is organized as follows. Section 3.2 reviews the related work to serve as a preamble. Section 3.3 elaborates the sparse representation and modeling of aﬀective content within videos, and discusses the construction of sample matrix. Section 3.4 describes the relevant experimental results. Finally, conclusions are drawn in Section 3.5.

In document Affect Analysis in Video (Page 42-46)