The most expressive way humans display emotions is through facial expres- sions. Humans detect and interpret faces and facial expressions in a scene with little or no effort. Still, development of an automated system that accomplishes this task is rather difficult. There are several related problems: detection of an image segment as a face, extraction of the facial expression information, and classification of the expression (e.g., in emotion categories). A system that performs these operations accurately and in real time would be a major step forward in achieving a human-like interaction between the man and machine.
In this chapter, we compare the different approaches of the previous chap- ters for the design of a facial expression recognition system. Our experiments suggest that using the TAN classifiers and the stochastic structure search algo- rithm described in Chapter 7 outperform previous approaches using Bayesian network classifiers, or even compared to Neural networks. We also show ex- perimentally that the learning the structure with the SSS algorithm holds the most promise when learning to classify facial expressions with labeled and unlabeled data.
1.
Introduction
In recent years there has been a growing interest in improving all aspects of the interaction between humans and computers. This emerging field has been a research interest for scientists from several different scholastic tracks, i.e., computer science, engineering, psychology, and neuroscience. These studies focus not only on improving computer interfaces, but also on improving the actions the computer takes based on feedback from the user. Feedback from the user has traditionally been given through the keyboard and mouse. Other devices have also been developed for more application specific interfaces, such as joysticks, trackballs, datagloves, and touch screens. The rapid advance of
188 Application:Facial Expression Recognition
technology in recent years has made computers cheaper and more powerful, and has made the use of microphones and PC-cameras affordable and easily available. The microphones and cameras enable the computer to “see” and “hear,” and to use this information to act. A good example of this is the “Smart- Kiosk” [Garg et al., 2000a].
It is argued that to truly achieve effective human-computer intelligent inter- action (HCII), there is a need for the computer to be able to interact naturally with the user, similar to the way human-human interaction takes place.
Human beings possess and express emotions in everyday interactions with others. Emotions are often reflected on the face, in hand and body gestures, and in the voice, to express our feelings or likings. While a precise, generally agreed upon definition of emotion does not exist, it is undeniable that emotions are an integral part of our existence. Facial expressions and vocal emotions are commonly used in everyday human-to-human communication, as one smiles to show greeting, frowns when confused, or raises one’s voice when enraged. People do a great deal of inference from perceived facial expressions: “You
looktired,” or “Youseemhappy.” The fact that we understand emotions and know how to react to other people’s expressions greatly enriches the interac- tion. There is a growing amount of evidence showing that emotional skills are part of what is called “intelligence” [Salovey and Mayer, 1990; Goleman, 1995]. Computers today, on the other hand, are still quite “emotionally chal- lenged.” They neither recognize the user’s emotions nor possess emotions of their own.
Psychologists and engineers alike have tried to analyze facial expressions in an attempt to understand and categorize these expressions. This knowledge can be for example used to teach computers to recognize human emotions from video images acquired from built-in cameras. In some applications, it may not be necessary for computers to recognize emotions. For example, the com- puter inside an automatic teller machine or an airplane probably does not need to recognize emotions. However, in applications where computers take on a social role such as an “instructor,” “helper,” or even “companion,” it may en- hance their functionality to be able to recognize users’ emotions. In her book, Picard [Picard, 1997] suggested several applications where it is beneficial for computers to recognize human emotions. For example, knowing the user’s emotions, the computer can become a more effective tutor. Synthetic speech with emotions in the voice would sound more pleasing than a monotonous voice. Computer “agents” could learn the user’s preferences through the users’ emotions. Another application is to help the human users monitor their stress level. In clinical settings, recognizing a person’s inability to express certain facial expressions may help diagnose early psychological disorders.
This chapter focuses on learning how to classify facial expressions with video as the input, using Bayesian networks. We have developed a real time
facial expression recognition system [Cohen et al., 2002b; Sebe et al., 2002]. The system uses a model based non-rigid face tracking algorithm to extract motion features that serve as input to a Bayesian network classifier used for recognizing the different facial expressions.
There are two main motivations for using Bayesian network classifiers in this problem. The first is the ability to learn with unlabeled data and infer the class label even when some of the features are missing (e.g., due to failure in tracking because of occlusion). Being able to learn with unlabeled data is important for facial expression recognition because of the relatively small amount of available labeled data. Construction and labeling of a good database of images or videos of facial expressions requires expertise, time, and training of subjects and only a few such databases are available. However, collecting, without labeling, data of humans displaying expressions is not as difficult. The second motivation for using Bayesian networks is that it is possible to extend the system to fuse other modalities, such as audio, in a principled way by simply adding subnetworks representing the audio features.
2.
Human Emotion Research
There is a vast body of literature on emotions. The multifaceted nature pre- vents a comprehensive review, we will review only what is essential in support- ing this work. Recent discoveries suggest that emotions are intricately linked to other functions such as attention, perception, memory, decision making, and learning. This suggests that it may be beneficial for computers to recognize the human user’s emotions and other related cognitive states and expressions. In this chapter, we concentrate on the expressive nature of emotion, especially those expressed in the voice and on the face.
2.1
Affective Human-computer Interaction
In many important HCI applications such as computer aided tutoring and learning, it is highly desirable (even mandatory) that the response of the com- puter takes into account the emotional or cognitive state of the human user. Emotions are displayed by visual, vocal, and other physiological means. Com- puters today can recognize much of what is said, and to some extent, who said it. But, they are almost completely in the dark when it comes to how things are said, the affective channel of information. This is true not only in speech, but also in visual communications despite the fact that facial expressions, posture, and gesture communicate some of the most critical information: how people feel. Affective communication explicitly considers how emotions can be rec- ognized and expressed during human-computer interaction. Addressing the problem of affective communication, Bianchi-Berthouze and Lisetti [Bianchi- Berthouze and Lisetti, 2002] identified three key points to be considered when
190 Application:Facial Expression Recognition
developing systems that capture affective information: embodiment (experi- encing physical reality), dynamics (mapping experience and emotional state with its label), and adaptive interaction (conveying emotive response, respond- ing to a recognized emotional state).
In most cases today, if you take a human-human interaction, and replace one of the humans with a computer, then the affective communication vanishes. Furthermore, this is not because people stop communicating affect - certainly we have all seen a person expressing anger at his machine. The problem arises because the computer has no ability to recognize if the human is pleased, an- noyed, interested, or bored. Note that if a human ignored this information, and continued babbling long after we had yawned, we would not consider that person very intelligent. Recognition of emotion is a key component of intelli- gence [Picard, 1997]. Computers are presently affect-impaired. Furthermore, if you insert a computer (as a channel of communication) between two or more humans, then the affective bandwidth may be greatly reduced. Email may be the most frequently used means of electronic communication, but typically all of the emotional information is lost when our thoughts are converted to the digital media.
Research is therefore needed for new ways to communicate affect through computer-mediated environments. Computer-mediated communication today almost always has less affective bandwidth than “being there, face-to-face”. The advent of affective wearable computers, which could help amplify affec- tive information as perceived from a person’s physiological state, are but one possibility for changing the nature of communication.
2.2
Theories of Emotion
There is little agreement about a definition of emotion. Many theories of emotion have been proposed. Some of these could not be verified until re- cently when measurement of some physiological signals become available. In general, emotions are short-term, whereas moods are long-term, and tempera- ments or personalities are very long-term [Jenkins et al., 1998]. A particular mood may sustain for several days, and temperament for months or years. Fi- nally, emotional disorders can be so disabling that people affected are no longer able to lead normal lives.
Darwin [Darwin, 1890] held an ethological view of emotional expressions, arguing that the expressions from infancy and lower life forms exist in adult humans. Following the Origin of Species he wrote The Expression of the Emotions in Man and Animals. According to him, emotional expressions are closely related to survival. Thus in human interactions, these nonverbal ex- pressions are as important as the verbal interaction.
James [James, 1890] viewed emotions not as causes but as effects. Sit- uations arise around us which cause changes in physiological signals. Ac-