• No results found

Valence Recognition Task and Ground Truth Data

This section explains how the ground truth data is defined for the valence recognition task in our object-teaching scenario that is investigated in this thesis. The usual practice in visual recognition resp. classification tasks is to define the ground truth labels in terms of the

visual appearance of the objects under investigation. This is perfectly fine for typical object

recognition problems, for instance, as an accurate label can usually be easily assigned by a human who can determine the correct class of an object without difficulties. However, in

4

However, one of these persons doubted that Biron actually had fun in the experiment, despite saying so in its farewell phrase.

5

Bielefeld Sensor and Actuator Interface 6

A graphical plugin shell optimized for image processing

3.3. Valence Recognition Task and Ground Truth Data 31

case of FCSs, the situation is different. As discussed in Sec. 2.2, FCSs and the way people use them in interactions are very complex and multifaceted, there are different, competing views about their nature an the best way to consider them. Concretely, this means that for given displays of FCSs, very often not only the class it should be assigned to is difficult to determine, but also the set of classes that should be used as interpretation categories at all. The second problem is usually solved by either using a (sub-)set of established categories (for example basic emotions [133] in case of facial expressions)—thus deciding for a particular psychological model—or by pragmatically defining categories that seem to match the concrete data and context at hand best. To solve the first problem, the assignment of class labels to data instances, typically one of the following approaches is taken:

• The subjects are asked to display FCSs of given categories on request. This has been done in many studies, a prominent example is the DaFEx database [24]. While this method has appealing advantages, most notably a comparatively safe and well-defined acquisition of labeled data, and has already been very useful in computer vision research, it also has serious drawbacks: The FCSs are not authentic and spontaneous, but posed. This is most likely an issue, because the differences are usually very prominent (please see Sec. 2.2.3). This can be moderated if the subjects are professional actors who are trained to pose FCSs “naturally”, as it is the case for the DaFEx database, but the differences are unlikely to vanish completely. Furthermore, the display of FCSs depends heavily on the interaction context, which is very difficult to consider appropriately when these signals are posed.

• Human raters judge video recordings of interactions where the subjects show authentic, spontaneous FCSs. This avoids the problem of posed FCSs, but relies on the necessarily subjective and often ambiguous impression of the raters. The reliability can be increased by having several raters judging the videos in parallel and accepting only those FCS displays where a majority agrees on. However, this causes a very high expenditure of human labor and could lead to a significant amount of rejected data instances with too poor agreement.8

• The subjects are interviewed about the intended meanings of their facial displays. This is less common than the first two approaches, probably because of the practical prob- lems: When the interview takes place after the experiment is over, it might be very difficult for the subjects to remember the intended meanings of several facial displays in particular situations. On the other hand, interrupting immediately after every interest- ing situation is likely to disturb the experiment or influence the subjects in an undesired way (e.g.[325]).

To cope with these problems, we used a different approach where the ground truth data is defined in terms of the objectively ascertainable interaction situation instead of the visual appearance of the face. In our object-teaching scenario, we focused on success and failure scenes, which are defined by the (either correct or wrong) answer of the robot when it classified an object. The FCSs displayed in these situations are treated as examples for the respective class: success or failure. In a sense, this is an inverse approach: instead of trying to find the correct ground truth labels for given facial displays, we look for FCSs in a given situation with implicitly given ground truth data.

While this approach yields reliable ground truth labels, it faces another problem: As the definition of these labels is solely based on the (outcome of the) interaction situation and independent of the visual appearance of the face, their is no guarantee that a meaningful FCS is displayed at all. However, the studies of Barkhuysen et al. [18] and also our pre-study suggested that usually a meaningful display occurs. The evaluation in the following Sec. 3.4 shows that this actually is the case for this object-teaching study.

Thus, the research question investigated in this thesis is not the standalone interpretation of FCS in itself (as in most research on facial expression recognition), but their interpretation as feedback about the interaction in terms of valence, and the question to which degree this feedback can be gained from FCSs at all. One can regard this as interpretation on pragmatic level, while the former is on semantic level. For all the following investigations, we did not use the complete scenes (present, wait, answer, and react phases), but extracted a subpart of the associated videos from the stationary face camera (please see Fig. 3.2), starting near the end of the answer phase, exactly when Biron started to say the object name, and ending at the end of the react phase. This starting point of the videos was chosen because it is the first moment from which the subject could know whether the answer of the robot was correct or not. Hence, this part of the interaction scene appears to be the relevant one for FCS analysis regarding the feedback as discussed above.