The AffectButton - Measuring Affect - A Study of Non-Linguistic Utterances for Social Human-Rob

3.3 Measuring Affect

3.3.3 The AffectButton

The AffectButton (Broekens and Brinkman, 2009; Broekens et al., 2010; Broekens and Brinkman, 2013), figure 3.7, is an open-source facial gesture tool designed to facilitate the capture of affective interpretations from people, using explicit methods (i.e. people are directly asked to provide feedback).

To provide a high level description, the AffectButton is a tool that displays a simplistic cartoon like face on a laptop screen within a box with a mouse cursor. As the location of the mouse cursor changes, so too does the facial expression of the cartoon like face. Furthermore, co-ordinates of the mouse are also mapped to a single point co-ordinate within a three-dimensional affect space (PAD value), where the dimensions represent Pleasure, Arousal and Dominance. Pleasure re- lates to the positiveness verses negativeness of an affect, Arousal to the level of activation and Dominance to the degree that the environment is imposing influence. When a subject has selected their desired facial gesture, they can click the mouse and the PAD value is captured and stored.

for each dimension respectively. These affect space triplets are represented by the dynamically changing expression on the face, allowing the user to select from a wide variety of different facial expressions and affect spaces values by moving the mouse within the box. The facial gestures are rendered in real time and therefore the user does not need to interpreted the underlying affective dimensions, but rather provides an affective rating by selecting a facial expression that they feel matching their interpretation of a given stimulus.

There are 9 prototype facial gestures located within this affect space (figure 3.7), each corresponding to an affective label: happy, excited, annoyed, angry, sad, scared, content, surprised and relaxed (these labels are exemplary), as the affect triplet changes, the facial expression displayed interpolates linearly between these nine prototype expressions. This is done via a mechanism that is comparable to that used in the robot Kismet (Breazeal, 2002; Broekens and Brinkman, 2013).

3.3.3.1 Mapping 2D Input to 3D Output

As outlined above, the AffectButton essentially provides a mapping between a two dimensional input space (the laptop screen) and a three-dimensional affect space, which in turn is used to determine the facial expression that is displayed on the screen. The purpose of this section is to detail this internal mechanisms and rules that determine this 2D to 3D mapping. The reason for this is that this has an impact upon how the results in the aforementioned chapters have been analysed (i.e. the statistical tests that have been employed) and presented graphically.

The button consists of two parts: an outer border and an inner border, both of which are square in shape. The outer border defines the working range of the mouse cursor. The inner border spans from −0.55 to 0.55 along both the horizontal and vertical components of the input space. The horizontal (x-axis) and vertical (y-axis) components of the cursor location within the outer border are directly mapped to the Pleasure and Dominance dimensions of the affect space respectively, and thus are controlled independently of each other. These values are then scaled such that they fall within the range [−1 1].

Pleasure (P) Dominance (D) Arousal (A) Outer Border Inner Border (P,A,D) A = +1.0 A = -1.0

Figure 3.8: Example of how the Arousal value is calculated from the Pleasure and Dominance values in the AffectButton. Image adapted from Broekens and

Brinkman (2013).

The Arousal value is a derived form the Pleasure and Dominance values, and is in essence the radial distance from the centre of the input space to the location of the mouse cursor. While the cursor location lies within the inner border, the Arousal value is held constant at −1. When the cursor is located outside the inner border, the Arousal is lineally interpolated between −1 (the start of the inner boarder) and 1 (the start of the outer border), based upon the distance to the outer border (Broekens and Brinkman, 2013). More formally, it is characterised as the hypotenuse of the triangle that is formed from the edge of the inner border, the x co-ordinate of the cursor, and the y co-ordinate of the cursor. This is shown in figure 3.8, and the algorithm used to calculate the PAD values from the mouse coordinates and check the PAD values are outlined in algorithms 8 and 9 in Appendix A.

Given this description, there needs to be a clarification regarding the exact nature of the affect space, and the different PAD values that can be located in this space. While the Pleasure and Dominance dimensions are independent of each other, the Arousal is not an independent dimension - it is derived from the other two dimensions. As a result, the PAD values cannot be located anywhere within the cubic space defined by the working range of the three dimensions. Rather,

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Pleasure Dominance −1 1 0 Arousal

(a) 2D Plot of the possible PAD values along the Pleasure (x) and Dominance

(y) dimensions. −1 −0.5 0 0.5 1 ₋₁ −0.5 0 0.5 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.81 Dominance Pleasure Arousal Arousal −1 1 0

(b) 3D plot of the possible PAD values. The mapping of these values to a

Trapezoid surface can be seen. Figure 3.9: Front and Side plots of the possible PAD values in the AffectButton

PAD space.

PAD values that can be obtained via this mapping fall onto a trapezoid surface within this space, as shown in figure 3.9.

3.3.3.2 How the AffectButton has been used

Within the context of this work, the AffectButton has been used with both child and adult subjects to capture affective interpretations of NLUs, both in formal lab settings, and in more “wild” settings such as primary school classrooms. Specifi- cally, the experiments presented in chapters 5, 6 and 7. It has also been reimple- mented from the source Python code into C++ such that it could be integrated into the various pieces of software that were developed to conduct the various experiments in the aforementioned chapters.

The rationale for using this tool over the others is three-fold. Firstly, the tool provides an intuitive manner through which subjects are able to express their affective interpretations: via facial expressions. Secondly, as the Nao robot does not afford an expressive face and only acoustic stimuli have been used in this research, the use of facial gestures does not present a conflict between the modality through which the robot is expressing itself, and the modality through which the ratings of this expression are captured. Finally, from a practical perspective, given that the affective dimensions underlying the button are hidden from the subject,

one does not need to be concerned with the cumbersome task of describing the nature of these dimensions, and thus the experimental process is made easier. This latter point is particularly appealing as describing how affective dimensions relate to the subjects’ affective interpretation of a stimulus is a tricky business and can be the source of considerable confusion, even for adults, let alone children.

However, while the AffectButton has been validated in a number of different ways, as described by Broekens and Brinkman (2013), this validation does not include young children. Thus, it cannot be assumed that they understand how to use the tool in the same way that teenagers and adults do. To address this, during experiments in which the AffectButton has been used in this thesis, care has been taken to allow subjects (both young and old) to become familiarised with how the tool works (i.e. how to move the button to produce different facial expressions) and the variety of different facial gestures that can be produced. This has been done via two methods (after subjects have been given time to explore the button). Either by asking subjects to assign facial gestures for different affective labels (chapters 6 and 7), or by presenting subjects with each of the prototype facial expressions, and asking subjects to move the mouse cursor the location in the button such that the AffectButton matched the prototype that has been presented (chapter 7).

Finally, a note on how the results obtained from experiments using the Af- fectButton are presented. Given that the Pleasure and Dominance dimension are the only two independent dimensions, with the Arousal values being derived from these, graphical plots of the PAD values are shown as two dimensional plots with Pleasure being represented as the horizontal dimension, and Dominance as the vertical dimension (see figure 3.9a). This is the same representation as is used in the button, and thus makes it easy to identify exactly where on the screen a subject clicked to capture an affective rating, and has the aim of making the graphs more intuitive to understand in this respect.

3.4 Summary

This chapter has served to provide details regarding methodological tools that have been employed in during the work informing this thesis. Firstly, a custom method for describing and characterising NLUs was presented. This makes it possible to create NLUs that have a sentence-like structure and is characterised using parameters that are analogous to those that are used to describe the acoustic correlates of affective expression in both the human voice and in music. Next, the Nao humanoid robotic platform was described as was its use in this work as the sole platform through which NLUs were embodied and studied. Finally, following a brief description and discussion regarding the two main schools of thought regarding representations of affect categories and affective dimensions, issues sur- rounding the measurement of affect and tools developed to do this were presented. This chapter ended with a detailed description of the affective measuring tool of choice, the AffectButton, and how it has been used in this work.

Chapter 4 Alignment of NLUs with Agent

Morphology

Summary of the key points:

• An online experiment is conducted to examine whether the morphology of a robot biases how people interpret the affective meaning, intentional meaning and appropriateness of utterances that the robot makes.

• People were shown an image of a Nao humanoid robot and an Aibo dog robot, and heard either a human-like utterance, animal-like sound or an NLU and were asked to rate these with respect to affect, intention and appropriateness.

• People are not coherent in the interpretations the affective or intentional meaning of utterances, and the morphology does not matter in this regard. • There does need to be an alignment between the type of vocalisation a robot makes and the physical morphology of the embodiment in order for people to deem the combination as appropriate.

• People deem it acceptable for the Nao robot to make NLUs. This serves as a justification for the use of the Nao as the robot in which NLUs are embodied and studied in this thesis.

As this work has proposed the use of a Nao humanoid robot as the platform in which NLUs are to be embodied, it is important to confirm that users deem this combination of the embodiment and NLUs as appropriate. This issue holds weight as research has revealed that a miss alignment between a robot’s morphology and its behaviour can lead to adverse reactions to the robot (MacDorman and Ishiguro, 2006). This alignment is the focus of the Uncanny Valley hypothesis (figure 4.1) proposed by Mori (1970) which states that as the anthropomorphism of an agent converges to that of a human being, the reaction of users will tend toward an affinity with the agent (Moore, 2012), until a point where the physical resemblance of the agent is such that it begins to evoke an adverse response, due to aspects of the appearance and behaviour differing from the human norm provoking a sensation of strangeness (MacDorman and Ishiguro, 2006). As such, Mori proposed that robot designers use the hypothesis as a guideline for when designing robots, encouraging the design of robots to reside on the left side of the valley, rather than striving to create robots with a high degree of human resemblance.

While this hypothesis, in it’s more famous manifestation, applies primarily to physical morphology and movement of an agent, research from the field of com- muter animation has shown that there are also cross-modal effects in which vocal- isations also have influence (Tinwell et al., 2011). Similarly Mitchell et al. (2011) created videos with combinations of visual/audio pairs with robot and human faces, and human and synthesised speech, finding that cross-modality mismatches resulted in a greater sense of eeriness, concluding that the physical and acoustic aspects of the robot should “match”. While the Uncanny Valley is not universally accepted in the current form (Tinwell et al., 2011; Bartneck et al., 2009), the basic premise that there is a relation between the degree of anthropomorphism, behaviour and user perception remains a relevant issue in HRI. Komatsu and Yamada (2011) provide a tangible example of this work, having found that subjects have significant differences in their interpretations of their artificial sounds depending on the physical appearance of an agent. The findings supporting the

100% Healthy Person Humanoid Robot + - Stuffed Amimal Industrial Robot Human Likeness Familiarity 50% Zombie

Corpse _{Prosthetic Hand}

Moving

Still Uncanny Valley

Figure 4.1: Graph of relationship between human likeness and perceived familiarity, as proposed by Mori (1970): familiarity increases with human

likeness until a point is reached where subtle deviations from the human appearance evoke an adverse response. This is known as the Uncanny Valley.

Figure adapted from MacDorman and Ishiguro (2006).

Uncanny Valley hypothesis strengthen the need to test the perception of the morphology/NLU alignment as applied to the Nao platform in order validate the use of the platform. Furthermore, the findings that a different embodiment may evoke a different interpretation of the same utterance (Komatsu and Yamada, 2011), and that the morphology of the embodiment alone can evoke substantially different reactions from subjects (Hwang et al., 2013) highlights to need to retain the same robot throughout the body of this research.

To this end, this chapter presents an experiment aimed at probing how the morphology of a robot influences the perception of appropriateness and the affective interpretation of NLUs, in comparison to more characteristic forms of utterance that may be associated with the particular morphology.

4.1 Experiment Setup

This experiment set out to test the following hypotheses:

H1: Users are coherent in their affective interpretations of utterances made by a

robot.

H3: The physical appearance of a robot has an influence upon the perception of

appropriateness of utterances made by the robot.

These hypotheses were tested through an online experiment, where subjects were asked to rate image and utterance stimulus pairs in terms of their emotional and intentional interpretations of utterances as well as provide a rating of the appropriateness of each stimulus pairing. In total, 20 utterances were collected together - 5 utterances recorded from a human source, 6 utterances from an animal source, and 9 sounds recorded from technology sources (e.g. analogue computers, mobile phones, etc.). Each acoustic stimulus was presented to subjects twice, once paired with an image of a Nao robot and once with an image of a Sony Aibo robot (figure 4.2), thus in total, 40 stimulus pairs were presented to subjects. For each utterance pair, subjects were asked to select from a list of which emotion they felt that the robot had conveyed through the utterance. They were also asked to guess the communicative intent, and to judge the appropriateness of the utterance and robot image pairing. All responses were forced choice. To conclude the experiment, subjects were asked whether they were pet-owners and if they came into contact with robots on a regular basis.

Two versions of the experiment were created with counterbalanced robot- utterance stimulus presentation, and subjects were directed to a URL that for- warded them to one of the two experiments.

4.1.1 Utterance Stimuli

The 20 utterances were grouped into three broad categories: human (6), animal (5) and technological (9). Audio samples were collected from a variety of sources including the FreeSound1 _{online data base, and self recordings. No pre-testing of}

the sounds in order to ascertain rough affective interpretations was performed. Human utterances consisted of recordings of utterances such as “hmm”, “ahhh”, and other recordings of sounds that can be produced using the human vocal tract. Animal utterances consisted of sounds such as a cat’s purr, or a small dog growl-

ing, and were selected to cover a small range of sounds that one might expect to hear from such animals. Technological utterances (which are essentially NLUs) came from a broad range of sources, such as mobile phones and unusual daily sounds such as windows being wiped clean.

The motivation behind this selections of utterances was to capture a broad variety of stimuli with respect to sound source and acoustic parameters (intonation, pitch, speed, ect.), rather than providing systematic and controlled differences in acoustic profile. Understanding correlates in acoustic features in utterances and their interpretations was not the focus of this experiment. Rather, the focus was to query the impact of varying agent morphologies and validate that the Nao is indeed an appropriate platform on which to conduce NLU research. As such the selected utterances were not intended to portray any particular affective states or have any particular meaning.

The 40 stimuli were presented in a pseudo-random order, with the constraint that the repetition of each utterance was to be separated by at least 14 others, all of which were to be different. Doing this avoided the sequential repetition of an utterance, with the aim to minimize the chance that subjects would not only recognize the acoustic stimulus, but also recall the response that they provided with the first presentation.

4.1.2 Visual Stimuli

The two robots to be presented were carefully considered due to the wide variety of robots currently available, both in the commercial and research domains. The primary concern was to make a comparison between two robots that have the same design theme, and avoid introducing unnecessary noise to the results by comparing two robots with differing underlying aesthetic design themes. Robots such as the Paro (Wada and Shibata, 2006) or MIT’s Leonardo (Coradeschi et al., 2006) are designed to resemble living creatures (evident through their soft, furry exteriors), while robots such as the Nao or Aibo have a more prominent industrial design theme and resemble technological artefacts (evident through their rigid,

(a) Sony’s Aibo dog robot.

(b) Aldebaran’s Nao humanoid robot.

Figure 4.2: Images of the two robots used in the experiment.

plastic exteriors). If comparison were made between the Nao and Paro platforms for example, this would not only be testing morphology, but life-like aesthetic

In document A Study of Non-Linguistic Utterances for Social Human-Robot Interaction (Page 104-137)