Structure - A Study of Non-Linguistic Utterances for Social Human-Robot Interaction

The structure of this thesis is outlined below, giving a brief description of the theme and context for each chapter. Also, to accommodate for the hasty reader, chapters 3 through 9 begin with a list of the main key points and findings.

• In this introductory, chapter 1, the main relevant themes to this thesis have been introduced and their relation to each other highlighted, and contribu- tions and structure of the thesis outlined.

• Chapter 2 provides a deeper and more extensive background regarding Non- Linguistic Utterances. The similarities and differences between NLUs and natural lanagueg are discussed with the boundary between the two shown to be vague at best. Previous work in both NLUs and the related area of Gibberish Speech is covered in detail, as is the related literature on affective expression through the human voice and music is also reviewed as this potentially holds useful insights with respect to how NLUs can convey affective meaning.

• The methods that have been used in this theses are detailed in chapter 3. it beings with a detailed description of a new algorithm designed and developed to facilitate real time generation and synthesis of NLUs as well as a high desire of precision in specifying and manipulating the acoustic properties. The algorithm described in this chapter serves as the means of producing, characterising and controlling the acoustic properties and features of the utterances used in all the experiments presented in the subsequent chapters of this Thesis (with the exception of chapter 4). A description of the Nao humanoid robot and the manner in which it has been used in this work follows this. Finally a discussion regarding the measurement of affect in humans is presented, and the affective measuring tool of choice and its use is detailed.

• Throughout the work described in this thesis, the Nao robotic platform has been used at the platform through which the NLUs have been embodied in a physical, social agent. Chapter 4 presents an experiment that seeks to provide an experimental justification for this, and highlights the importance of the relationship between the physical appearance of an agent, and the (audible) behaviour that it exhibits, and how this can impact the holistic perception that people have of the agent, particularly in the case of NLUs.

• In chapter 5, the parameters of the NLU generation algorithm are explored in a systematic manner through a series of small experiments in order to test

their impact upon the affective meaning of an utterance. These experiments are designed in such a way as to also accommodate the collection of training data to be used in chapter 7.

• Chapter 6 presents two experiments centred around investigating whether both adults and children exhibit categorical perception when affectively in- terpreting NLUs. It is well established that people categorise a wide variety of sensory stimulus, such as colours, facial expressions and emotional speech, and chapter 5 presents evidence that suggests that the same may be true for NLUs also. Using the methodologies that have been refined and well matured in the domain of psychology, this chapter seeks to uncover whether it is indeed the case that subjects affective interpretations of NLUs are also subject to a perceptual magnet effect and drawn to particular prototypes. • Using the data collected from the experiments in chapters 5 and 6, chapter 7

details how this data has been used to train a collection of Artificial Neural Networks to learn a mapping between an dimensional representation of affect and the parameters of the generation algorithm outlined in chapter 3. These networks (and the learnt mappings) are then evaluated with young children. • In chapter 8, the interaction between the situational context which NLUs are used within and the subsequent affective interpretation of these utterances is investigated. More specifically, this chapter queries whether the context has a biassing effect whereby the nature of the context within which utterances are used directs how they are subsequently interpreted, or conversely, whether the use of NLUs can bias how the context is interpreted.

• Taking the findings of chapter 8 - that situational context biases affective interpretation of NLUs - into consideration, chapter 9 explores the potential use of NLUs along side natural spoken language rather than being used as an alternative through an online experiment as natural language is another rich source of situational context.

sented in this thesis, and reflects upon the aspects that are related to the limitations of the thesis, as well as in the broader sense and ends with a discussion of a collection of topics that are considered as potentially fruitful future research.

Chapter 2 Non-Linguistic Utterances

This chapter serves to sketch a theoretical and practical background of Non- Linguistic Utterances (NLUs). It begins with a brief definition and formalisation of what NLUs are, and are not, and what distinguishes this, particularly with respect to natural language, as well as their relation to a similar strand of research surrounding the use of gibberish speech in social agents. This is followed with some examples of how NLUs have been used in real robots, as these help provide more tangible and concrete examples of the type of utterances that are the focus of this research. Following this, the general motivations and potential applications of NLUs (and gibberish speech) to social HRI are then outlined.

A review of research on emotional expression though the human voice and music is then presented, drawing particularly from the fields of psychology and musicology, as facets of these fields have had great influence upon the the previous work in NLUs and gibberish. Furthermore, in this review, certain links between methods developed to facilitate the study of emotional expression through sounds and the methods used to create NLUs and gibberish speech are highlighted, as many of these have been overlooked in the previous works.

Following this, a review of the previous work on NLUs and gibberish speech is presented in tandem, charting the developments that have already been made. This work is then discussed and important gaps in the research are highlighted, as these have influenced the manner in which the work informing this thesis has been conducted.

Finally, a note on the properties of NLUs that ultimately distinguish them from language is presented, as this justifies why NLUs and gibberish are not considered to be an artificial language with respect to this thesis.

2.1 A working definition of Non-Linguistic Ut-

terances

Non-Linguistic Utterances (NLUs) are sounds comprised of beeps, squeaks and whirrs rather than resembling a real spoken language. They are utterances that are specifically designed not the resemble the complex acoustic signals that can be made by the human vocal system, and are not designed to resemble any real natural language, and thus are inherently unable to convey complex, linguistic semantic information to humans who do speak real natural languages. However, NLUs are still theoretically able to convey affective information as this is not directly dependant upon a shared linguistic and semantic vocabulary between two people or agents, but rather can be encoded and decoded through more general features and characteristics in both simple and complex acoustic signals. This highlights an important distinguishing feature between natural language and NLUs: while the human voice affords the duel encoding of both affect and linguistic information via the same acoustic channel (Picard, 1997; Scherer, 2003), NLUs do not as there is not (intended to be) a defined linguistic vocabulary or encoding/decoding protocol (at this point in time).

At this stage, a similar, related method of expression should also be introduced, gibberish speech. As we shall see later, gibberish speech has the same underlying motivations as NLUs, as well as the same utilities with respect to their application to HRI. However, there is one fundamental difference, and that is that unlike NLUs, gibberish speech is indeed designed to resemble the timbre and voice quality of human speech, without containing any linguistic or semantic content. The reason for mentioning this seeming different modality is that NLUs and gibberish speech represent “two sides of the same coin”, so to speak. And as such, it is

useful to draw upon work relating to both in order to outline the shared underlying qualities and applications, as well as the motivations for these.

Finally, a note on why this thesis is about affective vocal expression through NLUs. As we shall see later, NLUs currently do not constitute a language, and as such, their use in robotic systems is not to try and communicate high level, complex meaning and information, as this is fundamentally not possible at this stage in time due the lack of established cultural norms regarding the use of NLUs during social interaction. However, NLUs do have the capacity to provide rich paralinguistic cues and convey affect. Setting language aside, affective expression is well established as being a fundamental ingredient required for facilitating and regulating engaging and quality social interaction (Breazeal, 2001b,a, 2002, 2003b,a; Belpaeme et al., 2012), hence this is why the body of research presented in this thesis focuses upon affective expression through the use of NLUs.

2.1.1 Examples of NLUs in Popular Culture

NLUs have been used almost exclusively to great effect within the world of An- imation as means of bringing inanimate objects, particularly robots, to life and allowing them to be portrayed as social agents/individuals who can interact with social peers with ease, and without the need to use a real spoken language1. Rather, in this respect, NLUs have been portrayed as a fictional language, where other characters within the films are able to understand what the robots are say- ing, while the audience in reality do not. Their understanding of what has been said by the robot is highly scaffolded by the events that occur within the rest of the scene, and the script of the other characters. As such, it can be viewed that the actual sounds themselves have little meaning to the audience, when used on their own, and the meaning is deduced by the other salient cues provided (with many of these cues being specifically tailored toward helping decode the utterances made by the robot). However, this is something that needs clarification.

As a result of the success of the Star Wars franchise in particular, NLUs have

1_{Fictional robots such as R2D2 from the Star Wars films, and WALL-E and Eve from the} Pixar film WALL-E provided vivid examples of this.

gained a certain iconic status in popular culture and media, in that they are now synonymous with the fictional robot R2D2 (and similar robots within the films), and how it expressed a rich variety of socially relevant cues (such as affect, humour, and logic) through a variety of beeps, squeaks and whirrs, which did not resemble vocalisations made by the human voice.

Given this status alone, using NLUs in real robotic systems can be seen as a appealing alternative to having a robot that speaks with a real natural language, as the popularity and iconic status of robots such as R2D2 (at least in the developed world) means that people who see a real robot using NLUs are likely to perceive the utterances as expressive displays. By using the association between NLUs and social capable robots such as R2D2, a roboticist can increase the likelihood that a real robot making similar sounds will be perceived as socially capable also, and that the utterances indeed have a social meaning and utility, or so the theory goes.

2.1.2 Examples of NLUs in Real Robots

While Animation has been the main beneficiary of NLUs overall, there is also a growing number of examples of both research and commercial systems that em- ploy(ed) NLUs as a means of expressive displays. For example, Keepon (Kozima et al., 2009), and the commercial sister robot, My Keepon, have both used small database of simple sounds that are used to provide expressive vocalisations both in response to sensory input, and as a means of attracting attention. Similarly, WowWee’s RoboQuad (and various other robot toys in their robot product line) also used a small collection NLUs for reactive behaviours to sensory input, as well as commands input by the user through an Infra-Red remote control.

There are some stark differences, however, between the use of NLUs in Ani- mation and in real world robotic systems. Primarily, robots such as R2D2 are not subject to the same limitations that real robots are. For example, while R2D2 is a robot, it does not have a real computer with limited memory and processing resources, or a similar target production line cost for that matter. As such, R2D2

is not limited in the variety of different utterances that it can make, while robots such as Keepon are, given the computational resources within which a functional system had to be created. Furthermore, utterances have to be carefully designed by hand by a sound engineer, which is a time consuming process in itself. This is why real robots have tended to have a limited repertoire of utterances, something that this easily spotted by the consumer when interacting with the robot.

In document A Study of Non-Linguistic Utterances for Social Human-Robot Interaction (Page 42-51)