Voice in virtual worlds

(1)

Voice in virtual worlds

Gregory Robert Wadley

Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy. December, 2011

Department of Information Systems, Faculty of Science,

(2)

Abstract

Virtual worlds are simulated online spaces through which large numbers of people connect in order to work, play and socialize. Examples include massively multiplayer online games like

World of Warcraft and open-ended worlds like Second Life. Virtual worlds are differentiated

from other systems by their simulation of a persistent three-dimensional landscape, in which users are usually represented as avatars. Virtual worlds host large numbers of users and support a variety of recreational and instrumental uses.

A critical aspect of any collaborative system is communication. In virtual worlds this is especially complex. Most users encounter people who are unknown to them offline. Most maintain a presence in the world over many months or even years, yet may prefer to be pseudonymous or engage in identity-play. Users must simultaneously manage both physical and virtual contexts. Synchronous as well as asynchronous communication is required. Virtual worlds initially offered only text as a medium for user-to-user communication. More recently, vendors have introduced facilities for communicating by voice. This has made the experience of virtual worlds more convivial for some, and has enabled forms of collaboration that were previously only possible in small experimental systems. But the introduction of voice provoked controversy, with some protesting that it projects too clearly the personal characteristics of speakers, damaging pseudonymity. Some are more sensitive about speaking with strangers than they are typing, and may become less communicative when adopting voice, or more easily dominated by extroverted collaborators. Voice channels are more prone to abuse, and the abuse can be more impactful. Users encounter sound quality problems, and are often uncertain whether they are being heard. Voice is less suited to asynchronous communication, and is more prone to congestion.

It appears that voice works well when conditions suit, but can lead to failed implementations when deployed inappropriately. Yet little research has been conducted to help us understand to which situations voice is suited, whom it benefits and whom it does not, and how it can best be configured to support different activities.

I conducted four studies designed to fill this gap. Two examined the influence of voice on user experience, in the two major types of virtual world. The others examined the interaction between voice and spatiality, the defining feature of virtual worlds, at macro and micro scales. I studied use in naturalistic contexts, collecting data via interviews and diaries and

(3)

triangulating these with observation, online ethnography, conversation analysis and quantitative measures.

I found that voice transforms the user experience of virtual worlds. It makes some forms of collaboration more efficient. However it interferes with identity-play and the ability of users to manage multiple tasks and conversations. When voice is propagated spatially, it increases immersion, reduces channel clutter, and affords new strategies for team coordination. However verbal references to places and objects often fail.

I discuss these results in the light of post-media-richness theories of communication, arguing that preferences for one modality or another reflect broader issues of managing social presence in virtual and physical contexts.

(4)

Declaration

This is to certify that:

(i) the thesis comprises only my original work towards the PhD except where indicated

in the Preface,

(ii) due acknowledgement has been made in the text to all other material used, (iii) the thesis is fewer than 100,000 words in length, exclusive of tables, maps,

bibliographies and appendices. Signed,

__________________________________ Greg Wadley

Dated: __________________________

Preface

Studies 1 and 2 of this thesis were conducted in collaboration with Martin Gibbs and Peter Benda at the University of Melbourne. Martin and I collaborated in designing these studies, and all of us collaborated in collecting the data; however the data analysis and write-up presented in this thesis was conducted by me.

Study 3 was supervised by Nic Ducheneaut and conducted by me. Study 4 was supervised by Martin Gibbs and conducted by me.

All four studies have been presented at conferences. The publications are listed in Appendix C. These were multi-authored; however the text in this thesis was written by me.

(5)

Acknowledgements

A thesis cannot come into being without the help and influence of many people, and I would like to thank them now. I make the usual author’s disclaimer that the fault for any inadequacies in the work lies with me alone.

My supervisors Martin Gibbs and Steve Howard of the University of Melbourne steered me deftly through a long and sometimes arduous process.

The Interaction Design Group at the University of Melbourne has offered unceasing challenge and stimulation while I worked on this thesis, and I thank all past and current members.

I have worked with the staff of the Department of Information Systems at Melbourne for a decade and I thank them for their support and collegiality.

I learned by working with experienced researchers: thanks Pete Benda, Nic Ducheneaut and Martin Gibbs.

I spent four months researching at the Palo Alto Research Center in California, funded by a University of Melbourne PORES scholarship. This was a transformative experience and I thank everyone who made it possible, especially staff at PARC, the Department of Information Systems, and the University of Melbourne’s School of Graduate Research. Most of all I thank my supervisor at PARC, Nic Ducheneaut, and my fellow researchers there Don Wen and Mike Robinson.

Many people discussed my work with me and offered useful advice and insights, including Richard Bartle, Tom Boellstorff, Marcus Carter, Paul Dourish, Connor Graham, Mitchell Harrop, Jeremy Hunsinger, Reeva Lederman, Tuck Leong, Bernd Ploderer, Ralph Schroeder, Dmitri Williams, and fellow PhD candidates in the Interaction Design Group.

I thank Paul Boustead for making the SpatialVoice system available for study 2.

I would like to thank the people who participated in my studies. Study participants receive little compensation for their efforts, and research cannot happen without them.

(6)

List of tables and figures

Tables

Table 1: Metaphors for propagating voice and text messages ... 14

Table 2: Overview of my four case studies... 20

Table 3: Data collection methods used in each study ... 87

Figures Figure 1: Using text-chat in the Second Life virtual world... 2

Figure 2: Timeline of virtual environment technologies... 4

Figure 3: Relationship between CVEs and other technologies ... 9

Figure 4: Taxonomy of virtual environments... 12

Figure 5: Cases arrayed along dimensions of diversity... 82

Figure 6: My cases situated within taxonomy of virtual environments ... 82

Figure 7: Design of groups for study 1 ... 95

Figure 8: Data collection timeline for study 1... 95

Figure 9: Text vs voice in MMOGs ... 138

Figure 11: In-avatar and in-camera views... 173

Figure 12: Creating prims in Second Life ... 174

Figure 13: 'House' task screenshot provided to participants ... 178

Figure 14: 'Garden' screenshot provided to participants ... 180

Figure 15: Data collection timeline for individual trials in study 3 ... 181

Figure 16: Split-screen video of a trial in study 3 ... 183

Figure 17: Time spent with camera in in-avatar mode... 189

(10)

Figure 20: Encountering voice users in SL (my avatar in foreground)... 204

Figure 21: PLS 1 for study 1 ... 275

Figure 22: PLS 2 for study 1 ... 276

Figure 23: Interview questions for study 1... 277

Figure 24: Diary instructions for study 1 ... 278

Figure 25: First pass at code hierarchy for study 1 ... 282

Figure 26: Code hierarchy for study 1 ... 283

Figure 27: Participant instructions for study 2 ... 284

Figure 28: Coding hierarchy from study 2 ... 286

Figure 29: IRB application for study 3... 288

Figure 30: Participant recruitment advertisement for study 3... 289

Figure 31: Study 3 lab notes 1... 290

Figure 34: Discussion about building in Second Life ... 293

Figure 35: Sample video analysis: one pair... 294

Figure 36: PLS for study 4 ... 295

Figure 37: Interview questions for study 4... 296

Figure 38: Excerpt from my SL diary ... 297

Figure 39: Excerpt of a group discussion from study 4 ... 298

Figure 40: Excerpt of an interview from study 4 ... 299

Figure 41: Excerpt from Second Life user forum ... 300

(11)

Chapter 1

1. Introduction

1.1 Background

This thesis is concerned with how people communicate within virtual worlds: large, Internet-based systems through which geographically-dispersed people connect for recreation, social interaction and collaborative work. Well-known virtual worlds (the term is usually abbreviated as VWs1_{) include Second Life and World of Warcraft. VWs are a subset of the genre of} computer systems known as collaborative virtual environments (CVEs). CVEs simulate three-dimensional spaces and visualize them using real-time computer graphics. Virtual spaces are usually designed to look something like the real world of our everyday experience, though they are typically fictional.

By definition, a collaborative virtual environment is a multi-user system (Churchill et al., 2001). Users are typically represented to each other in the form of simulated bodies called ‘avatars’. Most virtual environments of interest today are multi-user, therefore in this thesis I sometimes omit the word ‘collaborative’ for the sake of brevity and simply say ‘virtual environment’.

The properties that make VWs ‘worlds’ – which are usually accepted as differentiating virtual worlds from the larger category of collaborative virtual environments - are:

1. VWs are persistent (the simulation continues to run on the host computer, regardless which particular individuals happen to be logged in).

2. VWs are used by large numbers of people (World of Warcraft, a popular system, currently has about 11 million users).

3. The space of a VW is large (for example a broad expanse of virtual land, as opposed to, say, a single room).

(12)

Like most Internet-based social platforms, virtual worlds typically allow their users to communicate with each other by typing text messages. More recently however, many have added voice communication facilities, so that now both written and spoken forms of verbal communication are available. In a voice conversation, sound recorded by the sending user’s microphone is transmitted over the Internet and replayed through the receiving users’ speakers or headphones. The problem of how to design successful voice systems for virtual worlds is a theme of this thesis.

Virtual world users can communicate non-verbally as well. The use of avatars affords embodied forms of communication such as proxemics (how people position their bodies in relation to each other) and a simplified language of posture and gesture. Some virtual worlds allow users to construct objects, which are visible to others and can communicate meaning. Figure 1 shows a screenshot taken in the popular virtual world Second Life. My avatar is in the foreground, and the avatars of other users are scattered around the scene. Avatars are labelled with their user’s pseudonym, and optionally the name of a group to which the user belongs. In this scene some users have formed their avatars into conversation groups, as we would our bodies in offline conversation. Text messages entered by users whose avatars are near mine are displayed in the chat window on the bottom left. The ‘mini map’ on the top right is a top down view of the local area, with users represented as green dots. (This screenshot was taken in 2007: the SL ‘Viewer’ or client software has changed somewhat since then.)

(13)

Researchers have been examining social interactions in collaborative virtual environments for over twenty years (e.g. work surveyed in Churchill et al., 2001; Schroeder, 2002; and Schroeder and Axelsson, 2006). A significant concept in earlier work was presence, defined as the sensation of being somewhere other than one's physical surroundings; such as in a distant or fictitious place. The experimental systems examined in early research often utilized input-output (I/O) hardware beyond what is typically found on personal computers. For example Virtual Reality (VR) systems use head-mounted displays, data gloves and other specialized devices in order to maximize their users’ sense of being immersed in a virtual environment. VR did not achieve widespread commercial success, and vendor interest (and some but certainly not all researcher interest) shifted towards systems based on consumer hardware. Since the late 1990s, ‘desktop’ virtual worlds have allowed users to connect via PCs and the Internet to simulated spaces running on large servers. Some of these have been commercially successful and have inspired significant research interest.

Today’s VWs are often classified into two categories: those that are games and those that are not. The distinction is based not on the underlying technology (which is common to both) but on the kinds of activities that their users undertake. Use of a game world is constrained by rules and the necessity that users compete with each other, or collaborate to defeat simulated enemies. Activity in non-game worlds is less restricted, though there are still rules, such as the property laws in Second Life which support a market for user-created virtual goods. The taxonomy of Schultze and Rennecker (2007) classifies VWs along two dimensions: fantasy-realism and progression-emergence . These authors’ progression-emergence dimension describes the degree to which user behaviour is shaped by game rules, and thus corresponds to the distinction between games and non-games.

Game-based virtual worlds are usually called Massively Multiplayer Online Role-Playing

Games, usually abbreviated as MMORPG, MMOG or just MMO. For brevity, and because the

existence and nature of role-play in these worlds is contentious (e.g. MacCallum-Stewart and Parsler, 2008), I call them Massively Multiplayer Online Games (MMOGs) in this thesis. Non-game worlds are usually called either “open ended” (to emphasize that activity is not constrained by game rules), or “social”. The term “social world” is not a reference to Anselm Strauss’ concept of the same name, but simply emphasizes that a principle activity in this genre of VWs is socializing with other users. The term “social” is problematic because it distinguishes these platforms neither from game worlds nor non-spatial systems such as social network websites, all of which also feature social activity. However I have chosen to use this term because it is briefer than the alternative “open-ended”, and is in more common usage.

(14)

Lehdonvirta (2010) has pointed out that the ‘MMOG vs Social’ categorization is problematic because it lacks a strict definition. However its popularity suggests that it captures a distinction which is important to users.

MMOGs offer large, persistent, usually fictional settings in which groups of users collaborate and compete in a long-term game. MMOGs inherit much of their culture from the earlier, non-graphical ‘multi-user dungeons’ or MUDs. Currently the most popular MMOG in the USA, and one of the most popular worldwide, is World of Warcraft, which has about 11 million users. Other prominent MMOGs include Everquest, EVE Online and Lineage. Communication in MMOGs is examined in study 1 (chapter 4).

Social worlds also simulate large, often fictional spaces, but are not considered by their users to be games (though mini-games can be set up within a social world). There is no formal competition and nothing to ‘win’. There are fewer rules constraining users, and these systems are used for a broader range of activities including social interaction, discussion, role-play and content creation. The most popular social world currently is Second Life, which has about a million users. Other prominent social worlds have included There, Habbo Hotel and Active

Worlds. I examined social worlds in studies three and four.

Thus while early CVEs were imagined as systems for computer-supported collaborative work, they have succeeded in the form of recreational technologies. Yet this evolution has recently gone full circle, with interest growing in the appropriation of VWs for instrumental tasks such as online meetings and distance education. Linden Lab, the vendor of Second Life, actively promotes its system to business and education users, and other systems designed specifically to support collaborative work have been developed.

The timeline in figure 2 illustrates a brief history of virtual environments to date.

Figure 2: Timeline of virtual environment technologies

A fundamental problem in the design of desktop virtual worlds is how they can provide a usable interface to a simulated 3d space despite using 2d hardware such as a mouse (Bowman et al., 2001). Broadly speaking this lack of a third input dimension has been addressed in two

(15)

ways. One has been to devise paradigms that superimpose a third dimension onto existing 2d interfaces. For example, a modifier key can define some mouse movement as being in the 'z' direction. The other approach is to invent new I/O devices that are explicitly three-dimensional, such as the 3d mouse and the stereoscopic display. But virtual worlds have achieved mass uptake by ‘making do’ with standard hardware. Only very recently have true 3d devices such as the Wii Remote and Xbox Kinect become available to average consumers. I am less concerned with how VW users communicate with their computer, and more with how they communicate with each other. (Note that my topic is distinct from, say, the use of speech for inputting system commands.) To use the terminology of Preece and Maloney-Krichmar (2003), I am concerned more with the sociability of virtual environments than with their usability. Despite this emphasis, part of the challenge of designing successful communication tools for use in VWs stems from their spatiality.

As the world grapples with economic and ecological problems, Internet-enabled virtual worlds offer the promise of an inexpensive platform for collaboration among geographically-dispersed users. For example in 2008 Imperial College London and Nature Publishing Group co-hosted a conference in Second Life with the explicit aim of reducing the carbon footprint associated with long distance travel to international events2_{. Likewise many universities are investigating} the use of VWs as teaching spaces for distance education (e.g. Gregory et al., 2011), with aims that include easing the travel burdens of students and staff.

High expectations have been held for CVEs since their earliest incarnations. At times it has been predicted that they would become a dominant platform for online interaction (e.g. Rheingold, 1991; Gartner, 2007). However it would be fair to say that no implementation has thus far lived up to these expectations (Salomon, 2009). Commercial success is mostly limited to entertainment applications, and CVEs are less often put to serious use outside the research lab (Schroeder, 2010). Even the large MMOGs which boast millions of users do not approach the broad popularity of, say, email or social networking sites, whose users number in the billions. It is an open question why more people have adopted web-based social networking than have adopted virtual worlds. Researchers have offered critiques of existing implementations, as well as of over-enthusiastic expectations for 3d (e.g. Harrison and Dourish, 1996). One thing is certain: merely including three-dimensionality within a communication system has not guaranteed the “natural articulation of collaboration” (Benford, et al., 1994) that has been envisaged for virtual environments.

(16)

My experience studying and using virtual worlds has led me to conclude that, in order to estimate the aptness of VWs to a particular communication scenario, one needs to understand them: (a) as multi-media technologies, of which three-dimensional space is one of the media, and (b) as one, perhaps extreme, example of a broad project to build communication technologies of maximum ‘richness’ in the sense of Daft and Lengel (1986).

To describe CVEs as multimedia technologies is to emphasize that mechanisms for simulation of space, embodiment of users as avatars in the space, and linguistic communication between users, while they are frequently found together, are independent and need not all be implemented together within a given technology.

Some discourse about virtual worlds has implied that a VW’s purpose is to provide rich communication, even if the academic term ‘media richness’ is not explicitly used. For example Wasko et al.. (2011) explain that:

According to theories of media richness, 3D environments are objectively rich because there is synchronous contact; the visual stimuli, objects, and environmental designs offer a variety of social cues; and communication occurs through multiple channels, including audio, visual, and text. (p. 648)

A medium’s richness has been understood as the being extent to which it imitates face-to-face conversation by conveying not only linguistic messages but also information about the people conversing (Daft and Lengel, 1986). This view of communication sees face-to-face (f2f) conversation as the ‘gold standard’ to which mediated communication is to be compared. F2f communication includes not only speech but facial expressions and body language conveyed through the modality of vision. Thus an explicit goal of much VW design has been to simultaneously transmit, along with speech or text, simulated bodily orientations and gestures through the users’ avatars.

Implicit in the media-richness project is the belief that richer communication is better communication. This means that a CVE can fail in two ways. It can fail to be rich because of technical limitations, or it can fail because, in some situations, richer does not mean better. While the former is sometimes offered as an explanation for the limited success of CVEs, so that the solution must be to seek even greater fidelity of representation, in this thesis I argue that the latter is often the better explanation. In other words, VW users, like people in general, do not want maximally-rich communication media.

For example, VW users do not usually want their avatars to resemble them exactly (Ducheneaut et al., 2009), and my studies showed that some of them choose to communicate by text, even when voice is available, in order to prevent audio cues about their identity and physical context from being transmitted. Media richness theory has been much critiqued (see

(17)

chapter 2), yet this critique appears not to have permeated discourse on virtual world design. Clarifying this situation is one of the aims of this thesis, and is discussed in detail in chapter 8.

1.2 Classification and terminology

1.2.1 Virtual space

This thesis is concerned with computer systems that ‘locate’ their users within a simulated three-dimensional space and allow them to ‘move’ within it. I place ‘locate’ and ‘move’ within quotes because, of course, the positions at which users are ‘located’ are points in a simulated space. The users of a virtual environment do not need to be at any particular physical location while moving within virtual space: a user’s physical and virtual locations are, in existing VWs, independent of each other. In this thesis I address both the physical and virtual contexts of use, so I need to be clear that these are distinct. The user’s physical context is the space immediately around the computer they are using, and may include other people who can see or hear what the user is doing. (Modern high-powered laptops and mobile Internet connections mean that a VW user’s physical context can be almost anywhere.) The user’s virtual context is the virtual space immediately around their avatar. It might contain the avatars of other users who because of avatar proximity can see the first user’s avatar, receive their text chat, and through a voice channel possibly hear what is happening in their physical context.

Virtual worlds are computer-generated Euclidean spaces with three dimensions (‘3-spaces’), because that is the dimensionality of the space that we experience in our everyday dealings with the physical world3_{. In three-dimensional space the position of any object can be} described by three coordinates, which represent the distance from an agreed origin-point to the object along three orthogonal axes. The orientation of an object (the direction it faces) can be described by another ordered triple of angles relative to the agreed axes. Accordingly, translations (changes in location) and rotations (changes in orientation) can be expressed as three-dimensional vectors, and distances between points can be calculated straightforwardly by

3_{In fact a VW can be described as a Cartesian space because it has a coordinate system. Coordinates are} necessary for the software to function, so distinguishing between Euclidean and Cartesian is unimportant. The important distinction is between these and non-Euclidean spaces.

(18)

the Pythagorean formula. In a Euclidean space, the distance between two objects, along with their orientations, is part of what determines whether they can affect each other and thus whether they are relevant to each other (Benford et al, 1994).

Alternative models have been proposed to describe the space of the physical universe. For example, relativity utilizes a non-Euclidean space, and string theories propose higher dimensionalities that only become significant at tiny distance scales. These do not affect physics at the scale of human experience, and no virtual environment has attempted to represent them. (Dix et al., 2000, discussed the non-Euclidean spatialities of modern physics and their potential for application within other types of virtual spaces, such as mobile-device networks and the World-Wide Web.) The fact that virtual environments are Euclidean is not discussed further in this thesis: I simply refer to them as ‘spatial’, ‘three-dimensional’ or ‘3d’, and speak of them as having the property of ‘spatiality’.

Numerical coordinates can be used explicitly by Second Life users in order to directly manipulate locations, such as when building and scripting objects or ‘teleporting’ around the virtual world. By contrast, in MMOGs and game-worlds, the 3d nature of the space is often not made numerically explicit to users, and MMOG users cannot directly manipulate either their own location or those of objects in the game-world. However from the software developers’ point of view, all these spaces are explicitly three-dimensional.

There are different expressions in use for the properties and phenomena of virtual environments. However while there is not complete agreement on terms, in this section I will establish a nomenclature to be used consistently throughout the thesis.

Since the defining feature of virtual environments is their simulation of 3d space, I should first differentiate them from:

• networked multimedia systems such as hypertextual spaces (e.g. the world-wide web),

• text-based ‘MUDs’, which describe a pseudo-spatial environment using words rather than graphics,

• systems that represent a space that is navigable but two-dimensional, and

• cinema-like systems that display 3d scenes which are not navigable (Nitsche, 2008).

Some videogames use a so-called “2.5d” projection. The environment is two-dimensional but is displayed as if looking from above, to give the impression of three dimensions. One 2.5d

(19)

world, Ultima Online, is considered by many to be the first MMOG. Others such as Club

Penguin and Habbo are popular among young people. While 2.5d systems may certainly be

experienced as ‘worlds’ by their users, I have not included them among my cases. They do not afford the degree of spatial interaction possible in 3d systems.

The word ‘virtual’ has several usages. Informally it has been applied to a wide range of Internet-based technologies, so that websites offering e-commerce have been called 'virtual stores' and web-based learning management systems are sometimes called 'virtual classrooms'. That usage – by which the word 'virtual' means ‘web-based’ – has waned in favour of using the term only to refer to spatial simulations, and it is these with which I am concerned.

The relationship between virtual worlds and other Internet-based technologies is illustrated in figure 3.

Figure 3: Relationship between CVEs and other technologies

communication technologies Collaborative Virtual

Environments

non-spatial systems such as video-conferencing, email, websites, instant

messaging etc.

Virtual worlds VR

Networked FPS games

(20)

1.2.2 Types of collaborative virtual environments

There are several types of CVE, including:

• fully-immersive VR systems running on specialized hardware

• multi-player 3d videogames, of which the ‘first-person shooter’ (FPS) is a prominent genre

• virtual worlds.

Related technologies include video-conferencing systems, and immersive video projections such as the CAVE (Johnson and Leigh, 2001). Video-conferencing does not generate space: rather it projects an image of one real-world space into another (though Schroeder, 2010, predicts that non-PC-based CVEs and video-conferencing systems will converge somewhat). Immersive projection systems mix physical objects with virtual space, using specialized, expensive hardware that is not widely available. These are not within the scope of my thesis: Nor are specialized hardware platforms such as location-based entertainment centres and theme park rides (Badiquet et al., 2002). I have limited my research to the more widely-used, if less visually immersive, PC-based virtual worlds.

When analysing virtual worlds it is necessary not only to classify technologies but to differentiate the kind of uses to which they are put. Broadly speaking, researchers differentiate two approaches to the use of VWs.

- Leisure, entertainment, socializing etc (I call this “recreational”), and

- Utilitarian or work-related use (I follow Schroeder, 2010, and call this “instrumental”).

1.2.3 ‘First person shooter’ videogames

First-person shooter (FPS) videogames are not normally classified as VWs because their space is not persistent and they do not sufficiently many users in one environment. However in my second study I used a voice system designed for team-based FPSs, as no voice system of this type had been built for true virtual worlds.

One of the most popular FPSs is Counterstrike. One of the earliest, Doom, displayed scenes as though the user were looking through the eyes of their character: hence the term ‘first-person’. Recently most FPSs represent the user as an avatar in the game-world, situated just in front of

(21)

the retinal plane so that the user views the scene from behind their character’s head. Some call this a ‘third person’ view, however ‘first-person shooter’ has prevailed as the name of the genre.

When FPSs are multi-player it becomes important to represent users visually, because a significant subset of the environment that a player is attempting to negotiate consists of other players. The avatars of fellow players may be targets or foci of action. Like the pieces in chess, the positions of players in virtual space is a critical part of game state in a FPS.

FPSs are described in more detail in section 2.2.3.

1.2.4 MUDs: text-based worlds

Multi-user-dungeons or MUDs are a kind of online world in which the environment is described for the user in text rather than being graphically illustrated. Much of the thematic content of current MMOGs is inherited from MUDs, and some MUDs are still in use.

A MUD’s space is not truly 3d (they can be non-Euclidean networks of rooms – see Aarseth, 2008), and is not represented visually, so MUDs do not fit most definitions of ‘virtual environment’. I did not examine MUDs in this thesis; however they are of historical importance and research into use of MUDs is discussed in section 2.2.2.

Because MUDs are sometimes described as “text-based virtual worlds”, I need to clarify that when I discuss ‘text communication’ I am not referring to the method by which a MUD displays its environment to users, but to the typed messages that people use to communicate within graphical virtual worlds.

1.2.5 Using virtual worlds to research offline behaviour

While much research has sought to understand phenomena within virtual worlds (see section 2..2), some authors have proposed that research into virtual worlds can be used as a basis to study human behaviour in general, and that the VW context may be more convenient than the offline.

For example, Castronova (2008) tested economic theory within virtual worlds to support their use as models of offline economic activity. Bainbridge (2007, 2010) argued that the study of

(22)

A taxonomy of the systems relevant to my thesis is presented in figure 4. The red boxes represent genres that were not covered in my research.

Figure 4: Taxonomy of virtual environments

1.2.6 Virtual and physical space

It is common in discourse about virtual worlds to compare virtual and physical space. Several terms are used for the physical world and there is some debate over which is most suitable. Nitsche (2008) refers to a videogame player’s physical context as ‘play space’, and the collection of the play spaces of several collaborating or competing players as ‘social space’.

Second Life users typically refer to the physical world using the term ‘real life’ (‘RL’), or less

commonly 'first life' ('FL') or the ‘full bodied’ world.

Boellstorff (2008) argues that the use of the term ‘real life’ is misleading, since all human culture acts as a kind of virtual world in which social life is enacted. He recommends the term ‘actual world’, though this has not been widely adopted (see Golub, 2010, for a defence of this terminology).

This thesis is concerned with the mechanics of communicating within virtual space. Therefore I chose the term ‘physical world’ to describe the everyday world. This term emphasizes that communication in the physical world is subject to physical forces, which must be simulated in virtual worlds in order for communication to take place. In the physical world, voice conversations are usually propagated by sound waves or electromagnetic signals. Virtual worlds must explicitly implement a mechanics of signal propagation in software. These

(23)

mechanics can have arbitrary design and need not (and usually do not) mimic precisely the mechanics of the physical world.

To distinguish physical and virtual space is not to imply that virtual space is a separate universe untouched by real-world actions or social institutions. The idea that virtual worlds are walled off by a ‘magic circle’ has been dismissed by Lehdonvirta (2010), supporting earlier arguments by Taylor (2006) and others. One of my findings is that the modality of voice tends to breach whatever boundary does exist between virtual and physical worlds.

1.3 Theme: Communication in virtual worlds

Virtual worlds offer several means for user-to-user communication:

− the simulated space, which is visualized to each user and can support indexical utterances such as “this thing” and “that location”,

− user embodiment as avatars, which affords a simple body-language of proxemics, posture and gesture,

− linguistic communication via text and/or speech, − sound, triggered by the user actions on the environment, − and in a few systems, haptic or other sense modalities.

Because VWs locate users in a space, designers have the option of taking the sender’s and receivers’ virtual locations into account when implementing mechanisms for the transmission of signals. For example, spoken or typed messages might be sent only to users whose avatars are ‘near’ (in virtual space) the sender’s avatar, simulating the transmission of sound in air and enabling “a natural intuition about mutual audibility” (Smith et al., 2001). Conventionally this is called proximity, spatial or local chat.

Alternatively the VW might ignore avatar locations and simply transmit messages between users, wherever they are. In effect, this simulates the use of telecommunication devices. Thus I have suggested that there are in effect two metaphors currently used to design voice transmission in a virtual world: sound in air, and radio (see Wadley et al, 2005 in appendix C). The configurations typically available in VWs are illustrated in table 1, along with the terms commonly used to describe them.

(24)

Text modality Voice modality Sound-in-air

metaphor

‘local chat’, ‘vicinity chat’ ‘spatial voice’, ‘proximity voice’

Radio metaphor ‘instant message’, ‘guild chat’,

‘raid chat’

‘group voice’, ‘one-on-one’

Table 1: Metaphors for propagation of voice and text messages

Whichever metaphor is used, the rules of message propagation must be explicitly designed by the system developers. These rules constitute a simulated medium through which messages are propagated. Therefore while networked virtual environments are themselves a kind of telecommunication medium, they are unique in that they simulate a 3d space, locate users within it, and then provide them with one or more virtual telecommunication media through which they can communicate within virtual space.

This complexity relative to other communication technologies raises a number of issues: • How should virtual media be configured? Existing implementations are usually

simulations of sound or radio, but they could take arbitrary forms.

• What are the relative efficacies of differently-configured virtual media when used by different kinds of users in different situations?

• How can users manage the multiple communicative contexts that VW users experience?

These questions are all addressed within this thesis.

In the next section I describe the problem my work has attempted to address, after which I offer a précis of relevant prior research. I then identify a gap in this literature, present my research questions, give an overview of my research program, and précis my results and the conclusions I have drawn.

(25)

1.4 Problem

As discussed above, many virtual worlds now include mechanisms that allow users to speak to each other. Yet as the literature reviewed in the next chapter shows, there has been little systematic investigation of the influence that the voice medium has had on the experience of virtual world users. A thorough study has not previously been made of what advantages voice brings, or conversely which conditions, if any, render voice unsuitable. It has not been established who finds voice unsuitable, nor how this might be addressed by designers.

The influence of voice in virtual worlds is of theoretical interest because these systems are considered by some to be arenas par excellence for the kind of identity-play that in the past was held by many to occur within the Internet more widely. If VW users are exploiting ‘lean’ text chat and customizable avatars to enact fictitious personas in imaginary worlds, then voice should either make their role-play more difficult, or force us to reconsider what role-play is. The study of voice in virtual worlds can therefore inform our understanding of mediated communication more broadly.

The study of voice in virtual worlds has practical importance as well, because without understanding the experience of users who communicating by voice, virtual world developers run the risk of failing to successfully implement these large, expensive systems.

Some voice implementations have already been rejected by users; thus it would appear that such failures of understanding have already occurred. For example, the voice system added to

World of Warcraft by its developers has been rejected in favour of third-party voice systems

(Street, 2011). The introduction of voice to Second Life, too, met with controversy (chapter 7). Many users reacted angrily to Linden Lab’s announcement of their intention to add voice. Some not only refused to use voice but even threatened to quit the VW altogether if voice was introduced (Boellstorff, 2008). This indicates that not only the voice sub-system but the virtual world itself is at risk if implementation goes awry.

Finally, VW communication mechanisms need to be usable by a large, heterogeneous user base, and the design space for voice is potentially broad and has only been superficially investigated. Therefore an understanding of how different configurations are received by different users conducting activities in different contexts will be of value.

(26)

1.5 Prior work

In reviewing prior research I have focused on two fields: the study of virtual worlds and the study of communication.

Research into virtual worlds has focused on a number of issues. Authors such as Turkle (1995) and Kendall (2002) were interested in the ability that MUDs afforded users to engage in presentation-management and role-play. Bartle conducted pioneering work on MUDs, proposing a framework of four user types: achievers, socializers, explorers and killers. Yee (2006) developed this for MMOG players, surveying large numbers in order to explore their demographics and motivations.

There has been research on how MMOG users behave while playing (Moore et al., 2007), interact with other users (e.g. Seay et al., 2004), and use, customize, and relate to their avatars (e.g. Ducheneaut et al., 2004, 2009). Ethnographers have studied the culture of MMOGs (Taylor, 2006; Nardi and Harris, 2006; Ducheneaut et al., 2006; Bainbridge, 2010; Nardi, 2010; Golub, 2010), and social worlds (Taylor, 1999; Boellstorff, 2008).

Communication research has compared media and examined how these influence people’s choice of medium and the communication they subsequently carry out. The ‘social presence’ (Short et al, 1976) and ‘media richness’ (Daft and Lengel, 1986) theories proposed that the richer a medium - the more it projects the social presence of users - the more effectively it should substitute for face-to-face interaction. Subsequent research has proposed other influences on media choice and use. The social-influence (Fulk et al., 1990) and critical-mass (Markus, 1990) frameworks proposed that an individual’s choice of medium is influenced less by the medium’s properties than by the choices of the individual’s collaborators. Walther (1996) argued that the reduced social presence of text-based computer-mediated communication provided an increased opportunity for management of self-presentation, which enabled hyperpersonal interaction: a more intense relationship than would be expected to occur offline. Research has shown that people wishing to engage in deception or ‘impression management’ (Carlson et al., 2004), or who are shy (Stritzke et al., 2004), might prefer a medium with low social presence such as text.

(27)

1.6 Gap

Voice is a relative newcomer to virtual worlds, and only limited research has compared modalities in this context. For example, Sallnas (2002) compared decision-making by VW users equipped with text, voice, or a video link. Nilsson et al. (2002) supplemented Active

Worlds with a shared-audio system and studied workplace meetings held in the virtual world.

Gibbs et al. (2004) and Halloran et al. (2004) studied the use of voice by FPS players, finding that there were benefits, particularly concerning sociability and freeing up of hands to use game controls, but that these were situation-dependent. Williams et al. (2007) found that

World of Warcraft teams who used both voice and text to communicate over a period of

months, liked and trusted each other more, and became happier and less lonely, than those who communicated only by text.

When my research began in 2006 it was uncertain what influence voice had on the experience of VW users, why it tended to provoke extreme like and dislike, how existing implementations might be better configured, how voice could interact with the spatiality of a VW or what new forms of interaction it might enable.

1.7 Research Question

Given this gap in knowledge, I chose as my research question: RQ: How does voice influence the user experience of virtual worlds?

To answer this question I conducted four case studies designed to address the range of virtual worlds, situations and users discussed in section 1.2. My studies are outlined in the next section, and described in detail in chapters 4 through 7.

1.8 Approach

To detect patterns across the diversity of technologies, situations and usages that exist I conducted a series of case studies (Cavaye, 1996; Yin, 2003) of communication in virtual

(28)

environments. I employed a range of methods to study the influence of voice on user experience in these various scenarios. My methods are described in detail in chapter 3.

Understanding subjective experiences requires “accessing the meanings participants assign to them” (Orlikowski and Baroudi 1991). This implies an interpretive approach which acknowledges that users interpret their own experiences and that the researcher in turn interprets what users report (Neuman, 2011). I gathered subjective data via individual and group interviews and diaries, but triangulated these against the results of observation, participant research, and quantitative analysis. Use occurred in natural settings such as homes and workplaces, except in one study which took place in a lab but was designed to be as naturalistic as possible (chapter 6).

My approach to case selection, data gathering and analysis was informed by grounded theory (Glaser and Strauss, 1967). The principle of theoretical sampling guided my choice of cases, particularly when moving from the first to the second pair of studies. My first two studies examined game worlds: after analysing these it was apparent that the appeal of voice was partly due its support for real-time coordination of teams during fast-paced action. To test this I chose non-game worlds for the second pair of studies.

Participants were recruited from diverse populations (within the constraint of their needing to be available for interviews), and had a range of experience with the technologies in question. The technology used in all cases were commercially available, except for one voice system which was under development during the study and has since been commercially released.

Study 1: Voice in massively multiplayer online games

My first study examined voice communication among people playing MMOGs. I arranged for three groups to play under different circumstances over a period of two months. All participants had some experience with MMOGs, and some had already used voice products in online games.

At the time of the study, the first MMOG with integrated voice facilities had recently been released. Two groups played this, while a third group played older MMOGs using third-party voice products. They played in their own homes under normal playing conditions, and kept diaries in which they recorded experiences of the use of voice in the game. Half way through the study the participants were interviewed individually. At the end of the study they participated in focus-group interviews.

(29)

Participants were asked questions about whether they preferred voice or text, whether either modality was better suited to particular scenarios, and whether there were aspects of the voice interface they would like to see changed. The focus groups considered how they might use existing or imagined voice systems to deal with fictitious gameplay scenarios.

Study 2: Spatial voice in a team game

My second study examined the influence of spatially-propagated voice on users’ experience of a team combat game. At the time of the study, existing voice systems utilized a radio metaphor which assigned a channel to each team and allowed all team members to hear each other equally. I examined use of an experimental system in which the ability of one user to hear another was based on the proximity of their avatars in virtual space. Thus communication was explicitly integrated with the spatiality of the system.

Spatial voice was provided to a group of co-workers to use in a weekly game session. Participants were observed and asked to keep a diary in which they recorded their experiences. They used spatial voice for several months before taking part in a focus group.

Study 3: Collaboration around objects

My third study again addressed spatiality and communication. But instead of the 'macro' spatiality involved in coordinating a moving team, study 3 focused on the 'micro' level of collaboration around objects. Here the key problem is how to achieve mutual understanding of reference to locations and things.

My design combined the quasi-experimental approach of Hindmarsh et al. (2001) with methods from Kraut et al. (2002). I observed and interviewed small groups using voice to coordinate a building task in Second Life. I recorded their screen output and conversation for analysis. I focused on their verbal references to objects, and use of Second Life’s ‘virtual camera’. I discussed with participants the problems they faced and how they solved them. I checked my observations by discussing themes that emerged with expert users discovered in-world4_{and in the SL forum.}

(30)

Study 4: Voice in the ‘social’ virtual world Second Life

My final study examined the overall influence of voice on the social world Second Life. I interviewed users, convened an in-world discussion, analysed forums and blogs, and conducted participant research into both recreational and instrumental uses of SL. The participants used Second Life for a diverse range of activities that included socializing, teaching, business and art. This very broad range of use scenarios allowed me to compare findings with the first three studies and draw general conclusions about voice in virtual worlds.

The four cases are compared in table 2:

Type of VW Voice propagation Research methods

Study 1 Massively multiplayer online game

Radio Interviews, focus-groups,

diaries

Study 2 First-person shooter Spatial + radio Observation, focus groups, diaries

Study 3 Social world Spatial Observation, interviews,

online ethnography, interaction analysis

Study 4 Social world Spatial Interviews, focus group,

online ethnography, participant research

Table 2: Overview of my four case studies

1.9 Outcomes and contributions

Study 1 showed that voice provides significant advantages for team coordination, and can make the MMOG user experience more sociable. However it increases the emotional intensity of user-to-user communication and can intensify the effects of griefing. Voice makes it difficult for users to maintain a ‘fictional social presence’, and to simultaneously manage physical and virtual contexts.

(31)

Study 2 found that while spatial propagation is a restriction over radio, it reduces channel-clutter and acts as a ‘filter for relevance’ by conveying to users only the utterances of fellow users who are nearby in virtual space. Making members of competing teams audible to each other creates new opportunities for tactics.

Study 3 showed that achieving successful reference to objects and locations remains difficult in a modern virtual world. Users engaged in object-focused work mark locations not by pointing but by using their avatars as ‘cursors’. At times they simply ‘parked’ their avatars. This calls into question the necessity for embodiment in avatars, and suggests that virtual worlds instead allow users to represent their focus-of-attention directly on objects.

Study 4 highlighted the diversity of purposes and contexts for virtual world use. Users have difficulty managing multiple identities and conversations in both physical and virtual contexts. Choice of communication medium is of critical importance to VW users. The criteria by which they judge media cannot be reduced to a simple, single concern.

1.10 Conclusion

My research has allowed me to challenge several ideas regarding mediated communication in virtual environments:

• Voice is not straightforwardly superior to text in virtual worlds, any more than it is in physical environments. In fact it is problematic in many situations.

• Virtual worlds are not cut off from physical reality. The fact that users are embedded in both physical and virtual contexts has an important influence on their choice and use of communication modality.

• Avatars may not be essential for conveying social presence or supporting deictic speech.

In this chapter I provided an overview of virtual worlds research, illustrating the gap I hoped to fill, and described my research approach and findings. In the next chapter I review in greater detail prior research into virtual worlds, mediated communication and user experience.

(32)

(33)

Chapter 2

2. Review of related research

2.1 Introduction

In the previous chapter I presented an overview of my thesis topic: the influence of voice communication on the user experience of virtual worlds. In this chapter I review prior research from fields relevant to the topic.

To identify relevant literature and organize this chapter, I have first isolated the components of my research topic. I am interested in the intersection of three fields:

1. virtual worlds,

2. mediated communication, and 3. user experience.

I first highlight important phenomena from each of the fields, in sections 2.2, 2.3 and 2.4 respectively. Then in section 2.5 I consider how these phenomena might interact at the intersection of the fields. This allows me to identify a gap in knowledge and deduce my research question.

2.2 Virtual worlds

In section 1.2 I identified virtual worlds as a subset of the category of technologies known as collaborative virtual environments (CVEs). These multi-user technologies simulate a three-dimensional space containing virtual objects, some of which typically are representations of the users themselves (Aarseth, 2008; Nitsche, 2008). The users of a CVE can interact and communicate with each other, and may cooperate or compete to achieve goals.

Data about the contents of a CVE are stored on a server, to which users attach via client software. At every given moment the client renders for the user a visualization of the space and

(34)

its contents: this must be done from a point in the space, so that the user can be considered to be ‘situated’ at that point. ‘Moving’ in the virtual space essentially means changing the location of this point, or the direction in which one looks through it.

This location and orientation in virtual space is typically made visible to other users of the space by representing the user graphically as a avatar. Avatars may be human-shaped but this varies according to the theme of the VW. This mutually-visible, embodied representation is designed to allow users to simulate offline social interactions that involve the negotiation of space, such as conversation groups and team-based combat, to choose two common examples. Avatars afford their owners considerable plasticity of appearance, and therefore pseudonymity or anonymity (Bente et al., 2008). Part of the appeal of using an avatar is the control it gives a user over their social presence within the system. However this depends on the system and the type of use people make of it.

Avatars also exhibit behaviour. Users of PC-based virtual worlds control their avatars’ movements and gestures through deliberate control of the mouse and keyboard. On the other hand some advanced, immersive CVEs running on specialized hardware can detect user gestures and even facial expressions. Hence in these “an avatar is the model that is rendered on the fly to reflect the user’s behaviour” (Bailensen et al., 2006). Opinions differ as to whether either fidelity to the user’s movements or plasticity is the more desirable. Schroeder (2010) felt that the desirability of fidelity depends on whether a CVE is to be used for instrumental or leisure use, and regarded fidelity and plasticity to be suited to immersive and PC-based systems respectively (p. 22).

Many of the issues relevant to current virtual worlds were first identified in early CVE research, and I review this first.

2.2.1 Early collaborative virtual environments

CVEs evolved out of experiments with interactive 3d graphics (e.g. Fisher et al., 1986), vehicle simulations such as pilot trainers, and systems for the remote operation of machines in oceanic, Antarctic and extra-terrestrial exploration (Ellis, 1994).

The first virtual reality systems simulated an artificial environment for one user, and were characterized by the use of a head-mounted display (HMD), which displayed the environment in such a way that the user’s head movements caused their view to change, as happens when viewing the physical environment. Some VW systems supported the use of data-gloves for

(35)

haptic interaction and hand gestures (Snowdon et al., 2001). Some users of head-mounted displays encountered problems with disorientation and nausea (Ellis, 1994). Other research issues included interactivity, visual realism, speed of performance, and immersion. Ellis (1994) suggested applications of VR for surgery, robot control, data visualization, and entertainment. VR systems were typically used by one person at a time. The collaborative system of Codella et al. (1992) supported two users who were represented to each other simply as hands: interaction was limited to joint manipulation of objects.

Subsequent CVEs have typically represented users as humanoid avatars. Researchers studying avatar use have been interested in whether these assist in providing “peripheral awareness, informal meetings and learning by watching” (Jaa-Aro and Snowdon, 2001).

A number of graphical multi-user environments were constructed for research purposes during the 1990s: prominent examples include MASSIVE and DIVE.

2.2.2 MUDs

Graphical virtual worlds were preceded by MUDs, and inherited themes and user culture from them. MUDs (multi-user dungeons) are networked games that simulate a space within which users move and communicate; however the space is non-Euclidean and is portrayed with words rather than with graphics (Bartle, 2008).

The textual medium afforded pseudonymity and a significant capacity for presentation-management, allowing MUD users to present as fictitious characters and to play roles:

Users are not required to ever present their RL identities. [They] may therefore perform actions they might not in a face-to-face encounter. … interlocutors decide for themselves the degree to which they wish to reveal facets of their identities and are not limited to enacting their RL roles. [They] are more aware of the presentation of the self and have more control over whom others perceive them to be. … The anonymity of computer-mediated communication in [MUDs] supports playful experimentation with one's identity, and ultimately mastery over presenting the self to others in a virtual environment” (Raybourne, 2001).

Bartle, a pioneering developer of MUDs, argued (1996) that the primary reason for playing a MUD was to role-play in such a fashion, in order to develop one’s own (real life) character. Bartle recognized different types of players with different styles of use: achievers, socializers, explorers and killers, and argued that individual players enacted a “career” during which they shifted from one style to another.

Schiano (1999) studied the large-scale MUD LambdaMOO, finding that users emphasized social interaction over navigation of the space.

(36)

Kendall’s (2002) ethnography of the BlueSky MUD used the metaphor of a ‘virtual pub’ whose pseudonymous patrons maintained an informal sociality:

As usual around lunchtime, the bar is crowded. A few people sit singly at tables, but most sit in small groups, often milling around from table to table to chat with others. As in many such local bars and pubs, most of the regulars here are male. Many of them work for a handful of computer companies in a nearby high-tech industry enclave. The atmosphere is loud, casual, and clubby, even raucous. Everybody knows each other too well here to expect privacy at any of the tables. (p. 2)

MUD users multi-tasked as they divided their attention between physical and (possibly several) virtual spaces. Many utilized more than one character or MUD, engaging in multiple ‘presentations of self’. However Kendall, citing Goffman (1959), pointed out that:

“people also engage in different presentations of self to different audiences in other arenas of everyday life and did so before online forums existed (p. 9)”.

Kendall found that the MUD was a male space in which “people enact and negotiate masculine identities” (specifically, the ‘computer nerd’ style of masculinity). She cautioned against earlier utopian views of the Internet as a system offering infinite possibility for identity construction (e.g. Turkle, 1995), arguing instead that MUD users bring their offline gender, race and class identities into the virtual world.

Despite the MUDs’ text-only representation, users were able to achieve rich, long-term interaction. Churchill and Bly (1998) studied a group of co-workers who used a MUD to communicate within a workplace over a period of three years:

Our studies indicate that visually oriented media richness is not a prerequisite for the creation of sufficient social co-presence for maintaining collaborative relationships. Such social co-presence seems to reside in shared goals and understandings which derive from conversations around a common focus. (Churchill and Bly, 1998)

The low fidelity of textual representation gave MUDs an advantage over video- and audio-based media-space technologies with regard to privacy. Churchill and Bly found that co-workers valued being able to control the information about themselves and their surroundings that was transmitted into the shared medium:

The fact that one actually cannot see or hear what is really going on in others’ offices offers a significant advantage to the MUD. The impression a participant wants to share with others is determined by that participant and not by the physical or audible context. The MUD offers cognitive co-presence but not physical or visual co-presence. (Churchill and Bly, 1998)

The authors recognized that this conclusion “flies in the face of many theories of media richness”.

Voice in virtual worlds