Developing DOVE (Drawing Over Video Environment)

2.6 Designing Remote Gesture Tools for Collaborative Physical Tasks

2.6.3 Developing DOVE (Drawing Over Video Environment)

Initial investigations by Kraut, Miller and Siegel (1996) sought to explore how people engaged in tasks could be supported with the help of remote experts. By engaging study participants in a bicycle repair task (for which they were provided with an instruction manual), the effects of performing the task alone or with the support of a remote expert could be observed. The study manipulated both the presence of the remote expert and, if present, the means by which they communicated with the Worker performing the repair task. Contrary to their expectations Kraut, Miller and Siegel (ibid) observed that providing a video link between the Helper and the Worker (such that the Helper could see the Worker‟s task space) did not improve performance beyond those levels observed when the Helper and Worker communicated via audio link alone. The presence of the visual link between the spaces did however, have an effect on the pattern of communication between the collaborators but this altering of communication was not reflected in a change to performance times, whether or not the Helper was a current part of the Worker‟s task however, did influence the performance times incurred, greatly improving success.

This original study was reprised by Fussell, Kraut and Siegel (2000), a study which sought to extend the earlier work, by taking account of the lack of an adequate control condition in the original study, by introducing a side-by-side remote expert help condition. Further, attempt

was made to counter any bias that may have crept in from the experts previously using almost scripted language. An additional variable was also included, the variation of the expertise of the remote Helpers, to ascertain as to whether the level of expertise of the Helper made any significant difference to the interaction and the effect of the various communications technologies. The results of the study demonstrated that regardless of Helper expertise (which did not affect performance in the task) side-by-side collaboration was faster than other remote collaboration conditions; this was achieved without a reduction in the quality of the work achieved. The assessment that was made of the work quality, along with the quality of Worker and Helper communication was made by expert observers, this may however, have led the results to be open to experimental bias. Despite this, conclusions were drawn that side-by-side dialogues are significantly more efficient. On the basis of the experimental results four limitations to video-mediated visual spaces were suggested. 1) Workers‟ queries suggested that they were uncertain of the field of view of the Helper. 2) Helpers‟ views were in fact less than optimal – important features of the work space were often held external to their normal view. 3) Helpers‟ had no access to Workers‟ faces – this may have hampered the Helpers‟ understanding of the Workers‟ comprehension of verbal instructions. And finally 4) Workers‟ views of the Helpers were limited to upper body images thus preventing the Helper from effectively gesturing at shared objects. These observations concerning the limitations of the existing visual space lead to the creation of several suggestions for video system design including 1) the provision of better feedback to the Worker about what is perceptible (in terms of view) for the Helper. 2) The provision of a wider field of view for the Helpers. 3) Provide Helper‟s with feedback of worker‟s attentional focus. And finally 4) support Helper‟s in gesturing within the shared visual space.

These first two seminal studies were re-presented and evaluated in a further paper by Kraut, Fussell and Siegel (2003). Again considering bicycle repair as an exemplar of a task that might require expert support, the paper decries the fact that most groupware systems support activities that can be performed without reference to external objects and the external spatial environment. The paper argues that the “Development of systems to support collaborative tasks involving physical objects has been much slower.” (p.15). On the basis of this the paper introduces the notion of collaborative physical tasks as:

“Tasks in which two or more individuals work together to perform actions on concrete objects in the three-dimensional world.” (p.15)

And specifically in this instance:

“Collaborative physical tasks can vary along a number of dimensions, including number of participants, temporal dynamics, and the like. The task on which we focus here, a bicycle repair task, falls within a general class of „mentoring‟ collaborative physical tasks, in which one person directly manipulates objects

with the guidance of one or more other people, who frequently have greater expertise of the task.” (p.15)

The research interest of the CMU group is defined as being primarily concerned with the provision of and support in tasks with visual information. They argue that this can be used to improve situational awareness of a task (Endsley, 1995) and to aid conversational grounding (Clark, 1996). An interesting argument is put forward that situational awareness and conversational grounding are developed in face-to-face settings using a variety of behavioural expressions and interpretations (the work of Robertson, 1997, is perhaps the most literal interpretation of how such physicality is construed in the structuring of collaborative environments). Kraut, Fussell and Siegel (2003) argue that due to the constraints of bandwidth and the difficulties of representing all of this information coherently (as per Gaver et al 1993), such elaborate environments cannot be constructed, therefore they claim:

“Our approach is instead to try to identify the critical elements of visual space for collaborative physical tasks and to design video systems that support these critical elements.” (p.16)

Their analytical strategy was to take a decompositional approach, and systematically evaluate the various elements that might influence communicative behaviour when engaged in collaborative physical tasks. However, given that the original interests of the group had been the exploitation of video-mediated communications it is apparent that this early work shows the first hint of a locking in of the use of video windows as a permanent fixture in their communication system infrastructure, the work becomes an effort to extend the functionality of video-mediated communication, rather than a direct exploration of techniques to link spaces. The work is resolutely situated within the theoretical framework provided by Clark and Brennan (1991) and the affordances of communication media affecting the ease of maintaining task awareness and establishing common ground. Given the argument that various media hold different costs for the grounding process, assumptions were made about the suitability of various elements of shared visual spaces to support collaboration. The work of Kraut, Fussell and Siegel (2003) highlights the importance of object-centred shared visual spaces in object- focussed tasks (as typified by collaborative physical tasks), referencing Karsenty (1999 with her study of shared computer screens for problem solving), Gaver et al (1993, with their analysis of the usability of multiple video windows demonstrating that object views were viewed most often) and Nardi et al (1993, whose study showed how Nurses use monitors during surgery to view the current stage of surgical procedure to find tools in advance). In the type of interactions studied by Kraut et al the role of Helper was seen to be constructed of several phases of action, the first phase was the determination of what help was needed. Secondly the help must then be provided, during which the Helper must coordinate their utterances with those of the worker, the workers actions and the current state of the task. From their observations of this process being enacted with either the Helper being side-by-side with

the Worker or at a distance (but linked through varied communications media) several general conclusions were drawn. The first was (as stated previously) that the provision of expert help is a positive enhancement to performance. However, despite differences in the articulation work that is performed when a video image is shared, a video representation fails to effectively improve performance over audio-only support. On the basis of this conclusion strong claims were made that side-by-side collaboration is superior because of the way in which it supports natural deictic communication, by actively supporting gestural behaviour. This it was argued must be supported in later systems so as to improve the time required to achieve grounding. In more mediated conditions more time and resources were spent acknowledging (back- channelling) instructions. In the side-by-side collaboration communication from the Helper was far more directive, no understanding was provided by Kraut et al however, of the relative perceptions that participants have of this more directive approach and the impact that this might have on longer term patterns of collaboration.

The paper goes on to suggest that video-audio links may have failed to improve performance beyond that achievable by audio-only links because of a lack of a head shot of the worker, meaning that Helpers found it harder to determine whether Workers had understood instructions. The paper counters this supposition by citing Whittaker and O‟Conaill (1997) who had previously shown that such information was rarely useful to collaborators. Their final conclusion then rests with the limitations observed in the Helper using a camera mounted on the Worker‟s head, which necessarily restricts their field of view to that which the Worker is looking at. Party to this is a limited understanding from the Worker of exactly what of their view the Helper can actually see. This reciprocal awareness of mutual perspectives being considered a sizeable problem to be overcome, which the authors argue may be answerable in part by providing enhanced access (as perhaps is provided in side-by-side collaboration) to collaborators‟ gaze patterns. Some of these various issues were addressed in subsequent papers.

Fussell, Setlock, and Parker (2003) used eye tracking techniques to assess where Helpers look as they are providing assistance to a Worker during collaborative physical tasks. The results of the study suggested that Helpers did not look at the Worker‟s faces but did look heavily at their hands, the pieces being manipulated and the developing assembled piece. Whilst the results provide value for those wishing to develop technologies to support remote collaboration there are several problems with the study. Firstly, it is not made clear if the pair are co-present or using some intervening technology. Multiple video windows rather than side-by-side collaboration may have led to more use of face views. Secondly, Worker responses to Helper instructions were also scripted, making for a highly unusual interaction which would not conform to most standards of free flowing collaborative task focussed discourse. Finally, a large proportion of glances were reportedly made toward the instruction manual, but if the Helper were a true expert then this resource may not be used and therefore a significant amount of „gaze time‟ would be needed to be distributed elsewhere and this may end up being

focussed, in the absence of other requirements, on the Workers face so as to more securely confirm understanding. Whilst not strictly critical to the task, it may be preferable.

In a further study to test the benefits of the provision of simplified remote gesturing behaviours, Fussell, Setlock, Parker and Yang (2003) compared performance in side-by-side, video-audio and video-audio plus cursor instruction conditions. The results demonstrated that performance is better in side-by-side, and that the addition of cursor information does not improve performance over video-only presentation. The self reports of participants however suggested that the use of a cursor made the identification of objects easier. However, side-by- side collaboration was still rated as the easiest format. The fidelity of the pointing achieved with a cursor on a video view may however be responsible for its lack of success. Pointing in the 3-D world is considerably more accurate and easily interpretable than pointing in 3- dimensions over a 2-D representation.

To further understand the visual requirements of the Helper in a collaborative physical task, Fussell, Setlock and Kraut (2003) compared collaborative performance when using scene oriented and head-mounted cameras. Five distinct collaboration conditions were compared, side-by-side, audio only, head camera, scene camera and finally scene camera plus head camera. The performance results illustrated that side-by-side collaboration is fastest (faster than all other conditions). Performance when using the Scene camera was faster than audio only, but was not significantly faster than performance when using the head mounted camera, despite the conclusions the authors attempt to draw. The proposed difference between the head camera and the scene camera is somewhat controversial, with the only real difference being that the head camera views a subset of what is available in the scene camera (which could presumably be rectified by changing the head mounted camera for a more extreme wide- angled lens). It was interestingly noted however that the head camera may also provide some epiphenomenal information about current gaze awareness, as the centre point of the camera shot is clearly aligned with the Worker‟s facing direction, which is potentially why the authors expected the head camera plus scene camera views to be superior, as they would provide context views with an indication of current attentional focus. But obviously this plays into the trap of dividing the attention of the Helper between multiple windows (as discussed by Gaver et al 1993). That the head-mounted camera appeared to offer no real advantage is perhaps not surprising given that its ability to provide orientational information and a tight focus on task artefacts was potentially rendered ineffective. The very set-up of the scene camera may have incidentally supported the implicit development of awareness of the Worker‟s orientation as part of their head was captured in the image, the angle of the head therefore providing some gross orientational information, and considering the large nature of the pieces required for assembly, a fine-detailed close focus was not generally necessary. When situations did require a close focus Helpers could negotiate this deficit by having Workers hold items up to the camera.

Reported subjective preferences were for the side-by-side condition and then, secondly, for the scene camera, as the next best alternative, over a variety of measures. This is an interesting issue however as user preference is not necessarily the best indicator of performance, depending on the context of use, actual progress made in the tasks potentially being of considerably more value. There is a clear increase in communicative efficiency with use of the scene camera over the head camera. This has been discussed in previous research and is clearly predictable owing to the fact that with a head camera more work must be done to re-orient the visual image so it is suitable for the Helper.

The ultimate finding of the above studies was that along with various considerations concerning the adequate establishment of a shared visual space connecting the Worker‟s workspace to the Helper, the most primary reason behind the inability of video connections to facilitate collaboration to the levels witnessed in side-by-side collaboration was the conspicuous suppression of naturally occurring gestural behaviours. Time and again observational analysis during these studies demonstrated that Helper‟s wanted to be able to point at objects on their video feed, but clearly what was required was more than simple deictic behaviours as this had been shown to be of limited value during the interactions. As an answer to the problems highlighted in the above papers a system was built to perform the critical function of presenting remote gestural information whilst providing the wide angled scene oriented views of the task space which were demonstrably required. The system was referred to as the Drawing Over Video Environment (DOVE) and was first presented and discussed in Ou et al (2003a, 2003b). Figure 2.18 below illustrates the DOVE system.

Figure 2.18 The Drawing Over Video Environment (DOVE) from Ou et al (2003a,b)

The DOVE system works by capturing a live video feed of the Worker‟s task space via an IP camera. This video feed is then relayed to a remote Helper, who views the live images on a tablet PC. With the tablet PC the Helper is then able to write or draw, making marks over the live video feed with a digital pen. These resultant „gestural sketches‟, along with the video feed

are then passed back to a monitor (VDU) located at the edge of the Worker‟s task space. By looking up from their task artefacts and towards the monitor the Worker can then see the video image of their own task space with the gestural sketches overlaid. In some iterations of the system the sketches made by the Helper are normalised and corrected by software on the Helper‟s tablet PC, to conform to standard shapes (such as arrows and circles). The removal of the sketches from the video feed can only be effected by the Helper and can either be achieved manually by pressing a button on the tablet PC, or in later versions of the system is enacted automatically after a period of 3 seconds.

Fussell et al (2004) provided the first full evaluations of the DOVE system. In a review article they presented two experiments, the first is a re-write and evaluation of the findings from Fussell, Setlock, Parker and Yang (2003) (see above) and the second is a comparison of performance in a collaborative physical task using DOVE versus video only communication. To help ground the studies in some research context Fussell et al (2004) cite studies such as Flor (1998), Goodwin (1996) Kuzuoka and Shoji (1994) and Tang (1991), which argue that speech and action are intricately related to various external elements (people, objects, activities) within the collaborative environment. On the basis of this and their own earlier work, the authors make a call for the inclusion of gestural information in technologies to support remote object-focussed collaborations, discussing how gestures can be used to enhance spoken messages (as observed in Bekker, Olson and Olson, 1995 and McNeill, 1992).

To understand the context further of how gestural activity is situated within the context of a collaborative physical task Fussell et al (2004) break down the structure of a common interaction in this class of task:

“First, collaborators come to mutual agreement upon or „ground‟ the objects to be manipulated using one or more referential expressions. Next, they provide instructions for procedures to be performed on those objects. Finally, they check task status to ensure that the actions have had the desired effect.” (p.277)

Of course this is a simplified structure which could be elaborated further, there being iterative cycles of interaction at each stage, with each stage having significant potential to suffer communicative break-down, should the interaction be insufficiently grounded, and thus require a process of repair to be enacted (see discussion in chapter 1 for more detailed breakdown of a

In document Turn It This Way: Remote Gesturing in Video-Mediated Communication (Page 70-81)