Multimodal Interaction Analysis - Multimodal Data Collection and Analysis

3.2 Multimodal Data Collection and Analysis

3.2.3 Multimodal Interaction Analysis

Two sequences of interaction, performed in EyeCVE and captured using the multimodal data collection method detailed above are now presented. The sequences are extracted from data-sets collected during the forthcoming experimental work. The first analysis presents a dyadic conversational scenario, and features data extracted from the truth and deception experiment detailed in Chapter6. The second analysis examines an object-focused task featuring three users, and is extracted from the data collected from the experiment presented in Chapter5. The two examples are typical to the type of collaborative interaction performed in EyeCVE. The segments are analysed similarly to traditional interaction analysis: by identifying significant moments of interaction, looking for causal relationships between ob-

3.2. Multimodal Data Collection and Analysis 107 served behaviour, and, through verifiable observation, making judgements of how the interaction may have occurred [JH95]. For clarity, each analysis only considers metrics that are insightful with regards to describing the unfolding interaction. These metrics vary accordingly between the conversational and object-focused cases.

Conversational Scenario

Table3.1shows a sequence of interaction drawn from Chapter6’s truth and deception experiment data. Six lines from the collated log file are included in the table, each one pertaining to a significant moment during the short interaction. Each line’s time of occurrence is stated in the leftmost column, which, for ease of reading, has been reset to 0 at the first line, and finishing 7.33 seconds later at the sixth. The remaining columns refer to: markup data (input by the experimenter as the action unfolds); head direction (the object in the VE directly in front of the participant’s head); gaze (the object in the VE at which the participant is looking); pupil size (ranging between 0 indicating full constriction, and 1 indicating full dilation); whether the participant is currently blinking or not (1 indicates blink, 0 indicates no blink); and whether the participant is currently speaking or not (1 indicates that they are talking, 0 indicates that they silent).

Table 3.1: Selected data from an interaction sequence taken from the truth and deception experiment documented in Chapter6. Particular lines taken from the log file have been chosen to highlight significant moments during the interaction. In this sequence, the participant is required to lie to questions issued by a partner. In this case, they are asked “What is your first name?”

Time Markup Head Gaze Pupil Size Blink Voice

0.00 General/Lie/Q01 Partner-Body Grid-E2 0.08 0 0

1.81 Question issued Partner-Face Partner-Face 0.27 1 0

2.65 Answer start Grid-D4 Grid-B4 0.39 0 1

5.17 Answer end Partner-Body Partner-Face 0.20 0 1

5.65 - Partner-Body Partner-Arm-R 0.31 0 0

7.33 General/Lie/Q02 Partner-Face Partner-Face 0.13 1 0

The sequence begins at T=0.00. The “General/Lie/Q01” markup data indicates the experimental stage, in which the participant is about to be asked the first of a series of general questions, to which they must respond to deceptively, by lying. At this time, the participant is gazing a little away from their partner (the questioner), but their head direction indicates that general attention is focused on the questioner. The participant’s pupil size is close to normal given the environmental luminance levels, indicating that they are relaxed, and neither cognitively or emotionally aroused. The participant is not talking at this time. At T=1.81, the markup data indicates that a question has been issued to the participant (in this case the question is “What is your first name?”). Both head and gaze direction indicate focus of attention on the questioner’s face, and a slight increase in pupil size may indicate arousal. At this time, the participant also begins to blink. At T=2.65, the participant begins to answer the question, directing both gaze and head direction away from the questioner. The participant’s pupils dilate significantly as the talk begins, suggesting cognitive load. At T=5.17, the participant finishes delivering the verbal answer, and returns head direction and gaze to the questioner. The participant’s pupils are now less dilated. Shortly after,

3.2. Multimodal Data Collection and Analysis 108 at T=5.65, the participant again averts gaze from the questioner’s face, fixating downwards on the right arm. Tellingly, pupil size increases again, suggesting that the prior eye contact following the lie-telling evoked the participant’s arousal, possibly negatively, due to social discomfort. Finally, at T=7.33, the markup data indicates that the current question is over, and the next is soon to be issued. The participant’s pupil size has almost returned to normal, indicating a more relaxed state. A blink occurs, and gaze indicates that the participant’s attention is again focused on the questioner.

Object-Focused Scenario

Table3.2shows a sequence of three-party object-focused interaction drawn from the experiment documented in Chapter5. The task revolves around constructing a cube from eight smaller cubes, so that each face of the finished cube consists of a single colour. In an experimental scenario, the collaborative interaction generally consists of working together to find specifically-coloured cubes, picking up the cube, and positioning the cube in its correct position. Seven lines from the log file are presented, each one pertaining to a significant stage during the interaction. Each line’s time of occurrence is indicated in the leftmost column, reset to 0 at the first line, and finishing 9.32 seconds later at the seventh. The key difference between this and the above analysis segment is the inclusion of hand tracking data, used to signify the name of any currently-grabbed object, while pupil size and blink data are omitted.

Table 3.2: Selected data from an interaction sequence taken from the object-focused experiment documented in Chapter5. In this sequence, the tracked participant is solving a simplified Rubik’s Cube puzzle with two partners.

Time Markup Head Gaze Hand Voice

0.00 Search Wall-01 Cube-04 - 0

2.17 Grab and Query Cube-04 Partner-Body Cube-04 1

4.81 Search Wall-03 Cube-02 - 0

5.65 Search Cube-02 Cube-05 - 0

6.17 Query Cube-05 Partner-Face - 1

7.48 Grab Cube-05 Cube-05 Cube-05 0

9.32 Position Cube-05 Cube-05 - 1

At T=0.00, the participant is searching for a particular cube, and appears to be examining “Cube- 04”. Soon after, at T=2.17, the participant has grabbed “Cube-04”. The participant’s head is oriented toward the grabbed object, but gaze is directed at their partner’s body. Voice data indicates that the participant is speaking. In the context of the experiment, this is likely to be querying the correctness of the currently-grabbed cube for the intended placement position in the puzzle. This initial choice appears to have been incorrect as, at T=4.81, the participant has released “Cube-04”, and is again searching the VE, presumably for the correct cube. At T=5.65, the participant’s gaze falls on “Cube-05”, and this time, caution is exerted before grabbing as, at T=6.17, gaze is directed to their partner’s face, and another vocal query is uttered. At T=7.48, following what appears to have been an affirmative response, “Cube-05” is grabbed, and subsequently positioned at T=9.32 where the segment ends.

3.2. Multimodal Data Collection and Analysis 109

3.2.4 Summary

In summary, this section presented an approach to multimodal data collection, and presented two sample analyses from conversational and object-focused experimental scenarios. During analysis of the con- versation segment, the participant’s state of arousal and cognitive load was inferred from behavioural data, particularly from pupil size in combination with gaze. The collation of several data streams in a single log file preserves the temporal interrelationships between components of recorded behaviour, and is critical in preserving the context of interaction. In this case, the behaviour of establishing mutual gaze, and the response of increased pupil dilation to the, perhaps uncomfortable, situation of lying to another, is observable in the data. Analysis of the object-focused task centred on hand action in combination with both verbal and gaze behaviour to uncover the logistic process of the task. In this case, the benefit of being able to reference tightly-synchronised elements of both the visual and aural telecommunication is evident when attempting to explicate sequences of interaction. For instance, the verbal and hand tracking data can elucidate how, following an initial grabbing error, the participant learns not to grab a cube until the correctness of that action is confirmed, thus deploying a successful repair strategy in order to remedy a prior mistake.

A limitation of the documented approach to data collation for interaction analysis is that a log file relates only to a single user. Thus, data collation must be generated on a per-user basis at each local site. There is certainly scope for remote users to be represented in a site’s log file, but the included data must be processed wisely, taking into account latency and subsequent synchronisation with the local user’s recorded data. For instance, a low-cost and high-benefit addition to the recorded interaction would be the binary state of mutual gaze between two interactants. In contrast, it would be unwise to stream full gaze data from a remote user for logging.

It must be noted that the logging process writes 60 lines per second to the output file. Hence, in order to distil an interaction as presented in the two examples above, a significant amount of processing must be performed prior to interaction analysis. However, due to the consistent format of the log file, the majority of this computation may be automated. Additionally, in order to be certain of the context by which a recorded interaction took place, it is often necessary for the log file to be referenced against an audio or visual replay of the performance. To this end, two solutions are available when using EyeCVE. Firstly, the aural component of interaction is recorded in the Ogg Vorbis [Mof01] bitstream format, which supports multiplexing of a number of separate codecs including audio and text. Alongside the multiparty audio communication, a textual time-stamp, matching that of the main collated log file, is embedded. Secondly, the suite of applications related to EyeCVE does include a log file player, which is able to reconstruct and replay the virtual action recorded in a log file. The log file player allows a session to be replayed, paused, and randomly accessed, and is also capable of visualising additional data, including gaze targets as the interaction proceeds. Finally, the player application operates on standard desktop displays, and also in immersive CAVE systems, the latter enabling a free first-person viewpoint and perspective rendering in the spatial VE in which the original interaction took place. In this way, an analyst can be a bystander to a pre-recorded interaction similarly to a video replay, but with the critical advantage of an adjustable camera viewpoint.

3.3. Chapter Summary 110 Throughout the forthcoming experimental research, a variety of analytical methods, with varying aims, are employed. The multimodal approach as documented in this section provides precise timing of a range of data sources, preserving the holistic temporal characteristics and emphasising the causal interrelationships between the various tracking streams. More so perhaps than other methods of analysis, an objective understanding of users’ intent and state of arousal may be gained, giving a broad picture of how the interaction unfolded. More generally, through this kind of analysis, communicational failures related to the medium of AMC in ICVEs may emerge. Whether arising from technical or human-centred factors, critical bottlenecks restricting high quality virtual telecommunication may then be addressed.

3.3 Chapter Summary

The research presented in this thesis consists of three telecommunication experiments, and a collection of smaller-scale investigations relating to behavioural modelling and associated experiments. The telecommunication experiments investigate the impact of eye tracking, both to replicate and analyse oculesic behaviour in AMC, while the behavioural modelling work is focused on simulations of oculesics.

The first part of this chapter detailed the ICVE system, EyeCVE, which acts as the primary experimental platform for the forthcoming work. Alongside the standard head and hand tracking common to ICVE systems, EyeCVE integrates head-mounted eye tracking capable of capturing a wearer’s oculesic behaviour, including gaze, blinks, and pupil size. The evaluation of EyeCVE demonstrated its ability to support tracked gaze AMC while allowing physical movement within an immersive VR system such as the CAVE. The second part of this chapter documented an approach to multimodal data collection in a single log file. Using all available tracking devices in EyeCVE, the logging ability of the Viewpoint eye tracking system employed in order to collate an array input streams in a temporally accurate format that is both human-readable and conducive to post-processing and analysis. Demonstrating the technique, interaction analysis relating to two sequences of AMC recorded in EyeCVE were then presented. Anal- ysis emphasised the ability of collated input streams to elucidate interrelationships between components of captured behaviour, providing insight into an individual’s state of attention and arousal, together with logistic elements of interaction and repair strategy.

In document Eye tracking and avatar-mediated communication in immersive collaborative virtual environments (Page 106-110)