Multimodal Interaction - User Interfaces for Cooperation

Interactions with mundane objects, our environment, and other people involve our entire body, not only hands and fingers. We walk towards places, we bend our bodies around obstacles, we look at points of interest, and we specify meaning with gestures

and words. Perhaps, the most complex combinations of actions are involved in the exchange with each other. Our verbal expressions only reveal their full meaning in relation to the spatial and temporal context, our posture, our gaze, our facial expressions, our voice and many other behavioral aspects. Computer workplaces, on the other hand, evolved from desk work, which is most often solitary and bound to the interaction capabilities of our hands. The combination of modalities with comple- mentary capabilities promises more flexibility as well as higher interaction efficiency and fluency.

Martin distinguished five types of cooperation between modalities [61]. Transfer de- notes that information provided through one modality is further processed through another one, e.g., to resolve inaccuracies of individual modalities. Equivalence means that a certain process can be realized through various modalities, which provides more flexibility for dynamically changing settings and situations. Specialization ex- ploits the different capabilities of various modes. Interaction processes are thus per- formed with the most suitable available modality to increase efficiency. Cooperation can also be achieved through redundancy of multimodal information processing. Si- multaneous input through multiple modalities can, for example, provide confirmation without the common and often annoying confirmation dialogue for irreversible actions. The cooperation pattern of multimodal complementarity signifies that information processed through different modalities is merged to achieve a more complex expression. It differs from transfer in that the processed information does not only build on each other but it is more tightly intertwined as in the often mentioned combination of manual object selection with vocal commands.

5.2.1 Hand and Voice

The most popular example of multimodal interaction is the combination of pointing and speech input. Bolt demonstrated in 1980 how well this natural combination suits the common interaction pattern of command application to selected objects [41]. Since then, many research prototypes based on this concept have been realized (e.g. [42, 254, 313, 339, 341]). With improved robustness of speech recognition and a higher diversification of computer usage scenarios, speech input is also gaining acceptance in commercial systems.

Oviatt emphasized that speech input reveals the largest benefits only in combination with direct manipulation [255, 256]. Pointing and tracing, for example is suitable for the specification of spatial parameters. Speech, on the other hand, facilitates the inte- gration of semantic information. Semantic selection filters can resolve ambiguous or erroneous input [256, 339]. The verbal expression “Highlight that red car”, for example, provides two selection filters and specifies an operation that only requires coarse pointing towards the location of the desired object (Figure 5.2). Schnelle-Walka and Döweling identified successful integrations of speech input to touch-based interac-

Multimodal Interaction 67 tion and developed a taxonomy of related interaction patterns [305]. Their taxonomy includes speech-based mode switching, verbal selection of distal objects, touch-based error correction for speech input, and the above mentioned combination of pointing- based object selection with verbal commands.

Figure 5.2: Speech and deictic gestures complement each other. The combi-

nation of pointing and a verbal description, e.g., “blue book”, can clearly dis- ambiguate a target in dense environments like a bookshelf. Moreover, verbal commands can describe meaningful actions with the indicated object, e.g., to read the book.

Combinations of hand and voice input have not been adopted in end-user interfaces, despite the fact that potential benefits have been known for several decades and voice recognition is ever becoming more robust. Perhaps the reason for thit is that the low complexity of most graphical user interfaces does not require more sophisticated input. For more complex applications in the field of computer aided design, for example, the benefits may be more relevant [42, 313]. Another aspect is certainly the social compatibility of multimodal interfaces. The more of our communication modalities become engaged in the dialogue with the machine, the more they will interfere with interpersonal communication. Office colleagues generally talk with each other while working manually on different and often unrelated tasks. Incoming telephone calls interrupt these conversations - after a noticeable jingle that provides mutual aware- ness. Speech input to the computer would most likely inhibit such social exchange in office spaces.

5.2.2 Gaze and Hand

The combination of manual input with gaze tracking has also received considerable attention in research on human-computer interaction (e.g. [79, 323, 324, 380]. Gaze generally serves for the coarse identification of a target area and thereby provides a spatial frame of reference for subsequent location refinement with manual input. Stellmach and Dachselt expressed this pattern with the catchy phrase “gaze sug- gests, touch confirms” [323]. Our eyes continuously scan the environment and do not steadily remain on an object or area of interest. Efficient use of gaze input can thus only be realized with such combinations.

Earlier experiments of Zhai et al. [380] as well as Drewes and Schmidt [79] showed performance benefits of this input combination. Zhai et al. reported performance benefits of about 14% for the large-distance cursor movements. Drewes and Schmidt even showed that gaze-supported target acquisition can almost eliminate the effect of target distance. Moreover, they highlighted the impact of visual distraction on target acquisition with relative motion input from the mouse. In a condition with complex background texture they observed target acquisition benefits of almost 33%. Both studies combined absolute area selection based on eye tracking with position refinements from relative pointing devices. This seems to be the only feasible combination, since eye-tracking is well-suited only for absolute input. Relative input from the hand can add accurate adjustments in the final closed-loop phase of target acquisition. When using eye-tracking for relative input, instead, one would immediately lose track of the cursor.

It should also be noted, that in neither of these studies on long-distance target acquisition were users able to achieve throughput rates beyond approximately 3 bits/s4. In comparison to common 2D pointing performance, this not very convincing (cf. Section 2.1). Zhai et al. discussed this issue and suggested that the isometric joystick employed in their studies could be the reason for the overall mediocre performance [380]. Another possible explanation is that such gaze-supported manual pointing requires specific training, since its behavior contradicts established perception-action couplings. We are used to follow the actions of our hands with our eyes or to move our hands to locations we are looking at, but, we do not expect that manually operated tools automatically follow our gaze, while we are not even moving the hands.

Research results on the combination of gaze with manual input are far from being conclusive. Potential interferences of gaze input with cognitive tasks and social interaction have not yet been explored. Also, the potential performance benefits are

4_{The data from Drewes and Schmidt allows the computation of an average throughput of about 2.3}

bits/s for both input conditions in case of a blank background [79]. In Zhai et al.’s study, the manual input condition using an isometric joystick resulted in an index of performance of 3.2 bits/s [380] or a throughput of about 2.7 bits/s (average index of difficulty divided by average movement time). In combination with gaze input, a maximum throughput of 3.13 bits/s could be achieved.

Multi-User Cooperation 69 not yet fully proven. Improved target acquisition has been shown (e.g. [79, 323, 380]), but, similar or better performance can be achieved with other input combinations or clever input mappings (e.g. [93, 97, 226, 244, 350]). Pfeuffer et al. recently suggested a combination of gaze input with multitouch and pen, which produced comparable performance to the combination of pen and touch [266]. Nevertheless, in a study by Nancel et al., users expressed their subjective preference for combinations of gaze and hand and achieved higher accuracy than with combinations of manual input modes only [244].

In document User Interfaces for Cooperation (Page 94-98)