Conclusions and implications for interaction design

A Critical Reflection

6.2 Conclusions and implications for interaction design

The goal of the research presented in this thesis is to analyze how speech and pen can best be combined in a multimodal interface in order to enhance access to information services using a small mobile device. To this end, three laboratory experiments were carried out, each investigating a specific aspect of multimodal human-computer interaction and its effect on the usability of the system. These experiments, described in chapters 2 through 4, concerned the effects of prolonged use, the interaction paradigm, and multimodal error handling. Several conclusions can be drawn from these three studies. These are presented here together with their implications for the design of future multimodal systems.

We showed that user behavior with and user attitudes towards a conversational multimodal interface may change as a result of increasing experience. Most users tend to change their behavior in such a way that it ensures the most efficient interaction, for example by adapting their speaking style, by choosing the most efficient modality and by using the two modalities in a simultaneous instead of in a sequential way. Multimodal interfaces for mobile terminals facilitate learning, as these devices are typically used by the same user for a longer period of time. Therefore, these multimodal applications should be designed in such a way that they support interaction styles of both novice and experienced users.

Conversational multimodal interaction in which the system acts as an intelligent conversational partner by offering guidance and assistance in solving problems, for example those caused by recognition errors, turned out to not contribute to the usability of a multimodal form-filling interface. Direct manipulation type interaction without spoken guidance, but with a screen showing the complete form to be filled in appeared to provide suffi- cient help and feedback for the user. Therefore, multimodal interfaces for form-filling applications that are well understood by the prospective users are best modeled after the GUI style interface where the user determines the pace of the interaction and the order in which the fields are filled.

We showed that in a multimodal system that offers graphical error correction facili- ties on top of speech input, speech recognition errors can be corrected in a more efficient and more effective way than when only speech can be used. Speech recognition perform- ance turned out to be substantially lower for repair utterances than for first attempts. One of the reasons is that people tend to hyper-articulate if they have to repeat a value instead of resorting to changes in wording. Another reason is that the set of values to be repeated is obviously biased towards the most confusable words. Error handling methods that use speech only (such as repetition) should therefore be avoided in multimodal systems, whereas the use of graphical error correction should be encouraged.

Although multimodal systems may best be designed from scratch after an in-depth analysis of the requirements of the service for which the interface is being developed, our investigations demonstrate that for simple well-known applications, extending an existing direct manipulation system with speech input can lead to a usable system, where the user determines the pace of the interaction and the order in which the fields are filled (chapter 3).

A fourth experiment was conducted (chapter 5) aimed at evaluating whether a multimodal interface combining speech and pen would surpass unimodal interfaces (viz. a speech-only dialogue system and a GUI) for the same task.

It was found that user satisfaction was lowest for the speech-only system, as a result of the users’ inability to control the pace of the dialogue, their lack of control due to speech recognition errors, the limited possibilities to solve these, and the longer dialogue duration. Therefore, speech-only interfaces to information services should be avoided unless GUI and multimodal interaction are impossible, for example in situations where hands and eyes are busy or when a screen is not available.

It was also found that speech recognition latencies have a serious effect on users’ per- ception of a system. Although time-to-task completion of the multimodal system was lower than for the GUI, users experienced the GUI as being faster. We argued that this is proba-

6.2 Conclusions and implications for interface design

bly due to the fact that users had to wait for the response of the speech recognition system after each spoken input. Therefore, it is important that speech recognition be as fast as pos- sible to avoid latencies in the interaction that can seriously the perceived usability of the system.

Multimodal interaction and GUI interaction turned out to be similar in terms of user satisfaction. While the inability of our multimodal system to outperform the GUI is partly due to its perceived inefficiency, it is also related to familiarity. Most people by now are used to GUI interaction, whereas multimodal interaction is completely new to most people. In this respect, the importance of standards should be stressed; standardized interfaces will accelerate the familiarization of multimodal interaction. Moreover, cognitive models and guidelines should be used during design and development of multimodal interfaces in order to maximize their usability. Fortunately, besides a growing body of empirical data, several initiatives have been taken to develop theories that explain when and how people combine multiple modalities. Standards are also needed for evaluation in order to guarantee the reli- ability and validity of the obtained results.

In document On the usability of multimodal interaction for mobile access to information services (Page 138-140)