Future work - A Critical Reflection - On the usability of multimodal interaction for mobile acc

A Critical Reflection

6.3 Future work

The results presented indicate that multimodal interaction is a promising avenue for en- hancing mobile access to information services. It was already mentioned though, that for several reasons the results that were obtained in the laboratory may differ from what would be found if our interfaces were evaluated in more realistic circumstances. Future research may address this issue by evaluating the usability of multimodal interaction by asking pro- spective end users to perform self-defined tasks using a multimodal system out in the field.

As was already argued, the gain of multimodality in our application was limited be- cause the application is relatively well known and straightforward. A logical direction for future research would therefore be to extend our research towards more complex information systems, such as applications of a problem-solving type rather than form-filling applications. In this respect an obvious extension of our current interface would be to allow for navigation through and negotiation on different travel advices. This type of task offers ample opportunity to exploit multimodal interaction for improving the usability of the service compared to unimodal interaction by means of either speech or pen.

The existing unimodal architecture that we used as a starting point required relatively few adaptations to be made suitable for the type of multimodal use that occurs in our application. However, in order to fully reap the benefits of multimodal interaction for a problem-solving task, the system should be implemented on a platform that supports true asyn- chronous and concurrent multimodal interaction. For such an application thorough redesign of several modules will be necessary. Our work pointed out that one of the prevalent re-

quirements for a multimodal system is that it be as fast and efficient as possible. One of the crucial components in this respect is the speech recognition engine. Speech recognition errors and latencies may severely degrade the usability of a multimodal system. Future research should therefore aim at increasing the speed and robustness of the speech recog- nizer. Many state-of-the-art speech recognition engines are already much faster than the existing recognition engine we used. One of the opportunities to further reduce the latencies is the ‘early decision’ technique. Contrary to most standard recognizers, in which the selection of the first best recognition result is postponed until the end of the recognition process, early decision systems apply a strategy where the recognition result is output as soon as the evidence for a hypothesis exceeds a specified threshold. Consequently, the recognized words can be shown on the screen sooner; sometimes even before the user has fin- ished speaking. The early decision technique has been successfully used in unimodal speech recognition applications (Imai et al., 2003), and it would be interesting to explore the benefits and usability of this technique in the context of multimodal interaction. Fur- thermore, future research should aim at better taking advantage of the possibilities to integrate information from multiple modalities. For instance, it would be interesting to explore how the confidence scores that are generated by the speech recognition engine could be used in a multimodal interface. In the context of this thesis, confidence measures were only used in the conversational multimodal system (chapter 3) to decide whether recognized values needed to be verified in the spoken dialogue or only shown on the screen. One ques- tion that arises is whether providing information about the reliability of the recognized items on the screen may make users more alert to possible mistakes and whether it may make the interface more transparent? Another interesting technique that makes use of the complementarity of multiple modalities is ‘mutual disambiguation’. With this technique, recognition uncertainties in one modality are resolved by using complementary information from the other modality (Oviatt, 1999a). A relatively simple form of mutual disambiguation concerns disambiguation of deictic references. Whereas the utterance “please give me detailed information about this trip” is ambiguous in itself, the ambiguity may be resolved if pointing input from the screen is taken into account. Mutual disambiguation has been successfully applied in multimodal systems that integrate speech with pen input including drawn graphics, symbols, gestures, and pointing (Oviatt, 1999a). It would be worthwhile to investigate to what extent mutual disambiguation can be applied in multimodal problem- solving applications where pen input is limited to pointing and two-dimensional gestures.

Multimodal interaction is a novel way of communicating with information services. Future research should address the possibilities of familiarizing people with this type of interaction. User modeling and adaptation are important topics in this context. It would be interesting to investigate which information is appropriate for building a user model, ad-

6.3 Future work

dressing issues such as whether information about the proficiency level of a user can be inferred from the way he or she combines the two modalities. The next step should then be to investigate how this information can best be incorporated in the design in such a way that the user is assisted in using the system in a most efficient and satisfactory way, for ex- ample by providing extra guidance and help for inexperienced users while encouraging concurrent use of the two modalities for skilled users.

Finally, as has already been stressed, it is important that future research be aimed at the development of standards for multimodal interface design and evaluation to facilitate the familiarization process and ensure that effective interfaces are designed. To this end, interaction facilities that are offered in multimodal systems need to be standardized and frameworks have to be developed that are able to evaluate all aspects of multimodal interaction and that allow for comparison of different systems.

Bibliography

Ainsworth, W.A. & Pratt, S.R. (1992). Feedback strategies for error correction in speech recogni- tion systems. In: International Journal of Man-Machine Studies, vol. 36(6), 833-842.

Almeida, L., Amdal, I., Beires, N., Boualem, M., Boves, L., Den Os, E., Filoche, P., Gomes, R., Knudsen, J.E., Kvale, K., Rugelbak, J., Tallec, C., & Warakagoda, N. (2002). The MUST guide to Paris: Implementation and expert evaluation of a multimodal tourist guide to Paris. In: Pro- ceedings of ISCA tutorial and research workshop on Multi-modal dialogue in Mobile environ- ments (IDS02), Kloster Irsee, Germany, pp. 49-51.

Atkins, R. (2002). Computerworld (e-mail newsletter for Mobile/Wireless). Retrieved April 5, 2005 from http://www.computerworld.com/mobiletopics/mobile/story/0,10801,76686,00.html Beringer, N., Kartal, U., Louka, K., Schiel, F., & Türk U. (2002). PROMISE - A procedure for

multimodal interactive system evaluation. In: Proceedings of Multimodal Resources and Mul- timodal Systems Evaluation, Las Palmas de Gran Canaria, Spain, pp. 77-80.

Bernsen, N.O. (1997). Towards a tool for predicting speech functionality. In: Speech Communica- tion 23(3), 181-210.

Bernsen, N.O., Dybkjaer, H., & Dybkjaer, L. (1998). Designing Interactive Speech Systems: From First Ideas to User Testing. UK: Springer-Verlag.

Bernsen, N.O. & Dybkjær, L. (2001). Exploring natural interaction in the car. In: Proceedings of the International Workshop on Information Presentation and Natural Multi-modal Dialogue, Verona, Italy, pp. 75-79.

Bernsen, N.O. (2002). Multimodality in language and speech systems: From theory to design sup- port tool. In: B. Granström, D. House, & I. Karlsson (Eds.). Multimodality in Language and Speech Systems. Dordrecht: Kluwer Academic Publishers, pp. 93-149.

Bilici, V., Krahmer, E., Te Riele, S., & Veldhuis, R. (2000). Preferred modalities in dialogue sys- tems. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP2000), Beijing, China.

Bolt, R. (1980). ‘Put-that-there’: Voice and gesture at the graphics interface. In: Proceedings of Computer Graphics, SIGGRAPH.

Boros, M., Eckert, W., Gallwitz, F., Görz, G., Hanrieder, G., & Niemann, H. (1996). Towards un- derstanding spontaneous speech: Word accuracy vs. concept accuracy. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP96), Philadelphia, USA. Bouwman, G. & Hulstijn, J. (1998). Dialogue strategy redesign with reliability measures. In: Pro-

ceedings of the First International Conference on Language Resources and Evaluation (LREC), Granada, Spain, pp. 191-198.

Bouwman, G., Sturm, J., & Boves, L. (1999). Incorporating confidence measures in the Dutch train timetable information system developed in the Arise project. In: Proceedings of the IEEE In- ternational Conference on Acoustics, Speech, and Signal Processing (ICASSP-99), 493-496. Boves, L. (1984). The Phonetic Basis of Perceptual Ratings of Running Speech. Dordrecht: Foris

Publications.

Boves, L. & Den Os, E.A. (1999). Applications of speech technology: Designing for usability. In: Proceedings IEEE Workshop on Automatic Speech Recognition and Understanding

(ASRU’99), Keystone, CO, pp. 353-356.

Boves, L & Den Os, E.A. (2002). MUST - Multimodal and multilingual services for small mobile terminals. Heidelberg, EURESCOM Brochure Series.

Brewster, S. (2002). Overcoming the lack of screen space on mobile computers. In: Personal and Ubiquitous Computing, 6(3), 188-205

Cameron, H. (2000). Speech at the interface. In: Proceedings of the COST 249 workshop: Voice Operated Telecom Services - do they have a bright future?, Ghent, Belgium, pp. 1-8.

Cheyer, A. & Julia, L. (1995). Multimodal maps: An agent-based approach. In: Proceedings of the International Conference on Cooperative Multimodal Communication (CMC/95), Eindhoven, The Netherlands.

Choularton, S. (2004). Handling speech recognition errors in spoken language dialogue systems. Submitted to ACL-04 Students Workshop.

Churchill, E.F. & Erickson T. (2003). Talking about things in mediated conversations. In: Human- Computer Interaction, 18, 1-12.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd edition). New York: Academic Press.

Cohen, P.R. (1992). The role of natural language in a multimodal interface. In: Proceedings of the User Interface Software Technology Conference (UIST92), Monterey, CA, pp. 143-149. Cohen, P.R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L., & Clow, J.

(1997). QuickSet: Multimodal interaction for distributed applications. In: Proceedings of the Fifth Annual International ACM Multimedia Conference, Seattle, WA, pp. 31-40.

Cohen, P.R., McGee, D.R., & Clow, J. (2000). The efficiency of multimodal interaction for a map- based task. In: Proceedings of the Applied Natural Language Programming Conference (ANLP-00), Seattle, WA, pp. 331-338.

Cox, E.P. (1980). The optimal number of response alternatives for a scale: A review. In: Journal of Marketing Research, 17(4), 407-422.

Bibliography

Den Os, E.A., De Koning, N., Jongebloed, H., & Boves, L. (2001). Usability of a speech centric multimodal directory assistance service. In: Proceedings of the International Workshop on In- formation Presentation and Natural Multimodal Dialogue, Verona, Italy, pp. 65-69.

Den Os, E.A. & Boves, L. (2004). Natural multimodal interaction for design applications. In: Pro- ceedings of eChallenges 2004, Vienna, Austria.

Dix, A.J., Finlay, J., Abowd, G., & Beale, R. (1993). Human Computer Interaction (2nd Ed). New York: Prentice Hall.

Dybkjær, L., Bernsen, N.O., & Minker, W. (2004). Evaluation and usability of multimodal spoken language dialogue systems. In: Speech Communication 43, 33-54.

Findlater, L. & McGrenere, J. (2004). A comparison of static, adaptive, and adaptable menus. In: Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI 2004), Vienna, Austria, pp. 89-96.

Fineman, B. (2004). Computers as People: Human Interaction Metaphors in Human-Computer Interaction. Masters thesis for the School of Design, Carnegie Mellon University.

Freedman, R. (1999) Atlas: A plan manager for mixed-initiative, multimodal dialogue. In: Pro- ceedings of the AAAI-99 Workshop on Mixed-Initiative Intelligence, Orlando, Florida. Garg, S., Martinovski, B., Robinson, S., Stephan, J., Tetreault, J., & Traum, D.R. (2004).

Evaluation of transcription and annotation tools for a multi-modal, multi-party dialogue corpus. In: Proceedings 4th International Conference on Language Resources and Evaluation

(LREC2004), Lisbon, Portugal.

Garrod, S. & Pickering, M.J. (2004). Why is conversation so easy? In: Trends in Cognitive Sci- ences, 8(1), 8-11.

Gibbon, D., Mertins, I., & Moore, R.K. (Eds.) (2000). Handbook of multimodal and spoken dia- logue systems: Resources, terminology and product evaluation. Norwell, MA: Kluwer Aca- demic Publishers.

Glass, J. (1999). Challenges for spoken dialogue systems. In: Proceedings IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’99), Keystone, Colorado, USA. Grasso, M., Ebert, D., & Finn, T. (1998). The integrality of speech in multimodal interfaces. In:

ACM Transactions on Computer-human interaction, 5(4).

Gustafson, J., Bell, L., Boye, J., Edlund, J., & Wiren, M. (2002). Constraint manipulation and visu- alization in a multimodal dialogue system. In: Proceedings of ISCA workshop on Multi-Modal Dialogue in Mobile Environments (IDS02), Kloster Irsee, Germany.

Gustafson, J., Bell, L., Boye, J., Lindström, A., & Wiren, M. (2004). The NICE fairy-tale game system. In: Proceedings of 5th SIGdial Workshop on Discourse and Dialogue, Boston, USA.

Halverson, C.A., Horn, D.B., Karat, C.-M., Karat, J. (1999). The beauty of errors: Patterns of error correction in desktop speech systems. In: Proceedings of INTERACT’99, 133-140.

Harris, S. & Biermann, A.W. (2002). Mouse selection versus voice selection of menu items. In: International Journal of Speech Technology, 5(4).

Hirschman, L. & Thompson, H.S. (1995). Overview of evaluation in speech and natural language processing. In: R. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, & V. Zue (Eds.). Survey of the state of the art in human language technology. Cambridge: Cambridge University Press. Herzog, G., Kirchmann, H., Merten, S., Ndiaye, A,. Poller, P., & Becker, T. (2003). MULTI-

PLATFORM testbed: An integration platform for multimodal dialog systems. In: Proceedings of HLT-NAACL 2003 Workshop: Software Engineering and Architecture of Language Tech- nology Systems (SEALTS), Edmonton, Alberta, pp. 75-82.

Horvitz, E. (1999). Principles of mixed initiative user interfaces. In: Proceedings of the ACM SIG- CHI Conference on Human Factors in Computing Systems (CHI 99), Pittsburgh, PA.

Höök, K. (2000). Steps to take before intelligent user interfaces become real. In: Interacting with Computers, 12(4), 409-426.

Huang, X. D., Acero, A., Chelba, C., Deng, L., Droppo, J., Duchene, D., Goodman, J., Hon, H., Jacoby, D., Jiang, L., Loynd, R., Mahajan, M., Mau, P., Meredith, S., Mughal, S., Neto, S., Plumpe, M., Steury, K., Venolia, G., Wang, K., & Wang, Y. (2001). Mipad: A multimodal in- teraction prototype. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-01), Salt Lake City, UT, pp. 9-12.

Hugunin, J. & Zue, V. (1997). On the design of effective speech-based interfaces for desktop ap- plications. In: Proceedings of the European Conference on Speech Communication and Tech- nology (Eurospeech '97), 1335-1338.

Hura, S.L. (2003). The truth about multimodal interaction. Microsoft .NET Newsletter. Retrieved December 14, 2004 from http://developer.intervoice.com/docs/Truth-About-Multimodal.pdf. Höök, K. (2000). Steps to take before intelligent user interfaces become real. In: Interacting with

computers, 12(4), 409-426.

Ibrahim, A. & Johansson, P. (2002). Multimodal dialogue systems for interactive TV applications. In: Proceedings of ICMI'02, Pittsburgh, USA.

Imai, T., Tanaka, H., Ando, A., & Isono, H. (2003). Progressive early decision of speech recogni- tion results by comparing most likely word sequences. In: Systems and Computers in Japan, 34(14), 73-82.

ISO (1998). ISO 9241: Ergonomic requirements for office work with visual display terminals (VDTs) – Part 11: Guidance on usability. Retrieved April 5, 2005 from http://www.iso.org.

Bibliography

Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., Whittaker, S., & Maloor, P. (2002). MATCH: An architecture for multimodal dialogue systems. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL2002), Philadelphia, PA.

Jones, M., Buchanan, G., & Thimbleby, H. (2002). Sorting out searching on small screen devices. In: Paternò, F. (Ed.). Human Computer Interaction with Mobile Devices: Mobile HCI2002, Vol. 2411 of Lecture Notes in Computer Science, Springer-Verlag, pp. 81-94.

Kaiser, E., Olwal, A., McGee, D., Benko, H., Corradini, A., Cohen, P. R., & Feiner, S. (2003). Mu- tual disambiguation of 3D multimodal interaction in augmented and virtual reality. In: Pro- ceedings of the 5th International Conference on Multimodal Interfaces (ICMI'03), Vancouver, Canada, pp. 12-19.

Karat, C.M., Halverson, C., Horn, D., & Karat, J. (1999). Patterns of entry and correction in large vocabulary continuous speech recognition systems. In: Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI 99), Pittsburgh, PA, pp. 568-575.

Karat, J., Horn, D.B., Halverson, C.A., & Karat, C.-M. (2000). Overcoming unusability: Develop- ing efficient strategies in speech recognition systems. In: CHI '00 extended abstracts on Human factors in Computing Systems, 141-142.

Karat, J. & Karat, C.M. (2003). The evolution of user-centered focus in the human-computer inter- action field. In: IBM Systems Journal, 42(4), 532-541.

Kieras, D. E. (2003). Model-based evaluation. In: J. Jacko, & A. Sears (Eds.). The Human- Computer Interaction Handbook. Mahwah, NJ: Lawrence Erlbaum Ass., pp. 1139-1151. Kirakowski, J. (n.d.). Questionnaires in usability evaluation: a list of frequently asked questions.

Retrieved April 5, 2005 from http://www.ucc.ie/hfrg/resources/qfaq1.html.

Kirste, T., Herfet, T., & Schnaider, M. (2001). EMBASSI: Multimodal assistance for universal access to infotainment and service infrastructures. In: Proceedings of the 2001 EC/NSF work- shop on Universal Accessibility of Ubiquitous Computing: Providing for the Elderly, Alcácer do Sal, Portugal.

Kvale, K., Rugelbak, J. & Amdal, I. (2003). How do non-expert users exploit simultaneous inputs in multimodal interaction? In: Proceedings of the International Symposium on Human Factors in Telecommunication, Berlin, Germany, pp. 169-176.

Larsen, L.B. (2003a). Assessment of spoken dialogue system usability: what are we really measur- ing? In: Proceedings of the European Conference on Speech Communication and Technology (Eurospeech’03), Geneva, Switzerland, pp. 1945-1948.

Larsen, L.B. (2003b). On the Usability of Spoken Dialogue Systems. Ph.D. thesis, Aalborg Univer- sity, Aalborg, Denmark.

Larson, K. & Mowatt, D. (2003). Speech error correction: the story of the alternates list. Interna- tional Journal of Speech Technology, 6(2), 183-194.

Levow, G.-A. (1998). Characterizing and recognizing spoken corrections in human-computer dia- logue. In: Proceedings of the 36th Annual Meeting of the Association of Computational Lin- guistics, 736-742.

Lewin, I., Rupp, C.J., Hieronymus, J., Milward, D., Larsson, S., & Berman, A. (2000). Siridus sys- tem architecture and interface report (Baseline). Technical Report D6.1, Siridus. Retrieved April 5, 2005 from http://www.ling.gu.se/projekt/siridus/.

Lewis, M. (1998). Designing for human-agent interaction. In: AI Magazine, 67-78.

Litman, D. & Pan, S. (1999). Empirically evaluating an adaptable spoken dialogue system. In: Pro- ceedings of the 7th International Conference on User Modeling (UM), Banff, Canada, pp. 55- 64.

Litman, D. & Pan, S. (2002). Designing and evaluating an adaptive spoken dialogue system. In: User Modeling and User-Adapted Interaction, 12(2-3), 111-137.

Love, S., Dutton, R.T., Foster, J.C., Jack, M.A., & Stentiford, F.W.M. (1994). Identifying salient usability attributes for automated telephone services. In: Proceedings of the International Con- ference on Spoken Language Processing (ICSLP94), Yokohama, Japan, pp. 1307-1310. MacKenzie, I.S. (2002). Text entry for mobile computing. In: Human-Computer Interaction, Vol.

17, 141-146.

Maes, P. (1994). Agents that reduce work and information overload. In: Communications of the ACM, 37(7), 31-40.

Maes, P. (1995). Intelligent Software: Programs that can act independently will ease the burdens that computers put on people. In: Scientific American, 273(3), 84-86.

Maes, S.H. & Saraswat, V. (Eds.) (2003). Multimodal Interaction Requirements. W3C Note, 8 January 2003. Retrieved April 5, 2005 from http://www.w3.org/TR/mmi-reqs/.

Maglio, P.P., Matlock, T., Campbell, C.S., Zhai, S., & Smith, B.A. (2000). Gaze and speech in attentive user interfaces. In: Proceedings of The third International Conference on Multimodal Interfaces (ICMI2000), Beijing, China, pp 1-7.

Mankoff, J., & Abowd, G.D. (1999). Error correction techniques for handwriting, speech and other ambiguous or error prone systems. GVU Technical Report Number: GIT-GVU-99-18.

Martin, D.L., Cheyer, A.J., & Moran, B. (1999). The Open Agent architecture: A framework for building distributed software systems. In: Applied Artificial Intelligence: An international Journal 13(1-2), 91-128.

Maybury, M. & Wahlster, W. (1998). Readings in Intelligent User Interfaces. San Francisco, CA: Morgan Kaufmann Publishers Inc.

Bibliography

McGee, D. & Cohen, P. (1998). Exploring handheld, agent-based multimodal collaboration. Pre- sented at the Workshop on Handheld Collaboration at the Conference on Computer Supported Cooperative Work (CSCW’98), Seattle, WA.

McGlashan, S. (1995). Speech interfaces to virtual reality. In: Proceedings of 2nd International Workshop on Military Applications of Synthetic Environments and Virtual Reality, Stockholm, Sweden.

Middleton, S. (2002). Interface agents: A review of the field. Technical report number ECSRT- IAM01-001.

Negroponte, N. (1995). Being Digital. Alfred A. Knopf, Inc.

Negroponte, N. (1997). Agents: From direct manipulation to delegation. In: J. Bradshaw (Ed.). Software Agents. Cambridge, MA: AAAI Press / The MIT Press, pp. 57-66.

Nielsen, J. (1993). Usability Engineering. Boston, MA: Academic Press.

Nielsen, J. (1994). Heuristic evaluation. In: J. Nielsen & R.L. Mack (Eds.). Usability Inspection Methods. New York: John Wiley & Sons.

Nigay, L. & Coutaz, J. (1993). A design space for multimodal systems: Concurrent processing and data fusion. In: Proceedings of Interchi’93, Amsterdam, The Netherlands, pp. 172-178.

Niklfeld, G., Pucher, M., Finan, R., & Eckhart, W. (2002). Steps towards multi-modal data services in GPRS and in UMTS or WLAN networks. In: Proceedings of the ISCA Tutorial and Re- search Workshop on Multi-Modal Dialogue in Mobile Environments (IDS-02), Kloster Irsee, Germany.

Norman, D. (1997). How might people interact with agents? In: J. Bradshaw (Ed.). Software Agents. Menlo Park, CA: AAAI Press/The MIT Press.

Oviatt, S.L. & Olsen, D. (1994). Integration themes in multimodal human-computer interaction. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP94), Yo- kohama, Japan, pp. 551-554.

Oviatt, S.L., Cohen, P.R. & Wang, M.Q. (1994). Toward interface design for human language technology: Modality and structure as determinants of linguistic complexity. In: Speech Com- munication 15, 283-300.

Oviatt, S. (1996). Multimodal interfaces for dynamic interactive maps. In: Proceedings of the In- ternational Conference on Human Factors in Computing Systems (CHI ’96), Vancouver, Can- ada, pp. 95-102.

Oviatt, S. & VanGent, R. (1996). Error resolution during multimodal human-computer interaction. In: Proceedings International Conference on Spoken Language Processing (ICSLP-96), Phila- delphia, USA, pp. 204-207.

Oviatt, S., Levow, G.-A., MacEachern, M., & Kuhn, K. (1996). Modeling hyperarticulate speech during human-computer error resolution. In Proceedings International Conference on Spoken Language Processing (ICSLP-96), Philadelphia, USA, pp. 801-804.

Oviatt, S., De Angeli, A., & Kuhn, K. (1997). Integration and synchronization of input modes dur-

In document On the usability of multimodal interaction for mobile access to information services (Page 140-157)