Output Modalities - Multimodal Interaction

2. Methodologies

2.5. Multimodal Interaction

2.5.2. Output Modalities

Output modalities can be described from two main perspectives, the human and the computer. Human output modalities refer to aural, touch, gaze and kinaesthetic⁷. Computer output modalities refer to the ability of a device to deliver output to the user

7 Kinaesthesia refers to the physical position and movements of the body.

using a number of communication channels. These include graphics, language, film and music and are directly targeted at the main human input modalities of vision, auditory and haptic (Bernsen 1997).

With output modalities the original focus was on the presentation of multimedia content via multiple modalities such as graphics and audio. Formative work in (Andre et al.

1993), describes the architecture of a knowledge-based presentation system (WIP) that demonstrated the use of synergistic graphical and textual output for instructional direction. The idea of the automatic presentation of media content based on the coordination of two modalities and a user’s context has been a very active area of research for multimodal output (Bouchet, Nigay, and Ganille 2004, Rousseau et al.

2006b, Sinha and Landay 2003). The ICARE (Mansoux, Nigay, and Troccaz 2007) and CrossWeaver (Sinha and Landay 2003) systems take a similar approach to output modalities with multistage processes for the design and production of content and Open Agent Architectures (OAA) to facilitate cost effective modifications of modalities and combinations of them.

More recently, the concept of multimodal fission has been used coordinate multiple output modalities (Rousseau et al. 2006a, Jaimes and Sebe 2007, Perakakis and Potamianos 2008, Costa and Duarte 2011, Honold, Schüssel, and Weber 2012). Fission techniques enable a multimodal system to generate an adequate message in the correct form based on user profile and context. There are usually three stages to the process, content selection and structuring, output channel selection and output coordination. The output channels are based on the affordance of the computing device and can include text, graphics, speech synthesis or embodied conversational agents (Foster 2002).

provide multiple output modalities such as visual, auditory and haptic. Interestingly, the output channel selection can be based on device affordance and can also be affected by changes in a user’s context (i.e. providing audio while driving). Visual interfaces are used effectively to provide text, image and video data on mobile devices (Chittaro 2008, Church, Smyth, and Oliver 2009). The coordination of these modes (audio, video and/or graphics) using the visual modality has been demonstrated effectively in many cases (Chittaro 2008, Jaimes and Sebe 2007, Sarter 2006, Sebe 2009). The auditory modality on the other hand has seen a number of applications mainly related to synthesised speech or recorded speech⁸ (Nepper, Treu, and Küpper 2008a, Rajput and Nanavati 2012). Bartie (Bartie and Mackaness 2006) describes the development of a (synthesised) speech based AR system that delivers contextual geospatial information to assists a user in locating landmarks, demonstrating the benefits of focus independent directional instructions. Research in (Doyle, Bertolotto, and Wilson 2009, Kurkovsky 2009) describes the CoMPASS mobile application for the elderly and identifies that natural speech is preferred over synthesised speech, the down side being a significant overhead producing recorded speech assets. The haptic modality has gained significant attention with a myriad of useful applications demonstrated in a multimodal context.

Related work by (Robinson, Eslambolchilar, and Jones 2008, Williamson, Murray-Smith, and Hughes 2007, Jacob et al. 2012) investigated haptics (touch) for the purposes of navigation where users are directed through a space using only vibrotactile (the perception of vibration through touch) feedback as a guide.

The benefits of combining output modalities has been identified in many cases (Mousavi, Low, and Sweller 1995, Akatsu et al. 2009, Cao, Theune, and Nijholt 2009)

8 Synthesised speech is artificially produced human speech and recorded speech is the use of natural speech recordings.

as a means of reducing cognitive load on the user. Research has shown that the use of a mixed audio-visual modality is better than the audio modality alone (Hooten, Hayes, and Adams 2013). This indicates that the audio modality combined with the visual modality can be highly effective particularly in a mobile context, where the visual modality alone can be distracting. This type of mixed modality is useful for LBS where users are on the move and in most cases require hands-and-eyes free information delivery (Kaasinen 2003, Dowell and Shmueli 2008).

(Paper I) describes the research and development of a novel Geo-Services platform for the multimodal delivery of high quality and task relevant content to constrained mobile devices (e.g. spatially enabled smartphones). Importantly, it investigates the delivery of location-based content to the user by means of the phone’s aural modality as the primary delivery mechanism (Fitch and Kramer 1994, Nepper, Treu, and Küpper 2008b). The visual modality is also presented using a graphical user interface (GUI) that incorporates and builds on many of the ideas developed in previous work on location-based services. A primary objective of this research was to perform a live user trial to evaluate the effectiveness of multimodal content delivery in the form of media content and navigational directions. Results show that multimodal delivery is effective and that user choice in terms of modality combined with contextual instruction (informing them when something is wrong) improves the user experience considerably. Extending this, the idea of adaptive multimodal output was also considered for this application by using the light sensor to detect when the device was “in hand” or “in pocket” to automatically control the primary modality, however this was not employed in the user trial. Similar approaches are taken by the AdaptO (Teixeira et al. 2011) and ProFi (Honold, Schüssel, and Weber 2012) systems where the output modality is determined from user

preferences, environmental factors and device affordance. This work is extended in (Teixeira et al. 2014) with the development of a multimodal personal life assistant that adapts to real-life scenarios to provide multimodal output based on the users needs. This automatic adaptation of multimodal output based on environmental conditions, usage patterns and context has gained considerable attention in recent times, has significant potential and is the topic of future work in the area of adaptive multimodal output, which is outlined in Section 3.9.2.

This Chapter detailed the significant advancements made in the areas of location based information retrieval, hybrid positioning, MSI, AUIs and multimodal interaction. The methodologies described play a significant role in the effective multimodal delivery of content in a mobile context. The main objective was to set the published work in the context of existing literature and to stress the coherence of the publications linking them to the methodologies adopted. The following Chapter will critically describe the published work by establishing how its fits into the overarching theme of multimodal mobile services and evaluate the contributions made in related discipline areas.

In document Multimodal Content Delivery for Geo-services (Page 41-46)