2 Related w o rk
2.1 Multimodal user interfaces
2.1.1 Multimodal interface frameworks and architectures
In the recent years a number of multimodal interface technologies have been researched. One can argue that multimodal interface research first engaged on particular multimodal applications, evolved to multimodal architectures and recently a number of frameworks for the design of multimodal applications have been proposed. Even standardisation efforts have been brought forward.
For example, one of the important reference models for multimodal interface frameworks has been defined by the W3C Multimodal Interaction Activity Group [40]. Although the group targets their standards for the Web, in order to allow flexible decisions for users to be made on the choice of the “most appropriate mode of interaction”, it defines a very useful set of requirements and use cases which has been incorporated into well defined Multimodal Interaction Framework [10]. Subsequently the section provides an overview of relevant architectures and frameworks targeted to multimodal interaction.
The EMBASSl project designed a multilayer architecture [9] for multimodal assistance. It can be classified to be an agent based system. Agents are grouped into layers using a subset to the agent communication language KQML [41]. The system intelligence is spread across multiple agents interacting cooperatively. Due to its agent based nature this approach shows a high flexibility in adding new features while adding new agents at the respective layer. Layers are grouped into user input related tasks, user output related tasks, a dialog manager, user task assisting agents, abstract representation of physical devices, and a context manager.
Although the architecture seems to be very promising it has been applied mainly for home environment scenario settings. The agent approach seems to very suitable for flexible integration of several devices for modality input and user output into the user interaction. Nevertheless, it
does not define a flexible approach for allowing access for several kinds of multimodal applications.
SmartKom is one of the most recent and very sophisticated research efforts. The interaction metaphor for the system is based on the situated delegation-oriented dialog paradigm (SDDP), in which "... the user delegates a task to a virtual communication assistant, visible on the graphical
display.” [11] . The objective of this approach is to help users to accomplish more complex tasks
within certain applications.
Gesture Pointing Sensor Modality Fusion Speech Speech Prosodic Intention Aariysis Interaction Modeling Dynamic Service A I Discourse Modeling System Watchdog Lexicon
Management Shared Knowledge Services Action Function Moiteliflg Context ModcHng Service B Character Animation Display Presentation Speech
Figure 4: SmartKom architecture blueprint [42]
SmartKom provides an architecture which accommodates four main goals. These are natural interaction behaviour (facilitates timely user adapted interactions), handling of large numbers of modalities and applications, openness by allowing different processing approaches and support for distributed development as described in [42]. Figure 4 shows the related software architecture with its main components. In the uppermost area of the illustration the architecture defines functionalities for multimodal input processing, starting from specific input recogniser (audio and gesture in this picture) with information flow towards the fusion components. Within the middle circle the SmartKom architecture defines a number of information models defined as shared knowledge services. In lower part components responsible for multimodal output generation can be seen, starting with a presentation planning component and the end components for delivering the intentional output to the user. Due to the SDDP paradigm in SmartKom components as the
Intention Analysis, Dynamic Help, Action Planning, Function modelling provide the means of allowing applications to get support for fulfilment of certain application tasks. Service or applications are interfacing with the system trough the Function modelling. For this the SmartKom defines a multimodal mark-up language (M3L).
In contrast to architecture based approaches, research has been performed on Model-Based User Interface Development Environments (MB-UIDSEs). These in general are aiming "... to provide
an environment where developers can design and implement user interfaces (UI) in a professional and systematic way, more easily than when using traditional UI development tools.” [43]. The
idea is to give a user the tools at hand to develop an own application user interface without necessarily having the need for developer background. For example the interface design environment Mobi-D [44] provides a series of declarative models, such as the user-task, dialog and presentation model supported by an architecture to allow the end user to participate in more aspects of the development. Models are usually expressed by an interface modelling language. But these approaches are usually constraint on normal user interfaces in the common sense, such as desktop environments.
With the concept of “plasticity of user interfaces” [45] a concept for targeting the challenge of changes in user interfaces based on the context of use is provided. Plasticity in this respect “...is
the capacity of an interactive system to withstand variations of context of user while preserving usability.” [45]. The context of use is defined by two classes of attributes. First the information
attributes of the physical and software platform, which are a direct part of the user interaction with the system. Further the attributes of the environment which can affect the user interface indirectly. The concept in this sense links strongly to the research area of context-awareness, which is reviewed later in Section 2.3.2.
The authors provide an initial reference framework which is based on models, known from model-based approaches, used in current praxis: concepts, tasks, task oriented specification. Moreover they extended it with the notion needed for the context of use, such as platform model, environment and evolution model. Figure 5 shows the models and their relationships in different contexts.
Coiuesr 1 C onceyti r Tasks ^ ^ / f Platfciin C cucepts and TaskXiodei AViiti'ac: ':Df E ovucanieat lateiacitois ' Evclutton
1
V C c r ic ie e _ _ 3__'. FaiaiL T fcr Conte:<tl Cȟtex:r 2 Ccflcepv end Task. Model AW:ac: intei'facet c Conae:? Mite: face
FmaS IT ip t C o u ts x 'il Concepts Tasks I P la tf o im . JzüVîroameüt IIntetactDiv ' Evolmiou Siiualitiii rci'ttanitiisn Ideutify-ms coatext ^ D e te c tm j context changes changes c Identic,tna candidate solutions Selecting a candiclate sohation
n n ifjijin titin i tf a iv a c iin ii
R u n Time
SensiasT the
cciiiexr
Execuline th^ y
p io lo g i»
Executing Executing Pxtcutiu n n i the rct.cliu..
e p ilo g u e ^
Figure 5: Revised reference framework for supporting plastic interactive systems [45]
The following list gives an indication of what the models include and in addition a possible adaptation of the reference framework for multimodal interfaces in mobile environments:
Concepts model relevant to the domain of discourse - this could be linked to the model of the application domain in case of Mobile applications
Task models describe how user is carrying out a task - this links to the definition of the dialogue model for multimodal user input
The platform model defines the physical characteristics of the used User Interface Platform - in the case of multimodal interfaces it should describe the physical characteristics of the surrounding devices
Environment models describe actually the context of use - relevant context information for the adaptation could be described here, in a wider sense this would include the user preferences for adaptation
Interactor models describe the interface bits (widgets) which allow for presentation and inputs of the user - in multimodal interfaces these bits could be defined as multimodal agents or services running on user interface devices, which define the modalities a user can use for interaction with the Mobile multimodal applications
In addition to the reference model itself the authors defined a process for the run time behaviour. The runtime process is structured into sensing the context of use (acquiring context values, like physical information), detecting context changes (changes occurring over time, within a certain threshold) and identification of context changes (mapping the context changes to actual higher level context changes as used by applications).
Concluding one can argue that the restriction of the work very much lies in the limitation of the approach which relates to single user interfaces, such as computers, mobiles. It does not integrate more than on user interaction device at the same time. The model itself though gives a good indication how user interfaces can be modelled in dependence of different contexts.
FAME could be seen as such a model-based Framework for Adaptive Multimodal Environments [8]. It defines a so called - behavioural matrix - capturing adaptive multimodal interface analysis. Related to this it defines a general framework and guidelines for the development process of multimodal applications.
It defines an adaptive multimodal system with the core being the adaptation module. The rather general description includes the adaptation module taking into account different models captured from the user, the devices, the environment and the general interaction description
Supporting components which feeds the several models are providing the events for device changes, environmental changes and user inputs. Based on the adaptation decision the system produces an adaptive presentation layout of the application and the decision on which output devices to use.
The approach seems to be restricted only to the definition and analysis of multimodal applications. Nevertheless, it seems that the behavioural matrix provides a powerful tool on analysing potential multimodal applications. Developers can therefore analyse the settings in which the application resides and define their specific parameters.
I
S ystem & E nviron m ent A p p lication Functio ns S ession C o m po nen t Input O utput Interaction M a n a g e rFigure 6; The W3C Multimodal Interaction Framework 1401
Due to the increasing interest the W3C Multimodal Interaction Activity defined a W3C Multimodal Interaction Framework [10]. The framework provides well described guidelines on how to design architectures for Multimodal Interface systems. Since frameworks in general are
one abstraction level higher then architectures it does not describe specific functionalities, but rather structures relevant areas of concern for multimodal interaction. In Figure 6 the main framework parts are depicted and put into their relationship.
The modelling strongly complies with the before given definitions of Multimodal Interfaces. A human user serves the system with input through different input channels and receives media representations in combined single modality modes such as audio, speech, video, text etc. The coordination of inputs and outputs is defined with the interaction manager as shown in the centre of the illustration. An application can use provided coordination functionalities to be multimodal. Additional for completion of the architecture and accordingly to the requirements for multimodal interaction of the W3C the framework defines a session component which allows state , management of the multimodal interaction for temporary and persistent sessions. Finally and important the system and environment component allows to include current information about available modalities and devices.
The framework is targeted to different areas as to the mobile, automotive telematics, multimodal interfaces in the office and multimodal interfaces in the home. Although this probably scratches most areas the framework itself was motivated from an internet perspective.
There are a number of further frameworks existing, which are mostly of older date and missing current research results. Therefore from the ‘state of the art’ point of view the given examples of multimodal user interfaces and related user interface concepts should provide a very good overview. In essence multimodal interfaces should be able to recognise interaction and fuse/integrate user input from different input sources (e.g. gestures, speech input, etc.). Based on the given application multimodal interfaces should provide an output presentation fitting to the current available devices and media content. In order to reach this goal a number of support models (e.g. environment, user intention) and functionalities need to be provided. The fact of adaptation for certain situations needs to be acknowledged and gains an important role especially in mobile environments. Devices are available in certain location, but not in others. Interaction recognition and Intention recognition / action planning are not the main focus of this thesis. They are acknowledged as important features of multimodal interfaces, but not further discussed in detail. Subsequently the elements of multimodal integration are reviewed in the next section. Presentation adaptation as the mean to provide information to the user accordingly is discussed as part of Multimedia in Section 2.2.