• No results found

2.2 Multimodal Human-Computer Interaction

2.2.1 Terminology

Modality

In the literature like in Wahlster (2006b), the term modality often refers to the human senses employed to process incoming information: vision, audition, olfaction, touch, and taste. Other literature about MUIs like Oviatt and Cohen (2015b) from a more technical view use the term modality to describe the type of interfaces that are applied for user input -such as speech, pen, touch and multi-touch, gestures, gaze, and virtual keyboard- and system output -such as speech synthesis, Graphical User Interface (GUI) or sound.

Wasinger (2006) points out that with the term modality, a focus is set on the perception of the senses and the process in which user input is captured by the system or presented to the user. This definition combines both above mentioned definitions quite well and is accepted for this work.

Definition 1 (Modality)

Modality describes the perception of the senses and the process in which the user input is captured by the system or system output is presented to the user.

Code

With the term ‘code’, Maybury and Wahlster (1998a) refer to a system of symbols or information encoding (e.g., text, gestures or sign language). For the realisation of the code, a particular combination of user ability and device capability may be utilised. Thus, the modality actually used to present a code can differ, e.g., text can be entered via a keyboard or displayed on a monitor for output. Here the concrete involved modality is not crucial for the code; alternative examples for the concrete presentation of text are the spoken language or the braille language.

Definition 2 (Code)

A system of symbols or information encoding like text, gestures, or sign language. One code can be presented by diverse modalities.

Device

Nowadays, a device is not restricted to support only one specific modality or code. Moreover, modern devices can contain a wide range of different user interfaces. For example, current smartphones present information with sound, spoken language, GUIs, vibrations, and more. On the input side the user, e.g, interacts through multi-touch, speech input or motion, which is detected by accelerometers.

From the technical point of view of the Situation Adaptive Multimodal Dialogue Plat- form (SiAM-dp), presented in this thesis, a device is considered to be one unit in the environment that is connected to a multimodal dialogue application. Later in Section 2.4, in the context of CPEs, this unit will also be called a cyber physical unit. The various interfaces supported by one unit are called device services in this thesis. This term is examined in detail in the following definition and includes also sensors and actuators. Definition 3 (Device)

One unit in the environment that is connected to a multimodal dialogue application. A device can combine several user interfaces for input and output but also sensors and actuators.

2.2 Multimodal Human-Computer Interaction 17

Device Service

Oviatt and Cohen (2015b) make a distinction between active input modes and passive input modes. They define active input modes as input modes “that are deployed by the user intentionally as explicit input to a computer system (e.g., speaking, writing, typ- ing, gesturing, pointing)”. Equivalently, in this thesis active output modes are directly deployed to the user by a user interface (e.g., graphical output on the screen or speech synthesis). Since the initiative is originated from the computer system, it is difficult to speak of intentional and unintentional output. Thus, we consider output to be active output if there exists a possibly active equivalent in the interaction between humans. Passive input modes are defined as “naturally occurring user behaviour or actions that are recognised and processed by the system (e.g., facial expressions, gaze, physiological or brain wave patterns, sensor input such as location). They involve user or contextual input that is unobtrusively and passively monitored, without requiring any explicit user command to a computer” (Oviatt and Cohen, 2015b). Passive input is typically recog- nised by sensors, like cameras, eye-tracker, time-of-flight cameras, and motion sensors. For system output, passive output modes are, e.g., realised with virtual characters and their facial expressions. Another set of devices that can be used for passive output are actuators. Exemplarily, lamps can be used to unconsciously attract the user’s attention on a specific region or object in a product shelf.

The generic term for an interaction mode that is provided by a device is the device service. The interaction mode in this context is the combination of modality and code that can be processed by the device service. Chapter 5 will introduce a more exact classification of the above mentioned types of device services.

Definition 4 (Device Service)

An interaction mode that is supported by a device. The interaction mode is the combi- nation of modality and code that can be processed by the device service. This can be user interfaces for input and output, but also actuators and sensors. One device may contain more than one device services.

Definition 5 (Device Component)

The physical component of a device that is used to generate a system output or perceive a user input.

Table 2.4 shows the relations between the above mentioned terms using the example of a modern smartphone. Four typical device services are listed: speech synthesis, speech recognition, GUI, and hand gesture recognition. Every device service employs an indi- vidual device component. The modalities and codes also differ with the exception of speech synthesis and speech recognition. Here the decisive difference is the direction of the communication. Speech synthesis is used to present system output whereas speech recognition recognises user input.

Figure 2.4 – Relations of the terms Device, Device Service, Device Component, Modal- ity, and Code