2.4 Computer-mediated communication
3.1.3 Components
After presenting the general architecture of dialog systems and some of the relevant issues in more detail, we now describe each of the components of a dialog system in more detail.
Speech recognition
The key factor for spoken dialog systems is the quality of the speech recognition mod- ule. Speech recognition is the task to translate a raw speech signal into one or more hypotheses of what was said, usually expressed as a string of words, which is then used as input for the natural language interpretation module. This task is usually conceptualized in terms of the noisy-channel model which considers the original ut- terance to be distorted by some noise along the way with the goal to build a model on how the noise affects the signal in order to recover the original utterance given only the distorted signal.
Speech recognition requires as a first step to digitalize the speech signal that is recorded by one or more microphones. The digital signal is then segmented into
32 CHAPTER 3. DIALOG FOR LANGUAGE LEARNING frames of about 10 to 20 ms, and from each frame acoustic features are extracted with the help of signal processing methods. Based on these acoustic features, a number of statistical models are applied in order to estimate the most likely utterance. The mod- els comprise an acoustic model which contains the probability that the given acoustic features are realizing certain phones, further the probability of a sequence of phones realizing a certain word, and finally, the language model, which predicts the likelihood of word sequences in a particular language.
In general, the performance of the speech recognition depends on the size and variety of utterances that should be recognized. If the expected input is small and constrained, the recognition task is simpler than if the expected input is fairly uncon- strained. Based on this insight, it is a common strategy to consider knowledge about the current state of the dialog to guide the speech recognition, as certain states make certain utterances more likely than others. Furthermore, the recognition of isolated words as in certain phone command systems is easier and more reliable than recog- nition of continuous speech. Speech recognition in dialog systems usually deals with speech that is directed at the machine which is different from speech recognition for automatic transcription of human-human conversation. Another parameter is the level of ambient noise in the signal.
Another determining factor for the quality of the recognition is the training data and how similar it is to the actual data. This is particularly relevant for the recognition of non-native speech, since standard recognizers are usually trained on native speech. Tomokiyo (2001) reports on word error rates (WER) between 33 and 75 percent for English spoken by native Japanese speakers, compared to 13 and 21 percent for na- tive speakers. She also shows that the WER is related to the proficiency level of the speaker. Although there are ICALL systems that try to employ a standard recognizer trained on native speech (Morton and Jack, 2005; Anderson et al., 2008), it is usually more promising to adapt to non-native speech. One way is to train the recognizer on non-native speech data. However, given that there are fewer potential sources, it is hard and expensive to collect sufficient amounts of such data. It is even harder if the system is supposed to work with a variety of first languages and levels, since accents might differ considerably. Given these problems in collecting non-native data, there have been approaches to adapt native-trained recognizers based on known regulari- ties about specific accents (Goronzy, 2004), or, in a more general approach, based on the observed differences for a set of different accents (Raux, 2004). For a more detailed account of these attempts, see Eskenazi (2009). Apart from being integrated in spoken dialog systems, speech recognition for ICALL has been also used for pronunciation training and correction in various applications (Eskenazi, 2009). Another, if somewhat dated overview of using speech-based ICALL applications is given in (Ehsani and Kn- odt, 1998). A recent example of such efforts is the IFCASL3project, which aims to pro- vide automated individualized feedback for pronunciation errors. Part of this project is to build a bilingual corpus for French and German with the objective to predict the particular learner errors for the these two pairs of native and learner language (Fauth et al., 2014).
3.1. DIALOG 33
Speech synthesis
TTS synthesis produces an auditory signal based on text input. The process is usually divided into two phases: At first the textual input is translated into a phonemic repre- sentation, which is then synthesized as a waveform. There are two different types of approaches to synthesis, one is based on models of the vocal tract, the other is based on the concatenation of prerecorded units (Taylor, 2009). The former, first-generation ap- proaches attempt to generate speech from scratch based on models about how acoustic features of speech arise from the physiological conditions of the human speech organs. A major disadvantage of these approaches is that the voices they produce do not sound very natural. Compared to the data-driven techniques of the second generation, how- ever, they are more economical in terms of memory and processing demands. Nowa- days, with the increases in available memory and processing power, concatenative synthesis became more feasible. For these approaches prerecorded speech is chopped up into units of different sizes and then recombined. Their main advantage is natu- ralness, which makes them particularly suited for ICALL applications. For very con- strained domains a simpler approach is to use words as units and concatenate them, in this case, a phonemic representation may not be necessary.
For ICALL applications, speech synthesis is not only used as a part of dialog sys- tems, but also as reading machines (including talking texts, talking dictionaries, and dictation systems) and a pronunciation model for practicing individual or combined sounds (phonemes), prosody, and intonation (Handley and Hamel, 2005). Apart from naturalness, other criteria for the suitability of speech synthesis for ICALL are com- prehensibility, intelligibility, choice of pronunciation, accuracy, expressiveness, and appropriateness of register of the synthesized speech (Handley, 2009).
Natural language interpretation
The interpretation of utterances as part of dialog systems serves two purposes. For one, it is the precondition for generating an appropriate response. Furthermore, the content of the interpreted utterance is integrated into the existing knowledge base (Poesio, 2000). In order to achieve this, the linguistic input needs to be related to non-linguistic knowledge of the world. This requires (a) a formal representation of meaning and (b) computational methods that assign a meaning representation to the linguistic user input – semantic analysis.
Interpretation is challenging due to various factors. First of all, for speech-based systems, the result of automatic speech recognition is still not perfect and can lead to incorrect hypotheses to start from. Furthermore, spoken language is characterized by disfluencies like filled pauses, repetitions and corrections. In addition, utterances may be non-sentential, i.e., fragments that are not complete according to traditional grammars but can be resolved in the context of the preceding dialog. Consider for example, expressions such as “when?”, “at the post office”, or “exactly”, which can only be understood in relation to previous utterances. Similarly, referring expressions refer to entities in the context of the conversation and require a representation of the context for their interpretation. Consider deictic markers, like “here”, “today”, “this”, or “you” that refer to the particular spatial and temporal context of the conversation
34 CHAPTER 3. DIALOG FOR LANGUAGE LEARNING and to objects and persons that are present. Anaphoric expressions refer to entities mentioned previously in the dialog, for instance the personal pronoun “she” that refers to some female person established previously. The resolution of deictic and anaphoric referring expressions, as well as non-sentential utterances increases the potential for ambiguity, which is a notorious challenge in NLP.
A very simple form of representation relies on extracting meaningful keywords or key phrases from the input and mapping them to system responses (Komatani et al., 2001; Zhang et al., 2007). This can be appropriate for very small and constrained do- mains, such as controlling devices. An application to control home appliances might spot the words “turn”, “light”, and “on” within the user input and translate this to a command to switch the light on. Simple keyword spotting may not be sufficient for systems that are supposed to handle more varied input. For such systems, the range of expected user inputs is described by a grammar augmented with information for semantic interpretation.
One common way of integrating semantic interpretation is to design a context-free grammar in which non-terminals directly correspond to the domain-specific semantic concepts. This approach is known as semantic grammar and goes back to Brown and Burton (1975). The result of a parse with such a grammar corresponds to a slot-and- frame (attribute-value matrix) semantic representation, in which the non-terminals correspond to slot-names (attributes) and the terminals correspond to the slot-fillers (values). A similar way of integrating semantic information is to add semantic tags to the rules of a context-free grammar. This approach has been realized in various grammar representations for speech recognition (see, for instance, the W3C specifica- tion Semantic Interpretation for Speech Recognition4or the Java Speech Grammar Format (JSGF) 5). Because it is quite efficient and relatively easy to implement, the approach has been widely used. However, the disadvantage of this method is that its implemen- tation is very domain-specific and therefore not easily adaptable to other domains.
A more general approach is to enhance the syntactic grammar with semantic at- tachments that specify how to compute the meaning representations of a construction based on the meaning of its constituents, using first order predicate logic and the λ- calculus. For grammar formalisms based on feature structures and unification, seman- tics can be represented within the feature structures and the composition of meaning as unification equations. An example for grammar-based interpretation is given in Van Noord et al. (1999).
While such a deep semantic analysis is arguably more general and thus less depen- dent on a particular domain, its development is relatively expensive. Approaches to cut down these costs, while still aiming for independence of a certain domain comprise a shallower analysis of semantics and machine learning techniques to automatically arrive at an interpretation. For semantic role labeling (SLR), which is also referred to as shallow semantic parsing (Gildea and Jurafsky, 2002), semantic roles are assigned to phrases of a sentence relative to a target predicate that invokes the semantic frame (Fillmore, 1976). While the role assignment is an automated process based on statisti- cal learning techniques, it is dependent on annotated resources such as the FrameNet
4http://www.w3.org/TR/semantic-interpretation
3.1. DIALOG 35 data base that require considerable effort for their construction. Coppola et al. (2009) present examples of successful SRL-based interpretation of spoken dialogs which rests on the English FrameNet database and a smaller domain-dependent database con- structed by labeling a corpus of Italian help-desk dialogs. (He and Young, 2006, 2005) present another statistical parsing approach which reduces the dependence on anno- tated databases further by making do with annotations that contain no syntactic infor- mation and can be obtained easily from the associated SQL data base queries or parse results from a semantic parser.
A good overview and more details on semantic interpretation for dialog systems is given in De Mori et al. (2008) and Jurafsky and Martin (2009).
ICALL applications that attempt to interpret learner language need to take into account the nature of non-target like language and may include any of the error di- agnosis approaches described in Section 2.3. We will discuss some of those attempts in the context of our detailed discussion of systems below Section 3.2. A very recent effort of parsing spoken learner language is described by Caines and Buttery (2014).
Natural language generation
Based on a communicative goal provided by the dialog manager, the generation mod- ule is responsible for finding the best realization of that goal. As in the interpretation step, a variety of methods is available that differ with regard to their flexibility, ex- pressiveness and complexity. Simple approaches rely on canned utterances; slightly more advanced approaches make use of templates that contain slots which are filled with variable fillers. Such simple approaches lack in generality and are usually very application-specific, but have the advantage of easy maintenance. More powerful gen- eration methods rely on syntactical and semantic representations. The generation pro- cess can be divided into different steps (Rambow et al., 2001; Walker and Rambow, 2002). In the first step, content or text planning, the communicative goal is decomposed into atomic subgoals that correspond to single utterances. In a second step sentence
planning, sentences are planned based on atomic speech acts, by selecting lexemes and
syntactic structures. These then feed into the third step — surface realization. In this step, function words (e.g., determiners, auxiliaries) are added, word order is deter- mined, and lexemes are inflected according to morphological rules. For systems with speech output, the final step is prosody assignment, during which the surface string is enriched with intonation and stress patterns.
For the particular purposes of ICALL applications, the generation module may need to consider the limited vocabulary and knowledge of syntactic structures that learners a different stages might have. Furthermore, it may also consider the prefer- ence of particular structures or words that the learner should be exposed to. With a view on corrective feedback given in response to learner errors, the generation mod- ule may consider different parameters of feedback and the availability of information about the error, explained in more detail in Section 5.3 and 5.4.
36 CHAPTER 3. DIALOG FOR LANGUAGE LEARNING
Figure 3.2– A simplified example of a finite state automaton that models a dialog for making appointments. The labels of the nodes refer to system utterances, the labels of the edges are interpretations of the user response. The solid edges indicate transitions that are executed without conditions, the dashed edges indicate transitions that depend on the interpretation of a user response.
Dialog manager
At the heart of a dialog system lies the dialog manager, which is responsible for up- dating and maintaining the current dialog state and selecting communicative goals based on that state. Updates to the dialog state are usually triggered by results of the interpretation module, but, depending on the architecture, can also be induced by information from other processing modules and the external state and task manager. Similarly, the communicative goal selected by the dialog manager will be passed to the generation component, but there can be other, non-linguistic actions that the dia- log manager passes to the task manager or modules for other modalities. A crucial part of the dialog manager is the dialog state representation. In the following section we will discuss in more detail the different approaches to model dialog state and dialog flow.