3.3 Self-repair and incrementality in dialogue frameworks and systems
3.3.1 Self-repair in the Incremental Unit framework
Incremental dialogue systems enjoyed a notable theoretical and implementational development in the proposal of an abstract incremental architecture, the Incremental Unit (IU) framework Schlangen and Skantze (2009, 2011). Several interactive systems using its incremental multi- modular specification have been since developed, including those with NLG and voice synthesis modules (Skantze and Hjalmarsson, 2010; Baumann, 2013). The generation of mid-utterance back-channels (Skantze and Schlangen, 2009) and interruptions (Buß et al., 2010), phenomena that require continual interaction between speech recognition, parsing, generation and voice syn- thesis have been shown to be more tractable problems within such an architecture.
The IU framework can be described as a network of modules, each comprising a left buffer for input incremental units (IUs), a processor and a right buffer for the output IUs. IUs have a payload which determines what kind of data they carry, whether it is a word, POS tag or numerical value, or anything else determined by the system designer. It is edit actions consisting of add, commit and revoke actions on IUs in a module’s right buffer and the effect of doing so on its downstream modules’ left buffers that determines system behaviour. Furthermore, the IUs can have same level link relations between one another if it is desirable that they should be in some way inter-dependent within a module buffer, or have grounded in relations between different module buffers. The buffers are defined as graphs with nodes that represent IUs, allowing for multiple hypotheses to be constructed with time-linear input and their subsequent revision. These desirable incremental properties will be exploited in the proposed formal computational model in Chapters 6 and 7.
Speech plan generation in Jindigo
Skantze and Hjalmarsson (2010) implement an incremental canned-speech based vocalizer (gen- eration and TTS) module in the incremental dialogue framework Jindigo, the Java-implemented dialogue system based on Schlangen and Skantze’s abstract specification. Their implementa- tion does not rely on end-of-utterance silence thresholds from the ASR module before beginning the generation of a response, so the latency of response is greatly reduced compared to a non- incremental version. The chain of incremental updates occur in its buffers to allow incremental generation: word hypotheses are made for incoming auditory input, which are sent in real time to the interpretation module’s input buffer, which in turn processes these different hypothesis to add concepts to the dialogue manager’s input buffer, which in turn processes these to generate aSpeechPlanfor the vocalizer. In the face of lack of commitment of complete IUs from its upstream modules, the vocalizer may start addingSpeechSegments such as “eh” and “well, let’s see” to allow immediate response without having to wait for complete input.
The SpeechSegments, while sometimes spanning several words (e.g. “it is blue”), are semantically atomic, however word-by-word generation is achieved by further dividing the seg- ments into word-length SpeechUnits to be processed serially by the vocaliser. This incre- mental division gives Jindigo its mechanism for self-repair in the face of changing speech plans during generation: a cross-checking of the speech plan currently being vocalized against the new candidate speech plan gives the optimal word/unit position from which the repair can be integrated. Self-repair is therefore possible if input concepts are revised after commencing the vocalisation of a plan if commitment to it is revoked, both covertly (before synthesis) and overtly (after synthesis), and on both the segment and unit levels (see fig. 3.6).
While their model is not as clearly psychologically motivated as some of the generation work mentioned above, for instance not being syntactically oriented like Kempen and Hoenkamp, De Smedt and Neumann’s frameworks and without the fine-grained semantic input of Guhe’s model, Jindigo’s ability to allow a maximal amount of incremental information flow between all the modules in the dialogue system allows the possibility of more interactive and responsive NLG. It not only allows parallelism within the generation process itself, but also allows for incremental dependency on other decision processes within the dialogue system in generation outcomes, and in terms of a psychological analogue, a better interface with the rest of the cognitive model. The flexibility in the specification of different module behaviours allows the testing of different
3.3. Self-repair and incrementality in dialogue frameworks and systems 80
Figure 3.6: Different types of speech repairs in Jindigo vocalizer module. Shaded areas show whichSpeechUnits have been realized, at the point of revision. from (Skantze and Hjalmars- son, 2010)
theories and implementations for individual components of speech production, and hence situates the dialogue system as a tool-for-understanding (see Schlangen, 2009, for an explanation of this approach).
It is also worth mentioning that in Skantze and Hjalmarsson’s evaluation, using an innovative Wizard-of-Oz experiment, consistent with evidence in Aist et al. (2007) that incremental dia- logue systems seem more efficient and pleasant to use than their non-incremental counterparts, they found users similarly preferred an incremental generation system over a non-incremental version in terms of ratings of politeness, efficiency and indication when to speak. They found no difference in user response times between the two systems, which seems to fail to give support to Brennan and Schober (2001)’s claim of increased speed in response times upon hearing cor- rections. However, this measurement was presumably taken from the end of system’s utterances to ensure comparison across all utterances rather than from the onset of particular semantically salient words as it was in Brennan and Schober’s experiments, so comparison is difficult here. The presence of self-repair certainly does not hinder response time here, in any case. Their con-
tribution is valuable in terms of the evaluation challenge for incremental NLG, as the method isolates a capability of the system that can be controlled for, exhibiting notable interactional differences.
Incremental dialogue management with self-repair capability
The Jindigo implementation of self-repair was exploratory in terms of testing interactional ef- fects, however as it functioned within a Wizard-of-Oz setting rather than an end-to-end dialogue system, it is difficult to claim it is a generation implementation. Buß and Schlangen (2011) ad- dress the challenge of generating corrections and representing repair on a discourse level through their system DIUM, an incremental dialogue management module that functions in an imple- mented IU framework-based dialogue system.
DIUM addresses the need for a dialogue manager to self-repair in light of needing to produce output that conflicts with system behaviour that has already been publicly realised. It simul- taneously addresses the easier problem of covert repairs, where conflicting information is not realised, by using the IU network’s edit message revoke to remove the information that is in conflict with the current plan and also can revoke the IUs that are grounded in (i.e. were trig- gered by) the revoked IUs in the dialogue manager. It achieves its revision capability through characterising its internal information state as an IU network that allows the edit messages (add, commit, revoke) to operate on internal information rather than simple string output as in Skantze and Hjalmarsson (2010)’s system. The revision of internal representations is made pos- sible by incrementalising concept frames and characterising them as an IU network themselves as in Figure 3.7. They also introduce SemIU, DiscourseIU and DialogueActIU incremental units to differentiate the factual content established, the issues (i.e. frames) that are required to be resolved, and the plan to form a dialogue act, respectively.
The IU state graph in the DM is altered depending on the revision strategy required. For the more complex revision strategy, the following steps are taken:
1 Handle revoked input by computing a new state of the DM’s IU graph, removing Discour- seIUs that were grounded in the revoked input.
2 Check the DM’s own output to determine whether projected DialogueActIUs that are grounded in revoked input have been realised into observable output.
3.3. Self-repair and incrementality in dialogue frameworks and systems 82
Figure 3.7: The DIUM dialogue management framework as an IU network (Buß and Schlangen, 2011)
In the final step, if it is reached, the fact a repair has been initiated is recorded through use of a novel type of DialogueActIU called UNDO. By creation of a specific repair IU this can be interpreted by down-stream realisation modules to effect the appropriate repair behaviour. The authors only suggest a preliminary strategy, one of generating an apology ‘sorry about that’ whilst un-highlighting anything in the visual domain that was highlighted in the previous conflicting state.
While the repair generation is limited, and not psychologically motivated, DIUM was a useful step towards incremental generation of self-repairs in an interactive system. While it does not have an NLU component capable of interpreting repairs from the user, it began to address the issue of how repair acts could be represented in a dialogue information state. A more thorough attempt at addressing this issue, albeit not one implemented in a working dialogue system, is described in the next section.
3.3.2 KoS: Dialogue semantics of disfluency in an Information State Update approach