• No results found

Processing speech repairs with a noisy channel model

2.5 Summary and directions for research

3.1.3 Processing speech repairs with a noisy channel model

Johnson and Charniak (2004) present the first generative approach to processing self-repairs, in a noisy channel model of speech repair detection. They formulated the task of parsing utterances with repaired speech as finding the ‘cleaned’ utterance that is most likely one generated by an underlying fluent source language model and the most likely to generate the observed ‘noisy’ utterance (the ‘noise’ being the disfluencies)- see Figure 3.2. For their channel model, they build a S-TAG (Synchronous Tree Adjoining Grammar Shieber and Schabes, 1990) based transducer that yields complex sentences which are strings of tuples of words from the ‘noisy’ sentence (raw utterance with disfluencies) and corresponding words from the source sentence (clean underlying ‘intended’ utterance), using simple S-TAG rules.

The system is trained to yield string pairs which maximise the probability of the overall noisy channel model P(X |U ) yielding cleaned utterances X from raw utterance strings U . Using the

Bayesian noisy channel model formulation, this is achieved by a maximisation of the likelihood of the combination of the S-TAG based channel model P(U |X ) generating the noisy strings and

the language model P(X ) as in equation 3.3. The decoding task can therefore be viewed as search

for the most likely underlying clean sentence x∈ X given the observed noisy utterance u.

arg max

X

P(X |U ) = arg max X

P(U |X )P(X ) (3.3)

The S-TAG grammar in the channel model generates sentences Z, where each z∈ Z consists

be⊘ (the null string) if it is classified as part of a reparandum (i.e. removed from the output), so

an underlying cleaned sentence x consisting of the string of the first tuple elements will always be a substring of the noisy sequence u, the string of the second tuple elements. If the second element of the tuple is not a reparandum word then both elements have the same lexical value.

Note the TAG rules do not assign grammatical structure to words (i.e. the TAG parser is not a syntactic parser), rather they generate the strings of noisy utterances U from the underlying cleaned utterances X and yield a tree structure representing the repair-reparandum alignments such as that in Figure 3.3. The model uses the context-sensitive properties of TAG (specifically the ability to deal with crossed serial dependencies) as a way of dealing with the ‘rough copy’ dependencies often present in speech repairs.

The auxiliary trees used in the derivations have the tuples hraw word in u, cleaned word in

xi as their terminal nodes, i.e. the words that compose sentences Z as described above, and

simple reparandum-repair alignment rule categories for their non-terminals (copy, delete, insert, substitute), indicating the correspondence between their left and right daughter terminals. They contain the repair category alone if they have a single daughter, i.e. in the case of nodes where interregnum trees are attached. More technically, as can be seen in Figure 3.3, the non-terminals divide into three categories – Nwx (the preceding word wx is not part of a repair), Rwx:wy (the

preceding word in a reparandum and its corresponding word in a repair phase, if these two words are identical then it is a repetition, if they are different then there is a substitution, if wxis⊘ this

is an insert and if wy is⊘ this is a delete) and the other non-terminal is I, which dominates the

interregnum word sequences. Note the trees with Nwxand I mothers always rightward branch in

a finite state fashion, which allows the probability of these rule applications to be obtained by normal n-gram language model estimation– the authors train a bigram model for the non-repair Nwx headed trees and a unigram interregnum model for I headed trees. See the derived tree for

“..want a flight [ to Boston, +{ I mean } Denver ]” in Figure 3.3.

The S-TAG parser runs in O(n5) on the length of the input sequence, which they limit to

word windows of length 12 as the system stochastically predict a repair as beginning every word. A chart is used to store all the possible repair sequences. The space and time complexity issues here will be discussed in Section 3.5 and also in Chapters 4 and 5.

3.1. Automated processing of self-repairs 64

Figure 3.2: Parsing model for disfluencies from Johnson (2011).

tion systems, first introduced in Charniak and Johnson (2001)- the F-score4of reparandum words rm retrieved. To calculate the precision and recall to give this result, if we take the total num- ber of words hypothesised as being in a reparandum as rmhyp, the number of correct hypotheses

rmcorrect and the total number of gold standard reparandum words in the transcript as rmgoldwe

have:

Figure 3.3: TAG-based derivation of a repaired utterance (Johnson and Charniak, 2004). precision=rm correct rmhyp recall=rm correct rmgold

F-score= 2 ×precision× recall

precision+ recall

(3.4)

In testing their model they show how the system performs best by using a statistical parser based language model for P(X ) with an F-score of 0.798, rather than using bigram (F-score =

0.756) or trigram (F-score = 0.768) language models. It is worth mentioning their model was not trained on overlapping repairs, which is surprising given that a grammar-based approach should be more suited to this problem than sequence labelling approaches, given their embedded

3.1. Automated processing of self-repairs 66

structure (Shriberg, 1994).

Incrementalising the noisy channel model

The model I consider most suitable for incremental dialogue systems in previous work is Zwarts et al. (2010)’s incremental version of Johnson and Charniak (2004)’s noisy channel repair de- tector, as it incrementally applies structural repair analyses and is evaluated for its incremental properties. Following Johnson and Charniak (2004), instead of using a parsing model their sys- tem uses an n-gram language model trained on roughly 100K utterances of reparandum-excised (‘cleaned’) Switchboard data. As above, the channel model is a statistically-trained S-TAG parser whose grammar has simple reparandum-repair alignment rule categories for its non-terminals and words for its terminals. The parser hypothesises all possible repair structures for the string con- sumed so far in a chart, before pruning the unlikely ones, however these are processed in a strictly left-to-right manner from the input string. It performs equally well to the non-incremental model by the end of each utterance (F-score = 0.778), and can make detections early via the addition of a speculative next-word repair completion category to their S-TAG non-terminals.

In terms of incremental performance, they report the novel evaluation metric of time-to- detection for correctly identified repairs, achieving an average of 7.5 words from the start of the reparandum and 4.6 from the start of the repair phase. They also introduce delayed accuracy, a word-by-word evaluation against gold-standard disfluency tags up to the word before the current word being consumed (in their terms, the prefix boundary), giving a measure of the stability of the repair hypotheses. They report an F-score of 0.578 at one word back from the current prefix boundary, increasing word-by-word until 6 words back where it reaches 0.770. These results are the point-of-departure for the work in Chapter 5.