Characterizing retrace lengths with statistical modelling

2.3 Statistical corpus studies

2.3.5 Characterizing retrace lengths with statistical modelling

While work had been done on characterizing the forms of self-correction in terms of the per- missible syntactic points of a sentence from which a speaker could retrace (Levelt, 1983, 1989), these accounts did not address the probability of the retrace occurring from each of these pos- sible points or the probable length of how much of the utterance was to be retraced given these positions. To address this, Shriberg and Stolcke (1998)’s corpus study developed a quantitative model of repetition-type self-repairs that was purely word position based to attack the central question: when speakers retrace, what predicts how far back they go? (Shriberg and Stolcke,

Table 2.3: Tag set for type of disfluent words (Shriberg, 1994, p. 57)

1998, p.2183).

2.3. Statistical corpus studies 40

Table 2.4: Disfluency type annotation scheme for dialogue (Shriberg, 1994, p. 78)

for sentence boundaries and disfluencies, they harvested 30,524 disfluent utterances containing one or more retraced words. In gathering their data, they first observed that retracing did not occur across sentence boundaries nor go back to mid-word points of a previously uttered word. They characterized retracing in terms of the number of retraced words in the sentence k (i.e. the length of the reparandum, from[ up to the +), and the number of words in the utterance before

Figure 2.4: Distribution of disfluency types (Shriberg, 1994)

the repair was initiated m (that is the number of words from the beginning of the utterance before the repair point+). Their word-based measures can be seen in Figure 2.5.

Figure 2.5: Word-based measures used in Shriberg and Stolcke (1998)’s study.

Their central finding is the apparent exponential decay in frequency in k with increasing values of m: in other words, speakers are much more likely to trace back one word than they are two words, and so on. They also identified the exception to this trend for all values of m in the boosted frequency of retraces spanning the entire length of the utterance so far, i.e. when k= m.

See the final “hooks” of the lines in Figure 2.6 for values of m≤ 6.

There is also an exponential decay in this boosted probability value of k= m retraces, in that

the skip back is exponentially less likely to happen when the utterance is one word further on; for example when a speaker is only two words into a sentence the extra probability that they will skip back to the start is around 70%, whereas four words into a sentence this is only around 5%

2.3. Statistical corpus studies 42

(Shriberg and Stolcke, 1998, p.2185).

Figure 2.6: Distribution of Retrace Lengths by Position. From (Shriberg and Stolcke, 1998)

Additionally, their results clearly show the uniform relationship between the probability of retracing from N words back and that of retracing from N+1 words back- this shown by the straightness of the lines in Figure (2.6) which indicate a proportionally decreasing logarithmic frequency of occurrence for increasing values of k. It is suggested by the authors that the reason for this uniformity is due not only to constituency boundary reasons where it is more effortful to restart larger constituents3 but also for processing and temporal factors: they claim that if speakers are optimizing speaker time rather than silence time, they retrace more when they need more time to reformulate.

The other interesting aspect of their model is that, except for the boosted probability of re- tracing the entire length of the sentence, the likelihood of a retrace of a given length k is equally likely regardless of the number of words uttered in the sentence so far m (the distance from the beginning of the utterance to the repair point) - this can be seen graphically by the close prox- imity of the frequency points for each value of k in all the m-value lines, except when k = m. The authors do not offer a complete explanation for this, but this constitutes clear evidence that speakers show a tendency to retrace as locally to the trouble source as possible. Eklund (2004)’s thesis drew similar conclusions about verbatim retracing and its regularity in that the maximum retrace length observed among several corpora was 4.

3_{It is worth bearing in mind the observation by Levelt (1989) that in English, a predominately syntac-} tically right-branching language, under hierarchical phrase-structure type syntactic analyses every word typically marks the onset of some constituent.

The authors concede that the model requires some syntactic constituency considerations to get greater coverage. They show this through conducting Monte Carlo experiments on part-of- speech (POS) tagged versions of all the retraces in the corpus used for their model. Taking the value of m for each sentence they generate a simulated retrace by selecting a random value for k associated with the m value in question, taken from the empirical distribution from the data, and taking the reparandum as beginning word k-words back. In their comparison, they found that for the majority of POS types the simulation produced close to the frequency obtained from the real data. However, they mention that the word position based distribution under-determined (failed to predict) the frequency of prepositions being retraced, with speakers showing a preference for retracing from preposition onsets above the quantitative prediction (which they report as still being high, probably due to the fact that prepositions frequently begin phrases of only 2 words, but this was still not high enough). The authors also interestingly report that the model over- predicted for verbs, claiming a weighted preference of speakers not to retrace from verb phrase onset boundaries.

In summary, the purely quantitative model they proposed predicts the retrace points with mea- surable success without any notion of syntactic constituency. The model, while simple, places certain constraints on any generative model of retracing as a word position only based analysis of data gives some interesting regularities that could not be gleaned by analysing syntactic category alone.

In document Modelling Incremental Self-Repair Processing in Dialogue. (Page 39-44)