Experimental set-up - Modelling Incremental Self-Repair Processing in Dialogue.

STIR is trained on the Switchboard training data described above, and tested on the standard Switchboard test data (PTB III files 4[0-1]*) with partial words and punctuation removed from all files for fair comparison to other systems. In order to avoid over-fitting of classifiers to the basic language models, I use a cross-fold training approach: the corpus is divided into 10 folds and language models trained on 9 folds are used to obtain feature values for the 10th fold, repeating for all 10. The Random Forest classifiers are then trained as standard on the resulting feature- annotated corpus. This cross-fold method resulted in better feature utility for n-grams and better F-score results for detection in all components in the order of 5-6%.8

Training the classifiers Each Random Forest classifier was limited to 20 trees of maximum depth 4 nodes, putting a ceiling on decoding time. In making the classifiers cost-sensitive, Meta- Cost re-samples the data in accordance with the cost-functions: I found using 10 iterations over a re-sample of 25% of the training data gave the most effective trade-off between training time and accuracy. As Domingos (1999) demonstrated, there are only relatively small accuracy gains when using more than this, but with the cost of training time increasing in the order of the re- sample size. I only use one cost setting for ed as changing this did not have a noticeable effect on results, however I use 8 different cost-functions in r pstartwith differing costs for false negatives

of the form below, where R is a repair onset and F is a fluent onset:

8_{A similar approach was taken for Switchboard data in Zwarts and Johnson (2011) for training a re-} ranker of repair analyses.

5.4. Experimental set-up 154    Rhyp Fhyp Rgold 0 2 Fgold 1 0   

I adopt a similar technique in rmstartusing 5 different cost functions and in r pendusing 8 dif-

ferent settings, which when combined gives a total of 320 different cost function configurations. I hypothesise that higher recall permitted in the pipeline’s first components would result in better overall accuracy as these hypotheses become refined, though at the cost of the stability of the hypotheses of the sequence and extra downstream processing in pruning false positives.

I also experiment with the number of repair hypotheses that can be added to the stack per word, experimenting with limits of 1-best, 2- and 3-best hypotheses. I expect that allowing 2 or more hypotheses to be explored per r pstartshould allow greater final accuracy, but at the expense

of greater decoding and training complexity (theoretically this goes up from quadratic to cubic as described above), and possible incremental instability in its output.

In addition to testing accuracy in the standard way, I wish to explore the incremental perfor- mance versus final accuracy trade-off that STIR can achieve, so I now describe the evaluation metrics I employ that measure this.

5.4.1 Incremental evaluation metrics

Following Baumann et al. (2011) I divide the evaluation metrics into similarity metrics (mea- sures of equality with or similarity to a gold standard), timing metrics (measures of the timing of relevant phenomena detected from the gold standard) and diachronic metrics (evolution of incremental hypotheses over time).

Similarity metrics For direct comparison to previous approaches I use the standard measure of overall accuracy, the F-score over reparandum words, which I abbreviate Frm(see 5.8):

precision=rm correct rmhyp recall=rm correct rmgold Frm= 2 ×precision× recall precision+ recall (5.8)

I am also interested in repair structural classification given the different functions possible in repair shown in the last chapter, therefore I also measure F-score over all repair components

Input and current repair labels edits John John likes rm r p (⊕rm) (⊕rp) John likes uh ed (⊖rm) (⊖rp) ⊕ed

John likes uh loves

rm ed r p

⊕rm ⊕rp

John likes uh loves Mary

rm ed r p

Figure 5.7: Edit Overhead- 4 unnecessary edits

(rm words, ed words as interregna and r p words), a metric I abbreviate Fs. This is not measured in standard repair detection on Switchboard. To investigate incremental accuracy I evaluate the delayed accuracy (DA) introduced by Zwarts et al. (2010), as described in Section 3.1.3 against the utterance-final gold standard disfluency annotations of reparandum words, and use the mean of the 6 word F-scores.

Timing and resource metrics Again for comparative purposes I use Zwarts et al’s time-to- detection metrics, that is the two average distances (in numbers of words) consumed before first detection of gold standard repairs, one from rmstart, TDrmand one from r pstart, TDrp. In STIR’s 1-best stack setting, before evaluation I know a priori TDr pwill be 1 token, and TDrmwill be 1

more than the average length of rmstart− rpstart repair spans correctly detected. However when I

introduce a beam where multiple rmstarts are possible per r pstart with the most likely hypothesis

committed as the current output, the latency may begin to increase: the initially most probable hypothesis may not be the correct one. In addition to output timing metrics, I account for in- trinsic processing complexity with the metric processing overhead (PO), which is the number of classifications made by all components per word of input.

Diachronic metrics To measure stability of repair hypotheses over time I use Baumann et al. (2011)’s edit overhead (EO) metric. EO measures the proportion of edits (add, revoke, substitute) applied to a processor’s output structure that are unnecessary. STIR’s output is the repair label sequence shown in Figure 5.1, however rather than evaluating its EO against the current gold

In document Modelling Incremental Self-Repair Processing in Dialogue. (Page 154-157)