Textual completion - Corpus annotation - Turn-Taking and Affirmative Cue Words in Task-Oriented

2.2 Corpus annotation

6.1.5 Textual completion

Several authors (Duncan, 1972; Sacks et al., 1974; Ford and Thompson, 1996; Wennerstrom and Siegel, 2003, inter alia) claim that some sort of completion independent of intonation

and interactional import functions as a turn-yielding cue. Although some call this syntactic completion, all authors acknowledge the need for semantic and discourse information in judging utterance completion: “we judged an utterance to be syntactically complete if, in its discourse context, it could be interpreted as a complete clause” (Ford and Thompson, 1996, p. 143); “context could also influence coding decisions” (Wennerstrom and Siegel, 2003, p. 85). Therefore, we choose the more neutral term textual completion for this phenomenon.

In this section we describe how we manually annotated a portion of the corpus using a simple definition of textual completion. These data were subsequently used to train a machine learning (ML) classifier, with which we automatically labeled the whole Games Corpus. Finally, we present results relating both manual and automatic textual completion labels to turn-taking phenomena.

6.1.5.1 Manual labeling

In conversation, listeners judge textual completion incrementally and without access to future phrases. To simulate the same conditions in the labeling task, annotators were asked to judge the textual completion of a turn up to a target pause, and did not have access to the transcripts after the target pause. Annotators had access only to the written transcript of the current turn up to the target pause, and also the full previous turn by the other speaker (if any). These are a few sample tokens:

A: the lion’s left paw our front B: yeah and it’s th- right so the

A: and then a tea kettle and then the wine B: okay well I have the big shoe and the wine

A: —

B: okay there is a belt in the lower right a microphone in the lower left

A: so when you say directly above you really mean directly above the right arrow the the arrow the owl

B: the owl yeah

We selected 400 tokens at random from the Games Corpus. The target pauses were also chosen at random. To obtain a good coverage of the variation present in the corpus, tokens were selected in such a way that 100 of them were followed by speech from the same speaker (i.e., preceding a hold, or H), 100 by a backchannel from the other speaker (BC), 100 by a smooth switch to the other speaker (S), and 100 by a pause interruption by the other speaker (PI). Three annotators labeled each token independently as either complete or incomplete according to these guidelines:

Determine whether you believe what speaker B has said up to this point could constitute a complete response to what speaker A has said in the previous turn/segment.

Note: If there are no words by A, then B is beginning a new task, such as describing a card or the location of an object.

To avoid biasing the results, annotators were not given the turn-taking labels of the tokens. Inter-annotator reliability is measured by Fleiss’ κ at 0.8144, which corresponds to the ‘almost perfect’ agreement category. The mean pairwise agreement between the three subjects is 90.8%. For the cases in which there is disagreement between the three annotators, we adopt the majority label as our gold standard; that is, the label chosen by two annotators.

6.1.5.2 Automatic classification

Next, we train a machine learning model using the 400 manually annotated tokens as training data, to automatically classify all IPUs in the corpus as either complete or incomplete. For each IPU we extract a number of lexical and syntactic features from the current turn up to the IPU itself:

• lexical identity of the IPU-final word (w);

• POS tag of w;

• POS tags of the IPU-final bigram;

• simplified POS tags of the IPU-final bigram;

• number of words in the IPU;

• a binary flag indicating if w is a word fragment;

• size and type of the biggest (bp) and smallest (sp) phrase that end in w;

• binary flags indicating if each of bp and sp is a major phrase (NP, VP, PP, ADJP, ADVP);

• binary flags indicating if w is the head of each of bp and sp.

We choose these features in order to capture as much lexical and syntactic information as possible from the transcripts. The motivation for lexical identity and part-of-speech features is that complete utterances are unlikely to end in expressions such as the or but there, and more likely to finish in nouns, for example. Since fragments indicate almost by definition that the utterance is incomplete, we also include a flag indicating if the final word is a fragment. As for the syntactic features, our intuition is that the boundaries of textually complete utterances tend to occur between large syntactic phrases — a similar approach is used by Koehn et al. (2000) for predicting intonational phrase boundaries in raw text. The syntactic features are computed using two different parsers: Collins (Collins, 2003), a high-performance statistical parser; and CASS (Abney, 1996), a partial parser especially designed for use with noisy text.

We experiment with several learners, including the propositional rule learner Ripper (Cohen, 1995), the decision tree learner C4.5 (Quinlan, 1993), Bayesian networks (Hecker- man et al., 1995; Jensen, 1996) and support vector machines (SVM) (Vapnik, 1995; Cortes and Vapnik, 1995). We use the implementation of these algorithms provided in the Weka machine learning toolkit (Witten and Frank, 2000). Table 6.5 shows the accuracy of the majority-class baseline and of each classifier, using 10-fold cross validation on the 400 training data points, and the mean pairwise agreement by the three human labelers. The linear-kernel SVM classifier achieves the highest accuracy, significantly outperforming the

Classifier Accuracy Majority-class (‘complete’) 55.2% C4.5 55.2% Ripper 68.2% Bayesian networks 75.7% SVM, RBF kernel 78.2% SVM, linear kernel 80.0% Human labelers (mean agreement) 90.8%

Table 6.5: Mean accuracy of each classifier for the textual completion labeling task, using 10-fold cross validation on the training data.

majority-class baseline, and approaching the mean agreement of human labelers. However, there is still margin for further improvement. New approaches could include features cap- turing information from the previous turn by the other speaker, which was available to the human labelers but not to the ML classifiers. Also, the sequential nature of this classification task might be better exploited by more advanced graphical learning algorithms, such as Hidden Markov Models (HMM; Rabiner, 1989) and Conditional Random Fields (CRF; Lafferty et al., 2001).

6.1.5.3 Results

First we examine the 400 tokens that were manually labeled by three human annotators, considering the majority label as the gold standard. Of the 100 tokens followed by a smooth switch, 91 were labeled textually complete, an overwhelming proportion compared to those followed by a hold (42%). A chi-square test reports that this distribution departs significantly from random (χ2 = 51.7, d.f. = 1, p≈ 0), suggesting that textual completion as defined earlier in this section constitutes a necessary, but not sufficient, turn-yielding cue.

The analysis of tokens automatically annotated for textual completion provides addi- tional support for this hypothesis. We used the highest performing classifier, the linear-

kernel SVM, to label all IPUs in the corpus. Of the 3246 IPUs preceding a smooth switch, 2649 (81.6%) were labeled textually complete; while just about half of all IPUs preceding a hold (4272/8123, or 52.6%) were labeled complete. These numbers depart significantly from a random distribution (χ2 _{= 818.7, d.f. = 1, p} _{≈ 0), confirming the predominance of}

textual completion before smooth switches.

Speaker variation: To investigate speaker variation for the textual completion cue, we compute the proportion of complete IPUs preceding smooth switches (S) and holds (H) for each speaker. In all cases, the proportion before S ranges from 71.4% to 88.5%, and before H, from 46.5% to 60.9%, indicating that our general findings are valid across speakers. Detailed results for each speaker are provided in Appendix E.1.

Summary of findings: We provide a definition of textual completion, as well as a proce- dure for manual annotation that achieves a high inter-labeler agreement rate. Subsequently, we show how a relatively small manually labeled data set may be utilized to train a ML classifier that approaches human performance. When examining both manually and automatically labeled data, we find that textual completion seems to work almost as a necessary condition before smooth switches, but not before holds. A possible interpretation is that textual completion functions as a turn-yielding cue, with listeners more likely to take the speaking turn after completion points.

In document Turn-Taking and Affirmative Cue Words in Task-Oriented Dialogue (Page 59-64)