• No results found

Because our methods first reduce a sentence to an impoverished, uninflected form, we can both train the system and evaluate its performance by applying it to a corpus collected from a general population. We generated reduced sentences using the scheme described in the first step in §8.1. We then measured the system’s ability to recover the original articles, modals, and prepositions, and to produce correct inflectional forms for both nouns and verbs. This set-up provides an idealized situation: the error classes in the data are essentially restricted to the classes we model. Thus we are able to measure the effects of the reranking algorithms in a controlled fashion.

8.3.1 Data Source

The training set consists of 33,516 transcripts of utterances, produced by callers in spoken dialogues with the Mercury flight domain (Seneff, 2002). The development set consists of 317 sentences from the same domain, and is used for tuning word transition weights for the n-gram model. The test set consists of 1000 sentences, also from the same domain. These utterances are all at least four words long and have an average length of 7.4 words. None of them are present in the training set. They serve as our “gold-standard” in the automatic evaluation, although we recognize that some of the users may be non-native speakers.

8.3.2 Overgeneration Algorithm

Starting with a backbone lattice consisting of the reduced input sentence, the following operations on the lattice are allowed:

Free Insertions Articles, prepositions, auxiliaries and modals are allowed to be inserted anywhere. It is possible, based on a shallow analysis of the sentence, to limit the possible points of insertion. For example, insertion of articles may be restricted to the beginning of noun phrases. However, at least in this restricted domain, these constraints were observed to have a negligible effect on the final performance.

Noun/Verb Inflections The nouns and verbs, appearing in their uninflected forms in the reduced input, can be substituted by any of their inflected forms. Of the nouns specified in the Mercury grammar, 92 unique nouns appear in the test set, occurring 561 times; there are 79 unique verbs, occurring 716 times.

8.3.3 Reranking Strategies

The following four reranking strategies are contrasted. Details about the three parsing models can be found in §8.2.

Trigram is a word trigram language model, trained on the sentences in the training set. Parsedomain is the flight-domain context-free grammar, trained on the same sentences.

From the 10-best list produced by the trigram language model, the candidate that yields the highest-scoring parse from this grammar is selected. If no parse tree is obtained, we default to the highest-scoring trigram hypothesis.

Parsegeneric is a syntax-oriented, generic grammar. The re-ranking procedure is the same as for Parsedomain.

Reranker Noun/Verb Auxiliary/Preposition/Article (# utterances) Accuracy Precision Recall F-score

Trigram 89.4 0.790 0.636 0.705

Parsegeneric 90.1 0.819 0.708 0.759

Parsegeneric+geo 89.3 0.823 0.723 0.770 Parsedomain 90.8 0.829 0.730 0.776

Table 8.2: Experimental results, broken down into the various error classes.

Parsegeneric+geo is the same as Parsegeneric, except it is augmented with a set of proper

noun classes encoding geographical knowledge, including the names of the same cities, states, countries, airports and airlines that are included in Parsedomain. Again, the

re-ranking procedure is the same as for Parsedomain.

8.3.4 Results

The performances of the various re-ranking models are shown in Table 8.2. Compared with Trigram, reranking with all three Parse models resulted in improved precision and recall for the insertion of articles and prepositions. However, the accuracy in predicting nouns and verbs did not vary significantly. Overall, the quality of the output sentences of all three Parse models is statistically significantly superior to that of Trigram2. The differences

in performance among the three Parse models are not statistically significant3

.

The gains were due mostly to two reasons. The parser was able to reject candidate sentences that do not parse, such as “*When does the next flight to Moscow?”. Trigrams are unable to detect long-distance dependencies, in this case the absence of a verb following the noun phrase “the next flight”. Furthermore, given candidates that are syntactically valid, the parser was able to assign lower scores to those that are disfluent, such as “*I would like that 3:55pm flights” and “*What cities do you know about Louisiana?”. Parsedomain in

fact rejects this last sentence based on constraints associated with trace movements. Among the errors in the Parsedomain model, many contain parse tree patterns that are

not well modelled in the probabilities in the grammar. Consider the ill-formed sentence “*I would like a flight 324.”, whose parse tree is partially shown in Figure 8-2. Well-formed sentences such as “I would like flight 324” or “I would like a flight” have similar parse trees. The only “unusual” combination of nodes is thus the co-occurrence of the indef and flight numbernodes, which is missed by the spatial-temporal “node”-trigram model used in Tina. The grammar could likely be reconfigured to capture the distinction between a generic flight and a specific flight. Such combinations can perhaps also be expressed as features in a further reranking step similar to the one used in (Collins et al., 2005).

8.3.5 Multiple Correct Answers

In the evaluation above, the original transcript was considered as the sole “gold standard”. Just as in machine translation, however, there are often multiple valid corrections for one

2The quality is measured by word error rate (WER) — insertion, substitution and deletion — with

respect to the original transcript. For each test sentence, we consider one model to be better if the WER of its output is lower than that of the other model. All three Parse models significantly outperformed Trigramat p < 10−12, using McNemar’s test.

3By McNemar’s test, Parse

domain compared with Parsegeneric+geo and Parsegeneric yield p = 0.495

...

I would like dir obj

indef a a flight flight flight flight number 324

Figure 8-2: Parse tree for the ill-formed sentence “*I would like a flight 324” in Parsedomain.

The co-occurrence of indef and flight number nodes should be recognized to be a clue for ungrammaticality.

Reduced input: when delta flight leave atlanta

Correction 1: when does the delta flight leave atlanta Correction 2: when does the delta flight leave from atlanta

Table 8.3: A sample entry in the human evaluation. Correction 1 is the transcript, and correction 2 is the Parsedomain output. Both evaluators judged these to be equally good.

In the automatic evaluation, the Parsedomain output was penalized for the insertion of

“from”.

sentence, and so the performance level may be underestimated. To better gauge the perfor- mance, we had previously conducted a human evaluation on the output of the Parsedomain

model, focusing on the subset that were fully parsed by Tina. Using a procedure similar to §8.3, we harvested all output sentences from Parsedomain that differed from the gold stan-

dard. Four native English speakers, not involved in this research, were given the ill-formed input, and were asked to compare the corresponding transcript (the “gold-standard”) and the Parsedomain output, without knowing their identities. They were asked to decide

whether the two are of the same quality, or that one of the two is better. An example is shown in Table 8.3.

To measure the extent to which the Parsedomain output is distinguishable from the

transcript, we interpreted their evaluations in two categories: (1) category OK, when the Parse output is as good as, or better than the transcript; and (2) category Worse, when the transcript is better. As shown in Table 8.4, the human judges exhibited “substantial agreement” according to the kappa scale defined in (Landis & Koch, 1977). Of the 317 sentences, 218 were judged to be at least as good as the original transcript. Hence, the

Test Set 1 OK Worse Test Set 2 OK Worse

OK 101 6 OK 91 3

Worse 15 45 Worse 25 41

Table 8.4: Agreement in the human evaluation. The test set was randomly split into two halves, Test Set 1 and Test Set 2. Two human judges evaluated Test Set 1, with kappa = 0.72. Two other evaluated Test Set 2, with kappa = 0.63. Both kappa values correspond to “substantial agreement” as defined in Landis & Koch (1977).

corrections in up to two-thirds of the sentences currently deemed incorrect may in fact be acceptable.