Parsing Evaluation - Parser Post-Processor

5.6 Parser Post-Processor

5.6.1 Parsing Evaluation

Finally, we can now put the rebracketedNPs back into the parser output and re-evaluate.

This requires the additional task of labelling the brackets. There are only two labels to distinguish between (^NMLand ^JJP), and they can be inferred directly from thePOS tag of the head. If it is a verb or an adjective, we label the node as^JJP, and otherwise it is a^NML. A small number of errors (0.42% drop in matched bracket F-score) are introduced by this method, because of errors in the Penn TreebankPOStags and in our annotation, as well as errors in head finding.

Tables 5.16 and 5.17 show the final results. A suggestion result is not shown for all brackets since they only apply to^NMLs and^JJPs and it is difficult to post-process the parser’s output with them. The post-processor outperforms the parser by 9.04% and 8.10% on the development and test data respectively. The post-processor has also improved on the suggestion baseline established earlier. These results demonstrate the effectiveness of large-scale ^NP bracketing techniques, and show that internalNPstructure can be recovered with better performance than has ever been possible in the past.

We also measure statistical significance using a computer-intensive, randomised, stratified shuffling technique, as described in Noreen (1989) and Cohen (1995, §5.3). The null hypothesis — that the results are produced by the same model — is tested by swapping the scores on individual sentences between the two models. These swaps are performed repeatedly, with precision, recall

Chapter 5: Noun Phrase Bracketing 97

  P R F

Suggestions 94.29 56.81 70.90 NML JJP Parser 80.06 63.70 70.95 Post-processor 79.44 78.67 79.05

Parser 88.30 87.80 88.05

All brackets

Post-processor 88.23 88.24 88.23 Table 5.17: Test data performance

and F-score recalculated for each model at each iteration. A count is kept of how many times the difference between these recalculated metrics is greater than or equal to the difference between the original figures. The null hypothesis is rejected if this number is sufficiently low.

Ideally, all possible permutations would be performed, however this is infeasible, as it would require testing 2ⁿ permutations, where n is the number of sentences (2,416). Instead, an approximate randomised test can be used with a sufficiently large number of iterations, in this case 10,000. The p-value is then calculated as:

p= c+ 1

n+ 1 (5.11)

where c is the number of random swaps that resulted in a difference greater than or equal to the original difference, and n is the number of iterations performed.

The p-value on the test data all-brackets F-score is 0.0001. This is the smallest p-value attainable for the number of iterations we performed. That is, there were no iterations whatsoever where randomly swapped figures resulted in an F-score difference that was as large as the original, unswapped permutation. The p-value for the recall measure gave the same result. For precision, where the parser is actually superior to the post-processor, we calculated a p-value of 0.0163. Thus, this difference is also statistically significant, although less so than the recall and F-score metrics.

5.7 Summary

We have created the first large-scale supervised models that achieve excellent results.

These experiments are also the first to scale effectively to complex^NPs, attaining similarly high levels of performance. We expect that the data and models described in this chapter will provide the impetus for much more work onNPBracketing in the future.

98 Chapter 5: Noun Phrase Bracketing

One particularly important contribution of this chapter is the data sets that we have cre-ated. These data sets are orders of magnitude larger than those used previously, and have made possible the wide range of experiments we carried out. The final result of this chapter, where our post-processor outperforms the Collins (2003) parser, is another of the major contributions of this thesis. We previously observed the tremendous difficulty in bracketing^NPs, demonstrated by the below-baseline performance of the parser in Chapter 4. We have now overcome this difficulty and outperformed the suggestion baseline.

Chapter 6

Parsing with CCG

Although the NPBracketing system was successful, using a post-processor is hardly an elegant solution. A better solution would be to include theNPmodel into a standard parsing model, as this would allowNPstructure to be optimised together with the entire sentence. In this chapter, we will make such an addition to theC&CCCGparser (Clark and Curran, 2007b). There are a number of advantages to this approach:

• TheC&Cparser uses a maximum entropy model, which will make it relatively easy to add

NP-based features, compared to the Collins (2003) models.

• It will allow us to correct the errors in the CCG corpus, CCGbank, which we described in Section 2.5.1.

• Utilising a second parser will demonstrate thatNP structure is recoverable across multiple parsing architectures.

6.1 The C&C Parser

In Section 2.5.2, we described theC&Cparser (Clark and Curran, 2007b). Here, we will describe the features used by the parser in its Maximum Entropy model. This will be relevant for when the novel features we have added are described in Section 6.5.

Firstly, the model uses a lexical feature that combines the word and its lexical category.

Another feature generalises the word to itsPOStag. There are also features that are only active for the root constituent of the sentence. These are the root category; the root category and its head word;

100 Chapter 6: Parsing with CCG

and the root category and its head word’sPOStag. Another feature that applies to all non-terminal constituents is the rule that was applied to generate it. Once again, this feature is also expanded to the head word of the constituent, and generalised to the head word’sPOStag.

All of the above features are identical in both the normal-form and dependency models.

However, for the following features, the former uses local rule applications, such as S[dcl] → NP S[dcl]\NP; while the latter uses the predicate-argument dependencies, like the one that will be described in Section 6.2.1. The dependency model can thus include information from long-range dependencies.

For any non-terminal constituent, the head words of the child nodes form a feature. The rule or dependency involved is also included, for the normal-form and dependency models respec-tively. This feature is also generalised, to the first word’sPOStag; the second word’sPOStag; and the first and second word’sPOStag.

The last feature group adds distance information to the models. The number of words, verbs and punctuation marks between the head words of the child constituents are counted. All counts two or greater are subsumed into a single class. The feature also includes the parent con-stituent’s head word, and the rule or dependency that applies.

Most of these features will be described again in Section 6.5.1, where we give examples of the original features compared to their generalisations that use named entity tags.

In document Statistical parsing of noun phrase structure (Page 112-116)