Full Integration - Null Element Restoration

2.2 Parsing

2.2.2 Full Integration

In this approach there is no trace tagger, and thus the parser is not informed of the location of null elements. The authors try both unlexicalized and lexicalized parsers. In the unlexicalized case, they use a parser of their own, while in the lexicalized case they extend Model 3 from Collins (1999) with the idea of a “gapcat” frame analogous to the subcategorization frames already used by the parser.

The gapcat frames work as follows.2 If a node should have a gap associated 2_{In this paragraph we again assume knowledge of Collins’ Model 2 (Collins, 1999).}

with it, that gap is part of its gapcat frame set. This set is treated analogously to the subcategorization frame, but unlike subcategorization frames, which have elements discharged whenever a complement modifier is generated, gapcat frames have their elements only discharged when null element modifiers are generated (null elements generated as complements will also discharge subcat frame items). Since non-terminals have gaps indicated on them, the threading of gaps from one level to another is effectively accomplished by the inclusion of the gapcat frame in the conditioning of the modifier generation probabilities.

2.2.3 Evaluation

A parser-integrated approach must be evaluated in two respects: first, its performance on the null element task itself, and second, on the overall performance of the parser (both in accuracy and computational resources), since the approach will not be useful if it impairs the overall parsing task. We will consider these two aspects of evaluation in reverse order.

The authors find the fully-integrated approach to be entirely intractable for unlexicalized parsing (it cannot find any parse at all for 35% sentences in section 23), so we will focus on the lexicalized case. The core challenge here, of course, is the explosion (by a factor of 7) in the size of the the non-terminal alphabet due to all the gap annotations. The authors claim that this results in the familiar sparse data problem (that is, probabilities involving non-terminal symbols can no longer be estimated as accurately because training instances which were formerly considered together are “shattered” into different classes) and that it has a significant negative impact on parsing performance in both the fully and partially integrated case. The performance of the fully and partially-integrated cases is almost identical (86.6 and 86.4 F-measure, respectively), but this is a 12-13% increase in error relative to the same parsing model without null elements (88.0).

Relative # of Relative Bracketing Condition Nonterminals Parsing Time

NOTRACE 1.00 1.00 88.0%

WH–NP 1.63 1.07 87.4%

PRO&WH 7.15 1.33 86.6%

TAGGER 7.15 0.95 86.4%

Table 6: INSERT model lexicalized parsing results on Section 23.

Type EEdetection Antecedent rec. parser tagger parser tagger NP–NP 80.4% 83.5% 70.3% 70.7% WH–NP 81.5% 83.2% 80.2% 82.0% PRO–NP 64.5% 69.5% 64.5% 69.5% WH–S 92.0% 92.8% 82.2% 84.5% WH–ADVP 57.9% 59.5% 53.0% 53.6% Table 7: Comparison of pre-processing with lexicalized in-processing (F-scores).

missed parses precludes straightforward comparison of bracketing scores, therefore we report the per- centage of sentences where the parser fails. In the case of the lexicalized parser, less than 1% of the parses are missed, hence the comparisons are re- liable. Finally, we compare EE detection and antecedent recoveryF-scores of the TAGGER and the PRO&WHmodels for the overlappingEEtypes (Ta- ble 7).

5.3 Discussion

As noted by Dienes and Dubey (2003), unlexicalized parsing with EEs does not seem to be viable without pre-processing. However, the lexicalized parser is competitive with the pre-processing approach.

As for the bracketing scores, there are two inter- esting results. First, lexicalized models which han- dle EEs have lower bracketing scores than the NO- TRACE model. Indeed, as the number of EEs in- creases, so does the number of nonterminals, which results in increasingly severe sparse data problem. Consequently, there is a trade-off between finding local phrase structure and long-distance dependen- cies.

Second, comparing the TAGGER and the PRO&WH models, we find that the bracketing

results are nearly identical. Nonetheless, the PRO&WH model inserting EEs can match neither the accuracy for antecedent recovery nor the time efficiency of the pre-processing approach. Thus, the results show that treatingEE-detection as a pre- processing step is beneficial to both to antecedent recovery accuracy and to parsing efficiency.

Nevertheless, pre-processing is not necessarily the only useful strategy for trace detection. Indeed, by taking advantage of the insights that make the finite-state and lexicalized parsing models success- ful, it may be possible to generalize the results to other strategies as well. There are two key observations of importance here.

The first observation is that lexicalization is very important for detecting traces, not just for the lexicalized parser, but, as discussed in Section 3, for the trace-tagger as well. The two models may con- tain overlapping information: in many cases, the lexical cue corresponds to the immediate head-word the EE depends on. However, other surrounding words (which frequently correspond to the head- word of grandparent of the empty node) often carry important information, especially for distinguishing NP–NPandPRO–NPnodes.

Second, local information (i.e. a window of five words) proves to be informative for the task. This explains why the finite-state tagger is more accurate than the parser: this windowalwayscrosses a phrase boundary, and the parser cannot consider the whole window.

These two observations give a set of features that seem to be useful forEE detection. We conjecture that a parser that takes advantage of these features might be more accurate in detectingEEs while parsing than the parsers presented here. Apart from the pre-processing approach presented here, there are a number of ways these features could be used:

1. in a pre-processing system that only detects EEs, as we have done here;

2. as part of a larger syntactic pre-processing system, such as supertagging (Joshi and Banga- lore, 1994);

3. with a more informative beam search (Charniak et al., 1998);

Table 2.2: Comparison of null element performance for DD’s partially (tagger) and fully (parser) integrated systems. The format of the node types is antecedent- element. pro-np indicates an uncontrolled (NP *), while wh-s (confusingly) indicates a sentential trace. (Table from DD)

In terms of the relative performance of the partially and fully-integrated ap- proaches on the null element task itself, the partial approach is consistently superior (see table 2.2). The authors hypothesize that this is because the tagger’s five-word window gives it access to useful lexical information which crosses phrase boundaries. They provide a comparison only to Johnson (the only system available at the time), whom they generally outperform by a significant margin (see section 2.4.2 for an comparison to Levy and Manning).

In document Null Element Restoration (Page 38-40)