6.7 DepBank Evaluation
7.1.1 More Detailed NP Structure Categorisation
In our work, and in the Biomedical Project, there are only two possible structures: left-branching and right-left-branching. Other possibilities have been noted by different researchers, which we also came across during the annotation process.
127
128 Chapter 7: Future Work
Flat NPs
Firstly, there areNPs that are neither left nor right-branching, but exhibit only a flat, mono-lithic structure. Entities such as John A. Smith and International Business Machines are examples of this. In these cases, there is no real head-modifier relationship, John is not modifying Smith, but the tokens taken together still convey a meaning. McInnes, Pedersen, and Pakhomov (2007) recognise monolithicNPs in their annotation of medical terms, giving the example serous otitus media.
Perhaps the easiest way to annotate these flatNPs is to change their tokenisation, joining them together as a single token. This would simplify any structural problems and let a parser (or any NLPsystem) treat the entity as the single object that it is. This would not be a very practical approach to the problem, as much important lexical information would be lost. A better annotation scheme would be to add a marker to the relevant bracket in the corpus, in the same way that semantic markers (CLR,PRD, etc) are used:
(NP-FLAT (NNP John) (NNP A.) (NNP Smith) )
(NP
(NML-FLAT (NNP John) (NNP A.) (NNP Smith) ) (NNS apples) )
Another possibility is to include flat NPs into the parsing algorithm itself. The mono-lithic structure could be inserted at the appropriate level in the chart, rather than being formed as a constituent via a combination of lexical items. These two structures could then probabilistically compete, with the parser choosing the most likely option. However, this could introduce a problem similar to the bias inPCFGs, where smaller derivations are more likely because they involve multiply-ing fewer probabilities. The monolithic structure would likewise be made up of fewer probabilities.
Other problems would be deciding how to apply the feature set to multiple words together, and how to determine whichNPs are flat in the first place. If these issues could be resolved, then the resulting model would be able to statistically decide between left, right, and flat structures.
Indeterminate NPs
The second additional category is semantically indeterminate NPs, which we noted the presence of in Section 3.1. These NPs can be thought of as both left and right-branching, i.e. a dependency should exist between all word pairs. Lauer (1995b) found that 35 out of the 279 non-error NPs in his data set fitted this category, for example city sewerage systems and government policy decisions. It is the government policy in question in the latter example, but also policy
Chapter 7: Future Work 129
decisionsand government decisions, resulting in all three possible dependencies. In the same way as flatNPs, a marker could be added to the bracket to denote indeterminateNPs:
(NP-IND (NN government) (NN policy) (NNS decisions) )
(NP
(NML-IND (NN government) (NN policy) (NNS decisions) ) (NN report) )
Note that some NPs may appear to be indeterminate, but can actually be resolved. For example, in American President George Bush, George Bush is American, and the President, and the American President. However, the first meaning in this list is not intended by the utterance. Bush’s nationality is not relevant in the document, and so we argue that that the right-branching dependency should not be created. ThisNPshould be annotated as left-branching.
Marcus, Santorini, and Marcinkiewicz (1993) make some mention of indeterminateNPs, calling them permanent predictable ambiguities, a term they ascribe to Martin Kay. The example a boatload of warriors blown ashoreis given, which is similar to those in Hindle and Rooth (1993).
In Section 3.1.1 we described how both meanings of the prepositional phrase attachment are true in cases like this: the boatload was blown ashore, and so were the warriors. Marcus et al. (1994) describe the*PPA*trace used in the Penn Treebank, which is applied to these permanent predictable ambiguities, or as we have called them, indeterminates. However*PPA*is also applied to cases of general ambiguity (those described in the following paragraphs), whereas we would separate the two.
Ambiguous NPs
The final category that we suggest is for ambiguous NPs. These NPs do have a left or right-branching structure, however the annotator has no hope of determining which is correct. This may be because of technical jargon, e.g. senior subordinated debentures, or simply an ambiguity that cannot be resolved by the given context, as in the often cited PP-attachment example: I saw the man with the telescope. In these cases, there is a definite correct answer. The man either has a telescope, or a telescope is being used to do the seeing, but not both.1 This differentiates these ambiguous cases from indeterminateNPs, where both readings are true.
1In theory, the telescope could be with the man and used to do the seeing, but we will ignore this rather pathological possibility.
130 Chapter 7: Future Work
The Penn Treebank’s X constituent exists for when the correct category is unknown or uncertain, demonstrating that this problem occurs in the Treebank. However, we expect that the consistent use of this label is difficult at best. In any annotation task there will be hard-to-bracket cases, but drawing a line between those that are unresolvable and those that are merely complex would be up to individual annotators, whose opinions could vary greatly. In our experience, it is better to simply make a decision between left and right-branching. Accordingly, Section A.1.2 of our guidelines instructs annotators to leave an NPflat when they are unsure. Having this default strategy is one way to manage this problem, similar to highPPattachment in the Penn Treebank (Bies et al., 1995, §5.2.1) and in the Redwoods Treebank (Oepen et al., 2002).
The Frequency of these Additional Categories
Annotating for each of these flat, indeterminate and ambiguousNPs would require a fur-ther pass through the corpus, which would be a significant amount of work. From a pragmatic point of view, it may be better to leave them as is, as they comprise such a small proportion of allNPs. We can present no gold-standard figures for the Penn Treebank, as the annotation of these additionalNP
structure categories has not been performed as yet. However, considering that the annotator only marked 915 of the 60,959 inspected NPs as difficult (1.50%), we suggest that almost all NPs can be assigned to left or right-branching classes. From our experience annotating, we estimate that approximately 5% ofNPs do not fit into one of these major categories. IndeterminateNPs would be the least part of these, and ambiguousNPs (relating to financial jargon) the most.
McInnes, Pedersen, and Pakhomov (2007) found that flat NPs comprised 10.3% of their corpus, however another category ofNPs that they define, non-branching, appears to be equivalent to right-branchingNPs. Also, one of the flat examples given, difficulty finding words, does not seem to be anNP. For these reasons, a comparison between our corpus and theirs may not be reliable.
In Lauer’s data set, 12.54% of NPs are indeterminate, however we suspect that many of these cases could fit into a left or right-branching category. The same logic we applied to American President George Bushcould be used for some of Lauer’s indeterminateNPs, such as college bas-ketball players. TheNPis unlikely to be stressing that the players are college students, rather that they are playing in an official college basketball league. Although the lack of context adds some confusion, we suspect this is actually a left-branchingNP.
Flat, indeterminate and ambiguousNPs are interesting problems, but they are only a small part of the largerNPparsing task.
Chapter 7: Future Work 131