While most of our experiments will be on English language treebanks, it is also important to see how performance transfers to other languages. We will evaluate the performance of some of our systems on a collection of treebanks from 9 other languages: Arabic, Basque, French,
CHAPTER 2. EXPERIMENTAL SETUP 10
German, Hebrew, Hungarian, Korean, Polish and Swedish. These treebanks were used in the 2013 Syntactic Parsing of Morphologically Rich Languages (SMPRL) shared task. (Seddah et al. 2013) Many (though not all) of these languages do in fact have rich morphologies.
Some even have relatively free word order.
The data is derived from the following treebanks:
• Arabic: The Penn Arabic Treebank (Maamouri et al. 2003) and the Columbia Arabic Treebank (Habash and Roth 2009)
• Basque: The Basque Constituency Treebank (Aldezabal et al. 2008)
• French: The French Treebank (Abeill´e, Cl´ement, and Toussenel 2003)
• German: TiGer Treebank release 2.2 (Brants et al. 2002)
• Hebrew: The Modern Hebrew Treebank V2 (Sima’an et al. 2001)
• Hungarian: The Szeged Treebank (Csendes et al. 2005)
• Korean: The KAIST Treebank (Choi et al. 1994)
• Polish: The Sk ladnica Treebank (Woli´nski, G lowi´nska, and Marek 2011)
• Swedish: The Talbanken section of the Swedish Treebank (Nivre, Nilsson, and Hall 2006)
The SPMRL treebanks come with automatically induced POS tags and morphological analyses. Preliminary experiments indicated that these were not useful for our system, and so we do not use them. They also provided a small training set condition, in which parsers are only given 5000 sentences of training data. We did not experiment with this condition.
11
Chapter 3
Parsing with Refinements
Many high-performance probabilistic constituency parsers take an initially simple base gram-mar over treebank labels like NP and enrich it with more complex syntactic features to improve accuracy. This broad characterization includes lexicalized parsers (Collins 1997), unlexicalized parsers (Klein and Manning 2003), and latent variable parsers (Matsuzaki, Miyao, and Tsujii 2005). Figures 3.1(a), 3.1(b), and 3.1(c) show small examples of context-free trees that have been annotated in these ways.
In this chapter, we will present a single canonical representation of constituency parsing that covers the grammars used by nearly all standard chart-based parsers—that is, leaving aside neural network parsers like Henderson (2003). In particular, we will frame these parsers as having refinements at their core. This abstraction will allow us to reason about the different ways these parsers work and to combine and manipulate those models to see how they interact.
3.1 Maximum Likelihood PCFGs
Before we think about building refined grammar, let us be more specific about exactly which grammar we are refining. The most obvious grammar one could build from a treebank is the maximum likelihood estimate (MLE) like that used in Johnson (1998b). In this grammar, we read off exactly the symbols used in the treebank, estimating the probability of each rule as proportional to the number of times we have seen it.
As an example, if we saw a configuration like the following, we would create a rule NP→ DT NNP CD NN:
CHAPTER 3. PARSING WITH REFINEMENTS 12
(a) NP[agenda]
NN[agenda]
agenda NP[’s]
The president’s
(b) NP[ˆS]
NN[ˆNP]
agenda NP[ˆNP-Poss-Det]
The president’s
(c) NP[1]
NN[0]
agenda NP[1]
The president’s
Figure 3.1: Parse trees using three different refinement schemes: (a) Lexicalized refinement like that in Collins (1997); (b) Structural refinement like that in Klein and Manning (2003);
and (c) Latent refinement like that in Matsuzaki, Miyao, and Tsujii (2005).
(9)
NP
NN
review CD
19 NNP
Oct DT
an
Because we have to use a grammar in Chomsky Normal Form, we binarize the rules with more than two children by introducing synthetic symbols.1 Our rule from above becomes
1There are many ways to binarize a rule. One could start by peeling off from the left hand side, or from the right hand side. We use “head outward binarization,” where we pick a head symbol from among the children, and then add all symbols to its left, then to its right. For English, we use Collins (1997)’s head rules. See Section 3.5.
CHAPTER 3. PARSING WITH REFINEMENTS 13
several:2
(10) a. NP→ DT NP[\DT]
b. NP[\DT] → NNP NP[\NNP\DT]
c. NP[\NNP\DT] → CD NN
All in all, reading a grammar from the Penn Treebank training set produces a grammar with 38340 rules and 11681 symbols.3
Unfortunately, this grammar does not perform very well in practice, getting just 71.41 F1 on the development set of the Penn Treebank. The main reason this grammar does not work well is that the raw treebank symbols like NP do not encode enough information. For example, the maximum likelihood estimate grammar has no mechanism to differentiate the different rewrite statistics of subject NPs and object NPs (subject NPs are more likely to be pronomimal than object NPs), nor can it differentiate between PPs that attach to VPs (e.g. those headed by “to”) and PPs that attach to NPs (e.g. those headed by “of”). These kinds of attachment problems are critical to parsing performance; having a grammar that can distinguish these “different kinds” of NPs and PPs is perhaps the most obvious way to deal with the problem.
However, the grammar actually encodes too much context in other ways. In particular, by using only configurations that have actually occurred in the treebank, the grammar cannot represent certain novel configurations. For example, no NP in the training set of the Penn Treebank has the rewrite “JJ NN VBG NNS,” (e.g. “domestic printer manufacturing operations”) but that configuration does occur in the development set.
One way to correct for this shortcoming is to remove even more information from the grammar in a process called “horizontal Markovization.” Here, we shorten histories in the synthetic binarized symbols, removing a lot of the context that was encoded. For exam-ple, we might collapse the symbols NP[\NNP\DT] and NP[\NNP\JJ] into a single symbol NP[\NNP . . . ], where the . . . mean that the symbol may have some number of predecessors, but the number and identity of those symbols are discarded. This has the effect of reducing the size of the grammar considerably, as well as allowing the grammar to produce more tree structures than it could before.4 Removing all but 2 sibling gives a much smaller grammar with 3011 symbols and 16657 rules, scoring a slightly better 71.78 F1.
2The notation we use here is deliberately patterned on Categorial Grammar (Bar-Hillel 1953), where the categories of many words are functions that combine with other categories to form constituents. We use it merely to describe which neighbors a constituent has, but the binary rules produced by this process describe the order in which constituents combine, giving a similar interpretation, roughly speaking.
3Minor implementation differences will produce grammars with different numbers of rules.
4For the pedants: yes, the PCFG could already generate infinitely many trees. You know what I mean.
CHAPTER 3. PARSING WITH REFINEMENTS 14