2.2 Dependency Parsing
2.2.1 Transition-based Parsing
Transition-based parsers build the syntactic structure of a sentence incrementally by starting from an initial configuration and then repeatedly transitioning into different configurations by performing one of a set of predefined operations. The parsing process is finished when the parser ends in a defined final configuration. The sequence of operations (the transition sequence) that the parser performs to transition from the initial configuration to the final configuration encodes the syntactic structure of the input sentence.
2.2 Dependency Parsing 15
As an example, Figure 2.4 shows the Arc-Standard transition system as formalized by Nivre (Yamada and Matsumoto 2003, Nivre 2004). A configuration in this system consists of three components:
1. β, an input buffer that holds the words of the input sentence, indexed from 0 to n with 0 representing the artificial root node
2. σ, a stack that serves as a memory and stores unfinished substructures
3. A, the set of arcs that form the output structure
The parser starts with the root node on the stack, all other tokens stored in the buffer, and an empty arc set. It then applies one of the three operations shown in Figure 2.4 to transition into a new state and repeats the procedure until it ends in the final configuration. The final configuration is reached when the buffer is empty and the only symbol on the stack is the artificial root node. At this point, the arc set holds all arcs that the parser has introduced between the words of the sentence.
Left-Arclintroduces an arc between the front of the buffer and the top of the stack with
the front of the buffer being the head. The token on top of the stack is then discarded. Right-Arclintroduces an arc between the same items but makes the top of the stack the
head. The token in front of the buffer is discarded and the top of the stack is put back onto the buffer. The third operation, Shift, simply pushes the token in front of the buffer onto the stack. Left-Arcland Right-Arclare additionally parameterized for the label that they
introduce on the arc.
Transition Precondition
Left-Arcl (σ|wi, wj|β, A) ⇒ (σ, wj|β, A ∪ {hwj, wi, li}) i 6= 0
Right-Arcl (σ|wi, wj|β, A) ⇒ (σ, wi|β, A ∪ {hwi, wj, li})
Shift (σ, wi|β, A) ⇒ (σ|wi, β, A)
Figure 2.4:The Arc-Standard transition system adapted from K ¨ubler et al. (2009: fig. 3.1).
A statistical multiclass classifier is used to decide at each step, given the current config- uration, which of the three operations the parser should apply. The classifier is trained on oracle transition sequences. Oracle transition sequences are derived from manually annotated treebanks by running the transition system on a sentence such that it derives
16 2 Background
the treebank tree for this sentence. It is possible that there is more than one oracle tran- sition sequence for a given sentence. A canonical sequence can be defined by ranking the operations in the transition system, preferring higher-ranked ones in case multiple are allowed. Ranking the operations as in Figure 2.4 (Left-Arc > Right-Arc > Shift), the canonical oracle transition sequence for the example sentence in Figure 2.1 would be Shift, Right-Arcname, Shift, Shift, Shift, Shift, Shift, Left-Arcamod, Left-Arcdet, Left-Arcadvmod,
Left-Arccop, Left-Arcnsubj, Shift, Right-Arcpunct, Right-Arcroot, Shift.
In order to predict the next transition, the parser extracts features from its current con- figuration. These features include information about the next items in the buffer and the partially processed items on the stack. By accessing the stack, the feature model has access to the entire structure that has been build so far. As the parser advances, more structure is build and becomes available to the feature model.
There are several different flavors of transition-based parsers, all of which have in common that they build the output structure incrementally at each step predicting the next one based on the current configuration. Nivre (2008) makes a distinction between stack-based and list-based algorithms. Stack-based algorithms are e.g., the Arc-Standard algorithm shown above (Yamada and Matsumoto 2003, Nivre 2004) and the Arc-Eager algorithm (Nivre 2003). List-based algorithms are proposed in Covington (2001). Instead of a stack, they use one or more lists to store partially processed tokens. Non-directional parsers (Shen and Joshi 2008, Goldberg and Elhadad 2010b) abandon the strict left-to-right processing and instead allow introducing arcs between neighbouring tokens anywhere in the sentence. The statistical models guiding these parsers not only learn which tokens to connect but also which tokens should be processed before others.
Most of the transition-based algorithms derive projective trees only (with the exception of some of the list-based algorithms in Covington (2001) and the parser in Shen and Joshi (2008)). Modifications in different directions have been proposed to deal with non- projectivity: Nivre and Nilsson (2005) propose a preprocessing step that projectivizes trees prior to parsing and reintroduces non-projective edges afterwards. Nivre (2009), Nivre et al. (2009) add a swap operation to the transition-system that reorders tokens during parsing in order to create a projective parsing order. The swap operation was applied in non-directional parsing by Tratz and Hovy (2011). With the swap operation transition-based algorithms are able to derive any possible non-projective structure for a given sentence, but it comes with an increased time complexity. Other approaches
2.2 Dependency Parsing 17
avoid the increase in complexity by restricting the set of non-projective structures that can be derived (Attardi 2006, G ´omez-Rodr´ıguez and Nivre 2010, Pitler and McDonald 2015). These subsets can be parsed efficiently and are shown to include almost all of the non-projective arcs that can be found in the treebanks.
Transition-based parsers are fast due to their low time complexity. The stack-based variants find a tree in linear time with respect to the length of the input sentence and the list-based algorithms have quadratic complexity (for proofs, see e.g. Nivre 2008). The complexity of non-directional parser by Goldberg and Elhadad (2010b) is O(n log(n)). Introducing the swap operation increases the complexity of the stack-based parsers to O(n2).
Another aspect that makes transition-based parsers fast and efficient is that they search for the best tree greedily, i.e., they always go with the locally best decision under the assumption that this will usually also lead to the globally best output. This is one of the fundamental differences to graph-based parsers, which perform global optimization to find the best tree. However, greedy search suffers from error propagation because once the parser has made an incorrect attachment it cannot correct it anymore. For this reason, transition-based parsers used to perform worse than graph-based parsers in terms of parsing accuracy. Since then, techniques like beam-search (Johansson and Nugues 2007b, Zhang and Clark 2008a) and dynamic programming (Huang and Sagae 2010), which allow the parser to pursue several derivations in parallel, closed that gap while increasing runtime by a constant factor only. At the same time, error propagation in greedy parsers has been mitigated with the help of dynamic oracles, which allow the parser to learn how to recover from past mistakes (Goldberg and Nivre 2012, 2013).