• No results found

2.2 Dependency Parsing

2.2.2 Graph-based Parsing

In contrast to transition-based parsers, graph-based parsers do not build dependency trees incrementally. Given a statistical model that assigns scores to dependency trees, graph-based parsers search through the space of all possible dependency trees for a sentence returning the one to which the statistical model assigns the highest score. This approach guarantees that the returned tree is indeed the optimal given the statistical model. However, since there are exponentially many trees for a given sentence, one cannot simply look at each possible tree individually and compare their scores. Instead,

18 2 Background

algorithms were developed that efficiently search through the space without having to create every single tree.

Eisner (1997, 2000) develops a dynamic programming algorithm that finds the optimal dependency tree for a given sentence in cubic time, but is restricted to projective trees. The algorithm works like the chart parsers known from constituency parsing and uses the fact that subtrees of a dependency tree are also dependency trees. Alternatively, McDonald et al. (2005) propose to use spanning tree algorithms to find the optimal dependency tree for a given sentence. The particular algorithm used in McDonald et al. (2005) is the Chu-Liu-Edmonds algorithm. It is not restricted to projective trees and with clever implementation it runs in quadratic time with respect to the length of the input sentence.

The Chu-Liu-Edmonds Algorithm

The Chu-Liu-Edmonds algorithm (Chu and Liu 1965, Edmonds 1967) is a maximum spanning tree algorithm. Given a graph with arc weights, it searches for the spanning tree that connects all vertices in the graph and has the maximum sum of arc scores. We make use of this algorithm in several places in this dissertation, e.g., to enforce tree properties (Chapter 7). For this reason, we present the algorithm in this section, following the description in K ¨ubler et al. (2009: 48). We demonstrate the algorithm by using an example. Proper pseudocode and a longer discussion can be found in McDonald et al. (2005) and K ¨ubler et al. (2009: 47).

Figure 2.5 shows an example run of the algorithm on a small graph with four nodes. The nodes represent the sentence hROOTJohn saw Maryi. The algorithm starts from a fully connected graph (Figure 2.5a) whose arcs are weighted, e.g., by a statistical parsing model. The first step is to make a greedy selection by choosing the highest-scoring incoming arc for each word (Figure 2.5b). The algorithm would stop at this point if the resulting structure is a proper tree since this tree would be the maximum spanning tree over the original graph. However, greedy selection can create cycles as shown in the example between John and saw. In such a case, the algorithm contracts the cycle into a single node and recomputes the arc scores for each arc that enters or leaves the contracted cycle (Figure 2.5c). The new arc scores are computed depending on the score of the cycle and the scores of the incoming/outgoing arcs (see K ¨ubler et al. 2009: 47 for the exact procedure).

2.2 Dependency Parsing 19

Once the cycle is contracted and the new arc scores are computed, the algorithm calls itself recursively on the new smaller graph starting again with a greedy selection. If another cycle is found, the contraction procedure starts again and the recursion continues. Since each recursive call reduces the number of nodes in the graph due to the contraction, the algorithm eventually finds a tree structure without a cycle. This is shown in Figure 2.5d. From there, the contractions are resolved until an acyclic tree structure is obtained for the original graph (Figure 2.5e).

Arc-factored Model and Higher-order Factors

Eisner’s algorithm and the Chu-Liu-Edmonds algorithm both rest on the assumption that the individual arcs in the tree are independent of each other. The underlying statistical model assigns a score to each arc and the score of a tree is the sum of all arc scores. The score of an arc is thereby independent of any other arc in the tree, i.e., it does not change if the other arcs in the tree change. This model is called the arc-factored model, since the parameters of the model factor over single arcs.

Arc-factorization makes the search for the best dependency tree tractable but it is an unrealistic model from a linguistic point of view. To see its limitations, consider the two German sentences in Figure 2.6. Both sentences contain a dependent clause, the upper one contains a relative clause, the lower one contains a subordinate clause expressing a clausal relationship. The crucial difference here is that the relative clause depends on the object of the matrix clause whereas the subordinate clause depends on the verb (red arcs in both trees). The head of both dependent clauses is the same verb bellt. An arc-factored model now has to decide the attachment of this word without knowing any of the other arcs in the tree. In particular it does not know the arcs marked in blue. In the upper sentence, the blue arc attaches the relative pronoun to bellt, in the lower sentence, it attaches the subordinating conjunction. Linguistically, this information makes the attachment decision trivial, but the arc-factored model cannot access it. In short sentences like the examples in Figure 2.6, this information can be approximated by looking at the surrounding words. The relative pronoun in the upper sentence is the immediate left neighbor of bellt. However, German syntax allows for any number of other words to occur between bellt and the relative pronoun/subordinating conjunction, which makes surrounding context a rather unreliable source of information for this purpose.

20 2 Background root saw John Mary 9 10 9 20 3 30 30 11 0

(a)Start from a fully connected graph.

root saw John Mary 9 10 9 20 3 30 30 11 0

(b)Select greedily the highest-scoring in- coming arc for each node.

root saw John Mary 9+20 10+30 9 20 3 30 30 11+20 0+30 tc

(c)Contract cycles and recompute scores.

root saw John Mary 9+20 10+30 9 20 3 30 30 11+20 0+30 tc

(d)Run Chu-Liu-Edmonds on the contracted graph. root saw John Mary 10 30 30 9 9 20 3 11+20 0

(e)Resolve contractions.

2.2 Dependency Parsing 21

Sie streichelt den Hund , der bellt .

She pets the dog that barks .

nsubj dobj punct det acl:relcl punct nsubj

Sie streichelt den Hund , weil er bellt . She pets the dog because it barks .

nsubj dobj advcl punct det punct mark nsubj

Figure 2.6:Attachment of dependent clauses to illustrate second-order dependencies.

Figure 2.6 shows that modeling dependencies between arcs opens access to relevant information that is hidden from the arc-factored model. Unfortunately, higher-order models2come with increased complexity. Full non-projective parsing with more than first-order models was proven to be NP-hard (for the proof, see McDonald and Pereira 2006), but Eisner’s algorithm, which derives projective trees only, can be extended to higher order models while keeping its polynomial complexity. McDonald and Pereira (2006) propose a variant of Eisner’s algorithm that uses factors of consecutive siblings (Figure 2.7a), to which Carreras (2007) adds factors over grandchildren (Figure 2.7b). While the variant with consecutive siblings retains its cubic time complexity, Carrera’s decoder already runs with O(n4)with n being the length of the input sentence. Koo and Collins (2010) go one more step and introduce third-order features while at the same time keeping the runtime at O(n4). Zhang and McDonald (2012) generalize Eisner’s decoder to

any-order models, reporting a time complexity of O(n3+2x)where x is the number of free variables in their parsing rules.

2Models that consider dependencies between two arcs are called second-order models. If they model dependencies between three arcs they are called third-order models. The arc-factored model is hence called a first-order model.

22 2 Background

A B C D E

(a)Consecutive siblings.

A B C D E

(b)Grandchildren.

A B C D E

(c)Arbitrary siblings.

Figure 2.7:Second-order factors. Solid lines show one factor.

Factorization restricts the type of features that the feature model has access to. In the arc-factored case, features can only be extracted from single arcs, in higher-order models, sibling or grandchild features can be added. However, structural features can never go beyond the information retained in a factor. This is one of the fundamental differences between graph-based and transition-based parsers as the latter have access to their entire parse history, can extract structural features of any complexity, and are restricted only be the fact that some structure may not have been built yet.

Eisner’s algorithm allows for efficient higher-order models, but (and because) it can derive projective trees only. In order to derive non-projective trees, different modifications have been proposed. McDonald and Pereira (2006) propose a hill-climbing algorithm for postprocessing that starts from the highest-scoring projective tree and reattaches edges until the overall score does not increase anymore. Pitler (2014) extends Eisner’s algorithm to directly derive 1-Endpoint-Crossing trees,3 a subset of all possible non-projective

structures (see also Pitler et al. 2013).

While clever factorization keeps the parsing algorithms tractable, they are still rather slow, especially compared to greedy transition-based parsers. Many graph-based parsers therefore run a pruning step first that uses a simple method for cutting of arcs that are unlikely to be chosen by the parser. For example, a sentence with 100 words requires the parser to consider 99 heads for each word (in the non-projective case). Cutting this number down to 10 heads per word makes the combinatory problem considerably smaller and leads to faster parsing time. Of course, if the pruning step cuts off a correct arc, it cannot be recovered by the parser. A popular method to decide which arcs to keep is to

2.2 Dependency Parsing 23

use the marginals of a probabilistic arc-factored model (McDonald and Satta 2007, Smith and Smith 2007, Koo et al. 2007). Rush and Petrov (2012) use cascades of models with increasing complexity so that the simpler models narrow down the search space for the more complex models. Zhang and McDonald (2012) show how any kind of higher-order model can be kept tractable by using cube pruning to restrict the number of arcs that are considered between two words.

Parsing as an Integer Linear Program

Parallel to the effort of extending Eisner’s decoder, other methods of finding the optimal tree were investigated, for example dependency parsing as solving an integer linear program. Integer linear programming is a mathematical tool to describe constrained optimization problems. It consists of an objective function that is being optimized and a set of constraints over the variables in the objective function that need to hold in the optimal solution.

The general idea is the following: each possible arc between the tokens of a given sentence is represented by a binary variable that marks the presence or absence of this arc in the output tree (recall the representation of dependency trees as indicator vectors from above). Each of these binary variables is weighted by an arc score from a statistical model. The objective function of the integer linear program is then to optimize the overall score of the tree by finding the combination of binary variables whose weights sum up to the highest score.

Without any constraints, the solution to this optimization problem is to set all variables to 1 that have a positive arc weight. However, this will most likely not result in a well-formed dependency tree. Therefore, some constraints are added to the integer linear program that only allow solutions which are well-formed dependency trees. The conditions that need to be met are listed in the beginning of this section (root has no head, one head for each token, no cycles). Riedel and Clarke (2006), who proposed the first parser based on integer linear programming, defined constraints for the first two conditions, but had to resort to an iterative method to enforce acyclicity. They first compute a solution, and if the solution contains a cycle, they add a constraint that explicitly excludes this particular cycle. They then run the solving process again, possibly ending up with another cycle, until finally

24 2 Background

an acyclic solution is found. Obviously, this iterative process makes the approach very inefficient. Martins et al. (2009) find a concise formulation that directly enforces acyclicity without the need for an iterative process (see also Martins et al. 2010b). Their parser ensures cycle-freeness by employing a single-commodity flow formulation (Magnanti and Wolsey 1995). It models a flow from the root to each of the tokens in the sentence along the arcs of the tree. Together with the single-head constraint, only acyclic trees can fulfill this constraint.

Solving the integer linear program with the described constraints outputs a well-formed dependency tree that is optimal with respect to the scoring function. Any general-purpose constraint solver for (integer) linear programs can be used to find the optimal solution. This parser is attractive from a modeling point of view, because any kind of other constraints can be added to the formulation. We make use of this property in this dissertation to model dependencies between morphology and syntax. A formal definition of the parser is therefore deferred to Chapter 6. Higher-order features can be added through the definition of additional variables that are linked to the arc variables via constraints. The higher-order dependencies are scored by the statistical model and their corresponding variables are included into the objective function. Martins et al. (2009) propose several second-order features, e.g., the already mentioned consecutive siblings, grandchildren, or arbitrary siblings Figure 2.7c.

As with all graph-based parsers, the disadvantage of integer linear programming parsers is their complexity. Dropping the integer constraint, i.e., allowing for the variables to take any real value, creates a linear program, for which solvers exist that run in polynomial time. However, relaxing the problem in this way forfeits the guarantee to get a well-formed dependency tree since some of the variables in the solution may end up with fractional values. Martins et al. (2009) postprocess such fractional solutions by projecting them to the nearest integer solution. This is done by running the Chu-Liu-Edmonds algorithm on the first-order output graph with the fractional assignments as arc weights. However, they find that in the vast majority of cases the projection is unnecessary since the original solution already only contains integers.

2.2 Dependency Parsing 25

Lagrangian Relaxation and Dual Decomposition

Lagrangian relaxation is another method for solving constrained optimization problems that trades the guarantee for an exact solution for more efficient decoding. Rush et al. (2010) introduced this method to perform efficient inference in complex models for natural language processing. The idea of Lagrangian relaxation is to solve a hard constrained optimization problem by moving some or all of the constraints into the objective function and then searching for the solution that maximizes the original function while at the same time violating the constraints as little as possible. The following part is based on Rush and Collins (2012).

Assume that we have a problem where we wish to maximize a set of variables x given a set of parameters θ. Additionally, we have a set of constraints on the values of x.

arg max

x

x · θ (2.2)

subject to Ax = b

We assume now that we can solve the unconstrained problem efficiently, but with the constraints it is very difficult to do so. The idea of Lagrangian relaxation is to circumvent the constraints that make the problem difficult to solve by moving them into the objective function together with a set of Lagrangian multipliers (λ).

L(λ, x) = x · θ + λ · (Ax − b) (2.3)

The dual objective of the original problem is still to find the maximum values of x

L(λ) = arg max

x

L(λ, x) (2.4)

and the dual problem is to minimize the value of the Lagrangian multipliers

arg min

λ

L(λ) (2.5)

26 2 Background

by minimizing the dual problem, the upper bound is moved closer to the optimal solution. Rush and Collins (2012) show how to use subgradient descent for the optimization. The solution that is output in the end is not guaranteed to be identical to the optimal solution of the original problem. However, if at any point in the optimization the constraints are not violated, i.e., Ax − b = 0, then the upper bound and the optimal solution coincide and the x at this point is guaranteed to be optimal.

Dual Decomposition is a special case of Lagrangian relaxation in which the constrained op- timization problem can be decomposed into two or more sub-problems and the constraints connect the sub-problems in some way. As before, it is assumed that the sub-problems can be solved efficiently when the constraints are ignored.

Let the original problem be

arg max

x,z

= x · θ1+ z · θ2 (2.6)

subject to Ax + Bz = c

where x and z are the variables for the two sub-problems and θ1 and θ2 are the corre- sponding sets of parameters. A, B, and c define the constraints over x and z.

The constraints are integrated into the objective function as before

L(λ, x, z) = x · θ1+ z · θ2+ λ · (Ax + Bz − c) (2.7)

and the dual objective is to maximize x and z.

L(λ) = arg max

x,z

L(λ, x, z) (2.8)

The dual problem is as before to minimize the value of the Lagrangian multipliers.

arg min

λ

L(λ) (2.9)

Rush et al. (2010) illustrate the use of dual decomposition with two problems. In the first, they show a model for joint phrase-structure parsing and part-of-speech tagging, in the second they combine a phrase-structure parser with a dependency parser. They

2.2 Dependency Parsing 27

derive simple subgradient algorithms to optimize the complex models. In the case of joint phrase-structure parsing and part-of-speech tagging, the complex problem is decomposed into two problems, namely phrase-structure parsing and part-of-speech tagging. Both tasks alone are well-studied and can be solved efficiently with known algorithms. The difficult problem of the joint task is to enforce the equality constraints that postulate that the part-of-speech tags assigned by the parser should be the same as the ones assigned by the tagger. They can solve this problem efficiently with the described dual decomposition approach.

Rush et al. (2010) show that their algorithms solve a linear programming relaxation of the