5.5 Automatic Prediction of Morphological Features
7.1.1 A Graph-based Lattice Dependency Parser
In this section, we present the construction of the lattice parser. We cast the problem of lattice parsing as a constrained optimization problem, in which we seek the highest- scoring dependency tree under the constraint that the tokens that it spans over form a consecutive path through the lattice. Since solving this constrained problem is difficult, we use a dual decomposition approach (see Chapter 2, Section 2.2 for a description of the underlying idea of dual decomposition), in which we decompose the task into two subtasks. The first subtask finds a consecutive path through the sentence lattice. The second subtask computes a spanning tree over the lattice. To these two subtasks, we add two constraints that ensure that the path that is found by the first task coincides with the tokens that the second task predicts to be part of the final solution. A dual decomposition algorithm then finds the optimal solution to this problem by solving the subproblems repeatedly until all constraints are fulfilled.
As said before, we will refer to the minimal unit of parsing as a token. In the input lattices for the parser, a token corresponds to a single transition between two states. For convenience, we define the set of tokens T to hold all tokens represented in a (sentence) lattice. Tokens represent IGs in the Turkish treebank and morphemes in the Hebrew treebank.
We assume two different structures, lattices and dependency trees. Dependency trees are represented as directed acyclic trees with a special root node (ROOT); lattices are directed
acyclic graphs as defined above. For dependency trees, we will use the terms node and arc to refer to the vertices and the edges between the vertices, respectively. Tokens are represented as nodes in the dependency tree. For lattices, we use the terms state and transition to refer to the vertices and their edges in the lattice. Contrary to dependency trees, tokens are represented as transitions in the lattice.
7.1 Lattice Parsing 117
Finding The Path. A token bigram in a lattice M (x) is a pair of two transitions ht, t0i, such that the target state of t in M (x) coincides with the source state of t0 in M (x). A chain of overlapping bigrams that starts from the initial state and ends in the final state forms a path through the lattice. We represent theROOTtoken as the first transition, i.e., a single transition that leaves the initial state of the lattice.
Given a lattice M (x), we define the index set of token bigrams in the lattice to be
S := { ht, t0i | t, t0 ∈ T, target(t) = source(t0) }. (7.1) For later, we furthermore define the set of bigrams that have t at the second position:
S|t:= { hk, ti | hk, ti ∈ S, k ∈ T } (7.2)
A consecutive path through the lattice is defined as an indicator vector
p := hpsis∈S (7.3)
where ps= 1means that bigram s is part of the path, otherwise ps= 0. We define P as the
set of all well-formed paths, i.e., all paths that lead from the initial to the final state.
We use a linear model that factors over token bigrams. Given a scoring function fPthat assigns scores to paths, the path with the highest score can be found by
ˆ p = arg max p∈P fP(p) (7.4) with fP(p) = X s∈S psw · φSEG(s)
where φSEG is the feature extraction function for token bigrams. Given a weight vector, the highest-scoring path through the lattice can be found with the Viterbi algorithm. In Section 7.3, we also use the bigram model as a standalone disambiguator for morphological lattices to find the highest-scoring path in a lattice.
Finding The Tree. We define the index set of arcs in a dependency tree as
118 7 Graph-based Lattice Dependency Parsing
with L being a set of dependency relations. A dependency tree is defined as an indicator vector
y := hyaia∈A (7.6)
where ya= 1means that arc a is in the parse, otherwise ya= 0. We define Y to be the set
of all well-formed dependency trees (projective and non-projective).
We assume an arc-factored model as commonly done in dependency parsing (McDonald et al. 2005, Koo et al. 2010). Given a scoring function fTthat assigns scores to trees, the problem of finding the highest scoring tree is defined as
ˆ y = arg max y∈Y fT(y) (7.7) with fT(y) = X a∈A yaw · φARC(a)
where φARC is the feature extraction function for single arcs and w is the weight vector. We follow Koo et al. (2010) and use the Chu-Liu-Edmonds algorithm (CLE) to find the highest-scoring parse (Chu and Liu 1965, Edmonds 1967). CLE enforces the tree properties of the output, i.e., acyclicity and exactly one head per token. Note that the algorithm includes all tokens of the lattice into the spanning tree, not just some tokens on some path.
Agreement Constraints. To make the path and the parse tree agree with each other, we introduce an additional dependency relation NORELinto L, the set of dependency relations. We define a token that is attached toROOTwith relationNORELto be not on the
path through the lattice. These arcs are not scored by the statistical model, they simply serve as a means for CLE to mark tokens as not being part of the solution by attaching them to
ROOTwith this relation.3
When the bigram model finds a path through the input lattice, it effectively partitions the set of tokens T into two sets, namely the tokens that are on the path and the tokens that are not (see Figure 7.4a). By means of theNOREL label, the arc-factored model is able to mark a token as being part of the solution tree or not. The tokens that are part of the solution should form a dependency tree, whereas the other ones should simply be dependents ofROOT. Currently, however, a tree token can be a dependent of a non-tree
7.1 Lattice Parsing 119 0 1 2 3 4 5 R A B C D E F G
(a)The bigram model partitions the tokens into two sets. The tokens on the path, {R, A, F, G}, and the tokens not on the path,{B, C, D, E}.
0 1 2 3 4 5 R A B C D E F G NOREL NOREL NOREL NOREL
(b)The tree model also partitions the tokens into two sets. The tokens in the output tree, {R, A, F, G}, and the tokens that are not part of the output tree,{B, C, D, E}. Due to the first agreement constraint, there can be internal structure among the tokens in the first set only.
Figure 7.4:The bigram model and the tree model both partition the set of tokens into two sets. The second agreement constraint ensures that the two partitionings coincide.
token. To prevent this, we introduce a constraint that disallows tokens that are attached to
ROOTwithNORELto have dependents on their own. The constraint is implemented as an implication factor ( =⇒ , Martins et al. 2015). It states that an activeNORELarc for a token h
implies an inactive arc for all arcs having h as head. There is one such constraint for each possible arc in the parse.
yhROOT,h,NORELi =⇒ ¬yhh,d,li (7.8)
for all hh, d, li ∈ A, h 6=ROOT, l 6=NOREL
By introducing the constraint in Equation (7.8), the CLE cannot mix tree tokens and non- tree tokens and thus cleanly partitions T into two sets, namely tree tokens and non-tree tokens (see Figure 7.4b). Simultaneously, it predicts a dependency tree over the tree tokens. The final step is now to ensure that the partitioning of T by selecting a path through the lattice is identical to the partitioning of T by computing the spanning tree. This can be achieved with a second constraint, which is defined over token bigrams and arcs. It
120 7 Graph-based Lattice Dependency Parsing
states that for a token t, either one of its bigrams4or itsNOREL-arc must be active. It is implemented as an XOR factor (⊕, Martins et al. 2011b) and there is one such constraint for each token in the lattice.
M
s∈S|t
ps ⊕ yhROOT,t,NORELi for all t ∈ T (7.9)
By means of the constraint in Equation (7.9), both subtasks have to agree on the path through the lattice. The Viterbi algorithm ensures that the solution will be a coherent path while the CLE predicts a dependency tree over this path. All tokens that are not on the path are discarded before the parser returns the parse tree.