• No results found

We now discuss how weighted tree transducers and weighted tree-to-string transduc- ers are used as probabilistic models.

4.6.1 Representation

As in the case of probabilistic FST models, tree and tree-to-string transducers can represent joint or conditional probability distributions. Suppose t is an input tree, s is an output tree or string, and r is a tree transducer rule of the form q(σ(t1, . . . , tk)) →

used, and root(r) = q otherwise. Suppose that a tree transducer represents a joint distribution P (t, s). Then the weight of r is

π(r) = P (r|root(r)). (4.6.1) In the case of a conditional distribution P (s|t), the weight of r is

π(r) = P (r|LHS(r)), (4.6.2) where LHS(r) denotes the entire left hand side of r.

4.6.2 Inference

Inference with tree transducers is performed with application, as in the case of FSTs. However, tree transducer application is usually performed with custom algorithms for specific transducer classes. The generic approach used for FSTs cannot be used for most tree transducer classes, due to a lack of closure under composition. Application can only be performed with a limited number of transducer classes. The class of LNTs is closed under composition, but the class of xLNTs is not (Maletti et al., 2009). Almost none of the classes that perform copying or deletion are closed under composition.

A RTG G = (Σ, N, P, π, S) over a semiring W G is embedded in a LNT E if E assigns the weight wtG(t) to the tree pair (t, t) for every tree t ∈ TΣ, and 0

to all other tree pairs. May (2010, chap. 4) gives a custom forward application algorithm for applying a RTG to a xLNT. Though the class of xLNTs is not closed under composition, embedded RTGs form a more restricted class, and the application RTG can be computed. For backward application of a RTG to a xLNT, the embed- compose-project approach can be used. If M is a xLNT and E is a embedded LNT, then the composition C = M E can be computed. C is then right projected to a RTG A. A modified bucket brigade approach for application to tree transducer cascades has also been proposed (May, 2010, chap. 4).

Tree-to-string transducers are typically applied with backward application of a string s to a xLNTS M. Transforming a string to a tree can be seen as a parsing problem, and therefore parsing algorithms are used to perform this application. An algorithm based on Earley parsing is proposed by May (2010, chap. 4). A CKY-based algorithm for this application has also been used (Galley et al., 2006).

When we work with tree transducers in NLP applications, exact inference is usually intractable, since the search space is too large to perform an exact search given time and memory constraints. Therefore the search space is usually pruned heuristically. We discuss this more concretely in Section 6.3.2.

4.6.3 Training

We now discuss the training problem for probabilistic extended tree and tree-to- string transducers, given a set of training pairs. We assume here that the transducer rules are given, and that the goal is to estimate the rule weights. We discuss methods to extract rules from the training data in Section 6.1.

Suppose we have a xLNT M = (Σ, ∆, Q, R, π, Qd) with Qd = {S} over the

CHAPTER 4. PROBABILISTIC MODELS 60

t is an input tree and s is an output tree. Let us first consider the case where only one derivation from each training pair is considered as training data. Let f(r) denote the number of occurrences of rule r in the training derivations.

For a joint distribution P (t, s) the MLE of the rule weight associated with r is π(r) = P (r|root(r)) = f (r)

Σr0:root(r0)=root(r)f (r0)

. (4.6.3)

Let us now consider the general case, where the EM algorithm is used to perform training. Every derivation of M can be represented with a derivation tree over R (the tree nodes are labeled with rules). For a tree pair (ti, si), the set of derivation

trees for M, and the associated derivation weights, can be generated by a derivation RTG Gi. Each production in Gi has the form q → r(q1, . . . , qk), where symbol r

denotes a rule r ∈ R with k variables, and each state label q, q1, . . . , qk has the form

p × pos(ti) × pos(si), where p ∈ Q. The start state S of Gi corresponds to the start

state of M.

An algorithm to construct a derivation grammar Gi is given by Graehl et al.

(2008). In this algorithm, in the worst case each of the rules in R has to be considered for each of the tuples (q, pt, ps) ∈ Q, pos(ti), pos(si). The time and space complexity

are then both O(|Q| · |ti| · |si| · |R|)or O(Kn2), where n is the total size of the input

and output trees, and K is the grammar constant, representing the size of the rules and the states in M.

If the transducer has chain rules a corresponding derivation RTG may have cycles, leading to an infinite number of derivation trees. To avoid this additional complexity, chain rules are removed before the derivation grammars are constructed.

Similarly, derivation trees can also be constructed when M is a xLNTS and {(t1, s1), . . . , (tN, sN)} is a set of tree-string training pairs. However, computing

xLNTS derivations is more complex than computing xLNT derivations. Input tree nodes are matched with arbitrary output string spans, instead of output subtrees. Suppose that no transducer rules have more than two variables. For an output string of length m there are O(m2) spans, and each binary production over a span

has O(m) ways to divide the span in two. These spans and span divisions should be considered for each of the n input tree nodes, the transducer rules and the different states. Let K again be the grammar constant. Then the time and space complexity of constructing a xLNTS derivation grammar is O(Gnm3).

An instance of the EM algorithm has been defined to train tree transducers (Graehl et al., 2008). The model parameters θ are represented by the rule weight function π : R → W . For each pair in the training data D a derivation grammar is constructed only once. For each iteration j of the EM algorithm the weights of the derivation grammars are updated to the current parameter values θj. A derivation

grammar production with right hand side tree root labeled r is assigned weight π(r). The expected complete data log likelihood is

Q(θ, θj−1) =E[log P (D|θ)|θj−1] (4.6.4) =ΣNi=1Σr∈RfGi(r) log π(r) (4.6.5)

where fg(r)is the expected number of times that rule r is used in derivation trees

generated by derivation RTG g, when g is parameterized by θj−1. To estimate

fg(r), the number of times that r is used in each derivation tree is weighted by the derivation tree weight, and the sum of these weights is normalized over the weights of all the derivation trees. Therefore, we have

fg(r) = P d∈TRnd(r)wtg(d) P d∈TRwtg(d) , (4.6.7)

where nd(r) is the number of times that r occurs in derivation tree d. We can

compute this efficiently with the inside-outside weights of g: fg(r) =

P

p:a→u,root(u)=rγg(p)

βg(S)

. (4.6.8)

In the E-step of the EM algorithm, the expected fractional counts are computed for each derivation grammar. The expected count of each rule r is then obtained by summing over all the training examples:

f (r) =

N

X

i=1

fGi(r). (4.6.9)

In the M-step the MLE of the rule weights are updated, for joint probability distributions, using equation (4.6.3).

Tiburon (May and Knight, 2006) implements tree transducer algorithms, includ- ing EM training. We use Tiburon in performing experiments with our tree transducer models.

Recently, Bayesian methods for tree transducer training have also been devel- oped (Jones et al., 2012). Another alternative is a training method based on large margin training for structural SVMs (Cohn and Lapata, 2007).

4.7

Conclusion

In this chapter methods and applications of probabilistic models were presented. Section 4.1 introduced probabilistic modelling. Language models were presented in Section 4.2, and parsing in Section 4.3. We discussed probabilistic modelling with FSTs in Section 4.4. Section 4.5 discussed probabilistic RTGs, and specifically the computation of inside-outside weights. Finally, probabilistic modelling with tree transducers, including the EM training algorithm, was discussed in Section 4.6.

Chapter 5

Experimental Setup

In this chapter we present the experimental setup used in developing and testing our probabilistic tree transducer models for grammatical error correction. We discuss the various steps in preprocessing and parsing the learner corpora training data. We make use of standard, publicly available NLP tools to perform many of the processing steps. The processing pipeline and the steps that we perform ourselves are implemented in Python. We briefly discuss additional training resources, as well as the n-gram language model used. Finally we present a baseline FST model which include some of the model components used in the transducer models in Chapter 6.

5.1

Training and testing data

The two corpora used to train and test our models are NUCLE and FCE, described in Section 2.2.1. Details of the formats of these two corpora are given in Appendix B. The preprocessing steps we describe in the following sections are implemented separately for the data formats of the two corpora, though the same steps are fol- lowed.

Both corpora have separate training and test sets. We divide the training sets of each corpus into 80% training data, 10% validation data and 10% development data. Splitting is performed by random selection at essay level. The training sets are used to train the transducer error correction models, while hyperparameters such as the weight of the language model are tuned on the validation sets to optimize the system performance on the evaluation metric used directly. The development set is used to compare the performance of different modelling choices, while the test sets are used to perform the final evaluation of our models.

For the NUCLE test data there are two sets of annotations. The first is the original version annotated by the official annotator. The second is a revised version, released after the CoNLL-2013 shared task. After the initial results for the shared task were released, participating teams were given the opportunity to suggest alterna- tive correction annotations, based on the output of their systems. These alternative answers were then judged by the official annotator, and the revised version, that al- low multiple possible corrections, was released. The idea is that the evaluation scores on the revised annotations are more accurate, as some of the corrections suggested by a system may be correct, although not originally suggested by the annotator.