How do Transition-Based Dependency Parsers work?

5.2 A new Universal Dependency Scheme (UD15)

6.1.1 How do Transition-Based Dependency Parsers work?

In the last decade, data-driven dependency parsing has come to fore, with two main approaches dominating – graph-based and transition-based. Graph-based dependency parsing involves the construction of a parse tree by predicting the Maximum Spanning Tree (MST) in the digraph for the input sentence (McDonald et al., 2005). In this digraph, each word corresponds to a vertex, and these vertices are all connected by directed edges (arcs). Based on frequency counts in the training data (treebank), each arc in the graph is assigned a score at the learning or training stage. Making a common assumption of arc factorisation, the score of the graph is the sum of all the arc scores (weights). The challenge of the parser is to find the highest-scoring tree, that is, a subgraph including all vertices and only the mini- mum number of arcs to be connected (see Section 2.3.1) – the MST, when choosing from amongst the proposed candidates. MSTParser (McDonald et al., 2005) is a

graph-based parser. Mate Parser (Bohnet, 2010) also has a graph-based component. On the other hand, in classic transition-based dependency parsing, the training phase consists of learning the correct parser action to take, given the input string and the parse history, and the parsing phase consists of the greedy application of parser actions as dictated by the learned model. These actions are based on that of a shift-reduce parser, which involves progressing through the input string and moving tokens onto a stack from a buffer (shift) or removing them once they have been fully processed (reduce). Depending on the type of algorithm used, there are other actions involved, which we describe in more detail in Figure 6.1. MaltParser (Nivre et al., 2006) is an example of a transition-based parser and we describe how it works here in more detail than a graph-based parser as it is the main parser we used in most of our parsing experiments in Section 6.

In transition-based parsing, the parser moves from left to right through a sentence, making decisions as to which words will make up dependency pairs with the help of a classifier. The transition-based parsing algorithms use a buffer containing the sentence tokens in linear order, a stack onto which each token is pushed as part of the processing step and an arc list that contains the proposed head-modifier rela- tions (Nivre, 2003; Nivre and Nilsson, 2005; Kubler et al., 2009). In this approach, the parser looks to see what is on the top of the stack and appearing next in the buffer. Due to the fact that it does not look beyond the next item in the buffer nor does it undo any decisions it has already made, it is referred to as a greedy algorithm. Jurafsky and Manning (2012) identify the four main contributors (listed below) that help to calculate the probabilities of a dependency pair. In the training stage, the parser uses a classifier when looking at features in order to a predict parsing action, given a particular configuration. These features also help it to decide what dependency label to apply to the relation pair.

• Lexical information (based on data previously observed in a treebank) can tell the parser if two words are likely to be a dependency pair (e.g. ‘small child’).

Initialise:

σ = [ROOT], β = w1, ..., wn, A = φ

repeat

Choose one of the transition operations 1-4 until β = φ Transitions 1. Left-Arcr σ| wi, wj|β, A → σ, wj|β, A ∪ {r(wj, wi)} precondition: r0 (wk, wi) 6∈ A, wi 6= ROOT 2. Right-Arcr σ| wi, wj|β, A → σ| wi| wj, β, A ∪ {r(wi, wj)} 3. Reduce σ| wi, β, A → σ, β, A precondition: r0 (wk, wi) ∈ A 4. Shift σ, wi |β, A → σ| wi, β, A

Figure 6.1: The arc-eager shift-reduce algorithm. σ represents the stack, β is the buffer and A is the set of dependencies.

• The distance between two words can also indicate the probability that these two words will be connected. Dependency distance tends to be short, although long-distance dependencies do exist. For example, determiners and their nouns are usually in close proximity. In Irish, the subject and verb are usually close in proximity, whereas inserted elements such as adverbs or subject modifiers can increase the distance between a verb and its object.

• Intervening words: the parser can use the words occurring between the two words likely to attach to determine whether the attachment is possible. For example, punctuation and verbs are unlikely to occur between dependencies. • The valency of a word is the number of arguments it is likely to have. De-

pending on its part of speech (e.g. noun, verb), the valency of words can differ, and vary as to whether the arguments will be to the left or right. For example, the verb tabhair ‘give’ has a valency of 3: subject, object and an oblique argument.

The arc-eager algorithm is outlined in Figure 6.1.2 _{At each transition step, the}

parser consults the classifier to determine a word’s dependencies.

The Start configurations are such that σ indicates the stack, onto which tokens will be pushed as they are encountered, for which the notation is σ|wi for a token wi.

At initialisation there is just one element on the stack, that is the ROOT. β represents a buffer with the list of tokens from the input (the sentence) as word1, word2, and

so on. A represents a set of dependency arcs r(wi, wj), where r is a dependency

label. At initialisation, A is empty (φ). There are four possible operations as the parser iterates through each word on the buffer: Left-Arc, Right-Arc, Reduce, Shift. The precondition of Left-Arc is that the token on the top of the stack cannot already be a dependent of another word (thus preventing multiple heads), and it cannot be the root of the sentence. The precondition of the Reduce operation is that a token can only be removed from the stack if it has been made a dependent of another word. The process is finished when the buffer is empty.

Let us take the Irish sentence in Example 37 and parse it with the arc-eager algorithm.

(37) Cheannaigh an cail´ın an leabhar Bought the girl the book ‘The girl bought the book’ The Start configurations are:

σ = [ROOT]; β = Cheannaigh, an, cail´ın, an, leabhar; A = φ

1. Cheannaigh, a verb, is likely to be the root of the sentence, therefore apply a Right-Arcroot operation, which takes the word on the top of the stack

(ROOT ) and makes it the head of the next word in the buffer Cheannaigh. Add to A:

A = {root(ROOT, Cheannaigh)}

Part of Right-Arc operation is to push Cheannaigh onto the stack: σ = [ROOT | Cheannaigh]; β = an, cail´ın, an, leabhar;

2. Shift an onto the stack (an unlikely dependent of Cheannaigh) σ = [ROOT, Cheannaigh | an]; β = cail´ın, an, leabhar

3. The next word in the buffer (cail´ın) is likely to be the head of an – apply the Left-Arcdet operation, which takes the word on the top of the stack (an) and

makes it the dependent of the next word in the buffer cail´ın. Add to A: A = {root(ROOT, Cheannaigh), det(cail´ın, an)}

4. Reduce – Take an off the stack (it is already a dependent) σ = [ROOT | Cheannaigh]; β = cail´ın, an, leabhar

5. Apply the Right-Arcsubj operation. cail´ın is a good candidate as a subject

dependent of Cheannaigh. Add to A.

A = {root(ROOT, Cheannaigh), det(cail´ın, an), subj(Cheannaigh, cail´ın)} We add cail´ın to the stack (σ| wi| wj):

σ = [ROOT, Cheannaigh | cail´ın]; β = an, leabhar

6. Reduce: As cail´ın does not have any right dependents, and it is already identified as a dependent of another token, it can be taken off the stack: σ = [ROOT | Cheannaigh]; β = an, leabhar

7. Shift an onto the stack (an is not likely to be the dependent of Cheannaigh) σ = [ROOT, Cheannaigh | an]; β = leabhar

8. The next word in the buffer (leabhar ) is likely to be the head of an – apply the Left-Arcdet operation and add to A:

A = {root(ROOT, Cheannaigh), det(cail´ın, an), subj(Cheannaigh, cail´ın), det(leabhar, an))}

9. Reduce – Take an off the stack (it is already a dependent) σ = [ROOT | Cheannaigh]; β = leabhar

10. Apply the Right-Arcobj operation. leabhar is a good candidate as an object

dependent of Cheannaigh. Add to A.

A = {root(ROOT, Cheannaigh), det(cail´ın, an), subj(Cheannaigh, cail´ın), det(leabhar, an), obj(Cheannaigh, leabhar)}

11. Finish – as the buffer is empty.

It is clear in this example that we are assuming that all the decisions made by the classifier are correct. Of course, the classifier is not always reliable and the parser cannot get all choices correct each time. Yet, we can see how the various features it uses assists it in making the correct decisions. For example, we can see in the steps above that at one stage Cheannaigh is on the top of the stack, and an is the next token on the buffer. We note that an is unlikely to be a dependent of Cheannaigh at this stage. Yet, if we were parsing the (present tense) interrogative form of the sentence (An gceanna´ıonn an cail´ın an leabhar? ‘Does the girl buy the book?’), the first word An1is a present tense interrogative particle, and should in fact be attached

to the verb gceanna´ıonn as a dependent. How then, would the classifier know the difference between what is possible for An1 and an2 (of an cail´ın)? In this case,

the length of the proposed dependency arc for both cases is equal, so that feature will not bear any weighting. The part-of-speech tags will differ however. The POS for An1 is Part (particle) and the POS for an2 is Art (Article). In addition, An1,

in sequence occurs to the left of the verb, requiring a Left-Arc operation, therefore is more likely to be a dependent than an2 which occurs to the right of the verb

(requiring a Right-Arc operation).

In document Irish dependency treebanking and parsing (Page 140-145)