• No results found

6.3 A Multitask Learning Architecture for Parsing and Tagging

6.3.3 Parsing Component

A second bi-LSTM layer uses the sequence [d(1)1 , d (1)

2 , . . . , d (1)

n ] as input

and computes higher-level representations [d(2)1 , d (2)

2 , . . . , d (2)

n ]. Then, to

represent a configuration, we extract typed features from the stack and buffer. The features used in practice are illustrated in Table 6.5. They consist of the left and right corners and the heads of constituents in the stack, as well as the labels of the first three constituents in the stack, and the first element in the buffer. Types are either nonterminal symbols or tokens. The nonterminal features are valued by an embedding. The token features

are valued by the corresponding representations d(2)i . The concatenation

of these valued features forms a vector h(0)that is input to a feed-forward

neural network with a softmax output layer that computes a distribution over possible actions:

h(1) = f (W(1)p · h(0)+ b(1) p ) (hidden layer) h(2) = f (W(2)p · h(1)+ b(2) p ) (hidden layer) p(ai = · |ai−11 , w n 1; θp) = Softmax(Wp(3)· h (2) + b(3)p ) (output layer) where W(l)t and b (l)

t are parameters and f is a non-linear activation func-

tion.

In practice, the number of hidden layers for the feed-forward network and the choice of an activation function are hyperparameters. In our experiments, we used a network with 2 hidden layers, as presented above, and a rectifier (ReLU: x 7→ max{0, x}) as the activation function.

Several variants of this multitask architecture are possible (number of layers for the sentence level bi-LSTM). In particular, it is possible to share the two layers of the sentence-level bi-LSTM between the tagger and the parser, instead of just the first layer. Supervising different tasks at different levels of the hierarchical neural net has shown benefits in previous works (Søgaard and Goldberg, 2016).

6.3.4

Objective Function and Training

To train the model, we optimize the negative log-likelihood of the data. The loss function for a single sentence wn

1, with the corresponding gold

sequence of actions ak

1 and gold labels M1nis defined as:

L(ak1, w1n, M1n; θ) = −

k

X

i=1

log p(ai|ai−11 , w n 1; θp) − n X i=1 m X j=1 log p(Mi,j|wn1; θt)

where θ = θt∪ θp. The first term is the objective function for the parser

and the second term is the objective function for the tagger. The loss for the whole dataset is the sum of L for every sentence in the dataset. Although, we assume that each sentence has gold annotations for every task (parsing, morphological analysis, functional labelling), it is also possible to use a different dataset for each task.

Parser configuration: s1 CAT s2 CAT s0 CAT b0 RC LC head RC LC head RC Parsing input

,

Template set: s0.CAT, s0.LC, s0.RC, s0.head, s1.CAT, s1.LC, s1.RC, s1.head, s2.CAT, s2.RC,

b0

Figure 6.5: Feature templates for bi-RNN parsing. s and b respectively address symbols in the stack and the buffer.

Input Auxiliary tasks

TOK+CLSTM token, character bi-LSTM

TOK+CLSTM+M token, character bi-LSTM morphology

TOK+CLSTM+M+D token, character bi-LSTM morphology, functional labels

TOK token

TOK+MMT token, predicted morphology

TOK+MMT+D token, predicted morphology functional labels

Table 6.2: Summary of models.

6.4

Experiments

The experiments we conduct have several objectives. First, we assess to what extent the tagging auxiliary tasks can improve constituency parsing. Secondly, we evaluate the accuracy of the output of the auxiliary tasks. Finally, we compare our multitask model to a pipeline approach, where predicted morphological attributes are given as the input to the parser at test time.

In a first set of experiments, we use the model we described with a character-level bi-LSTM, and either no auxiliary task (TOK+CLSTM), mor- phological analysis as an auxiliary task (TOK+CLSTM+M) or morphological analysis and functional labelling as auxiliary tasks (TOK+CLSTM+M+D).

In a second set of experiments, the input to the sentence-level bi-LSTM does not include a character-based embedding. Instead, it is either a stan- dard word-embedding (TOK), or the concatenation of a word embedding and embeddings for each available morphological tag (TOK+MMT). For

example, the token ´etages from the sentence above will be represented as the concatenation [w´etages; wg=m; wnumber=p; wtense=NA; wmood=NA]. Finally, the

last model uses the same input as TOK+MMT, but predicts functional la- bels as an auxiliary task (TOK+MMT+D). The different parameters of these models are summed up in Table 6.2.

6.4.1

Datasets

We evaluate our models on the SPMRL dataset (Seddah et al., 2013). This dataset contains constituency and dependency treebanks aligned at the word level for 9 morphologically rich languages. Each token is annotated with a part-of-speech tag and a number of language-specific morphological attributes (case, mood, tense, number).

In the first set of experiments, where morphology is predicted as an auxiliary task, we use the gold tags and morphological annotations at training time. At test time, the only input to the parser is a sequence of word forms.

In the second set of experiments, we use the POS and morphological tags predicted by MARMOT (Mueller et al., 2013) for training and pars- ing.3 MARMOT is a CRF tagger designed to output a structured morpho- logical analysis for each token, and to use external morphological lexicons. As the transition system is lexicalized, we need to know the head of each constituent in order to extract the gold derivation. The constituency trees were head-annotated using the method of Crabb´e (2015). This method uses the alignment between constituency and dependency trees to deter- mine the head of each constituent and uses heuristics to solve mismatch cases. Finally, we performed a head-outward binarization with an order- 0 Markovization (see Section 3.2.3.1), and collapsed unary productions to single nodes, except those that produce preterminals.

Related documents