In this dissertation, we use supervised learning to learn the statistical models. Supervised learning means that we train the model on data for which we already know the correct solution. In Chapter 5, we predict the morphological features of words. The training data is therefore a corpus for which the morphological features have been manually annotated. Analogously, when we train parsers, we train them on treebanks. In this case, the model is used to predict dependency trees given an input sentence.
Let X be the set of inputs and Y be the set of outputs. For example, x ∈ X could be a word in morphology prediction or a sentence in parsing, and y ∈ Y could be a
2.3 Training the Statistical Models 29
morphological feature label or a dependency tree. A training set for supervised learning then consists of elements from the input set labeled with elements from the output set, i.e., T = {hx, yi, . . .} is a set of input-output pairs. The y’s in T are also called the gold standard and are assumed to be correct.
Throughout the dissertation we use linear models to obtain a score for an input-output pair. In prediction, the score is used to find the optimal y for a given x. The input is first mapped into a high-dimensional feature space using a feature function φ. Most of the features that we work with are binary, i.e., the value of a single feature can be either 0 or 1. The statistical model is a function that takes the features as input and computes a score. It is called linear because the score depends linearly on the feature values. Training the model means to learn a weight for each feature.
score(x, y) = w · φ(x, y) =
d
X
i=0
wd∗ φd(x, y) (2.10)
Equation (2.10) shows the general scoring function. Given a pair hx, yi and a weight vector w, the score is computed by taking the dot product of w and the feature vector produced by the feature function φ(x, y). The dot product is the sum of the pairwise multiplication of each weight with its corresponding feature. d is the number of different features, i.e., w ∈ Rdand φ(x, y) ∈ Rd.
The models in this dissertation are trained with a general online learning algorithm as shown in Algorithm 1. It takes a labeled training set and a predefined number of iterations (usually set to 10). The algorithm goes through the training data one instance after the other, each time making a prediction (Line 6). The weight vector then updated with respect to the prediction (Line 7). We furthermore use averaging to prevent overfitting and to obtain models that generalize better to unseen data (Freund and Schapire 1999, Collins 2002).
We train models for two purposes: multiclass classification and structured prediction (Collins 2002). The first case is used for example in Chapter 5 to predict morphological feature values for words. The input are words in their context and the output is a label from a set of morphological feature descriptions. For each word, the statistical model takes a feature representation of the word in its sentential context and outputs the feature
30 2 Background
Algorithm 1Online Learning with Averaging
Require: T = {hx0, y0i, . . . , hxt, yti} .The labeled training set
Require: number of iterations I
1: w = 0 .Initialize the weight vector
2: wa= 0 .Keep a second vector for averaging
3: for i = 1to I do
4: SHUFFLE(T ) .Shuffle the training data
5: for all hx, yi ∈ T do
6: y =ˆ PREDICT(w, x) .Make a prediction
7: UPDATE(w, x, ˆy, y) .Update the weights according to prediction
8: wa= wa+ w .Store the current weight vector
9: end for 10: end for
11: w = w¯ a/(T ∗ I) .Average the weights
12: return ¯w
label with the highest score. ˆ
y = arg max
y∈Y
w · φ(x, y) (2.11)
The parsers developed in this dissertation belong to the graph-based paradigm and are trained with structured prediction. Unlike in the multiclass prediction case, the output values Y have an internal structure, i.e., they are dependency trees. As there are exponentially many dependency trees for a given sentence, it is not feasible to simply do an argmax over all possible output values as in Equation (2.11). However, since we are only interested in the highest-scoring dependency tree, we can use one of the parsing algorithms from above, say Chu-Liu-Edmonds, to find the highest-scoring dependency tree efficiently without having to go through exponentially many trees one after the other. Recall that this is efficient because the amount of information to which the statistical model has access is limited.
ˆ
y =Chu-Liu-Edmonds(w, φ, x) (2.12)
Equations (2.11) and (2.12) are the instantiations of Line 6 in Algorithm 1 for multiclass prediction and structured prediction, respectively. For adjusting the weights of a model during training (Line 7 in Algorithm 1), we use the passive-aggressive update rule by Crammer et al. (2003, 2006) which is shown in Equations (2.13) and (2.14). The model is
2.3 Training the Statistical Models 31
updated only if the prediction is incorrect with respect to the gold standard. The update changes the weight vector just as much as it needs to make a correct prediction for the current input, but not more because the model should stay good on the examples where it made correct predictions. The name passive-aggressive describes this behaviour: it is passive when the model made a correct prediction, but in the other case it aggressively changes the weights to get a correct prediction next time.
δ = w · φ(x, ˆy) − w · φ(x, y) +LOSS(ˆy, y)
kφ(x, y) − φ(x, ˆy)k2 (2.13)
w = w + δ(φ(x, y) − φ(x, ˆy)) (2.14)
In Equations (2.13) and (2.14), φ(x, y) is the feature vector of the gold standard and φ(x, ˆy) is the feature vector of the best prediction. Note that in the multiclass case, ˆyis a single label whereas in the structured prediction case, it is a complete dependency tree. Equation (2.13) computes the amount δ by which the weight of each feature in the prediction and the gold standard is changed. In Equation (2.14), the weights for features in the gold standard are increased whereas the feature weights of the best prediction are decreased. After the update, the gold standard should get a higher score than the best prediction. Additionally, the passive-aggressive update enforces a margin between the score of the best prediction and the score of the gold standard that must be at least as big as the loss between the two. We use a zero-one loss in the multiclass prediction meaning that the loss is one if the predicted label is incorrect and otherwise 0. For parsing, the loss is a function of the number of tokens that did not get the correct head. Note that in structured prediction, the feature vectors in Equations (2.13) and (2.14) are the sum of the feature vectors for each factor in the structure, for example in the arc-factored model, it would be the sum of the feature vectors for each arc in the tree.
33
Chapter 3
Motivation
In this chapter, we develop the hypotheses and research questions of this dissertation. We first present morphological and syntactic phenomena of morphologically rich languages. We then examine the models that are commonly used in parsing and show that some of the assumptions that are built into these model do not hold for languages with rich morphology. This chapter sets the scene for the following chapters, in which we test the developed hypotheses empirically.
3.1
Morphology and Syntax
In English, the syntactic structure of a sentence is mainly expressed by the order in which the words appear in the sentence. Consider the example1by Bresnan (2001) in Figure 3.1.
The fact that children comes before are chasing in Figure 3.1 determines the subjecthood of the word. In the same way, dog is the direct object because it follows the verb. Switching the positions of these words would result in a change of meaning, as now, dog would be subject and children would be the direct object.
Consider now the example in Figure 3.2, also by Bresnan (2001). This sentence is from the
1The original examples come with a phrase structure analysis which we changed to dependencies since the argument we want to make does not depend on the syntactic theory.
34 3 Motivation
the two small children are chasing that dog
det nummod amod nsubj aux dobj det
Figure 3.1:English syntactic structure is mostly expressed through word order. Example taken from Bresnan (2001: 5).
Australian language Warlpiri and expresses the same semantic concept as the sentence in Figure 3.1. According to Bresnan, any other order of the words in this sentence are also acceptable to express the same meaning as long as the auxiliary occupies the second position in the sentence. In Warlpiri, word order therefore cannot serve to determine the syntactic relationships between the individual words. Instead, the morphology of the words overtly marks the roles that they play in the syntactic structure of this sentence. The subject of the sentence, witajarrarlu kurdujarrarlu, is in ergative case, whereas the direct object, yalumpu maliki, is in absolutive case. Furthermore, the words for small and children do not need to be adjacent since their identical inflection relates them to each other.
wita-jarra-rlu ka-pala wajili-pi-nyi yalumpu kurdu-jarra-rlu maliki small-DUAL-ERG pres-3DU.SUBJ chase-NPAST that.ABS child-DUAL-ERG dog.ABS
aux
nsubj dobj amod
det
Figure 3.2: In Warlpiri, syntactic structure is expressed by morphology. Example taken from Bresnan (2001: 6).
These two sentences exemplify two opposite points on the scale of options that languages have to express syntactic structure. Bresnan uses them to illustrate a phenomenon com- monly observed by language typologists: languages with rich morphology usually allow for free word order whereas languages with rather poor morphology often have very strict word order rules. Bresnan summarizes this observation with the slogan: Morphology competes with syntax (Bresnan 2001: 6).