2.4 Learning the Weight Vector
2.4.1 The Perceptron and the Passive-Aggressive Algorithm
The (structured) perceptron is an online algorithm that processes training data points (x, y)one by one. For each training instance, it makes a prediction. If the prediction is correct, the algorithm moves on to the next training instance. If the prediction is incorrect, the weight vector is updated to favor the correct solution. In our setting the situation is a bit more complex since prediction is carried out in Z-space and not in Y-space. As discussed above, a latent structure can be found by running the search algorithm over the constrained space Zy. For the time being we assume that the search problem in Z-
space can be solved exactly by a search procedure called EXACT. This is for instance the situation when an arc-factored model is used for coreference resolution, as was discussed in the introductory example in Chapter 1.
Pseudocode for the PA algorithm with latent structure is shown in Algorithm 2.3. It starts by initializing the weights to the 0-vector and then makes a number of passes (or
epochs) over the training data, considering one training instance at a time. It starts by deriving an instance from the latent space which we aim to learn by decoding in Zy-
space (line 5).2It then makes a prediction in Z-space (line 6). If the predicted structure ˆz i
is identical to the latent one (i.e., ˜zi) then no action is taken, and the algorithm continues
with the next training instance. If the prediction ˆziis not the same as the latent structure
˜zi, an update is carried out (lines 7 to 10). The update is the only thing that differentiates
the PA algorithm from the vanilla perceptron. The PA algorithm introduces a custom loss function LOSS(ˆz, ˜z)which measures the compatibility between a prediction and the correct structure with the intuition that a better (but still wrong) prediction should receive a lower loss than a worse prediction. The loss function must be defined by the user for the specific task at hand. For instance, when comparing a predicted coreference tree to a latent gold coreference tree, the loss function might count the number of erroneous arcs in the prediction compared to the gold.
Algorithm 2.3 Passive-Aggressive algorithm with latent structure
Input: Training data D = (xi, yi), number of training epochs T , loss function LOSS
Output: Weight vector w
1: function PAPERCEPTRON(D, T, LOSS)
2: w =−→0 .Initialize weights to the zero vector
3: for t ∈ 1..T do .Loop T times (epochs)
4: for (xi, yi) ∈ Ddo .For each instance
5: ˜zi =EXACT(xi, w, φ, Zy) .Decode latent
6: ˆzi =EXACT(xi, w, φ, Z) .Predicted structure
7: if ˜zi 6= ˆzithen .If wrong, update
8: ∆ = Φ(ˆzi) − Φ(˜zi) .Distance vector
9: τ = ∆·w+LOSS(ˆzi,˜zi)
k∆k2 .Scaling factor
10: w = w + τ ∆ .Passive-aggressive update
11: return w .Return learned weights
To explain the scaling factor τ we first need to consider the weight vector in a geo- metric sense. A common view on the weight vector is to regard it as a hyperplane, where training instances (data points in the vector space) are either on one side or the other of the hyperplane. The margin of training instance is the distance between that point in vec- tor space and the hyperplane corresponding to the weight vector. Margin-based methods
2This should be contrasted with using a heuristic to derive the Z-space structure. If a heuristic is used, no
latent structure is required. Moreover, the target Z-space structure can be computed once and for all before training begins. This is the situation in Chapter 4 when we compare with static oracles, and consistently throughout Chapter 5, where no latent structure is employed.
of learning weight vectors aim at not only finding a weight vector (hyperplane) that sep- arates the training data, but one that does so with a certain minimum margin for training instances on either side of the hyperplane. The PA algorithm, which can be regarded as a margin-based extension of the perceptron algorithm, aims at maintaining a margin the size of the loss function between the current training instance and the hyperplane. The aggressive part in the name of the algorithm comes from the fact that every update is aggressive enough to ensure this margin, while at the same time moving the hyper- plane as little as possible.3The computation of τ on line 9 ensures exactly this (Crammer
et al., 2006). If τ is instead set to 1, the algorithm collapses into the regular structured perceptron and no margin constraints are enforced.
In practice, we deviate from the pseudo-code in Algorithm 2.3 in two ways: First, the training instances are shuffled between every epoch so as to simulate a “random” stream of training examples. Second, after training we return the average of all weight vectors seen during training, known as parameter averaging or the averaged perceptron (Collins, 2002). Parameter averaging has been shown to approximate the more expensive voted per- ceptronwhich involves multiple weight vectors used for prediction (Freund and Schapire, 1999; Collins, 2002). The intuition between the voted perceptron (and, consequently, also the averaged perceptron) is that, because of the online nature of the perceptron, it is very sensitive to the order of the instances in the training data. In particular, the very last (or most recent, during training) instance seen has an disproportionate bias on the final weight vector compared to instances seen much earlier. Therefore, letting the weight vectors vote, or just considering their average, softens this effect and makes the final weight vector less sensitive to the order of the training data. The motivation for using the averaged perceptron rather than the voted one is strictly based on efficiency – the voted perceptron relies on using all weight vectors seen during training and thus means the computations involved at test time would use a involve a considerable number of scalar products as opposed to a single one.4
Computing the average of all weight vectors may be an expensive operation since it involves a linear pass over the weight vector after every instance in order to aggregate a sum. However, Daum´e III (2006, pp. 9-10) provides an elegant solution to this problem by showing that the average weight factor can be unfolded as a telescopic sum over the distance vectors involved in the updates. This way, only the weights that change
3The passive part of the name comes from the simple fact that when a prediction is correct no action is
taken, i.e., the algorithm is passive.
4To make it perfectly clear, we emphasize that the average of the weight vectors is only computed after
training has completed. That is, while training, the most recent weight vector is used consistently and the averaged weight vector is only computed after Algorithm 2.3 has finished.
during an update need to be processed and the average weight vector can be computed efficiently at the end of learning.