• No results found

2.5 Bibliographical Notes

2.5.1 Related Models

Below we give a brief review of different machine learning methods that are closely related to structural SVMs in the area of discriminative structured output learning.

Conditional Random Fields

The influential paper by Lafferty, McCallum and Pereira [81] on Conditional Random Fields (CRFs) is the work that opens up the whole area of discrimi- native learning of structured output prediction models. Conditional random fields learn the conditional probability distributionP(y | x)with the following exponential form:

P(y|x) = Pexp(w·Φ(x, y)) ˆ

y∈Yexp(w·Φ(x,yˆ))

, (2.15)

whereΦis the joint feature vector that serves the same purpose as in structural SVMs, and w is the parameter vector to be learned. By modeling the condi- tional distributionP(y | x)instead of the joint distributionP(x, y), conditional random field is the first model that allows flexible feature construction in Φ, which greatly improves the performance of many sequence labeling tasks. No- tice in Equation (2.15) the normalization factor only involves summation over outputs yˆbut not inputx, and therefore one can construct arbitrarily complex feature over the inputxwithout having to worry about solving a more difficult inference problem in training. CRFs solve the following regularized negative log-likelihood minimization problem during training on a training set consist-

ing of{(x1, y1), . . . ,(xn, yn)}.

Optimization Problem 2.10. (CONDITIONAL RANDOM FIELDS (WITH REGU- LARIZATION)) min w 1 2kwk 2 −C n X i=1 " w·Φ(xi, yi)−log X ˆ y∈Y exp(w·Φ(xi,yˆ)) !# . (2.16)

The regularization term1/2kwk2prevents the model from overfitting when the number of parameters is large. This training problem is a convex optimiza- tion problem and can be solved with methods such as iterative scaling [37] or limited memory BFGS [85, 113]. The form of this training problem is very sim- ilar to the training problem of structural SVMs in Equation (2.9), the only dif- ference being the way we penalize the prediction errors on the training set. We can rewrite Equation (2.9) without the constraints for a direct comparison:

Optimization Problem 2.11. (STRUCTURAL SVM (AFTER ELIMINATING SLACK VARIABLES)) min w 1 2kwk 2+C n X i=1 max ˆ y∈Y[w·Φ(xi,yˆ)−w·Φ(xi, yi) + ∆(yi,yˆ)] (2.17)

Compared to CRFs, structural SVMs contain a loss function ∆for specify- ing the margin requirements for different applications and do not require the computation of the normalization termPˆy∈Yexp(w·Φ(x,yˆ))(called the parti-

tion function) during training. For structured output prediction problems with inference procedures based on dynamic programming, the partition function can be computed using the sum-product algorithm (e.g., forward-backward al- gorithm in HMM), with computational complexity similar to the argmax com- putation (max-prodcut algorithm) used in the training of structural SVMs (e.g., Viterbi algorithm in HMM). For structured output prediction problems with

NP-hard inference problems, sometimes there are approximation algorithms for computing the highest scoring structure, which can be applied to the training of structural SVMs. However it is less clear how these algorithms can be modi- fied to compute the partition function. Even for problems with polynomial time inference algorithms there could be differences in the computational complex- ity of computing the argmax and computing the partition function. For exam- ple, computing the minimum/maximum spanning tree takes O(n) time with Kruskal or Prim’s algorithm, but summing up all the scores of all possible span- ning trees takes O(n3) time via the matrix-tree theorem [79]. However CRFs have the advantage of being a probabilistic model and can be composed with other probabilistic models in certain applications. It is also very well-grounded theoretically and can be derived from the principle of maximum entropy [5].

Margin-based Models

Collins [30] proposed using perceptron updates to learn the parameters of hid- den Markov models, which allows the use of flexible feature functions like that in CRF, but at the same time is much easier to implement. Like moving from perceptrons to SVM for a more stable classification boundary, his work was ex- tended in [6] to Hidden Markov Support Vector Machines (HM-SVM) using ideas of regularization. Compared with the perceptron training method, HM- SVM has improved accuracies on many sequence labeling tasks [6] due to regu- larization. HM-SVM was later generalized to structural SVMs [132] for general structured output learning.

Around the same time Taskar et al. took the idea from Collins one step fur- ther and apply it to the training of general Markov Random Fields, which they

called Max-Margin Markov Networks (M3N)[125]. The formulation of max- margin Markov networks is equivalent to the margin-rescaling loss penalization in structural SVMs. One major difference is their proposed training methods. M3N uses dual methods based on sequential minimal optimization (SMO) [102] while structural SVMs employ cutting plane algorithms for training. However these training methods are still slow on large datasets due to the repeated use of inference algorithms such as Viterbi decoding to compute gradients.

Trying to strike a balance between performance and training time, there were also works on introducing the concept of margins into online learning of struc- tured output models, notably in [91]. It extends the MIRA algorithm for online learning [36] and have improved performance on dependency parsing over the Collins perceptron algorithm.

Kernel Dependency Estimation

Kernel Dependency Estimation (KDE) [144] takes a completely different ap- proach to structured output prediction. The use of general kernels such as tree or string alignment kernels allows us to apply the SVM framework to classify input structures such as trees and sequences. KDE extends this idea by includ- ing a kernel on the output space as well, which maps the output structure to a high dimensional vector space. The learning task then becomes learning a mapping from the input kernel space to the output kernel space, which in their case was done using kernel PCA and regression. The major difficulty of this approach is that since the output kernel maps an output structure to the output kernel space, a preimage problem needs to be solved when making prediction. The advantage of this approach is that relatively simple methods such as kernel

logistic regression can be used for learning the mapping from the input kernel space to the output kernel space, without the need to perform a large number of Viterbi decoding as in the case of CRF or large-margin structured learning.

Search-based Structured Output Learning

Another distinct approach to structured-output learning is to relate it to another main area of machine learning, namely reinforcement learning. In [38] Daum´e et al. relate the search process during the decoding in different structured out- put learning problem (Viterbi decoding, parsing, etc) to exploration in a state space. They apply ideas from reinforcement learning to perform structured out- put learning, a method which they call SEARN. They try to learn a cost-sensitive classifier to decide what action to take at a particular state in the search process based on past decisions (for example, what tag to use at the(k+ 1)th position after the firstk tags are given in a left-to-right Viterbi decoding process). Com- pared to models such as CRFs or structural SVMs which perform discriminative parameter learning on graphical models [70], they try to model the search pro- cess directly and learn good parameters for the search function (policy) so that it will terminate at outputs with low loss. The experimental results are com- petitive with models like CRF and structural SVM on many natural language processing tasks.