Structural SVMs with Latent Variables - Improved Learning Of Structural Support Vector Machines

How can we extend our structural SVM framework to handle such latent information? In particular we would like to have an extension that is general enough to include a diverse set of structured output prediction problems with latent information, and also efficient to train.

In the case of generative probabilistic models with latent variable, maximum likelihood estimation of a parameter vector θ involves maximizing the

log-likelihood function

logPθ(x, y, h),

which is usually non-concave. Herexis the observed input features,yis the observed output label, andhis the latent variable. The optimization can be done via the Expectation-Maximization (EM) algorithm [40]. In the Expectation Step (E-Step) we compute the expected log likelihood under the marginal distribution P_θ(t)(h | x, y) of the latent variable h, where θ(t) is the current parameter

vector:

Q(θ_|θ(t)) = EP

θ(t)(h|x,y)[logPθ(x, y, h)]. (5.1)

This fixes the marginal distribution of the latent variables and make the maximization problem simpler. The Maximization Step (M-Step) involves maximizing the functionQfor a new parameterθ(t+1):

θ(t+1) = argmax_θQ(θ _|θ(t)) (5.2) The EM algorithm alternates between the E-step and the M-step until conver- gence. Other methods for differentiable optimization such as gradient descent or conjugate gradient can also be applied with suitable initialization. Due to non-concavity of the objective function these algorithms can only converge to a local maximum, and therefore having a good initialization strategy is very important.

How can we construct a general large-margin learning framework that can handle issues such as non-convex objective that comes with the use of latent variables? To tackle this question let us start from the basic joint feature map and prediction. Recall that in structural SVM we are trying to learn a prediction function

parameterized by the weight vectorw (for simplicity let us restrict our discus- sion to linear structural SVMs for the moment). Just like the presence of latent variablehin the likelihood function in generative probabilistic models, we can extend our joint feature map to include the influence of latent variables in our prediction function:

fw(x) = argmax_(ˆ_y,_hˆ₎∈Y×Hw·Φ(x,y,ˆ ˆh). (5.3)

By extending the joint feature map fromΦ(x, y)toΦ(x, y, h), we make sure that latent variables such as word alignments in machine translation or object part positions in object recognition are taken into account when making predictions, even when they are not available in the training set. Notice that we are predict- ing a pair(y, h)that jointly maximizes the linear scoring functionw_·Φ(x, y, h). In a probabilistic model we have the option of integrating out the latent variable

hwhen predictingyas fθ(x) = argmaxyˆ∈Y X ˆ h∈H Pθ(x,y,ˆ hˆ).

In our large-margin framework we replace the sum over latent variables with a point estimatehˆthat maximizes the score of the inputxand the outputy,

fw(x) = argmaxyˆ∈Ymax

h∈H

w_·Φ(x,y,ˆ ˆh), (5.4) leading to the above joint prediction rule in Equation (5.3). This point-estimate approach fits much better with the large-margin framework and also avoids the computational issue of integrating out the latent variables.

So far everything looks straightforward as we have done nothing apart from extending the joint feature map and modifying the output type to be a pair (y, h). The interesting part comes when we consider how to formulate this as an optimization problem during training. Unlike the fully observed case where we

know the labelyi for each examplexi, in the latent variable case we only know

the labelyipart of the output(y, h)for each examplexi, but have no information

about the latent variablehon the training data.

For a start, let us go back to look at the constraints in the standard structural SVM formulation in Optimization Problem 2.7:

∀yˆ_{∈ Y},yˆ₆=yi,

w_·Φ(xi, yi)−w·Φ(xi,yˆ)≥∆(yi,yˆ)−ξi

ξi ≥0.

The set of constraints essentially says that the score of the correct output w _· Φ(xi, yi)has to be greater than the score of all other incorrect outputw·Φ(xi,yˆ),

with a margin measured by the loss function∆(yi,yˆ). We can also rewrite the

above set of constraints in a more succinct way as:

ξi = max

y∈Y[w·Φ(xi,yˆ) + ∆(yi,yˆ)]−w·Φ(xi, yi)

It is easy to see thatξiis a convex function ofwin this form, since the first term

is a maximum over linear functions.

We can preserve the semantics of the above constraints in standard structural SVM in the case with latent variables if we replace the linear scoring function w_·Φ(x, y)in the constraints with the functionmaxh∈Hw·Φ(x, y, h):

∀yˆ_{∈ Y},yˆ₆=yi,

max

h∈H w·Φ(xi, yi, h)−maxˆ_h∈H

w_·Φ(xi,y,ˆ ˆh)≥∆(yi,yˆ)−ξi

ξi ≥0.

There are two major points to note here. First the new constraints with maxh∈Hw ·Φ(x, y, h) fit well with the prediction rule in Equation (5.4), since

we are using the latent variableh that maximizes the score between the input

x and the output y in prediction. Secondly the optimization problem is now non-convex after we put in the standard structural SVM objective due to the termmaxh∈Hw·Φ(xi, yi, h), which is a common issue when working with latent

variables. This is easy to see if we rewrite the constraints as follows:

ξi = max

(ˆy,hˆ)∈Y×H

[w_·Φ(xi,y,ˆ ˆh) + ∆(yi,yˆ)]−max

h∈H w·Φ(xi, yi, h)

The slack variableξi is the difference of two convex functions, which is in gen-

eral not convex. We will discuss how to solve these issues in the next section on training algorithms.

We can now write out our full optimization problem for structural SVM with latent variables, which we callLatent Structural SVM:

Optimization Problem 5.1. (LATENTSTRUCTURALSVM) min w,ξ 1 2kwk 2₊_C n X i=1 ξi ∀i, ξi ≥ max (ˆy,ˆh)∈Y×H [w_·Φ(xi,y,ˆ hˆ) + ∆(yi,y,ˆ ˆh)]−max h∈H w·Φ(xi, yi, h)

Notice we have made the formulation more general by including the latent variableˆhof the alternative outputyˆin the loss∆. This does not change the formulation or the training algorithm in any major manner, but will prove useful in two of the example applications below. It is also easy to see that this formulation reduces to the standard structural SVM if we remove the latent variables.

In document Improved Learning Of Structural Support Vector Machines: Training With Latent Variables And Nonlinear Kernels (Page 104-109)