6.3 Spline Models
6.4.1 Support Vector Classification
For support vector classifiers, the key notion that we need to introduce is that of themaximum margin hyperplane for a linear classifier. Then by using the “kernel trick” this can be lifted into feature space. We consider first the sep- arable case and then the non-separable case. We conclude this section with a comparison between GP classifiers and SVMs.
142 Relationships between GPs and Other Models
The Separable Case
Figure 6.2(a) illustrates the case where the data is linearly separable. For a linear classifier with weight vectorw and offset w0, let the decision boundary be defined by w·x+w0 = 0, and let ˜w = (w, w0). Clearly, there is a whole version space of weight vectors that give rise to the same classification of the training points. The SVM algorithm chooses a particular weight vector, that gives rise to the “maximum margin” of separation.
Let the training set be pairs of the form (xi, yi) fori= 1, . . . , n, whereyi=
±1. For a given weight vector we can compute the quantity ˜γi=yi(w·x+w0),
which is known as the functional margin. Notice that ˜γi>0 if a training point
functional margin
is correctly classified.
If the equation f(x) =w·x+w0 defines a discriminant function (so that the output is sgn(f(x))), then the hyperplane cw·x+cw0 defines the same discriminant function for any c >0. Thus we have the freedom to choose the scaling of ˜w so that mini γ˜i= 1, and in this case ˜w is known as thecanonical
form of the hyperplane.
The geometrical margin is defined as γi = ˜γi/|w|. For a training point xi
geometrical margin
that is correctly classified this is simply the distance fromxito the hyperplane.
To see this, letc = 1/|w| so that ˆw=w/|w|is a unit vector in the direction of w, and ˆw0 is the corresponding offset. Then ˆw·x computes the length of the projection of x onto the direction orthogonal to the hyperplane and
ˆ
w·x+ ˆw0computes the distance to the hyperplane. For training points that are misclassified the geometrical margin is the negative distance to the hyperplane. The geometrical margin for a dataset D is defined as γD = miniγi. Thus
for a canonical separating hyperplane the margin is 1/|w|. We wish to find the maximum margin hyperplane, i.e. the one that maximizesγD.
By considering canonical hyperplanes, we are thus led to the following op- timization problem to determine the maximum margin hyperplane:
optimization problem
minimize 1 2|w|
2 overw, w 0
subject toyi(w·xi+w0)≥1 for alli= 1, . . . , n. (6.33)
It is clear by considering the geometry that for the maximum margin solution there will be at least one data point in each class for whichyi(w·xi+w0) = 1, see
Figure6.2(b). Let the hyperplanes that pass through these points be denoted H+ andH− respectively.
This constrained optimization problem can be set up using Lagrange multi- pliers, and solved using numerical methods for quadratic programming4 (QP)
problems. The form of the solution is w = X
i
λiyixi, (6.34)
4A quadratic programming problem is an optimization problem where the objective func- tion is quadratic and the constraints are linear in the unknowns.
6.4 Support Vector Machines 143
where theλi’s are non-negative Lagrange multipliers. Notice that the solution
is a linear combination of thexi’s.
The key feature of equation6.34is thatλiis zero for everyxi except those
which lie on the hyperplanes H+ or H−; these points are called the support
vectors. The fact that not all of the training points contribute to the final support vectors
solution is referred to as the sparsity of the solution. The support vectors lie closest to the decision boundary. Note that if all of the other training points were removed (or moved around, but not crossingH+ orH−) the same
maximum margin hyperplane would be found. The quadratic programming problem for finding the λi’s is convex, i.e. there are no local minima. Notice
the similarity of this to the convexity of the optimization problem for Gaussian process classifiers, as described in section3.4.
To make predictions for a new input x∗ we compute
sgn(w·x∗+w0) = sgn Xn i=1 λiyi(xi·x∗) +w0 . (6.35)
In the QP problem and in eq. (6.35) the training points{xi}and the test point
x∗ enter the computations only in terms of inner products. Thus by using the
kernel trick we can replace occurrences of the inner product by the kernel to kernel trick
obtain the equivalent result in feature space. The Non-Separable Case
For linear classifiers in the original x space there will be some datasets that are not linearly separable. One way to generalize the SVM problem in this case is to allow violations of the constraintyi(w·xi+w0)≥1 but to impose a
penalty when this occurs. This leads to thesoft marginsupport vector machine soft margin
problem, the minimization of 1 2|w| 2+C n X i=1 (1−yifi)+ (6.36)
with respect to w and w0, where fi =f(xi) = w·xi+w0 and (z)+ =z if
z >0 and 0 otherwise. Here C > 0 is a parameter that specifies the relative importance of the two terms. This convex optimization problem can again be solved using QP methods and yields a solution of the form given in eq. (6.34). In this case the support vectors (those withλi6= 0) are not only those data points
which lie on the separating hyperplanes, but also those that incur penalties. This can occur in two ways (i) the data point falls in betweenH+andH− but
on the correct side of the decision surface, or (ii) the data point falls on the wrong side of the decision surface.
In a feature space of dimension N, if N > n then there will always be separating hyperplane. However, this hyperplane may not give rise to good generalization performance, especially if some of the labels are incorrect, and thus the soft margin SVM formulation is often used in practice.
144 Relationships between GPs and Other Models −2 0 1 4 0 1 2 log(1 + exp(−z)) −log Φ(z) max(1−z, 0) z g(z) − 0 − . (a) (b)
Figure 6.3: (a) A comparison of the hinge error, gλ and gΦ. (b) The-insensitive
error function used in SVR.
For both the hard and soft margin SVM QP problems a wide variety of algorithms have been developed for their solution; see Sch¨olkopf and Smola [2002, ch. 10] for details. Basic interior point methods involve inversions ofn×n matrices and thus scale asO(n3), as with Gaussian process prediction. However, there are other algorithms, such as the sequential minimal optimization (SMO) algorithm due toPlatt[1999], which often have better scaling in practice.
Above we have described SVMs for the two-class (binary) classification prob- lem. There are many ways of generalizing SVMs to the multi-class problem, seeSch¨olkopf and Smola[2002, sec. 7.6] for further details.
Comparing Support Vector and Gaussian Process Classifiers For the soft margin classifier we obtain a solution of the form w = P
iαixi
(withαi=λiyi) and thus|w|2=Pi,jαiαj(xi·xj). Kernelizing this we obtain
|w|2 = α>Kα = f>K−1f, as5 Kα = f. Thus the soft margin objective
function can be written as 1 2f >K−1f+C n X i=1 (1−yifi)+. (6.37)
For the binary GP classifier, to obtain the MAP value ˆf ofp(f|y) we minimize the quantity 1 2f >K−1f− n X i=1 logp(yi|fi), (6.38)
cf. eq. (3.12). (The final two terms in eq. (3.12) are constant if the kernel is fixed.)
For log-concave likelihoods (such as those derived from the logistic or pro- bit response functions) there is a strong similarity between the two optimiza- tion problems in that they are both convex. Letgλ(z), log(1 +e−z), gΦ =
5Here the offsetw
0has been absorbed into the kernel so it is not an explicit extra param- eter.
6.4 Support Vector Machines 145
−log Φ(z), andghinge(z),(1−z)+ wherez =yifi. We refer toghinge as the
hinge error function, due to its shape. As shown in Figure6.3(a) all three data hinge error function
fit terms are monotonically decreasing functions ofz. All three functions tend to infinity asz→ −∞and decay to zero asz→ ∞. The key difference is that the hinge function takes on the value 0 forz≥1, while the other two just decay slowly. It is this flat part of the hinge function that gives rise to the sparsity of the SVM solution.
Thus there is a close correspondence between the MAP solution of the GP classifier and the SVM solution. Can this correspondence be made closer by considering the hinge function as a negative log likelihood? The answer to this is no [Seeger,2000,Sollich,2002]. IfCghinge(z) defined a negative log likelihood, then exp(−Cghinge(f)) + exp(−Cghinge(−f)) should be a constant independent off, but this is not the case. To see this, consider the quantity
ν(f;C) =κ(C)[exp(−C(1−f)+) + exp(−C(1 +f)+)]. (6.39) κ(C) cannot be chosen so as to make ν(f;C) = 1 independent of the value of f for C > 0. By comparison, for the logistic and probit likelihoods the analogous expression is equal to 1. Sollich [2002] suggests choosing κ(C) = 1/(1 + exp(−2C)) which ensures that ν(f, C) ≤ 1 (with equality only when f =±1). He also gives an ingenious interpretation (involving a “don’t know” class to soak up the unassigned probability mass) that does yield the SVM solution as the MAP solution to a certain Bayesian problem, although we find this construction rather contrived. Exercise6.7.2invites you to plotν(f;C) as a function off for various values ofC.
One attraction of the GP classifier is that it produces an output with a clear probabilistic interpretation, a prediction for p(y= +1|x). One can try to interpret the function valuef(x) output by the SVM probabilistically, and Platt[2000] suggested that probabilistic predictions can be generated from the SVM by computing σ(af(x) +b) for some constants a, b that are fitted using some “unbiased version” of the training set (e.g. using cross-validation). One disadvantage of this rather ad hoc procedure is that unlike the GP classifiers it does not take into account the predictive variance of f(x) (cf. eq. (3.25)). Seeger[2003, sec. 4.7.2] shows that better error-reject curves can be obtained on an experiment using the MNIST digit classification problem when the effect of this uncertainty is taken into account.