Equation (15.4) and Equation (15.5) are equivalent

Support Vector Machines

Claim 15.5. Equation (15.4) and Equation (15.5) are equivalent

Proof. Fix some w,b and consider the minimization over ξ in Equation (15.4).

Fix some i . Since ξi must be nonnegative, the best assignment to ξi would be 0 if yi(w,xi + b) ≥ 1 and would be 1 − yi(w,xi + b) otherwise. In other words, ξi= ^hinge((w,b),(xi, yi)) for all i , and the claim follows.

We therefore see that Soft-SVM falls into the paradigm of regularized loss min-imization that we studied in the previous chapter. A Soft-SVM algorithm, that is, a solution for Equation (15.5), has a bias toward low norm separators. The objective function that we aim to minimize in Equation (15.5) penalizes not only for training errors but also for large norm.

It is often more convenient to consider Soft-SVM for learning a homogenous halfspace, where the bias term b is set to be zero, which yields the following optimization problem:

minw

λw²+ L^hinge_S (w)

, (15.6)

where

L^hinge_S (w)= 1 m

m i=1

max{0,1 − yw,xi}.

15.2.1 The Sample Complexity of Soft-SVM

We now analyze the sample complexity of Soft-SVM for the case of homogenous halfspaces (namely, the output of Equation (15.6)). In Corollary13.8 we derived a generalization bound for the regularized loss minimization framework assuming that the loss function is convex and Lipschitz. We have already shown that the hinge loss is convex so it is only left to analyze the Lipschitzness of the hinge loss.

Claim 15.6. Let f (w)= max{0,1 − yw,x}. Then, f is x-Lipschitz.

Proof. It is easy to verify that any subgradient of f at w is of the formαx where

|α| ≤ 1. The claim now follows from Lemma14.7.

Corollary13.8therefore yields the following:

Corollary 15.7. Let D be a distribution over X × {0,1}, where X = {x : x ≤ ρ}.

Consider running Soft-SVM (Equation (15.6)) on a training set S∼ D^mand let A(S) be the solution of Soft-SVM. Then, for every u,

S∼ED^m[L^hinge_D (A(S))] ≤ L^hinge_D (u)+ λu²+2ρ² λm.

15.2 Soft-SVM and Norm Regularization 173

Furthermore, since the hinge loss upper bounds the 0−1 loss we also have

S∼ED^m[L⁰⁻¹_D (A(S))] ≤ L^hinge_D (u)+ λu²+2ρ² λm. Last, for every B> 0, if we set λ =

2ρ² B²m then

S∼ED^m[L⁰_D⁻¹(A(S))] ≤ E

S∼D^m[L^hinge_D (A(S))] ≤ min

w:w≤BL^hinge_D (w)+

* 8ρ²B²

m . We therefore see that we can control the sample complexity of learning a half-space as a function of the norm of that halfhalf-space, independently of the Euclidean dimension of the space over which the halfspace is defined. This becomes highly significant when we learn via embeddings into high dimensional feature spaces, as we will consider in the next chapter.

Remark 15.2. The condition thatX will contain vectors with a bounded norm fol-lows from the requirement that the loss function will be Lipschitz. This is not just a technicality. As we discussed before, separation with large margin is meaningless without imposing a restriction on the scale of the instances. Indeed, without a con-straint on the scale, we can always enlarge the margin by multiplying all instances by a large scalar.

15.2.2 Margin and Norm-Based Bounds versus Dimension

The bounds we have derived for Hard-SVM and Soft-SVM do not depend on the dimension of the instance space. Instead, the bounds depend on the norm of the examples,ρ, the norm of the halfspace B (or equivalently the margin parameter γ ) and, in the nonseparable case, the bounds also depend on the minimum hinge loss of all halfspaces of norm≤ B. In contrast, the VC-dimension of the class of homogenous halfspaces is d, which implies that the error of an ERM hypothesis decreases as√

d/m does. We now give an example in which ρ²B²& d; hence the bound given in Corollary15.7is much better than the VC bound.

Consider the problem of learning to classify a short text document according to its topic, say, whether the document is about sports or not. We first need to represent documents as vectors. One simple yet effective way is to use a bag-of-words representation. That is, we define a dictionary of bag-of-words and set the dimension d to be the number of words in the dictionary. Given a document, we represent it as a vector x∈ {0,1}^d, where xi = 1 if the i’th word in the dictionary appears in the document and xi= 0 otherwise. Therefore, for this problem, the value of ρ²will be the maximal number of distinct words in a given document.

A halfspace for this problem assigns weights to words. It is natural to assume that by assigning positive and negative weights to a few dozen words we will be able to determine whether a given document is about sports or not with reasonable accuracy. Therefore, for this problem, the value of B²can be set to be less than 100.

Overall, it is reasonable to say that the value of B²ρ²is smaller than 10,000.

On the other hand, a typical size of a dictionary is much larger than 10,000. For example, there are more than 100,000 distinct words in English. We have therefore

174 Support Vector Machines

shown a problem in which there can be an order of magnitude difference between learning a halfspace with the SVM rule and learning a halfspace using the vanilla ERM rule.

Of course, it is possible to construct problems in which the SVM bound will be worse than the VC bound. When we use SVM, we in fact introduce another form of inductive bias – we prefer large margin halfspaces. While this inductive bias can significantly decrease our estimation error, it can also enlarge the approximation error.

15.2.3 The Ramp Loss*

The margin-based bounds we have derived in Corollary 15.7rely on the fact that we minimize the hinge loss. As we have shown in the previous subsection, the term

ρ²B²/m can be much smaller than the corresponding term in the VC bound,

√d/m. However, the approximation error in Corollary15.7is measured with respect to the hinge loss while the approximation error in VC bounds is measured with respect to the 0−1 loss. Since the hinge loss upper bounds the 0−1 loss, the approx-imation error with respect to the 0−1 loss will never exceed that of the hinge loss.

It is not possible to derive bounds that involve the estimation error term

ρ²B²/m for the 0−1 loss. This follows from the fact that the 0−1 loss is scale insensitive, and therefore there is no meaning to the norm of w or its margin when we measure error with the 0−1 loss. However, it is possible to define a loss function that on one hand it is scale sensitive and thus enjoys the estimation error

ρ²B²/m while on the other hand it is more similar to the 0−1 loss. One option is the ramp loss, defined as

^ramp(w,(x, y)) = min{1,^hinge(w,(x, y))} = min{1, max{0,1 − yw,x}}.

The ramp loss penalizes mistakes in the same way as the 0−1 loss and does not penalize examples that are separated with margin. The difference between the ramp loss and the 0−1 loss is only with respect to examples that are correctly classified but not with a significant margin. Generalization bounds for the ramp loss are given in the advanced part of this book.

y〈w, x〉

^{0 − 1} ^ramp

^hinge

1 1

15.4 Duality 175

The reason SVM relies on the hinge loss and not on the ramp loss is that the hinge loss is convex and, therefore, from the computational point of view, min-imizing the hinge loss can be performed efficiently. In contrast, the problem of minimizing the ramp loss is computationally intractable.

15.3 OPTIMALITY CONDITIONS AND “SUPPORT VECTORS”*

The name “Support Vector Machine” stems from the fact that the solution of hard-SVM, w0, is supported by (i.e., is in the linear span of) the examples that are exactly at distance 1/w0 from the separating hyperplane. These vectors are therefore called support vectors. To see this, we rely on Fritz John optimality conditions.

Theorem 15.8. Let w0be as defined in Equation (15.3) and let I = {i : |w0,xi| = 1}.

Then, there exist coefficientsα1,...,αm such that w0=

i∈I

αixi.

The examples{xi: i∈ I} are called support vectors.

The proof of this theorem follows by applying the following lemma to Equation (15.3).

Lemma 15.9 (Fritz John). Suppose that w∈ argmin

Historically, many of the properties of SVM have been obtained by considering the dual of Equation (15.3). Our presentation of SVM does not rely on duality. For completeness, we present in the following how to derive the dual of Equation (15.3).

We start by rewriting the problem in an equivalent form as follows. Consider the function

We can therefore rewrite Equation (15.3) as minw

w²+ g(w)

. (15.7)

Rearranging the preceding we obtain that Equation (15.3) can be rewritten as the problem

176 Sup p o rt Vector M a chines

Now suppose that we flip the order of min and max in the equation. This can only decrease the objective value (see Exercise 15.4), and we have

minw max

The preceding inequality is called weak duality. It turns out that in our case, strong duality also holds; namely, the inequality holds with equality. Therefore, the dual problem is

We can simplify the dual problem by noting that once α is fixed, the optimization problem with respect to w is unconstrained and the objective is differentiable; thus, at the optimum, the gradient equals zero:

w−

This shows us that the solution must be in the linear span of the examples, a fact we will use later to derive SVM with kernels. Plugging the preceding into Equation (15.9) we obtain that the dual problem can be rewritten as

α∈Rmax^m:α≥0

Note that the dual problem only involves inner products between instances and does not require direct access to specific elements within an instance. This prop-erty is important when implementing SVM with kernels, as we will discuss in the next chapter.

15.5 IMPLEMENTING SOFT-SVM USING SGD

In this section we describe a very simple algorithm for solving the optimization problem of Soft-SVM, namely,

We rely on the SGD framework for solving regularized loss minimization problems, as described in Section14.5.3.

15.7 Bibliographic Remarks 177

Recall that, on the basis of Equation (14.15), we can rewrite the update rule of SGD as

w^(t+1)= − 1 λt

t j=1

v_j,

where vj is a subgradient of the loss function at w^{( j )}on the random example chosen at iteration j . For the hinge loss, given an example (x, y), we can choose vj to be 0 if yw^{( j )},x ≥ 1 and vj= −y x otherwise (see Example14.2). Denotingθ^{(t )}= −

j<tvj

we obtain the following procedure.

SGD for Solving Soft-SVM goal: Solve Equation (15.12)

parameter: T initialize:θ⁽¹⁾= 0 for t= 1,...,T

Let w^{(t )}=_λt¹θ^{(t )}

Choose i uniformly at random from [m]

If (yiw^{(t )},xi < 1) Setθ^(t⁺¹⁾= θ^{(t )}+ yixi

Else

Setθ^(t⁺¹⁾= θ^{(t )} output: ¯w =_T¹T

t=1w^{(t )}

15.6 SUMMARY

SVM is an algorithm for learning halfspaces with a certain type of prior knowledge, namely, preference for large margin. Hard-SVM seeks the halfspace that separates the data perfectly with the largest margin, whereas soft-SVM does not assume sep-arability of the data and allows the constraints to be violated to some extent. The sample complexity for both types of SVM is different from the sample complexity of straightforward halfspace learning, as it does not depend on the dimension of the domain but rather on parameters such as the maximal norms of x and w.

The importance of dimension-independent sample complexity will be realized in the next chapter, where we will discuss the embedding of the given domain into some high dimensional feature space as means for enriching our hypothesis class.

Such a procedure raises computational and sample complexity problems. The lat-ter is solved by using SVM, whereas the former can be solved by using SVM with kernels, as we will see in the next chapter.

15.7 BIBLIOGRAPHIC REMARKS

SVMs have been introduced in (Cortes and Vapnik 1992, Boser, Guyor and Vapnik 1992). There are many good books on the theoretical and practical aspects of SVMs.

For example, (Vapnik 1995, Cristianini & Shawe-Taylor 2000, Schölkopf & Smola 2002, Hsu et al. 2003, Steinwart and Christmann 2008). Using SGD for solving soft-SVM has been proposed in Shalev-Shwartz et al. (2007).

178 Support Vector Machines

15.8 EXERCISES

15.1 Show that the hard-SVM rule, namely, argmax

(w,b):w=1

min

i∈[m]|w,xi + b| s.t. ∀i, yi(w,xi + b) > 0, is equivalent to the following formulation:

argmax

(w,b):w=1

min

i∈[m] yi(w,xi + b). (15.13) Hint: Define_G= {(w,b) : ∀i, yi(w,xi + b) > 0}.

1. Show that

argmax

(w,b):w=1

min

i∈[m] yi(w,xi + b) ∈G 2. Show that∀(w,b) ∈G,

min

i∈[m] yi(w,xi + b) = min

i∈[m]|w,xi + b|

15.2 Margin and the Perceptron Consider a training set that is linearly separable with a marginγ and such that all the instances are within a ball of radius ρ. Prove that the maximal number of updates the Batch Perceptron algorithm given in Section9.1.2 will make when running on this training set is (ρ/γ )².

15.3 Hard versus soft SVM: Prove or refute the following claim:

There existsλ > 0 such that for every sample S of m > 1 examples, which is separa-ble by the class of homogenous halfspaces, the hard-SVM and the soft-SVM (with parameterλ) learning rules return exactly the same weight vector.

15.4 Weak duality: Prove that for any function f of two vector variables x∈X,y ∈Y, it holds that

minx∈Xmax

y∈Y f(x,y) ≥ max

y∈Ymin

x∈X f(x,y).

16

In document Understanding Machine Learning (Page 190-197)