• No results found

2.4 Linear SVMs and other linear methods

2.4.1 Linear Support Vector Machines for binary classification

two-class problem.

We want to find a decision function

f(x) =sgn(wTx+b) (2.16) minimizing the classification error on the learning set 1

n Pn

i=1I(yi 6=f(x)).

If the classes are perfectly separable, it is possible to find such a function fulfilling at the same time the condition

yi(wTxi+b)>0 (2.17) for allifrom 1 to n.

Moreover, the optimal hyperplaneh(x) =wTx+b separating the two classes is de- fined in SVM as the one that maximizes the distance between this hyperplane and the nearest point of any class inLS. This distance is called themargin M and such hyper- plane is called themaximum margin hyperplaneand those nearest points calledsupport

vectors. Thehard-marginproblem is illustrated in the left panel of Figure2.5.

The decision boundary function can thus be obtained by solving the following opti- mization problem

CHAPTER 2. PRINCIPLES OF MACHINE LEARNING

21

max w,b,kwk=1

M (2.18)

subject to yi(wTxi+b)≥M, i= 1, . . . , n (2.19) which looks for the highest marginM with all the dataset points at least localized at a distanceM from the hyperplane.

The conditionkwk= 1can be removed by injecting it in (2.19) which thus becomes

1

kwkyi(w

Tx

i+b)≥M. (2.20)

Finally, we can arbitrarily fixk w k= M1 because, ifw andb fulfil these conditions, their rescaling will not modify the value of the marginM.

Hence, the optimization problem to solve can be formulated as

min w,b 1 2kwk 2 (2.21) subject to yi(wTxi+b)≥1, i= 1, . . . , n. (2.22) The dual formulation is obtained by using the Lagrangian

L(w, b, α) = 1 2 kwk 2 n X i=1 αi[yi(wTxi+b)−1] (2.23) with αi ≥ 0 ∀i. It has to be minimized with respect to w and b and maximized with respect toα.

Optimal conditions can thus be obtained by deriving the Lagrangian with respect to

wandband setting these derivatives to zero. In particular, these operations provide the following conditions w= n X i=1 αiyixi, (2.24) 0 = n X i=1 αiyi, (2.25)

with αi ≥0. These results can be substituted in the Lagrangian to give the following dual optimization problem

max α n X i=1 αi− 1 2 n X i=1 n X k=1 αiαkyiykxTi xk (2.26) subject toαi≥0and n X i=1 αiyi = 0. (2.27)

According to theKarush-Tucker conditions, the solution must also fulfil the condi- tion

CHAPTER 2. PRINCIPLES OF MACHINE LEARNING

22

which givesαi= 0ifyi(wTxi+b)>1andxiis not on the boundary of the margin. On the contrary, ifyi(wTxi+b) = 1,αi>0andxi is on the boundary. In this case, xi is called

asupport vector. Any of these vectors can be used to obtain the value of the parameterb.

The hyperplaneh(x)can be reformulated in terms of theαi parameters as follows:

h(x) =

n X i=1

αiyixTix+b. (2.29) Unfortunately, data are in general not perfectly separable by an hyperplane and the optimization problem has to be adapted in consequence. We thus talk about the soft-

margin SVM and we refer to Figure2.5for an illustration of this situation.

The soft-margin SVM consists in the introduction of slack variables which measure discrepancies between data points and the margin. These variables will enable to find the best hyperplane separating the data but allowing simultaneously some data points to lie on the wrong side of the hyperplane or inside the margin.

Mathematically, the formulation of the soft-margin SVM is as follows

min w,b 1 2 kwk 2+C n X i=1 ξi (2.30) subject toξi≥0andyi(wTxi+b)≥1−ξi ∀i, (2.31) whereC is a now a parameter of the problem. A small value ofC will be more permis- sive with respect to misclassified samples and will correspond to a larger margin while a high value of C will restrict at most the misclassification errors et the expense of the margin.

Solving this optimization problem leads to the same equation for the hyperplaneh(x)

as in Equation (2.29) with the weight vectorw w=

n X

i=1

αiyixi. (2.32)

The samplesxi that contribute to the predictive model are the one for which αi is non zero and these samples are also called the support vectors. Instances for which

ξi = 0 are such that0 < αi < C and they are on the boundary of the margin while ob- servationsi withαi =C are inside the margin and can thus be correctly or incorrectly classified depending on the side of the hyperplane they are.

In Equation (2.29), the inner product xT

ix is called a linear kernel and is noted

K(xi,x). This kernel function measures the similarity between xi and x by a simple product but can be replaced by other non linear kernel functions to adapt SVM to non linear classification.