We have expressed earlier the hyperplanes which pass through the points lying closest to the separating hyperplane in the form (3.3). If we make the assumption that we have at least one example in each category in order for the binary classification to have a meaning then we may come up with two such hyperplanes passing through at least two examples belonging to different classes. This implies that a separating hyperplane can be found with these examples having a functional margin with respect to it given by
γ(xi, yi) =yi(w·xi+b) = 1 .
The index i denotes here only the examples from both categories which are closest to the hyperplane. Moreover, for all the examples (xi, yi), i = 1, . . . , l in the dataset we
Figure 3.1: The optimal hyperplane is the one bisecting the segment that connects
the closest points of the convex hulls of the two classes. By requiring the scaling to be such that the point(s) closest to the hyperplane satisfy|w·xi+b|= 1 we obtain a
hyperplane in canonical form. The maximum marginγ equals 1/kwk.
can assert that the following inequality holds true
yi(w·xi+b)≥1 .
In order to achieve a low generalisation error we have to determine a solution with maximum margin. This is equivalent to seeking a hyperplane with minimumkwkwhich at the same time manages to classify all the points of the training set with functional margin greater or equal to 1 (Fig. 3.1).
The above way of stating the problem of finding separating hyperplanes with maxi- mum margin prompts us to proceed to a formulation in terms of optimisation theory. Specifically, the optimisation problem can be stated formally as follows:
minimisew,b
1 2kwk
2 ,
subject to yi(w·xi+b)≥1 for all i= 1, . . . , l .
We call this formulation the primal optimisation problem. We observe that the objective function 12kwk2 is a strictly convex function of w, hence the primal problem admits a unique solution. The qualification “primal” stems from the fact that the original quantities (w, b) defining the solution hyperplane enter the expression as opposed to an equivalent expression which will be described shortly. Because the problem is hard to solve in its primal form we follow a procedure that transforms it into an equivalent form [41]. We begin by writing down the corresponding Lagrangian in a primal form
L(w, b,α) = 1 2kwk 2 − l X i=1 αi[yi(w·xi+b)−1] , (3.6)
where αi ≥ 0 are parameters known as the Lagrange multipliers which are in the fol-
lowing considered as the components of a vector a. By differentiating L(w, b,α) with respect to w andb we get
∂L(w, b,α) ∂w =w− l X i=1 yiαixi=0 , (3.7) ∂L(w, b,α) ∂b = l X i=1 yiαi = 0 . (3.8)
Solving (3.7) with respect to wand substituting back to (3.6) we obtain the dual form of the Langrangian L(α) = 1 2 l X i,j=1 yiyjαiαjxi·xj− l X i,j=1 yiyjαiαjxi·xj+ l X i=1 αi = l X i=1 αi− 1 2 l X i,j=1 yiyjαiαjxi·xj . (3.9)
Note here that the primal variables w and b have been totally eliminated from the expression of the Lagrangian. Its dual form is written exclusively in terms of the variables
αi, i = 1, . . . , l which for this reason are called dual variables. The new optimisation
problem in terms of the dual variables is stated as follows
maximiseαL(α) , subject to l X i=1 αiyi= 0 , αi≥0 i= 1, . . . , l .
After determining the solution α⋆ of the dual problem in terms of the parameters α⋆ i
the optimum weight vector w⋆ which is the unknown quantity of the primal problem can be derived from (3.7). Solving with respect to w⋆ we obtain w⋆ = Pl
i=1α⋆iyixi
from which it is clear that the solution weight vector can be represented as a linear combination of the training patterns. Recall that this was a sufficient condition in order for the decision rule to be evaluated by means of a kernel. The other parameter which remains to be specified is the biasb⋆ of the hyperplane. The bias will be evaluated using the primal constraints sinceb does not enter the dual formulation. In this attempt we employ exclusively the active constraints, which are the ones satisfied by the solution as equalities, by solving each one of them with respect tob⋆. Since there will be more than one active constraints we perform an average over all the values b⋆ found
b⋆ = Ps
i=1(yi−w⋆·xi)
wheresdenotes the number of active constraints.
We will argue in the following that the dual Lagrangian (3.9) is a concave function of α. The Hessian matrix formed by taking the second order partial derivatives of
−L(α) with respect to αi and αj consists of the entries 12(yixi·yjxj)((l,li,j))=(1,1). If the
labels accompanying the points are ignored this matrix corresponds to the simplest of the kernels, the linear one, which is positive semi-definite. We can incorporate the labels into the initial mapping transformationφ(xi) =xi and obtain a new mapping φ′(xi) =yixi
which yields a kernel matrix identical to the Hessian one. Since the Hessian matrix resulting from the Lagragian has entries which can be expressed as inner products of the embedded in some feature space H training points under the mapping φ′ :xi → H it
is proved to be positive semi-definite. Therefore L(α) has a unique maximum attained possibly for more than one different realisations of α since the dual objective is not strictly concave in contrast to the strict convexity of the primal one.
We stressed before that we aspire to solve the primal optimisation problem through the formulation of the dual one which appears easier. It is not obvious, however, that the objective functions corresponding to the primal and dual problems have the same optimal value. In the sequel we will attempt to address this issue and give the conditions under which equality of the objectives may be possible. In general if (w, b) is a pair satisfying the constraints of the primal problem with objective f(w, b) andα is a set ofl values satisfying the constraints of the dual problem with objective L(α) = infw,bL(w, b,α) it
holds that
f(w, b)≥L(α) . (3.10) Under the restrictions imposed on (w, b) and α to lie in the feasibility region of the primal and the dual problem, respectively the value of the dual objective is always bounded from above by the value of the primal. This relationship holding between the two formulations of the optimisation problem is known as the weak duality theorem [41]. This difference that exists between the f(w, b) and L(α) measures something that is known as the duality gap. If we come up with values (w⋆, b⋆) and α⋆ such
that f(w⋆, b⋆) =L(α⋆) then the duality gap will vanish. Nevertheless, we still do not know if there are values of the primal and the dual variables for which equality in the objectives can be achieved. Let us denote by gi(w, b) = 1−yi(w·xi+b) ≤ 0 the
inequality constraints of the primal optimisation problem which in our case stem from the requirement for correct classification of the points by the canonical hyperplane. It is easily proved that since
L(α) = inf w,bL(w, b,α)≤L(w, b,α) =f(w, b) + l X i=1 αigi(w, b)≤f(w, b) (3.11)
candidate values for the vanishing of the duality gap are surely the ones that minimise the primal Lagrangian since in this case the first inequality in (3.11) becomes an equality.
We will discuss very shortly the additional condition under which the second inequality becomes equality. This can happen for α⋆ solving the dual problem since this is the only value for which (3.10) could hold as an equality. By forcing L(α⋆) to attain a maximum we are able to approach and possibly reach the minimum valuef(w⋆, b⋆). The next theorem known as the strong duality theorem [41] provides us with the guarantees needed for the dual optimisation problem to have the same objective as the primal. The theorem states that
Theorem 3.2. Given an optimisation problem with convex domainΩ⊂Rn
minimisew,bf(w, b) ,
subject to gi(w, b)≤0 for all i= 1, . . . , l ,
where gi are affine functions, then the duality gap is zero.
Apparently these rather mild conditions are satisfied by our primal optimisation prob- lem. Given that the two problems can admit the same objective value for the solutions of the primal and the dual optimisation problem, (3.11) indicates that this can take place only ifPl
i=1αigi(w, b) = 0. The Kuhn-Tucker theorem [36] states the conditions
for a point (w⋆, b⋆) to be the optimum solution of the primal problem.
Theorem 3.3. Given an optimisation problem like the one appearing in Theorem 3.2
the sufficient and necessary conditions for a regular point (w⋆, b⋆) to be the optimum solution whenf(w, b)is a convex function with continuous first order partial derivatives and gi affine constraints are the existence of dual variables α⋆ ≥ 0 such that (w⋆, b⋆)
which belongs to the feasibility region of the primal problem (gi(w⋆, b⋆)≤0, i= 1. . . , l)
satisfies ∂L(w⋆, b⋆,α⋆) ∂w =0, ∂L(w⋆, b⋆,α⋆) ∂b =0 , α⋆igi(w⋆, b⋆) = 0, i= 1, . . . , l .
In our case, binary classification with maximum margin,f(w, b) = 12kwk2 satisfies the requirements of convexity with respect towand continuity of the first order derivatives. The ladditional constraints taking the form
α⋆i [1−yi(w⋆·xi+b⋆)] = 0 i= 1, . . . , l
constitute the Karush-Kuhn-Tucker (KKT) complementarity conditions [29]. At each step of the optimisation procedure the intermediate values of (w, b) before solution is reached and the optimum (w⋆, b⋆) one can be inside the feasible region defined by some of the constraints and on the boundary surfaces of the feasible region regarding some others. In the former case we say that the corresponding constraints are inactive whereas in the latter the constraints are characterised as active ones. The dual variables linked
to constraints being inactive, i.e. holding for points (xi, yi) which are not lying on the
canonical hyperplanes, assume the value αi = 0. On the contrary, for points which
satisfy the relationship yi(w·xi+b) = 1 corresponding to active constraints the dual
variables are allowed by the KKT conditions to satisfy αi ≥0. As a consequence only
these patterns participate eventually in the expansion of the solution weight vector w⋆ in terms of the training points
w⋆ =
s
X
i=1 α⋆iyixi
wheresdesignates the number of examples for whichα⋆i >0. The larger the value of a parameterα⋆i is, the higher is the influence of the corresponding pattern on the solution. The examples connected to positive dual variables are called support vectors because the direction and the bias of the hyperplane can be exclusively determined by them. The rest of the examples have no impact on the determination of the hyperplane and if they were identified they could have been discarded at no cost regarding the approximation to the optimal solution.