3.2 Support vector machine at a glance
3.2.1 Basic ideas of SVM
We illustrate here the basic ideas of SVM as a classification method. The main advantage of SVM is that it can be not only described very intuitively in the con- text of linear classification but also extended in an intelligent way to the non-linear case. Let us define the training data set consisting of pairs of “input/output” points (xi, yi), with 1 ≤ i ≤ n. Here the input vector xi belongs to some space X whereas
the output yi belongs to {−1, 1} in the case of bi-classification. The output yi is
used to identify the two possible classes. Hard margin classification
The most simple idea of linear classification is to look at the whole set of input {xi ⊂ X } and search the possible hyperplane which can separate the data in two
classes based on its label yi =±1. Its consists of constructing a linear discriminant
function of the form:
h(x) = wTx + b
where the vector w is the weight vector and b is called the bias. The hyperplane is defined by the following equation:
H ={x : h(x) = wTx + b = 0}
This hyperplane divides the space X into two regions: the region where the discrimi- nant function has positive value and the region with negative value. The hyperplane is the also called the decision boundary. The linear classification comes from the fact that this boundary depends on the data in the linear way.
Figure 3.1: Geometric interpretation of the margin in a linear SVM.
We now define the notion of a margin. In Figure 3.1 (reprinted from Ben-Hur A. et al., 2010), we give a geometric interpretation of the margin in a linear SVM. Let x+ and x− be the closest points to the hyperplane from the positive side and
negative side. The cycle data points are defined as the support vectors which are the closest points to the decision boundary (see Figure 3.1). The vector w is the normal vector to the hyperplane and we denote its norm kwk = √wTw and its
direction ˆw = w/kwk. We assume that x+and x−are equidistant from the decision
boundary. They determine the margin from which the two classes of points of data set D are separated:
mD(h) = 1 2wˆ
T (x
+− x−)
In the geometric consideration, this margin is just half of the distant between two closest points from both sides of the hyperplane H projected in the direction ˆw. Using the equations that define the relative positions of these points to the hyperplane H:
h(x+) = wTx++ b = a
h(x−) = wTx−+ b =−a
where a > 0 is some constant. As the normal vector w and the bias b are undeter- mined quantity, we can simply divide them by a and renormalized all these equations. This is equivalent to set a = 1 in the above expression and we finally get
mD(h) = 1 2wˆ
T(x
+− x−) = 1
The basic idea of maximum margin classifier is to determine the hyperplane which maximizes the margin. For a separable dataset, we can define the hard margin SVM as the following optimization problem:
min w,b 1 2kwk 2 (3.1) u.c. yi wTxi+ b > 1 i = 1...n
Here, yi wTxi+ b> 1is just a compact way to express the relative position of two
classes of data points to the hyperplane H. In fact, we have wTx
i + b > 1 for the
class yi= 1 and wTxi+ b <−1 for the class yi=−1.
The historical approach to solve this quadratic program is to map the primal problem to dual problem. We give here the main result while the detailed derivation can be found in the Appendix C.1. Via KKT theorem, this approach gives us the following optimized solution (w?, b?):
w?= n X i=1 α?iyixi where α? = (α?
1, . . . , α?n)is the solution of the dual optimization problem with dual
variable α = (α1, . . . , αn) of dimension n: max α n X i=1 αi− 1 2 n X i,j=1 αiαjyiyjxTi xj u.c. αi ≥ 0 i = 1...n
We remark that the above optimization problem is a quadratic program in the vectorial space Rd with n linear inequality constraints. It may become meaningless
if it has no solution (the dataset is inseparable) or too many solutions (stability of boundary decision on data). The questions on the existence of a solution in Prob- lem3.5or on the sensibility of solution on dataset are very difficult. A quantitative characterization can be found in the next discussion on the framework of Vapnik- Chervonenskis theory. We will present here an intuitive view of this problem which depends on two main factors. The first one is the dimension of the space of func- tion h(x) which determines the decision boundary. In the linear case, it is simply determined by the dimension of the couple (w, b). If the dimension of this function space is two small as in the linear case, it is possible that there exists no linear so- lution or the dataset can not be separated by a simple linear classifier. The second factor is the number of data points which involves in the optimization program via ninequality constraints. If the number of constraints is too large, the solution may not exist neither. In order to overcome this problem we must increase the dimension of the optimization problem. There exists two possible ways to do this. The first one consists of relaxing the inequality constrains by introducing additional variables which aim to tolerate the strict separation. We will allow the separation with cer- tain error (some data points in the wrong side). This technique is introduced first by
Cortes C. and Vapnik V. (1995) under the name “Soft margin SVM”. The second one consists of using the non-linear classifier which directly extend the function space to higher dimension. The use of non-linear classifier can increase rapidly the dimension of the optimization problem which invokes a computation problem. An intelligent way to get over is employing the notion of kernel. In the next discussions, we will try to clarify these two approaches then finish this section by introducing two general frameworks of this learning theory.
Soft margin classification
In fact, the inequality constrains described above yi wTxi+ b> 1 ensure that all
data points will be well classified with respect to the optimal hyperplane. As the data may be inseparable, an intuitive way to overcome is relaxing the strict constrains by introducing additional variables ξi with i = 1, . . . , n so-called slack variables. They
allow to commit certain error in the classification via new constrains: yi wTxi+ b
> 1− ξi i = 1...n (3.2)
For ξi > 1, the data point xi is completely misclassified whereas 0 ≤ ξi ≤ 1 can be
interpreted as margin error. By this definition of slack variables,Pn
i=1ξi is directly
related to the number of misclassified points. In order to fix our expected error in the classification problem, we introduce an additional term CPn
i=1ξ p
i in the objective
function and rewrite the optimization problem as following: min w,b,ξ 1 2kwk 2 + C n X i=1 ξi (3.3) u.c. yi wTxi+ b ≥ 1 − ξi, ξi ≥ 0 i = 1...n
Here, C is the parameter used to fix our desired level of error and p ≥ 1 is an usual way to fix the convexity on the additional term1. The soft-margin solution for the
SVM problem can be interpreted as a regularization technique that one can find different optimization problem such as regression, filtering or matrix inversion. The same result can be found with regularization technique later when we discuss the possible use of kernel.
Before switching to next discussion on the non-linear classification with kernel approach, we remark that the soft margin SVM problem is now at higher dimension d + 1 + n. However, the computation cost will be not increased. Thank to the KKT theorem, we can turn this primal problem to a dual problem with more simple constrains. We can also work directly with the primal problem by effectuating a trivial optimization on ξ. The primal problem is now no longer the a quadratic program, however it can be solved by Newton optimization or conjugate gradient as demonstrated in Chapelle O. (2007).
Non-linear classification, Kernel approach
The second approach to improve the classification is to employ the non-linear SVM. In the context of SVM, we would like to insist that the construction of non-linear discriminant function h(x) consists of two steps. We first extend the data space X of dimension d to a feature space F with higher dimension N via a non-linear transformation φ : X → F, then a hyperplane will be constructed in the feature space F as presented before:
h (x) = wTφ (x) + b
Here, the result vector z = (z1, . . . , zN) = φ (x) is N-component vector in F space,
hence w is also a vector of size N. The hyperplane H = {z : wTz + b = 0} defined
in F is no longer a linear decision boundary in the initial space X : B ={x : wTφ (x) + b = 0}
At this stage, the generalization to non-linear case helps us to avoid the problem of overfitting or underfitting. However, a computation problem emerges due to the high dimension of the feature space. For example, if we consider an quadratic trans- formation, it can lead to a feature space of dimension N = d(d + 3)/2. The main question is how to construct the separating hyperplane in the feature space? The answer to this question is to employ the mapping to the dual problem. By this way, our N-dimension problem turn again to the following n-dimension optimization problem with dual variable α:
max α n X i=1 αi− 1 2 n X i,j=1 αiαjyiyjφ (xi)T φ (xj) u.c. αi ≥ 0 i = 1...n
Indeed, the expansion of the optimal solution w? has the following form:
w? =
n
X
i=1
α?iyiφ (xi)
In order to solve the quadratic program, we do not need the explicit form of the non-linear application but only the kernel of the form K (xi, xj) = φ (xi)T φ (xj)
which is usually supposed to be symmetric. If we provide only the kernel K (xi, xj)
for the optimization problem, it is enough to construct later the hyperplane H in the feature space F or the boundary decision in the data space X . The discriminant function can be computed as following thank to the expansion of the optimal w? on
the initial data xii = 1, . . . , n:
h (x) =
n
X
i=1
αiyiK (x, xi) + b
From this expression, we can construct the decision function which can be used to classified a given input x as f (x) = sign (h (x)).
For a given non-linear function φ (x), we can compute the kernel K (xi, xj) via
the scalar product of two vector in F space. However, the reciprocal result does not stay unless the kernel satisfies the condition of the Mercer’s theorem (1909). Here, we study some standard kernel which are already widely used in the pattern recognition domain:
i. Polynomial kernel: K (x, y) = xTy + 1p
ii. Radial Basis kernel: K (x, y) = exp −kx − yk2/2σ2
iii. Neural Network kernel: K (x, y) = tanh axTy− b