3. Learning and classification models
3.5. Support vector machines (SVMs)
Originally proposed by Boser et al. in [12], the support vector machine approach is a supervised machine learning solution specifically devised for pattern recognition. Based on justified statistical learning theories, the SVMs are capable of learning from training examples to construct a set of hyperplanes that separate the data points into two differ- ent classes with the maximum margin in the high dimensional feature space. SVM pro- vides a direct mapping from observed variables to target variables, which is the result of
learning the parameters of a mapping function. The mapping function is usually selected and optimized according to a set of constraints, among them is the minimization of training errors. Even though other types of SVMs are proposed after their initial presen- tation, the original SVM is a strictly discriminative approach.
Being a general pattern recognition method, SVM has been used in a variety of ap- plications. The modular characteristic also adds to its popularity since users need not to worry about the implementation details of the algorithm. The publishing of SVM library [21] provides users with an easy access to pattern recognition, which has tremendously benefited other studies.
3.5.1. Problem definition
In mathematical term, the primary objective of SVMs is to learn a mapping: y, wherex represents the feature vector (observed variables) and y y denotes the
class label (target variables). In a binary classification problem, wherex R (n dimen-n
sional feature space), y { 1}, the goal is to learn a classifier defined in equation (25), with function parameters , given the training set( ,x y1 1), ( ,x y2 2),..., (xm,ym). Once the
mapping function is learned, the algorithm can take into new unlabeled data points and render a label with a certain level of confidence.
( , )
y f x (25)
Since there are many functions that can separate the training data with respect to their labels, as shown in Figure 3.4, the optimization process should be constrained with additional requirements so that the best hyperplane is selected. One requirement of SVM is that the selected hyperplane should measure the largest distance between classes; the other is the constraint on the training error which should be kept to a mini- mum.
According to linear separability of data points from different classes, SVMs can be linear or nonlinear. Nonlinear SVM is simply an extension to linear SVM. It first maps linearly inseparable data points in the original space to a higher dimensional feature space, in which the data points can be linearly separable again. Such mapping is realized through the use of kernel functions.
Figure 3.4. An example of separating data points from two different classes. There exist many hyperplanes that can manage such data separation.
3.5.2. Linear SVM
In the case of linear SVM, a hyperplane can be expressed as a linear combination of the dimensions of observed data pointx (i w xi b 0, where w is the normal vector, b de- notes the offset of the hyperplane from the origin and the dot product.) In association with class labels, the hyperplane should be defined such that the following equations are satisfied: 1, if y 1 1, if y 1 i i i i x w b x w b (26)
Equation (26) can be rewritten as a single form:
( ) 1
i i
y w x b (27)
The data points that lie exactly on the hyperplanes should satisfy x w bi 1or 1
i
x w b and are called support vectors. Recall that in the definition of SVM, the
hyperplanes are constrained to have the maximum margin. Since the margin between them can be defined as 2
w according to simple geometric deduction, the objective can
be morphed to minimize w . Combining the objective and the constraint, we can arrive
to the theoretical formulation of linear SVM which is to minimize w , subject to equa-
tion (27).
Since the norm of w w involves square root operations, it is difficult to minimize.
Without changing the general solution, the problem is reformulated to:min(w 2), sub- ject to equation (27). This formulation can be expressed with Lagrange function that has the following form:
2 1 1 [ ( ) 1] 2 m p i i i i L w y x w b (28)
where iare the Lagrange multipliers. This problem can be solved with quadratic pro- gramming.
3.5.3. Nonlinear SVM—the kernel trick
Since the normal vector can be expressed as 1
m i i i i
w y x , alternatively the linear SVM can be formulated as the following optimization problem:
1 , 1 , 1 ( ) 2 1 ( , ) 2 m i i j i j i j i i j m i i j i j i j i i j L y y x x y y k x x (29)
subject to i 0and 0 0 m i i i
y . In equation (29), the dot product is expressed in a linear kernel functionk x x( ,i j) which can be easily extended to nonlinear cases.
It is not always guaranteed that in the original feature space, data points are linearly separable. In such cases, the training data should be first transformed into a higher di- mensional feature space in which the linear separability property is satisfied so that lin- ear SVM can be directly applied to perform pattern recognition. Suppose a feature trans- formation function can be denoted as ( )x , then the kernel function in equation (29) can i
be expressed as
( ,i j) ( )i ( j)
k x x x x (30)
Thus the nonlinear SVM problem is transformed into linear SVM in a higher dimen- sional feature space by replacing the linear kernel function with a nonlinear one. There are many types of nonlinear kernel functions, the most widely used in scene recognition, however, is the Radial Basis Function (RBF):
2
( ,i j) exp( i j ), 0
k x x x x (31)
An example of using RBF as a kernel function for nonlinear class separation is shown in Figure 3.5.
Figure 3.5. Nonlinear separation of data points from two different classes. The nonlinearity of the boundary indicates that the observed variables are in the original feature space.