2.3 Supervised Machine Learning Techniques
2.3.4 Support Vector Machines
The Support Vector Machine (SVM) is one of the most popular algorithms in modern machine learning. It was introduced by Vapnik (1995) and in its simplest form is a linear classifier that attempts to place a hyperplane between two categories of data. The neuron update rule in Equation (2.29) iteratively updates the decision boundary until an acceptable number of data points are correctly classified. The SVM, on the other hand, attempts to find a decision boundary that is optimal in some way.
Figure 2.17 illustrates three hyperplanes, H1, H2, and H3. The first hyperplane, H1, is obviously a bad choice as it incorrectly classifies most of the black circles. Both H2 andH3, however, classify all the data points correctly. This invites the question whether one plane is better than the other and if so, by what criteria are they are judged.
“
However, if you had to pick one of the lines to act as the classifier for a set of test data, I’m guessing that most of you would pick the line shown in the middle picture. It’s probably hard to describe exactly why you would do this, but somehow we prefer a line that runs through the middle of the separation between the data points from the two classes, staying approximately equidistant from the data in bothclasses. Marsland(2015)
This intuition stems from a sense that if the training data truly indicates some underlying process, then when new points are drawn some will be closer the decision boundary. Without more information there is no way of knowing where the actual dividing line sits, but by placing the separator down the middle of the gap it minimises the probability that new points will fall on the wrong side of the line and therefore be incorrectly classified. By building in an equal sized margin on each side, it also ensures that small variations in the features do not cause unintended misclassifications.
By requiring the margin to be maximal, it decreases the number of ways the decision bound- ary can be placed. As a result, the memory capacity of the model is decreased which can lead to a decrease in over-fitting and an increase in generalisability due to the bias-variance trade-off (Manning et al. 2008). This optimal hyperplane is illustrated in Figure2.18.
H1 H2 H3
X
1X
2Figure 2.17: Separating Hyperplanes (Weinberg 2012)
The following derivation of the optimal hyperplane is based on those in Friedman et al. (2001),Manning et al.(2008), andMarsland(2015) and assumes that the data is linearly sepa- rable. Notation is similar to that ofManning et al.(2008).
Let the decision surface be the hyperplane,hw, bi defined by
w· x + b = 0. (2.30)
Let the linear classification,g(x), for the SVM be
g(x) = sign(w· x + b), (2.31)
where w contains the parameters of the decision boundary and the intercept, b, is some bias that is not folded into the weight parameters. g yields values from{−1, 1} that correspond to each of the two classes. This differs from conventional linear classifiers that use{0, 1}.
Let the set of data points D={(xi, yi)}, where xi ∈ Rlis the input vector,yi ∈ {−1, 1} is the label andl is the length of the input features.
Define the functional margin of theith data point, x
i, with respect tohw, bi, to be
mi = yi· (w · x + b). (2.32)
The functional margin of the entire dataset is then m = min
H
0: Optimal h
yperplane
H
+H
- geometric m argin x' r xFigure 2.18: Optimal Separating Hyperplane
which is twice the minimum functional margin over all the points. The factor of2 comes from measuring the width of the margin on each side of the hyperplane. At this stage there is still an unconstrained degree of freedom in the hyperplane – and therefore in the functional margin – as a constant scaling factor can be applied to w andb without changing the plane, i.e. hw, bi andhαw, αbi are equivalent hyperplanes, but result in different functional margins. Because of this, a constraint on the size of w is required.
Letr be the Euclidean distance from some point x to the decision boundary. The shortest path is the line that passes through x perpendicular to the plane. The unit vector in this direction is ||w||w . Let x′ be the intersection of the decision boundary and this line. Therefore,
x′ = x− yr w
||w|| (2.34)
where multiplying by y swaps the sign for when the point is above or below the decision boundary. Because x′ is on the decision boundary,
Substituting Equation (2.34) into Equation (2.35) gives the following formula forr: w·(x − yr w ||w||) + b = 0 (2.36) w· x − yrw· w ||w|| + b = 0 (2.37) w· x − yr ||w|| 2 ||w|| + b = 0 (2.38) w· x − yr||w|| + b = 0 (2.39) yr||w|| = w · x + b (2.40) r = w· x + b y||w|| (2.41) r = y w · x + b ||w|| . (2.42)
In the final line,y can be moved from the denominator as its value can be only−1 or 1. The data points with a minimal distance between them and the decision surface are termed
support vectorsand these vectors are ultimately responsible for the placement of the decision surface. The geometric margin is the maximum width of the band that can separate the support vectors in each class and is independent of the scaling mentioned earlier due to the normalising ||w|| term in r. Because of the independence of the geometric margin on scale, any scaling constraint can be imposed on w. Therefore, we can choose to require that the functional margin of each point in the dataset is at least 1, and for at least one datum in each category to equal 1. In other words,
yi(w· xi+ b)≥ 1 (2.43) ∃ support vectors S ⊂ D, such that (xi, yi)∈ S ⇒ yi(w· xi+ b) = 1. (2.44) To calculater for the support vectors, substitute Equation (2.44) into Equation (2.42):
ri =
yi(w· xi+ b)
||w|| (2.45)
= 1
||w||. (2.46)
Hence, the geometric margin,ρ, is
ρ = 2
||w|| = 2 √
w· w. (2.47)
The aim of training the SVM is to find the decision surface that maximises this geometric margin. Therefore, training the SVM involves maximisingρ subject to the condition that
∀(xi, yi)∈ D, yi(w· xi+ b)≥ 1. (2.48) It is easy to see that the maximisation problem is solved by minimising the objective func- tionL(w), subject to the same condition.
The solution can be found through the dual problem using Lagrange multipliers. A La- grange multiplier, αi is associated with each constraint in Equation (2.48) and the problem becomes: max ai X i αi− 1 2 X i X j αiαjyiyjxi· xj ! (2.50) such thatX i αiyi = 0, and αi ≥ 0 ∀αi (2.51)
The SVM parameters are then:
w=X
i
αiyixi (2.52)
b = yk− w · xk, for any k where ak6= 0, (2.53) and the support vectors, xi, are those that correspond toai 6= 0. This shows how the placement of the decision boundary relies on the support vectors only.
To handle cases where the SVM may have misclassified some data points because the data is not linearly separable, the objective function can be updated to:
L(w) = w· w + λ R X
i=1
ǫi, (2.54)
whereǫi is the distance to the correct side of the margin known as the slack variable andR is the number of incorrect classifications (Marsland 2015;Ramanan 2008). λ is then a parameter that controls the trade-off between a large geometric margin and a misclassified data point. This is called a soft margin SVM andαi from Equation (2.51) is now bounded above byλ:
0≤ αi ≤ λ ∀αi. (2.55)
Note that the only computation that ever involves x is the dot product seen in Equa- tion (2.50). This means that while the method is fundamentally designed for data that is linearly separable, it lends itself to the types of kernel functions discussed in Section2.3.2. By including polynomial, radial-basis, or sigmoid functions, data that is not linearly separable can be em- bedded in higher dimensional spaces. When using the data directly, this results in an explosion in the number of dimensions and therefore the storage and computational costs. However, the kernels provide a way to calculate the dot product of the extended feature vectors without ever having to construct them, and therefore retain the original computational complexity. Using the
kernel trickin this way, SVMs can be used successfully in a number of cases where the data is otherwise not suitable to linear classification.