• No results found

2.3 Machine Learning

2.3.1 Supervised Learning

2.3.1.5 Support Vector Machines

Another approach to deal with supervised machine learning is offered by Sup- port Vector Machines(SVM ) [Vap13,Bur98,CST00,SS04]. Let assume we have labeled training data {xi, yi}, yi ∈ {−1, 1}, xi ∈ Rd. The basis of SVMs lays

on the notion of “separating hyperplane”, a hyperplane in the space Rd that

separates the positive from the negative examples. Let d+and d−be the short-

est distance from the separating hyperplane to the closest positive and negative examples. The difference d+− d− is called the “margin” of the separating

hyperplane. If the data set examples are linearly separable the support vec- tor algorithm searches the separating hyperplane with the largest margin (see

Figure 2.8). Creating the largest possible distance between the separating hy- perplane and the instances on either side of it has been proven to reduce an upper bound on the expected generalization error [VK82].

Figure 2.8: Example of separating hyperplane. Source: http: //docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/

introduction_to_svm.html

The problem of finding the hyperplane with the maximal margin can be formulated in this way:

xi· w + b ≥ 1 ∀yi= +1 (2.25)

xi· w + b ≤ −1 ∀yi= −1 (2.26)

These equations can recombined in one set of inequalities:

yi(xi· w + b) − 1 ≥ 0 ∀i (2.27)

The solution in the two-dimensional case has the form depicted in Figure 2.8. The training points for which the equality of Eq. 2.27 holds are called the support vectors (solid shapes in Fig. 2.8); removing them would cause the solution to change. The problem can be reformulated with a Lagrangian reformulation (see [Bur98] for a detailed explanation); we therefore introduce a set of Lagrangian multipliers αi, i = 1, .., l one for each inequality constraints in Eq. 2.27. We

finally obtain the Lagrangian:

Lp≡ 1 2||w|| 2 l X i=1 αiyi(xi· w + b) + l X i=1 αi (2.28)

This is a quadratic convex problem and can be solved through standard algo- rithms, for example using the dual representation of the problem [Fle87]. In the end the solution will be a linear combination of the support vectors.

The mechanism described above cannot be directly applied to the case of non-separable data, introduced, for example, by misclassified instances. The problem can be addressed by using a soft margin that accepts some misclassi- fications of the training instances [VCC+99]. To overcome this limitation we

2.3 Machine Learning 53

can introduce a set of positive slack variables ξi, i= 1, .., l [CV95]. The problem

then becomes:

xi· w + b ≥ 1 − ξi ∀yi= +1 (2.29)

xi· w + b ≤ −1 − ξi ∀yi= −1 (2.30)

ξi≥ 0 ∀i (2.31)

Consequently, the Lagrangian primal is:

Lp≡ 1 2||w|| 2+ C l X i=1 ξi− l X i=1 αi{yi(xi· w + b) − 1 + ξi} − l X i=1 µiξi (2.32)

where µi are the Lagrange multipliers introduced to enforce positivity of the

xii. C is a parameter chosen by the user, with larger values corresponding to

assigning a higher penalty to errors. The Lagrangian primal problem is then solved as in the previous separable case.

Unfortunately many problems in the real world involve non-linear relation- ship between the input features and the output value. One solution is to map the data in a higher dimensional space and define a separating hyperplane there [BGV92, ABR64]. This high-dimensional space is called transformed fea- ture space, in opposition to the input space defined by the training set. The input data need to be mapped to some other Euclidean space H using a map- ping function Φ:

Φ : Rd7→ H (2.33)

The SVM training algorithm is basically composed by dot products xi· xj and

thus the remapping would take the form Φ(xi) · Φ(xj). The next step is to in-

troduce a “kernel” function K such that K(xi, xj) = Φ(xi) · Φ(xj); this function

would be sufficient for the training algorithm (avoiding the need of explicitly de- fine Φ). Therefore kernels are special function that allow the inner products to be computed in the feature space, without passing through the mapping [SB99]. Once the hyperplane is obtained, the kernel function classifies new points within the feature space. The classification function has this expression:

f(x) = Ns X i=1 αiyiΦ(si) · Φ(x) + b = Ns X i=1 αiyiK(si, x) + b (2.34)

where si are the support vectors and Ns is their number. The selection of

the right kernel function is extremely important for the accuracy and efficiency of the SVM training. Different classes of problems require different kinds of kernels. Genton [Gen01] surveys several types of kernel functions but does not investigate which are the optimal kernels for given a problem. It is common practice to test a range of potential kernels and use cross-validation over the training set to find the best one. This has the negative side effect of increasing the time needed for training the SVM. Selecting a kernel (and its setting) is analogous to choosing the number of hidden layer in a neural network.

An interesting aspect of SVMs is that the model complexity of a SVM does not depend on the number of features present in the training data, therefore SVMs are very apt at dealing with learning tasks where the number of features is large with respect to the number of training instances.