Kernels, feature map and feature space

2.2 Support vector machine

2.2.3 Kernels, feature map and feature space

The Support vector machine is also referred as a kernel machine. The main idea is that if data in the input space are not linearly separable, the SVM algo- rithm can map the input space on a higher dimensional one, where it is possible to separate the data. Thus, it can occur that the resulting system becomes too complex and the calculation of the Euclidean distance between each training point and the separating surface becomes too hard. In this case, SVM may introduce kernel functions, which operate in a feature space and calculate only the inner product between images of points in the feature space, instead of the Euclidean

distance. This ’trick’ is computationally simple and therefore allows SVM to be used in high dimensional classification problem. We explain in this section the fundamental notions of kernel, feature map, feature space, with reference to [20] and [9].

Definition 2.3 (Kernel). Let X be a non-empty finite dimensional set. A function k: X×X → R is called a kernel on X if there exists a R-Hilbert space H = (H, h·, ·i) and a map φ: X→ H such that ∀x, x0 ∈ X we have

k(x, x0) = hφ(x), φ(x0)i

where φ is called feature map and H is called the feature space of k.

In other words, a kernel is a function that computes the inner product of the images produced in the feature space under the embedding φ of two data points x, x0 in the input space X.

Observation 3. φ and H are not uniquely determined.

Observation 4. We can consider k(x, x0) as the ij-th element of a symmetric N ×N matrix K. The matrix K is a nonnegative definite matrix called the kernel matrix.

Whenever H is separable, since it has a countable orthonormal basis it follows the isomorphism H ∼= `2, we have

Proposition 2.2.2 (Series representation of kernel).

Let X be a non-empty set, consider fn : X → C, n ∈ N, such that fn(x) ∈

`2, ∀x ∈ X. Then k(x, x0) := ∞ X i=1 fn(x)fn(x0), f or x, x0 ∈ X defines a kernel on X.

Proof. It follows from the definition and from the fact that the scalar product in `2 is defined as the sum of the series:

k(x, x0) = hf (x), f (x0)i_`

2.2 Support vector machine 29

Observation 5 (Gaussian RBF kernel). An example of real-valued kernel, one of the most used in practice, is the Gaussian RBF (radial basis function) kernel with width γ, defined by kγ(x, x0) := exp − kx − x 0_k2 2 γ2 , with x, x0 _{∈ R}d

It can be derived as the restriction to Rd _{of the more general complex kernel}

defined in Cd_: k_γ,Cd(z, z0) := exp − γ−2 d X j=1 (zj − ¯zj)2

The introduction of these notions allow us to derive the equation of the optimal hyperplane using a kernel function. The basic motivation behind this approach is due to Cover’s theorem, which can be formulated in the following form [9] Theorem 2.2.3 (Cover’s Theorem).

A complex pattern-classification problem, cast in a high-dimensional space non- linearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated.

This result plays a central role when we have non separable data, and we want to classify them with a Support vector machine in a different manner from what exposed in Observation 2.

Consider an input vector x in a finite dimensional input space X. Let {φi}∞i=1

be an infinte set of feature map, defined by X to a Hilbert space H. We may define a separating hyperplane according to the formula

∞

i=1

wiφi(x) = 0

where {wi}∞i=1 defines an infinite set of weights that tranform the feature space

in the output space. In a more compact way, this can be written as

wTϕ(x) = 0 (2.19)

where ϕ(x) is the feature vector and w is the corresponding weight vector. Our aim is now to find a separating hyperplane in the feature space H, as we have done in Section 2.2.1 and 2.2.2.

From Eq. (2.15), in this particular context it becomes

w =

i=1

with Ns the number of support vectors. Substituing Eq. (2.19), we can write the

separating hyperplane in the output space as follows

Ns X i=1 αiyiϕT(xi)ϕ(x) = 0 =⇒ k(xi,x)=ϕT(xi)ϕ(x) Ns X i=1 αiyik(xi, x) = 0 (2.20)

Because of the last equation, it can be observed the reason why Support vector machine is often referred to be a kernel machine. In fact, from Eq. (2.20), it follows that we have never to calculate the weight vector w0, because specifying

the kernel is sufficient. This also motivates the reason why Eq. (2.20) is called the kernel-trick. An important observation, in particular for applications, is that whenever the feature space is defined as an infinite dimensional space, the Eq. (2.20) defining the optimal hyperplane consists of a linear finite sum of terms, in particular equal to the number of support vectors.

In document Pattern recognition methods for EMG prosthetic control (Page 45-48)