Basic Kernel Operations in Feature Space

Kernel Methods

5.3 Basic Kernel Operations in Feature Space

Let us look at some of the basic data analysis tasks that can be performed solely via kernels, without instantiating φ(x).

Norm of a Point We can compute the norm of a point φ(x) in feature space as follows

kφ(x)k² = φ(x)^Tφ(x) = K(x, x) which implies thatkφ(x)k =p

K(x, x).

Distance between Points The distance between two points φ(x_i) and φ(x_j) can be computed as

Rearranging (5.11), we can see that the kernel value can be considered as a measure of the similarity between two points, since

2 kφ(xi)k²+kφ(xj)k²− kφ(xi)− φ(xj)k²

= K(xi, xj) = φ(xi)^Tφ(xj) Thus, the more the distancekφ(xi)−φ(xj)k between the two points in feature space, the less the kernel value, i.e., the less the similarity.

Example 5.9: Consider the two points x₁ and x₂ in Figure 5.1.

x₁ =

Assuming the homogeneous quadratic kernel, the norm of φ(x₁) can be computed as

kφ(x1)k²= K(x₁, x₁) = (x^T₁x₁)² = 43.81² = 1919.32 which implies that the norm of the transformed point is kφ(x1)k = √

43.81² = 43.81.

The distance between φ(x₁) and φ(x₂) in feature space is given as δ φ(x₁), φ(x₂)

Mean in Feature Space The mean of the points in feature space is given as µ_φ= 1

n Xn

i=1

φ(x_i)

Since we do not, in general, have access to φ(x), we cannot explicitly compute the mean point in feature space.

Nevertheless, we can compute the squared norm of the mean as follows kµφk² = µ^T_φµ_φ

= 1

The above derivation implies that the squared norm of the mean in feature space is simply the average of the values in the kernel matrix K.

Example 5.10: Consider the ﬁve points from Example 5.3, also shown in Fig-ure 5.1. Example 5.4 showed the norm of the mean for the linear kernel. Let us consider the Gaussian kernel with σ = 1. The Gaussian kernel matrix is given as







1.00 0.60 0.78 0.42 0.72 0.60 1.00 0.94 0.07 0.44 0.78 0.94 1.00 0.13 0.65 0.42 0.07 0.13 1.00 0.23 0.72 0.44 0.65 0.23 1.00







The squared norm of the mean in feature space is therefore

kµφk²= 1

Total Variance in Feature Space Let us ﬁrst derive a formula for the squared distance of a point φ(x) to the mean µ_φin feature space

kφ(xi)− µφk² =kφ(xi)k²− 2φ(xi)^Tµ_φ+kµφk²

The total variance (1.4) in feature space is obtained by taking the average squared deviation of points from the mean in feature space

σ_φ² = 1

= 1 n

Xn i=1

K(x_i, x_i)− 1 n²

Xn i=1

Xn j=1

K(x_i, x_j) (5.13)

In other words, the total variance in feature space is given as the diﬀerence between the average of the diagonal entries and the average of the entire kernel matrix K.

Also notice that by (5.12) the second term is simply kµφk².

Example 5.11: Continuing Example 5.10, the total variance in feature space for the ﬁve points, for the Gaussian kernel, is given as

σ_φ² = 1 n

Xn i=1

K(x_i, x_i)

− kµφk² = 1

5× 5 − 0.599 = 0.401

The distance between φ(x₁) from the mean µ_φ in feature space is given as

kφ(x1)− µφk² = K(x₁, x₁)−2 5

X5 j=1

K(x₁, x_j) +kµφk²

= 1−2

5(1 + 0.6 + 0.78 + 0.42 + 0.72) + 0.599

= 1− 1.410 + 0.599 = 0.189

Centering in Feature Space We can center each point in feature space by sub-tracting the mean from it, as follows

φ(xˆ _i) = φ(x_i)− µφ

Since we do not have explicit representation of φ(x_i) or µ_φ, we cannot explicitly center the points. However, we can still compute the centered kernel matrix, i.e., the kernel matrix over centered points.

The centered kernel matrix is given as Kˆ =n

K(xˆ _i, x_j)on i,j=1

where each cell corresponds to the kernel between centered points, that is

In other words, we can compute the centered kernel matrix using only the kernel function. Over all the pairs of points, the centered kernel matrix can be written compactly as follows where 1_n_×n is the n× n singular matrix, all of whose entries equal one.

Example 5.12: Consider the ﬁrst ﬁve points from the two-dimensional Iris dataset shown in Figure 5.1a

x₁ =

Consider the linear kernel matrix shown in Figure 5.1b. We can center it by ﬁrst computing

The centered kernel matrix (5.14) is given as

43.81 50.01 47.64 36.74 42.00 50.01 57.22 54.53 41.66 48.22 47.64 54.53 51.97 39.64 45.98 36.74 41.66 39.64 31.40 34.64 42.00 48.22 45.98 34.64 40.84



0.02 −0.06 −0.06 0.18 −0.08

−0.06 0.86 0.54 −1.19 −0.15

−0.06 0.54 0.36 −0.83 −0.01 0.18 −1.19 −0.83 2.06 −0.22

−0.08 −0.15 −0.01 −0.22 0.46







To verify that ˆK is the same as the kernel matrix for the centered points, let us ﬁrst center the points by subtracting the mean µ = (6.0, 2.88)^T. The centered points in feature space are given as

z₁ = sim-ilar manner. Thus, the kernel matrix obtained by centering the data and then computing the kernel is the same as that obtained via (5.14).

Normalizing in Feature Space A common form of normalization is to ensure that points in feature space have unit length by replacing φ(x_i) with the correspond-ing unit vector φ_n(x_i) = _kφ(x^φ(xⁱ⁾

i)k. The dot product in feature space then corresponds to the cosine of the angle between the two mapped points, since

φ_n(x_i)^Tφ_n(x_j) = φ(x_i)^Tφ(x_j)

kφ(xi)k · kφ(xj)k = cos(θ)

If the mapped points are both centered and normalized, then a dot product corre-sponds to the correlation between the two points in feature space.

The normalized kernel matrix, K_n, can be computed using only the kernel

func-tion K, since

K_n(xi, xj) = φ(x_i)^Tφ(x_j)

kφ(xi)k · kφ(xj)k = K(x_i, x_j) pK(x_i, x_i)· K(xj, x_j) K_n has all diagonal elements as one.

Let W denote the diagonal matrix comprising the diagonal elements of K

W= diag(K) =

The normalized kernel matrix can then be expressed compactly as K_n= W^−1/2· K · W^−1/2

where W^−1/2 is the diagonal matrix, deﬁned as W^−1/2(x_i, x_i) = √ ¹

K(xi,xi), with all other elements being zero.

Example 5.13: Consider the ﬁve points and the linear kernel matrix shown in Figure 5.1. We have

The normalized kernel is given as

K_n= W^−1/2· K · W^−1/2=







1.0000 0.9988 0.9984 0.9906 0.9929 0.9988 1.0000 0.9999 0.9828 0.9975 0.9984 0.9999 1.0000 0.9812 0.9980 0.9906 0.9828 0.9812 1.0000 0.9673 0.9929 0.9975 0.9980 0.9673 1.0000







The same kernel would have been obtained, if we had ﬁrst normalized feature vectors to have unit length, and then taken the dot products. For example, with the linear kernel, the normalized point φ_n(x₁) is given as

φ_n(x₁) = φ(x₁)

Likewise, we have φ_n(x₂) = √¹ 57.22

6.9 3.1

0.9122 0.4098

. Their dot product is

φ_n(x₁)^Tφ_n(x₂) = 0.8914· 0.9122 + 0.4532 · 0.4098 = 0.9988 which matches K_n(x₁, x₂).

If we start with the centered kernel matrix ˆK from Example 5.12, and then normalize it, we obtain the normalized and centered kernel matrix ˆK_n

Kˆ_n=







1.00 −0.44 −0.61 0.80 −0.77

−0.44 1.00 0.98 −0.89 −0.24

−0.61 0.98 1.00 −0.97 −0.03 0.80 −0.89 −0.97 1.00 −0.22

−0.77 −0.24 −0.03 −0.22 1.00







As noted earlier, the kernel value ˆK_n(x_i, x_j) denotes the correlation between x_i and x_j in feature space, i.e., it is cosine of the angle between the centered points φ(xi) and φ(xj).

In document Data Mining and Analysis (Page 174-181)