Kernel Principal Component Analysis (Kernel PCA)

3.4 Empirical Studies

4.1.2 Kernel Principal Component Analysis (Kernel PCA)

As can be seen from the SVD formulation, PCA is looking for alinear projection of the data, where each principal axis is a linear combination of the original feature dimensions. In some applications like image retrieval and bioinformatics, it makes sense to consider non-linear projections, because linear projections are too restrictive. One straightforward idea of extending linear PCA to non-linear PCA is to first map the original data points into a new feature spaceF, i.e., define φ:x∈ X 7→ φ(x)∈ F, and then perform linear

58 CHAPTER 4. OVERVIEW OF PROJECTION MODELS

PCA to the new data matrixφ(X) := [φ(x1), . . . ,φ(xN)]>in the new space. One can, for

example, define a set of basis functions on the original feature dimensions to obtain the mappingφ(·), but there exists a technology calledkernel PCAin which we do not need to know the explicit mapping φ(·). This is based on adual formulation of the SVD solution to linear PCA [61].

LetK:= XX> be the N ×N matrix with the (i, j)-th entry Kij =x>i xj =hxi,xji,

the inner-product of xi and xj in X. This matrix is sometimes called the Gram matrix.

Then if the SVD of data matrix Xis VDU>, the eigen-decomposition ofKwould be K=VD2V>,

because U is an orthonormal matrix satisfying U>U = I. From this starting point, we can solve PCA in the dual form as follows:

1. Construct the Gram matrix Kusing the inner-product inX;

2. Calculate the eigen-decomposition of K to get the eigenvectors in V and square roots of eigenvalues in D, sorted in descending order;

3. The projections onto the principal subspace for X are Z = VKDK, where VK

contains the firstK columns ofV, andDK the top left K×K sub matrix ofD.

To make projection for a new data point x∗, we can calculate the projection as z∗ =U>x∗ =D−1V>(VDU>)x∗ =D−1V>Xx∗=D−1V>k(X,x∗),

where the vector k(X,x∗) := [hx1,x∗i, . . . ,hxN,x∗i]> also depends only on the inner-

product. Then we truncatez∗at lengthKto obtain the projection onto theK-dimensional

principal subspace.

The key observation from this dual formulation is that the whole PCA algorithm can be derived usingonly the inner-product in the feature spaceX. If we have a mappingφ(·) which mapsx∈ X intoφ(x)∈ F, and define an inner-producth·,·iF in the new spaceF,

then PCA can be performed with Gram matrixKdefined asKij =hφ(xi),φ(xj)iF. Fur-

thermore, the reproducing kernel Hilbert space (RKHS) theory tells us that we can directly define this inner-product using akernel function κ(·,·), i.e.,hφ(xi),φ(xj)iF =κ(xi,xj), if

κ(·,·) satisfies the Mercer conditions [62]. Therefore, for non-linear PCA we just need to choose a kernel function (e.g., the popular RBF kernel κ(xi,xj) = exp −αkxi−xjk2

), calculate the kernel matrix KasKij =κ(xi,xj), and perform the dual form PCA. Since

Kis N×N, we can in principal project the data up toN dimensions.

Since we are now working in the new space F, we need to make all the data points centered in this space. We can also do this without knowing the explicit mappingφ(·), by modifying the kernel matrix Kas

ˆ K=K− 1 N1N1 > NK− 1 NK1N1 > N+ 1 N21N1 > NK1N1>N, (4.3)

4.1. PROJECTION MODELS FOR CONTINUOUS DATA 59

Figure 4.2: Illustration of kernel PCA on a 2D toy data. RBF kernel function is used with α= 10. The first 8 principal components are shown here, with eigenvalues shown on top of each figure. Blue lines are contours, with white and dark color indicating high and low values, respectively. This toy example for kernel PCA is provided by Bernhard Sch¨olkopf with the MATLAB code available athttp://www.kernel-machines.org/code/kpca toy.m.

where1N denotes the all one column vector [1, . . . ,1]> of lengthN. For the kernel vector

k(X,x∗) given test data x∗, it can also be centered by

ˆ k=k− 1 N1N1 > Nk− 1 NK1N+ 1 N21N1 > NK1N. (4.4)

Figure4.2 illustrates the first eight kernel PCA components for a 2D toy data, gener- ated from three symmetric Gaussian distributions with standard deviation 0.1 and means (−0.5,−0.2), (0,0.6) and (0.5,0). Since the projection is non-linear, we show the contours of each principal component on the 2D surface, with white and dark regions indicating high and low values, respectively. The RBF kernel function withα = 10 is used here. It can be seen that the data points in different clusters can be detected using the first two components, and more detailed structures are shown in the other components. The data itself have only two dimensions, but in kernel PCA we can obtain up to N projection dimensions.

Remark 4.1.1. The non-linear PCA reduces to linear one if we choose the linear kernel

functionκ(xi,xj) =x>i xj, i.e., the normal inner-product in Euclidean space. In the case

of N > M, we can still calculate the eigen-decomposition of K, but the N −M smallest eigenvalues will be zero, leaving the effective projection dimensions to be M. In the case

60 CHAPTER 4. OVERVIEW OF PROJECTION MODELS

of M > N, this dual form with linear kernel function leads to a more efficient solution for linear PCA because we only need to solve a N ×N eigenvalue problem.

In document Yu, Shipeng (2006): Advanced Probabilistic Models for Clustering and Projection. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 79-82)