Introduction: Overview of Kernel Methods

(1)

Statistical Data Analysis with Positive Definite Kernels

Kenji Fukumizu

Institute of Statistical Mathematics, ROIS

Department of Statistical Science, Graduate University for Advanced Studies

(2)

Outline

Basic idea of kernel methods

Linear and nonlinear Data Analysis Essence of kernel methodology

Some examples of kernel methods

Kernel PCA: Nonlinear extension of PCA Ridge regression and its kernelization

(3)

Linear and nonlinear Data Analysis

Essence of kernel methodology

(4)

Nonlinear Data Analysis I

• Classical linear methods

• Data is expressed by a matrix.

X =      X11 X12 · · · X1m X1 2 X22 · · · X2m .. . X1 N XN2 · · · XNm      (mdimensional, N data)

• Linear operations (matrix operations) are used for data analysis. e.g.

- Principal component analysis (PCA)

- Canonical correlation analysis (CCA)

- Linear regression analysis

- Fisher discriminant analysis (FDA)

(5)

Nonlinear Data Analysis II

Are linear methods sufficient? Nonlinear transform can help.

• Example 1: classification -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 0 5 10 15 200 5 10 15 20 -15 -10 -5 0 5 10 15 x1 x₂ z1 z₃ z2

linearly inseparable linearly separable

(x1, x2) 7→ (z1, z2, z3) = (x21, x 2 2,

√ 2x1x2)

(6)

• Example 2: dependence of two data Correlation

ρXY =

Cov[X, Y ]

pVar[X]Var[Y ]=

E[(X − E[X])(Y − E[Y ])]

pE[(X − E[X])2_{]E[(Y − E[Y ])}2_].

• Transforming data to incorporate high-order moments seems

(7)

Linear and nonlinear Data Analysis

Essence of kernel methodology

(8)

Feature space for transforming data

• Kernel methodology = a systematic way of analyzing data by

transforming them into a high-dimensionalfeature space.

Apply linear methods on the feature space.

• Which type of space serves as a feature space?

• The space should incorporate various nonlinear information of the original data.

• The inner product of the feature space is essential for data analysis (seen in the next subsection).

(9)

Computational problem of inner product

• For example, how about this?

(X, Y, Z) 7→ (X, Y, Z, X2, Y2, Z2, XY, Y Z, ZX, . . .).

• But, for high-dimensional data, the above expansion makes the

feature space very huge!

e.g. If X is 100 dimensional and the moments up to the third order are used, the dimensionality of feature space is

100C1+100C2+100C3= 166750.

• This causes a serious computational problem in working on the

inner product of the feature space.

(10)

Inner product by positive definite kernel

• A positive definite kernel gives efficient computation of the inner

product:

With special choice of the feature space, we have a function

k(x, y)such that

hΦ(Xi), Φ(Xj)i = k(Xi, Xj), positive definite kernel where

X 3 x 7→ Φ(x) ∈ H (feature space).

• Many linear methods use only the inner product without

(11)

Kernel PCA: Nonlinear extension of PCA

(12)

Review of PCA I

X1, . . . , XN : m-dimensional data.

Principal Component Analysis (PCA)

• Find d-directions to maximize the variance.

• Purpose: represent the structure of the data in a low dimensional

(13)

Review of PCA II

The first principal direction:

u1= arg max kuk=1 1 N PN i=1u T_(X i−_N1 P N j=1Xj) 2 = arg max kuk=1u T_{V u,}

where V is the variance-covariance matrix:

V = 1 N PN i=1(Xi−_N1 P N j=1Xj)(Xi−_N1 P N j=1Xj)T.

- Eigenvectors u1, . . . , umof V (in descending order).

- The p-th principal axis = up.

- The p-th principal component of Xi = uTpXi

Observation: PCA can be done if we can compute the inner product

• covariance matrix V,

(14)

Kernel PCA I

X1, . . . , XN : m-dimensional data.

Transform the data by a feature map Φ into a feature space H: X1, . . . , XN 7→ Φ(X1), . . . , Φ(XN)

Assume that the feature space has theinner producth , i.

Apply PCA to the transformed data:

• Maximize the variance of the projection onto the unit vector f .

max

kf k=1Var[hf, Φ(X)i] = maxkf k=1 1 N PN i=1 hf, Φ(Xi)−_N1P N j=1Φ(Xj)i 2

• Note: it suffices to use f =Pn

i=1aiΦ(X˜ i), where

˜

Φ(Xi) = Φ(Xi) −_N1P

N

j=1Φ(Xj).

The direction orthogonal to Span{ ˜Φ(X1), . . . , ˜Φ(XN)}does not

(15)

Kernel PCA II

• The PCA solution:

max aTK˜2a subject to aTKa = 1,˜

where ˜Kis N × N matrix with ˜Kij = h ˜Φ(Xi), ˜Φ(Xj)i.

Note: 1 N PN i=1hf, ˜Φ(Xi)i2= _N1P N i=1h PN j=1ajΦ(X˜ j), ˜Φ(Xi)i2=_N1aTK˜2a, kf k2_{= h}Pn i=1aiΦ(X˜ i), Pn i=1aiΦ(X˜ i)i = a T_Ka._˜

• The first principal component of the data Xiis

h ˜Φ(Xi), ˆf i =P N i=1

p λ1u1i,

(16)

Kernel PCA III

Observation:

• PCA in the feature space can be done if we can compute

h ˜Φ(Xi), ˜Φ(Xj)ior

hΦ(Xi), Φ(Xj)i = k(Xi, Xj).

• The principal direction is obtained in the form f =P

iaiΦ(X˜ i),

i.e., in the linear hull of the data. Note: ˜ Kij= h ˜Φ(Xi), ˜Φ(Xj)i = hΦ(Xi), Φ(Xj)i −_N1PNb=1hΦ(Xi), Φ(Xb)i − 1 N PN a=1hΦ(Xa), Φ(Xj)i + 1 N2 PN a=1hΦ(Xa), Φ(Xb)i = k(Xi, Xj) −_N1PNb=1k(Xi, Xb) −_N1PNa=1k(Xa, Xj) +_N12 PN a=1k(Xa, Xb)

(17)

Kernel PCA: Nonlinear extension of PCA

(18)

Review: Linear Regression I

Linear regression

• Data: (X1, Y1), . . . , (XN, YN): data

• Xi: explanatory variable, covariate (m-dimensional) • Yi: response variable, (1 dimensional)

• Regression model: find the best linear relation

(19)

Review: Linear Regression II

• Least square method:

min a PN i=1kYi− aTXik2. • Matrix expression X =      X1 1 X12 · · · X1m X1 2 X22 · · · X2m .. . X1 N XN2 · · · XNm      , Y =      Y1 Y2 .. . YN      . • Solution: ba = (X T_X)−1_XT_Y b y =_baTx = YTX(XTX)−1x.

(20)

Ridge Regression

Ridge regression:

• Find a linear relation by

min a PN i=1kYi− aTXik2+ λkak2. λ: regularization coefficient. • Solution b a = (XTX + λIN)−1XTY For a general x, b y(x) =_baTx = YTX(XTX + λIN)−1x.

• Ridge regression is useful when (XT_X)−1 _{does not exist, or}

(21)

Kernelization of Ridge Regression I

(X1, Y1) . . . , (XN, YN)(Yi: 1-dimensional)

Transform Xiby a feature map Φ into a feature space H:

X1, . . . , XN 7→ Φ(X1), . . . , Φ(XN)

Assume that the feature space has theinner producth , i.

Apply ridge regression to the transformed data:

• Find the vector f such that

min f ∈H PN i=1|Yi− hf, Φ(Xi)iH| 2_{+ λkf k}2 H.

• Similarly to kernel PCA, we can assume f =Pn

j=1cjΦ(Xj). min c PN i=1|Yi− hP N j=1cjΦ(Xj), Φ(Xi)iH|2+ λkP N j=1cjΦ(Xj)k2H

(22)

Kernelization of Ridge Regression II

• Solution: b c = (K+λIN)−1Y, where Kij = hΦ(Xi), Φ(Xj)iH = k(Xi, Xj). For a general x, b

y(x) = h bf , Φ(x)iH = hPjbcjΦ(Xj), Φ(x)iH

= YT(K + λIN)−1k, where k =    hΦ(X1), Φ(x)i .. . hΦ(XN), Φ(x)i   =    k(X1, x) .. . k(XN, x)   .

(23)

Kernelization of Ridge Regression III

Proof.

Matrix expression gives

PN i=1 |Yi− hP N j=1cjΦ(Xj), Φ(Xi)iH|2+ λkPNj=1cjΦ(Xj)k2H = (Y − Kc)T_{(Y − Kc) + λc}T_Kc = cT(K2+ λK)c − 2YTKc + YTY.

It follows that the optimal c is given by b

c = (K + λIN)−1Y.

Inserting this to_by(x) = hP

(24)

Kernelization of Ridge Regression IV

Observation:

• Ridge regression in the feature space can be done if we can

compute the inner product

hΦ(Xi), Φ(Xj)i = k(Xi, Xj).

• The resulting coefficient is of the form f =P

iciΦ(Xi), i.e., in the

linear hull of the data.

The orthogonal directions do not contribute to the objective function.

(25)

Kernel methodology

• A feature space H with inner product h , i.

• Mapping of the data into a feature space:

X1, . . . , XN 7→ Φ(X1), . . . , Φ(XN) ∈ H.

• If the computation of the inner product hΦ(Xi), Φ(Xi)iis

tractable, various linear methods can be extended to the feature space.

• Give Methods of nonlinear data analysis.

How can we prepare such a feature space? ⇒Positive definite kernel!