Reproducing kernel Hilbert spaces and regularization

(1)

Reproducing kernel Hilbert spaces and regularization

Nuno Vasconcelos

ECE Department, UCSD

(2)

Classification

a classification problem has two types of variables

• X - vector of observations (features) in the world

• Y - state (class) of the world

Perceptron: classifier implements the linear decision rule with

appropriate when the classes are linearly separable

to deal with non-linear separability we introduce a kernel

[ ] ^g(x)

sgn )

( x =

h g ( x ) = w

^T

x + b

_w

w b

x w

x g( )

(3)

Types of kernels

these three are equivalent

dot product kernel

positive definite kernel

Mercer kernel

⇔⇔

(4)

Dot-product kernels

Definition: a mapping

k: X x X → ℜ

(x,y) → ^k(x,y) is a dot-product kernel if and only if

k(x,y) = < Φ ^(x), Φ ^(y)>

where Φ ^: X → H,

H is a vector space,

<.,.> is a dot-product in H

xx x

x x

x x x x x

x x

oo o

o o o o o

o o

x2 x

x x

xx x x x

x x

oo o o

o o oo

o o

o x1 o

xn

Φ

X H

(5)

positive definite and Mercer kernels

Definition: k(x,y) is a positive definite kernel on X xX if

∀ l and ∀ {x

₁

, ..., x

_l

}, x

_i

∈ _{X, the} Gram matrix

is positive definite.

Definition: a symmetric mapping k: X x X → R such that

is a Mercer kernel

[ ] ^K

_ij

⁼ ^k ⁽ ^x

_i

^, ^x

_j

⁾

∞

<

∀

≥

∫

∫∫

dx x

f x

f

dxdy y

f x f y x k

)

2

( )

(

, 0 )

( ) ( ) , (

s.t.

(6)

Two different pictures

different definitions lead to different interpretations of what the kernel does

Reproducing kernel map

H

_K

= ( )

⎭⎬

⎫

⎩⎨

⎧ =

∑

= m

i ik xi

f f

1

., (.)

|

(.) α

( )

∑∑

= =

= ^m

i i j

m

j i j

k x x

g f

1 '

1

* , '

,

α β

) (., : x → k x Φ

Mercer kernel map

H

_M

=

⎭⎬

⎫

⎩⎨

⎧ < ∞

=

∑

= d

i i

d x x

l

1 2

2 |

g f g

f

= ^T

, *

( ^x ^x )

^T

x ( ), ( ), _L

: → λ

₁

φ

₁

λ

₂

φ

₂

Φ

where λ

_i

, φ

_i

are the eigenvalues and eigenfunctions of k(x,y)

λ

_i

> 0

(7)

The dot-product picture

when we use the Gaussian kernel

• the point x

_i

∈

X is mapped into the Gaussian G(x,x_i,

σ

I)

• H is the space of all functions that are linear combinations of Gaussians

• the kernel is a dot product in

H, and a non-linear similarity on X

• reproducing property on H: analogy to linear systems

σ

2

) ,

(

xi

x

i

e

x x K

− −

=

x x x

x

x x x x x

x x

o o

o

o o

o x 1 o

x x x

x

x x x x

x x

o o x 2

Φ

X * H

_K

*

(8)

The Mercer picture

this is a mapping from R

ⁿ

to R

^d

much more like to a multi-layer Perceptron than before

the kernelized Perceptron as a neural net

x x x

x

x x x x x

x x

o o

o

o o

o

o o

o x 1 o

x 3 x2 x d

x x x

x

x x x x x

x x

o o o

o

o o o

o

o o

x 1 x 2

Φ

X Φ

(x_i)

l

₂^d

x_i

x

Φ

₁

(.) Φ

₂

(.) Φ

_d

(.)

(9)

The reproducing property

with this definition of H

_K

and <.,.>

_*

∀ f ∈ _H

_K

, <k(.,x),f(.)>

_*

= f(x) this is called the reproducing property

an analogy is to think of linear time-invariant systems

• the dot product as a convolution

• k(.,x) as the Dirac delta

• f(.) as a system input

• the equation above is the basis of all linear time invariant systems theory

leads to reproducing Kernel Hilbert Spaces

(10)

reproducing kernel Hilbert spaces

Definition: a Hilbert space is a complete dot-product space (vector space + dot product + limit points of all Cauchy sequences)

Definition: Let H be a Hilbert space of functions

f: X → R . H is a RKHS endowed with dot-product <.,.>

_*

if

∃ ^k: X x X → R such that

1.k spans H, i.e., ∃ {x

_i

}, { α

_i

}, such that H =

2.<f(.),k(.,x)>

_*

= f(x), ∀ f ∈ _H,

{ }

⎭⎬

⎫

⎩⎨

⎧ =

=

∑

i i

i f f k x

x k

span (., ) (.)| (.) α (., )

(11)

Mercer kernels

how different are the spaces H

_K

and H

_M

?

Theorem: Let k: X x X → R be a Mercer kernel. Then, there exists an orthonormal set of functions

and a set of λ

_i

≥ 0, such that

∑

∑ ∫∫

∞

=

∞

=

1 2 1

2

) ( ) ( )

, ( )

) , ( )

i i i i

i i

y x

y x k ii

dxdy y

x k

i

φ φ

λ λ

ij j

i

x φ x dx δ

φ =

∫ ⁽ ⁾ ⁽ ⁾

(**)

(12)

RK vs Mercer maps

note that for H

_M

we are writing

but, since the φ

_i

(.) are orthonormal, there is a 1-1 map

and we can write

hence k(.,x) maps x into M = span{ φ

_k

(.)}

e

d

x e

x

x ) ( ) ^r _L ( ) ^r ( = λ

₁

φ

₁ ₁

+ + λ

₁

φ

₁

Φ

{ }

(.)

(.) :

₂

k k k

d k

e

span l

φ λ

φ

→

Γ r

( )

) (.,

(.) )

( (.)

) ( )

(

₁ ₁ ₁

x k

x x

x

_d _d _d

=

+ +

= Φ

Γ _o λ φ φ _L λ φ φ

from (**)

(***)

(13)

The Mercer picture

we have

x x x

x

x x x x x

x x

o o o

o

o o o

o

o o

x 1 x 2

Φ

X

x

x x

x

x x x x x

x x

o o

o

o o

o

o o

o e1 o

e3 e2 ed

l

₂^d

Φ

(x_i)

x_i

x x x

x

x x x x x

x x

o o

o

o o

o

o o

o φ1 o

M

Γ

Τ

^ο

Φ

(x_i)=k(.,x_i)

(14)

RK vs Mercer maps

define the dot product in M so that

then { φ

_k

(.)} is a basis, M is a vector space, any function in M can be written as

and

i.e., k is a reproducing kernel on M

ij j k

j j

k

j

x x dx δ

φ λ λ φ

φ

φ 1 ( ) ( ) ¹

,

**

⁼ ∫ ⁼

( ) x x

f

k k k

∑

= α φ

) (

) ( )

( ,

) (

) ( (.) ,

(.) )

(., (.),

1

*

x f x

x

x x

k f

i

i i ij

j i j

j i

j

j j

j i

i i

ij j

=

∑

φ α φ

φ φ

λ α

φ φ

λ φ

α

λ δ

4 3

42

1

(15)

RK vs Mercer maps

furthermore, since k(.,x) ∈ _M, any functions of the form

are in M and

( ) ( )

_j

j j

i i

k x

i

g k x

f (.) ⁼ ∑ ^α ., (.) ⁼ ∑ ^β .,

) , ( )

( ) (

(.) (.),

) ( ) (

) ( (.) ),

( (.)

) (., ,

*

j i j

i j

l i l l j

i

ij l i m j l m

lm l m

j i

ij m m j

m m

i l l l l

j i

j j j

i i i

x x k x

x

x x

x k

g f

∑

∑ ∑

∑ ∑ ∑

∑

=

β α φ

φ λ β

α

φ φ

λ λ β

α

φ φ

λ φ

φ λ β

α

β α

(16)

The Mercer picture

furthermore, note that for f in H

_M

and since

the dot product on H

_M

is

( ) x x

f

k k k

∑

= α φ

) (

ij j k

j j

k

j

x x dx δ

φ λ λ φ

φ

φ 1 ( ) ( ) ¹

,

**

⁼ ∫ ⁼

∑

=

k k

kl k l k l

g f

λ β α

φ φ β α

*

,

(17)

The Mercer picture

we have

x x x

x

x x x x x

x x

o o o

o

o o o

o

o o

x 1 x 2

Φ

X

x

x x

x

x x x x x

x x

o o

o

o o

o

o o

o e1 o

e3 e2 ed

l

₂^d

Φ

(x_i)

x_i

x x x

x

x x x x x

x x

o o

o

o o

o

o o

o φ1 o

M

^⊂

H

_K

Γ

Τ

^ο

Φ

(x_i)=k(.,x_i)

*

) , (

, _*_* _j

ij i jk xi x

g

f ⁼

∑

^α ^β

(18)

RK vs Mercer maps

H

_K

⊂ M and <.,.>

_*

in H

_K

is the same as <.,.>

_**

in M Question: is M ⊂ H

_K

?

• need to show that any H

_K

• from (***),

and, for any sequence {x

₁

, ..., x

_d

},

• if there is an invertible P, then H

_K

and M ⊂ H

_K

( ) ∈

= ∑ ^x

x f

k

α

k

φ

k

) (

⎥ ⎥

⎥

⎦

⎤

⎢ ⎢

⎢

⎣

⎡

⎥ ⎥

⎥

⎦

⎤

⎢ ⎢

⎢

⎣

⎡

=

⎥ ⎥

⎥

⎦

⎤

⎢ ⎢

⎢

⎣

⎡

(.) (.)

) (

) ( )

(

) (.,

)

(.,

₁

1 1

1 1 1

d P

d d

d

x x

x k

φ φ φ

λ φ

λ

φ λ φ

λ

M 4

4 4 4

4 3

4 4 4 4

4 2

1 L M

M

(.) )

( (.)

) ( )

(., x

₁ ₁

x

₁ _d _d

x

_d

k = λ φ φ + L + λ φ φ

( ) ^∈

= ∑

^k

k k

k

x α k .,x

φ ( )

(19)

RK vs Mercer maps

since λ

_i

> 0

is invertible when Π ^{is. If} Π is not invertible, then

if there is no sequence for which Π is invertible then

⎥ ⎥

⎥

⎦

⎤

⎢ ⎢

⎢

⎣

⎡

⎥ ⎥

⎥

⎦

⎤

⎢ ⎢

⎢

⎣

⎡

=

Π

d i

d d

d

x x

P

λ λ

λ φ

φ

φ φ

0

0 ) (

) (

) ( )

(

₁

1

1 1

1

M 4 4 4

4 3

4 4 4

4 2

1 L M

( ) ⁰

0 =

≠

∃

∀ ∑

ⁱ

k k k

x

i α s.t. α φ

( ) ⁰

0 =

≠

∃

∀ ^x ∑ ^x

k

α

k

φ

k

α s.t.

(20)

The Mercer picture

we have

x x x

x

x x x x x

x x

o o o

o

o o o

o

o o

x 1 x 2

Φ

_Μ

X

x

x x

x

x x x x x

x x

o o

o

o o

o

o o

o e1 o

e3 e2 ed

H

_M

=l

₂^d

Φ

(x_i)

x_i

x x x

x

x x x x x

x x

o o

o

o o

o

o o

o φ1 o

φ3 φ2 φd

M=H

_K

Γ

Τ

^ο

Φ

(x_i)=k(.,x_i)

*

∑

=

k k

k

g k

f λ

β α

*

, *

) , (

, _*_* _j

ij i jk xi x

g

f ⁼

∑

^α ^β

(21)

In summary

H

_K

= M and <.,.>

_*

in H

_K

is the same as <.,.>

_**

in M

the reproducing kernel map and the Mercer kernel map lead to the same RKHS, Mercer gives us an o.n. basis

Reproducing kernel map

H

_K

= ( )

⎭⎬

⎫

⎩⎨

⎧ =

∑

= m

i ik xi

f f

1

., (.)

|

(.) α

( )

∑∑

= =

= ^m

i i j

m

j i j

k x x

g f

1 '

1

* , '

,

α β

) (., : x k x

r

→

Φ

Mercer kernel map

H

_M

=

⎭⎬

⎫

⎩⎨

⎧ < ∞

=

∑

= d

i i

d x x

l

1 2

2 |

g f g

f

= ^T

, *

( )

^T

M

: x → λ

₁

φ

₁

( x ), λ

₂

φ

₂

( x ), _L Φ

{ }

(.)

(.) :

₂

k k k

d k

e

span l

φ λ

φ

→

Γ r

r

M

= Φ

Γ o Φ

(22)

Regularization

Q: RKHS: why do we care?

A: regularization

• we want to do well outside of the training set

• minimizing empirical risk is not enough

• need to penalize complexity

• example: regression

• give me n points, I will give you a model of zero error (polynomial of order n-1)

• incredibly “wiggly”

• poor generalization

(23)

Regularization

we need to “regularize” the risk

• f is the function we are trying to learn

• λ > 0 is the regularization parameter

• Ω is a complexity penalizing functional

• larger when f is more “wiggly”

the regularized risk can be justified in various ways

• constrained minimization

• Bayesian inference

[ ] ^f ^R [ ] ^f [ ] ^f

R

_reg

=

_emp

+ λ Ω

(24)

Constrained minimization

regularized risk as the solution to the problem

we will talk a lot more about this

• solution is the solution of the Lagrangian problem

• λ is a Lagrange multiplier

• changing λ is equivalent to changing t (the constraint on the complexity of the optimal solution)

[ ] ^f [ ] ^f ^t

R

f

_emp

f

≤ Ω

= arg min subject to

*

[ ] [ ]

{ ^R ^f ^f }

f

_emp

f

Ω +

= arg min λ

*

(25)

Bayesian inference

in Bayesian inference we have

• likelihood function P

_X|_θ

(x| θ )

• prior P

_θ

( θ )

• and search for the maximum a posteriori (MAP) estimate

{ }

[ ]

{ ^log ⁽ ^| ⁾ ^log ⁽ ⁾ }

max arg

) ( )

| ( max

arg

) (

) ( )

| max (

arg

)

| ( max

arg

*

|

θ θ

θ θ θ

θ θ

θ θ θ

P x

P

P x

P

x P

P x

P

x P

X X

X

+

=

(26)

Bayesian inference

if P

_X|_θ

(x|f) = e

^-Remp[f]

and P

_θ

(f) = e

^-^λΩ^[f]

• the MAP estimate is the minimum of the regularized risk

• the prior P

_θ

(f) assigns low probability to function f with large Ω [f]

• this reflects our prior belief that it is “unlikely” that the solution will be very complex

example:

• if Ω [f] = || f-f

₀

||

²

, the prior is a Gaussian centered at f

₀

with variance 1/ λ

^1/2

• we believe that the most likely solution is f = f

₀

• the larger the λ , the more we penalize solutions which are different from this

[ ] [ ]

{ ^R ^f ^f }

f

_emp

f

Ω +

= arg min λ

*

(27)

Structural risk minimization

• start from a nested collection of families of functions

where S

_i

= {h

_i

(x, α ), for all α }

• for each S

_i

, find the function (set of parameters) that minimizes the empirical risk

• select the function class such that

• Φ (h) is a function of VC dimension (complexity) of the family S

_i

• VC have shown that this is equivalent to minimizing a bound on the risk, and provides generalization guarantees

S

k

S

1

⊂ L ⊂

∑

=

ⁿ

k k i k

empi

L y h x

R n

1

)]

, ( , 1 [

min α

α

{

_empⁱ

( )

_i

}

i

R h

R

^*

= min + Φ

(28)

The regularizer

what is a good regularizer?

intuition: “wigglier” functions have a larger norm than smoother functions

for example, in H

_K

we have

∑

∑ ∑

∑

=

⎥ ⎦

⎢ ⎤

⎣

= ⎡

=

l

l l l

l i

i l i l

i

i l l

l l i

i

i i

x c

x x

x k

x f

) (

) ( )

(

) ( ) (

) (., )

(

φ

φ φ

α λ

φ φ

λ α

α

(29)

The regularizer

and

with hence,

• || f ||

²

grows with the number of c

_l

different than zero

• this is the case in which f is more complex, since it becomes a sum of more basis function φ

_l

(x)

• identical to what happens in Fourier type decompositions

• more coefficients means more “high-frequencies” or “less smoothness”

∑

∑ ⁼ ⁼

=

l l

l

lk l

lk k l lk

k l

c c c x

x c

c x

f λ λ

φ δ

φ

²

*

2

( ), ( )

) (

∑ ( )

=

i

i l i l

l

x

c λ α φ

(30)

Regularization

OK, regularization is a good idea from multiple points of view

problem: minimizing the regularized risk

over the set of all functions seems like a nightmare it turns out that it is not, under some reasonable conditions on the regularizer Ω

the key is the “representer theorem”

[ ] ^f ^R [ ] ^f [ ] ^f

R

_reg

=

_emp

+ λ Ω

(31)

Representer theorem

Theorem: Let

• Ω :[0, ∞ ) → ℜ be a strictly monotonically increasing function,

• H the RKHS associated with a kernel k(x,y)

• L[y,f(x)] a loss function

then, if

f admits a representation of the form*

[ ( ) ] ( ) _⎥

⎦

⎢ ⎤

⎣

⎡ + Ω

= ∑

= n

i i i

f

L y f x f

f

1

* 2

, min

arg λ

∑ ( )

=

ⁿ _i

k x

_i

f

^*

α ., (i.e. f* ∈ _{H )}

(32)

Proof

• decompose any f into the part contained in the span of the k(.,xi) and the part in the orthogonal complement

where f ∈ w

₀

and f ∈ w

_⊥

with

• then

( ) ^., ⁽ ⁾

) ( )

( )

(

1

0

x f x k x f x

f x

f

ⁿ

i i i ⊥

⊥

=

+

= ∑ ^α

{ }

{

₀

}

0

, 0 ,

|

) (.,

|

w f

f g g

w

x k

span f

f

w

_i

∈

∀

=

∈

=

⊥

[ ] ^{( )}

( ) ( ) _⎟ ^⎟

⎠

⎞

⎜ ⎜

⎝ Ω ⎛

⎟ ≥

⎟

⎠

⎞

⎜ ⎜

⎝

⎛ +

Ω

=

⎟ ⎟

⎠

⎞

⎜ ⎜

⎝

⎛ +

Ω

= Ω

∑

=

⊥

=

⊥

=

2

1 2

2

1

2

1 2

., .,

) ( .,

n

i i i

n

i i i

n

i i i

x k f

x k

x f x

k f

α α

α

( Ω mon. increasing)

(33)

Proof

• this shows that the second term in

is minimized by a function of the stated form.

• the first, using the reproducing property

is always a function of the stated form.

• hence, the minimum must be of this form as well

[ ( ) ] ( ) _⎥

⎦

⎢ ⎤

⎣

⎡ + Ω

= ∑

= n

i i i

f

L y f x f

f

1

* 2

, min

arg λ

( )

∑

=

⊥

=

+

=

n

j j i j

i i

x x k

x k

f x

k f

x k

f x

f

1

0 0

) ,

(

) (., (.),

α

4 4 8 4

4 7

6

(34)

The picture

due to reproducing property:

• f(x

_i

) ∈ _H

if f is the solution in* H

• ||f’|| > ||f||*

• is a more complex function

• which does not reduce the loss

hence, f* is the optimal solution

o o

o o o

o

o o

f(x_i)

H

e1

e3 ed

f*

f’

(35)

Relevance

the remarkable consequence of the theorem is that:

• we can reduce the minimization over the (infinite dimensional) space of functions

Reproducing kernel Hilbert spaces and regularization