• No results found

Reproducing kernel Hilbert spaces and regularization

N/A
N/A
Protected

Academic year: 2021

Share "Reproducing kernel Hilbert spaces and regularization"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

Reproducing kernel Hilbert spaces and regularization

Nuno Vasconcelos

ECE Department, UCSD

(2)

Classification

a classification problem has two types of variables

X - vector of observations (features) in the world

Y - state (class) of the world

Perceptron: classifier implements the linear decision rule with

appropriate when the classes are linearly separable

to deal with non-linear separability we introduce a kernel

[ ] g(x)

sgn )

( x =

h g ( x ) = w

T

x + b

w

w b

x w

x g( )

(3)

Types of kernels

these three are equivalent

dot product kernel

positive definite kernel

Mercer kernel

⇔⇔

(4)

Dot-product kernels

Definition: a mapping

k: X x X → ℜ

(x,y) k(x,y) is a dot-product kernel if and only if

k(x,y) = < Φ (x), Φ (y)>

where Φ : XH,

H is a vector space,

<.,.> is a dot-product in H

xx x

x x

x x x x x

x x

oo o

o o o o o

o o

o o

x2 x

x x

x x

xx x x x

x x

oo o o

o o oo

o o

o x1 o

xn

Φ

X H

(5)

positive definite and Mercer kernels

Definition: k(x,y) is a positive definite kernel on X xX if

l and {x

1

, ..., x

l

}, x

i

X, the Gram matrix

is positive definite.

Definition: a symmetric mapping k: X x XR such that

is a Mercer kernel

[ ] K

ij

= k ( x

i

, x

j

)

<

∫∫

dx x

f x

f

dxdy y

f x f y x k

)

2

( )

(

, 0 )

( ) ( ) , (

s.t.

(6)

Two different pictures

different definitions lead to different interpretations of what the kernel does

Reproducing kernel map

H

K

= ( )

⎭⎬

⎩⎨

⎧ =

= m

i ik xi

f f

1

., (.)

|

(.) α

( )

∑∑

= =

= m

i i j

m

j i j

k x x

g f

1 '

1

* , '

,

α β

) (., : x → k x Φ

Mercer kernel map

H

M

=

⎭⎬

⎩⎨

⎧ < ∞

=

= d

i i

d x x

l

1 2

2 |

g f g

f

= T

, *

( x x )

T

x ( ), ( ), L

: → λ

1

φ

1

λ

2

φ

2

Φ

where λ

i

, φ

i

are the eigenvalues and eigenfunctions of k(x,y)

λ

i

> 0

(7)

The dot-product picture

when we use the Gaussian kernel

• the point x

i

X is mapped into the Gaussian G(x,xi,

σ

I)

H is the space of all functions that are linear combinations of Gaussians

• the kernel is a dot product in

H, and a non-linear similarity on X

• reproducing property on H: analogy to linear systems

σ

2

) ,

(

xi

x

i

e

x x K

=

x x x

x

x

x x x x x

x x

o o

o

o

o o

o x 1 o

x x x

x

x x x x

x x

o o x 2

Φ

X * H

K

*

(8)

The Mercer picture

this is a mapping from R

n

to R

d

much more like to a multi-layer Perceptron than before

the kernelized Perceptron as a neural net

x x x

x

x

x x x x x

x x

o o

o o

o

o o

o

o o

o x 1 o

x 3 x2 x d

x x x

x

x

x x x x x

x x

o o o

o

o o o

o

o o

o o

x 1 x 2

Φ

X Φ

(xi)

l

2d

xi

x

Φ

1

(.) Φ

2

(.) Φ

d

(.)

(9)

The reproducing property

with this definition of H

K

and <.,.>

*

f H

K

, <k(.,x),f(.)>

*

= f(x) this is called the reproducing property

an analogy is to think of linear time-invariant systems

• the dot product as a convolution

• k(.,x) as the Dirac delta

• f(.) as a system input

• the equation above is the basis of all linear time invariant systems theory

leads to reproducing Kernel Hilbert Spaces

(10)

reproducing kernel Hilbert spaces

Definition: a Hilbert space is a complete dot-product space (vector space + dot product + limit points of all Cauchy sequences)

Definition: Let H be a Hilbert space of functions

f: XR . H is a RKHS endowed with dot-product <.,.>

*

if

k: X x XR such that

1.k spans H, i.e., {x

i

}, { α

i

}, such that H =

2.<f(.),k(.,x)>

*

= f(x), f H,

{ }

⎭⎬

⎩⎨

⎧ =

=

i i

i f f k x

x k

span (., ) (.)| (.) α (., )

(11)

Mercer kernels

how different are the spaces H

K

and H

M

?

Theorem: Let k: X x XR be a Mercer kernel. Then, there exists an orthonormal set of functions

and a set of λ

i

0, such that

∑ ∫∫

=

=

=

=

1 2 1

2

) ( ) ( )

, ( )

) , ( )

i i i i

i i

y x

y x k ii

dxdy y

x k

i

φ φ

λ λ

ij j

i

x φ x dx δ

φ =

( ) ( )

(**)

(12)

RK vs Mercer maps

note that for H

M

we are writing

but, since the φ

i

(.) are orthonormal, there is a 1-1 map

and we can write

hence k(.,x) maps x into M = span{ φ

k

(.)}

e

d

x e

x

x ) ( ) r L ( ) r ( = λ

1

φ

1 1

+ + λ

1

φ

1

Φ

{ }

(.)

(.) :

2

k k k

d k

e

span l

φ λ

φ

Γ r

( )

) (.,

(.) )

( (.)

) ( )

(

1 1 1

x k

x x

x

d d d

=

+ +

= Φ

Γ o λ φ φ L λ φ φ

from (**)

(***)

(13)

The Mercer picture

we have

x x x

x

x

x x x x x

x x

o o o

o

o o o

o

o o

o o

x 1 x 2

Φ

X

x

x x

x

x

x x x x x

x x

o o

o o

o

o o

o

o o

o e1 o

e3 e2 ed

l

2d

Φ

(xi)

xi

x x x

x

x

x x x x x

x x

o o

o o

o

o o

o

o o

o φ1 o

M

Γ

Τ

ο

Φ

(xi)=k(.,xi)

(14)

RK vs Mercer maps

define the dot product in M so that

then { φ

k

(.)} is a basis, M is a vector space, any function in M can be written as

and

i.e., k is a reproducing kernel on M

ij j k

j j

k

j

x x dx δ

φ λ λ φ

φ

φ 1 ( ) ( ) 1

,

**

==

( ) x x

f

k k k

= α φ

) (

) ( )

( ,

) (

) ( (.) ,

(.) )

(., (.),

1

*

*

*

*

*

*

x f x

x

x x

k f

i

i i ij

j i j

j i

j

j j

j i

i i

ij j

=

=

=

=

φ α φ

φ φ

λ α

φ φ

λ φ

α

λ δ

4 3

42

1

(15)

RK vs Mercer maps

furthermore, since k(.,x)M, any functions of the form

are in M and

( ) ( )

j

j j

i i

k x

i

g k x

f (.) =α ., (.) =β .,

) , ( )

( ) (

(.) (.),

) ( ) (

) ( (.) ),

( (.)

) (., ,

) (., ,

*

*

*

*

*

*

*

*

j i j

i j

l i l l j

i

ij l i m j l m

lm l m

j i

ij m m j

m m

i l l l l

j i

j j j

i i i

x x k x

x

x x

x x

x k

x k

g f

∑ ∑

∑ ∑

∑ ∑ ∑

=

=

=

=

=

β α φ

φ λ β

α

φ φ

φ φ

λ λ β

α

φ φ

λ φ

φ λ β

α

β α

(16)

The Mercer picture

furthermore, note that for f in H

M

and since

the dot product on H

M

is

( ) x x

f

k k k

= α φ

) (

ij j k

j j

k

j

x x dx δ

φ λ λ φ

φ

φ 1 ( ) ( ) 1

,

**

==

=

=

k k

k k

kl k l k l

g f

λ β α

φ φ β α

*

*

*

*

,

,

(17)

The Mercer picture

we have

x x x

x

x

x x x x x

x x

o o o

o

o o o

o

o o

o o

x 1 x 2

Φ

X

x

x x

x

x

x x x x x

x x

o o

o o

o

o o

o

o o

o e1 o

e3 e2 ed

l

2d

Φ

(xi)

xi

x x x

x

x

x x x x x

x x

o o

o o

o

o o

o

o o

o φ1 o

M

H

K

Γ

Τ

ο

Φ

(xi)=k(.,xi)

*

) , (

, ** j

ij i jk xi x

g

f =

α β

(18)

RK vs Mercer maps

H

K

⊂ M and <.,.>

*

in H

K

is the same as <.,.>

**

in M Question: is M ⊂ H

K

?

• need to show that any H

K

• from (***),

and, for any sequence {x

1

, ..., x

d

},

• if there is an invertible P, then H

K

and M ⊂ H

K

( ) ∈

= ∑ x

x f

k

α

k

φ

k

) (

⎥ ⎥

⎢ ⎢

⎥ ⎥

⎢ ⎢

=

⎥ ⎥

⎢ ⎢

(.) (.)

) (

) (

) ( )

(

) (.,

)

(.,

1

1 1

1 1

1 1 1

d P

d d

d d

d d

d

x x

x x

x k

x k

φ φ φ

λ φ

λ

φ λ φ

λ

M 4

4 4 4

4 3

4 4 4 4

4 2

1

L M

M

(.) )

( (.)

) ( )

(., x

1 1

x

1 d d

x

d

k = λ φ φ + L + λ φ φ

( )

= ∑

k

k k

k

x α k .,x

φ ( )

(19)

RK vs Mercer maps

since λ

i

> 0

is invertible when Π is. If Π is not invertible, then

if there is no sequence for which Π is invertible then

⎥ ⎥

⎢ ⎢

⎥ ⎥

⎢ ⎢

=

Π

d i

d d

d

d

x x

x x

P

λ λ

λ φ

φ

φ φ

0

0

) (

) (

) ( )

(

1

1

1 1

1

M 4 4 4

4 3

4 4 4

4 2

1

L M

( ) 0

0 =

∀ ∑

i

k k k

x

i α s.t. α φ

( ) 0

0 =

xx

k

α

k

φ

k

α s.t.

(20)

The Mercer picture

we have

x x x

x

x

x x x x x

x x

o o o

o

o o o

o

o o

o o

x 1 x 2

Φ

Μ

X

x

x x

x

x

x x x x x

x x

o o

o o

o

o o

o

o o

o e1 o

e3 e2 ed

H

M

=l

2d

Φ

(xi)

xi

x x x

x

x

x x x x x

x x

o o

o o

o

o o

o

o o

o φ1 o

φ3 φ2 φd

M=H

K

Γ

Τ

ο

Φ

(xi)=k(.,xi)

*

=

k k

k

g k

f λ

β α

*

, *

) , (

, ** j

ij i jk xi x

g

f =

α β

(21)

In summary

H

K

= M and <.,.>

*

in H

K

is the same as <.,.>

**

in M

the reproducing kernel map and the Mercer kernel map lead to the same RKHS, Mercer gives us an o.n. basis

Reproducing kernel map

H

K

= ( )

⎭⎬

⎩⎨

⎧ =

= m

i ik xi

f f

1

., (.)

|

(.) α

( )

∑∑

= =

= m

i i j

m

j i j

k x x

g f

1 '

1

* , '

,

α β

) (., : x k x

r

Φ

Mercer kernel map

H

M

=

⎭⎬

⎩⎨

⎧ < ∞

=

= d

i i

d x x

l

1 2

2 |

g f g

f

= T

, *

( )

T

M

: x → λ

1

φ

1

( x ), λ

2

φ

2

( x ), L Φ

{ }

(.)

(.) :

2

k k k

d k

e

span l

φ λ

φ

Γ r

r

M

= Φ

Γ o Φ

(22)

Regularization

Q: RKHS: why do we care?

A: regularization

• we want to do well outside of the training set

• minimizing empirical risk is not enough

• need to penalize complexity

• example: regression

• give me n points, I will give you a model of zero error (polynomial of order n-1)

• incredibly “wiggly”

• poor generalization

(23)

Regularization

we need to “regularize” the risk

f is the function we are trying to learn

• λ > 0 is the regularization parameter

• Ω is a complexity penalizing functional

larger when f is more “wiggly”

the regularized risk can be justified in various ways

• constrained minimization

• Bayesian inference

[ ] f R [ ] f [ ] f

R

reg

=

emp

+ λ Ω

(24)

Constrained minimization

regularized risk as the solution to the problem

we will talk a lot more about this

• solution is the solution of the Lagrangian problem

• λ is a Lagrange multiplier

• changing λ is equivalent to changing t (the constraint on the complexity of the optimal solution)

[ ] f [ ] f t

R

f

emp

f

≤ Ω

= arg min subject to

*

[ ] [ ]

{ R f f }

f

emp

f

Ω +

= arg min λ

*

(25)

Bayesian inference

in Bayesian inference we have

• likelihood function P

X|θ

(x| θ )

• prior P

θ

( θ )

• and search for the maximum a posteriori (MAP) estimate

{ }

[ ]

{ log ( | ) log ( ) }

max arg

) ( )

| ( max

arg

) (

) ( )

| max (

arg

)

| ( max

arg

*

|

|

|

|

θ θ

θ θ

θ θ

θ θ

θ θ θ

θ θ θ

θ θ

θ θ θ

P x

P

P x

P

x P

P x

P

x P

X X

X X

X

+

=

=

=

=

(26)

Bayesian inference

if P

X|θ

(x|f) = e

-Remp[f]

and P

θ

(f) = e

-λΩ[f]

• the MAP estimate is the minimum of the regularized risk

• the prior P

θ

(f) assigns low probability to function f with large [f]

• this reflects our prior belief that it is “unlikely” that the solution will be very complex

example:

• if Ω [f] = || f-f

0

||

2

, the prior is a Gaussian centered at f

0

with variance 1/ λ

1/2

we believe that the most likely solution is f = f

0

• the larger the λ , the more we penalize solutions which are different from this

[ ] [ ]

{ R f f }

f

emp

f

Ω +

= arg min λ

*

(27)

Structural risk minimization

• start from a nested collection of families of functions

where S

i

= {h

i

(x, α ), for all α }

for each S

i

, find the function (set of parameters) that minimizes the empirical risk

• select the function class such that

• Φ (h) is a function of VC dimension (complexity) of the family S

i

• VC have shown that this is equivalent to minimizing a bound on the risk, and provides generalization guarantees

S

k

S

1

⊂ L ⊂

=

=

n

k k i k

empi

L y h x

R n

1

)]

, ( , 1 [

min α

α

{

empi

( )

i

}

i

R h

R

*

= min + Φ

(28)

The regularizer

what is a good regularizer?

intuition: “wigglier” functions have a larger norm than smoother functions

for example, in H

K

we have

∑ ∑

∑ ∑

=

⎥ ⎦

⎢ ⎤

= ⎡

=

=

=

l

l l l

l i

i l i l

i

i l l

l l i

i

i i

x c

x x

x x

x k

x f

) (

) ( )

(

) ( ) (

) (., )

(

φ

φ φ

α λ

φ φ

λ α

α

(29)

The regularizer

and

with hence,

• || f ||

2

grows with the number of c

l

different than zero

this is the case in which f is more complex, since it becomes a sum of more basis function φ

l

(x)

• identical to what happens in Fourier type decompositions

• more coefficients means more “high-frequencies” or “less smoothness”

= =

=

l l

l

lk l

lk k l lk

k l

k l

c c c x

x c

c x

f λ λ

φ δ

φ

2

*

*

2

( ), ( )

) (

∑ ( )

=

i

i l i l

l

x

c λ α φ

(30)

Regularization

OK, regularization is a good idea from multiple points of view

problem: minimizing the regularized risk

over the set of all functions seems like a nightmare it turns out that it is not, under some reasonable conditions on the regularizer Ω

the key is the “representer theorem”

[ ] f R [ ] f [ ] f

R

reg

=

emp

+ λ Ω

(31)

Representer theorem

Theorem: Let

• Ω :[0, ) → ℜ be a strictly monotonically increasing function,

H the RKHS associated with a kernel k(x,y)

• L[y,f(x)] a loss function

then, if

f* admits a representation of the form

[ ( ) ] ( )

⎢ ⎤

⎡ + Ω

= ∑

= n

i i i

f

L y f x f

f

1

* 2

, min

arg λ

∑ ( )

=

n i

k x

i

f

*

α ., (i.e. f* H )

(32)

Proof

• decompose any f into the part contained in the span of the k(.,xi) and the part in the orthogonal complement

where f ∈ w

0

and f w

with

• then

( ) ., ( )

) ( )

( )

(

1

0

x f x k x f x

f x

f

n

i i i

=

=

+

+

= ∑ α

{ }

{ }

{

0

}

0

, 0 ,

|

) (.,

|

w f

f g g

w

x k

span f

f

w

i

=

=

=

[ ] ( )

( ) ( )

⎜ ⎜

⎝ Ω ⎛

⎟ ≥

⎜ ⎜

⎛ +

=

⎟ ⎟

⎜ ⎜

⎛ +

= Ω

=

=

=

2

1 2

2

1

2

1 2

., .,

) ( .,

n

i i i

n

i i i

n

i i i

x k f

x k

x f x

k f

α α

α

( Ω mon. increasing)

(33)

Proof

• this shows that the second term in

is minimized by a function of the stated form.

• the first, using the reproducing property

is always a function of the stated form.

• hence, the minimum must be of this form as well

[ ( ) ] ( )

⎢ ⎤

⎡ + Ω

= ∑

= n

i i i

f

L y f x f

f

1

* 2

, min

arg λ

( )

=

=

+

=

=

n

j j i j

i i

i i

x x k

x k

f x

k f

x k

f x

f

1

0 0

) ,

(

) (., (.),

) (., (.),

) (., (.),

α

4 4 8 4

4 7

6

(34)

The picture

due to reproducing property:

f(x

i

)H

if f* is the solution in H

||f’|| > ||f*||

• is a more complex function

• which does not reduce the loss

hence, f* is the optimal solution

o o

o o

o o

o o o

o

o o

f(xi)

H

e1

e3 ed

f*

f’

(35)

Relevance

the remarkable consequence of the theorem is that:

• we can reduce the minimization over the (infinite dimensional) space of functions

• to a minimization on a finite dimensional space!

to see this note that, because we have

( ) ., ,

1

*

=

=

n

i i

k x

i

f α

( ) ( )

( ) α α

α α

α α

K x

x k

x k

x k

f f f

ij

j T i

j i

ij i j i j

=

=

=

=

,

., ,

., ,

*

2 *

*

( ) α

α k x x K x

f

*

==

(36)

Regularization

this proves the following theorem Theorem:

• if Ω :[0, ) → ℜ is a strictly monotonically increasing function,

then for any dot-product kernel k(x,y) and loss function L[y,f(x)]

the solution of

is with

K the Gram matrix [k(x

i

,x

j

)] and Y =(...,y

i

,..

.

)

[ ( ) ] ( )

⎢ ⎤

⎡ + Ω

= ∑

= n

i i i

f

L y f x f

f

1

* 2

, min

arg λ

[ ] ( )

⎢ ⎤

⎡ + Ω

= ∑

= n

i

T

K K

Y L

1

*

arg min , α λ α α

α

α

∑ ( )

=

=

n

i i

k x

i

f

1

*

*

α .,

(37)

Note

the result that f* ∈ H holds for any norm, not just L

2

however, the exact form of the problem will no longer be

the argument of the second term will have a different form

[ ] ( )

⎢ ⎤

⎡ + Ω

= ∑

= n

i

T

K K

Y L

1

*

arg min , α λ α α

α

α

(38)

References

Related documents

¾ The single most important element of technical analysis is that future exchange rates are based on the current exchange rate ¾ In practice most commonly used for short-term

displaying MoCA diagnostic cut points, where a cut point for both high sensitivity and specificity is chosen, which stratifies participants into high-, intermediate- and

Moreover, we have used a co-expression network ap- proach to elucidate potential regulatory interactions be- tween expressed miRNAs and differentially expressed (DE) mRNA genes as

Copper, ( As Taiwan Approaches the New Millennium (Lanham: University Press of America, 1999); Chun Chieh Huang, Postwar Taiwan in Historical Perspective (College

Splitting the percentage of women in the overall supervisory board into the percentage of women on the audit committee and the percentage of women on the residual

In the civil war (or internal strife) that ensued, Jephthah and the men of Gilead defeated the men of Ephraim who had come over from the west side to the east side of the Jordan

In order to study the influence of the adoption of fiscal rules on discretionary policy the model estimates the relationship between the structural primary balance ratio and the

Bart Van Looy has published on these topics in journals like Research Policy, Journal of Product and Innovation Management, Organization Studies, R&amp;D Management,