Reproducing kernel Hilbert spaces and regularization
Nuno Vasconcelos
ECE Department, UCSD
Classification
a classification problem has two types of variables
• X - vector of observations (features) in the world
• Y - state (class) of the world
Perceptron: classifier implements the linear decision rule with
appropriate when the classes are linearly separable
to deal with non-linear separability we introduce a kernel
[ ] g(x)
sgn )
( x =
h g ( x ) = w
Tx + b
ww b
x w
x g( )
Types of kernels
these three are equivalent
dot product kernel
positive definite kernel
Mercer kernel
⇔⇔
Dot-product kernels
Definition: a mapping
k: X x X → ℜ
(x,y) → k(x,y) is a dot-product kernel if and only if
k(x,y) = < Φ (x), Φ (y)>
where Φ : X → H,
H is a vector space,
<.,.> is a dot-product in H
xx x
x x
x x x x x
x x
oo o
o o o o o
o o
o o
x2 x
x x
x x
xx x x x
x x
oo o o
o o oo
o o
o x1 o
xn
Φ
X H
positive definite and Mercer kernels
Definition: k(x,y) is a positive definite kernel on X xX if
∀ l and ∀ {x
1, ..., x
l}, x
i∈ X, the Gram matrix
is positive definite.
Definition: a symmetric mapping k: X x X → R such that
is a Mercer kernel
[ ] K
ij= k ( x
i, x
j)
∞
<
∀
≥
∫
∫∫
dx x
f x
f
dxdy y
f x f y x k
)
2( )
(
, 0 )
( ) ( ) , (
s.t.
Two different pictures
different definitions lead to different interpretations of what the kernel does
Reproducing kernel map
H
K= ( )
⎭⎬
⎫
⎩⎨
⎧ =
∑
= m
i ik xi
f f
1
., (.)
|
(.) α
( )
∑∑
= == m
i i j
m
j i j
k x x
g f
1 '
1
* , '
,
α β
) (., : x → k x Φ
Mercer kernel map
H
M=
⎭⎬
⎫
⎩⎨
⎧ < ∞
=
∑
= d
i i
d x x
l
1 2
2 |
g f g
f
= T, *
( x x )
Tx ( ), ( ), L
: → λ
1φ
1λ
2φ
2Φ
where λ
i, φ
iare the eigenvalues and eigenfunctions of k(x,y)
λ
i> 0
The dot-product picture
when we use the Gaussian kernel
• the point x
i∈
X is mapped into the Gaussian G(x,xi,σ
I)• H is the space of all functions that are linear combinations of Gaussians
• the kernel is a dot product in
H, and a non-linear similarity on X• reproducing property on H: analogy to linear systems
σ
2
) ,
(
xi
x
i
e
x x K
− −
=
x x x
x
x
x x x x x
x x
o o
o
o
o o
o x 1 o
x x x
x
x x x x
x x
o o x 2
Φ
X * H
K*
The Mercer picture
this is a mapping from R
nto R
dmuch more like to a multi-layer Perceptron than before
the kernelized Perceptron as a neural net
x x x
x
x
x x x x x
x x
o o
o o
o
o o
o
o o
o x 1 o
x 3 x2 x d
x x x
x
x
x x x x x
x x
o o o
o
o o o
o
o o
o o
x 1 x 2
Φ
X Φ
(xi)l
2dxi
x
Φ
1(.) Φ
2(.) Φ
d(.)
The reproducing property
with this definition of H
Kand <.,.>
*∀ f ∈ H
K, <k(.,x),f(.)>
*= f(x) this is called the reproducing property
an analogy is to think of linear time-invariant systems
• the dot product as a convolution
• k(.,x) as the Dirac delta
• f(.) as a system input
• the equation above is the basis of all linear time invariant systems theory
leads to reproducing Kernel Hilbert Spaces
reproducing kernel Hilbert spaces
Definition: a Hilbert space is a complete dot-product space (vector space + dot product + limit points of all Cauchy sequences)
Definition: Let H be a Hilbert space of functions
f: X → R . H is a RKHS endowed with dot-product <.,.>
*if
∃ k: X x X → R such that
1.k spans H, i.e., ∃ {x
i}, { α
i}, such that H =
2.<f(.),k(.,x)>
*= f(x), ∀ f ∈ H,
{ }
⎭⎬
⎫
⎩⎨
⎧ =
=
∑
i i
i f f k x
x k
span (., ) (.)| (.) α (., )
Mercer kernels
how different are the spaces H
Kand H
M?
Theorem: Let k: X x X → R be a Mercer kernel. Then, there exists an orthonormal set of functions
and a set of λ
i≥ 0, such that
∑
∑ ∫∫
∞
=
∞
=
=
=
1 2 1
2
) ( ) ( )
, ( )
) , ( )
i i i i
i i
y x
y x k ii
dxdy y
x k
i
φ φ
λ λ
ij j
i
x φ x dx δ
φ =
∫ ( ) ( )
(**)
RK vs Mercer maps
note that for H
Mwe are writing
but, since the φ
i(.) are orthonormal, there is a 1-1 map
and we can write
hence k(.,x) maps x into M = span{ φ
k(.)}
e
dx e
x
x ) ( ) r L ( ) r ( = λ
1φ
1 1+ + λ
1φ
1Φ
{ }
(.)
(.) :
2k k k
d k
e
span l
φ λ
φ
→
→
Γ r
( )
) (.,
(.) )
( (.)
) ( )
(
1 1 1x k
x x
x
d d d=
+ +
= Φ
Γ o λ φ φ L λ φ φ
from (**)
(***)
The Mercer picture
we have
x x x
x
x
x x x x x
x x
o o o
o
o o o
o
o o
o o
x 1 x 2
Φ
X
xx x
x
x
x x x x x
x x
o o
o o
o
o o
o
o o
o e1 o
e3 e2 ed
l
2dΦ
(xi)xi
x x x
x
x
x x x x x
x x
o o
o o
o
o o
o
o o
o φ1 o
M
Γ
Τ
οΦ
(xi)=k(.,xi)RK vs Mercer maps
define the dot product in M so that
then { φ
k(.)} is a basis, M is a vector space, any function in M can be written as
and
i.e., k is a reproducing kernel on M
ij j k
j j
k
j
x x dx δ
φ λ λ φ
φ
φ 1 ( ) ( ) 1
,
**= ∫ =
( ) x x
f
k k k
∑
= α φ
) (
) ( )
( ,
) (
) ( (.) ,
(.) )
(., (.),
1
*
*
*
*
*
*
x f x
x
x x
k f
i
i i ij
j i j
j i
j
j j
j i
i i
ij j
=
=
=
=
∑
∑
∑
∑
φ α φ
φ φ
λ α
φ φ
λ φ
α
λ δ
4 3
42
1
RK vs Mercer maps
furthermore, since k(.,x) ∈ M, any functions of the form
are in M and
( ) ( )
jj j
i i
k x
ig k x
f (.) = ∑ α ., (.) = ∑ β .,
) , ( )
( ) (
(.) (.),
) ( ) (
) ( (.) ),
( (.)
) (., ,
) (., ,
*
*
*
*
*
*
*
*
j i j
i j
l i l l j
i
ij l i m j l m
lm l m
j i
ij m m j
m m
i l l l l
j i
j j j
i i i
x x k x
x
x x
x x
x k
x k
g f
∑
∑ ∑
∑ ∑
∑ ∑ ∑
∑
∑
=
=
=
=
=
β α φ
φ λ β
α
φ φ
φ φ
λ λ β
α
φ φ
λ φ
φ λ β
α
β α
The Mercer picture
furthermore, note that for f in H
Mand since
the dot product on H
Mis
( ) x x
f
k k k
∑
= α φ
) (
ij j k
j j
k
j
x x dx δ
φ λ λ φ
φ
φ 1 ( ) ( ) 1
,
**= ∫ =
∑
∑
=
=
k k
k k
kl k l k l
g f
λ β α
φ φ β α
*
*
*
*
,
,
The Mercer picture
we have
x x x
x
x
x x x x x
x x
o o o
o
o o o
o
o o
o o
x 1 x 2
Φ
X
xx x
x
x
x x x x x
x x
o o
o o
o
o o
o
o o
o e1 o
e3 e2 ed
l
2dΦ
(xi)xi
x x x
x
x
x x x x x
x x
o o
o o
o
o o
o
o o
o φ1 o
M
⊂H
KΓ
Τ
οΦ
(xi)=k(.,xi)*
) , (
, ** j
ij i jk xi x
g
f =
∑
α βRK vs Mercer maps
H
K⊂ M and <.,.>
*in H
Kis the same as <.,.>
**in M Question: is M ⊂ H
K?
• need to show that any H
K• from (***),
and, for any sequence {x
1, ..., x
d},
• if there is an invertible P, then H
Kand M ⊂ H
K( ) ∈
= ∑ x
x f
k
α
kφ
k) (
⎥ ⎥
⎥
⎦
⎤
⎢ ⎢
⎢
⎣
⎡
⎥ ⎥
⎥
⎦
⎤
⎢ ⎢
⎢
⎣
⎡
=
⎥ ⎥
⎥
⎦
⎤
⎢ ⎢
⎢
⎣
⎡
(.) (.)
) (
) (
) ( )
(
) (.,
)
(.,
11 1
1 1
1 1 1
d P
d d
d d
d d
d
x x
x x
x k
x k
φ φ φ
λ φ
λ
φ λ φ
λ
M 4
4 4 4
4 3
4 4 4 4
4 2
1
L M
M
(.) )
( (.)
) ( )
(., x
1 1x
1 d dx
dk = λ φ φ + L + λ φ φ
( ) ∈
= ∑
kk k
k
x α k .,x
φ ( )
RK vs Mercer maps
since λ
i> 0
is invertible when Π is. If Π is not invertible, then
if there is no sequence for which Π is invertible then
⎥ ⎥
⎥
⎦
⎤
⎢ ⎢
⎢
⎣
⎡
⎥ ⎥
⎥
⎦
⎤
⎢ ⎢
⎢
⎣
⎡
=
Π
d i
d d
d
d
x x
x x
P
λ λ
λ φ
φ
φ φ
0
0
) (
) (
) ( )
(
11
1 1
1
M 4 4 4
4 3
4 4 4
4 2
1
L M
( ) 0
0 =
≠
∃
∀ ∑
ik k k
x
i α s.t. α φ
( ) 0
0 =
≠
∃
∀ x ∑ x
k
α
kφ
kα s.t.
The Mercer picture
we have
x x x
x
x
x x x x x
x x
o o o
o
o o o
o
o o
o o
x 1 x 2
Φ
ΜX
xx x
x
x
x x x x x
x x
o o
o o
o
o o
o
o o
o e1 o
e3 e2 ed
H
M=l
2dΦ
(xi)xi
x x x
x
x
x x x x x
x x
o o
o o
o
o o
o
o o
o φ1 o
φ3 φ2 φd
M=H
KΓ
Τ
οΦ
(xi)=k(.,xi)*
∑
=
k k
k
g k
f λ
β α
*
, *
) , (
, ** j
ij i jk xi x
g
f =
∑
α βIn summary
H
K= M and <.,.>
*in H
Kis the same as <.,.>
**in M
the reproducing kernel map and the Mercer kernel map lead to the same RKHS, Mercer gives us an o.n. basis
Reproducing kernel map
H
K= ( )
⎭⎬
⎫
⎩⎨
⎧ =
∑
= m
i ik xi
f f
1
., (.)
|
(.) α
( )
∑∑
= == m
i i j
m
j i j
k x x
g f
1 '
1
* , '
,
α β
) (., : x k x
r
→
Φ
Mercer kernel map
H
M=
⎭⎬
⎫
⎩⎨
⎧ < ∞
=
∑
= d
i i
d x x
l
1 2
2 |
g f g
f
= T, *
( )
TM
: x → λ
1φ
1( x ), λ
2φ
2( x ), L Φ
{ }
(.)
(.) :
2k k k
d k
e
span l
φ λ
φ
→
→
Γ r
r
M
= Φ
Γ o Φ
Regularization
Q: RKHS: why do we care?
A: regularization
• we want to do well outside of the training set
• minimizing empirical risk is not enough
• need to penalize complexity
• example: regression
• give me n points, I will give you a model of zero error (polynomial of order n-1)
• incredibly “wiggly”
• poor generalization
Regularization
we need to “regularize” the risk
• f is the function we are trying to learn
• λ > 0 is the regularization parameter
• Ω is a complexity penalizing functional
• larger when f is more “wiggly”
the regularized risk can be justified in various ways
• constrained minimization
• Bayesian inference
[ ] f R [ ] f [ ] f
R
reg=
emp+ λ Ω
Constrained minimization
regularized risk as the solution to the problem
we will talk a lot more about this
• solution is the solution of the Lagrangian problem
• λ is a Lagrange multiplier
• changing λ is equivalent to changing t (the constraint on the complexity of the optimal solution)
[ ] f [ ] f t
R
f
empf
≤ Ω
= arg min subject to
*
[ ] [ ]
{ R f f }
f
empf
Ω +
= arg min λ
*
Bayesian inference
in Bayesian inference we have
• likelihood function P
X|θ(x| θ )
• prior P
θ( θ )
• and search for the maximum a posteriori (MAP) estimate
{ }
[ ]
{ log ( | ) log ( ) }
max arg
) ( )
| ( max
arg
) (
) ( )
| max (
arg
)
| ( max
arg
*
|
|
|
|
θ θ
θ θ
θ θ
θ θ
θ θ θ
θ θ θ
θ θ
θ θ θ
P x
P
P x
P
x P
P x
P
x P
X X
X X
X
+
=
=
=
=
Bayesian inference
if P
X|θ(x|f) = e
-Remp[f]and P
θ(f) = e
-λΩ[f]• the MAP estimate is the minimum of the regularized risk
• the prior P
θ(f) assigns low probability to function f with large Ω [f]
• this reflects our prior belief that it is “unlikely” that the solution will be very complex
example:
• if Ω [f] = || f-f
0||
2, the prior is a Gaussian centered at f
0with variance 1/ λ
1/2• we believe that the most likely solution is f = f
0• the larger the λ , the more we penalize solutions which are different from this
[ ] [ ]
{ R f f }
f
empf
Ω +
= arg min λ
*
Structural risk minimization
• start from a nested collection of families of functions
where S
i= {h
i(x, α ), for all α }
• for each S
i, find the function (set of parameters) that minimizes the empirical risk
• select the function class such that
• Φ (h) is a function of VC dimension (complexity) of the family S
i• VC have shown that this is equivalent to minimizing a bound on the risk, and provides generalization guarantees
S
kS
1⊂ L ⊂
∑
==
nk k i k
empi
L y h x
R n
1
)]
, ( , 1 [
min α
α
{
empi( )
i}
i
R h
R
*= min + Φ
The regularizer
what is a good regularizer?
intuition: “wigglier” functions have a larger norm than smoother functions
for example, in H
Kwe have
∑
∑ ∑
∑ ∑
∑
=
⎥ ⎦
⎢ ⎤
⎣
= ⎡
=
=
=
l
l l l
l i
i l i l
i
i l l
l l i
i
i i
x c
x x
x x
x k
x f
) (
) ( )
(
) ( ) (
) (., )
(
φ
φ φ
α λ
φ φ
λ α
α
The regularizer
and
with hence,
• || f ||
2grows with the number of c
ldifferent than zero
• this is the case in which f is more complex, since it becomes a sum of more basis function φ
l(x)
• identical to what happens in Fourier type decompositions
• more coefficients means more “high-frequencies” or “less smoothness”
∑
∑
∑ = =
=
l l
l
lk l
lk k l lk
k l
k l
c c c x
x c
c x
f λ λ
φ δ
φ
2*
*
2
( ), ( )
) (
∑ ( )
=
i
i l i l
l
x
c λ α φ
Regularization
OK, regularization is a good idea from multiple points of view
problem: minimizing the regularized risk
over the set of all functions seems like a nightmare it turns out that it is not, under some reasonable conditions on the regularizer Ω
the key is the “representer theorem”
[ ] f R [ ] f [ ] f
R
reg=
emp+ λ Ω
Representer theorem
Theorem: Let
• Ω :[0, ∞ ) → ℜ be a strictly monotonically increasing function,
• H the RKHS associated with a kernel k(x,y)
• L[y,f(x)] a loss function
then, if
f* admits a representation of the form
[ ( ) ] ( ) ⎥
⎦
⎢ ⎤
⎣
⎡ + Ω
= ∑
= n
i i i
f
L y f x f
f
1
* 2
, min
arg λ
∑ ( )
=
n ik x
if
*α ., (i.e. f* ∈ H )
Proof
• decompose any f into the part contained in the span of the k(.,xi) and the part in the orthogonal complement
where f ∈ w
0and f ∈ w
⊥with
• then
( ) ., ( )
) ( )
( )
(
1
0
x f x k x f x
f x
f
ni i i ⊥
⊥
=
=+
+
= ∑ α
{ }
{ }
{
0}
0
, 0 ,
|
) (.,
|
w f
f g g
w
x k
span f
f
w
i∈
∀
=
=
∈
=
⊥
[ ] ( )
( ) ( ) ⎟ ⎟
⎠
⎞
⎜ ⎜
⎝ Ω ⎛
⎟ ≥
⎟
⎠
⎞
⎜ ⎜
⎝
⎛ +
Ω
=
⎟ ⎟
⎠
⎞
⎜ ⎜
⎝
⎛ +
Ω
= Ω
∑
∑
∑
=
⊥
=
⊥
=
2
1 2
2
1
2
1 2
., .,
) ( .,
n
i i i
n
i i i
n
i i i
x k f
x k
x f x
k f
α α
α
( Ω mon. increasing)
Proof
• this shows that the second term in
is minimized by a function of the stated form.
• the first, using the reproducing property
is always a function of the stated form.
• hence, the minimum must be of this form as well
[ ( ) ] ( ) ⎥
⎦
⎢ ⎤
⎣
⎡ + Ω
= ∑
= n
i i i
f
L y f x f
f
1
* 2
, min
arg λ
( )
∑
=⊥
=
+
=
=
n
j j i j
i i
i i
x x k
x k
f x
k f
x k
f x
f
1
0 0
) ,
(
) (., (.),
) (., (.),
) (., (.),
α
4 4 8 4
4 7
6
The picture
due to reproducing property:
• f(x
i) ∈ H
if f* is the solution in H
• ||f’|| > ||f*||
• is a more complex function
• which does not reduce the loss
hence, f* is the optimal solution
o o
o o
o o
o o o
o
o o
f(xi)
H
e1
e3 ed
f*
f’
Relevance
the remarkable consequence of the theorem is that:
• we can reduce the minimization over the (infinite dimensional) space of functions
• to a minimization on a finite dimensional space!
to see this note that, because we have
( ) ., ,
1
*
∑
=
=
ni i
k x
if α
( ) ( )
( ) α α
α α
α α
K x
x k
x k
x k
f f f
ij
j T i
j i
ij i j i j
∑
∑
=
=
=
=
,
., ,
., ,
*2 *
*
( ) α
α k x x K x
f
*= ∑ =
Regularization
this proves the following theorem Theorem:
• if Ω :[0, ∞ ) → ℜ is a strictly monotonically increasing function,
• then for any dot-product kernel k(x,y) and loss function L[y,f(x)]
the solution of
is with
K the Gram matrix [k(x
i,x
j)] and Y =(...,y
i,..
.)
[ ( ) ] ( ) ⎥
⎦
⎢ ⎤
⎣
⎡ + Ω
= ∑
= n
i i i
f
L y f x f
f
1
* 2
, min
arg λ
[ ] ( ) ⎥
⎦
⎢ ⎤
⎣
⎡ + Ω
= ∑
= n
i
T
K K
Y L
1
*
arg min , α λ α α
α
α∑ ( )
=
=
ni i
k x
if
1
*
*
α .,
Note
the result that f* ∈ H holds for any norm, not just L
2however, the exact form of the problem will no longer be
the argument of the second term will have a different form
[ ] ( ) ⎥
⎦
⎢ ⎤
⎣
⎡ + Ω
= ∑
= n
i
T
K K
Y L
1
*