Optimization
Nuno Vasconcelos
ECE Department, UCSD
Classification
a classification problem has two types of variables
• X - vector of observations (features) in the world
• Y - state (class) of the world
Perceptron: classifier implements the linear decision rule with
appropriate when the classes are linearly separable
to deal with non-linear separability we introduce a kernel
[ ] g(x)
sgn )
( x =
h g ( x ) = w
Tx + b
ww b
x w
x g( )
Types of kernels
these three are equivalent
dot product kernel
positive definite kernel
Mercer kernel
⇔⇔
Dot-product kernels
Definition: a mapping
k: X x X → ℜ
(x,y) → k(x,y) is a dot-product kernel if and only if
k(x,y) = < Φ (x), Φ (y)>
where Φ : X → H,
H is a vector space,
<.,.> is a dot-product in H
xx x
x x
x x x x x
x x
oo o oo
o o
o o
x2 x
x x
x x
xx x x x
x x
oo o o
o o oo
o o
o x1 o
Φ
X H
positive definite and Mercer kernels
Definition: k(x,y) is a positive definite kernel on X xX if
∀ l and ∀ {x
1, ..., x
l}, x
i∈ X, the Gram matrix
is positive definite.
Definition: a symmetric mapping k: X x X → R such that
is a Mercer kernel
[ ] K
ij= k ( x
i, x
j)
∞
<
∀
≥
∫
∫∫
dx x
f x
f
dxdy y
f x f y x k
)
2( )
(
, 0 )
( ) ( ) , (
s.t.
In summary
different definitions produce different pictures 1-1 relationship:
• RKHS: directly dependent on the kernel and sample
• Mercer: orthonormal basis (eigenvalues) 1-1 mapping to Rd
Reproducing kernel map
HK =
( )
⎭⎬
⎫
⎩⎨
⎧ =
∑
= m
i ik xi
f f
1
., (.)
|
(.) α
( )
∑∑
= == m
i i j
m
j i jk x x
g f
1 '
1
* , '
, α β
) (., : x k x
r
→
Φ
Mercer kernel map HM=
⎭⎬
⎫
⎩⎨
⎧ < ∞
=
∑
= d
i i
d x x
l
1 2
2 |
g f g
f = T
, *
( )
TM
: x → λ
1φ
1( x ), λ
2φ
2( x ), L Φ
{ }
(.)
(.)
: 2d k
e
span l
φ λ
φ
→
→
Γ r
Φ
=
Γ o Φ
The RKHS picture
we have
x x x
x
x
x x x x x
x x
o o o
o
o o o
o
o o
o o
x 1 x 2
Φ
ΜX x
x x
x
x
x x x x x
x x
o o
o o
o
o o
o
o o
o e1 o
e3 e2 ed
HM=l2d
Φ
(xi)xi
x x x
x
x
x x x x x
x x
o o
o o
o
o o
o
o o
o φ1 o
φ φ2 φd
HK
Γ
Τ
οΦ
(xi)=k(.,xi)*
∑
= k k
g
f λ
β , α
) , (
, ** j
ij i jk xi x
g
f =
∑
α βThe Mercer picture
this is a mapping from R
nto R
da lot like a
multi-layer Perceptron
the kernelized Perceptron as a neural net
x x x
x
x
x x x x x
x x
o o
o o
o
o o
o
o o
o x 1 o
x 3 x2 x d
x x x
x
x
x x x x x
x x
o o o
o
o o o
o
o o
o o
x 1 x 2
Φ
X
Φ
(xi) l2dxi
x
Φ
1(.)Φ
2(.)Φ
d(.)Regularization
Q: RKHS: why do we care?
A: regularization
• need to penalize complexity
• minimize regularized risk
• this is a complicated problem
since the minimization is over the set of all functions
as long as the empirical risk is a function on a RKHS this can be solved efficiently
the key is the representer theorem
[ ] f R [ ] f [ ] f
R
reg=
emp+ λ Ω
Representer theorem
Theorem: Let
•
Ω
:[0,∞
)→ ℜ
be a strictly monotonically increasing function,• H the RKHS associated with a kernel k(x,y)
• L[y,f(x)] a loss function
then
f* admits a representation of the form
[ ( ) ] ( ) ⎥
⎦
⎢ ⎤
⎣
⎡ + Ω
= ∑
= n
i i i
f
L y f x f
f
1
* 2
, min
arg λ
∑ ( )
=
=
ni i
k x
if
1
*
α ., (i.e. f* ∈ H )
Regularization
this makes the function minimization a minimization in R
nTheorem:
• if
Ω
:[0,∞
)→ ℜ
is a strictly monotonically increasing function,• then for any dot-product kernel k(x,y) and loss function L[y,f(x)]
the solution of
is with
K the Gram matrix [k(x
i,x
j)] and Y =(...,y
i,..
.)
[ ( ) ] ( ) ⎥
⎦
⎢ ⎤
⎣
⎡ + Ω
= ∑
= n
i i i
f
L y f x f
f
1
* 2
, min
arg λ
[ ] ( ) ⎥
⎦
⎢ ⎤
⎣
⎡ + Ω
= ∑
= n i
T
K K
Y L
1
*
arg min , α λ α α
α
α∑ ( )
=
=
ni i
k x
if
1
*
*
α .,
Optimization
this leads us to the topic of optimization
goal: find maximum or minimum of a function
Definition: given functions f, g
i, i=1,...,k and h
i, i=1,...m defined on some domain Ω ∈ R
nf(w): cost; h
i(equality), g
i(inequality): constraints
for compactness we write g(w) ≤ 0 instead of g
i(w) ≤ 0, ∀ i.
Similarly h(w) = 0
note that f(w) ≥ 0 ⇔ –f(w) ≤ 0 (no need for ≥ 0) i
w h
i w
g
w w
f
i i
∀
=
∀
≤
Ω
∈
, 0 )
(
, 0 )
( subject to
),
(
min
Optimization
note: maximizing f(x) is the same as minimizing –f(x), this definition also works for maximization
the feasible region is the region where f(.) is defined and all constraints hold
w* is a global minimum of f(w) if
w* is a local minimum of f(w) if
{ ∈ Ω | ( ) ≤ 0 , ( ) = 0 }
=
ℜ w g w h x
local
global
Ω
∈
∀
≥ f w w
w
f ( ) ( *),
*) (
) (
* s.t.
0
w f w
f w w
≥
⇒
<
−
>
∃ ε ε
The gradient
the gradient of a function f(w) at z is
Theorem: the gradient points in the direction of maximum growth proof:
• from Taylor series expansion
• derivative along d
• is maximum when d is in the direction of the gradient
T
n
w z z f
w z f
f ⎟⎟
⎠
⎜⎜ ⎞
⎝
⎛
∂
∂
∂
= ∂
∇
−
) ( ,
), ( )
(
1 0
L
) ( )
( )
( )
(
wα
d f wα
d f w Oα
2f
+ = +
T∇ +
)) ( , cos(
. ) ( .
) ) (
( )
lim (
0 f w
+
d−
f w=
dT∇
f w=
d∇
f w d∇
f w→
α
α
α
∇f
(*)
The gradient
max
min
saddle
note that if ∇ f = 0
• there is no direction of growth
• also -
∇
f = 0, and there is no direction of decrease• we are either at a local minimum or maximum or “saddle” point
conversely, at local min or max or saddle point
• no direction of growth or decrease
•
∇
f = 0this shows that we have a critical point if and only if ∇ f = 0
to determine which type we need second
order conditions
The Hessian
maxmin
saddle
if ∇ f = 0, by Taylor series
and
pick α such that O( α ) << |d
T∇
2f d|, ∀ d ≠ 0
• maximum at w if and only if dT
∇
2f d≤
0,∀
d≠
0• minimum at w if and only if dT
∇
2f d ≥ 0,∀
d≠
0• saddle otherwise
this proves the following theorems
) ( )
2 ( )
(
) ( )
(
3 2
2
0
α α α
α
O d
w f d
w f d
w f d
w f
T
T
∇ + ∇ +
+
= +
3 2 1
) ( )
2 ( 1 )
( )
(
22
α
α
α
d f w d f w d O wf
+ − =
T∇ +
Minima conditions (unconstrained)
let f(w) be continuously differentiable
w* is a local minimum of f(w) if and only if
• f has zero gradient at w*
• and the Hessian of f at w* is positive definite
• where
0
*)
( =
∇ w f
⎥ ⎥
⎥ ⎥
⎥ ⎤
⎢ ⎢
⎢ ⎢
⎢ ⎡
∂
∂
∂
∂
∂
∂
∂
=
∇
−) ( )
(
) ( )
( )
(
2 2
1 0
2 2
0 2
2
f x f x
x x x
x f x
f x
f
nL M L
n
t
f w d d
d ∇
2( *) ≥ 0 , ∀ ∈ ℜ
Maxima conditions (unconstrained)
let f(w) be continuously differentiable
w* is a local maximum of f(w) if and only if
• f has zero gradient at w*
• and the Hessian of f at w* is negative definite
• where
0
*)
( =
∇ w f
⎥ ⎥
⎥ ⎥
⎥ ⎤
⎢ ⎢
⎢ ⎢
⎢ ⎡
∂
∂
∂
∂
∂
∂
∂
=
∇
−) ( )
(
) ( )
( )
(
2 2
1 0
2 2
0 2
2
f x f x
x x x
x f x
f x
f
nL M L
n
t
f w d d
d ∇
2( *) ≤ 0 , ∀ ∈ ℜ
Convex functions
Definition: f(w) is convex if ∀ w,u ∈ Ω and λ ∈ [0,1]
Theorem: f(w) is convex if and only if its Hessian is positive definite for all w
proof:
• requires some
intermediate results that we will not cover
• we will skip it
) ( ) 1
( ) ( )
) 1
(
( w u f w f u
f λ + − λ ≤ λ + − λ
f(w)
f(u) λf(w)+(1-λ)f(v)
u w
λw+(1-λ)v
f(λw+(1-λ)v)
Ω
∈
∀
≥
∇ f w w w
w
t 2( ) 0 ,
Concave functions
Definition: f(w) is concave if ∀ w,u ∈ Ω and λ ∈ [0,1]
Theorem: f(w) is concave if and only if its Hessian is negative definite for all w
proof:
• -f(w) is convex
• by previous theorem, Hessian is negative definite
• Hessian of f(w) is positive definite
) ( ) 1
( ) ( )
) 1
(
( w u f w f u
f λ + − λ ≥ λ + − λ
Ω
∈
∀
≤
∇ f w w w
w
t 2( ) 0 ,
Convex functions
Theorem: if f(w) is convex any local minimum w* is also a global minimum
Proof:
• we need to show that, for any u, f(w*)
≤
f(u)• for any u: ||w*-[
λ
w*+(1-λ
)u|| = (1-λ
) ||w*-u||• and, making
λ
arbitrarily close to 1, we can make||w*-[
λ
w*+(1-λ
)u||≤ ε
, for anyε
> 0• since w* is local minimum, it follows that f(w*)
≤
f(λ
w*+(1-λ
)u) and, by convexity, that f(w*)≤ λ
f(w*)+(1-λ
)f(u)• or f(w*)(1-
λ
)≤
f(u)(1-λ
)• and f(w*)
≤
f(u)Constrained optimization
in summary:
• we know what are conditions for unconstrained max and min
• we like convex functions (find a minima, it will be global minimum)
what about optimization with constraints?
a few definitions to start with inequality g
i(w) ≤ 0:
• is active if gi(w)
=
0, otherwise inactiveinequalities can be expressed as equalities by introduction of slack variables
0
and
, 0 )
(
0 )
( ≤ ⇔
i+
i=
i≥
i
w g w
g ξ ξ
Convex optimization
Definition: a set Ω is convex if ∀ w,u ∈ Ω and λ ∈ [0,1]
then λ w+(1- λ )u ∈ Ω
“a line between any two points in Ω is also in Ω ”
Definition: an optimization problem where the set Ω, the cost f and all constraints g and h are convex is said to be convex
note: linear constraints g(x) = Ax+b are always convex (zero Hessian)
convex not convex
Constrained optimization
we will consider general (not only convex) constrained optimization problems, start by case with only equalities Theorem: consider the problem
where the constraint gradients h
i(x*) are linearly
independent. Then x* is a solution if and only if there exits a unique vector λ , such that
0 )
( subject to
) ( min arg
* = f x h x =
x
x
0
*) ( s.t.
, 0
*) (
*) ( )
0
*) (
*) (
)
2 1
2
1
=
∇
∀
⎥ ≥
⎦
⎢ ⎤
⎣
⎡ ∇ + ∇
=
∇ +
∇
∑
∑
=
=
y x
h y
y x
h x
f y
ii
x h x
f i
T i
m
i
i T
i m
i
i
λ
λ
Alternative formulation
state the conditions through the Lagrangian
the theorem can be compactly written as
the entries of λ are referred to as Lagrange multipliers )
( )
( )
, (
1
x h x
f x
L
im
i
∑
i=
+
= λ
λ
0
*) ( s.t.
, 0 )
*, ( )
0 )
*, (
0 )
*, ( )
* 2
*
*
=
∇
∀
≥
∇
=
∇
=
∇
y x
h y
y x
L y
ii
x L ii)
x L i
T xx
T x
λ λ λ
λ
Gradient (revisited)
recall from (*) that derivative of f along d is
this means that
• greatest increase when d || f
• no increase when d ⊥ f
• since there is no increase when d is tangent to iso-contour f(x) = k
• the gradient is perpendicular to the tangent of the iso-contour
this suggests a geometric proof
)) ( , cos(
. ) ( .
) ) (
( )
lim (
0 f w
+
d−
f w=
dT∇
f w=
d∇
f w d∇
f w→
α
α
α
∇f
no increase
Lagrangian optimization
tg plane
span{∇h(x*)}
geometric interpretation:
• since h(x)=0 is a iso-contour of h(x),
∇
h(x*) is perpendicular to the iso-contour• i) says that
∇
f (x*)∈
span{∇
hi(x*)}• i.e.
∇
f ⊥ to tangent space of the constraint surface• intuitive
• direction of largest increase of f is ⊥ to constraint surface
• the gradient is zero along the constraint
• no way to give an infinitesimal gradient step, without ending up violating it
• it is impossible to increase f and
Example
consider the problem
min x
1+ x
2subject to x
12+ x
22= 2 it leads to the following picture
h(x)=0
iso-contours of f(x)
2
2
x1 + x2 = -1
⎥⎦
⎢ ⎤
⎣
= ⎡
∇
2 1
2 ) 2
( x
x x h
⎥⎦
⎢ ⎤
⎣
= ⎡
∇ 1
) 1 (x f
x1 + x2 = 1 x1 + x2 = 0
Example
consider the problem
min x
1+ x
2subject to x
12+ x
22= 2
∇ f ⊥ to the iso-contours of f (x
1+ x
2= k)
h(x)=0
2
2
⎥⎦
⎢ ⎤
⎣
= ⎡
∇
2 1
2 ) 2
( x
x x h
⎥⎦
⎢ ⎤
⎣
= ⎡
∇ 1
) 1 (x f
x1 + x2 = 1 x1 + x2 = 0
x1 + x2 = -1
Example
consider the problem
min x
1+ x
2subject to x
12+ x
22= 2
∇ h ⊥ to the iso-contour of h (x
12+ x
22- 2 = 0)
h(x)=0
2
2
⎥⎦
⎢ ⎤
⎣
= ⎡
∇
2 1
2 ) 2
( x
x x h
⎥⎦
⎢ ⎤
⎣
= ⎡
∇ 1
) 1 (x f
x1 + x2 = 1 x1 + x2 = 0
x1 + x2 = -1
Example
recall that derivative along d is
- moving along the tangent is descent as long as
- i.e. π /2 < angle( ∇ f,tg) < 3 π /2 - can always find such d unless
∇ f ⊥ tg
- critical point when ∇ f || ∇ h
- to find which type we need 2
ndorder (as before)
2
2
)) ( ,
cos(
. ) ( ) .
( )
lim (
0
f w + d − f w = d ∇ f w d ∇ f w
→
α
α
α
0 )
,
cos( tg ∇ f <
critical point
critical point
Alternative view
consider the tangent space to the iso-contour h(x) = 0 this is the subspace of first order feasible variations
space of ∆ x for which x + ∆ x satisfies the constraint up to first order approximation
{ x h x x i }
x
V ( *) = ∆ | ∇
i( *)
T∆ = 0 , ∀
x* ∇
h(x*)
V(x*) feasible variations h(x)=0
Feasible variations
multiplying our first Lagrangian condition by ∆x
it follows that
this is a generalization of ∇ f(x*)=0 in unconstrained case implies that ∇ f(x*) ⊥ V(x*) and therefore ∇ f(x*) || ∇ h(x*) note:
• Hessian constraint only defined for y in V(x*)
• makes sense: we cannot move anywhere else, does not really matter what Hessian is outside V(x*)
0
*) (
*) (
1
=
∆
∇ +
∆
∇ ∑
=
x x
h x
x
f
m i Ti i
T
λ
*) (
, 0
*)
( x x x V x
f
T∆ = ∀ ∆ ∈
∇
In summary
for a constrained optimization problem, with equality constraints
Theorem: consider the problem
where the constraint gradients h
i(x*) are linearly
independent. Then x* is a solution if and only if there exits a unique vector λ , such that
0 )
( subject to
) ( min arg
* = f x h x =
x
x
0
*) ( s.t.
, 0
*) (
*) ( )
0
*) (
*) (
)
2 1
2
1
=
∇
∀
⎥ ≥
⎦
⎢ ⎤
⎣
⎡ ∇ + ∇
=
∇ +
∇
∑
∑
=
=
y x
h y
y x
h x
f y
ii
x h x
f i
T i
m
i
i T
i m
i
i
λ
λ
Alternative formulation
state the conditions through the Lagrangian
the theorem can be compactly written as
the entries of λ are referred to as Lagrange multipliers next time we will consider inequality constraints
) ( )
( )
, (
1
x h x
f x
L
im
i
∑
i=
+
= λ
λ
0
*) ( s.t.
, 0 )
*, ( )
0 )
*, (
0 )
*, ( )
* 2
*
*
=
∇
∀
≥
∇
=
∇
=
∇
y x
h y
y x
L y
ii
x L ii)
x L i
T xx
T x
λ λ λ
λ