a classification problem has two types of variables

(1)

Optimization

Nuno Vasconcelos

ECE Department, UCSD

(2)

Classification

a classification problem has two types of variables

• X - vector of observations (features) in the world

• Y - state (class) of the world

Perceptron: classifier implements the linear decision rule with

appropriate when the classes are linearly separable

to deal with non-linear separability we introduce a kernel

[ ] ^g(x)

sgn )

( x =

h g ( x ) = w

^T

x + b

_w

w b

x w

x g( )

(3)

Types of kernels

these three are equivalent

dot product kernel

positive definite kernel

Mercer kernel

⇔⇔

(4)

Dot-product kernels

Definition: a mapping

k: X x X → ℜ

(x,y) → ^k(x,y) is a dot-product kernel if and only if

k(x,y) = < Φ ^(x), Φ ^(y)>

where Φ ^: X → H,

H is a vector space,

<.,.> is a dot-product in H

xx x

x x

x x x x x

x x

oo o oo

o o

x2 x

x x

xx x x x

x x

oo o o

o o oo

o o

o x1 o

Φ

X H

(5)

positive definite and Mercer kernels

Definition: k(x,y) is a positive definite kernel on X xX if

∀ l and ∀ {x

₁

, ..., x

_l

}, x

_i

∈ _{X, the} Gram matrix

is positive definite.

Definition: a symmetric mapping k: X x X → R such that

is a Mercer kernel

[ ] ^K

_ij

⁼ ^k ⁽ ^x

_i

^, ^x

_j

⁾

∞

<

∀

≥

∫

∫∫

dx x

f x

f

dxdy y

f x f y x k

)

2

( )

(

, 0 )

( ) ( ) , (

s.t.

(6)

In summary

different definitions produce different pictures 1-1 relationship:

• RKHS: directly dependent on the kernel and sample

• Mercer: orthonormal basis (eigenvalues) 1-1 mapping to R^d

Reproducing kernel map

H_K =

( )

⎭⎬

⎫

⎩⎨

⎧ =

∑

= m

i ik xi

f f

1

., (.)

|

(.) α

( )

∑∑

= =

= ^m

i i j

m

j i jk x x

g f

1 '

1

* , '

, α β

) (., : x k x

r

→

Φ

Mercer kernel map H_M=

⎭⎬

⎫

⎩⎨

⎧ < ∞

=

∑

= d

i i

d x x

l

1 2

2 |

g f g

f = ^T

, *

( )

^T

M

: x → λ

₁

φ

₁

( x ), λ

₂

φ

₂

( x ), _L Φ

{ }

(.)

: ₂^d _k

e

span l

φ λ

φ

→

Γ r

Φ

=

Γ o Φ

(7)

The RKHS picture

we have

x x x

x

x x x x x

x x

o o o

o

o o o

o

o o

x 1 x 2

Φ

_Μ

X x

x x

x

x x x x x

x x

o o

o

o o

o

o o

o e1 o

e3 e2 ed

H_M=l₂^d

Φ

(x_i)

x_i

x x x

x

x x x x x

x x

o o

o

o o

o

o o

o φ1 o

φ φ2 φd

H_K

Γ

Τ

^ο

Φ

(x_i)=k(.,x_i)

*

∑

= ^k ^k

g

f λ

β , α

) , (

, _*_* _j

ij i jk xi x

g

f ⁼

∑

^α ^β

(8)

The Mercer picture

this is a mapping from R

ⁿ

to R

^d

a lot like a

multi-layer Perceptron

the kernelized Perceptron as a neural net

x x x

x

x x x x x

x x

o o

o

o o

o

o o

o x 1 o

x 3 x2 x d

x x x

x

x x x x x

x x

o o o

o

o o o

o

o o

x 1 x 2

Φ

X

Φ

(x_i) l₂^d

x_i

x

Φ

₁(.)

Φ

₂(.)

Φ

_d(.)

(9)

Regularization

Q: RKHS: why do we care?

A: regularization

• need to penalize complexity

• minimize regularized risk

• this is a complicated problem

since the minimization is over the set of all functions

as long as the empirical risk is a function on a RKHS this can be solved efficiently

the key is the representer theorem

[ ] ^f ^R [ ] ^f [ ] ^f

R

_reg

=

_emp

+ λ Ω

(10)

Representer theorem

Theorem: Let

•

Ω

:[0,

∞

)

→ ℜ

be a strictly monotonically increasing function,

• H the RKHS associated with a kernel k(x,y)

• L[y,f(x)] a loss function

then

f admits a representation of the form*

[ ( ) ] ( ) _⎥

⎦

⎢ ⎤

⎣

⎡ + Ω

= ∑

= n

i i i

f

L y f x f

f

1

* 2

, min

arg λ

∑ ( )

=

ⁿ

i i

k x

i

f

1

*

α ., (i.e. f* ∈ _{H )}

(11)

Regularization

this makes the function minimization a minimization in R

ⁿ

Theorem:

• if

Ω

:[0,

∞

)

→ ℜ

is a strictly monotonically increasing function,

• then for any dot-product kernel k(x,y) and loss function L[y,f(x)]

the solution of

is with

K the Gram matrix [k(x

_i

,x

_j

)] and Y =(...,y

_i

,..

_.

)

[ ( ) ] ( ) _⎥

⎦

⎢ ⎤

⎣

⎡ + Ω

= ∑

= n

i i i

f

L y f x f

f

1

* 2

, min

arg λ

[ ] ( ) _⎥

⎦

⎢ ⎤

⎣

⎡ + Ω

= ∑

= n i

T

K K

Y L

1

*

arg min , α λ α α

α

∑ ( )

=

ⁿ

i i

k x

i

f

1

*

α .,

(12)

Optimization

this leads us to the topic of optimization

goal: find maximum or minimum of a function

Definition: given functions f, g

_i

, i=1,...,k and h

_i

, i=1,...m defined on some domain Ω ∈ R

ⁿ

f(w): cost; h

_i

(equality), g

_i

(inequality): constraints

for compactness we write g(w) ≤ 0 instead of g

_i

(w) ≤ ^0, ∀ ^i.

Similarly h(w) = 0

note that f(w) ≥ 0 ⇔ –f(w) ≤ 0 (no need for ≥ 0) i

w h

i w

g

w w

f

i i

∀

=

∀

≤

Ω

∈

, 0 )

(

, 0 )

( subject to

),

(

min

(13)

Optimization

note: maximizing f(x) is the same as minimizing –f(x), this definition also works for maximization

the feasible region is the region where f(.) is defined and all constraints hold

w is a global minimum* of f(w) if

w is a local minimum of f(w)* if

{ ^∈ ^Ω ^| ⁽ ⁾ ^≤ ⁰ ^, ⁽ ⁾ ⁼ ⁰ }

=

ℜ w g w h x

local

global

Ω

∈

∀

≥ f w w

w

f ( ) ( *),

*) (

) (

* s.t.

0 w f w

f w w

≥

⇒

<

−

>

∃ ε ε

(14)

The gradient

the gradient of a function f(w) at z is

Theorem: the gradient points in the direction of maximum growth proof:

• from Taylor series expansion

• derivative along d

• is maximum when d is in the direction of the gradient

T

n

w z z f

w z f

f ⎟⎟

⎠

⎜⎜ ⎞

⎝

⎛

∂

= ∂

∇

−

) ( ,

), ( )

(

1 0

L

) ( )

( )

(

w

α

d f w

α

d f w O

α

²

f

+ = +

^T

∇ +

)) ( , cos(

. ) ( .

) ) (

( )

lim (

0 f w

+

d

−

f w

=

d_T

∇

f w

=

d

∇

f w d

∇

f w

→

α

∇f

(*)

(15)

The gradient

max

min

saddle

note that if ∇ ^{f = 0}

• there is no direction of growth

• also -

∇

f = 0, and there is no direction of decrease

• we are either at a local minimum or maximum or “saddle” point

conversely, at local min or max or saddle point

• no direction of growth or decrease

•

∇

f = 0

this shows that we have a critical point if and only if ∇ ^{f = 0}

to determine which type we need second

order conditions

(16)

The Hessian

^max

min

saddle

if ∇ f = 0, by Taylor series

and

pick α such that O( α ) << |d

^T

∇

²

f d|, ∀ d ≠ 0

• maximum at w if and only if d^T

∇

²f d

≤

0,

∀

d

≠

0

• minimum at w if and only if d^T

∇

²f d ≥ 0,

∀

d

≠

0

• saddle otherwise

this proves the following theorems

) ( )

2 ( )

(

) ( )

(

3 2

2

0

α α α

α

O d

w f d

w f

T

∇ + ∇ +

+

= +

3 2 1

) ( )

2 ( 1 )

( )

(

₂

2

α

d f w d f w d O w

f

+ − =

_T

∇ +

(17)

Minima conditions (unconstrained)

let f(w) be continuously differentiable

w is a local minimum of f(w) if and only if*

• f has zero gradient at w*

• and the Hessian of f at w* is positive definite

• where

0 *)

( =

∇ w f

⎥ ⎥

⎥ ⎤

⎢ ⎢

⎢ ⎡

∂

=

∇

⁻

) ( )

(

) ( )

( )

(

2 2

1 0

2 2

0 2

2

f x f x

x x x

x f x

f x

f

ⁿ

L M L

n

t

f w d d

d ∇

²

( *) ≥ 0 , ∀ ∈ ℜ

(18)

Maxima conditions (unconstrained)

let f(w) be continuously differentiable

w is a local maximum of f(w) if and only if*

• f has zero gradient at w*

• and the Hessian of f at w* is negative definite

• where

0 *)

( =

∇ w f

⎥ ⎥

⎥ ⎤

⎢ ⎢

⎢ ⎡

∂

=

∇

⁻

) ( )

(

) ( )

( )

(

2 2

1 0

2 2

0 2

2

f x f x

x x x

x f x

f x

f

ⁿ

L M L

n

t

f w d d

d ∇

²

( *) ≤ 0 , ∀ ∈ ℜ

(19)

Convex functions

Definition: f(w) is convex if ∀ ^w,u ∈ Ω ^and λ ∈ ^[0,1]

Theorem: f(w) is convex if and only if its Hessian is positive definite for all w

proof:

• requires some

intermediate results that we will not cover

• we will skip it

) ( ) 1

( ) ( )

) 1

(

( w u f w f u

f λ + − λ ≤ λ + − λ

f(w)

f(u) λf(w)+(1-λ)f(v)

u w

λw+(1-λ)v

f(λw+(1-λ)v)

Ω

∈

∀

≥

∇ f w w w

w

^t ²

( ) 0 ,

(20)

Concave functions

Definition: f(w) is concave if ∀ ^w,u ∈ Ω ^and λ ∈ ^[0,1]

Theorem: f(w) is concave if and only if its Hessian is negative definite for all w

proof:

• -f(w) is convex

• by previous theorem, Hessian is negative definite

• Hessian of f(w) is positive definite

) ( ) 1

( ) ( )

) 1

(

( w u f w f u

f λ + − λ ≥ λ + − λ

Ω

∈

∀

≤

∇ f w w w

w

^t ²

( ) 0 ,

(21)

Convex functions

**Theorem: if f(w) is convex any local minimum w* is also** a global minimum

Proof:

• we need to show that, for any u, f(w*)

≤

f(u)

• for any u: ||w*-[

λ

w*+(1-

λ

)u|| = (1-

λ

) ||w*-u||

• and, making

λ

arbitrarily close to 1, we can make

||w*-[

λ

w*+(1-

λ

)u||

≤ ε

, for any

ε

> 0

• since w* is local minimum, it follows that f(w*)

≤

f(

λ

w*+(1-

λ

)u) and, by convexity, that f(w*)

≤ λ

f(w*)+(1-

λ

)f(u)

• or f(w*)(1-

λ

)

≤

f(u)(1-

λ

)

• and f(w*)

≤

f(u)

(22)

Constrained optimization

in summary:

• we know what are conditions for unconstrained max and min

• we like convex functions (find a minima, it will be global minimum)

what about optimization with constraints?

a few definitions to start with inequality g

_i

(w) ≤ ^0:

• is active if g_i(w)

=

0, otherwise inactive

inequalities can be expressed as equalities by introduction of slack variables

0 and

, 0 )

(

0 )

( ≤ ⇔

_i

+

_i

=

_i

≥

i

w g w

g ξ ξ

(23)

Convex optimization

Definition: a set Ω is convex if ∀ ^w,u ∈ Ω ^and λ ∈ ^[0,1]

then λ ^w+(1- λ ^)u ∈ Ω

“a line between any two points in Ω is also in Ω ^”

Definition: an optimization problem where the set Ω, the cost f and all constraints g and h are convex is said to be convex

note: linear constraints g(x) = Ax+b are always convex (zero Hessian)

convex not convex

(24)

Constrained optimization

we will consider general (not only convex) constrained optimization problems, start by case with only equalities Theorem: consider the problem

where the constraint gradients h

_i

(x) are linearly*

independent. Then x is a solution if and only if there* exits a unique vector λ , such that

0 )

( subject to

) ( min arg

* = f x h x =

x

0 *) ( s.t.

, 0

*) (

*) ( )

0 *) (

*) (

)

2 1

2

1

=

∇

∀

⎥ ≥

⎦

⎢ ⎤

⎣

⎡ ∇ + ∇

=

∇ +

∇

∑

=

y x

h y

y x

h x

f y

ii

x h x

f i

T i

m

i

i T

i m

i

λ

(25)

Alternative formulation

state the conditions through the Lagrangian

the theorem can be compactly written as

the entries of λ are referred to as Lagrange multipliers )

( )

, (

1

x h x

f x

L

_i

m

i

∑

i

=

+

= λ

λ

0 *) ( s.t.

, 0 )

*, ( )

0 )

*, (

0 )

*, ( )

* 2

*

=

∇

∀

≥

∇

=

∇

=

∇

y x

h y

y x

L y

ii

x L ii)

x L i

T xx

T x

λ λ λ

λ

(26)

Gradient (revisited)

recall from (*) that derivative of f along d is

this means that

• greatest increase when d || f

• no increase when d ⊥ f

• since there is no increase when d is tangent to iso-contour f(x) = k

• the gradient is perpendicular to the tangent of the iso-contour

this suggests a geometric proof

)) ( , cos(

. ) ( .

) ) (

( )

lim (

0 f w

+

d

−

f w

=

d_T

∇

f w

=

d

∇

f w d

∇

f w

→

α

∇f

no increase

(27)

Lagrangian optimization

tg plane

span{∇h(x*)}

geometric interpretation:

• since h(x)=0 is a iso-contour of h(x),

∇

h(x*) is perpendicular to the iso-contour

• i) says that

∇

f (x*)

∈

span{

∇

h_i(x*)}

• i.e.

∇

f ⊥ to tangent space of the constraint surface

• intuitive

• direction of largest increase of f is ⊥ to constraint surface

• the gradient is zero along the constraint

• no way to give an infinitesimal gradient step, without ending up violating it

• it is impossible to increase f and

(28)

Example

consider the problem

min x

₁

+ x

₂

subject to x

₁²

+ x

₂²

= 2 it leads to the following picture

h(x)=0

iso-contours of f(x)

2

x₁ + x₂= -1

⎥⎦

⎢ ⎤

⎣

= ⎡

∇

2 1

2 ) 2

( x

x x h

⎥⎦

⎢ ⎤

⎣

= ⎡

∇ 1

) 1 (x f

x₁ + x₂= 1 x₁ + x₂= 0

(29)

Example

consider the problem

min x

₁

+ x

₂

subject to x

₁²

+ x

₂²

= 2

∇ ^f ⊥ to the iso-contours of f (x

₁

+ x

₂

= k)

h(x)=0

2

⎥⎦

⎢ ⎤

⎣

= ⎡

∇

2 1

2 ) 2

( x

x x h

⎥⎦

⎢ ⎤

⎣

= ⎡

∇ 1

) 1 (x f

x₁ + x₂= 1 x₁ + x₂= 0

x₁ + x₂= -1

(30)

Example

consider the problem

min x

₁

+ x

₂

subject to x

₁²

+ x

₂²

= 2

∇ ^h ⊥ to the iso-contour of h (x

₁²

+ x

₂²

- 2 = 0)

h(x)=0

2

⎥⎦

⎢ ⎤

⎣

= ⎡

∇

2 1

2 ) 2

( x

x x h

⎥⎦

⎢ ⎤

⎣

= ⎡

∇ 1

) 1 (x f

x₁ + x₂= 1 x₁ + x₂= 0

x₁ + x₂= -1

(31)

Example

recall that derivative along d is

- moving along the tangent is descent as long as

- i.e. π /2 < angle( ∇ f,tg) < 3 π ^/2 - can always find such d unless

∇ f ⊥ tg

- critical point when ∇ f || ∇ h

- to find which type we need 2

^nd

order (as before)

2

)) ( ,

cos(

. ) ( ) .

( )

lim (

0

f w + d − f w ₌ d _∇ f w d _∇ f w

→

α

0 )

,

cos( tg ∇ f <

critical point

(32)

Alternative view

consider the tangent space to the iso-contour h(x) = 0 this is the subspace of first order feasible variations

space of ∆ x for which x + ∆ x satisfies the constraint up to first order approximation

{ ^x ^h ^x ^x ⁱ }

x

V ( *) = ∆ | ∇

_i

( *)

^T

∆ = 0 , ∀

x* _∇

h(x*)

V(x*) feasible variations h(x)=0

(33)

Feasible variations

multiplying our first Lagrangian condition by ∆x

it follows that

this is a generalization of ∇ f(x)=0 in unconstrained case* implies that ∇ ^f(x)* ⊥ V(x)* and therefore ∇ ^{f(x) ||}* ∇ ^h(x)* note:

• Hessian constraint only defined for y in V(x*)

• makes sense: we cannot move anywhere else, does not really matter what Hessian is outside V(x*)

0 *) (

*) (

1

=

∆

∇ +

∆

∇ ∑

=

x x

h x

x

f

^m _i ^T

i i

T

λ

*) (

, 0

*)

( x x x V x

f

^T

∆ = ∀ ∆ ∈

∇

(34)

In summary

for a constrained optimization problem, with equality constraints

Theorem: consider the problem

where the constraint gradients h

_i

(x) are linearly*

independent. Then x is a solution if and only if there* exits a unique vector λ , such that

0 )

( subject to

) ( min arg

* = f x h x =

x

0 *) ( s.t.

, 0

*) (

*) ( )

0 *) (

*) (

)

2 1

2

1

=

∇

∀

⎥ ≥

⎦

⎢ ⎤

⎣

⎡ ∇ + ∇

=

∇ +

∇

∑

=

y x

h y

y x

h x

f y

ii

x h x

f i

T i

m

i

i T

i m

i

λ

(35)

Alternative formulation

state the conditions through the Lagrangian

the theorem can be compactly written as

the entries of λ are referred to as Lagrange multipliers next time we will consider inequality constraints

a classification problem has two types of variables

Optimization

Nuno Vasconcelos

ECE Department, UCSD

Classification