• No results found

a classification problem has two types of variables

N/A
N/A
Protected

Academic year: 2021

Share "a classification problem has two types of variables"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

Optimization

Nuno Vasconcelos

ECE Department, UCSD

(2)

Classification

a classification problem has two types of variables

X - vector of observations (features) in the world

Y - state (class) of the world

Perceptron: classifier implements the linear decision rule with

appropriate when the classes are linearly separable

to deal with non-linear separability we introduce a kernel

[ ] g(x)

sgn )

( x =

h g ( x ) = w

T

x + b

w

w b

x w

x g( )

(3)

Types of kernels

these three are equivalent

dot product kernel

positive definite kernel

Mercer kernel

⇔⇔

(4)

Dot-product kernels

Definition: a mapping

k: X x X → ℜ

(x,y) k(x,y) is a dot-product kernel if and only if

k(x,y) = < Φ (x), Φ (y)>

where Φ : XH,

H is a vector space,

<.,.> is a dot-product in H

xx x

x x

x x x x x

x x

oo o oo

o o

o o

x2 x

x x

x x

xx x x x

x x

oo o o

o o oo

o o

o x1 o

Φ

X H

(5)

positive definite and Mercer kernels

Definition: k(x,y) is a positive definite kernel on X xX if

l and {x

1

, ..., x

l

}, x

i

X, the Gram matrix

is positive definite.

Definition: a symmetric mapping k: X x XR such that

is a Mercer kernel

[ ] K

ij

= k ( x

i

, x

j

)

<

∫∫

dx x

f x

f

dxdy y

f x f y x k

)

2

( )

(

, 0 )

( ) ( ) , (

s.t.

(6)

In summary

different definitions produce different pictures 1-1 relationship:

• RKHS: directly dependent on the kernel and sample

• Mercer: orthonormal basis (eigenvalues) 1-1 mapping to Rd

Reproducing kernel map

HK =

( )

⎭⎬

⎩⎨

⎧ =

= m

i ik xi

f f

1

., (.)

|

(.) α

( )

∑∑

= =

= m

i i j

m

j i jk x x

g f

1 '

1

* , '

, α β

) (., : x k x

r

Φ

Mercer kernel map HM=

⎭⎬

⎩⎨

⎧ < ∞

=

= d

i i

d x x

l

1 2

2 |

g f g

f = T

, *

( )

T

M

: x → λ

1

φ

1

( x ), λ

2

φ

2

( x ), L Φ

{ }

(.)

(.)

: 2d k

e

span l

φ λ

φ

Γ r

Φ

=

Γ o Φ

(7)

The RKHS picture

we have

x x x

x

x

x x x x x

x x

o o o

o

o o o

o

o o

o o

x 1 x 2

Φ

Μ

X x

x x

x

x

x x x x x

x x

o o

o o

o

o o

o

o o

o e1 o

e3 e2 ed

HM=l2d

Φ

(xi)

xi

x x x

x

x

x x x x x

x x

o o

o o

o

o o

o

o o

o φ1 o

φ φ2 φd

HK

Γ

Τ

ο

Φ

(xi)=k(.,xi)

*

= k k

g

f λ

β , α

) , (

, ** j

ij i jk xi x

g

f =

α β

(8)

The Mercer picture

this is a mapping from R

n

to R

d

a lot like a

multi-layer Perceptron

the kernelized Perceptron as a neural net

x x x

x

x

x x x x x

x x

o o

o o

o

o o

o

o o

o x 1 o

x 3 x2 x d

x x x

x

x

x x x x x

x x

o o o

o

o o o

o

o o

o o

x 1 x 2

Φ

X

Φ

(xi) l2d

xi

x

Φ

1(.)

Φ

2(.)

Φ

d(.)

(9)

Regularization

Q: RKHS: why do we care?

A: regularization

• need to penalize complexity

• minimize regularized risk

• this is a complicated problem

since the minimization is over the set of all functions

as long as the empirical risk is a function on a RKHS this can be solved efficiently

the key is the representer theorem

[ ] f R [ ] f [ ] f

R

reg

=

emp

+ λ Ω

(10)

Representer theorem

Theorem: Let

:[0,

)

→ ℜ

be a strictly monotonically increasing function,

H the RKHS associated with a kernel k(x,y)

• L[y,f(x)] a loss function

then

f* admits a representation of the form

[ ( ) ] ( )

⎢ ⎤

⎡ + Ω

= ∑

= n

i i i

f

L y f x f

f

1

* 2

, min

arg λ

∑ ( )

=

=

n

i i

k x

i

f

1

*

α ., (i.e. f* H )

(11)

Regularization

this makes the function minimization a minimization in R

n

Theorem:

• if

:[0,

)

→ ℜ

is a strictly monotonically increasing function,

then for any dot-product kernel k(x,y) and loss function L[y,f(x)]

the solution of

is with

K the Gram matrix [k(x

i

,x

j

)] and Y =(...,y

i

,..

.

)

[ ( ) ] ( )

⎢ ⎤

⎡ + Ω

= ∑

= n

i i i

f

L y f x f

f

1

* 2

, min

arg λ

[ ] ( )

⎢ ⎤

⎡ + Ω

= ∑

= n i

T

K K

Y L

1

*

arg min , α λ α α

α

α

∑ ( )

=

=

n

i i

k x

i

f

1

*

*

α .,

(12)

Optimization

this leads us to the topic of optimization

goal: find maximum or minimum of a function

Definition: given functions f, g

i

, i=1,...,k and h

i

, i=1,...m defined on some domain Ω ∈ R

n

f(w): cost; h

i

(equality), g

i

(inequality): constraints

for compactness we write g(w) 0 instead of g

i

(w) 0, i.

Similarly h(w) = 0

note that f(w) ≥ 0 –f(w) ≤ 0 (no need for ≥ 0) i

w h

i w

g

w w

f

i i

=

, 0 )

(

, 0 )

( subject to

),

(

min

(13)

Optimization

note: maximizing f(x) is the same as minimizing –f(x), this definition also works for maximization

the feasible region is the region where f(.) is defined and all constraints hold

w* is a global minimum of f(w) if

w* is a local minimum of f(w) if

{ | ( ) 0 , ( ) = 0 }

=

w g w h x

local

global

f w w

w

f ( ) ( *),

*) (

) (

* s.t.

0

w f w

f w w

<

>

∃ ε ε

(14)

The gradient

the gradient of a function f(w) at z is

Theorem: the gradient points in the direction of maximum growth proof:

• from Taylor series expansion

derivative along d

is maximum when d is in the direction of the gradient

T

n

w z z f

w z f

f ⎟⎟

⎜⎜ ⎞

= ∂

) ( ,

), ( )

(

1 0

L

) ( )

( )

( )

(

w

α

d f w

α

d f w O

α

2

f

+ = +

T

∇ +

)) ( , cos(

. ) ( .

) ) (

( )

lim (

0 f w

+

d

f w

=

dT

f w

=

d

f w d

f w

α

α

α

f

(*)

(15)

The gradient

max

min

saddle

note that if ∇ f = 0

• there is no direction of growth

• also -

f = 0, and there is no direction of decrease

• we are either at a local minimum or maximum or “saddle” point

conversely, at local min or max or saddle point

• no direction of growth or decrease

f = 0

this shows that we have a critical point if and only if ∇ f = 0

to determine which type we need second

order conditions

(16)

The Hessian

max

min

saddle

if ∇ f = 0, by Taylor series

and

pick α such that O( α ) << |d

T

2

f d|, d0

• maximum at w if and only if dT

2f d

0,

d

0

• minimum at w if and only if dT

2f d ≥ 0,

d

0

• saddle otherwise

this proves the following theorems

) ( )

2 ( )

(

) ( )

(

3 2

2

0

α α α

α

O d

w f d

w f d

w f d

w f

T

T

∇ + ∇ +

+

= +

3 2 1

) ( )

2 ( 1 )

( )

(

2

2

α

α

α

d f w d f w d O w

f

+ − =

T

∇ +

(17)

Minima conditions (unconstrained)

let f(w) be continuously differentiable

w* is a local minimum of f(w) if and only if

• f has zero gradient at w*

• and the Hessian of f at w* is positive definite

• where

0

*)

( =

∇ w f

⎥ ⎥

⎥ ⎥

⎥ ⎤

⎢ ⎢

⎢ ⎢

⎢ ⎡

=

) ( )

(

) ( )

( )

(

2 2

1 0

2 2

0 2

2

f x f x

x x x

x f x

f x

f

n

L M L

n

t

f w d d

d

2

( *) ≥ 0 , ∀ ∈ ℜ

(18)

Maxima conditions (unconstrained)

let f(w) be continuously differentiable

w* is a local maximum of f(w) if and only if

• f has zero gradient at w*

• and the Hessian of f at w* is negative definite

• where

0

*)

( =

∇ w f

⎥ ⎥

⎥ ⎥

⎥ ⎤

⎢ ⎢

⎢ ⎢

⎢ ⎡

=

) ( )

(

) ( )

( )

(

2 2

1 0

2 2

0 2

2

f x f x

x x x

x f x

f x

f

n

L M L

n

t

f w d d

d

2

( *) ≤ 0 , ∀ ∈ ℜ

(19)

Convex functions

Definition: f(w) is convex ifw,u ∈ Ω and λ ∈ [0,1]

Theorem: f(w) is convex if and only if its Hessian is positive definite for all w

proof:

• requires some

intermediate results that we will not cover

• we will skip it

) ( ) 1

( ) ( )

) 1

(

( w u f w f u

f λ + − λ ≤ λ + − λ

f(w)

f(u) λf(w)+(1-λ)f(v)

u w

λw+(1-λ)v

fw+(1-λ)v)

∇ f w w w

w

t 2

( ) 0 ,

(20)

Concave functions

Definition: f(w) is concave ifw,u ∈ Ω and λ ∈ [0,1]

Theorem: f(w) is concave if and only if its Hessian is negative definite for all w

proof:

-f(w) is convex

• by previous theorem, Hessian is negative definite

Hessian of f(w) is positive definite

) ( ) 1

( ) ( )

) 1

(

( w u f w f u

f λ + − λ ≥ λ + − λ

∇ f w w w

w

t 2

( ) 0 ,

(21)

Convex functions

Theorem: if f(w) is convex any local minimum w* is also a global minimum

Proof:

• we need to show that, for any u, f(w*)

f(u)

for any u: ||w*-[

λ

w*+(1-

λ

)u|| = (1-

λ

) ||w*-u||

• and, making

λ

arbitrarily close to 1, we can make

||w*-[

λ

w*+(1-

λ

)u||

≤ ε

, for any

ε

> 0

since w* is local minimum, it follows that f(w*)

f(

λ

w*+(1-

λ

)u) and, by convexity, that f(w*)

≤ λ

f(w*)+(1-

λ

)f(u)

or f(w*)(1-

λ

)

f(u)(1-

λ

)

and f(w*)

f(u)

(22)

Constrained optimization

in summary:

• we know what are conditions for unconstrained max and min

• we like convex functions (find a minima, it will be global minimum)

what about optimization with constraints?

a few definitions to start with inequality g

i

(w) 0:

• is active if gi(w)

=

0, otherwise inactive

inequalities can be expressed as equalities by introduction of slack variables

0

and

, 0 )

(

0 )

( ≤ ⇔

i

+

i

=

i

i

w g w

g ξ ξ

(23)

Convex optimization

Definition: a set Ω is convex if w,u ∈ Ω and λ ∈ [0,1]

then λ w+(1- λ )u ∈ Ω

“a line between any two points in Ω is also in Ω

Definition: an optimization problem where the set Ω, the cost f and all constraints g and h are convex is said to be convex

note: linear constraints g(x) = Ax+b are always convex (zero Hessian)

convex not convex

(24)

Constrained optimization

we will consider general (not only convex) constrained optimization problems, start by case with only equalities Theorem: consider the problem

where the constraint gradients h

i

(x*) are linearly

independent. Then x* is a solution if and only if there exits a unique vector λ , such that

0 )

( subject to

) ( min arg

* = f x h x =

x

x

0

*) ( s.t.

, 0

*) (

*) ( )

0

*) (

*) (

)

2 1

2

1

=

⎥ ≥

⎢ ⎤

⎡ ∇ + ∇

=

∇ +

=

=

y x

h y

y x

h x

f y

ii

x h x

f i

T i

m

i

i T

i m

i

i

λ

λ

(25)

Alternative formulation

state the conditions through the Lagrangian

the theorem can be compactly written as

the entries of λ are referred to as Lagrange multipliers )

( )

( )

, (

1

x h x

f x

L

i

m

i

i

=

+

= λ

λ

0

*) ( s.t.

, 0 )

*, ( )

0 )

*, (

0 )

*, ( )

* 2

*

*

=

=

=

y x

h y

y x

L y

ii

x L ii)

x L i

T xx

T x

λ λ λ

λ

(26)

Gradient (revisited)

recall from (*) that derivative of f along d is

this means that

greatest increase when d || f

no increase when d ⊥ f

• since there is no increase when d is tangent to iso-contour f(x) = k

• the gradient is perpendicular to the tangent of the iso-contour

this suggests a geometric proof

)) ( , cos(

. ) ( .

) ) (

( )

lim (

0 f w

+

d

f w

=

dT

f w

=

d

f w d

f w

α

α

α

f

no increase

(27)

Lagrangian optimization

tg plane

span{h(x*)}

geometric interpretation:

since h(x)=0 is a iso-contour of h(x),

h(x*) is perpendicular to the iso-contour

• i) says that

f (x*)

span{

hi(x*)}

• i.e.

f ⊥ to tangent space of the constraint surface

• intuitive

• direction of largest increase of f is ⊥ to constraint surface

• the gradient is zero along the constraint

• no way to give an infinitesimal gradient step, without ending up violating it

• it is impossible to increase f and

(28)

Example

consider the problem

min x

1

+ x

2

subject to x

12

+ x

22

= 2 it leads to the following picture

h(x)=0

iso-contours of f(x)

2

2

x1 + x2 = -1

⎥⎦

⎢ ⎤

= ⎡

2 1

2 ) 2

( x

x x h

⎥⎦

⎢ ⎤

= ⎡

∇ 1

) 1 (x f

x1 + x2 = 1 x1 + x2 = 0

(29)

Example

consider the problem

min x

1

+ x

2

subject to x

12

+ x

22

= 2

f ⊥ to the iso-contours of f (x

1

+ x

2

= k)

h(x)=0

2

2

⎥⎦

⎢ ⎤

= ⎡

2 1

2 ) 2

( x

x x h

⎥⎦

⎢ ⎤

= ⎡

∇ 1

) 1 (x f

x1 + x2 = 1 x1 + x2 = 0

x1 + x2 = -1

(30)

Example

consider the problem

min x

1

+ x

2

subject to x

12

+ x

22

= 2

h ⊥ to the iso-contour of h (x

12

+ x

22

- 2 = 0)

h(x)=0

2

2

⎥⎦

⎢ ⎤

= ⎡

2 1

2 ) 2

( x

x x h

⎥⎦

⎢ ⎤

= ⎡

∇ 1

) 1 (x f

x1 + x2 = 1 x1 + x2 = 0

x1 + x2 = -1

(31)

Example

recall that derivative along d is

- moving along the tangent is descent as long as

- i.e. π /2 < angle(f,tg) < 3 π /2 - can always find such d unless

ftg

- critical point when ∇ f || ∇ h

- to find which type we need 2

nd

order (as before)

2

2

)) ( ,

cos(

. ) ( ) .

( )

lim (

0

f w + d − f w = d f w d f w

α

α

α

0 )

,

cos( tg ∇ f <

critical point

critical point

(32)

Alternative view

consider the tangent space to the iso-contour h(x) = 0 this is the subspace of first order feasible variations

space of ∆ x for which x + x satisfies the constraint up to first order approximation

{ x h x x i }

x

V ( *) = ∆ | ∇

i

( *)

T

∆ = 0 , ∀

x*

h(x*)

V(x*) feasible variations h(x)=0

(33)

Feasible variations

multiplying our first Lagrangian condition by ∆x

it follows that

this is a generalization of ∇ f(x*)=0 in unconstrained case implies that ∇ f(x*) ⊥ V(x*) and therefore ∇ f(x*) || h(x*) note:

• Hessian constraint only defined for y in V(x*)

• makes sense: we cannot move anywhere else, does not really matter what Hessian is outside V(x*)

0

*) (

*) (

1

=

∇ +

∇ ∑

=

x x

h x

x

f

m i T

i i

T

λ

*) (

, 0

*)

( x x x V x

f

T

∆ = ∀ ∆ ∈

(34)

In summary

for a constrained optimization problem, with equality constraints

Theorem: consider the problem

where the constraint gradients h

i

(x*) are linearly

independent. Then x* is a solution if and only if there exits a unique vector λ , such that

0 )

( subject to

) ( min arg

* = f x h x =

x

x

0

*) ( s.t.

, 0

*) (

*) ( )

0

*) (

*) (

)

2 1

2

1

=

⎥ ≥

⎢ ⎤

⎡ ∇ + ∇

=

∇ +

=

=

y x

h y

y x

h x

f y

ii

x h x

f i

T i

m

i

i T

i m

i

i

λ

λ

(35)

Alternative formulation

state the conditions through the Lagrangian

the theorem can be compactly written as

the entries of λ are referred to as Lagrange multipliers next time we will consider inequality constraints

) ( )

( )

, (

1

x h x

f x

L

i

m

i

i

=

+

= λ

λ

0

*) ( s.t.

, 0 )

*, ( )

0 )

*, (

0 )

*, ( )

* 2

*

*

=

=

=

y x

h y

y x

L y

ii

x L ii)

x L i

T xx

T x

λ λ λ

λ

(36)

References

Related documents

This study in farmer-led field pilots demonstrated that a low dosage of 4 tons/ha biochar in combination with CF can have a strong effect on maize yield, in sandy, acidic soils

During the development of the business case for RDM support, the staff in the Research Data Facility put a lot of effort into making sure that the management and financial systems

For each scenario, we used our models of the existing residential and commercial building stock (see earlier sections) to estimate what the window-related fraction of

• Third, FUSEE informs policymakers and decisionmakers about fire ecology and management issues, and advocates for alternative eco- logical restoration policies that

■ Your vision when scanning ■ Infrastructure changes.. b) Following the course, I have a better understanding of why people make certain manoeuvres when cycling in traffic.

Statistical analysis consisted of separate mixed-design analyses of variance (ANOVA) for the depen- dent variables (cumulative fixation duration and mean fixa- tion duration) and

Attribute based encryption (ABE) is a promising cryptographic primitive, which has been generally connected to plan fine grained access control framework recently.. In any

As with the classical risk model, the general problem in using equation (4.1) to find joint distributions involving the aggregate claim amount at ruin is being able to find the