PCA LD

(1)

(2)

 Regression Analysis

 PCA

 LD

 K-Nearest Neighbor Classifier

 ANN as Classifier

(3)

 SVM became famous when, using images as input, it gave accuracy comparable to neural-network with

hand-designed features in a handwriting recognition task  Currently, SVM is widely used in object detection &

(4)



Large Margin Linear Classifier



Nonlinear SVM: The Kernel Trick

(5)

( )

( ) for all

i j

g

x



g

x

j



i



An example we’ve learned before:

 Minimum-Error-Rate Classifier

 For two-category case,

g

( )

x



g

₁

( )

x



g

₂

( )

x

1 2

Decide



if ( )

g

x



0; otherwise decide



( )

(

| )

(

| )

(6)

Nearest Neighbor

Decision Tree

Linear Functions

( )

T

g

x



w x



b

(7)

( )

T

g

x



w x



b

x₁ wT _{x + b < 0}

 A hyper-plane in the

feature space

 (Unit-length) normal vector

of the hyper-plane:



w

n

w

(8)

 How would you classify these points using a linear discriminant function in order to minimize the error rate?

x₁ x₂

(9)

x x₂

(10)

x₁ x₂

(11)

x x₂

 Infinite number of answers!

(12)

 The linear discriminant

function (classifier) with the maximum margin is the

best

“safe zone”

 Margin is defined as the

width that the boundary

could be increased by before hitting a data point

 Why it is the best?

 Robust to outliners and thus

strong generalization ability

Margin

(13)

 Given a set of data points:

 With a scale transformation

on both w and b, the above is equivalent to

x₁ x₂

For

1,

0 For

1,

0

T i i T i i

y

b

y

b

 

 

 

 

w x

{( ,

x

_i

y

_i

)},

i



1, 2,



,

n

, where

For

1,

1 For

1,

1

(14)

 We know that

 The margin width is:

x₁ x₂

1

T T

b

 

 

  

w x

Margin x+ x+ x

-(

)

2 (

)

M

 















x

n

w

x

w

n

(15)

 Formulation: x x₂ Margin x+ x+ x -n

such that

2 maximize

w

For

1,

1 For

1,

1

(16)

 Formulation: x₁ x₂ Margin x+ x+ x -n 2

1 minimize

2 w

such that

For

1,

1 For

1,

1

(17)

 Formulation:

x x₂

Margin

x+

x

-n

(

T

) 1

i i

y

w x



b



2

1 minimize

2 w

(18)

(

T

) 1

i i

y

w x



b



2 s.t.

with linear constraints





2 1

1 minimize

( , ,

)

(

) 1

2

n

T

p i i i i

i

L

b



y

b











w

w x

(19)

1

2

_i_

s.t.



_i



0

p

L

b







0

p

L







w

1

n

(20)

1

2

_i_

s.t.



_i



0

1 1 1

1 maximize

2

n n n

T i i j i j i j

i i j

y y



 

  







x x

s.t.



_i



0

(21)

 The solution has the form:



(

T

) 1



0

i

y

i i

b



w x







 Thus, only support vectors have

0

i





1 SV

n

i i i i i i

i i

y



 









w

x

get from (

) 1

0,

where is support vector

T i i

b

y

w x



b

 

(22)

SV

( )

T _i T_i

i

g

b



b





 





x

w x

x x

 Notice it relies on a dot product between the test point x

and the support vectors x_i

 Also keep in mind that solving the optimization problem

involved computing the dot products x_iT_x

j between all pairs

(23)

 What if data is not linear separable? (noisy data, outliers, etc.)

 Slack variables ξ_i can be

added to allow

mis-classification of difficult or noisy data points

x x₂

1



2

(24)

(

T

) 1

i i i

y

w x



b

 



2

1

1 minimize

2 w



C



_i_



i

such that

0

i





(25)

1 1 1

1 maximize

2

n n n

T i i j i j i j

i i j

y y



 

  







x x

such that

0 



_i



C

(26)

 Datasets that are linearly separable with noise work out great:

0 x

x2

0 x

 But what are we going to do if the dataset is just too hard?

 How about… mapping data to a higher-dimensional space:

(27)



General idea: the original input space can be mapped to

some higher-dimensional feature space where the

training set is separable:

(28)

 With this mapping, our discriminant function is now:

SV

( )

T

( )

_i

( )

_i T

( )

i

g



b

 



b





 





x

w

x

 No need to know this mapping explicitly, because we only use

the dot product of feature vectors in both the training and test.

 A kernel function is defined as a function that corresponds to

a dot product of two feature vectors in some expanded feature space:

( ,

_i _j

)

( )

_i T

(

_j

)

(29)

2-dimensional vectors x=[x₁x₂];

let K(x_i,x_j)=(1 + x_iT_x j)2,

Need to show that K(x_i,x_j) = φ(x_i)T_φ_(x j):

K(x_i,x_j)=(1 + x_iT_x j)2,

= 1+ x_i12x_j12 + 2 x_i1x_j1x_i2x_j2+ x_i22x_j22 + 2x_i1x_j1+ 2x_i2x_j2

= [1 x_i12 √2 x_i1x_i2 x_i22 √2x_i1√2x_i2]T [1 x_j12 √2 x_j1x_j2 x_j22 √2x_j1√2x_j2] = φ(x_i)T_φ_(x

j), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]

(30)

 Linear kernel:

2

( ,

)

exp(

)

2

i j i j

K









x

x x

( ,

_i _j

)

T_i _j

K

x x



x x

( ,

_i _j

)

(1

T_i _j

)

p

K

x x





x x

0 1

( ,

_i _j

)

tanh(

T_i _j

)

K

x x





x x





 Examples of commonly-used kernel functions:

 Polynomial kernel:

 Gaussian (Radial-Basis Function (RBF) ) kernel:

 Sigmoid:

 In general, functions that satisfy Mercer’s condition can be

(31)

1 1 1

1 maximize

( ,

)

2

i i j i j i j

i i j

y y K



 

  







x x

such that

0 



_i



C

1

0

n i i i

y







 The solution of the discriminant function is

SV

( )

_i

( , )

_i

i

g



K

b









x

x x

(32)

 2. Choose a value for C

 3. Solve the quadratic programming problem (many software packages available)

(33)

- if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating

appropriate similarity measures



Choice of kernel parameters

- e.g. σ in Gaussian kernel

- σ is the distance between closest points with different

classifications

- In the absence of reliable criteria, applications rely on the use of a validation set or cross-validation to set such parameters.