• No results found

PCA LD

N/A
N/A
Protected

Academic year: 2020

Share "PCA LD"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

 Regression Analysis

 PCA

 LD

 K-Nearest Neighbor Classifier

 ANN as Classifier

(3)

 SVM became famous when, using images as input, it gave accuracy comparable to neural-network with

hand-designed features in a handwriting recognition task  Currently, SVM is widely used in object detection &

(4)

Large Margin Linear Classifier

Nonlinear SVM: The Kernel Trick

(5)

( )

( ) for all

i j

g

x

g

x

j

i

An example we’ve learned before:

 Minimum-Error-Rate Classifier

 For two-category case,

g

( )

x

g

1

( )

x

g

2

( )

x

1 2

Decide

if ( )

g

x

0; otherwise decide

( )

(

| )

(

| )

(6)

Nearest Neighbor

Decision Tree

Linear Functions

( )

T

g

x

w x

b

(7)

( )

T

g

x

w x

b

x1 wT x + b < 0

 A hyper-plane in the

feature space

 (Unit-length) normal vector

of the hyper-plane:

w

n

w

(8)

 How would you classify these points using a linear discriminant function in order to minimize the error rate?

x1 x2

(9)

 How would you classify these points using a linear discriminant function in order to minimize the error rate?

x x2

(10)

 How would you classify these points using a linear discriminant function in order to minimize the error rate?

x1 x2

(11)

x x2

 How would you classify these points using a linear discriminant function in order to minimize the error rate?

 Infinite number of answers!

(12)

 The linear discriminant

function (classifier) with the maximum margin is the

best

“safe zone”

 Margin is defined as the

width that the boundary

could be increased by before hitting a data point

 Why it is the best?

 Robust to outliners and thus

strong generalization ability

Margin

(13)

 Given a set of data points:

 With a scale transformation

on both w and b, the above is equivalent to

x1 x2

For

1,

0

For

1,

0

T i i T i i

y

b

y

b

 

 

 

 

w x

w x

{( ,

x

i

y

i

)},

i

1, 2,

,

n

, where

For

1,

1

For

1,

1

(14)

 We know that

 The margin width is:

x1 x2

1

1

T T

b

b

 

 

  

w x

w x

Margin x+ x+ x

-(

)

2

(

)

M

 

 

x

x

n

w

x

x

w

w

n

(15)

 Formulation: x x2 Margin x+ x+ x -n

such that

2

maximize

w

For

1,

1

For

1,

1

(16)

 Formulation: x1 x2 Margin x+ x+ x -n 2

1

minimize

2

w

such that

For

1,

1

For

1,

1

(17)

 Formulation:

x x2

Margin

x+

x+

x

-n

(

T

) 1

i i

y

w x

b

2

1

minimize

2

w

(18)

(

T

) 1

i i

y

w x

b

2

s.t.

with linear constraints

2 1

1

minimize

( , ,

)

(

) 1

2

n

T

p i i i i

i

L

b

y

b

w

w

w x

(19)

1

2

i

s.t.

i

0

0

p

L

b

0

p

L

w

1

n

(20)

1

2

i

s.t.

i

0

1 1 1

1

maximize

2

n n n

T i i j i j i j

i i j

y y

 

  



x x

s.t.

i

0

(21)

 The solution has the form:

(

T

) 1

0

i

y

i i

b

w x

 Thus, only support vectors have

0

i

1 SV

n

i i i i i i

i i

y

y

 

w

x

x

get from (

) 1

0,

where is support vector

T i i

b

y

w x

b

 

(22)

SV

( )

T i Ti

i

g

b

b

 

x

w x

x x

 Notice it relies on a dot product between the test point x

and the support vectors xi

 Also keep in mind that solving the optimization problem

involved computing the dot products xiTx

j between all pairs

(23)

 What if data is not linear separable? (noisy data, outliers, etc.)

 Slack variables ξi can be

added to allow

mis-classification of difficult or noisy data points

x x2

1

2

(24)

(

T

) 1

i i i

y

w x

b

 

2

1

1

minimize

2

w

C

i

i

such that

0

i

(25)

1 1 1

1

maximize

2

n n n

T i i j i j i j

i i j

y y

 

  



x x

such that

0

i

C

(26)

 Datasets that are linearly separable with noise work out great:

0 x

0 x

x2

0 x

 But what are we going to do if the dataset is just too hard?

 How about… mapping data to a higher-dimensional space:

(27)

General idea: the original input space can be mapped to

some higher-dimensional feature space where the

training set is separable:

(28)

 With this mapping, our discriminant function is now:

SV

( )

T

( )

i

( )

i T

( )

i

g

b

 

b

 

x

w

x

x

x

 No need to know this mapping explicitly, because we only use

the dot product of feature vectors in both the training and test.

 A kernel function is defined as a function that corresponds to

a dot product of two feature vectors in some expanded feature space:

( ,

i j

)

( )

i T

(

j

)

(29)

2-dimensional vectors x=[x1 x2];

let K(xi,xj)=(1 + xiTx j)2,

Need to show that K(xi,xj) = φ(xi)Tφ(x j):

K(xi,xj)=(1 + xiTx j)2,

= 1+ xi12xj12 + 2 xi1xj1xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2

= [1 xi12 2 xi1xi2 xi22 2xi1 2xi2]T [1 xj12 2 xj1xj2 xj22 2xj1 2xj2] = φ(xi)Tφ(x

j), where φ(x) = [1 x12 2 x1x2 x22 2x1 2x2]

(30)

 Linear kernel:

2

2

( ,

)

exp(

)

2

i j i j

K

x

x

x x

( ,

i j

)

Ti j

K

x x

x x

( ,

i j

)

(1

Ti j

)

p

K

x x

x x

0 1

( ,

i j

)

tanh(

Ti j

)

K

x x

x x

 Examples of commonly-used kernel functions:

 Polynomial kernel:

 Gaussian (Radial-Basis Function (RBF) ) kernel:

 Sigmoid:

 In general, functions that satisfy Mercer’s condition can be

(31)

1 1 1

1

maximize

( ,

)

2

i i j i j i j

i i j

y y K

 

  



x x

such that

0

i

C

1

0

n i i i

y

 The solution of the discriminant function is

SV

( )

i

( , )

i

i

g

K

b

x

x x

(32)

2. Choose a value for C

 3. Solve the quadratic programming problem (many software packages available)

(33)

- if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating

appropriate similarity measures

Choice of kernel parameters

- e.g. σ in Gaussian kernel

- σ is the distance between closest points with different

classifications

- In the absence of reliable criteria, applications rely on the use of a validation set or cross-validation to set such parameters.

Optimization criterion

– Hard margin v.s. Soft margin

(34)

Better generalization ability & less over-fitting

2. The Kernel Trick

Map data points to higher dimensional

space in order to make them linearly

separable.

Since only dot product is used, we do not

(35)
(36)

References

Related documents

(AD): Alzheimer's disease; (FOXO): forkhead box tran- scription factor, class O; (LBD): Lewy body dementia; (PD): Parkinson's disease; (TBS): Tris-buffered saline..

I have made little videos for these songs too, please have a look at the home learning

Weiner has served on the scientific advisory boards for Lilly, Araclon and Institut Catala de Neurociencies Aplicades, Gulf War Veterans Illnesses Advisory Committee, VACO,

[6] European Language Label of Labels: “The European Label is an award that encourages new initiatives in the field of teaching and learning languages, rewarding new techniques

Ethnic differences in common carotid intimamedia thickness, and the relationship to cardiovascular risk factors and peripheral arterial disease:

The experiences of women with maternal near miss and their perception of quality of care in Kelantan, Malaysia a qualitative study RESEARCH ARTICLE Open Access The experiences of women

It is possible that specific context-dependent cytokines can simultaneously regulate the same or differ- ent signaling pathways in breast cancer to promote drug resistance; or

Extracts of Araceae species inhibit the in vitro growth of the human malaria parasite Plasmodium falciparum Welch and significant median inhibitory concentrations (IC 50 )