• No results found

Lecture 2: The SVM classifier

N/A
N/A
Protected

Academic year: 2022

Share "Lecture 2: The SVM classifier"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

Lecture 2: The SVM classifier

C19 Machine Learning Hilary 2015 A. Zisserman

• Review of linear classifiers

• Linear separability

• Perceptron

• Support Vector Machine (SVM) classifier

• Wide margin

• Cost function

• Slack variables

• Loss functions revisited

• Optimization

(2)

Binary Classification

Given training data ( x i , y i ) for i = 1 . . . N , with x i R d and y i ∈ {−1, 1}, learn a classifier f( x ) such that

f ( x i )

( ≥ 0 y i = +1

< 0 y i = −1

i.e. y i f ( x i ) > 0 for a correct classification.

(3)

Linear separability

linearly separable

not linearly separable

(4)

Linear classifiers

X2

X1 A linear classifier has the form

• in 2D the discriminant is a line

• is the normal to the line, and b the bias

• is known as the weight vector

f ( x ) = 0

f ( x ) = w > x + b

f (x) > 0 f (x) < 0

(5)

Linear classifiers

A linear classifier has the form

• in 3D the discriminant is a plane, and in nD it is a hyperplane

For a K-NN classifier it was necessary to `carry’ the training data

For a linear classifier, the training data is used to learn w and then discarded Only w is needed for classifying new data

f ( x ) = 0

f ( x ) = w > x + b

(6)

Given linearly separable data xi labelled into two categories yi = {-1,1} , find a weight vector w such that the discriminant function

separates the categories for i = 1, .., N

• how can we find this separating hyperplane ?

The Perceptron Classifier

f ( x i ) = w > x i + b

The Perceptron Algorithm

Write classifier as

• Initialize w = 0

• Cycle though the data points { xi, yi }

• if xi is misclassified then

• Until all the data is correctly classified

w w + α sign(f ( x

i

)) x

i

f ( x

i

) = ˜ w

>

˜ x

i

+ w

0

= w

>

x

i

where w = ( ˜ w , w

0

), x

i

= (˜ x

i

, 1)

(7)

For example in 2D

X2

X1 X2

X1 w

before update after update

w

NB after convergence w = P N i α i x i

• Initialize w = 0

• Cycle though the data points { xi, yi }

• if xi is misclassified then

• Until all the data is correctly classified

w w + α sign(f ( x

i

)) x

i

x i

(8)

• if the data is linearly separable, then the algorithm will converge

• convergence can be slow …

• separating line close to training data

• we would prefer a larger margin for generalization

-15 -10 -5 0 5 10

-10 -8 -6 -4 -2 0 2 4 6 8

Perceptron example

(9)

What is the best w?

• maximum margin solution: most stable under perturbations of the inputs

(10)

Support Vector Machine

w

Support Vector Support Vector

b

||w||

f (x) = X

i

α

i

y

i

(x

i>

x) + b

support vectors

wTx + b = 0 linearly separable data

(11)

SVM – sketch derivation

• Since w

>

x + b = 0 and c( w

>

x + b) = 0 define the same plane, we have the freedom to choose the normalization of w

• Choose normalization such that w

>

x

+

+b = +1 and w

>

x

+ b = −1 for the positive and negative support vectors re- spectively

• Then the margin is given by w

|| w || .

³

x

+

x

´

= w

> ³

x

+

x

´

|| w || =

2

|| w ||

(12)

Support Vector Machine

w

Support Vector Support Vector

wTx + b = 0 wTx + b = 1

wTx + b = -1

Margin = 2

||w||

linearly separable data

(13)

SVM – Optimization

• Learning the SVM can be formulated as an optimization:

max

w

2

|| w || subject to w

>

x

i

+b ≥ 1 if y

i

= +1

≤ −1 if y

i

= −1 for i = 1 . . . N

• Or equivalently

min

w

|| w ||

2

subject to y

i ³

w

>

x

i

+ b

´

≥ 1 for i = 1 . . . N

• This is a quadratic optimization problem subject to linear

constraints and there is a unique minimum

(14)

Linear separability again: What is the best w?

• the points can be linearly separated but there is a very narrow margin

• but possibly the large margin solution is better, even though one constraint is violated

In general there is a trade off between the margin and the number of mistakes on the training data

(15)

Introduce “slack” variables

w

Support Vector Support Vector

wTx + b = 0 wTx + b = 1

wTx + b = -1

Margin = 2

||w||

Misclassified point

ξi

||w|| >

2

||w||

 = 0

ξi

||w|| <

1

||w||

ξ i ≥ 0

• for 0 < ξ ≤ 1 point is between margin and correct side of hyper- plane. This is a margin violation

• for ξ > 1 point is misclassified

(16)

“Soft” margin solution

The optimization problem becomes min

wRdiR+ ||w||2+C

XN i

ξi

subject to

yi ³w>xi + b´ ≥ 1−ξi for i = 1 . . . N

• Every constraint can be satisfied if ξi is sufficiently large

• C is a regularization parameter:

— small C allows constraints to be easily ignored → large margin

— large C makes constraints hard to ignore → narrow margin

— C = ∞ enforces all constraints: hard margin

• This is still a quadratic optimization problem and there is a unique minimum. Note, there is only one parameter, C.

(17)

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 -0.8

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

feature x

feature y

• data is linearly separable

• but only with a narrow margin

(18)

C = Infinity hard margin

(19)

C = 10 soft margin

(20)

Application: Pedestrian detection in Computer Vision

Objective: detect (localize) standing humans in an image

• cf face detection with a sliding window classifier

reduces object detection to binary classification

• does an image window contain a person or not?

Method: the HOG detector

(21)

• Positive data – 1208 positive window examples

• Negative data – 1218 negative window examples (initially)

Training data and features

(22)

Feature: histogram of oriented gradients (HOG)

Feature vector dimension = 16 x 8 (for tiling) x 8 (orientations) = 1024 image

dominant

direction HOG

frequency

orientation

• tile window into 8 x 8 pixel cells

• each cell represented by HOG

(23)
(24)

Averaged positive examples

(25)

Training (Learning)

• Represent each example window by a HOG feature vector

• Train a SVM classifier

Testing (Detection)

• Sliding window classifier

Algorithm

f (x) = w

>

x + b

x

i

∈ R

d

, with d = 1024

(26)

Dalal and Triggs, CVPR 2005

(27)

Learned model

Slide from Deva Ramanan

f ( x ) = w > x + b

(28)

Slide from Deva Ramanan

(29)

Optimization

Learning an SVM has been formulated as a constrained optimization prob- lem over w and ξ

min

wRdiR+ ||w||2 + C

XN i

ξi subject to yi ³w>xi + b´ ≥ 1 − ξi for i = 1 . . . N

The constraint yi ³w>xi + b´ ≥ 1 − ξi, can be written more concisely as yif (xi) ≥ 1 − ξi

which, together with ξi ≥ 0, is equivalent to

ξi = max (0, 1 − yif (xi))

Hence the learning problem is equivalent to the unconstrained optimiza- tion problem over w

min

wRd||w||2 + C

XN i

max (0, 1 − yif (xi))

loss function regularization

(30)

Loss function

w

Support Vector Support Vector

wTx + b = 0 min

wRd ||w||2 + C

XN i

max (0, 1 − yif (xi))

Points are in three categories:

1. yif (xi) > 1

Point is outside margin.

No contribution to loss 2. yif (xi) = 1

Point is on margin.

No contribution to loss.

As in hard margin case.

3. yif (xi) < 1

Point violates margin constraint.

Contributes to loss

loss function

(31)

Loss functions

• SVM uses “hinge” loss

• an approximation to the 0-1 loss

max (0, 1 − y

i

f ( x

i

))

y i f ( x i )

(32)

min

w∈Rd

C

XN i

max (0, 1 − y

i

f ( x

i

)) + || w ||

2

• Does this cost function have a unique solution?

• Does the solution depend on the starting point of an iterative optimization algorithm (such as gradient descent)?

local minimum

global minimum

If the cost function is convex, then a locally optimal point is globally optimal (provided the optimization is over a convex set, which it is in our case)

Optimization continued

(33)

Convex functions

(34)

Convex function examples

convex Not convex

A non-negative sum of convex functions is convex

(35)

SVM min

wRd C

XN i

max (0, 1 − yif (xi)) + ||w||2

+

convex

(36)

Gradient (or steepest) descent algorithm for SVM

First, rewrite the optimization problem as an average

minw C(

w

) = λ

2||

w

||2 + 1 N

XN

i

max (0, 1 − yif (

x

i))

= 1

N

XN

i

µλ

2||

w

||2 + max (0, 1 − yif (

x

i))

(with λ = 2/(N C) up to an overall scale of the problem) and f (

x

) =

w

>

x

+ b

Because the hinge loss is not differentiable, a sub-gradient is computed

To minimize a cost function C( w ) use the iterative update w

t+1

w

t

− η

t

w

C( w

t

)

where η is the learning rate.

(37)

Sub-gradient for hinge loss

L( x i , y i ; w ) = max (0, 1 − y i f ( x i )) f ( x i ) = w > x i + b

y i f ( x i )

∂ L

∂ w = −y

i

x

i

∂ L

∂ w = 0

(38)

Sub-gradient descent algorithm for SVM

C(w) = 1 N

XN i

µλ

2||w||2 + L(xi, yi;w)

The iterative update is

wt+1 wt − η∇wtC(wt)

← wt − η 1 N

XN i

(λwt + wL(xi, yi;wt)) where η is the learning rate.

Then each iteration t involves cycling through the training data with the updates:

wt+1 wt − η(λwt − yixi) if yif (xi) < 1

← wt − ηλwt otherwise In the Pegasos algorithm the learning rate is set at ηt = λt1

(39)

Pegasos – Stochastic Gradient Descent Algorithm

Randomly sample from the training data

0 50 100 150 200 250 300

10-2 10-1 100 101 102

energy

-6 -4 -2 0 2 4 6

-6 -4 -2 0 2 4 6

(40)

Background reading and more …

• Next lecture – see that the SVM can be expressed as a sum over the support vectors:

• On web page:

http://www.robots.ox.ac.uk/~az/lectures/ml

• links to SVM tutorials and video lectures

• MATLAB SVM demo

f (x) = X

i

α

i

y

i

(x

i>

x) + b

support vectors

References

Related documents

Size Press Moisture Production Rate Machine Information lb/3300sq ft % % fpm tph 3597 306 1256.. Total

With the guiding principles in hand, the next task was to create the change team that would lead the library OAD. The formation of the change team that would led the OAD

Contingent valuation study on Diyatha Uyana and surroundings are of interest to many people because it is one of the widely used urban water body recreational areas in Sri Lanka4.

Quantitative data about chemical composition, molar mass and concentration of all the segments, which are present in a polymer, can be obtained, after calibrating the HPLC

Root architecture traits were calculated in pixel values including total root length, average diameter, root area, root perimeter, convex area, solidity (network area di- vided by

Background: In the healthcare setting, effective teamwork is essential to achieve the quadruple aim of improving patient outcomes, improving population health, increasing

Ontogeny of glomerular territory patterning in the olfactory bulb of juvenile Chinook salmon ( Oncorhynchus tshawytscha) and exploring for potential effect of.. sensory experience

The model was estimated using co-integration and error correction models to analyze the short and long run equilibrium among the variables.Results of the study show