• No results found

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

N/A
N/A
Protected

Academic year: 2021

Share "Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

Lecture 4: More classifiers and classes

C4B Machine Learning Hilary 2011 A. Zisserman

Logistic regression

• Loss functions revisited

• Adaboost

• Loss functions revisited

Optimization

Multiple class classification

(2)

Overview

Logistic regression is actually a classi

fi

cation method

LR introduces an extra non-linearity over a linear

classi

fi

er,

f

(

x

) =

w

>

x

+

b

, by using a logistic (or

sigmoid) function,

σ

().

The LR classi

fi

er is de

fi

ned as

σ

(

f

(

x

i

))

(

0

.

5

y

i

= +1

<

0

.

5

y

i

=

1

where

σ

(

f

(

x

)) =

1 1+e−f(x)

The logistic function or sigmoid function

-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 z

σ

(

z

) =

1

1 +

e

z

As

z

goes from

−∞

to

,

σ

(

z

) goes from 0 to 1, a

“squash-ing function”.

It has a “sigmoid” shape (i.e. S-like shape)

(3)

-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

σ

(wx

+

b)

fi

t to

y

-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Intuition – why use a sigmoid?

Here, choose binary classi

fi

cation to be

repre-sented by

y

i

{

0,

1

}

, rather than

y

i

{

1,

1

}

wx

+

b

fi

t to

y

0 1

• fit of wx + b dominated by more distant points

• causes misclassification

• instead LR regresses the sigmoid to the class data

Least squares fit

0.5 0.5 0 1

Similarly in 2D

LR linear LR linear

σ

(w

1

x

1

+

w

2

x

2

+

b)

fi

t, vs

w

1

x

1

+

w

2

x

2

+

b

(4)

In logistic regression fit a sigmoid function to the data {

x

i

, y

i

}

by minimizing the classification errors

-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y

i

σ

(

w

>

x

i

)

Learning

-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

A sigmoid favours a larger margin cf a step classifier

Margin property

-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(5)

Probabilistic interpretation

Think of

σ

(f

(

x

)) as the

posterior probability

that

y

= 1, i.e.

P

(y

= 1

|

x

) =

σ

(f

(

x

))

Hence, if

σ

(f

(

x

))

>

0.5 then class

y

= 1 is selected

Then, after a rearrangement

f

(

x

) = log

P

(

y

= 1

|

x

)

1

P

(y

= 1

|

x

)

= log

P

(

y

= 1

|

x

)

P

(y

= 0

|

x

)

which is the

log odds ratio

Maximum Likelihood Estimation

Assume

p(y = 1|

x

;

w

) = σ(

w

>

x

) p(y = 0|

x

;

w

) = 1 σ(

w

>

x

) write this more compactly as

p(y|

x

;

w

) = ³σ(

w

>

x

)´y³1σ(

w

>

x

)´(1−y)

Then the likelihood (assuming data independence) is p(

y

|

x

;

w

) N Y i ³ σ(

w

>

x

i)´yi³1σ(

w

>

x

i)´(1−yi) and the negative log likelihood is

L(

w

) =

N X

i

(6)

Logistic Regression Loss function

Use notation yi {1,1}. Then

P(y = 1|x) = σ(f(x)) = 1 1 +e−f(x) P(y =1|x) = 1σ(f(x)) = 1 1 +e+f(x) So in both cases P(yi|xi) = 1 1 +e−yif(xi)

Assuming independence, the likelihood is N

Y

i

1

1 +e−yif(xi)

and the negative log likelihood is =

N

X

i

log³1 +e−yif(xi)´

which defines the loss function.

Logistic Regression Learning

Learning is formulated as the optimization problem

min

wRd N X i

log

³

1 +

e

−yif(xi)´

+

λ

||

w

||

2

loss function regularization

• For correctly classified points yif(

x

i) is negative, and log³1 + e−yif(xi)´ is near zero

• For incorrectly classified points yif(

x

i) is positive, and log³1 + e−yif(xi)´ can be large.

• Hence the optimization penalizes parameters which lead to such misclassifications

(7)

Comparison of SVM and LR cost functions

SVM min w∈RdC N X i max (0,1yif(xi)) +||w||2 Logistic regression: min w∈Rd N X i log³1 + e−yif(xi)´+λ||w||2 Note:

• both approximate 0-1 loss

• very similar asymptotic behaviour • main difference is smoothness, and non-zero values outside margin • SVM gives sparsesolution for

α

i

yif(xi)

(8)

Overview

AdaBoost is an algorithm for constructing a

strong classifier

out of a linear combination

of simple

weak classifiers

. It provides a method of

choosing the weak classifiers and setting the weights

T

X

t

=1

α

t

h

t

(

x

)

h

t

(

x

)

Terminology

• weak classifier , for data vector

x

• strong classifier

H

(

x

) = sign

T X t=1

α

t

h

t

(

x

)

α

t

h

t

(

x

)

{

1

,

1

}

Example: combination of linear classifiers

weak classifier 1

h

1

(

x

)

• Note, this linear combination is not a simple majority vote

(it would be if )

• Need to compute as well as selecting weak classifiers

h

t

(

x

)

{

1

,

1

}

H

(

x) = sign (

α

1

h

1

(

x

) +

α

2

h

2

(

x

) +

α

3

h

3

(

x))

strong classifier

H

(

x

)

weak classifier 2

h

2

(

x

)

weak classifier 3

h

3

(

x

)

(9)

AdaBoost algorithm – building a strong classifier

For t = 1 …,T

• Select weak classifier with minimum error

• Set

• Reweight examples (

boosting

) to give misclassified

examples more weight

• Add weak classifier with weight

Start with equal weights on each xi, and a set of weak classifiers

²

t

=

X

i

ω

i

[

h

t

(

x

i

)

6

=

y

i

] where

ω

i

are weights

α

t

=

1

2

ln

1

²

t

²

t

ω

t

+1

,i

=

ω

t,i

e

α

t

y

i

h

t

(

x

i

)

H

(

x

) = sign

T X t=1

α

t

h

t

(

x

)

α

t

h

t

(

x

)

Example

Weak Classifier 1 Weights Increased Weak classifier 3 Final classifier is linear combination of weak classifiers Weak Classifier 2

start with equal weights on

each data point (i)

²

j

=

X

i

(10)

The AdaBoost algorithm

(Freund & Shapire 1995)

• Given example data (x1, y1), . . . ,(xn, yn), where yi = −1,1 for negative and positive examples

respectively.

• Initialize weights ω1,i = 21m,

1

2l for yi = −1,1 respectively, where m and l are the number of

negatives and positives respectively.

• For t= 1, . . . , T

1. Normalize the weights,

ωt,i← Pnωt,i j=1ωt,j

so that ωt,i is a probability distribution.

2. For each j, train a weak classifier hj with error evaluated with respect to ωt,i, ²j=X

i

ωt,i[hj(xi)6=yi]

3. Choose the classifier, ht, with the lowest error ²t. 4. Setαt as αt= 1 2ln 1²t ²t

5. Update the weights

ωt+1,i=ωt,ie−αtyiht(xi)

• The final strong classifier is

H(x) = sign

T

X

t=1

αtht(x)

Why does it work?

The AdaBoost algorithm carries out a greedy optimization

of a loss function

min

α

i

,h

i

N

X

i

e

y

i

H

(

x

i

)

AdaBoost LR SVM hinge loss yif(xi) SVM loss function N X i max (0,1yif(xi))

Logistic regression loss function

N

X

i

(11)

Sketch derivation –

non-examinable

The objective function used by AdaBoost is

J(H) =X i

e−yiH(xi)

For a correctly classified point the penalty is exp(|H|) and for an incorrectly classified point the penalty is exp(+|H|). The AdaBoost algorithm incrementally decreases the cost by adding simple functions to

H(x) =X t

αtht(x)

Suppose that we have a function B and we propose to add the functionαh(x) where the scalar α is to be determined and h(x) is some function that takes values in +1 or 1 only. The new function is B(x) +αh(x) and the new cost is

J(B+αh) =X i

e−yiB(xi)e−αyih(xi)

Differentiating with respect to α and setting the result to zero gives

e−α X

yi=h(xi)

e−yiB(xi)−e+α X

yi6=h(xi)

e−yiB(xi)= 0

Rearranging, the optimal value of α is therefore determined to be

α=1 2log P yi=h(xi)e −yiB(xi) P yi6=h(xi)e −yiB(xi)

The classification error is defined as

²=X i ωi[h(xi)6=yi] where ωi = e −yiB(xi) P je−yjB(xj)

Then, it can be shown that,

α= 1 2log

1² ²

The update from B to H therefore involves evaluating the weighted performance (with the weights ωi given above) ² of the “weak” classifier h.

If the current function B is B(x) = 0 then the weights will be uniform. This is a common starting point for the minimization. As a numerical convenience, note that at the next round of boosting the required weights are obtained by multiplying the old weights with exp(αyih(xi)) and then normalizing. This gives the update formula

ωt+1,i=

1

Zt

ωt,ie−αtyiht(xi)

where Zt is a normalizing factor.

Choosing h The function h is not chosen arbitrarily but is chosen to give a good performance (low value of ²) on the training data weighted by ω.

(12)

Optimization

SVM min w∈RdC N X i max (0,1yif(xi)) +||w||2 Logistic regression: min w∈Rd N X i log³1 +e−yif(xi)´+λ||w||2

We have seen many cost functions, e.g.

• Do these have a unique solution?

• Does the solution depend on the starting point of an iterative optimization algorithm (such as gradient descent)?

local minimum

global minimum

If the cost function is convex, then a locally optimal point is globally optimal (provided the optimization is over a convex set, which it is in our case)

(13)

Convex functions

Convex function examples

convex Not convex

(14)

Logistic regression: min w∈Rd N X i log³1 +e−yif(xi)´+λ||w||2 SVM min w∈RdC N X i max (0,1yif(xi)) +||w||2

+

+

convex convex

Gradient (or Steepest) descent algorithms

To minimize a cost function C(w) use the iterative update

wt+1 wt −ηt∇wC(wt) where η is the learning rate.

In our case the loss function is a sum over the training data. For example for LR min w∈RdC(w) = N X i log³1 +e−yif(xi)´+λ||w||2 = N X i L(xi, yi;w) +λ||w||2

This means that one iterative update consists of a pass through the training data with an update for each point

wt+1 wt−( N

X

i

ηtwL(xi, yi;wt) + 2λwt)

The advantage is that for large amounts of data, this can be carried out point by point.

(15)

Note:

• this is similar, but not identical, to the perceptron update rule.

• there is a unique solution for

• in practice more efficient Newton methods are used to minimize L

• there can be problems with becoming infinite for linearly separable data

w

[exercise]

Minimizing

L

(

w

) using gradient descent gives

the update rule

w

w

η

(

y

i

σ

(

w

>

x

i

))

x

i

where

y

i

{

0

,

1

}

w

Gradient descent algorithm for LR

w

w

η

sign(

w

>

x

i

)

x

i

Gradient descent algorithm for SVM

First, rewrite the optimization problem as an average

min w C(w) = λ 2||w|| 2+ 1 N N X i max (0,1yif(xi)) = 1 N N X i µλ 2||w|| 2+ max (0,1y if(xi)) ¶

(with λ = 2/(N C) up to an overall scale of the problem) and f(x) = w>x+b

Because the hinge loss is not differentiable, a sub-gradient is computed

(16)

Sub-gradient for hinge loss

L

(

x

i

, y

i

;

w

) = max (0

,

1

y

i

f

(

x

i

))

f

(

x

i

) =

w

>

x

i

+

b

y

i

f

(

x

i

)

L

w

= −yi

x

i ∂L

w

= 0

Sub-gradient descent algorithm for SVM

C(w) = 1 N N X i µλ 2||w|| 2+ L(xi, yi;w) ¶

The iterative update is

wt+1 ← wt−η∇wtC(wt) ← wt−η 1 N N X i (λwt+∇wL(xi, yi;wt))

where η is the learning rate.

Then each iteration t involves cycling through the training data with the updates:

wt+1 ← (1−ηλ)wt+ηyixi if yi(w>xi+b)< 1

(17)

Multi-class Classification

Multi-Class Classification – what we would like

Assign input vector to one of K classes

Goal:

a decision rule that divides input space into K

decision regions

separated by

decision boundaries

(18)

Reminder: K Nearest Neighbour (K-NN) Classifier

Algorithm

• For each test point, x, to be classified, find the K nearest samples in the training data

• Classify the point, x, according to the majority vote of their class labels

e.g. K = 3

• naturally applicable to multi-class case

Build from binary classifiers …

• Learn:K two-class 1 vs the rest classifiers fk(x)

C1 C2 C3 1 vs 2 & 3 3 vs 1 & 2 2 vs 1 & 3

?

(19)

Build from binary classifiers …

• Learn:K two-class 1 vs the rest classifiers fk(x) • Classification: choose class with most positive score

C1 C2 C3 1 vs 2 & 3 3 vs 1 & 2 2 vs 1 & 3

max

k

f

k

(

x

)

Application: hand written digit recognition

• Feature vectors: each image is 28 x 28 pixels. Rearrange as a 784-vector x

• Training: learn k=10 two-class 1 vs the rest SVM classifiers fk(x)

• Classification:choose class with most positive score

f

(

x

) = max

(20)

0 5 5 2 5 3 5 4 5 5

1

2

3

4

5

6

7

8

9

0

1

2

3

4

5

Example

hand drawn classification

Background reading and more

• Other multiple-class classifiers (not covered here):

• Neural networks

• Random forests

• Bishop, chapters 4.1 – 4.3 and 14.3

• Hastie et al, chapters 10.1 – 10.6

• More on web page:

References

Related documents