Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

(1)

Lecture 4: More classifiers and classes

C4B Machine Learning Hilary 2011 A. Zisserman

•

Logistic regression

• Loss functions revisited

• Adaboost

• Loss functions revisited

•

Optimization

•

Multiple class classification

(2)

Overview

•

Logistic regression is actually a classi

fi

cation method

•

LR introduces an extra non-linearity over a linear

classi

fi

er,

f

(

x

) =

w

>

x

+

b

, by using a logistic (or

sigmoid) function,

σ

().

•

The LR classi

fi

er is de

fi

ned as

σ

(

f

(

x

_i

))

(

≥

0

.

5

y

_i

= +1

<

0

.

5

y

_i

=

−

1

where

σ

(

f

(

x

)) =

1 1+e−f(x)

The logistic function or sigmoid function

-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 z

σ

(

z

) =

1

1 +

e

−

z

•

As

z

goes from

−∞

to

∞

,

σ

(

z

) goes from 0 to 1, a

“squash-ing function”.

•

It has a “sigmoid” shape (i.e. S-like shape)

(3)

-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

σ

(wx

+

b)

fi

t to

y

-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Intuition – why use a sigmoid?

Here, choose binary classi

fi

cation to be

repre-sented by

y

_i

∈

{

0,

1

}

, rather than

y

_i

∈

{

1,

−

1

}

wx

+

b

fi

t to

y

0 1

• fit of wx + b dominated by more distant points

• causes misclassification

• instead LR regresses the sigmoid to the class data

Least squares fit

0.5 0.5 0 1

Similarly in 2D

LR linear LR linear

σ

(w

₁

x

₁

+

w

₂

x

₂

+

b)

fi

t, vs

w

₁

x

₁

+

w

₂

x

₂

+

b

(4)

In logistic regression fit a sigmoid function to the data {

x

_i

, y

_i

}

by minimizing the classification errors

-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

y

i

−

σ

(

w

>

x

i

)

Learning

-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

A sigmoid favours a larger margin cf a step classifier

Margin property

-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(5)

Probabilistic interpretation

•

Think of

σ

(f

(

x

)) as the

posterior probability

that

y

= 1, i.e.

P

(y

= 1

|

x

) =

σ

(f

(

x

))

•

Hence, if

σ

(f

(

x

))

>

0.5 then class

y

= 1 is selected

•

Then, after a rearrangement

f

(

x

) = log

P

(

y

= 1

|

x

)

1

−

P

(y

= 1

|

x

)

= log

P

(

y

= 1

|

x

)

P

(y

= 0

|

x

)

which is the

log odds ratio

Maximum Likelihood Estimation

Assume

p(y = 1_|

x

;

w

) = σ(

w

>

x

) p(y = 0_|

x

;

w

) = 1 ₋σ(

w

>

x

) write this more compactly as

p(y_|

x

;

w

) = ³σ(

w

>

x

)´y³1₋σ(

w

>

x

)´(1−y)

Then the likelihood (assuming data independence) is p(

y

_|

x

;

w

) _∼ N Y i ³ σ(

w

>

x

_i)´yi³1₋σ(

w

>

x

_i)´(1−yi) and the negative log likelihood is

L(

w

) = ₋

N X

i

(6)

Logistic Regression Loss function

Use notation y_i _∈_{₋1,1_}. Then

P(y = 1_|x) = σ(f(x)) = 1 1 +e−f(x) P(y =₋1_|x) = 1₋σ(f(x)) = 1 1 +e+f(x) So in both cases P(y_i_|x_i) = 1 1 +e−yif(xi)

Assuming independence, the likelihood is N

Y

i

1

1 +e−yif(xi)

and the negative log likelihood is =

N

X

i

log³1 +e−yif(xi)´

which defines the loss function.

Logistic Regression Learning

Learning is formulated as the optimization problem

min

w_∈Rd N X i

log

³

1 +

e

−yif(xi)´

+

λ

||

w

||

2

loss function regularization

• For correctly classified points ₋y_if(

x

_i) is negative, and log³1 + e−yif(xi)´ _{is near zero}

• For incorrectly classified points ₋y_if(

x

_i) is positive, and log³1 + e−yif(xi)´ _{can be large.}

• Hence the optimization penalizes parameters which lead to such misclassifications

(7)

Comparison of SVM and LR cost functions

SVM min w∈RdC N X i max (0,1₋y_if(x_i)) +_||w_||2 Logistic regression: min w∈Rd N X i log³1 + e−yif(xi)´₊_λ_||_w_||2 Note:

• both approximate 0-1 loss

• very similar asymptotic behaviour • main difference is smoothness, and non-zero values outside margin • SVM gives sparsesolution for

α

_i

y_if(x_i)

(8)

Overview

•

AdaBoost is an algorithm for constructing a

strong classifier

out of a linear combination

of simple

weak classifiers

. It provides a method of

choosing the weak classifiers and setting the weights

T

X

t

=1

α

t

h

t

(

x

)

h

t

(

x

)

Terminology

• weak classifier , for data vector

x

• strong classifier

H

(

x

) = sign

T X t=1

α

_t

h

_t

(

x

)

α

t

h

t

(

x

)

∈

{

−

1

,

1

}

Example: combination of linear classifiers

weak classifier 1

h

1

(

x

)

• Note, this linear combination is not a simple majority vote

(it would be if )

• Need to compute as well as selecting weak classifiers

h

t

(

x

)

∈

{

−

1

,

1

}

H

(

x) = sign (

α

₁

h

₁

(

x

) +

α

₂

h

₂

(

x

) +

α

₃

h

₃

(

x))

strong classifier

H

(

x

)

weak classifier 2

h

2

(

x

)

weak classifier 3

h

3

(

x

)

(9)

AdaBoost algorithm – building a strong classifier

For t = 1 …,T

• Select weak classifier with minimum error

• Set

• Reweight examples (

boosting

) to give misclassified

examples more weight

• Add weak classifier with weight

Start with equal weights on each x_i, and a set of weak classifiers

²

t

=

X

i

ω

i

[

h

t

(

x

i

)

6

=

y

i

] where

ω

i

are weights

α

t

=

1

2

ln

1

−

²

t

²

t

ω

t

+1

,i

=

ω

t,i

e

−

α

t

y

i

h

t

(

x

i

)

H

(

x

) = sign

T X t=1

α

_t

h

_t

(

x

)

α

t

h

_t

(

x

)

Example

Weak Classifier 1 Weights Increased Weak classifier 3 Final classifier is linear combination of weak classifiers Weak Classifier 2

start with equal weights on

each data point (i)

²

j

=

X

i

(10)

The AdaBoost algorithm

(Freund & Shapire 1995)

• Given example data (x1, y1), . . . ,(xn, yn), where yi = −1,1 for negative and positive examples

respectively.

• Initialize weights ω1,i = 21m,

1

2l for yi = −1,1 respectively, where m and l are the number of

negatives and positives respectively.

• For t= 1, . . . , T

1. Normalize the weights,

ωt,i← Pnωt,i j=1ωt,j

so that ωt,i is a probability distribution.

2. For each j, train a weak classifier hj with error evaluated with respect to ωt,i, ²j=X

i

ωt,i[hj(xi)6=yi]

3. Choose the classifier, ht, with the lowest error ²t. 4. Setαt as αt= 1 2ln 1₋²t ²t

5. Update the weights

ωt+1,i=ωt,ie−αtyiht(xi)

• The final strong classifier is

H(x) = sign

T

X

t=1

αtht(x)

Why does it work?

The AdaBoost algorithm carries out a greedy optimization

of a loss function

min

α

_i

,h

_i

N

X

i

e

−

y

i

H

(

x

i

)

AdaBoost LR SVM hinge loss y_if(x_i) SVM loss function N X i max (0,1₋y_if(x_i))

Logistic regression loss function

N

X

i

(11)

Sketch derivation –

non-examinable

The objective function used by AdaBoost is

J(H) =X i

e−yiH(xi)

For a correctly classified point the penalty is exp(₋_|H_|) and for an incorrectly classified point the penalty is exp(+_|H|). The AdaBoost algorithm incrementally decreases the cost by adding simple functions to

H(x) =X t

αtht(x)

Suppose that we have a function B and we propose to add the functionαh(x) where the scalar α is to be determined and h(x) is some function that takes values in +1 or ₋1 only. The new function is B(x) +αh(x) and the new cost is

J(B+αh) =X i

e−yiB(xi)_e−αyih(xi)

Diﬀerentiating with respect to α and setting the result to zero gives

e−α X

yi=h(xi)

e−yiB(xi)−_e+α X

yi6=h(xi)

e−yiB(xi)_{= 0}

Rearranging, the optimal value of α is therefore determined to be

α=1 2log P yi=h(xi)e −yiB(xi) P yi6=h(xi)e −yiB(xi)

The classification error is defined as

²=X i ωi[h(xi)6=yi] where ωi = e −yiB(xi) P je−yjB(xj)

Then, it can be shown that,

α= 1 2log

1₋² ²

The update from B to H therefore involves evaluating the weighted performance (with the weights ωi given above) ² of the “weak” classifier h.

If the current function B is B(x) = 0 then the weights will be uniform. This is a common starting point for the minimization. As a numerical convenience, note that at the next round of boosting the required weights are obtained by multiplying the old weights with exp(₋αyih(xi)) and then normalizing. This gives the update formula

ωt+1,i=

1

Zt

ωt,ie−αtyiht(xi)

where Zt is a normalizing factor.

Choosing h The function h is not chosen arbitrarily but is chosen to give a good performance (low value of ²) on the training data weighted by ω.

(12)

Optimization

SVM min w∈RdC N X i max (0,1₋y_if(x_i)) +||w_||2 Logistic regression: min w∈Rd N X i log³1 +e−yif(xi)´₊_λ_||_w_||2

We have seen many cost functions, e.g.

• Do these have a unique solution?

• Does the solution depend on the starting point of an iterative optimization algorithm (such as gradient descent)?

local minimum

global minimum

If the cost function is convex, then a locally optimal point is globally optimal (provided the optimization is over a convex set, which it is in our case)

(13)

Convex functions

Convex function examples

convex Not convex

(14)

Logistic regression: min w∈Rd N X i log³1 +e−yif(xi)´₊_λ_||_w_||2 SVM min w∈RdC N X i max (0,1₋y_if(x_i)) +_||w_||2

+

convex convex

Gradient (or Steepest) descent algorithms

To minimize a cost function _C(w) use the iterative update

w_t₊₁ _← wt −ηt∇wC(wt) where η is the learning rate.

In our case the loss function is a sum over the training data. For example for LR min w∈RdC(w) = N X i log³1 +e−yif(xi)´₊_λ_||_w_||2 ₌ N X i L(x_i, y_i;w) +λ_||w_||2

This means that one iterative update consists of a pass through the training data with an update for each point

w_t₊₁ _← wt−( N

X

i

η_t_∇wL(xi, yi;wt) + 2λwt)

The advantage is that for large amounts of data, this can be carried out point by point.

(15)

Note:

• this is similar, but not identical, to the perceptron update rule.

• there is a unique solution for

• in practice more efficient Newton methods are used to minimize L

• there can be problems with becoming infinite for linearly separable data

w

[exercise]

Minimizing

L

(

w

) using gradient descent gives

the update rule

w

←

w

−

η

(

y

_i

−

σ

(

w

>

x

_i

))

x

_i

where

y

_i

∈

{

0

,

1

}

w

Gradient descent algorithm for LR

w

←

w

−

η

sign(

w

>

x

i

)

x

i

Gradient descent algorithm for SVM

First, rewrite the optimization problem as an average

min w C(w) = λ 2||w|| 2₊ 1 N N X i max (0,1₋y_if(x_i)) = 1 N N X i µ_λ 2||w|| 2_{+ max (0}_,₁₋_y if(xi)) ¶

(with λ = 2/(N C) up to an overall scale of the problem) and f(x) = w>x+b

Because the hinge loss is not diﬀerentiable, a sub-gradient is computed

(16)

Sub-gradient for hinge loss

L

(

x

_i

, y

_i

;

w

) = max (0

,

1

−

y

_i

f

(

x

_i

))

f

(

x

_i

) =

w

>

x

_i

+

b

y

i

f

(

x

i

)

∂_L ∂

w

= −yi

x

i ∂_L ∂

w

= 0

Sub-gradient descent algorithm for SVM

C(w) = 1 N N X i µ_λ 2||w|| 2₊ L(x_i, y_i;w) ¶

The iterative update is

w_t+1 ← wt−η∇wtC(wt) ← wt−η 1 N N X i (λwt+∇wL(xi, yi;wt))

where η is the learning rate.

Then each iteration t involves cycling through the training data with the updates:

w_t+1 ← (1−ηλ)wt+ηyixi if yi(w>xi+b)< 1

(17)

Multi-class Classification

Multi-Class Classification – what we would like

Assign input vector to one of K classes

Goal:

a decision rule that divides input space into K

decision regions

separated by

decision boundaries

(18)

Reminder: K Nearest Neighbour (K-NN) Classifier

Algorithm

• For each test point, x, to be classified, find the K nearest samples in the training data

• Classify the point, x, according to the majority vote of their class labels

e.g. K = 3

• naturally applicable to multi-class case

Build from binary classifiers …

• Learn:K two-class 1 vs the rest classifiers f_k(x)

C₁ C₂ C₃ 1 vs 2 & 3 3 vs 1 & 2 2 vs 1 & 3

?

(19)

Build from binary classifiers …

• Learn:K two-class 1 vs the rest classifiers f_k(x) • Classification: choose class with most positive score

C₁ C₂ C₃ 1 vs 2 & 3 3 vs 1 & 2 2 vs 1 & 3

max

k

f

k

(

x

)

Application: hand written digit recognition

• Feature vectors: each image is 28 x 28 pixels. Rearrange as a 784-vector x

• Training: learn k=10 two-class 1 vs the rest SVM classifiers f_k(x)

• Classification:choose class with most positive score

f

(

x

) = max

(20)

0 5 5 2 5 3 5 4 5 5

1

2

3

4

5

6

7

8

9

0

1

2

3

4

5

Example

hand drawn classification

Background reading and more

• Other multiple-class classifiers (not covered here):

• Neural networks

• Random forests

• Bishop, chapters 4.1 – 4.3 and 14.3

• Hastie et al, chapters 10.1 – 10.6

• More on web page: