Lecture 4: More classifiers and classes
C4B Machine Learning Hilary 2011 A. Zisserman
•
Logistic regression
• Loss functions revisited
• Adaboost
• Loss functions revisited
•
Optimization
•
Multiple class classification
Overview
•
Logistic regression is actually a classi
fi
cation method
•
LR introduces an extra non-linearity over a linear
classi
fi
er,
f
(
x
) =
w
>x
+
b
, by using a logistic (or
sigmoid) function,
σ
().
•
The LR classi
fi
er is de
fi
ned as
σ
(
f
(
x
i))
(≥
0
.
5
y
i= +1
<
0
.
5
y
i=
−
1
where
σ
(
f
(
x
)) =
1 1+e−f(x)The logistic function or sigmoid function
-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 z
σ
(
z
) =
1
1 +
e
−
z
•
As
z
goes from
−∞
to
∞
,
σ
(
z
) goes from 0 to 1, a
“squash-ing function”.
•
It has a “sigmoid” shape (i.e. S-like shape)
-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
σ
(wx
+
b)
fi
t to
y
-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Intuition – why use a sigmoid?
Here, choose binary classi
fi
cation to be
repre-sented by
y
i∈
{
0,
1
}
, rather than
y
i∈
{
1,
−
1
}
wx
+
b
fi
t to
y
0 1
• fit of wx + b dominated by more distant points
• causes misclassification
• instead LR regresses the sigmoid to the class data
Least squares fit
0.5 0.5 0 1
Similarly in 2D
LR linear LR linearσ
(w
1x
1+
w
2x
2+
b)
fi
t, vs
w
1x
1+
w
2x
2+
b
In logistic regression fit a sigmoid function to the data {
x
i, y
i}
by minimizing the classification errors
-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y
i
−
σ
(
w
>
x
i
)
Learning
-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1A sigmoid favours a larger margin cf a step classifier
Margin property
-20 -15 -10 -5 0 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Probabilistic interpretation
•
Think of
σ
(f
(
x
)) as the
posterior probability
that
y
= 1, i.e.
P
(y
= 1
|
x
) =
σ
(f
(
x
))
•
Hence, if
σ
(f
(
x
))
>
0.5 then class
y
= 1 is selected
•
Then, after a rearrangement
f
(
x
) = log
P
(
y
= 1
|
x
)
1
−
P
(y
= 1
|
x
)
= log
P
(
y
= 1
|
x
)
P
(y
= 0
|
x
)
which is the
log odds ratio
Maximum Likelihood Estimation
Assume
p(y = 1|
x
;w
) = σ(w
>x
) p(y = 0|x
;w
) = 1 −σ(w
>x
) write this more compactly asp(y|
x
;w
) = ³σ(w
>x
)´y³1−σ(w
>x
)´(1−y)Then the likelihood (assuming data independence) is p(
y
|x
;w
) ∼ N Y i ³ σ(w
>x
i)´yi³1−σ(w
>x
i)´(1−yi) and the negative log likelihood isL(
w
) = −N X
i
Logistic Regression Loss function
Use notation yi ∈{−1,1}. Then
P(y = 1|x) = σ(f(x)) = 1 1 +e−f(x) P(y =−1|x) = 1−σ(f(x)) = 1 1 +e+f(x) So in both cases P(yi|xi) = 1 1 +e−yif(xi)
Assuming independence, the likelihood is N
Y
i
1
1 +e−yif(xi)
and the negative log likelihood is =
N
X
i
log³1 +e−yif(xi)´
which defines the loss function.
Logistic Regression Learning
Learning is formulated as the optimization problem
min
w∈Rd N X ilog
³1 +
e
−yif(xi)´+
λ
||
w
||
2loss function regularization
• For correctly classified points −yif(
x
i) is negative, and log³1 + e−yif(xi)´ is near zero• For incorrectly classified points −yif(
x
i) is positive, and log³1 + e−yif(xi)´ can be large.• Hence the optimization penalizes parameters which lead to such misclassifications
Comparison of SVM and LR cost functions
SVM min w∈RdC N X i max (0,1−yif(xi)) +||w||2 Logistic regression: min w∈Rd N X i log³1 + e−yif(xi)´+λ||w||2 Note:• both approximate 0-1 loss
• very similar asymptotic behaviour • main difference is smoothness, and non-zero values outside margin • SVM gives sparsesolution for
α
iyif(xi)
Overview
•
AdaBoost is an algorithm for constructing a
strong classifier
out of a linear combination
of simple
weak classifiers
. It provides a method of
choosing the weak classifiers and setting the weights
T
X
t
=1
α
t
h
t
(
x
)
h
t(
x
)
Terminology
• weak classifier , for data vector
x
• strong classifier
H
(
x
) = sign
T X t=1
α
th
t(
x
)
α
t
h
t
(
x
)
∈
{
−
1
,
1
}
Example: combination of linear classifiers
weak classifier 1
h
1
(
x
)
• Note, this linear combination is not a simple majority vote
(it would be if )
• Need to compute as well as selecting weak classifiers
h
t
(
x
)
∈
{
−
1
,
1
}
H
(
x) = sign (
α
1h
1(
x
) +
α
2h
2(
x
) +
α
3h
3(
x))
strong classifierH
(
x
)
weak classifier 2h
2
(
x
)
weak classifier 3h
3
(
x
)
AdaBoost algorithm – building a strong classifier
For t = 1 …,T
• Select weak classifier with minimum error
• Set
• Reweight examples (
boosting
) to give misclassified
examples more weight
• Add weak classifier with weight
Start with equal weights on each xi, and a set of weak classifiers
²
t
=
X
i
ω
i
[
h
t
(
x
i
)
6
=
y
i
] where
ω
i
are weights
α
t=
1
2
ln
1
−
²
t²
tω
t
+1
,i
=
ω
t,i
e
−
α
ty
ih
t(
x
i)
H
(
x
) = sign
T X t=1α
th
t(
x
)
α
t
h
t(
x
)
Example
Weak Classifier 1 Weights Increased Weak classifier 3 Final classifier is linear combination of weak classifiers Weak Classifier 2start with equal weights on
each data point (i)
²
j
=
X
i
The AdaBoost algorithm
(Freund & Shapire 1995)
• Given example data (x1, y1), . . . ,(xn, yn), where yi = −1,1 for negative and positive examples
respectively.
• Initialize weights ω1,i = 21m,
1
2l for yi = −1,1 respectively, where m and l are the number of
negatives and positives respectively.
• For t= 1, . . . , T
1. Normalize the weights,
ωt,i← Pnωt,i j=1ωt,j
so that ωt,i is a probability distribution.
2. For each j, train a weak classifier hj with error evaluated with respect to ωt,i, ²j=X
i
ωt,i[hj(xi)6=yi]
3. Choose the classifier, ht, with the lowest error ²t. 4. Setαt as αt= 1 2ln 1−²t ²t
5. Update the weights
ωt+1,i=ωt,ie−αtyiht(xi)
• The final strong classifier is
H(x) = sign
T
X
t=1
αtht(x)
Why does it work?
The AdaBoost algorithm carries out a greedy optimization
of a loss function
min
α
i,h
iN
X
i
e
−
y
iH
(
x
i)
AdaBoost LR SVM hinge loss yif(xi) SVM loss function N X i max (0,1−yif(xi))Logistic regression loss function
N
X
i
Sketch derivation –
non-examinable
The objective function used by AdaBoost isJ(H) =X i
e−yiH(xi)
For a correctly classified point the penalty is exp(−|H|) and for an incorrectly classified point the penalty is exp(+|H|). The AdaBoost algorithm incrementally decreases the cost by adding simple functions to
H(x) =X t
αtht(x)
Suppose that we have a function B and we propose to add the functionαh(x) where the scalar α is to be determined and h(x) is some function that takes values in +1 or −1 only. The new function is B(x) +αh(x) and the new cost is
J(B+αh) =X i
e−yiB(xi)e−αyih(xi)
Differentiating with respect to α and setting the result to zero gives
e−α X
yi=h(xi)
e−yiB(xi)−e+α X
yi6=h(xi)
e−yiB(xi)= 0
Rearranging, the optimal value of α is therefore determined to be
α=1 2log P yi=h(xi)e −yiB(xi) P yi6=h(xi)e −yiB(xi)
The classification error is defined as
²=X i ωi[h(xi)6=yi] where ωi = e −yiB(xi) P je−yjB(xj)
Then, it can be shown that,
α= 1 2log
1−² ²
The update from B to H therefore involves evaluating the weighted performance (with the weights ωi given above) ² of the “weak” classifier h.
If the current function B is B(x) = 0 then the weights will be uniform. This is a common starting point for the minimization. As a numerical convenience, note that at the next round of boosting the required weights are obtained by multiplying the old weights with exp(−αyih(xi)) and then normalizing. This gives the update formula
ωt+1,i=
1
Zt
ωt,ie−αtyiht(xi)
where Zt is a normalizing factor.
Choosing h The function h is not chosen arbitrarily but is chosen to give a good performance (low value of ²) on the training data weighted by ω.
Optimization
SVM min w∈RdC N X i max (0,1−yif(xi)) +||w||2 Logistic regression: min w∈Rd N X i log³1 +e−yif(xi)´+λ||w||2We have seen many cost functions, e.g.
• Do these have a unique solution?
• Does the solution depend on the starting point of an iterative optimization algorithm (such as gradient descent)?
local minimum
global minimum
If the cost function is convex, then a locally optimal point is globally optimal (provided the optimization is over a convex set, which it is in our case)
Convex functions
Convex function examples
convex Not convex
Logistic regression: min w∈Rd N X i log³1 +e−yif(xi)´+λ||w||2 SVM min w∈RdC N X i max (0,1−yif(xi)) +||w||2
+
+
convex convexGradient (or Steepest) descent algorithms
To minimize a cost function C(w) use the iterative update
wt+1 ← wt −ηt∇wC(wt) where η is the learning rate.
In our case the loss function is a sum over the training data. For example for LR min w∈RdC(w) = N X i log³1 +e−yif(xi)´+λ||w||2 = N X i L(xi, yi;w) +λ||w||2
This means that one iterative update consists of a pass through the training data with an update for each point
wt+1 ← wt−( N
X
i
ηt∇wL(xi, yi;wt) + 2λwt)
The advantage is that for large amounts of data, this can be carried out point by point.
Note:
• this is similar, but not identical, to the perceptron update rule.
• there is a unique solution for
• in practice more efficient Newton methods are used to minimize L
• there can be problems with becoming infinite for linearly separable data
w
[exercise]
Minimizing
L
(
w
) using gradient descent gives
the update rule
w
←
w
−
η
(
y
i−
σ
(
w
>x
i))
x
iwhere
y
i∈
{
0
,
1
}
w
Gradient descent algorithm for LR
w
←
w
−
η
sign(
w
>
x
i
)
x
i
Gradient descent algorithm for SVM
First, rewrite the optimization problem as an average
min w C(w) = λ 2||w|| 2+ 1 N N X i max (0,1−yif(xi)) = 1 N N X i µλ 2||w|| 2+ max (0,1−y if(xi)) ¶
(with λ = 2/(N C) up to an overall scale of the problem) and f(x) = w>x+b
Because the hinge loss is not differentiable, a sub-gradient is computed
Sub-gradient for hinge loss
L
(
x
i, y
i;
w
) = max (0
,
1
−
y
if
(
x
i))
f
(
x
i) =
w
>x
i+
b
y
i
f
(
x
i
)
∂L ∂w
= −yix
i ∂L ∂w
= 0Sub-gradient descent algorithm for SVM
C(w) = 1 N N X i µλ 2||w|| 2+ L(xi, yi;w) ¶
The iterative update is
wt+1 ← wt−η∇wtC(wt) ← wt−η 1 N N X i (λwt+∇wL(xi, yi;wt))
where η is the learning rate.
Then each iteration t involves cycling through the training data with the updates:
wt+1 ← (1−ηλ)wt+ηyixi if yi(w>xi+b)< 1
Multi-class Classification
Multi-Class Classification – what we would like
Assign input vector to one of K classes
Goal:
a decision rule that divides input space into K
decision regions
separated by
decision boundaries
Reminder: K Nearest Neighbour (K-NN) Classifier
Algorithm
• For each test point, x, to be classified, find the K nearest samples in the training data
• Classify the point, x, according to the majority vote of their class labels
e.g. K = 3
• naturally applicable to multi-class case
Build from binary classifiers …
• Learn:K two-class 1 vs the rest classifiers fk(x)C1 C2 C3 1 vs 2 & 3 3 vs 1 & 2 2 vs 1 & 3
?
Build from binary classifiers …
• Learn:K two-class 1 vs the rest classifiers fk(x) • Classification: choose class with most positive scoreC1 C2 C3 1 vs 2 & 3 3 vs 1 & 2 2 vs 1 & 3
max
k
f
k
(
x
)
Application: hand written digit recognition
• Feature vectors: each image is 28 x 28 pixels. Rearrange as a 784-vector x
• Training: learn k=10 two-class 1 vs the rest SVM classifiers fk(x)
• Classification:choose class with most positive score
f
(
x
) = max
0 5 5 2 5 3 5 4 5 5
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
Example
hand drawn classificationBackground reading and more
• Other multiple-class classifiers (not covered here):
• Neural networks
• Random forests
• Bishop, chapters 4.1 – 4.3 and 14.3
• Hastie et al, chapters 10.1 – 10.6
• More on web page: