Probabilistic Linear Classifier:
Logistic Regression
Three Main Approaches
to learning a Classifier
• Learn a classifier: a function
f
ŷ
=
f
(
x
)
• Learn a classifier: a function
f,
ŷ
=
f
(
x
)
• Learn a probabilistic discriminative
model, i.e.,
the conditional distribution P(
y
|
x
)
the conditional distribution P(
y
|
x
)
• Learn a probabilistic generative
model, i.e., the
joint probability distribution: P(
x
,
y
)
• Examples:
– Learn a classifier: Perceptron, LDA (projection with threshold view)
threshold view)
– Learn a conditional distribution: Logistic regression
– Learn the joint distribution: a probabilistic view of j p Linear Discriminant Analysis (LDA)
Notation Shift
Notation Shift
• S={(
S {(
x
x
i,
,
y
y
i):
):
i 1,…, N
i=1,…, N
} --- superscript for example
}
superscript for example
index.
N
is the total number of examples
• Subscript for element index within a vector, i.e.,
p
,
,
x
ij
represents the
j
th element of the
i
th training
example
• Class labels are 0 and 1 (not +1 and -1)
Logistic Regression
• Given training set D logistic regression learns the
• Given training set D, logistic regression learns the
conditional distribution P(
y
|
x
)
• We will assume only two classes
y
= 0 and
y
= 1
• We will assume only two classes
y
= 0 and
y
= 1
and a parametric form for P(y = 1 | x, w) were w is
the parameter vector
)
(
1
)
;
|
0
(
1
1
)
(
)
;
|
1
(
1x
w
x
x
w
x
w xp
y
p
e
p
y
p
=
=
+
=
=
=
− ⋅• It is easy to show that this is equivalent to
)
(
1
)
;
|
0
(
y
x
w
p
1x
p
=
=
−
x
w
w
x
w
x
=
⋅
=
=
)
;
|
0
(
)
;
|
1
(
log
y
p
y
p
Why the Logistic (Sigmoid) Function
)
exp(
1
1
)
(
x
w
w
x,
⋅
−
+
=
g
)
p(
x
w
⋅
A linear function has a range from , the logistic function transforms the range to [0,1] to be a
b bilit ] , [−∞ ∞ 5 probability.
Logistic Regression Yields Linear Classifier
R ll th t i P( | ) di t ŷ 1 if th t d • Recall that given P(y | x) we predict ŷ = 1 if the expected
loss of predicting 0 is greater than predicting 1 (for now assume L(0,1) = L(1,0))
) , 1 ( ) | ( ) , 0 ( ) | ( )] , 1 ( [ )] , 0 ( [ | | ⇔ > ⇔ >
∑
∑
P y x L y P y x L y y L E y L E y y x y x y ) | 0 ( ) | 1 ( ) | 1 ( ) | 0 ( ) | 1 ( ) | 0 ( 00 01 10 11 ⇔ = > = ⇔ = + = > = + =∑
∑
x x x x x x y P y P L y P L y P L y P L y P y y 0 0 ) | 0 ( ) | 1 ( log 1 ) | 0 ( ) | 1 ( > ⇔ > = = ⇔ > = = x w x x x x y P y P y P y P • This assumed L(0,1)=L(1,0) 0 > ⋅x w 6• A similar derivation can be done for arbitrary L(0,1) and L(1,0).
Maximum Likelihood Learning
Maximum Likelihood Learning
• Recall that the likelihood function is the probability of the data D given the parameters – p(D|w)
It is a function of the parameters • It is a function of the parameters
• Maximum likelihood learning finds the parameters that maximize this likelihood function
• A common trick is to work with log-likelihood, i.e., take the logarithm of the likelihood function – log p(D|w)
Computing the Likelihood
• In our framework, we assume each training example (xi , yi) is drawn
independently from the same (but unknown) distribution P( x ,y ) (the
famous i i d assumption) hence we can write
famous i.i.d assumption), hence we can write
∑
= ∏ = i i i i i i y P y P D P( | ) log ( | ) log ( | ) log w x , w x , w• Joint distribution P(a,b) can be factored as P(a | b)P(b)
) | , ( log max arg ) | ( log max arg P D w =
∑
P xi yi w• Further P(x | w) = P(x) because it does not depend on w so:
) | ( ) , | ( log max arg x w x w w w i i i i i w P y P
∑
=• Further, P(x | w) = P(x) because it does not depend on w, so:
∑
= i i i y P DP( | ) argmax log ( | )
log max
arg w x ,w
w
Computing the Likelihood
Recall∑
= i i i y P DP( | ) arg max log ( | , ) log max arg w x w w w Recall y g y p y e g y p ˆ 1 ) ( 1 ) | 0 ( ˆ 1 1 ) , ( ) , | 1 ( − = − = = = + = = = − ⋅ w x w x w x w x x w
This can be compactly written as
y g
y
p( = 0| x,w) =1 (x,w) =1
We will take our learning objective function to be:
) 1 (
)
ˆ
1
(
ˆ
)
,
|
(
y
i iy
yiy
yip
x
w
=
−
−We will take our learning objective function to be:
∑
∑
= = i i i y P D P L(w) log ( | w) log ( | x ,w) 9∑
+ − − = i i i i i y y y y log ˆ (1 )log(1 ˆ )] [Fitting Logistic Regression by Gradient Ascent
∑
∑
P i i i i i i L( ) =∑
l ( | ) =∑
[ l ˆ +(1− )l (1− ˆ )] i i i i i i i i y y y y y PL(w) log ( | x ,w) [ log ˆ (1 )log(1 ˆ )]
i i i i i i y y y y w w y P w L − − + ∂ ∂ = ∂ ∂ = ∂ ∂ )] ˆ 1 log( ) 1 ( ˆ log [ ) , | ( log ) (w x w j i i i i i j i i i j i i i j j j w y y y y y w y y y w y y y w w w ∂ ∂ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − − − = ∂ ∂ − − − + ∂ ∂ = ∂ ∂ ∂ ˆ ˆ 1 1 ˆ ) ˆ ( ˆ 1 1 ) ˆ ( ˆ j i i i i i j i i i i i i i i i j j j w y y y y y w y y y y y y y y y ∂ ∂ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − − = ∂ ∂ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − + − − = ⎦ ⎣ ˆ ) ˆ 1 ( ˆ ˆ ˆ ) ˆ 1 ( ˆ ˆ ˆ ˆ Recall that ) exp( 1 1 ) , ( ˆ ˆ i i i y y x w w x ⋅ − + = = )) ( 1 )( ( ) exp( (t) have we ) exp( 1 1 ) ( for t t t g − ′ − + = )) ( 1 )( ( )) exp( 1 ( ) p( (t) g 2 g t g t -t = − + = ′ So i j i i j i i i j i x y y w y y w y ) ˆ 1 ( ˆ ) ( ) ˆ 1 ( ˆ ˆ − = ∂ ⋅ ∂ − = ∂ ∂ w x N 10
∑
= − = ∂ ∂ N i j i i y x y w L ) ˆ ( ) (w∑
= − = ∇L(w) N (yi yˆi)xiBatch Gradient Ascent for LR
w x ← = i i , y , i N 0) (0 0 0 Let ,..., 1 ) ( examples training : Given d w ← ← ...,0) (0,0,0, e convergenc until Repeat ...,0) (0,0,0, Let x w + ← = − · i e y N i i ) 1 1 do to 1 For d x d d = + ⋅ − =+ i i i error y y error e ) 1• Online gradient ascent algorithm can be easily constructed
d w
w ← +
η
Connection Between
L
i ti R
i
& P
t
Al
ith
Logistic Regression & Perceptron Algorithm
If we replace the logistic function with a step function: If we replace the logistic function with a step function:
x w w
x
=
+
− ⋅e
h
1
1
)
(
⎨
⎧
⋅
>
=
1
if
0
)
(
x
w
x
h
+
e
1
⎩
⎨
otherwise
0
)
(
x
wh
:
rule
updating
same
the
uses
algorithms
Both
i i ih
y
x
x
w
w
=
+
η
(
−
w(
))
Multi-Class Cases
• Choose class K to be the “reference class” and represent each of the other classes as a logistic function of the odds of class k versus class K:
of class k versus class K:
x x w x x = ⋅ = = = 1 ) | 2 ( l ) | ( ) | 1 ( log y P K y P y P x x w x x = ⋅ = = 2 ) | 1 ( ) | ( ) | 2 ( log K y P K y P y P M x w x x = ⋅ = − = −1 ) | ( ) | 1 ( log K K y P K y P
• Gradient ascent can be applied to
simultaneously train all weight vectors
w
kMulti-Class Cases
• Conditional probability for class k
≠
K can be
computed as
computed as
)
(
1
)
exp(
)
|
(
=
x
=
1w
⋅
x
∑
K− kk
y
P
F
l
K th
diti
l
b bilit i
)
exp(
1
1 1w
⋅
x
+
∑
K= l l• For class K, the conditional probability is
1
)
|
(
K
P
)
exp(
1
1
)
|
(
1 1w
x
x
⋅
+
=
=
∑
− = l K lK
y
P
Summary of Logistic Regression
y
g
g
• Learns conditional probability distribution
p
y
P(
y
|
x
)
• Local Search
Local Search
– begins with initial weight vector. Modifies it
iteratively to maximize the log likelihood of the
iteratively to maximize the log likelihood of the
data
• Online or Batch
Online or Batch
– both online and batch variants of the
algorithm exist
15