Probabilistic Linear Classifier: Logistic Regression. CS534-Machine Learning

(1)

Probabilistic Linear Classifier:

Logistic Regression

(2)

Three Main Approaches

to learning a Classifier

• Learn a classifier: a function

f

ŷ

=

f

(

x

)

• Learn a classifier: a function

f,

ŷ

=

f

(

x

)

• Learn a probabilistic discriminative

model, i.e.,

the conditional distribution P(

y

|

x

)

the conditional distribution P(

y

|

x

)

• Learn a probabilistic generative

model, i.e., the

joint probability distribution: P(

x

,

y

)

• Examples:

– Learn a classifier: Perceptron, LDA (projection with threshold view)

threshold view)

– Learn a conditional distribution: Logistic regression

– Learn the joint distribution: a probabilistic view of j p Linear Discriminant Analysis (LDA)

(3)

Notation Shift

• S={(

S {(

x

i

,

y

i

):

i 1,…, N

i=1,…, N

} --- superscript for example

}

superscript for example

index.

N

is the total number of examples

• Subscript for element index within a vector, i.e.,

p

,

x

i

j

represents the

j

th element of the

i

th training

example

• Class labels are 0 and 1 (not +1 and -1)

(4)

Logistic Regression

• Given training set D logistic regression learns the

• Given training set D, logistic regression learns the

conditional distribution P(

y

|

x

)

• We will assume only two classes

y

= 0 and

y

= 1

• We will assume only two classes

y

= 0 and

y

= 1

and a parametric form for P(y = 1 | x, w) were w is

the parameter vector

)

(

1 )

;

|

0 (

1

1 )

(

)

;

|

1 (

₁

x

w

x

w

x

_w _x

p

y

p

e

p

y

p

=

+

=

₋ _⋅

• It is easy to show that this is equivalent to

)

(

1 )

;

|

0 (

y

x

w

p

₁

x

p

=

−

x

w

x

w

x

₌

_⋅

=

)

;

|

0 (

)

;

|

1 (

log

y

p

y

p

(5)

Why the Logistic (Sigmoid) Function

)

exp(

1

1 )

(

x

w

x,

⋅

−

+

=

g

)

p(

x

w

⋅

A linear function has a range from , the logistic function transforms the range to [0,1] to be a

b bilit ] , [−∞ ∞ 5 probability.

(6)

Logistic Regression Yields Linear Classifier

R ll th t i P( | ) di t ŷ 1 if th t d • Recall that given P(y | x) we predict ŷ = 1 if the expected

loss of predicting 0 is greater than predicting 1 (for now assume L(0,1) = L(1,0))

) , 1 ( ) | ( ) , 0 ( ) | ( )] , 1 ( [ )] , 0 ( [ _| | ⇔ > ⇔ >

∑

P y x L y P y x L y y L E y L E y y x y x y ) | 0 ( ) | 1 ( ) | 1 ( ) | 0 ( ) | 1 ( ) | 0 ( ₀₀ ₀₁ ₁₀ ₁₁ ⇔ = > = ⇔ = + = > = + =

∑

x x x x x x y P y P L y P L y P L y P L y P y y 0 0 ) | 0 ( ) | 1 ( log 1 ) | 0 ( ) | 1 ( > ⇔ > = = ⇔ > = = x w x x x x y P y P y P y P • This assumed L(0,1)=L(1,0) 0 > ⋅x w 6

• A similar derivation can be done for arbitrary L(0,1) and L(1,0).

(7)

Maximum Likelihood Learning

• Recall that the likelihood function is the probability of the data D given the parameters – p(D|w)

It is a function of the parameters • It is a function of the parameters

• Maximum likelihood learning finds the parameters that maximize this likelihood function

• A common trick is to work with log-likelihood, i.e., take the logarithm of the likelihood function – log p(D|w)

(8)

Computing the Likelihood

• In our framework, we assume each training example (xi , yi) is drawn

independently from the same (but unknown) distribution P( x ,y ) (the

famous i i d assumption) hence we can write

famous i.i.d assumption), hence we can write

∑

= ∏ = i i i i i i y P y P D P( | ) log ( | ) log ( | ) log w x , w x , w

• Joint distribution P(a,b) can be factored as P(a | b)P(b)

) | , ( log max arg ) | ( log max arg P D w =

∑

P xi yi w

• Further P(x | w) = P(x) because it does not depend on w so:

) | ( ) , | ( log max arg x w x w w w i i i i i w P y P

∑

=

• Further, P(x | w) = P(x) because it does not depend on w, so:

∑

= i i i y P D

P( | ) argmax log ( | )

log max

arg w x ,w

w

(9)

Computing the Likelihood

Recall

∑

= i i i y P D

P( | ) arg max log ( | , ) log max arg w x w w w Recall y g y p y e g y p ˆ 1 ) ( 1 ) | 0 ( ˆ 1 1 ) , ( ) , | 1 ( − = − = = = + = = = ₋ _⋅ w x w x w x w x x w

This can be compactly written as

y g

y

p( = 0| x,w) =1 (x,w) =1

We will take our learning objective function to be:

) 1 (

)

ˆ

1 (

ˆ

)

,

|

(

_y

i i

_y

yi

_y

yi

p

x

w

=

−

We will take our learning objective function to be:

∑

= = i i i y P D P L(w) log ( | w) log ( | x ,w) 9

∑

+ − − = i i i i i _y _y _y y log ˆ (1 )log(1 ˆ )] [

(10)

Fitting Logistic Regression by Gradient Ascent

∑

_P i i i i i i L( ) =

∑

l ( | ) =

∑

[ l ˆ +(1− )l (1− ˆ )] i i i i i i i i _y _y _y _y y P

L(w) log ( | x ,w) [ log ˆ (1 )log(1 ˆ )]

i i i i i i y y y y w w y P w L − − + ∂ ∂ = ∂ ∂ = ∂ ∂ )] ˆ 1 log( ) 1 ( ˆ log [ ) , | ( log ) (w x w j i i i i i j i i i j i i i j j j w y y y y y w y y y w y y y w w w ∂ ∂ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − − − = ∂ ∂ − − − + ∂ ∂ = ∂ ∂ ∂ ˆ ˆ 1 1 ˆ ) ˆ ( ˆ 1 1 ) ˆ ( ˆ j i i i i i j i i i i i i i i i j j j w y y y y y w y y y y y y y y y ∂ ∂ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − − = ∂ ∂ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − + − − = ⎦ ⎣ ˆ ) ˆ 1 ( ˆ ˆ ˆ ) ˆ 1 ( ˆ ˆ ˆ ˆ Recall that ) exp( 1 1 ) , ( ˆ ˆ i i i _y y x w w x ⋅ − + = = )) ( 1 )( ( ) exp( (t) have we ) exp( 1 1 ) ( for t t t g − ′ − + = )) ( 1 )( ( )) exp( 1 ( ) p( (t) g 2 g t g t -t = − + = ′ So i j i i j i i i j i x y y w y y w y ) ˆ 1 ( ˆ ) ( ) ˆ 1 ( ˆ ˆ − = ∂ ⋅ ∂ − = ∂ ∂ w x N 10

∑

= − = ∂ ∂ N i j i i _y _x y w L ) ˆ ( ) (w

∑

= − = ∇_L₍_w₎ N ₍_yi _y_ˆi₎_xi

(11)

Batch Gradient Ascent for LR

w x ← = i i_{, y} _{, i} _N 0) (0 0 0 Let ,..., 1 ) ( examples training : Given d w ← ← ...,0) (0,0,0, e convergenc until Repeat ...,0) (0,0,0, Let x w + ← = − · i e y N i i ) 1 1 do to 1 For d x d d = + ⋅ − =+ i i i error y y error e ) 1

• Online gradient ascent algorithm can be easily constructed

d w

w ← +

η

(12)

Connection Between

L

i ti R

i

& P

t

Al

ith

Logistic Regression & Perceptron Algorithm

If we replace the logistic function with a step function: If we replace the logistic function with a step function:

x w w

x

=

+

− ⋅

e

h

1

1 )

(

⎨

⎧

⋅

>

=

1 if

0 )

(

x

w

x

h

+

e

1 ⎩

⎨

otherwise

0 )

(

x

w

h

:

rule

updating

same

the

uses

algorithms

Both

i i i

_h

y

x

w

=

+

η

(

−

_w

(

))

(13)

Multi-Class Cases

• Choose class K to be the “reference class” and represent each of the other classes as a logistic function of the odds of class k versus class K:

of class k versus class K:

x x w x x = ⋅ = = = 1 ) | 2 ( l ) | ( ) | 1 ( log y P K y P y P x x w x x = ⋅ = = 2 ) | 1 ( ) | ( ) | 2 ( log K y P K y P y P M x w x x ₌ _⋅ = − = −1 ) | ( ) | 1 ( log _K K y P K y P