• No results found

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

N/A
N/A
Protected

Academic year: 2022

Share "CS 688 Pattern Recognition Lecture 4. Linear Models for Classification"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

CS 688 – Pattern Recognition Lecture 4

Linear Models for Classification

  Probabilistic generative models

  Probabilistic discriminative models

Probabilistic Generative Models

(2)

Generative Approach (

Ck

)

p x| Ck

( )

Ck

p

( ) ( ) ( )

( )

x

x x

p C p C C p

p k | k k

| =

( )

=

∑ ( ) ( )

k

k

k pC

C p

p x x|

Probabilistic Generative Models

Two class case:

logistic sigmoid function

(3)

Probabilistic Generative Models

K

>2 classes:

softmax function

Probabilistic Generative Models

Lets assume the class-conditional densities are Gaussian with the same covariance matrix:

Two class case first. We can show the following result:

(4)

Probabilistic Generative Models

Probabilistic Generative Models

(5)

Probabilistic Generative Models

We have shown:

Decision boundary:

Probabilistic Generative Models

(6)

Probabilistic Generative Models

K

>2 classes:

We can show the following result:

Probabilistic Generative Models

The cancellation of the quadratic terms is due to the assumption of shared covariances.

If we allow each class conditional density to have its own covariance matrix, the cancellations no longer occur, and we obtain a quadratic function of

x

, i.e. a quadratic discriminant.

(7)

Probabilistic Generative Models

Maximum likelihood solution

We have a parametric functional form for the class-conditional densities:

We can estimate the parameters and the prior class probabilities using maximum likelihood.

  Two class case with shared covariance matrix.

 Training data:

(8)

Maximum likelihood solution

For a data point from class we have and therefore

For a data point from class we have and therefore

Maximum likelihood solution

p t |

(

π,µ12

)

= p t

(

n |π,µ1,µ2

)

n=1 N

=

p x

(

n,C1

)

[ ]

tn

n=1 N

∏ [

p x

(

n,C2

) ]

1−tn =

πN x

(

n1

)

[ ]

tn

n=1 N

∏ [ (

1−π

)

N x

(

n |µ2

) ]

1−tn t = t

(

1,,tN

)

T

Assuming observations are drawn independently, we can write the likelihood function as follows:

(9)

Maximum likelihood solution

p t |

(

π,µ12

)

=

πN x

(

n1

)

[ ]

tn

n=1 N

∏ [ (

1−π

)

N x

(

n |µ2

) ]

1−tn

We want to find the values of the parameters that maximize the likelihood function, i.e., fit a model that best describes the observed data.

As usual, we consider the log of the likelihood:

Maximum likelihood solution

We first maximize the log likelihood with respect to . The terms that depend on are

(10)

Maximum likelihood solution

Thus, the maximum likelihood estimate of is the fraction of points in class

The result can be generalized to the multiclass case: the maximum likelihood estimate of is given by the fraction of points in the training set that belong to

Maximum likelihood solution

We now maximize the log likelihood with respect to . The terms that depend on are

(11)

Maximum likelihood solution

Thus, the maximum likelihood estimate of is the sample mean of all the input vectors assigned to class

By maximizing the log likelihood with respect to we obtain a similar result for

Maximum likelihood solution

Maximizing the log likelihood with respect to we obtain the maximum likelihood estimate

  Thus: the maximum likelihood estimate of the covariance is given by the weighted average of the sample covariance

(12)

Probabilistic Discriminative Models

Two-class case:

Multiclass case:

Discriminative approach: use the functional form of the generalized linear model for the posterior probabilities and determine its parameters directly using maximum

likelihood.

Probabilistic Discriminative Models

Advantages:

  Fewer parameters to be determined

  Improved predictive performance, especially when the class-conditional density assumptions give a poor

approximation of the true distributions.

(13)

Probabilistic Discriminative Models

Two-class case:

In the terminology of statistics, this model is known as logistic regression.

Assuming how many parameters do we need to estimate?

Probabilistic Discriminative Models

How many parameters did we estimate to fit Gaussian class-conditional densities (generative approach)?

(14)

Logistic Regression

We use maximum likelihood to determine the parameters of the logistic regression model.

xn,tn

{ }

n = 1,, N

tn= 1 denotes class C1 tn= 0 denotes class C2

We want to find the values of w that maximize the posterior probabilities associated to the observed data

Likelihood function : L w

( )

= P C

(

1| xn

)

tn

n=1 N

∏ (

1− P C

(

1| xn

) )

1−tn = y x

( )

n tn

n=1 N

∏ (

1− y x

( )

n

)

1−tn

Logistic Regression

We consider the negative logarithm of the likelihood:

(15)

Logistic Regression

We compute the derivative of the error function with respect to

w

(gradient):

We need to compute the derivative of the logistic sigmoid function:

∂aσ a

( )

=

∂a 1

1+ e−a = e−a 1+ e−a

( )

2 =

1 1+ e−a

e−a 1+ e−a = 1

1+ e−a 1− 1 1+ e−a

  

 = σ a

( ) (

1− σ a

( ) )

Logistic Regression

(16)

Logistic Regression

  The gradient of

E

at

w

gives the direction of the

steepest increase of

E

at

w

. We need to minimize

E

. Thus we need to update

w

so that we move along the opposite direction of the gradient:

This technique is called gradient descent

  It can be shown that

E

is a concave function of

w

. Thus, it has a unique minimum.

  An efficient iterative technique exists to find the optimal

w

parameters (Newton-Raphson optimization).

Batch vs. on-line learning

  The computation of the above gradient requires the processing of the entire training set (batch technique)

  If the data set is large, the above technique can be costly;

  For real time applications in which data become

available as continuous streams, we may want to update the parameters as data points are presented to us (on-line technique).

(17)

On-line learning

  After the presentation of each data point

n

, we compute the contribution of that data point to the gradient

(stochastic gradient):

 The on-line updating rule for the parameters becomes:

η > 0 is called learning rate.

It's value needs to be chosen carefully to ensure convergence

Multiclass Logistic Regression

Multiclass case:

We use maximum likelihood to determine the parameters of the logistic regression model.

xn,tn

{ }

n = 1,, N

tn = 0,

(

, 1,,0

)

denotes class Ck

We want to find the values of w1,,wK that maximize the posterior probabilities associated to the observed data

(18)

Multiclass Logistic Regression

We consider the negative logarithm of the likelihood:

Multiclass Logistic Regression

We compute the gradient of the error function with respect to one of the parameter vectors:

(19)

Multiclass Logistic Regression

Thus, we need to compute the derivatives of the softmax function:

∂ak yk = ∂

∂ak eak

eaj

j =

eak eaj

j − eakeak

eaj

j

 

 

2 = yk − yk2 = yk

(

1− yk

)

Multiclass Logistic Regression

Thus, we need to compute the derivatives of the softmax function:

for j ≠ k,

y = ∂ eak

= −eakeaj

= − y y

(20)

Multiclass Logistic Regression

Compact expression:

Multiclass Logistic Regression

(21)

Multiclass Logistic Regression

  It can be shown that

E

is a concave function of

w

. Thus, it has a unique minimum.

  For a batch solution, we can use the Newton-Raphson optimization technique.

  On-line solution (stochastic gradient descent):

References

Related documents

Online community: A group of people using social media tools and sites on the Internet OpenID: Is a single sign-on system that allows Internet users to log on to many different.

Namen tega poskusa je proučevanje vpliva različnih prekrivnih rastlin na zmanjšanje rasti plevelnih rastlin in na pridelek bučnic v naših klimatskih razmerah v posevku oljnih buč..

An analysis of the economic contribution of the software industry examined the effect of software activity on the Lebanese economy by measuring it in terms of output and value

Creating a national network of research and innovation, the Growth Centre will bring together Australian governments, businesses, start-ups and the research community to define

The challenges listed above would thus at first glance pose significant constraints in the provision of low-income housing by the private sector, but research has shown

With this labeled data at hand, we choose the preprocessing steps, estimate the model error and train different supervised machine learning models for predicting the need

The  next  stage  will  be  to  shortlist  tender  submissions  from  suppliers  and  start  the  process  of  kit  evaluation,  meetings 

Similar results are obtained for the other volume fractions: despite the influence of the surface anisotropy on the structure morphology, it affects only marginally the