08_Linear_Classification_2.pdf

(1)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Outlines

Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Neural Networks 1 Neural Networks 2 Continuous Latent Variables Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Mixture Models and EM 1 Mixture Models and EM 2 Sequential Data 1 Sequential Data 2 Combining Models

Introduction to Statistical Machine Learning

Cheng Soon Ong & Christian Walder

Machine Learning Research Group Data61 | CSIRO

and

College of Engineering and Computer Science The Australian National University

Canberra February – June 2017

(2)

University

Probabilistic Generative Models

Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression

Part VIII

(3)

University

Three Models for Decision Problems

In increasing order of complexity

Find a discriminant functionf(x)which maps each input directly onto a class label.

Discriminative Models

1 Solve the inference problem of determining the posterior

class probabilitiesp(Ck|x).

2 Use decision theory to assign each new_xto one of the

classes. Generative Models

1 Solve the inference problem of determining the

class-conditional probabilitiesp(x| Ck).

2 Also, infer the prior class probabilities_p(C_k). 3 Use Bayes’ theorem to find the posterior_p₍C

k|x).

4 _{Alternatively, model the joint distribution}_p₍_x_,C

k)directly.

(4)

University

Probabilistic Generative Models

Generative approach: model class-conditional densities p(x| Ck)and priorsp(Ck)to calculate the posterior probability for classC1

p(C1|x) =

p(x| C1)p(C1)

p(x| C1)p(C1) +p(x| C2)p(C2)

= 1

1+exp(−a(x)) =σ(a(x))

whereaand thelogistic sigmoidfunctionσ(a)are given by

a(x) =lnp(x| C1)p(C1)

p(x| C2)p(C2)

=lnp(x,C1)

p(x,C2)

σ(a) = 1

(5)

University

Logistic Sigmoid

Thelogistic sigmoidfunctionσ(a) = 1 1+exp(−a)

"squashing function’ because it maps the real axis into a finite interval(0,1)

σ(−a) =1−σ(a) Derivative d

daσ(a) =σ(a)σ(−a) =σ(a) (1−σ(a)) Inverse is calledlogitfunctiona(σ) =ln₁₋σ_σ

-10 -5 5 10 a

0.2 0.4 0.6 0.8 1.0

ΣHaL

Logistic Sigmoidσ(a)

0.2 0.4 0.6 0.8 1.0Σ

-6

-4

-2 2 4 aHΣL

(6)

University

Probabilistic Generative Models - Multiclass

Thenormalised exponentialis given by

p(Ck|x) =

p(x| Ck)p(Ck) P

jp(x| Cj)p(Cj)

= _Pexp(ak) jexp(aj)

where

ak=ln(p(x| Ck)p(Ck)).

Also calledsoftmax functionas it is a smoothed version of themaxfunction.

(7)

University

Probabil. Generative Model - Continuous Input

Assume class-conditional probabilities are Gaussian, all classes share thesame covariance. What can we say about the posterior probabilities?

p(x| Ck) =

1

(2π)D/2

1 |Σ|1/2exp

−1

2(x−µk)

T_Σ−1₍_x₋_µ

k)

= 1

(2π)D/2

1 |Σ|1/2exp

−1 2x

T_Σ−1_x

×exp

µT_kΣ−1x−1

2µ

T kΣ

−1

µ_k

(8)

University Probabilistic Generative Models Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression

Probabil. Generative Model - Continuous Input

For two classes

p(C1|x) =σ(a(x))

anda(x)is

a(x) =lnp(x| C1)p(C1)

p(x| C2)p(C2)

=lnexp

µT

1Σ

−1_x₋1 2µ

T

1Σ

−1_µ 1

exp

µT

2Σ

−1_x₋1 2µ

T

2Σ

−1_µ 2

+lnp(C1)

p(C2)

Therefore

p(C1|x) =σ(wTx+w0)

where

w=Σ−1(µ₁−µ₂)

w0 =−

1 2µ

T

1Σ

−1_µ 1+

1 2µ

T

2Σ

−1_µ 2+ln

p(C1)

(9)

University

Probabil. Generative Model - Continuous Input

Class-conditional densities for two classes (left). Posterior probabilityp(C1|x)(right). Note the logistic sigmoid of a linear

(10)

University

General Case - K Classes, Shared Covariance

Use the normalised exponential

p(Ck|x) =

p(x| Ck)p(Ck) P

jp(x| Cj)p(Cj)

where

ak=ln(p(x| Ck)p(Ck)).

to get a linear function ofx

ak(x) =wTkx+wk0.

where

wk=Σ−1µk

wk0=−

1 2µ

T kΣ

−1

(11)

University

General Case - K Classes, Different Covariance

If each class-conditional probability has adifferent covariance, the quadratic terms−1

2x

T_Σ−1_x

do not longer cancel each other out.

We get aquadraticdiscriminant.

−2 −1 0 1 2

(12)

University

Maximum Likelihood Solution

Given the functional form of the class-conditional densities p(x| Ck), can we determine the parametersµandΣ?

Not without data ;-)

Given also a data set(xn,tn)forn=1, . . . ,N. (Using the coding scheme wheretn=1corresponds to classC1 and

tn=0denotes classC2.

Assume the class-conditional densities to be Gaussian with the same covariance, but different mean.

Denote the prior probabilityp(C1) =π, and therefore

p(C2) =1−π.

Then

p(xn,C1) =p(C1)p(xn| C1) =πN(xn|µ1,Σ)

(13)

University

Maximum Likelihood Solution

Given the functional form of the class-conditional densities p(x| Ck), can we determine the parametersµandΣ? Not without data ;-)

Given also a data set(xn,tn)forn=1, . . . ,N. (Using the coding scheme wheretn=1corresponds to classC1 and

tn=0denotes classC2.

Assume the class-conditional densities to be Gaussian with the same covariance, but different mean.

Denote the prior probabilityp(C1) =π, and therefore

p(C2) =1−π.

Then

p(xn,C1) =p(C1)p(xn| C1) =πN(xn|µ1,Σ)

(14)

University

Maximum Likelihood Solution

Thus the likelihood for the whole data setXandtis given by

p(t,X|π,µ₁,µ₂,Σ) = N Y

n=1

[πN(xn|µ1,Σ)]

tn_×

[(1−π)N(xn|µ2,Σ)] 1−tn

Maximise the log likelihood The term depending onπis

N X

n=1

(tnlnπ+ (1−tn)ln(1−π)

which is maximal for

π= 1 N

N X

n=1

tn= N1

N = N1

N1+N2

(15)

University

Maximum Likelihood Solution

Similarly, we can maximise the log likelihood (and thereby the likelihoodp(t,X|π,µ1,µ2,Σ)) depending on the mean

µ₁orµ₂, and get

µ₁= 1

N1

N X

n=1

tnxn

µ₂= 1

N2

N X

n=1

(1−tn)xn

(16)

University

Maximum Likelihood Solution

Finally, the log likelihoodlnp(t,X|π,µ₁,µ₂,Σ)can be maximised for the covarianceΣresulting in

Σ=N1

NS1+ N2

NS2

Sk=

1

Nk X

n∈Ck

(17)

University

Continuous Input

Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression

Discrete Features - Naive Bayes

Assume the input space consists of discrete features, in the simplest casexi∈ {0,1}.

For aD-dimensional input space, a general distribution would be represented by a table with2Dentries.

Together with the normalisation constraint, this are2D₋₁ independent variables.

Grows exponentially with the number of features. TheNaive Bayesassumption is that all features

conditioned on the classCkare independent of each other.

p(x| Ck) = D Y

i=1

µxi

ki(1−µki)

(18)

University

Continuous Input

Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression

Discrete Features - Naive Bayes

With the naive Bayes

p(x| Ck) = D Y

i=1

µxi

ki(1−µki)

1−xi

we can then again find the factorsakin the normalised exponential

p(Ck|x) =

p(x| Ck)p(Ck) P

jp(x| Cj)p(Cj)

as a linear function of thexi

ak(x) = D X

i=1

(19)

University

Continuous Input

Discrete Features

Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression

Three Models for Decision Problems

In increasing order of complexity

Find a discriminant functionf(x)which maps each input directly onto a class label.

Discriminative Models

1 Solve the inference problem of determining the posterior

class probabilitiesp(Ck|x).

classes. Generative Models

1 Solve the inference problem of determining the

class-conditional probabilitiesp(x| Ck).

2 Also, infer the prior class probabilities_p(C_k). 3 Use Bayes’ theorem to find the posterior_p₍C

k|x).

4 _{Alternatively, model the joint distribution}_p₍_x_,C

k)directly.

(20)

University

Continuous Input

Discrete Features

Probabilistic Discriminative Models

Maximise a likelihood function defined through the conditional distributionp(Ck|x)directly.

Discriminativetraining

Typically fewer parameters to be determined.

As we learn the posterirorp(Ck|x)directly, prediction may be better than with a generative model where the

class-conditional density assumptionsp(x| Ck)poorly approximate the true distributions.

(21)

University

Continuous Input

Discrete Features

Original Input versus Feature Space

Used direct inputxuntil now.

All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functionsφ(x).

Example: Use two Gaussian basis functions centered at the green crosses in the input space.

x2

−1 0 1

φ2

(22)

University

Continuous Input

Discrete Features

Original Input versus Feature Space

Linear decision boundaries in the feature space

correspond to nonlinear decision boundaries in the input space.

Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. BUT: If classes overlap in input space, they will also overlap in feature space.

Nonlinear featuresφ(x)can not remove the overlap; but they may increase it !

x1

x2

−1 0 1

φ1

φ2

0 0.5 1

(23)

University

Continuous Input

Discrete Features

Original Input versus Feature Space

Fixed basis functions do not adapt to the data and therefore have important limitations (see discussion in Linear Regression).

Understanding of more advanced algorithms becomes easier if we introduce the feature space now and use it instead of the original input space.

Some applications use fixed features successfully by avoiding the limitations.

(24)

University

Continuous Input Discrete Features

Probabilistic Discriminative Models

Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression

Logistic Regression is Classification

Two classes where the posterior of classC1is a logistic

sigmoidσ()acting on a linear function of the feature vector

φ

p(C1|φ) =y(φ) =σ(wTφ)

p(C2|φ) =1−p(C1|φ)

Model dimension is equal to dimension of the feature spaceM.

Compare this to fitting two Gaussians

2M |{z} means

+M(M+1)/2

| {z }

shared covariance

=M(M+5)/2

(25)

University

Logistic Regression is Classification

Determine the parameter via maximum likelihood for data (φ_n,tn),n=1, . . . ,N, whereφn=φ(xn). The class membership is coded astn ∈ {0,1}.

Likelihood function

p(t|w) = N Y

n=1

ytn

n(1−yn)1−tn

whereyn=p(C1|φn).

Error function : negative log likelihood resulting in the cross-entropyerror function

E(w) =−lnp(t|w) =−

N X

n=1

(26)

University

Logistic Regression is Classification

Error function (cross-entropyerror )

E(w) =−

N X

n=1

{tnlnyn+ (1−tn)ln(1−yn)}

yn=p(C1|φn) =σ(wTφn)

Gradient of the error function (using dσ

da =σ(1−σ))

∇E(w) = N X

n=1

(yn−tn)φn

gradient does not contain any sigmoid function

for each data point error is product of deviationyn−tn and basis functionφ_n.

(27)

University

Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression

Iterative Reweighted Least Squares

Laplace Approximation Bayesian Logistic Regression

Laplace Approximation

Given a continous distributionp(x)which is not Gaussian, can we approximate it by a Gaussianq(x)?

Need to find a mode ofp(x). Try to find a Gaussian with the same mode.

−2 −1 0 1 2 3 4

0 0.2 0.4 0.6 0.8

Non-Gaussian (yellow) and Gaussian approximation (red).

−2 −1 0 1 2 3 4

0 10 20 30 40

Negative log of the Non-Gaussian (yellow) and

(28)

University

Laplace Approximation

Assumep(x)can be written as

p(z) = 1 Z f(z)

with normalisationZ=R

f(z)dz.

Furthermore, assumeZis unknown !

A mode ofp(z)is at a pointz0wherep0(z0) =0.

Taylor expansion oflnf(z)atz0

lnf(z)'lnf(z0)−

1

2A(z−z0)

2

where

A=−d

2

(29)

University

Laplace Approximation

Exponentiating

lnf(z)'lnf(z0)−

1

2A(z−z0)

2

we get

f(z)'f(z0)exp{−

A

2(z−z0)

2_}_.

And after normalisation we get theLaplace approximation

q(z) =

A

2π 1/2

exp{−A 2(z−z0)

2_}_.

(30)

University

Laplace Approximation - Vector Space

Approximate p(z) forz∈RM

p(z) = 1 Z f(z).

we get the Taylor expansion

lnf(z)'lnf(z0)−

1

2(z−z0)

T_A₍_z₋_z

0)

where the HessianAis defined as

A=−∇∇lnf(z)|z=z0.

The Laplace approximation ofp(z)is then

q(z) = |A|

1/2

(2π)M/2exp

−1

2(z−z0)

T_A₍_z₋_z

0)

(31)

University

Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares

Laplace Approximation

Bayesian Logistic Regression

Exact Bayesian inference for the logistic regression is intractable.

Why? Need to normalise a product of prior probabilities and likelihoods which itself are a product of logistic sigmoid functions, one for each data point.

(32)

University

Bayesian Logistic Regression

Assume a Gaussian prior because we want a Gaussian posterior.

p(w) =N(w|m0,S0)

for fixedhyperparameterm0 andS0.

Hyperparametersare parameters of a prior distribution. In contrast to the model parametersw, they are not learned. For a set of training data(xn,tn), wheren=1, . . . ,N, the posterior is given by

p(w|t)∝p(w)p(t|w)

(33)

University

Bayesian Logistic Regression

Using our previous result for the cross-entropy function

E(w) =−lnp(t|w) =−

N X

n=1

we can now calculate the log of the posterior p(w|t)∝p(w)p(t|w)

using the notationyn=σ(wTφn)as

lnp(w|t) =−1

2(w−m0)

T_S−1

0 (w−m0)

+ N X

n=1

(34)

University

Bayesian Logistic Regression

To obtain a Gaussian approximation to

lnp(w|t) =−1

2(w−m0)

T_S−1

0 (w−m0)

+ N X

n=1

1 Find_w_MAPwhich maximises_ln_p(_w|t). This defines the

mean of the Gaussian approximation. (Note: This is a nonlinear function inwbecauseyn=σ(wTφn).)

2 _{Calculate the second derivative of the negative log likelihood}

to get the inverse covariance of the Laplace approximation

SN =−∇∇lnp(w|t) =S−01+

N X

n=1

(35)

University

Bayesian Logistic Regression

The approximated Gaussian (via Laplace approximation) of the posterior distribution is now

q(w|φ) =N(w|wMAP,SN)

where

SN =−∇∇lnp(w|t) =S−01+

N X

n=1