07_Linear_Classification_1.pdf

(1)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Outlines

Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Neural Networks 1 Neural Networks 2 Continuous Latent Variables Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Mixture Models and EM 1 Mixture Models and EM 2 Sequential Data 1 Sequential Data 2 Combining Models

Introduction to Statistical Machine Learning

Cheng Soon Ong & Christian Walder

Machine Learning Research Group Data61 | CSIRO

and

College of Engineering and Computer Science The Australian National University

Canberra February – June 2017

(2)

University

Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm

Part VII

(3)

University

Classification

Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm

Classification

Goal : Given input datax, assign it to one ofKdiscrete classesCkwherek=1, . . . ,K.

Divide the input space into different regions.

(4)

University

Classification

How to represent binary class labels?

Class labels are no longer real values as in regression, but a discrete set.

Two classes :t∈ {0,1}

(t=1represents classC1andt=0represents classC2)

Can interpret the value oftas the probability of classC1,

(5)

University

Classification

How to represent multi-class labels?

If there are more than two classes (K>2), we call it a multi-class setup.

Often used: 1-of-Kcoding scheme in whichtis a vector of lengthKwhich has all values0except fortj=1, wherej comes from the membership in classCjto encode. Example: Given5classes,{C1, . . . ,C5}. Membership in

classC2will be encoded as the target vector

t= (0,1,0,0,0)T

(6)

University

Classification

Generalised Linear Model

Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm

Linear Model

Idea: Use again aLinear Modelas in regression: y(x,w)is a linear function of the parametersw

y(xn,w) =wTφ(xn)

But generallyy(xn,w)∈R.

(7)

University

Classification

Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm

Generalised Linear Model

Apply a mappingf :R→Zto the linear model to get the discrete class labels.

y(xn,w) =f(wTφ(xn))

Activation function:f(·)

Link function : f−1(·)

-0.5 0.0 0.5 1.0z

-1.0

-0.5 0.5 1.0 signHzL

(8)

University

Classification Generalised Linear Model

Inference and Decision

Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm

Three Models for Decision Problems

Find adiscriminant functionf(x)which maps each input directly onto a class label.

Discriminative Models

1 Solve the inference problem of determining the posterior

class probabilitiesp(Ck|x).

2 Use decision theory to assign each newxto one of the

classes.

Generative Models

1 _{Solve the inference problem of determining the}

class-conditional probabilitiesp(x| Ck).

2 _{Also, infer the prior class probabilities}_p(C

k).

3 Use Bayes’ theorem to find the posterior_p(C_k|_x₎.

4 _{Alternatively, model the joint distribution}_p₍_x_,C

k)directly.

5 Use decision theory to assign each newxto one of the

(9)

University

Decision Theory - Key Ideas

Two classesC1 andC2

joint distributionp(x,Ck) using Bayes’ theorem

p(Ck|x) = p(x| Ck)p(Ck)

p(x)

Example: cancer treatment (k=2) datax: an X-ray image

C1: patient has cancer (C2: patient has no cancer) p(C1)is the prior probability of a person having cancer p(C1|x)is the posterior probability of a person having

(10)

University

Decision Theory - Key Ideas

Need a rule which assigns each value of the inputxto one of the available classes.

The input space is partitioned intodecision regionsRk. Leads todecision boundariesordecision surfaces

probability of a mistake

p(mistake) =p(x∈ R1,C2) +p(x∈ R2,C1)

=

Z

R1

p(x,C2)dx+ Z

R2

p(x,C1)dx

R2

(11)

University

Decision Theory - Key Ideas

probability of a mistake

p(mistake) =p(x∈ R1,C2) +p(x∈ R2,C1)

=

Z

R1

p(x,C2)dx+ Z

R2

p(x,C1)dx

goal: minimizep(mistake)

R1 R2

x0 _bx

p(x,C1)

p(x,C2)

(12)

University

Decision Theory - Key Ideas

multiple classes

instead of minimising the probability of mistakes, maximise the probability of correct classification

p(correct) =

K

X

k=1

p(x∈ Rk,Ck)

= K

X

k=1 Z

Rk

(13)

University

Minimising the Expected Loss

Not all mistakes are equally costly.

Weight each misclassification ofxto the wrong classCj instead of assigning it to the correct classCkby a factorLkj. The expected loss is now

E[L] =

X

k

X

j

Z

Rj

Lkjp(x,Ck)dx

(14)

University

The Reject Region

Avoid making automated decisions on difficult cases. Difficult cases:

posterior probabilitiesp(Ck|x)are very small

joint distributionsp(x,Ck)have comparable values

x

p(C1|x) p(C2|x)

0.0 1.0 θ

(15)

University

Classification Generalised Linear Model Inference and Decision

Discriminant Functions

Fisher’s Linear Discriminant The Perceptron Algorithm

Two Classes

Definition

Adiscriminantis a function that maps from an input vectorxto one ofKclasses, denoted byCk.

Consider first two classes (K=2). Construct a linear function of the inputsx

y(x) =wTx+w0

such thatxbeing assigned to classC1ify(x)≥0, and to

classC2otherwise. weight vectorw

(16)

University

Two Classes

Decision boundaryy(x) =0is a(D−1)-dimensional hyperplane in aD-dimensional input space (decision surface).

wis orthogonal to any vector lying in the decision surface. Proof: AssumexAandxBare two points lying in the decision surface. Then,

(17)

University

Two Classes

The normal distance from the origin to the decision surface is

wT_x kwk =−

w0

kwk

x2

x1

w

x

y(x)

kwk

x⊥

−w0

kwk

y= 0

y <0

y >0

R2

(18)

University

Two Classes

The value ofy(x)gives a signed measure of the

perpendicular distancerof the pointxfrom the decision surface,r=y(x)/kwk.

y(x) =wT

x

z }| {

x⊥+r

w

kwk

+w0=rw

T_w

kwk +

0

z }| {

wTx⊥+w0=rkwk

x2

x1 w

x

y(x) kwk x⊥

−w0 kwk

y= 0 y <0

y >0

(19)

University

Two Classes

More compact notation : Add an extra dimension to the input space and set the value tox0=1.

Also definew_e = (w0,w)and_ex= (1,x)

y(x) =w_eT_ex

Decision surface is now aD-dimensional hyperplane in a

(20)

University

Multi-Class

Number of classesK>2

Can we combine a number of two-class discriminant functions usingK−1one-versus-the-restclassifiers?

R1

R2

R3 ?

C1

notC1

C2

(21)

University

Multi-Class

Can we combine a number of two-class discriminant functions usingK(K−1)/2one-versus-oneclassifiers?

R1

R2 R3 ?

C1

C2 C1

C3

(22)

University

Multi-Class

Solution: UseKlinear functions

yk(x) =wT_kx+wk0

Assign inputxto classCkifyk(x)>yj(x)for allj6=k. Decision boundary between classCkandCjgiven by

yk(x) =yj(x)

Ri

Rj

Rk

xA

xB

b

(23)

University

Least Squares for Classification

Regression with a linear function of the model parameters and minimisation of sum-of-squares error function resulted in a closed-from solution for the parameter values.

Is this also possible for classification?

Given input dataxbelonging to one ofKclassesCk. Use1-of-Kbinary coding scheme.

Each class is described by its own linear model

(24)

University

Least Squares for Classification

With the conventions

e

wk=

wk0

wk

∈RD+1

e

x=

1 x

∈_RD+1

e

W=

e

w1 . . . weK

∈_R(D+1)×K

we get for the discriminant function (vector valued)

y(x) =WeT_ex ∈_RK.

(25)

University

Determine

W

e

Given a training set{xn,t}wheren=1, . . . ,N, andtis the class in the1-of-Kcoding scheme.

Define a matrixTwhere rowncorresponds totT n. The sum-of-squares error can now be written as

ED(We) = 1 2tr

n

(XeWe −T)T(XeWe −T) o

The minimum ofED(We)will be reached for

e

W= (XeTXe)−1XeTT=Xe†T

(26)

University

Discriminant Function for Multi-Class

The discriminant functiony(x)is therefore

y(x) =WeT_ex=TT(Xe†)T_ex,

whereXe is given by the training data, and_exis the new

input.

Interesting property: If for everytnthe same linear constraintaT_t

n+b=0holds, then the predictiony(x)will also obey the same constraint

aTy(x) +b=0.

(27)

University

Deficiencies of the Least Squares Approach

Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later)

−4 −2 0 2 4 6 8

−8 −6 −4 −2 0 2 4

−4 −2 0 2 4 6 8

(28)

University

Deficiencies of the Least Squares Approach

Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later)

−6 −4 −2 0 2 4 6

(29)

University

Classification Generalised Linear Model Inference and Decision Discriminant Functions

Fisher’s Linear Discriminant

The Perceptron Algorithm

Fisher’s Linear Discriminant

View linear classification as dimensionality reduction.

y(x) =wTx

Ify≥ −w0then classC1, otherwiseC2.

But there are many projections from aD-dimensional input space onto one dimension.

Projection always means loss of information.

For classification we want to preserve the class separation in one dimension.

(30)

University

Fisher’s Linear Discriminant

Samples from two classes in a two-dimensional input space and their histogram when projected to two different

one-dimensional spaces.

−2 2 6

−2 0 2 4

−2 2 6

(31)

University

Fisher’s Linear Discriminant - First Try

GivenN1input data of classC1, andN2input data of class

C2, calculate the centres of the two classes

m1=

1 N1

X

n∈C1

xn, m2=

1 N2

X

n∈C2 xn

Choosewso as to maximise the projection of the class means ontow

m1−m2=wT(m1−m2)

Problem with non-uniform covariance

−2 2 6

(32)

University

Fisher’s Linear Discriminant

Measure also the within-class variance for each class

s2_k=X n∈Ck

(yn−mk)2

whereyn=wTxn.

Maximise theFisher criterion

J(w) = (m2−m1)

2

s2 1+s22

(33)

University

Fisher’s Linear Discriminant

The Fisher criterion can be rewritten as

J(w) = w T_S

Bw

wT_S Ww

SB is thebetween-classcovariance

SB= (m2−m1)(m2−m1)T

SW is thewithin-classcovariance

SW =

X

n∈C1

(xn−m1)(xn−m1)T+ X

n∈C2

(34)

University

Fisher’s Linear Discriminant

The Fisher criterion

J(w) = w T_S

Bw

wT_S Ww

has a maximum forFisher’s linear discriminant

w∝S−_W1(m2−m1)

Fisher’s linear discriminant is NOT a discriminant, but can be used to construct one by choosing a thresholdy0 in the

(35)

University

Fisher’s Discriminant For Multi-Class

Assume that the dimensionality of the input spaceDis greater than the number of classesK.

UseD0>1linear ’features’yk=wTxand write everything in vector form (no bias involved!)

y=WTx.

The within-class covariance is then the sum of the covariances for allKclasses

SW=

K

X

k=1 Sk

where

Sk=

X

n∈Ck

(xn−mk)(xn−mk)T

mk=

1 Nk

X

n∈Ck

(36)

University

Fisher’s Discriminant For Multi-Class

Between-class covariance

SB = K

X

k=1

Nk(mk−m)(mk−m)T.

wheremis the total mean of the input data

m= 1

N

X

n=1 xn.

One possible way to define a function ofWwhich is large when the between-class covariance is large and the within-class covariance is small is given by

J(W) =tr(WT_S

WW)−1(WTSBW)

(37)

University

Fisher’s Discriminant For Multi-Class

How many linear ’features’ can one find with this method?

SB is of rank at mostK−1because of the sum ofKrank one matrices and the global constraint viam.

(38)

University

Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant

The Perceptron Algorithm

Frank Rosenblatt (1928 - 1969)

(39)

University

The Perceptron Algorithm

(40)

University

The Perceptron Algorithm

Two class model

Create feature vectorφ(x)by a fixed nonlinear transformation of the inputx.

Generalised linear model

y(x) =f(wTφ(x))

withφ(x)containing some bias elementφ0(x) =1.

nonlinearactivationfunction

f(a) =

(

+1, a≥0

−1, a<0

Target coding for perceptron

t=

(

(41)

University

The Perceptron Algorithm - Error Function

Idea : Minimise total number of misclassified patterns. Problem : As a function ofw, this is piecewise constant and therefore the gradient is zero almost everywhere. Better idea: Using the(−1,+1)target coding scheme, we want all patterns to satisfywT_φ₍_x

n)tn >0.

Perceptron Criterion: Add the errors for all patterns belonging to the set of misclassified patternsM

EP(w) =−

X

n∈M

(42)

University

Perceptron - Stochastic Gradient Descent

Perceptron Criterion (with notationφ_n =φ(xn))

EP(w) =−X n∈M

wTφ_ntn

One iteration at stepτ

1 _{Choose a training pair}₍_x

n,tn)

2 _{Update the weight vector}_w_by

w(τ+1)=w(τ)−η∇EP(w) =w(τ)+ηφntn

Asy(x,w)does not depend on the norm ofw, one can set η=1

(43)

University

The Perceptron Algorithm - Update 1

Update of the perceptron weights from a misclassified pattern (green)

w(τ+1)=w(τ)+φ_ntn

−1 −0.5 0 0.5 1

(44)

University

The Perceptron Algorithm - Update 2

Update of the perceptron weights from a misclassified pattern (green)

w(τ+1)=w(τ)+φ_ntn

−0.5 0 0.5 1

(45)

University

The Perceptron Algorithm - Convergence

Does the algorithm converge ? For a single update step

−w(τ+1)Tφ_ntn=−w(τ)Tφntn−(φntn)Tφntn <−w(τ)Tφntn

because(φ_ntn)Tφntn=kφntnk>0.

BUT: contributions to the error from the other misclassified patterns might have increased.

AND: some correctly classified patterns might now be misclassified.