Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Outlines
Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Neural Networks 1 Neural Networks 2 Continuous Latent Variables Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Mixture Models and EM 1 Mixture Models and EM 2 Sequential Data 1 Sequential Data 2 Combining Models
Introduction to Statistical Machine Learning
Cheng Soon Ong & Christian Walder
Machine Learning Research Group Data61 | CSIRO
and
College of Engineering and Computer Science The Australian National University
Canberra February – June 2017
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Part VII
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification
Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Classification
Goal : Given input datax, assign it to one ofKdiscrete classesCkwherek=1, . . . ,K.
Divide the input space into different regions.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification
Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
How to represent binary class labels?
Class labels are no longer real values as in regression, but a discrete set.
Two classes :t∈ {0,1}
(t=1represents classC1andt=0represents classC2)
Can interpret the value oftas the probability of classC1,
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification
Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
How to represent multi-class labels?
If there are more than two classes (K>2), we call it a multi-class setup.
Often used: 1-of-Kcoding scheme in whichtis a vector of lengthKwhich has all values0except fortj=1, wherej comes from the membership in classCjto encode. Example: Given5classes,{C1, . . . ,C5}. Membership in
classC2will be encoded as the target vector
t= (0,1,0,0,0)T
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification
Generalised Linear Model
Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Linear Model
Idea: Use again aLinear Modelas in regression: y(x,w)is a linear function of the parametersw
y(xn,w) =wTφ(xn)
But generallyy(xn,w)∈R.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification
Generalised Linear Model
Inference and Decision Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Generalised Linear Model
Apply a mappingf :R→Zto the linear model to get the discrete class labels.
Generalised Linear Model
y(xn,w) =f(wTφ(xn))
Activation function:f(·)
Link function : f−1(·)
-0.5 0.0 0.5 1.0z
-1.0
-0.5 0.5 1.0 signHzL
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model
Inference and Decision
Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Three Models for Decision Problems
Find adiscriminant functionf(x)which maps each input directly onto a class label.
Discriminative Models
1 Solve the inference problem of determining the posterior
class probabilitiesp(Ck|x).
2 Use decision theory to assign each newxto one of the
classes.
Generative Models
1 Solve the inference problem of determining the
class-conditional probabilitiesp(x| Ck).
2 Also, infer the prior class probabilitiesp(C
k).
3 Use Bayes’ theorem to find the posteriorp(Ck|x).
4 Alternatively, model the joint distributionp(x,C
k)directly.
5 Use decision theory to assign each newxto one of the
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model
Inference and Decision
Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Decision Theory - Key Ideas
Two classesC1 andC2
joint distributionp(x,Ck) using Bayes’ theorem
p(Ck|x) = p(x| Ck)p(Ck)
p(x)
Example: cancer treatment (k=2) datax: an X-ray image
C1: patient has cancer (C2: patient has no cancer) p(C1)is the prior probability of a person having cancer p(C1|x)is the posterior probability of a person having
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model
Inference and Decision
Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Decision Theory - Key Ideas
Need a rule which assigns each value of the inputxto one of the available classes.
The input space is partitioned intodecision regionsRk. Leads todecision boundariesordecision surfaces
probability of a mistake
p(mistake) =p(x∈ R1,C2) +p(x∈ R2,C1)
=
Z
R1
p(x,C2)dx+ Z
R2
p(x,C1)dx
R2
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model
Inference and Decision
Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Decision Theory - Key Ideas
probability of a mistake
p(mistake) =p(x∈ R1,C2) +p(x∈ R2,C1)
=
Z
R1
p(x,C2)dx+ Z
R2
p(x,C1)dx
goal: minimizep(mistake)
R1 R2
x0 bx
p(x,C1)
p(x,C2)
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model
Inference and Decision
Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Decision Theory - Key Ideas
multiple classes
instead of minimising the probability of mistakes, maximise the probability of correct classification
p(correct) =
K
X
k=1
p(x∈ Rk,Ck)
= K
X
k=1 Z
Rk
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model
Inference and Decision
Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Minimising the Expected Loss
Not all mistakes are equally costly.
Weight each misclassification ofxto the wrong classCj instead of assigning it to the correct classCkby a factorLkj. The expected loss is now
E[L] =
X
k
X
j
Z
Rj
Lkjp(x,Ck)dx
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model
Inference and Decision
Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
The Reject Region
Avoid making automated decisions on difficult cases. Difficult cases:
posterior probabilitiesp(Ck|x)are very small
joint distributionsp(x,Ck)have comparable values
x
p(C1|x) p(C2|x)
0.0 1.0 θ
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions
Fisher’s Linear Discriminant The Perceptron Algorithm
Two Classes
Definition
Adiscriminantis a function that maps from an input vectorxto one ofKclasses, denoted byCk.
Consider first two classes (K=2). Construct a linear function of the inputsx
y(x) =wTx+w0
such thatxbeing assigned to classC1ify(x)≥0, and to
classC2otherwise. weight vectorw
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions
Fisher’s Linear Discriminant The Perceptron Algorithm
Two Classes
Decision boundaryy(x) =0is a(D−1)-dimensional hyperplane in aD-dimensional input space (decision surface).
wis orthogonal to any vector lying in the decision surface. Proof: AssumexAandxBare two points lying in the decision surface. Then,
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions
Fisher’s Linear Discriminant The Perceptron Algorithm
Two Classes
The normal distance from the origin to the decision surface is
wTx kwk =−
w0
kwk
x2
x1
w
x
y(x)
kwk
x⊥
−w0
kwk
y= 0
y <0
y >0
R2
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions Fisher’s Linear Discriminant The Perceptron Algorithm
Two Classes
The value ofy(x)gives a signed measure of the
perpendicular distancerof the pointxfrom the decision surface,r=y(x)/kwk.
y(x) =wT
x
z }| {
x⊥+r
w
kwk
+w0=rw
Tw
kwk +
0
z }| {
wTx⊥+w0=rkwk
x2
x1 w
x
y(x) kwk x⊥
−w0 kwk
y= 0 y <0
y >0
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions
Fisher’s Linear Discriminant The Perceptron Algorithm
Two Classes
More compact notation : Add an extra dimension to the input space and set the value tox0=1.
Also definewe = (w0,w)andex= (1,x)
y(x) =weTex
Decision surface is now aD-dimensional hyperplane in a
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions
Fisher’s Linear Discriminant The Perceptron Algorithm
Multi-Class
Number of classesK>2
Can we combine a number of two-class discriminant functions usingK−1one-versus-the-restclassifiers?
R1
R2
R3 ?
C1
notC1
C2
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions
Fisher’s Linear Discriminant The Perceptron Algorithm
Multi-Class
Number of classesK>2
Can we combine a number of two-class discriminant functions usingK(K−1)/2one-versus-oneclassifiers?
R1
R2 R3 ?
C1
C2 C1
C3
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions
Fisher’s Linear Discriminant The Perceptron Algorithm
Multi-Class
Number of classesK>2
Solution: UseKlinear functions
yk(x) =wTkx+wk0
Assign inputxto classCkifyk(x)>yj(x)for allj6=k. Decision boundary between classCkandCjgiven by
yk(x) =yj(x)
Ri
Rj
Rk
xA
xB
b
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions
Fisher’s Linear Discriminant The Perceptron Algorithm
Least Squares for Classification
Regression with a linear function of the model parameters and minimisation of sum-of-squares error function resulted in a closed-from solution for the parameter values.
Is this also possible for classification?
Given input dataxbelonging to one ofKclassesCk. Use1-of-Kbinary coding scheme.
Each class is described by its own linear model
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions
Fisher’s Linear Discriminant The Perceptron Algorithm
Least Squares for Classification
With the conventions
e
wk=
wk0
wk
∈RD+1
e
x=
1 x
∈RD+1
e
W=
e
w1 . . . weK
∈R(D+1)×K
we get for the discriminant function (vector valued)
y(x) =WeTex ∈RK.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions
Fisher’s Linear Discriminant The Perceptron Algorithm
Determine
W
e
Given a training set{xn,t}wheren=1, . . . ,N, andtis the class in the1-of-Kcoding scheme.
Define a matrixTwhere rowncorresponds totT n. The sum-of-squares error can now be written as
ED(We) = 1 2tr
n
(XeWe −T)T(XeWe −T) o
The minimum ofED(We)will be reached for
e
W= (XeTXe)−1XeTT=Xe†T
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions
Fisher’s Linear Discriminant The Perceptron Algorithm
Discriminant Function for Multi-Class
The discriminant functiony(x)is therefore
y(x) =WeTex=TT(Xe†)Tex,
whereXe is given by the training data, andexis the new
input.
Interesting property: If for everytnthe same linear constraintaTt
n+b=0holds, then the predictiony(x)will also obey the same constraint
aTy(x) +b=0.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions
Fisher’s Linear Discriminant The Perceptron Algorithm
Deficiencies of the Least Squares Approach
Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later)
−4 −2 0 2 4 6 8
−8 −6 −4 −2 0 2 4
−4 −2 0 2 4 6 8
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision
Discriminant Functions
Fisher’s Linear Discriminant The Perceptron Algorithm
Deficiencies of the Least Squares Approach
Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later)
−6 −4 −2 0 2 4 6
−6 −4 −2 0 2 4 6
−6 −4 −2 0 2 4 6
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
Fisher’s Linear Discriminant
View linear classification as dimensionality reduction.
y(x) =wTx
Ify≥ −w0then classC1, otherwiseC2.
But there are many projections from aD-dimensional input space onto one dimension.
Projection always means loss of information.
For classification we want to preserve the class separation in one dimension.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
Fisher’s Linear Discriminant
Samples from two classes in a two-dimensional input space and their histogram when projected to two different
one-dimensional spaces.
−2 2 6
−2 0 2 4
−2 2 6
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
Fisher’s Linear Discriminant - First Try
GivenN1input data of classC1, andN2input data of class
C2, calculate the centres of the two classes
m1=
1 N1
X
n∈C1
xn, m2=
1 N2
X
n∈C2 xn
Choosewso as to maximise the projection of the class means ontow
m1−m2=wT(m1−m2)
Problem with non-uniform covariance
−2 2 6
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
Fisher’s Linear Discriminant
Measure also the within-class variance for each class
s2k=X n∈Ck
(yn−mk)2
whereyn=wTxn.
Maximise theFisher criterion
J(w) = (m2−m1)
2
s2 1+s22
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
Fisher’s Linear Discriminant
The Fisher criterion can be rewritten as
J(w) = w TS
Bw
wTS Ww
SB is thebetween-classcovariance
SB= (m2−m1)(m2−m1)T
SW is thewithin-classcovariance
SW =
X
n∈C1
(xn−m1)(xn−m1)T+ X
n∈C2
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
Fisher’s Linear Discriminant
The Fisher criterion
J(w) = w TS
Bw
wTS Ww
has a maximum forFisher’s linear discriminant
w∝S−W1(m2−m1)
Fisher’s linear discriminant is NOT a discriminant, but can be used to construct one by choosing a thresholdy0 in the
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
Fisher’s Discriminant For Multi-Class
Assume that the dimensionality of the input spaceDis greater than the number of classesK.
UseD0>1linear ’features’yk=wTxand write everything in vector form (no bias involved!)
y=WTx.
The within-class covariance is then the sum of the covariances for allKclasses
SW=
K
X
k=1 Sk
where
Sk=
X
n∈Ck
(xn−mk)(xn−mk)T
mk=
1 Nk
X
n∈Ck
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
Fisher’s Discriminant For Multi-Class
Between-class covariance
SB = K
X
k=1
Nk(mk−m)(mk−m)T.
wheremis the total mean of the input data
m= 1
N
N
X
n=1 xn.
One possible way to define a function ofWwhich is large when the between-class covariance is large and the within-class covariance is small is given by
J(W) =tr(WTS
WW)−1(WTSBW)
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions
Fisher’s Linear Discriminant
The Perceptron Algorithm
Fisher’s Discriminant For Multi-Class
How many linear ’features’ can one find with this method?
SB is of rank at mostK−1because of the sum ofKrank one matrices and the global constraint viam.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant
The Perceptron Algorithm
The Perceptron Algorithm
Frank Rosenblatt (1928 - 1969)
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant
The Perceptron Algorithm
The Perceptron Algorithm
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant
The Perceptron Algorithm
The Perceptron Algorithm
Two class model
Create feature vectorφ(x)by a fixed nonlinear transformation of the inputx.
Generalised linear model
y(x) =f(wTφ(x))
withφ(x)containing some bias elementφ0(x) =1.
nonlinearactivationfunction
f(a) =
(
+1, a≥0
−1, a<0
Target coding for perceptron
t=
(
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant
The Perceptron Algorithm
The Perceptron Algorithm - Error Function
Idea : Minimise total number of misclassified patterns. Problem : As a function ofw, this is piecewise constant and therefore the gradient is zero almost everywhere. Better idea: Using the(−1,+1)target coding scheme, we want all patterns to satisfywTφ(x
n)tn >0.
Perceptron Criterion: Add the errors for all patterns belonging to the set of misclassified patternsM
EP(w) =−
X
n∈M
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant
The Perceptron Algorithm
Perceptron - Stochastic Gradient Descent
Perceptron Criterion (with notationφn =φ(xn))
EP(w) =−X n∈M
wTφntn
One iteration at stepτ
1 Choose a training pair(x
n,tn)
2 Update the weight vectorwby
w(τ+1)=w(τ)−η∇EP(w) =w(τ)+ηφntn
Asy(x,w)does not depend on the norm ofw, one can set η=1
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant
The Perceptron Algorithm
The Perceptron Algorithm - Update 1
Update of the perceptron weights from a misclassified pattern (green)
w(τ+1)=w(τ)+φntn
−1 −0.5 0 0.5 1
−1 −0.5 0 0.5 1
−1 −0.5 0 0.5 1
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant
The Perceptron Algorithm
The Perceptron Algorithm - Update 2
Update of the perceptron weights from a misclassified pattern (green)
w(τ+1)=w(τ)+φntn
−0.5 0 0.5 1
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Classification Generalised Linear Model Inference and Decision Discriminant Functions Fisher’s Linear Discriminant
The Perceptron Algorithm
The Perceptron Algorithm - Convergence
Does the algorithm converge ? For a single update step
−w(τ+1)Tφntn=−w(τ)Tφntn−(φntn)Tφntn <−w(τ)Tφntn
because(φntn)Tφntn=kφntnk>0.
BUT: contributions to the error from the other misclassified patterns might have increased.
AND: some correctly classified patterns might now be misclassified.