Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Outlines
Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Neural Networks 1 Neural Networks 2 Continuous Latent Variables Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Mixture Models and EM 1 Mixture Models and EM 2 Sequential Data 1 Sequential Data 2 Combining Models
Introduction to Statistical Machine Learning
Cheng Soon Ong & Christian Walder
Machine Learning Research Group Data61 | CSIRO
and
College of Engineering and Computer Science The Australian National University
Canberra February – June 2017
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Part VIII
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Three Models for Decision Problems
In increasing order of complexity
Find a discriminant functionf(x)which maps each input directly onto a class label.
Discriminative Models
1 Solve the inference problem of determining the posterior
class probabilitiesp(Ck|x).
2 Use decision theory to assign each newxto one of the
classes. Generative Models
1 Solve the inference problem of determining the
class-conditional probabilitiesp(x| Ck).
2 Also, infer the prior class probabilitiesp(Ck). 3 Use Bayes’ theorem to find the posteriorp(C
k|x).
4 Alternatively, model the joint distributionp(x,C
k)directly.
5 Use decision theory to assign each newxto one of the
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Probabilistic Generative Models
Generative approach: model class-conditional densities p(x| Ck)and priorsp(Ck)to calculate the posterior probability for classC1
p(C1|x) =
p(x| C1)p(C1)
p(x| C1)p(C1) +p(x| C2)p(C2)
= 1
1+exp(−a(x)) =σ(a(x))
whereaand thelogistic sigmoidfunctionσ(a)are given by
a(x) =lnp(x| C1)p(C1)
p(x| C2)p(C2)
=lnp(x,C1)
p(x,C2)
σ(a) = 1
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Logistic Sigmoid
Thelogistic sigmoidfunctionσ(a) = 1 1+exp(−a)
"squashing function’ because it maps the real axis into a finite interval(0,1)
σ(−a) =1−σ(a) Derivative d
daσ(a) =σ(a)σ(−a) =σ(a) (1−σ(a)) Inverse is calledlogitfunctiona(σ) =ln1−σσ
-10 -5 5 10 a
0.2 0.4 0.6 0.8 1.0
ΣHaL
Logistic Sigmoidσ(a)
0.2 0.4 0.6 0.8 1.0Σ
-6
-4
-2 2 4 aHΣL
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Probabilistic Generative Models - Multiclass
Thenormalised exponentialis given by
p(Ck|x) =
p(x| Ck)p(Ck) P
jp(x| Cj)p(Cj)
= Pexp(ak) jexp(aj)
where
ak=ln(p(x| Ck)p(Ck)).
Also calledsoftmax functionas it is a smoothed version of themaxfunction.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Probabil. Generative Model - Continuous Input
Assume class-conditional probabilities are Gaussian, all classes share thesame covariance. What can we say about the posterior probabilities?
p(x| Ck) =
1
(2π)D/2
1 |Σ|1/2exp
−1
2(x−µk)
TΣ−1(x−µ
k)
= 1
(2π)D/2
1 |Σ|1/2exp
−1 2x
TΣ−1x
×exp
µTkΣ−1x−1
2µ
T kΣ
−1
µk
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University Probabilistic Generative Models Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Probabil. Generative Model - Continuous Input
For two classes
p(C1|x) =σ(a(x))
anda(x)is
a(x) =lnp(x| C1)p(C1)
p(x| C2)p(C2)
=lnexp
µT
1Σ
−1x−1 2µ
T
1Σ
−1µ 1
exp
µT
2Σ
−1x−1 2µ
T
2Σ
−1µ 2
+lnp(C1)
p(C2)
Therefore
p(C1|x) =σ(wTx+w0)
where
w=Σ−1(µ1−µ2)
w0 =−
1 2µ
T
1Σ
−1µ 1+
1 2µ
T
2Σ
−1µ 2+ln
p(C1)
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Probabil. Generative Model - Continuous Input
Class-conditional densities for two classes (left). Posterior probabilityp(C1|x)(right). Note the logistic sigmoid of a linear
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
General Case - K Classes, Shared Covariance
Use the normalised exponential
p(Ck|x) =
p(x| Ck)p(Ck) P
jp(x| Cj)p(Cj)
= Pexp(ak) jexp(aj)
where
ak=ln(p(x| Ck)p(Ck)).
to get a linear function ofx
ak(x) =wTkx+wk0.
where
wk=Σ−1µk
wk0=−
1 2µ
T kΣ
−1
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
General Case - K Classes, Different Covariance
If each class-conditional probability has adifferent covariance, the quadratic terms−1
2x
TΣ−1x
do not longer cancel each other out.
We get aquadraticdiscriminant.
−2 −1 0 1 2
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Maximum Likelihood Solution
Given the functional form of the class-conditional densities p(x| Ck), can we determine the parametersµandΣ?
Not without data ;-)
Given also a data set(xn,tn)forn=1, . . . ,N. (Using the coding scheme wheretn=1corresponds to classC1 and
tn=0denotes classC2.
Assume the class-conditional densities to be Gaussian with the same covariance, but different mean.
Denote the prior probabilityp(C1) =π, and therefore
p(C2) =1−π.
Then
p(xn,C1) =p(C1)p(xn| C1) =πN(xn|µ1,Σ)
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Maximum Likelihood Solution
Given the functional form of the class-conditional densities p(x| Ck), can we determine the parametersµandΣ? Not without data ;-)
Given also a data set(xn,tn)forn=1, . . . ,N. (Using the coding scheme wheretn=1corresponds to classC1 and
tn=0denotes classC2.
Assume the class-conditional densities to be Gaussian with the same covariance, but different mean.
Denote the prior probabilityp(C1) =π, and therefore
p(C2) =1−π.
Then
p(xn,C1) =p(C1)p(xn| C1) =πN(xn|µ1,Σ)
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Maximum Likelihood Solution
Thus the likelihood for the whole data setXandtis given by
p(t,X|π,µ1,µ2,Σ) = N Y
n=1
[πN(xn|µ1,Σ)]
tn×
[(1−π)N(xn|µ2,Σ)] 1−tn
Maximise the log likelihood The term depending onπis
N X
n=1
(tnlnπ+ (1−tn)ln(1−π)
which is maximal for
π= 1 N
N X
n=1
tn= N1
N = N1
N1+N2
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Maximum Likelihood Solution
Similarly, we can maximise the log likelihood (and thereby the likelihoodp(t,X|π,µ1,µ2,Σ)) depending on the mean
µ1orµ2, and get
µ1= 1
N1
N X
n=1
tnxn
µ2= 1
N2
N X
n=1
(1−tn)xn
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Maximum Likelihood Solution
Finally, the log likelihoodlnp(t,X|π,µ1,µ2,Σ)can be maximised for the covarianceΣresulting in
Σ=N1
NS1+ N2
NS2
Sk=
1
Nk X
n∈Ck
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input
Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Discrete Features - Naive Bayes
Assume the input space consists of discrete features, in the simplest casexi∈ {0,1}.
For aD-dimensional input space, a general distribution would be represented by a table with2Dentries.
Together with the normalisation constraint, this are2D−1 independent variables.
Grows exponentially with the number of features. TheNaive Bayesassumption is that all features
conditioned on the classCkare independent of each other.
p(x| Ck) = D Y
i=1
µxi
ki(1−µki)
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input
Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Discrete Features - Naive Bayes
With the naive Bayes
p(x| Ck) = D Y
i=1
µxi
ki(1−µki)
1−xi
we can then again find the factorsakin the normalised exponential
p(Ck|x) =
p(x| Ck)p(Ck) P
jp(x| Cj)p(Cj)
= Pexp(ak) jexp(aj)
as a linear function of thexi
ak(x) = D X
i=1
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Three Models for Decision Problems
In increasing order of complexity
Find a discriminant functionf(x)which maps each input directly onto a class label.
Discriminative Models
1 Solve the inference problem of determining the posterior
class probabilitiesp(Ck|x).
2 Use decision theory to assign each newxto one of the
classes. Generative Models
1 Solve the inference problem of determining the
class-conditional probabilitiesp(x| Ck).
2 Also, infer the prior class probabilitiesp(Ck). 3 Use Bayes’ theorem to find the posteriorp(C
k|x).
4 Alternatively, model the joint distributionp(x,C
k)directly.
5 Use decision theory to assign each newxto one of the
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Probabilistic Discriminative Models
Maximise a likelihood function defined through the conditional distributionp(Ck|x)directly.
Discriminativetraining
Typically fewer parameters to be determined.
As we learn the posterirorp(Ck|x)directly, prediction may be better than with a generative model where the
class-conditional density assumptionsp(x| Ck)poorly approximate the true distributions.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Original Input versus Feature Space
Used direct inputxuntil now.
All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functionsφ(x).
Example: Use two Gaussian basis functions centered at the green crosses in the input space.
x2
−1 0 1
φ2
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Original Input versus Feature Space
Linear decision boundaries in the feature space
correspond to nonlinear decision boundaries in the input space.
Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. BUT: If classes overlap in input space, they will also overlap in feature space.
Nonlinear featuresφ(x)can not remove the overlap; but they may increase it !
x1
x2
−1 0 1
−1 0 1
φ1
φ2
0 0.5 1
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input
Discrete Features
Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Original Input versus Feature Space
Fixed basis functions do not adapt to the data and therefore have important limitations (see discussion in Linear Regression).
Understanding of more advanced algorithms becomes easier if we introduce the feature space now and use it instead of the original input space.
Some applications use fixed features successfully by avoiding the limitations.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Logistic Regression is Classification
Two classes where the posterior of classC1is a logistic
sigmoidσ()acting on a linear function of the feature vector
φ
p(C1|φ) =y(φ) =σ(wTφ)
p(C2|φ) =1−p(C1|φ)
Model dimension is equal to dimension of the feature spaceM.
Compare this to fitting two Gaussians
2M |{z} means
+M(M+1)/2
| {z }
shared covariance
=M(M+5)/2
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Logistic Regression is Classification
Determine the parameter via maximum likelihood for data (φn,tn),n=1, . . . ,N, whereφn=φ(xn). The class membership is coded astn ∈ {0,1}.
Likelihood function
p(t|w) = N Y
n=1
ytn
n(1−yn)1−tn
whereyn=p(C1|φn).
Error function : negative log likelihood resulting in the cross-entropyerror function
E(w) =−lnp(t|w) =−
N X
n=1
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features
Probabilistic Discriminative Models
Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression
Logistic Regression is Classification
Error function (cross-entropyerror )
E(w) =−
N X
n=1
{tnlnyn+ (1−tn)ln(1−yn)}
yn=p(C1|φn) =σ(wTφn)
Gradient of the error function (using dσ
da =σ(1−σ))
∇E(w) = N X
n=1
(yn−tn)φn
gradient does not contain any sigmoid function
for each data point error is product of deviationyn−tn and basis functionφn.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation Bayesian Logistic Regression
Laplace Approximation
Given a continous distributionp(x)which is not Gaussian, can we approximate it by a Gaussianq(x)?
Need to find a mode ofp(x). Try to find a Gaussian with the same mode.
−2 −1 0 1 2 3 4
0 0.2 0.4 0.6 0.8
Non-Gaussian (yellow) and Gaussian approximation (red).
−2 −1 0 1 2 3 4
0 10 20 30 40
Negative log of the Non-Gaussian (yellow) and
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation Bayesian Logistic Regression
Laplace Approximation
Assumep(x)can be written as
p(z) = 1 Z f(z)
with normalisationZ=R
f(z)dz.
Furthermore, assumeZis unknown !
A mode ofp(z)is at a pointz0wherep0(z0) =0.
Taylor expansion oflnf(z)atz0
lnf(z)'lnf(z0)−
1
2A(z−z0)
2
where
A=−d
2
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation Bayesian Logistic Regression
Laplace Approximation
Exponentiating
lnf(z)'lnf(z0)−
1
2A(z−z0)
2
we get
f(z)'f(z0)exp{−
A
2(z−z0)
2}.
And after normalisation we get theLaplace approximation
q(z) =
A
2π 1/2
exp{−A 2(z−z0)
2}.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression
Iterative Reweighted Least Squares
Laplace Approximation Bayesian Logistic Regression
Laplace Approximation - Vector Space
Approximate p(z) forz∈RM
p(z) = 1 Z f(z).
we get the Taylor expansion
lnf(z)'lnf(z0)−
1
2(z−z0)
TA(z−z
0)
where the HessianAis defined as
A=−∇∇lnf(z)|z=z0.
The Laplace approximation ofp(z)is then
q(z) = |A|
1/2
(2π)M/2exp
−1
2(z−z0)
TA(z−z
0)
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
Bayesian Logistic Regression
Exact Bayesian inference for the logistic regression is intractable.
Why? Need to normalise a product of prior probabilities and likelihoods which itself are a product of logistic sigmoid functions, one for each data point.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
Bayesian Logistic Regression
Assume a Gaussian prior because we want a Gaussian posterior.
p(w) =N(w|m0,S0)
for fixedhyperparameterm0 andS0.
Hyperparametersare parameters of a prior distribution. In contrast to the model parametersw, they are not learned. For a set of training data(xn,tn), wheren=1, . . . ,N, the posterior is given by
p(w|t)∝p(w)p(t|w)
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
Bayesian Logistic Regression
Using our previous result for the cross-entropy function
E(w) =−lnp(t|w) =−
N X
n=1
{tnlnyn+ (1−tn)ln(1−yn)}
we can now calculate the log of the posterior p(w|t)∝p(w)p(t|w)
using the notationyn=σ(wTφn)as
lnp(w|t) =−1
2(w−m0)
TS−1
0 (w−m0)
+ N X
n=1
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
Bayesian Logistic Regression
To obtain a Gaussian approximation to
lnp(w|t) =−1
2(w−m0)
TS−1
0 (w−m0)
+ N X
n=1
{tnlnyn+ (1−tn)ln(1−yn)}
1 FindwMAPwhich maximiseslnp(w|t). This defines the
mean of the Gaussian approximation. (Note: This is a nonlinear function inwbecauseyn=σ(wTφn).)
2 Calculate the second derivative of the negative log likelihood
to get the inverse covariance of the Laplace approximation
SN =−∇∇lnp(w|t) =S−01+
N X
n=1
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Probabilistic Generative Models
Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares
Laplace Approximation
Bayesian Logistic Regression
Bayesian Logistic Regression
The approximated Gaussian (via Laplace approximation) of the posterior distribution is now
q(w|φ) =N(w|wMAP,SN)
where
SN =−∇∇lnp(w|t) =S−01+
N X
n=1