Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Introduction to Statistical Machine Learning
Cheng Soon Ong & Christian Walder
Machine Learning Research Group Data61 | CSIRO
and
College of Engineering and Computer Science The Australian National University
Canberra February – June 2017
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Part IV
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation
Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Linear Curve Fitting - Least Squares
N=10
x≡(x1, . . . ,xN)T
t≡(t1, . . . ,tN)T
xi∈R i=1,. . . ,N ti∈R i=1,. . . ,N
y(x,w) =w1x+w0
X≡[x 1]
w∗= (XTX)−1XTt
0 2 4 6 8 10
x
0 1 2 3 4 5 6 7 8
t
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation
Boxes with Apples and Oranges
Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Simple Experiment
1 Choose a box.
Red boxp(B=r) =4/10
Blue boxp(B=b) =6/10
2 Choose any item of the selected box with equal probability.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation
Boxes with Apples and Oranges
Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
What do we know?
p(F=o|B=b) =1/4
p(F=a|B=b) =3/4
p(F=o|B=r) =3/4
p(F=a|B=r) =1/4
Given that we have chosen an orange, what is the probability that the box we chose was the blue one?
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation
Boxes with Apples and Oranges
Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Calculating the Posterior p(B
=
b
|
F
=
o)
Bayes’ Theorem
p(B=b|F=o) =p(F=o|B=b)p(B=b) p(F=o)
Sum Rule for the denominator
p(F=o) =p(F=o,B=b) +p(F=o,B=r) =p(F=o|B=b)p(B=b)
+p(F=o|B=r)p(B=r)
= 1
4× 6 10+
3 4×
4 10 =
9 20
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges
Bayes’ Theorem
Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Bayes’ Theorem - revisited
Before choosing an item from a box: most complete information inp(B)(prior).
Note, that in our examplep(B=b) = 6
10. Therefore
choosing the box from the prior, we would opt for the blue box.
Once we observe some data (e.g. choose an orange), we can calculatep(B=b|F=o)(posterior probability) via Bayes’ theorem.
After observing an orange, the posterior probability p(B=b|F=o) = 1
3 and thereforep(B=r|F=o) = 2 3.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges
Bayes’ Theorem
Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Bayes’ Rule
posterior
z }| {
p(Y|X) =
likelihood
z }| {
p(X|Y)
prior
z }| {
p(Y) p(X)
| {z }
normalisation
= Pp(X|Y)p(Y)
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem
Bayes’ Probabilities
Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Bayes’ Probabilities
classical or frequentist interpretation of probabilities Bayesian view: probabilities represent uncertainty Example: Will the Arctic ice cap have disappeared by the end of the century?
fresh evidence can change the opinion on ice loss
goal: quantify uncertainty and revise uncertainty in light of new evidence
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem
Bayes’ Probabilities
Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Andrey Kolmogorov - Axiomatic Probability
Theory (1933)
Let(Ω,F,P)be a measure space withP(Ω) =1.
Then(Ω,F,P)is a probability space, with sample spaceΩ, event spaceFand probability measureP.
1. AxiomP(E)≥0 ∀E∈F. 2. AxiomP(Ω) =1.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem
Bayes’ Probabilities
Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Richard Threlkeld Cox - (1946, 1961)
Postulates (according to Arnborg and Sjödin)
1 Divisibility and comparability: The plausibility of a
statement is a real number and is dependent on information we have related to the statement.
2 Common sense: Plausibilities should vary sensibly with
the assessment of plausibilities in the model.
3 Consistency: If the plausibility of a statement can be
derived in many ways, all the results must be equal.
“Common sense” includes consistency with Aristotelian logic
Result: the numerical quantities representing plausibility behave all according to the rules of probability (either p∈[0,1]orp∈[1,∞]).
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem
Bayes’ Probabilities
Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Curve fitting - revisited
uncertainty about the parameterwcaptured in the prior probabilityp(w)
observed dataD={t1, . . . ,tN}
calculate the uncertainty inwafterthe dataDhave been observed
p(w| D) = p(D |w)p(w) p(D)
p(D |w)as a function ofw:likelihood function likelihood expresses how probable the data are for different values ofw
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem
Bayes’ Probabilities
Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Likelihood Function - Frequentist versus Bayesian
Likelihood functionp(D |w)
Frequentist Approach wconsidered fixed parameter
value defined by some ’estimator’ error bars on the estimatedw obtained from the distribution of possible data setsD
Bayesian Approach only one single data setD
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem
Bayes’ Probabilities
Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Frequentist Estimator - Maximum Likelihood
choosewfor which the likelihoodp(D |w)is maximal choosewfor which the probability of the observed data is maximal
Machine Learning: error function is negative log of likelihood function
log is a monoton function
maximising the likelihood ⇐⇒ minimising the error Example: Fair-looking coin is tossed three times, always landing on heads.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem
Bayes’ Probabilities
Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Bayesian Approach
including prior knowledge easy (via priorw)
BUT: if prior is badly chosen, can lead to bad results subjective choice of prior
sometimes choice of prior motivated by convinient mathematical form
need to sum/integrate over the whole parameter space advances in sampling (Markov Chain Monte Carlo methods)
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities
Probability Distributions
Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Expectations
Weighted average of a function f(x) under the probability distributionp(x)
E[f] =X
x
p(x)f(x) discrete distributionp(x)
E[f] =
Z
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities
Probability Distributions
Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Variance
arbitrary functionf(x)
var[f] =E(f(x)−E[f(x)])2=Ef(x)2−E[f(x)]2
Special case:f(x) =x
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities
Probability Distributions
Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
The Gaussian Distribution
x∈R
Gaussian Distribution withmeanµandvarianceσ2
N(x|µ, σ2) = 1
(2πσ2)12 exp{− 1
2σ2(x−µ) 2
}
N(x|µ, σ2)
x
2σ
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities
Probability Distributions
Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
The Gaussian Distribution
N(x|µ, σ2)>0
R∞
−∞N(x|µ, σ
2)dx=1
Expectation overx
E[x] =
Z ∞
−∞N
(x|µ, σ2)
x dx=µ
Expectation overx2
Ex2=
Z ∞
−∞N
(x|µ, σ2)x2 dx=µ2+σ2
Variance of x
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities
Probability Distributions
Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
Linear Curve Fitting - Least Squares
N=10
x≡(x1, . . . ,xN)T
t≡(t1, . . . ,tN)T
xi∈R i=1,. . . ,N ti∈R i=1,. . . ,N
y(x,w) =w1x+w0
X≡[x 1]
w∗= (XTX)−1XTt
0 2 4 6 8 10
x
0 1 2 3 4 5 6 7 8
t
We assume
t= y(x,w)
| {z }
deterministic
+
|{z}
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities
Probability Distributions
Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
The Mode of a Probability Distribution
Mode of a distribution : the value that occurs the most frequently in a probability distribution.
For a probability density function: the valuexat which the probability density attains its maximum.
Gaussian Distribution hasonemode (unimodular); the mode isµ.
If there are multiple local maxima in the probability
distribution, the probability distribution is calledmultimodal (example: mixture of three Gaussians).
N(x|µ, σ2)
x 2σ
µ x
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities
Probability Distributions
Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
The Bernoulli Distribution
Two possible outcomesx∈ {0,1}(e.g. coin which may be damaged).
p(x=1|µ) =µfor0≤µ≤1
p(x=0|µ) =1−µ
Bernoulli Distribution
Bern(x|µ) =µx(1
−µ)1−x
Expectation overx
E[x] =µ
Variance of x
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities
Probability Distributions
Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
The Binomial Distribution
Flip a coinNtimes. What is the distribution to observe heads exactlymtimes?
This is a distribution overm={0, . . . ,N}. Binomial Distribution
Bin(m|N, µ) =
N m
µm(1
−µ)N−m
m
0 1 2 3 4 5 6 7 8 9 10 0
0.1 0.2 0.3
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities
Probability Distributions
Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas
The Beta Distribution
Beta Distribution
Beta(µ|a,b) = Γ(a+b) Γ(a)Γ(b)µ
a−1(1
−µ)b−1
µ a= 0.1
b= 0.1
0 0.5 1
0 1 2 3
µ a= 1
b= 1
0 0.5 1
0 1 2 3
a= 2
b= 3
1 2 3
a= 8
b= 4
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions
Gaussian Distribution over a Vector
Decision Theory Model Selection - Key Ideas
The Gaussian Distribution over a Vector
x
x∈RD
Gaussian Distribution withmeanµ∈RDandcovariance
matrixΣ∈RD×D
N(x|µ,Σ) = 1 (2π)D2
1
|Σ|12
exp{−1 2(x−µ)
TΣ−1(x
−µ)}
where|Σ|is the determinant ofΣ.
x1
x2
λ11/2
λ12/2
y1
y2
u1
u2
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions
Gaussian Distribution over a Vector
Decision Theory Model Selection - Key Ideas
The Gaussian Distribution over a vector
x
Can find a linear transformation to a new coordinate systemyin which thexbecomes
y=UT (x −µ),
Uis the eigenvector matrix for the covariance matrixΣ
with eigenvalue matrixE
ΣU=U E=U
λ1 0 . . . 0
0 λ2 . . . 0
. . . .
0 0 . . . λD
Ucan be made an orthogonal matrix, therefore the columnsuiofUare unit vectors which are orthogonal to
each otheruT iuj=
(
1 i=j
0 i6=j
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions
Gaussian Distribution over a Vector
Decision Theory Model Selection - Key Ideas
The Gaussian Distribution over a vector
x
Now use the linear transformationy=UT (x
−µ)and
Σ−1=U E−1UTto transform the exponent (x−µ)TΣ−1(x−µ)into
yTE y= n
X
j=1
y2
j
λj
Now exponating the sum (and taking care of the factors) results in a product of scalar valued Gaussian distributions in orthogonal directionsui
p(y) = D
Y
j=1 1
(2πλj)1/2exp{−
y2
j
2λj}
x2
λ1/2
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions
Gaussian Distribution over a Vector
Decision Theory Model Selection - Key Ideas
Partitioned Gaussians
Given a joint Gaussian distributionN(x|µ,Σ), whereµis the mean vector,Σthe covariance matrix, andΛ≡Σ−1
theprecisionmatrix
Assume that the variables can be partitioned into two sets
x=
xa xb
, µ=
µa
µb
,Σ=
Σaa Σab
Σba Σbb
,Λ=
Λaa Λab
Λba Λbb
Λaa = (Σaa−ΣabΣ−bb1Σba) −1
Λab =−(Σaa−ΣabΣ−bb1Σba)
−1ΣabΣ−1
bb Conditional distribution
p(xa|xb) =N(xa|µa|b,Λ−aa1)
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions
Gaussian Distribution over a Vector
Decision Theory Model Selection - Key Ideas
Partitioned Gaussians
xa xb= 0.7
xb
p(xa, xb)
0 0.5 1
0 0.5 1
xa p(xa)
p(xa|xb= 0.7)
0 0.5 1
0 5 10
Contours of a Gaussian distribution over two variablesxaand xb(left), and marginal distributionp(xa)and conditional
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector
Decision Theory
Model Selection - Key Ideas
Decision Theory - Key Ideas
Two classesC1 andC2
joint distributionp(x,Ck) using Bayes’ theorem
p(Ck|x) =
p(x| Ck)p(Ck) p(x)
Example: cancer treatment (k=2) datax: an X-ray image
C1: patient has cancer (C2: patient has no cancer)
p(C1)is the prior probability of a person having cancer
p(C1|x)is the posterior probability of a person having
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector
Decision Theory
Model Selection - Key Ideas
Decision Theory - Key Ideas
Need a rule which assigns each value of the inputxto one of the available classes.
The input space is partitioned intodecision regionsRk. Leads todecision boundariesordecision surfaces probability of a mistake
p(mistake) =p(x∈ R1,C2) +p(x∈ R2,C1)
=
Z
R1
p(x,C2)dx+
Z
R2
p(x,C1)dx
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector
Decision Theory
Model Selection - Key Ideas
Decision Theory - Key Ideas
probability of a mistake
p(mistake) =p(x∈ R1,C2) +p(x∈ R2,C1)
=
Z
R1
p(x,C2)dx+
Z
R2
p(x,C1)dx
goal: minimizep(mistake)
x0 bx
p(x,C1)
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector
Decision Theory
Model Selection - Key Ideas
Decision Theory - Key Ideas
multiple classes
instead of minimising the probability of mistakes, maximise the probability of correct classification
p(correct) = K
X
k=1
p(x∈ Rk,Ck)
= K
X
k=1
Z
Rk
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector
Decision Theory
Model Selection - Key Ideas
Minimising the Expected Loss
Not all mistakes are equally costly.
Weight each misclassification ofxto the wrong classCj instead of assigning it to the correct classCkby a factorLkj. The expected loss is now
E[L] =
X
k
X
j
Z
Rj
Lkjp(x,Ck)dx
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector
Decision Theory
Model Selection - Key Ideas
The Reject Region
Avoid making automated decisions on difficult cases. Difficult cases:
posterior probabilitiesp(Ck|x)are very small
joint distributionsp(x,Ck)have comparable values
x p(C1|x) p(C2|x)
0.0 1.0
θ
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory
Model Selection - Key Ideas
S-fold Cross Validation
Given a set ofNdata items and targets.
Goal: Find the best model (type of model, number of parameters like the orderpof the polynomial or the regularisation constantλ). Avoid overfitting.
Solution: Train a machine learning algorithm with some of the data, evaluate it with the rest.
If we have many data
1 Train a range of models or a model with a range of
parameters.
2 Compare the performance on an independent data set
(validation set) and choose the one with the best predictive
performance.
3 Still, overfitting to the validation set can occur. Therefore,
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory
Model Selection - Key Ideas
S-fold Cross Validation
For few data, there is a dilemma: Few training data or few test data.
Solution iscross-validation.
Use a portion of(S−1)/Sof the available data for training, but use all the data to asses the performance.
Introduction to Statistical Machine Learning
c 2016 Ong & Walder Data61 | CSIRO The Australian National
University
Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory
Model Selection - Key Ideas
S-fold Cross Validation
Partition data intoSgroups.
UseS−1groups to train a set of models that are then evaluated on the remaining group.
Repeat for allSchoices of the held-out group, and average the performance scores from theSruns.
run 1
run 2
run 3
run 4