Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Introduction to Statistical Machine Learning
Cheng Soon Ong & Christian Walder
Machine Learning Research Group
Data61 | CSIRO
and
College of Engineering and Computer Science
The Australian National University
Canberra
February – June 2017
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
49of 90
Part II
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Flavour of this course
Formalise intuitions about problems
Use language of mathematics to express models
Geometry, vectors, linear algebra for reasoning
Probabilistic models to capture uncertainty
Design and analysis of algorithms
Numerical algorithms in python
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
51of 90
What is Machine Learning?
Definition (Mitchell, 1998)
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Polynomial Curve Fitting
some artificial data created from the function
sin(2
πx
) +
random noise
x
=
0
, . . . ,
1
x
t
0
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
53of 90
Polynomial Curve Fitting - Input Specification
N
=
10
x
≡
(
x
1
, . . . ,
x
N
)
T
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Polynomial Curve Fitting - Input Specification
N
=
10
x
≡
(
x
1
, . . . ,
x
N
)
T
t
≡
(
t
1
, . . . ,
t
N
)
T
x
i
∈
R
i
=
1
,. . . ,
N
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
55of 90
Polynomial Curve Fitting - Model Specification
M : order of polynomial
y
(
x,
w
) =
w
0
+
w
1
x
+
w
2
x
2
+
· · ·
+
w
M
x
M
=
M
X
m
=
0
w
m
x
m
nonlinear function of
x
linear
function of the unknown model parameter
w
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Learning is Improving Performance
t
x y(xn,w) tn
xn
Performance measure : Error between target and
prediction of the model for the training data
E
(w) =
1
2
N
X
n
=
1
(
y
(
x
n
,
w)
−
t
n
)
2
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
57of 90
Learning is Improving Performance
t
x y(xn,w) tn
xn
Performance measure : Error between target and
prediction of the model for the training data
E
(w) =
1
2
N
X
n
=
1
(
y
(
x
n
,
w)
−
t
n
)
2
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Model Comparison or Model Selection
y
(
x,
w) =
M
X
m
=
0
w
m
x
m
M
=
0
=
w
0
x
t
M
= 0
0
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
59of 90
Model Comparison or Model Selection
y
(
x,
w) =
M
X
m
=
0
w
m
x
m
M
=
1
=
w
0
+
w
1
x
x
t
M
= 1
0
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Model Comparison or Model Selection
y
(
x,
w) =
M
X
m
=
0
w
m
x
m
M
=
3
=
w
0
+
w
1
x
+
w
2
x
2
+
w
3
x
3
x
t
M
= 3
0
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
61of 90
Model Comparison or Model Selection
y
(
x,
w) =
M
X
m
=
0
w
m
x
m
M
=
9
=
w
0
+
w
1
x
+
· · ·
+
w
8
x
8
+
w
9
x
9
overfitting
x
t
M
= 9
0
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Testing the Model
Train the model and get
w
?
Get
100
new data points
Root-mean-square (RMS) error
E
RMS
=
p
2
E
(w
?
)
/N
M
E
RM
S
0
3
6
9
0
0.5
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
63of 90
Testing the Model
M = 0
M = 1
M = 3
M = 9
w
?
0
0.19
0.82
0.31
0.35
w
?
1
-1.27
7.99
232.37
w
?
2
-25.43
-5321.83
w
?
3
17.37
48568.31
w
?
4
-231639.30
w
?
5
640042.26
w
?
6
-1061800.52
w
?
7
1042400.18
w
?
8
-557682.99
w
?
9
125201.43
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
More Data
N
=
15
x
t
N
= 15
0
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
65of 90
More Data
N
=
100
heuristics : have no less than
5
to
10
times as many data
points than parameters
but number of parameters is not necessarily the most
appropriate measure of model complexity !
later: Bayesian approach
x
t
N
= 100
0
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Regularisation
How to constrain the growing of the coefficients
w
?
Add a
regularisation
term to the error function
e
E
(w) =
1
2
N
X
n
=
1
(
y
(
x
n
,
w)
−
t
n
)
2
+
λ
2
k
w
k
2
Squared norm of the parameter vector
w
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
67of 90
Regularisation
M
=
9
x
t
ln
λ
=
−
18
0
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Regularisation
M
=
9
x
t
ln
λ
= 0
0
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
69of 90
Regularisation
M
=
9
E
RM
S
ln
λ
−35
−30
−25
−20
0
0.5
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
What is Machine Learning?
Definition (Mitchell, 1998)
A computer program is said to learn from experience
E
with
respect to some class of tasks
T
and performance measure
P,
if its performance at tasks in
T
, as measured by
P, improves
with experience
E.
Task: regression
Experience:
x
input examples,
t
output labels
Performance: squared error
Model choice
Regularisation
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
71of 90
Probability Theory
p
(
X, Y
)
X
Y
= 2
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Probability Theory
Y vs. X
a
b
c
d
e
f
g
h
i
sum
2
0
0
0
1
4
5
8
6
2
26
1
3
6
8
8
5
3
1
0
0
34
sum
3
6
8
9
9
8
9
6
2
60
p
(
X, Y
)
X
Y
= 2
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
73of 90
Sum Rule
Y vs. X
a
b
c
d
e
f
g
h
i
sum
2
0
0
0
1
4
5
8
6
2
26
1
3
6
8
8
5
3
1
0
0
34
sum
3
6
8
9
9
8
9
6
2
60
p
(
X
=
d,
Y
=
1) =
8
/
60
p
(
X
=
d
) =
p
(
X
=
d,
Y
=
2) +
p
(
X
=
d,
Y
=
1)
=
1
/
60
+
8
/
60
p
(
X
=
d
) =
X
Y
p
(
X
=
d,
Y
)
p
(
X
) =
X
Y
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Sum Rule
Y vs. X
a
b
c
d
e
f
g
h
i
sum
2
0
0
0
1
4
5
8
6
2
26
1
3
6
8
8
5
3
1
0
0
34
sum
3
6
8
9
9
8
9
6
2
60
p
(
X
) =
X
Y
p
(
X,
Y
)
p(X)
X
p
(
Y
) =
X
X
p
(
X,
Y
)
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
75of 90
Product Rule
Y vs. X
a
b
c
d
e
f
g
h
i
sum
2
0
0
0
1
4
5
8
6
2
26
1
3
6
8
8
5
3
1
0
0
34
sum
3
6
8
9
9
8
9
6
2
60
Conditional Probability
p
(
X
=
d
|
Y
=
1) =
8
/
34
Calculate
p
(
Y
=
1)
:
p
(
Y
=
1) =
X
X
p
(
X,
Y
=
1) =
34
/
60
p
(
X
=
d,
Y
=
1) =
p
(
X
=
d
|
Y
=
1)
p
(
Y
=
1)
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Product Rule
Y vs. X
a
b
c
d
e
f
g
h
i
sum
2
0
0
0
1
4
5
8
6
2
26
1
3
6
8
8
5
3
1
0
0
34
sum
3
6
8
9
9
8
9
6
2
60
p
(
X
) =
X
Y
p
(
X,
Y
)
p(X)
X
p
(
X,
Y
) =
p
(
X
|
Y
)
p
(
Y
)
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
77of 90
Sum Rule and Product Rule
Sum Rule
p
(
X
) =
X
Y
p
(
X,
Y
)
Product Rule
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Bayes Theorem
Use product rule
p
(
X,
Y
) =
p
(
X
|
Y
)
p
(
Y
) =
p
(
Y
|
X
)
p
(
X
)
Bayes Theorem
p
(
Y
|
X
) =
p
(
X
|
Y
)
p
(
Y
)
p
(
X
)
only defined for
p
(
X
)
>
0
and
p
(
X
) =
X
Y
p
(
X,
Y
)
(sum rule)
=
X
Y
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
79of 90
Probability Densities
Real valued variable
x
∈
R
Probability of
x
to fall in the interval
(
x,
x
+
δx
)
is given by
p
(
x
)
δx
for infinitesimal small
δx.
p
(
x
∈
(
a,
b
)) =
Z
b
a
p
(
x
)
d
x.
x δx
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Constraints on p(x)
Nonnegative
p
(
x
)
≥
0
Normalisation
Z
∞
−∞
p
(
x
)
d
x
=
1
.
x δx
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
81of 90
Cumulative distribution function P(x)
P
(
x
) =
Z
x
−∞
p
(
z
)
d
z
or
d
dx
P
(
x
) =
p
(
x
)
x δx
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Multivariate Probability Density
Vector
x
≡
(
x
1
, . . . ,
x
D
)
T
=
x
1
..
.
x
D
Nonnegative
p
(x)
≥
0
Normalisation
Z
∞
−∞
p
(
x
)
dx
=
1
.
This means
Z
∞
−∞
· · ·
Z
∞
−∞
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
83of 90
Sum and Product Rule for Probability Densities
Sum Rule
p
(
x
) =
Z
∞
−∞
p
(
x,
y
)
d
y
Product Rule
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Expectations
Weighted average of a function f(x) under the probability
distribution
p
(
x
)
E
[
f
] =
X
x
p
(
x
)
f
(
x
)
discrete distribution
p
(
x
)
E
[
f
] =
Z
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
85of 90
How to approximate
E
[f
]
Given a finite number
N
of points
x
n
drawn from the
probability distribution
p
(
x
)
.
Approximate the expectation by a finite sum:
E
[
f
]
'
1
N
N
X
n
=
1
f
(
x
n
)
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Expection of a function of several variables
arbitrary function
f
(
x,
y
)
E
x
[
f
(
x,
y
)] =
X
x
p
(
x
)
f
(
x,
y
)
discrete distribution
p
(
x
)
E
x
[
f
(
x,
y
)] =
Z
p
(
x
)
f
(
x,
y
)
d
x
probability density
p
(
x
)
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory Probability Densities Expectations and Covariances 87of 90
Conditional Expectation
arbitrary function
f
(
x
)
E
x
[
f
|
y
] =
X
x
p
(
x
|
y
)
f
(
x
)
discrete distribution
p
(
x
)
E
x
[
f
|
y
] =
Z
p
(
x
|
y
)
f
(
x
)
d
x
probability density
p
(
x
)
Note that
E
x
[
f
|
y
]
is a function of
y.
Other notation used in the literature :
E
x
|
y
[
f
]
.
What is
E
[E
[
f
(
x
)
|
y
]]
? Can we simplify it?
This must mean
E
y
[E
x
[
f
(
x
)
|
y
]]
. (Why?)
E
y
[
E
x
[
f
(
x
)
|
y
]] =
X
y
p
(
y
)
E
x
[
f
|
y
] =
X
y
p
(
y
)
X
x
p
(
x
|
y
)
f
(
x
)
=
X
x
,
y
f
(
x
)
p
(
x,
y
) =
X
x
f
(
x
)
p
(
x
)
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Variance
arbitrary function
f
(
x
)
var[
f
] =
E
(
f
(
x
)
−
E
[
f
(
x
)])
2
=
E
f
(
x
)
2
−
E
[
f
(
x
)]
2
Special case:
f
(
x
) =
x
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
89of 90
Covariance
Two random variables
x
∈
R
and
y
∈
R
cov[
x,
y
] =
E
x
,
y
[(
x
−
E
[
x
])(
y
−
E
[
y
])]
=
E
x
,
y
[
x y
]
−
E
[
x
]
E
[
y
]
With
E
[
x
] =
a
and
E
[
y
] =
b
cov[
x,
y
] =
E
x
,
y
[(
x
−
a
)(
y
−
b
)]
=
E
x
,
y
[
x y
]
−
E
x
,
y
[
x b
]
−
E
x
,
y
[
a y
] +
E
x
,
y
[
a b
]
=
E
x
,
y
[
x y
]
−
b
E
x
,
y
[
x
]
|
{z
}
=
E
x[
x
]
−
a
E
x
,
y
[
y
]
|
{z
}
=
E
y[
y
]
+
a b
E
x
,
y
[1]
|
{z
}
=
1
=
E
x
,
y
[
x y
]
−
a b
−
a b
+
a b
=
E
x
,
y
[
x y
]
−
a b
=
E
x
,
y
[
x y
]
−
E
[
x
]
E
[
y
]
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Polynomial Curve Fitting
Probability Theory
Probability Densities
Expectations and Covariances
Covariance for Vector Valued Variables
Two random variables
x
∈
R
D
and
y
∈
R
D
cov[x
,
y] =
E
x
,
y
(x
−
E
[x])(y
T
−
E
y
T
)
=
E
x
,
y
x y
T