Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Outlines
Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Neural Networks 1 Neural Networks 2 Continuous Latent Variables Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Mixture Models and EM 1 Mixture Models and EM 2 Sequential Data 1 Sequential Data 2 Combining Models
1of 859
Introduction to Statistical Machine Learning
Cheng Soon Ong & Christian Walder
Machine Learning Research Group
Data61 | CSIRO
and
College of Engineering and Computer Science
The Australian National University
Canberra
February – June 2017
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
179of 859
Part V
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
180of 859
Linear Regression
N
=
10
x
≡
(
x
1
, . . . ,
xN
)
T
t
≡
(
t
1
, . . . ,
t
N
)
T
xi
∈
R
i
=
1,. . . ,
N
t
i
∈
R
i
=
1,. . . ,
N
0 2 4 6 8 10 x0 1 2 3 4 5 6 7 8
t
Predictor
y
(
x
,
w
)
?
Performance measure?
Optimal solution
w
∗
?
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
181of 859
Probabilities, Losses
Gaussian Distribution
Bayes Rule
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
182of 859
Regression
Given a training data set of
N
observations
{
x
n
}
and target
values
tn
.
Goal : Learn to predict the value of one ore more target
values
t
given a new value of the input
x.
Example: Polynomial curve fitting (see Introduction).
x
t
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
183of 859
Supervised Learning
model with adjustable
parameter w
training
data x
training
targets
t
model with fixed
parameter w*
test
data x
test
target
t
Training Phase
Test Phase
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
184of 859
Linear Basis Function Models
Linear
combination of
fixed
nonlinear basis functions
y
(
x
,
w
) =
M
−
1
X
j
=
0
w
j
φ
j
(
x
) =
w
T
φ
(
x
)
parameter
w
= (
w
0
, . . . ,
w
M
−
1
)
T
basis functions
φ
(
x
) = (
φ
0
(
x
)
, . . . , φ
M
−
1
(
x
))
T
convention
φ
0
(
x
) =
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
185of 859
Polynomial Basis Functions
Scalar input variable
x
φ
j
(
x
) =
x
j
Limitation : Polynomials are global functions of the input
variable
x
.
Extension: Split the input space into regions and fit a
different polynomial to each region (spline functions).
−1
0
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
186of 859
’Gaussian’ Basis Functions
Scalar input variable
x
φ
j
(
x
) =
exp
−
(
x
−
µ
j
)
2
2
s
2
Not a probability distribution.
No normalisation required, taken care of by the model
parameters
w
.
−1
0
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
187of 859
Sigmoidal Basis Functions
Scalar input variable
x
φ
j
(
x
) =
σ
x
−
µ
j
s
where
σ
(
a
)
is the logistic sigmoid function defined by
σ
(
a
) =
1
1
+
exp
(
−
a
)
σ
(
a
)
is related to the
hyperbolic tangent
tanh
(
a
)
by
tanh
(
a
) =
2
σ
(
a
)
−
1.
−1
0
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
188of 859
Other Basis Functions
Fourier Basis : each basis function represents a specific
frequency and has infinite spatial extent.
Wavelets : localised in both space and frequency (also
mutually orthogonal to simplify appliciation).
Splines (piecewise polynomials restricted to regions of the
input space; additional constraints where pieces meet, e.g.
smoothness constraints
→
conditions on the derivatives).
0 1 2 3 4 5
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0
Linear
Splines
0 1 2 3 4 5
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0
Quadratic
Splines
0 1 2 3 4 5
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0
Cubic
Splines
0 1 2 3 4 5
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0
Quartic
Splines
Approximate the points
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
189of 859
Maximum Likelihood and Least Squares
No special assumption about the basis functions
φ
j
(
x
)
. In
the simplest case, one can think of
φ
j
(
x
) =
x
j
, or
φ
(
x
) =
x.
Assume target
t
is given by
t
=
y
(
x
,
w
)
|
{z
}
deterministic
+
|{z}
noise
where
is a
zero-mean Gaussian
random variable with
precision
(inverse variance)
β
.
Thus
p
(
t
|
x
,
w
, β
) =
N
(
t
|
y
(
x
,
w
)
, β
−
1
)
t
x x0
y(x0)
y(x)
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
190of 859
Maximum Likelihood and Least Squares
Likelihood of one target
t
given the data
x
was
p
(
t
|
x
,
w
, β
) =
N
(
t
|
y
(
x
,
w
)
, β
−
1
)
Now, a
set of inputs
X
with corresponding target values
t
.
Assume data are
independent and identically distributed
(i.i.d.) (means : data are drawn independent and from the
same distribution). The likelihood of the target
t
is then
p
(t
|
X
,
w
, β
) =
N
Y
n
=
1
N
(
t
n
|
y
(
x
n
,
w
)
, β
−
1
)
=
N
Y
n
=
1
N
(
tn
|
w
T
φ
(
x
n
)
, β
−
1
)
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
191of 859
Maximum Likelihood and Least Squares
Consider the
logarithm of the likelihood
p
(t
|
w
, β
)
(the
logarithm is a monoton function! )
ln
p
(t
|
w
, β
) =
N
X
n
=
1
ln
N
(
tn
|
w
T
φ
(
x
n
)
, β
−
1
)
=
N
X
n
=
1
ln
r
β
2
π
exp
−
β
2
(
tn
−
w
T
φ
(
x
n
))
2
!
=
N
2
ln
β
−
N
2
ln
(
2
π
)
−
β
E
D
(
w
)
where the
sum-of-squares error function
is
ED
(
w
) =
1
2
N
X
n
=
1
{
tn
−
w
T
φ
(
xn
)
}
2
.
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
192of 859
Maximum Likelihood and Least Squares
Goal: Find a more compact representation.
Rewrite the
error function
ED
(
w
) =
1
2
N
X
n
=
1
{
tn
−
w
T
φ
(
xn
)
}
2
=
1
2
(t
−
Φ
w
)
T
(t
−
Φ
w
)
where
t
= (
t
1
, . . . ,
tN
)
T
, and
Φ
=
φ
0
(
x
1
)
φ
1
(
x
1
)
. . .
φ
M
−
1
(
x
1
)
φ
0
(
x
2
)
φ
1
(
x
2
)
. . .
φ
M
−
1
(
x
2
)
..
.
..
.
. ..
..
.
φ
0
(
x
N
)
φ
1
(
x
N
)
. . .
φ
M
−
1
(
x
N
)
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
193of 859
Maximum Likelihood and Least Squares
The log likelihood is now
ln
p
(t
|
w
, β
) =
N
2
ln
β
−
N
2
ln
(
2
π
)
−
β
E
D
(
w
)
=
N
2
ln
β
−
N
2
ln
(
2
π
)
−
β
1
2
(t
−
Φ
w
)
T
(t
−
Φ
w
)
Find
critical points
of
ln
p
(t
|
w
, β
)
.
Directional derivative in direction
ξ
D
ln
p
(t
|
w
, β
)(
ξ
) =
β
ξ
T
(
Φ
T
t
−
Φ
T
Φ
w
)
shall be zero in all directions
ξ. Therefore
0
=
Φ
T
t
−
Φ
T
Φ
w
,
which results in
w
ML
= (
Φ
T
Φ
)
−
1
Φ
T
t
=
Φ
†
t
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
194of 859
Maximum Likelihood and Least Squares
Found a critical point
w
ML
for
ln
p
(t
|
w
, β
)
. Is it a maximum,
a minimum, or a saddle point?
Calculate the second directional derivative
D
2
ln
p
(t
|
w
, β
)(
ξ
,
ξ
) =
−
β
ξ
T
Φ
T
Φ
ξ
=
−
β
(
Φ
ξ
)
T
Φ
ξ
=
−
β
k
Φ
ξ
k
2
≤
0
We found a maximum.
Aside: Is it enough to check the second directional
derivative in two directions
ξ
which are the same? Yes,
because for any bilinear function
g
(
ξ
,
η
)
, symmetric in its
arguments,
1
2
[
g
(
ξ
,
ξ
) +
g
(
η
,
η
)
−
g
(
ξ
−
η
,
ξ
−
η
)]
=
1
2
[
g
(
ξ
,
ξ
) +
g
(
η
,
η
)
−
g
(
ξ
,
ξ
)
−
g
(
η
,
η
) +
g
(
ξ
,
η
) +
g
(
η
,
ξ
)]
=
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review
Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
195of 859
Maximum Likelihood and Least Squares
The log likelihood with the optimal
w
ML
is now
ln
p
(t
|
w
ML
, β
)
=
N
2
ln
β
−
N
2
ln
(
2
π
)
−
β
1
2
(t
−
Φ
w
ML
)
T
(t
−
Φ
w
ML
)
Find critical points of
ln
p
(t
|
w
, β
)
wrt
β
,
∂
ln
p
(t
|
w
ML
, β
)
∂β
=
0
results in
1
β
ML
=
1
N
(t
−
Φ
w
ML
)
T
(t
−
Φ
w
ML
)
Note: We can first find the maximum likelihood for
w
as
this does
not depend
on
β
. Then we can use
w
ML
to find
the maximum likelihood solution for
β
.
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares
Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
196of 859
Geometry of Least Squares
Target vector
t
= (
t
1
, . . . ,
tN
)
T
∈
R
N
Basis vectors
ϕ
j
= (
Φ
j
1
, . . . ,
Φ
jN
)
T
= (
φ
j
(
x
1
)
, . . . , φ
j
(
x
N
))
T
∈
R
N
span a
subspace
S
Find
w
such that
y
= (
y
(
x
1
,
w
)
, . . . ,
y
(
x
N
,
w
))
T
∈ S
is closest
to
t
.
S
t
y
ϕ
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares
Geometry of Least Squares
Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
197of 859
Sequential Learning - Stochastic Gradient
Descent
For
large data sets, calculating the maximum likelihood
parameters
w
ML
and
β
ML
may be costly.
For
online
applications, never all data in memory.
Use a
sequential
algorithms (online
algorithm).
If the error function is a sum over data points
E
=
P
n
En
,
then
1
initialise
w
(0)to some starting value
2
update the parameter vector at iteration
τ
+
1
by
w
(τ+1)=
w
(τ)−
η
∇
E
n,Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares
Geometry of Least Squares
Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
198of 859
Sequential Learning - Stochastic Gradient
Descent
For the sum-of-squares error function, stochastic gradient
descent results in
w
(
τ
+
1
)
=
w
(
τ
)
−
η
t
n
−
w
(
τ
)
T
φ
(
x
n
)
φ
(
x
n
)
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares
Sequential Learning
Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
199of 859
Regularized Least Squares
Add regularisation in order to prevent overfitting
ED
(
w
) +
λ
EW
(
w
)
with regularisation coefficient
λ
.
Simple quadratic regulariser
EW
(
w
) =
1
2
w
T
w
Maximum likelihood solution
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares
Sequential Learning
Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
200of 859
Regularized Least Squares
More general regulariser
EW
(
w
) =
1
2
M
X
j
=
1
|
wj
|
q
q
=
1
(lasso) leads to a sparse model if
λ
large enough.
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares
Sequential Learning
Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
201of 859
Comparison of Quadratic and Lasso Regulariser
Assume a sufficiently large regularisation coefficient
λ
.
Quadratic regulariser
1
2
M
X
j
=
1
w
2
j
w
1w
2w
?Lasso regulariser
1
2
M
X
j
=
1
|
wj
|
w
1w
2Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning
Regularized Least Squares
Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
202of 859
Multiple Outputs
More than
1
target variable per data point.
y
becomes a vector instead of a scalar. Each dimension
can be treated with a different set of basis functions (and
that may be necessary if the data in the different target
dimensions represent very different types of information.)
Here we restrict ourselves to the SAME basis functions
y
(
x
,
w
) =
W
T
φ
(
x
)
where
y
is a
K
-dimensional column vector,
W
is an
M
×
K
matrix of model parameters, and
φ
(
x
) = (
φ
0
(
x
)
, . . . , φ
M
−
1
(
x
)
,
φ
0
(
x
) =
1, as before.
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning
Regularized Least Squares
Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
203of 859
Multiple Outputs
Suppose the conditional distribution of the target vector is
an isotropic Gaussian of the form
p
(
t
|
x
,
W
, β
) =
N
(
t
|
W
T
φ
(
x
)
, β
−
1
I
)
.
The log likelihood is then
ln
p
(
T
|
X
,
W
, β
) =
N
X
n
=
1
ln
N
(
t
n
|
W
T
φ
(
x
n
)
, β
−
1
VI
)
=
NK
2
ln
β
2
π
−
β
2
N
X
n
=
1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning
Regularized Least Squares
Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition
204of 859
Multiple Outputs
Maximisation with respect to
W
results in
W
ML
= (
Φ
T
Φ
)
−
1
Φ
T
T
.
For each target variable
t
k
, we get
w
k
= (
Φ
T
Φ
)
−
1
Φ
T
t
k
=
Φ
†
t
k
.
The solution between the different target variables
decouples.
Holds also for a general Gaussian noise distribution with
arbitrary covariance matrix.
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares
Multiple Outputs
Loss Function for Regression The Bias-Variance Decomposition
205of 859
Loss Function for Regression
Over-fitting results from a large number of basis functions
and a relatively small training set.
Regularisation can prevent overfitting, but how to find the
correct value for the regularisation constant
λ
?
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares
Multiple Outputs
Loss Function for Regression The Bias-Variance Decomposition
206of 859
Loss Function for Regression
Choose an estimator
y
(
x
)
to estimate the target value
t
for
each input
x.
Choose a loss function
L
(
t
,
y
(
x
))
which measures the
difference between the target
t
and the estimate
y
(
x
)
.
The
expected loss
is then
E
[
L
] =
Z
Z
L
(
t
,
y
(
x
))
p
(
x
,
t
)
dx
d
t
Common choice:
Squared Loss
L
(
t
,
y
(
x
)) =
{
y
(
x
)
−
t
}
2
.
Expected loss for squared loss function
E
[
L
] =
Z
Z
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares
Multiple Outputs
Loss Function for Regression The Bias-Variance Decomposition
207of 859
Loss Function for Regression
Expected loss for squared loss function
E
[
L
] =
Z
Z
{
y
(
x
)
−
t
}
2
p
(
x
,
t
)
dx
d
t
.
Minimise
E
[
L
]
by choosing the regression function
y
(
x
) =
R
t p
(
x
,
t
)
d
t
p
(
x
)
=
Z
t p
(
t
|
x
)
d
t
=
E
t
[
t
|
x
]
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares
Multiple Outputs
Loss Function for Regression The Bias-Variance Decomposition
208of 859
Loss Function for Regression
The regression function which minimises the expected
squared loss, is given by the mean of the conditional
distribution
p
(
t
|
x
)
.
t
x
x
0y
(
x
0)
y
(
x
)
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares
Multiple Outputs
Loss Function for Regression The Bias-Variance Decomposition
209of 859
Loss Function for Regression
Analyse the expected loss
E
[
L
] =
Z
Z
{
y
(
x
)
−
t
}
2
p
(
x
,
t
)
dx
d
t
.
Rewrite the squared loss
{
y
(
x
)
−
t
}
2
=
{
y
(
x
)
−
E
[
t
|
x
] +
E
[
t
|
x
]
−
t
}
2
=
{
y
(
x
)
−
E
[
t
|
x
]
}
2
+
{
E
[
t
|
x
]
−
t
}
2
+
2
{
y
(
x
)
−
E
[
t
|
x
]
} {E
[
t
|
x
]
−
t
}
Claim
Z
Z
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares
Multiple Outputs
Loss Function for Regression The Bias-Variance Decomposition
210of 859
Loss Function for Regression
Claim
Z
Z
{
y
(
x
)
−
E
[
t
|
x
]
} {E
[
t
|
x
]
−
t
}
p
(
x
,
t
)
dx
d
t
=
0
.
Seperate functions depending on
t
from function
depending on
x
Z
{
y
(
x
)
−
E
[
t
|
x
]
}
Z
{
E
[
t
|
x
]
−
t
}
p
(
x
,
t
)
d
t
dx
Calculate the integral over
t
Z
{E
[
t
|
x
]
−
t
}
p
(
x
,
t
)
d
t
=
E
[
t
|
x
]
p
(
x
)
−
p
(
x
)
Z
t p
(
x
,
t
)
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares
Multiple Outputs
Loss Function for Regression The Bias-Variance Decomposition
211of 859
Loss Function for Regression
The expected loss is now
E
[
L
] =
Z
{
y
(
x
)
−
E
[
t
|
x
]
}
2
p
(
x
)
dx
+
Z
var
[
t
|
x
]
p
(
x
)
dx
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
212of 859
The Bias-Variance Decomposition
Consider now the dependency on the data set
D
.
Prediction function now
y
(
x
;
D
)
.
Consider again squared loss for which the optimal
prediction is given by the conditional expectation
h
(
x
)
h
(
x
) =
E
[
t
|
x
] =
Z
t p
(
t
|
x
)
d
t
.
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
213of 859
The Bias-Variance Decomposition
Taking the expectation over all data sets
D
E
D
[
E
[
L
]] =
Z
E
D
{
y
(
x
;
D
)
−
h
(
x
)
}
2
p
(
x
)
dx
+
Z
Z
{
h
(
x
)
−
t
}
2
p
(
x
,
t
)
dx
d
t
Again, add and subtract the expectation
E
D
[
y
(
x
;
D
)]
{
y
(
x
;
D
)
−
h
(
x
)
}
2
=
{
y
(
x
;
D
)
−
E
D
[
y
(
x
;
D
)]
+
E
D
[
y
(
x
;
D
)]
−
h
(
x
)
}
2
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
214of 859
The Bias-Variance Decomposition
Expected loss
E
D
[
L
]
over all data sets
D
expected loss
= (
bias
)
2
+
variance
+
noise
.
where
(
bias
)
2
=
Z
{E
D
[
y
(
x
;
D
)]
−
h
(
x
)
}
2
p
(
x
)
dx
variance
=
Z
E
D
h
{
y
(
x
;
D
)
−
E
D
[
y
(
x
;
D
)]
}
2
i
p
(
x
)
dx
noise
=
Z
Z
{
h
(
x
)
−
t
}
2
p
(
x
,
t
)
dx
d
t
.
variance
: How sensitive is the model to small changes in
the training set? (How much do solutions for individual
data sets vary around their average ?
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
215of 859
The Bias-Variance Decomposition
Simple
models have
low variance
and
high bias.
x
t lnλ= 2.6
0 1
−1 0 1
x t
0 1
−1 0 1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
216of 859
The Bias-Variance Decomposition
Dependence of bias and variance on the model complexity
x
t lnλ=−0.31
0 1
−1 0 1
x t
0 1
−1 0 1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
217of 859
The Bias-Variance Decomposition
Complex
models have
high variance
and
low bias.
x
t lnλ=−2.4
0 1
−1 0 1
x t
0 1
−1 0 1
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
218of 859
The Bias-Variance Decomposition
Squared bias, variance, their sum, and test data
The minimum for
(
bias
)
2
+
variance occurs close to the
value that gives the minimum error
ln
λ
−3
−2
−1
0
1
2
0
0.03
0.06
0.09
0.12
0.15
Introduction to Statistical Machine Learning
c
2016 Ong & Walder Data61 | CSIRO The Australian National
University
Review Linear Basis Function Models
Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs
Loss Function for Regression
The Bias-Variance Decomposition
219of 859
The Bias-Variance Decomposition
Tradeoff between bias and variance
simple models have low variance and high bias
complex models have high variance and low bias
The sum of bias and variance has a minimum at a certain
model complexity.
Expected loss
E
D
[
L
]
over all data sets
D
expected loss
= (
bias
)
2
+
variance
+
noise
.
The noise comes from the data, and can not be removed
from the expected loss.