• No results found

05_Linear_Regression_1.pdf

N/A
N/A
Protected

Academic year: 2020

Share "05_Linear_Regression_1.pdf"

Copied!
42
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Outlines

Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Neural Networks 1 Neural Networks 2 Continuous Latent Variables Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Mixture Models and EM 1 Mixture Models and EM 2 Sequential Data 1 Sequential Data 2 Combining Models

1of 859

Introduction to Statistical Machine Learning

Cheng Soon Ong & Christian Walder

Machine Learning Research Group

Data61 | CSIRO

and

College of Engineering and Computer Science

The Australian National University

Canberra

February – June 2017

(2)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

179of 859

Part V

(3)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

180of 859

Linear Regression

N

=

10

x

(

x

1

, . . . ,

xN

)

T

t

(

t

1

, . . . ,

t

N

)

T

xi

R

i

=

1,. . . ,

N

t

i

R

i

=

1,. . . ,

N

0 2 4 6 8 10 x

0 1 2 3 4 5 6 7 8

t

Predictor

y

(

x

,

w

)

?

Performance measure?

Optimal solution

w

?

(4)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

181of 859

Probabilities, Losses

Gaussian Distribution

Bayes Rule

(5)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

182of 859

Regression

Given a training data set of

N

observations

{

x

n

}

and target

values

tn

.

Goal : Learn to predict the value of one ore more target

values

t

given a new value of the input

x.

Example: Polynomial curve fitting (see Introduction).

x

t

(6)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

183of 859

Supervised Learning

model with adjustable

parameter w

training

data x

training

targets

t

model with fixed

parameter w*

test

data x

test

target

t

Training Phase

Test Phase

(7)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

184of 859

Linear Basis Function Models

Linear

combination of

fixed

nonlinear basis functions

y

(

x

,

w

) =

M

1

X

j

=

0

w

j

φ

j

(

x

) =

w

T

φ

(

x

)

parameter

w

= (

w

0

, . . . ,

w

M

1

)

T

basis functions

φ

(

x

) = (

φ

0

(

x

)

, . . . , φ

M

1

(

x

))

T

convention

φ

0

(

x

) =

1

(8)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

185of 859

Polynomial Basis Functions

Scalar input variable

x

φ

j

(

x

) =

x

j

Limitation : Polynomials are global functions of the input

variable

x

.

Extension: Split the input space into regions and fit a

different polynomial to each region (spline functions).

−1

0

1

(9)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

186of 859

’Gaussian’ Basis Functions

Scalar input variable

x

φ

j

(

x

) =

exp

(

x

µ

j

)

2

2

s

2

Not a probability distribution.

No normalisation required, taken care of by the model

parameters

w

.

−1

0

1

(10)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

187of 859

Sigmoidal Basis Functions

Scalar input variable

x

φ

j

(

x

) =

σ

x

µ

j

s

where

σ

(

a

)

is the logistic sigmoid function defined by

σ

(

a

) =

1

1

+

exp

(

a

)

σ

(

a

)

is related to the

hyperbolic tangent

tanh

(

a

)

by

tanh

(

a

) =

2

σ

(

a

)

1.

−1

0

1

(11)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

188of 859

Other Basis Functions

Fourier Basis : each basis function represents a specific

frequency and has infinite spatial extent.

Wavelets : localised in both space and frequency (also

mutually orthogonal to simplify appliciation).

Splines (piecewise polynomials restricted to regions of the

input space; additional constraints where pieces meet, e.g.

smoothness constraints

conditions on the derivatives).

0 1 2 3 4 5

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0

Linear

Splines

0 1 2 3 4 5

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0

Quadratic

Splines

0 1 2 3 4 5

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0

Cubic

Splines

0 1 2 3 4 5

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0

Quartic

Splines

Approximate the points

(12)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

189of 859

Maximum Likelihood and Least Squares

No special assumption about the basis functions

φ

j

(

x

)

. In

the simplest case, one can think of

φ

j

(

x

) =

x

j

, or

φ

(

x

) =

x.

Assume target

t

is given by

t

=

y

(

x

,

w

)

|

{z

}

deterministic

+

|{z}

noise

where

is a

zero-mean Gaussian

random variable with

precision

(inverse variance)

β

.

Thus

p

(

t

|

x

,

w

, β

) =

N

(

t

|

y

(

x

,

w

)

, β

1

)

t

x x0

y(x0)

y(x)

(13)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

190of 859

Maximum Likelihood and Least Squares

Likelihood of one target

t

given the data

x

was

p

(

t

|

x

,

w

, β

) =

N

(

t

|

y

(

x

,

w

)

, β

1

)

Now, a

set of inputs

X

with corresponding target values

t

.

Assume data are

independent and identically distributed

(i.i.d.) (means : data are drawn independent and from the

same distribution). The likelihood of the target

t

is then

p

(t

|

X

,

w

, β

) =

N

Y

n

=

1

N

(

t

n

|

y

(

x

n

,

w

)

, β

1

)

=

N

Y

n

=

1

N

(

tn

|

w

T

φ

(

x

n

)

, β

1

)

(14)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

191of 859

Maximum Likelihood and Least Squares

Consider the

logarithm of the likelihood

p

(t

|

w

, β

)

(the

logarithm is a monoton function! )

ln

p

(t

|

w

, β

) =

N

X

n

=

1

ln

N

(

tn

|

w

T

φ

(

x

n

)

, β

1

)

=

N

X

n

=

1

ln

r

β

2

π

exp

β

2

(

tn

w

T

φ

(

x

n

))

2

!

=

N

2

ln

β

N

2

ln

(

2

π

)

β

E

D

(

w

)

where the

sum-of-squares error function

is

ED

(

w

) =

1

2

N

X

n

=

1

{

tn

w

T

φ

(

xn

)

}

2

.

(15)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

192of 859

Maximum Likelihood and Least Squares

Goal: Find a more compact representation.

Rewrite the

error function

ED

(

w

) =

1

2

N

X

n

=

1

{

tn

w

T

φ

(

xn

)

}

2

=

1

2

(t

Φ

w

)

T

(t

Φ

w

)

where

t

= (

t

1

, . . . ,

tN

)

T

, and

Φ

=

φ

0

(

x

1

)

φ

1

(

x

1

)

. . .

φ

M

1

(

x

1

)

φ

0

(

x

2

)

φ

1

(

x

2

)

. . .

φ

M

1

(

x

2

)

..

.

..

.

. ..

..

.

φ

0

(

x

N

)

φ

1

(

x

N

)

. . .

φ

M

1

(

x

N

)

(16)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

193of 859

Maximum Likelihood and Least Squares

The log likelihood is now

ln

p

(t

|

w

, β

) =

N

2

ln

β

N

2

ln

(

2

π

)

β

E

D

(

w

)

=

N

2

ln

β

N

2

ln

(

2

π

)

β

1

2

(t

Φ

w

)

T

(t

Φ

w

)

Find

critical points

of

ln

p

(t

|

w

, β

)

.

Directional derivative in direction

ξ

D

ln

p

(t

|

w

, β

)(

ξ

) =

β

ξ

T

(

Φ

T

t

Φ

T

Φ

w

)

shall be zero in all directions

ξ. Therefore

0

=

Φ

T

t

Φ

T

Φ

w

,

which results in

w

ML

= (

Φ

T

Φ

)

1

Φ

T

t

=

Φ

t

(17)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

194of 859

Maximum Likelihood and Least Squares

Found a critical point

w

ML

for

ln

p

(t

|

w

, β

)

. Is it a maximum,

a minimum, or a saddle point?

Calculate the second directional derivative

D

2

ln

p

(t

|

w

, β

)(

ξ

,

ξ

) =

β

ξ

T

Φ

T

Φ

ξ

=

β

(

Φ

ξ

)

T

Φ

ξ

=

β

k

Φ

ξ

k

2

0

We found a maximum.

Aside: Is it enough to check the second directional

derivative in two directions

ξ

which are the same? Yes,

because for any bilinear function

g

(

ξ

,

η

)

, symmetric in its

arguments,

1

2

[

g

(

ξ

,

ξ

) +

g

(

η

,

η

)

g

(

ξ

η

,

ξ

η

)]

=

1

2

[

g

(

ξ

,

ξ

) +

g

(

η

,

η

)

g

(

ξ

,

ξ

)

g

(

η

,

η

) +

g

(

ξ

,

η

) +

g

(

η

,

ξ

)]

=

1

(18)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review

Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

195of 859

Maximum Likelihood and Least Squares

The log likelihood with the optimal

w

ML

is now

ln

p

(t

|

w

ML

, β

)

=

N

2

ln

β

N

2

ln

(

2

π

)

β

1

2

(t

Φ

w

ML

)

T

(t

Φ

w

ML

)

Find critical points of

ln

p

(t

|

w

, β

)

wrt

β

,

ln

p

(t

|

w

ML

, β

)

∂β

=

0

results in

1

β

ML

=

1

N

(t

Φ

w

ML

)

T

(t

Φ

w

ML

)

Note: We can first find the maximum likelihood for

w

as

this does

not depend

on

β

. Then we can use

w

ML

to find

the maximum likelihood solution for

β

.

(19)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares

Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

196of 859

Geometry of Least Squares

Target vector

t

= (

t

1

, . . . ,

tN

)

T

R

N

Basis vectors

ϕ

j

= (

Φ

j

1

, . . . ,

Φ

jN

)

T

= (

φ

j

(

x

1

)

, . . . , φ

j

(

x

N

))

T

R

N

span a

subspace

S

Find

w

such that

y

= (

y

(

x

1

,

w

)

, . . . ,

y

(

x

N

,

w

))

T

∈ S

is closest

to

t

.

S

t

y

ϕ

1

(20)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares

Geometry of Least Squares

Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

197of 859

Sequential Learning - Stochastic Gradient

Descent

For

large data sets, calculating the maximum likelihood

parameters

w

ML

and

β

ML

may be costly.

For

online

applications, never all data in memory.

Use a

sequential

algorithms (online

algorithm).

If the error function is a sum over data points

E

=

P

n

En

,

then

1

initialise

w

(0)

to some starting value

2

update the parameter vector at iteration

τ

+

1

by

w

(τ+1)

=

w

(τ)

η

E

n,
(21)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares

Geometry of Least Squares

Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

198of 859

Sequential Learning - Stochastic Gradient

Descent

For the sum-of-squares error function, stochastic gradient

descent results in

w

(

τ

+

1

)

=

w

(

τ

)

η

t

n

w

(

τ

)

T

φ

(

x

n

)

φ

(

x

n

)

(22)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares

Sequential Learning

Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

199of 859

Regularized Least Squares

Add regularisation in order to prevent overfitting

ED

(

w

) +

λ

EW

(

w

)

with regularisation coefficient

λ

.

Simple quadratic regulariser

EW

(

w

) =

1

2

w

T

w

Maximum likelihood solution

(23)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares

Sequential Learning

Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

200of 859

Regularized Least Squares

More general regulariser

EW

(

w

) =

1

2

M

X

j

=

1

|

wj

|

q

q

=

1

(lasso) leads to a sparse model if

λ

large enough.

(24)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares

Sequential Learning

Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

201of 859

Comparison of Quadratic and Lasso Regulariser

Assume a sufficiently large regularisation coefficient

λ

.

Quadratic regulariser

1

2

M

X

j

=

1

w

2

j

w

1

w

2

w

?

Lasso regulariser

1

2

M

X

j

=

1

|

wj

|

w

1

w

2
(25)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning

Regularized Least Squares

Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

202of 859

Multiple Outputs

More than

1

target variable per data point.

y

becomes a vector instead of a scalar. Each dimension

can be treated with a different set of basis functions (and

that may be necessary if the data in the different target

dimensions represent very different types of information.)

Here we restrict ourselves to the SAME basis functions

y

(

x

,

w

) =

W

T

φ

(

x

)

where

y

is a

K

-dimensional column vector,

W

is an

M

×

K

matrix of model parameters, and

φ

(

x

) = (

φ

0

(

x

)

, . . . , φ

M

1

(

x

)

,

φ

0

(

x

) =

1, as before.

(26)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning

Regularized Least Squares

Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

203of 859

Multiple Outputs

Suppose the conditional distribution of the target vector is

an isotropic Gaussian of the form

p

(

t

|

x

,

W

, β

) =

N

(

t

|

W

T

φ

(

x

)

, β

1

I

)

.

The log likelihood is then

ln

p

(

T

|

X

,

W

, β

) =

N

X

n

=

1

ln

N

(

t

n

|

W

T

φ

(

x

n

)

, β

1

VI

)

=

NK

2

ln

β

2

π

β

2

N

X

n

=

1

(27)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning

Regularized Least Squares

Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

204of 859

Multiple Outputs

Maximisation with respect to

W

results in

W

ML

= (

Φ

T

Φ

)

1

Φ

T

T

.

For each target variable

t

k

, we get

w

k

= (

Φ

T

Φ

)

1

Φ

T

t

k

=

Φ

t

k

.

The solution between the different target variables

decouples.

Holds also for a general Gaussian noise distribution with

arbitrary covariance matrix.

(28)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares

Multiple Outputs

Loss Function for Regression The Bias-Variance Decomposition

205of 859

Loss Function for Regression

Over-fitting results from a large number of basis functions

and a relatively small training set.

Regularisation can prevent overfitting, but how to find the

correct value for the regularisation constant

λ

?

(29)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares

Multiple Outputs

Loss Function for Regression The Bias-Variance Decomposition

206of 859

Loss Function for Regression

Choose an estimator

y

(

x

)

to estimate the target value

t

for

each input

x.

Choose a loss function

L

(

t

,

y

(

x

))

which measures the

difference between the target

t

and the estimate

y

(

x

)

.

The

expected loss

is then

E

[

L

] =

Z

Z

L

(

t

,

y

(

x

))

p

(

x

,

t

)

dx

d

t

Common choice:

Squared Loss

L

(

t

,

y

(

x

)) =

{

y

(

x

)

t

}

2

.

Expected loss for squared loss function

E

[

L

] =

Z

Z

(30)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares

Multiple Outputs

Loss Function for Regression The Bias-Variance Decomposition

207of 859

Loss Function for Regression

Expected loss for squared loss function

E

[

L

] =

Z

Z

{

y

(

x

)

t

}

2

p

(

x

,

t

)

dx

d

t

.

Minimise

E

[

L

]

by choosing the regression function

y

(

x

) =

R

t p

(

x

,

t

)

d

t

p

(

x

)

=

Z

t p

(

t

|

x

)

d

t

=

E

t

[

t

|

x

]

(31)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares

Multiple Outputs

Loss Function for Regression The Bias-Variance Decomposition

208of 859

Loss Function for Regression

The regression function which minimises the expected

squared loss, is given by the mean of the conditional

distribution

p

(

t

|

x

)

.

t

x

x

0

y

(

x

0

)

y

(

x

)

(32)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares

Multiple Outputs

Loss Function for Regression The Bias-Variance Decomposition

209of 859

Loss Function for Regression

Analyse the expected loss

E

[

L

] =

Z

Z

{

y

(

x

)

t

}

2

p

(

x

,

t

)

dx

d

t

.

Rewrite the squared loss

{

y

(

x

)

t

}

2

=

{

y

(

x

)

E

[

t

|

x

] +

E

[

t

|

x

]

t

}

2

=

{

y

(

x

)

E

[

t

|

x

]

}

2

+

{

E

[

t

|

x

]

t

}

2

+

2

{

y

(

x

)

E

[

t

|

x

]

} {E

[

t

|

x

]

t

}

Claim

Z

Z

(33)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares

Multiple Outputs

Loss Function for Regression The Bias-Variance Decomposition

210of 859

Loss Function for Regression

Claim

Z

Z

{

y

(

x

)

E

[

t

|

x

]

} {E

[

t

|

x

]

t

}

p

(

x

,

t

)

dx

d

t

=

0

.

Seperate functions depending on

t

from function

depending on

x

Z

{

y

(

x

)

E

[

t

|

x

]

}

Z

{

E

[

t

|

x

]

t

}

p

(

x

,

t

)

d

t

dx

Calculate the integral over

t

Z

{E

[

t

|

x

]

t

}

p

(

x

,

t

)

d

t

=

E

[

t

|

x

]

p

(

x

)

p

(

x

)

Z

t p

(

x

,

t

)

(34)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares

Multiple Outputs

Loss Function for Regression The Bias-Variance Decomposition

211of 859

Loss Function for Regression

The expected loss is now

E

[

L

] =

Z

{

y

(

x

)

E

[

t

|

x

]

}

2

p

(

x

)

dx

+

Z

var

[

t

|

x

]

p

(

x

)

dx

(35)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs

Loss Function for Regression

The Bias-Variance Decomposition

212of 859

The Bias-Variance Decomposition

Consider now the dependency on the data set

D

.

Prediction function now

y

(

x

;

D

)

.

Consider again squared loss for which the optimal

prediction is given by the conditional expectation

h

(

x

)

h

(

x

) =

E

[

t

|

x

] =

Z

t p

(

t

|

x

)

d

t

.

(36)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs

Loss Function for Regression

The Bias-Variance Decomposition

213of 859

The Bias-Variance Decomposition

Taking the expectation over all data sets

D

E

D

[

E

[

L

]] =

Z

E

D

{

y

(

x

;

D

)

h

(

x

)

}

2

p

(

x

)

dx

+

Z

Z

{

h

(

x

)

t

}

2

p

(

x

,

t

)

dx

d

t

Again, add and subtract the expectation

E

D

[

y

(

x

;

D

)]

{

y

(

x

;

D

)

h

(

x

)

}

2

=

{

y

(

x

;

D

)

E

D

[

y

(

x

;

D

)]

+

E

D

[

y

(

x

;

D

)]

h

(

x

)

}

2

(37)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs

Loss Function for Regression

The Bias-Variance Decomposition

214of 859

The Bias-Variance Decomposition

Expected loss

E

D

[

L

]

over all data sets

D

expected loss

= (

bias

)

2

+

variance

+

noise

.

where

(

bias

)

2

=

Z

{E

D

[

y

(

x

;

D

)]

h

(

x

)

}

2

p

(

x

)

dx

variance

=

Z

E

D

h

{

y

(

x

;

D

)

E

D

[

y

(

x

;

D

)]

}

2

i

p

(

x

)

dx

noise

=

Z

Z

{

h

(

x

)

t

}

2

p

(

x

,

t

)

dx

d

t

.

variance

: How sensitive is the model to small changes in

the training set? (How much do solutions for individual

data sets vary around their average ?

(38)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs

Loss Function for Regression

The Bias-Variance Decomposition

215of 859

The Bias-Variance Decomposition

Simple

models have

low variance

and

high bias.

x

t lnλ= 2.6

0 1

−1 0 1

x t

0 1

−1 0 1

(39)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs

Loss Function for Regression

The Bias-Variance Decomposition

216of 859

The Bias-Variance Decomposition

Dependence of bias and variance on the model complexity

x

t lnλ=−0.31

0 1

−1 0 1

x t

0 1

−1 0 1

(40)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs

Loss Function for Regression

The Bias-Variance Decomposition

217of 859

The Bias-Variance Decomposition

Complex

models have

high variance

and

low bias.

x

t lnλ=−2.4

0 1

−1 0 1

x t

0 1

−1 0 1

(41)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs

Loss Function for Regression

The Bias-Variance Decomposition

218of 859

The Bias-Variance Decomposition

Squared bias, variance, their sum, and test data

The minimum for

(

bias

)

2

+

variance occurs close to the

value that gives the minimum error

ln

λ

−3

−2

−1

0

1

2

0

0.03

0.06

0.09

0.12

0.15

(42)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs

Loss Function for Regression

The Bias-Variance Decomposition

219of 859

The Bias-Variance Decomposition

Tradeoff between bias and variance

simple models have low variance and high bias

complex models have high variance and low bias

The sum of bias and variance has a minimum at a certain

model complexity.

Expected loss

E

D

[

L

]

over all data sets

D

expected loss

= (

bias

)

2

+

variance

+

noise

.

The noise comes from the data, and can not be removed

from the expected loss.

References

Related documents

Council Comments and Recommendations 1 Executive Summary 15 Chapter 1 Introduction 21 1.1 Background and purpose of study 22 1.2 Focus on 55-69 year olds 23 1.3 Study design 25

The researchers developed an internet based insurance system. The system developed provides detailed information to any clients who may want to undertake

The vector Ax  p in the column space is perpendicular to e in the left nullspace... The vector b is not in the column space

Vinitkumar Dongre et al [l], presents the design of rectangular microstrip patch antenna using coaxial feed to function on single and dual frequency band for

The Method of Least Squares is a procedure to determine the best fit line to data; the proof uses simple calculus and linear algebra.. For example, the force of a spring

If we used absolute values to measure the error (see equation (2.6)), then all errors are weighted equally; however, the ab- solute value function is not differentiable, and thus

, least-squares provides the best linear unbiased

One approach is to eliminate some predictors (e.g., using step- wise methods) another one, called principal component regression, is to perform.. 1 In: Lewis-Beck M., Bryman, A.,