05_Linear_Regression_1.pdf

(1)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Outlines

Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Neural Networks 1 Neural Networks 2 Continuous Latent Variables Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Mixture Models and EM 1 Mixture Models and EM 2 Sequential Data 1 Sequential Data 2 Combining Models

1of 859

Introduction to Statistical Machine Learning

Cheng Soon Ong & Christian Walder

Machine Learning Research Group

Data61 | CSIRO

and

College of Engineering and Computer Science

The Australian National University

Canberra

February – June 2017

(2)

c

University

Review Linear Basis Function Models

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

179of 859

Part V

(3)

c

University

180of 859

Linear Regression

N

=

10

x

≡

(

x

1

, . . . ,

xN

)

T

t

≡

(

t

1

, . . . ,

t

N

)

T

xi

∈

R

i

=

1,. . . ,

N

t

i

∈

R

i

=

1,. . . ,

N

0 2 4 6 8 10 x

0 1 2 3 4 5 6 7 8

t

Predictor

y

(

x

,

w

)

?

Performance measure?

Optimal solution

w

∗

?

(4)

c

University

181of 859

Probabilities, Losses

Gaussian Distribution

Bayes Rule

(5)

c

University

Review

Linear Basis Function Models

182of 859

Regression

Given a training data set of

N

observations

{

x

n

}

and target

values

tn

.

Goal : Learn to predict the value of one ore more target

values

t

given a new value of the input

x.

Example: Polynomial curve fitting (see Introduction).

x

t

(6)

c

University

Review

183of 859

Supervised Learning

model with adjustable

parameter w

training

data x

training

targets

t

model with fixed

parameter w*

test

data x

test

target

t

Training Phase

Test Phase

(7)

c

University

Review

184of 859

Linear Basis Function Models

Linear

combination of

fixed

nonlinear basis functions

y

(

x

,

w

) =

M

−

1

X

j

=

0

w

j

φ

j

(

x

) =

w

T

φ

(

x

)

parameter

w

= (

w

0

, . . . ,

w

M

−

1

)

T

basis functions

φ

(

x

) = (

φ

0

(

x

)

, . . . , φ

M

−

1

(

x

))

T

convention

φ

0

(

x

) =

1

(8)

c

University

Review

185of 859

Polynomial Basis Functions

Scalar input variable

x

φ

j

(

x

) =

x

j

Limitation : Polynomials are global functions of the input

variable

x

.

Extension: Split the input space into regions and fit a

different polynomial to each region (spline functions).

−1

0

1

(9)

c

University

Review

186of 859

’Gaussian’ Basis Functions

x

φ

j

(

x

) =

exp

−

(

x

−

µ

j

)

2

s

2

Not a probability distribution.

No normalisation required, taken care of by the model

parameters

w

.

−1

0

1

(10)

c

University

Review

187of 859

Sigmoidal Basis Functions

x

φ

j

(

x

) =

σ

x

−

µ

j

s

where

σ

(

a

)

is the logistic sigmoid function defined by

σ

(

a

) =

1

+

exp

(

−

a

)

σ

(

a

)

is related to the

hyperbolic tangent

tanh

(

a

)

by

tanh

(

a

) =

2

σ

(

a

)

−

1.

−1

0

1

(11)

c

University

Review

188of 859

Other Basis Functions

Fourier Basis : each basis function represents a specific

frequency and has infinite spatial extent.

Wavelets : localised in both space and frequency (also

mutually orthogonal to simplify appliciation).

Splines (piecewise polynomials restricted to regions of the

input space; additional constraints where pieces meet, e.g.

smoothness constraints

→

conditions on the derivatives).

0 1 2 3 4 5

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0

Linear

Splines

0 1 2 3 4 5

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0

Quadratic

Splines

0 1 2 3 4 5

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0

Cubic

Splines

0 1 2 3 4 5

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0

Quartic

Splines

Approximate the points

(12)

c

University

Review

189of 859

Maximum Likelihood and Least Squares

No special assumption about the basis functions

φ

j

(

x

)

. In

the simplest case, one can think of

φ

j

(

x

) =

x

j

, or

φ

(

x

) =

x.

Assume target

t

is given by

t

=

y

(

x

,

w

)

|

{z

}

deterministic

+

|{z}

noise

where

is a

zero-mean Gaussian

random variable with

precision

(inverse variance)

β

.

Thus

p

(

t

|

x

,

w

, β

) =

N

(

t

|

y

(

x

,

w

)

, β

−

1

)

t

x x0

y(x0)

y(x)

(13)

c

University

Review

190of 859

Maximum Likelihood and Least Squares

Likelihood of one target

t

given the data

x

was

p

(

t

|

x

,

w

, β

) =

N

(

t

|

y

(

x

,

w

)

, β

−

1

)

Now, a

set of inputs

X

with corresponding target values

t

.

Assume data are

independent and identically distributed

(i.i.d.) (means : data are drawn independent and from the

same distribution). The likelihood of the target

t

is then

p

(t

|

X

,

w

, β

) =

N

Y

n

=

1

N

(

t

n

|

y

(

x

n

,

w

)

, β

−

1

)

=

N

Y

n

=

1

N

(

tn

|

w

T

φ

(

x

n

)

, β

−

1

)

(14)

c

University

Review

191of 859

Maximum Likelihood and Least Squares

Consider the

logarithm of the likelihood

p

(t

|

w

, β

)

(the

logarithm is a monoton function! )

ln

p

(t

|

w

, β

) =

N

X

n

=

1

ln

N

(

tn

|

w

T

φ

(

x

n

)

, β

−

1

)

=

N

X

n

=

1

ln

r

β

2

π

exp

−

β

2

(

tn

−

w

T

φ

(

x

n

))

2

!

=

N

2

ln

β

−

N

2

ln

(

2

π

)

−

β

E

D

(

w

)

where the

sum-of-squares error function

is

ED

(

w

) =

1

2

N

X

n

=

1

{

tn

−

w

T

φ

(

xn

)

}

2

.

(15)

c

University

Review

192of 859

Maximum Likelihood and Least Squares

Goal: Find a more compact representation.

Rewrite the

error function

ED

(

w

) =

1

2

N

X

n

=

1

{

tn

−

w

T

φ

(

xn

)

}

2

=

1

2

(t

−

Φ

w

)

T

(t

−

Φ

w

)

where

t

= (

t

1

, . . . ,

tN

)

T

, and

Φ

=







φ

0

(

x

1

)

φ

1

(

x

1

)

. . .

φ

M

−

1

(

x

1

)

φ

0

(

x

2

)

φ

1

(

x

2

)

. . .

φ

M

−

1

(

x

2

)

..

.

..

.

. ..

..

.

φ

0

(

x

N

)

φ

1

(

x

N

)

. . .

φ

M

−

1

(

x

N

)



(16)

c

University

Review

193of 859

Maximum Likelihood and Least Squares

The log likelihood is now

ln

p

(t

|

w

, β

) =

N

2

ln

β

−

N

2

ln

(

2

π

)

−

β

E

D

(

w

)

=

N

2

ln

β

−

N

2

ln

(

2

π

)

−

β

1

2

(t

−

Φ

w

)

T

(t

−

Φ

w

)

Find

critical points

of

ln

p

(t

|

w

, β

)

.

Directional derivative in direction

ξ

D

ln

p

(t

|

w

, β

)(

ξ

) =

β

ξ

T

(

Φ

T

t

−

Φ

T

Φ

w

)

shall be zero in all directions

ξ. Therefore

0

=

Φ

T

t

−

Φ

T

Φ

w

,

which results in

w

ML

= (

Φ

T

Φ

)

−

1

Φ

T

t

=

Φ

†

t

(17)

c

University

Review

194of 859

Maximum Likelihood and Least Squares

Found a critical point

w

ML

for

ln

p

(t

|

w

, β

)

. Is it a maximum,

a minimum, or a saddle point?

Calculate the second directional derivative

D

2

ln

p

(t

|

w

, β

)(

ξ

,

ξ

) =

−

β

ξ

T

Φ

T

Φ

ξ

=

−

β

(

Φ

ξ

)

T

Φ

ξ

=

−

β

k

Φ

ξ

k

2

≤

0

We found a maximum.

Aside: Is it enough to check the second directional

derivative in two directions

ξ

which are the same? Yes,

because for any bilinear function

g

(

ξ

,

η

)

, symmetric in its

arguments,

1

2

[

g

(

ξ

,

ξ

) +

g

(

η

,

η

)

−

g

(

ξ

−

η

,

ξ

−

η

)]

=

1

2

[

g

(

ξ

,

ξ

) +

g

(

η

,

η

)

−

g

(

ξ

,

ξ

)

−

g

(

η

,

η

) +

g

(

ξ

,

η

) +

g

(

η

,

ξ

)]

=

1

(18)

c

University

Review

195of 859

Maximum Likelihood and Least Squares

The log likelihood with the optimal

w

ML

is now

ln

p

(t

|

w

ML

, β

)

=

N

2

ln

β

−

N

2

ln

(

2

π

)

−

β

1

2

(t

−

Φ

w

ML

)

T

(t

−

Φ

w

ML

)

Find critical points of

ln

p

(t

|

w

, β

)

wrt

β

,

∂

ln

p

(t

|

w

ML

, β

)

∂β

=

0

results in

1

β

ML

=

1

N

(t

−

Φ

w

ML

)

T

(t

−

Φ

w

ML

)

Note: We can first find the maximum likelihood for

w

as

this does

not depend

on

β

. Then we can use

w

ML

to find

the maximum likelihood solution for

β

.

(19)

c

University

Maximum Likelihood and Least Squares

Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

196of 859

Geometry of Least Squares

Target vector

t

= (

t

1

, . . . ,

tN

)

T

∈

R

N

Basis vectors

ϕ

j

= (

Φ

j

1

, . . . ,

Φ

jN

)

T

= (

φ

j

(

x

1

)

, . . . , φ

j

(

x

N

))

T

∈

R

N

span a

subspace

S

Find

w

such that

y

= (

y

(

x

1

,

w

)

, . . . ,

y

(

x

N

,

w

))

T

∈ S

is closest

to

t

.

S

t

y

ϕ

1

(20)

c

University

Geometry of Least Squares

Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

197of 859

Sequential Learning - Stochastic Gradient

Descent

For

large data sets, calculating the maximum likelihood

parameters

w

ML

and

β

ML

may be costly.

For

online

applications, never all data in memory.

Use a

sequential

algorithms (online

algorithm).

If the error function is a sum over data points

E

=

P

n

En

,

then

1

initialise

w

(0)

to some starting value

2

update the parameter vector at iteration

τ

+

1

by

w

(τ+1)

=

w

(τ)

−

η

∇

E

n,

(21)

c

University

Geometry of Least Squares

Sequential Learning Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

198of 859

Sequential Learning - Stochastic Gradient

Descent

For the sum-of-squares error function, stochastic gradient

descent results in

w

(

τ

+

1

)

=

w

(

τ

)

−

η

t

n

−

w

(

τ

)

T

φ

(

x

n

)

φ

(

x

n

)

(22)

c

University

Maximum Likelihood and Least Squares Geometry of Least Squares

Sequential Learning

Regularized Least Squares Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

199of 859

Regularized Least Squares

Add regularisation in order to prevent overfitting

ED

(

w

) +

λ

EW

(

w

)

with regularisation coefficient

λ

.

Simple quadratic regulariser

EW

(

w

) =

1

2

w

T

w

Maximum likelihood solution

(23)

c

University

Sequential Learning

200of 859

Regularized Least Squares

More general regulariser

EW

(

w

) =

1

2

M

X

j

=

1

|

wj

|

q

=

1

(lasso) leads to a sparse model if

λ

large enough.

(24)

c

University

Sequential Learning

201of 859

Comparison of Quadratic and Lasso Regulariser

Assume a sufficiently large regularisation coefficient

λ

.

Quadratic regulariser

1

2

M

X

j

=

1

w

2

j

w

1

w

2

w

?

Lasso regulariser

1

2

M

X

j

=

1

|

wj

|

w

1

w

2

(25)

c

University

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning

Regularized Least Squares

Multiple Outputs Loss Function for Regression The Bias-Variance Decomposition

202of 859

Multiple Outputs

More than

1

target variable per data point.

y

becomes a vector instead of a scalar. Each dimension

can be treated with a different set of basis functions (and

that may be necessary if the data in the different target

dimensions represent very different types of information.)

Here we restrict ourselves to the SAME basis functions

y

(

x

,

w

) =

W

T

φ

(

x

)

where

y

is a

K

-dimensional column vector,

W

is an

M

×

K

matrix of model parameters, and

φ

(

x

) = (

φ

0

(

x

)

, . . . , φ

M

−

1

(

x

)

,

φ

0

(

x

) =

1, as before.

(26)

c

University

203of 859

Multiple Outputs

Suppose the conditional distribution of the target vector is

an isotropic Gaussian of the form

p

(

t

|

x

,

W

, β

) =

N

(

t

|

W

T

φ

(

x

)

, β

−

1

I

)

.

The log likelihood is then

ln

p

(

T

|

X

,

W

, β

) =

N

X

n

=

1

ln

N

(

t

n

|

W

T

φ

(

x

n

)

, β

−

1

VI

)

=

NK

2

ln

β

2

π

−

β

2

N

X

n

=

1

(27)

c

University

204of 859

Multiple Outputs

Maximisation with respect to

W

results in

W

ML

= (

Φ

T

Φ

)

−

1

Φ

T

.

For each target variable

t

k

, we get

w

k

= (

Φ

T

Φ

)

−

1

Φ

T

t

k

=

Φ

†

t

k

.

The solution between the different target variables

decouples.

Holds also for a general Gaussian noise distribution with

arbitrary covariance matrix.

(28)

c

University

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares

Multiple Outputs

Loss Function for Regression The Bias-Variance Decomposition

205of 859

Loss Function for Regression

Over-fitting results from a large number of basis functions

and a relatively small training set.

Regularisation can prevent overfitting, but how to find the

correct value for the regularisation constant

λ

?

(29)

c

University

Multiple Outputs

206of 859

Loss Function for Regression

Choose an estimator

y

(

x

)

to estimate the target value

t

for

each input

x.

Choose a loss function

L

(

t

,

y

(

x

))

which measures the

difference between the target

t

and the estimate

y

(

x

)

.

The

expected loss

is then

E

[

L

] =

Z

L

(

t

,

y

(

x

))

p

(

x

,

t

)

dx

d

t

Common choice:

Squared Loss

L

(

t

,

y

(

x

)) =

{

y

(

x

)

−

t

}

2

.

Expected loss for squared loss function

E

[

L

] =

Z

(30)

c

University

Multiple Outputs

207of 859

Loss Function for Regression

Expected loss for squared loss function

E

[

L

] =

Z

{

y

(

x

)

−

t

}

2

p

(

x

,

t

)

dx

d

t

.

Minimise

E

[

L

]

by choosing the regression function

y

(

x

) =

R

t p

(

x

,

t

)

d

t

p

(

x

)

=

Z

t p

(

t

|

x

)

d

t

=

E

t

[

t

|

x

]

(31)

c

University

Multiple Outputs

208of 859

Loss Function for Regression

The regression function which minimises the expected

squared loss, is given by the mean of the conditional

distribution

p

(

t

|

x

)

.

t

x

0

y

(

x

0

)

y

(

x

)

(32)

c

University

Multiple Outputs

209of 859

Loss Function for Regression

Analyse the expected loss

E

[

L

] =

Z

{

y

(

x

)

−

t

}

2

p

(

x

,

t

)

dx

d

t

.

Rewrite the squared loss

{

y

(

x

)

−

t

}

2

=

{

y

(

x

)

−

E

[

t

|

x

] +

E

[

t

|

x

]

−

t

}

2

=

{

y

(

x

)

−

E

[

t

|

x

]

}

2

+

{

E

[

t

|

x

]

−

t

}

2

+

2

{

y

(

x

)

−

E

[

t

|

x

]

} {E

[

t

|

x

]

−

t

}

Claim

Z

(33)

c

University

Multiple Outputs

210of 859

Loss Function for Regression

Claim

Z

{

y

(

x

)

−

E

[

t

|

x

]

} {E

[

t

|

x

]

−

t

}

p

(

x

,

t

)

dx

d

t

=

0

.

Seperate functions depending on

t

from function

depending on

x

Z

{

y

(

x

)

−

E

[

t

|

x

]

}

Z

{

E

[

t

|

x

]

−

t

}

p

(

x

,

t

)

d

t

dx

Calculate the integral over

t

Z

{E

[

t

|

x

]

−

t

}

p

(

x

,

t

)

d

t

=

E

[

t

|

x

]

p

(

x

)

−

p

(

x

)

Z

t p

(

x

,

t

)

(34)

c

University

Multiple Outputs

211of 859

Loss Function for Regression

The expected loss is now

E

[

L

] =

Z

{

y

(

x

)

−

E

[

t

|

x

]

}

2

p

(

x

)

dx

+

Z

var

[

t

|

x

]

p

(

x

)

dx

(35)

c

University

Maximum Likelihood and Least Squares Geometry of Least Squares Sequential Learning Regularized Least Squares Multiple Outputs

Loss Function for Regression

The Bias-Variance Decomposition

212of 859

The Bias-Variance Decomposition

Consider now the dependency on the data set

D

.

Prediction function now

y

(

x

;

D

)

.

Consider again squared loss for which the optimal

prediction is given by the conditional expectation

h

(

x

)

h

(

x

) =

E

[

t

|

x

] =

Z

t p

(

t

|

x

)

d

t

.

(36)

c

University

213of 859

The Bias-Variance Decomposition

Taking the expectation over all data sets

D

E

D

[

E

[

L

]] =

Z

E

D

{

y

(

x

;

D

)

−

h

(

x

)

}

2

p

(

x

)

dx

+

Z

{

h

(

x

)

−

t

}

2

p

(

x

,

t

)

dx

d

t

Again, add and subtract the expectation

E

D

[

y

(

x

;

D

)]

{

y

(

x

;

D

)

−

h

(

x

)

}

2

=

{

y

(

x

;

D

)

−

E

D

[

y

(

x

;

D

)]

+

E

D

[

y

(

x

;

D

)]

−

h

(

x

)

}

2

(37)

c

University

214of 859

The Bias-Variance Decomposition

Expected loss

E

D

[

L

]

over all data sets

D

expected loss

= (

bias

)

2

+

variance

+

noise

.

where

(

bias

)

2

=

Z

{E

D

[

y

(

x

;

D

)]

−

h

(

x

)

}

2

p

(

x

)

dx

variance

=

Z

E

D

h

{

y

(

x

;

D

)

−

E

D

[

y

(

x

;

D

)]

}

2

i

p

(

x

)

dx

noise

=

Z

{

h

(

x

)

−

t

}

2

p

(

x

,

t

)

dx

d

t

.

variance

: How sensitive is the model to small changes in

the training set? (How much do solutions for individual

data sets vary around their average ?

(38)

c

University

215of 859

The Bias-Variance Decomposition

Simple

models have

low variance

and

high bias.

x

t lnλ= 2.6

0 1

−1 0 1

x t

0 1

−1 0 1

(39)

c

University

216of 859

The Bias-Variance Decomposition

Dependence of bias and variance on the model complexity

x

t lnλ=−0.31

0 1

−1 0 1

x t

0 1

−1 0 1

(40)

c

University

217of 859

The Bias-Variance Decomposition

Complex

models have

high variance

and

low bias.

x

t lnλ=−2.4

0 1

−1 0 1

x t

0 1

−1 0 1

(41)

c

University

218of 859

The Bias-Variance Decomposition

Squared bias, variance, their sum, and test data

The minimum for

(

bias

)

2

+

variance occurs close to the

value that gives the minimum error

ln

λ

−3

−2

−1

0

1

2

0

0.03

0.06

0.09

0.12

0.15

(42)

c

University

219of 859

The Bias-Variance Decomposition

Tradeoff between bias and variance

simple models have low variance and high bias

complex models have high variance and low bias

The sum of bias and variance has a minimum at a certain

model complexity.

Expected loss

E

D

[

L

]

over all data sets

D

expected loss

= (

bias

)

2

+

variance

+

noise

.

The noise comes from the data, and can not be removed

from the expected loss.