02_Introduction.pdf

(1)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Introduction to Statistical Machine Learning

Cheng Soon Ong & Christian Walder

Machine Learning Research Group

Data61 | CSIRO

and

College of Engineering and Computer Science

The Australian National University

Canberra

February – June 2017

(2)

c

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

49of 90

Part II

(3)

c

University

Probability Theory

Flavour of this course

Formalise intuitions about problems

Use language of mathematics to express models

Geometry, vectors, linear algebra for reasoning

Probabilistic models to capture uncertainty

Design and analysis of algorithms

Numerical algorithms in python

(4)

c

University

Probability Theory

51of 90

What is Machine Learning?

Definition (Mitchell, 1998)

(5)

c

University

Probability Theory

some artificial data created from the function

sin(2

πx

) +

random noise

x

=

0

, . . . ,

1

x

t

0

1

(6)

c

University

Probability Theory

53of 90

Polynomial Curve Fitting - Input Specification

N

=

10

x

≡

(

x

1

, . . . ,

x

N

)

T

(7)

c

University

Probability Theory

Polynomial Curve Fitting - Input Specification

N

=

10

x

≡

(

x

1

, . . . ,

x

N

)

T

t

≡

(

t

1

, . . . ,

t

N

)

T

x

i

∈

R

i

=

1

,. . . ,

N

(8)

c

University

Probability Theory

55of 90

Polynomial Curve Fitting - Model Specification

M : order of polynomial

y

(

x,

w

) =

w

0

+

w

1

x

+

w

2

x

2

+

· · ·

+

w

M

x

M

=

M

X

m

=

0

w

m

x

m

nonlinear function of

x

linear

function of the unknown model parameter

w

(9)

c

University

Probability Theory

Learning is Improving Performance

t

x y(xn,w) tn

xn

Performance measure : Error between target and

prediction of the model for the training data

E

(w) =

1

2

N

X

n

=

1

(

y

(

x

n

,

w)

−

t

n

)

2

(10)

c

University

Probability Theory

57of 90

Learning is Improving Performance

t

x y(xn,w) tn

xn

Performance measure : Error between target and

prediction of the model for the training data

E

(w) =

1

2

N

X

n

=

1

(

y

(

x

n

,

w)

−

t

n

)

2

(11)

c

University

Probability Theory

Model Comparison or Model Selection

y

(

x,

w) =

M

X

m

=

0

w

m

x

m

M

=

0

=

w

0

x

t

M

= 0

0

1

(12)

c

University

Probability Theory

59of 90

Model Comparison or Model Selection

y

(

x,

w) =

M

X

m

=

0

w

m

x

m

M

=

1

=

w

0

+

w

1

x

t

M

= 1

0

1

(13)

c

University

Probability Theory

Model Comparison or Model Selection

y

(

x,

w) =

M

X

m

=

0

w

m

x

m

M

=

3

=

w

0

+

w

1

x

+

w

2

x

2

+

w

3

x

3

x

t

M

= 3

0

1

(14)

c

University

Probability Theory

61of 90

Model Comparison or Model Selection

y

(

x,

w) =

M

X

m

=

0

w

m

x

m

M

=

9

=

w

0

+

w

1

x

+

· · ·

+

w

8

x

8

+

w

9

x

9

overfitting

x

t

M

= 9

0

1

(15)

c

University

Probability Theory

Testing the Model

Train the model and get

w

?

Get

100

new data points

Root-mean-square (RMS) error

E

RMS

=

p

2

E

(w

?

)

/N

M

E

R

M

S

0

3

6

9

0

0.5

1

(16)

c

University

Probability Theory

63of 90

Testing the Model

M = 0

M = 1

M = 3

M = 9

w

?

₀

0.19

0.82

0.31

0.35

w

?

1

-1.27

7.99

232.37

w

?

2

-25.43

-5321.83

w

?

3

17.37

48568.31

w

?

4

-231639.30

w

?

5

640042.26

w

?

6

-1061800.52

w

?

7

1042400.18

w

?

8

-557682.99

w

?

9

125201.43

(17)

c

University

Probability Theory

More Data

N

=

15

x

t

N

= 15

0

1

(18)

c

University

Probability Theory

65of 90

More Data

N

=

100

heuristics : have no less than

5

to

10

times as many data

points than parameters

but number of parameters is not necessarily the most

appropriate measure of model complexity !

later: Bayesian approach

x

t

N

= 100

0

1

(19)

c

University

Probability Theory

Regularisation

How to constrain the growing of the coefficients

w

?

Add a

regularisation

term to the error function

e

E

(w) =

1

2

N

X

n

=

1

(

y

(

x

n

,

w)

−

t

n

)

2

+

λ

2

k

w

k

2

Squared norm of the parameter vector

w

(20)

c

University

Probability Theory

67of 90

Regularisation

M

=

9

x

t

ln

λ

=

−

18

0

1

(21)

c

University

Probability Theory

Regularisation

M

=

9

x

t

ln

λ

= 0

0

1

(22)

c

University

Probability Theory

69of 90

Regularisation

M

=

9

E

R

M

S

ln

λ

−35

−30

−25

−20

0

0.5

1

(23)

c

University

Probability Theory

What is Machine Learning?

Definition (Mitchell, 1998)

A computer program is said to learn from experience

E

with

respect to some class of tasks

T

and performance measure

P,

if its performance at tasks in

T

, as measured by

P, improves

with experience

E.

Task: regression

Experience:

x

input examples,

t

output labels

Performance: squared error

Model choice

Regularisation

(24)

c

University

Probability Theory

71of 90

Probability Theory

p

(

X, Y

)

X

Y

= 2

(25)

c

University

Probability Theory

Y vs. X

a

b

c

d

e

f

g

h

i

sum

2

0

1

4

5

8

6

2

26

1

3

6

8

5

3

1

0

34

sum

3

6

8

9

8

9

6

2

60

p

(

X, Y

)

X

Y

= 2

(26)

c

University

Probability Theory

73of 90

Sum Rule

Y vs. X

a

b

c

d

e

f

g

h

i

sum

2

0

1

4

5

8

6

2

26

1

3

6

8

5

3

1

0

34

sum

3

6

8

9

8

9

6

2

60

p

(

X

=

d,

Y

=

1) =

8

/

60

p

(

X

=

d

) =

p

(

X

=

d,

Y

=

2) +

p

(

X

=

d,

Y

=

1)

=

1

/

60

+

8

/

60

p

(

X

=

d

) =

X

Y

p

(

X

=

d,

Y

)

p

(

X

) =

X

Y

(27)

c

University

Probability Theory

Sum Rule

Y vs. X

a

b

c

d

e

f

g

h

i

sum

2

0

1

4

5

8

6

2

26

1

3

6

8

5

3

1

0

34

sum

3

6

8

9

8

9

6

2

60

p

(

X

) =

X

Y

p

(

X,

Y

)

p(X)

X

p

(

Y

) =

X

p

(

X,

Y

)

(28)

c

University

Probability Theory

75of 90

Product Rule

Y vs. X

a

b

c

d

e

f

g

h

i

sum

2

0

1

4

5

8

6

2

26

1

3

6

8

5

3

1

0

34

sum

3

6

8

9

8

9

6

2

60

Conditional Probability

p

(

X

=

d

|

Y

=

1) =

8

/

34

Calculate

p

(

Y

=

1)

:

p

(

Y

=

1) =

X

p

(

X,

Y

=

1) =

34

/

60

p

(

X

=

d,

Y

=

1) =

p

(

X

=

d

|

Y

=

1)

p

(

Y

=

1)

(29)

c

University

Probability Theory

Product Rule

Y vs. X

a

b

c

d

e

f

g

h

i

sum

2

0

1

4

5

8

6

2

26

1

3

6

8

5

3

1

0

34

sum

3

6

8

9

8

9

6

2

60

p

(

X

) =

X

Y

p

(

X,

Y

)

p(X)

X

p

(

X,

Y

) =

p

(

X

|

Y

)

p

(

Y

)

(30)

c

University

Probability Theory

77of 90

Sum Rule and Product Rule

Sum Rule

p

(

X

) =

X

Y

p

(

X,

Y

)

Product Rule

(31)

c

University

Probability Theory

Bayes Theorem

Use product rule

p

(

X,

Y

) =

p

(

X

|

Y

)

p

(

Y

) =

p

(

Y

|

X

)

p

(

X

)

Bayes Theorem

p

(

Y

|

X

) =

p

(

X

|

Y

)

p

(

Y

)

p

(

X

)

only defined for

p

(

X

)

>

0

and

p

(

X

) =

X

Y

p

(

X,

Y

)

(sum rule)

=

X

Y

(32)

c

University

Probability Theory

79of 90

Real valued variable

x

∈

R

Probability of

x

to fall in the interval

(

x,

x

+

δx

)

is given by

p

(

x

)

δx

for infinitesimal small

δx.

p

(

x

∈

(

a,

b

)) =

Z

b

a

p

(

x

)

d

x.

x δx

(33)

c

University

Probability Theory

Constraints on p(x)

Nonnegative

p

(

x

)

≥

0

Normalisation

Z

∞

−∞

p

(

x

)

d

x

=

1

.

x δx

(34)

c

University

Probability Theory

81of 90

Cumulative distribution function P(x)

P

(

x

) =

Z

x

−∞

p

(

z

)

d

z

or

d

dx

P

(

x

) =

p

(

x

)

x δx

(35)

c

University

Probability Theory

Multivariate Probability Density

Vector

x

≡

(

x

1

, . . . ,

x

D

)

T

=







x

1

..

.

x

D







Nonnegative

p

(x)

≥

0

Normalisation

Z

∞

−∞

p

(

x

)

dx

=

1

.

This means

Z

∞

−∞

· · ·

Z

∞

−∞

(36)

c

University

Probability Theory

83of 90

Sum and Product Rule for Probability Densities

Sum Rule

p

(

x

) =

Z

∞

−∞

p

(

x,

y

)

d

y

Product Rule

(37)

c

University

Probability Theory

Expectations

Weighted average of a function f(x) under the probability

distribution

p

(

x

)

E

[

f

] =

X

x

p

(

x

)

f

(

x

)

discrete distribution

p

(

x

)

E

[

f

] =

Z

(38)

c

University

Probability Theory

85of 90

How to approximate

E

[f

]

Given a finite number

N

of points

x

n

drawn from the

probability distribution

p

(

x

)

.

Approximate the expectation by a finite sum:

E

[

f

]

'

1

N

X

n

=

1

f

(

x

n

)

(39)

c

University

Probability Theory

Expection of a function of several variables

arbitrary function

f

(

x,

y

)

E

x

[

f

(

x,

y

)] =

X

x

p

(

x

)

f

(

x,

y

)

p

(

x

)

E

x

[

f

(

x,

y

)] =

Z

p

(

x

)

f

(

x,

y

)

d

x

probability density

p

(

x

)

(40)

c

University

Probability Theory Probability Densities Expectations and Covariances 87of 90

Conditional Expectation

arbitrary function

f

(

x

)

E

x

[

f

|

y

] =

X

x

p

(

x

|

y

)

f

(

x

)

p

(

x

)

E

x

[

f

|

y

] =

Z

p

(

x

|

y

)

f

(

x

)

d

x

probability density

p

(

x

)

Note that

E

x

[

f

|

y

]

is a function of

y.

Other notation used in the literature :

E

x

|

y

[

f

]

.

What is

E

[E

[

f

(

x

)

|

y

]]

? Can we simplify it?

This must mean

E

y

[E

x

[

f

(

x

)

|

y

]]

. (Why?)

E

y

[

E

x

[

f

(

x

)

|

y

]] =

X

y

p

(

y

)

E

x

[

f

|

y

] =

X

y

p

(

y

)

X

x

p

(

x

|

y

)

f

(

x

)

=

X

x

,

y

f

(

x

)

p

(

x,

y

) =

X

x

f

(

x

)

p

(

x

)

(41)

c

University

Probability Theory

Variance

arbitrary function

f

(

x

)

var[

f

] =

E

(

f

(

x

)

−

E

[

f

(

x

)])

2

=

E

f

(

x

)

2

−

E

[

f

(

x

)]

2

Special case:

f

(

x

) =

x

(42)

c

University

Probability Theory

89of 90

Covariance

Two random variables

x

∈

R

and

y

∈

R

cov[

x,

y

] =

E

x

,

y

[(

x

−

E

[

x

])(

y

−

E

[

y

])]

=

E

x

,

y

[

x y

]

−

E

[

x

]

E

[

y

]

With

E

[

x

] =

a

and

E

[

y

] =

b

cov[

x,

y

] =

E

x

,

y

[(

x

−

a

)(

y

−

b

)]

=

E

x

,

y

[

x y

]

−

E

x

,

y

[

x b

]

−

E

x

,

y

[

a y

] +

E

x

,

y

[

a b

]

=

E

x

,

y

[

x y

]

−

b

E

x

,

y

[

x

]

|

{z

}

=

E

x

[

x

]

−

a

E

x

,

y

[

y

]

|

{z

}

=

E

y

[

y

]

+

a b

E

x

,

y

[1]

|

{z

}

=

1

=

E

x

,

y

[

x y

]

−

a b

−

a b

+

a b

=

E

x

,

y

[

x y

]

−

a b

=

E

x

,

y

[

x y

]

−

E

[

x

]

E

[

y

]

(43)

c

University

Probability Theory

Covariance for Vector Valued Variables

Two random variables

x

∈

R

D

and

y

∈

R

D

cov[x

,

y] =

E

x

,

y

(x

−

E

[x])(y

T

−

E

y

T

)

=

E

x

,

y

x y

T