• No results found

02_Introduction.pdf

N/A
N/A
Protected

Academic year: 2020

Share "02_Introduction.pdf"

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Introduction to Statistical Machine Learning

Cheng Soon Ong & Christian Walder

Machine Learning Research Group

Data61 | CSIRO

and

College of Engineering and Computer Science

The Australian National University

Canberra

February – June 2017

(2)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

49of 90

Part II

(3)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Flavour of this course

Formalise intuitions about problems

Use language of mathematics to express models

Geometry, vectors, linear algebra for reasoning

Probabilistic models to capture uncertainty

Design and analysis of algorithms

Numerical algorithms in python

(4)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

51of 90

What is Machine Learning?

Definition (Mitchell, 1998)

(5)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Polynomial Curve Fitting

some artificial data created from the function

sin(2

πx

) +

random noise

x

=

0

, . . . ,

1

x

t

0

1

(6)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

53of 90

Polynomial Curve Fitting - Input Specification

N

=

10

x

(

x

1

, . . . ,

x

N

)

T

(7)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Polynomial Curve Fitting - Input Specification

N

=

10

x

(

x

1

, . . . ,

x

N

)

T

t

(

t

1

, . . . ,

t

N

)

T

x

i

R

i

=

1

,. . . ,

N

(8)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

55of 90

Polynomial Curve Fitting - Model Specification

M : order of polynomial

y

(

x,

w

) =

w

0

+

w

1

x

+

w

2

x

2

+

· · ·

+

w

M

x

M

=

M

X

m

=

0

w

m

x

m

nonlinear function of

x

linear

function of the unknown model parameter

w

(9)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Learning is Improving Performance

t

x y(xn,w) tn

xn

Performance measure : Error between target and

prediction of the model for the training data

E

(w) =

1

2

N

X

n

=

1

(

y

(

x

n

,

w)

t

n

)

2

(10)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

57of 90

Learning is Improving Performance

t

x y(xn,w) tn

xn

Performance measure : Error between target and

prediction of the model for the training data

E

(w) =

1

2

N

X

n

=

1

(

y

(

x

n

,

w)

t

n

)

2

(11)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Model Comparison or Model Selection

y

(

x,

w) =

M

X

m

=

0

w

m

x

m

M

=

0

=

w

0

x

t

M

= 0

0

1

(12)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

59of 90

Model Comparison or Model Selection

y

(

x,

w) =

M

X

m

=

0

w

m

x

m

M

=

1

=

w

0

+

w

1

x

x

t

M

= 1

0

1

(13)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Model Comparison or Model Selection

y

(

x,

w) =

M

X

m

=

0

w

m

x

m

M

=

3

=

w

0

+

w

1

x

+

w

2

x

2

+

w

3

x

3

x

t

M

= 3

0

1

(14)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

61of 90

Model Comparison or Model Selection

y

(

x,

w) =

M

X

m

=

0

w

m

x

m

M

=

9

=

w

0

+

w

1

x

+

· · ·

+

w

8

x

8

+

w

9

x

9

overfitting

x

t

M

= 9

0

1

(15)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Testing the Model

Train the model and get

w

?

Get

100

new data points

Root-mean-square (RMS) error

E

RMS

=

p

2

E

(w

?

)

/N

M

E

R

M

S

0

3

6

9

0

0.5

1

(16)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

63of 90

Testing the Model

M = 0

M = 1

M = 3

M = 9

w

?

0

0.19

0.82

0.31

0.35

w

?

1

-1.27

7.99

232.37

w

?

2

-25.43

-5321.83

w

?

3

17.37

48568.31

w

?

4

-231639.30

w

?

5

640042.26

w

?

6

-1061800.52

w

?

7

1042400.18

w

?

8

-557682.99

w

?

9

125201.43

(17)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

More Data

N

=

15

x

t

N

= 15

0

1

(18)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

65of 90

More Data

N

=

100

heuristics : have no less than

5

to

10

times as many data

points than parameters

but number of parameters is not necessarily the most

appropriate measure of model complexity !

later: Bayesian approach

x

t

N

= 100

0

1

(19)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Regularisation

How to constrain the growing of the coefficients

w

?

Add a

regularisation

term to the error function

e

E

(w) =

1

2

N

X

n

=

1

(

y

(

x

n

,

w)

t

n

)

2

+

λ

2

k

w

k

2

Squared norm of the parameter vector

w

(20)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

67of 90

Regularisation

M

=

9

x

t

ln

λ

=

18

0

1

(21)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Regularisation

M

=

9

x

t

ln

λ

= 0

0

1

(22)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

69of 90

Regularisation

M

=

9

E

R

M

S

ln

λ

−35

−30

−25

−20

0

0.5

1

(23)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

What is Machine Learning?

Definition (Mitchell, 1998)

A computer program is said to learn from experience

E

with

respect to some class of tasks

T

and performance measure

P,

if its performance at tasks in

T

, as measured by

P, improves

with experience

E.

Task: regression

Experience:

x

input examples,

t

output labels

Performance: squared error

Model choice

Regularisation

(24)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

71of 90

Probability Theory

p

(

X, Y

)

X

Y

= 2

(25)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Probability Theory

Y vs. X

a

b

c

d

e

f

g

h

i

sum

2

0

0

0

1

4

5

8

6

2

26

1

3

6

8

8

5

3

1

0

0

34

sum

3

6

8

9

9

8

9

6

2

60

p

(

X, Y

)

X

Y

= 2

(26)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

73of 90

Sum Rule

Y vs. X

a

b

c

d

e

f

g

h

i

sum

2

0

0

0

1

4

5

8

6

2

26

1

3

6

8

8

5

3

1

0

0

34

sum

3

6

8

9

9

8

9

6

2

60

p

(

X

=

d,

Y

=

1) =

8

/

60

p

(

X

=

d

) =

p

(

X

=

d,

Y

=

2) +

p

(

X

=

d,

Y

=

1)

=

1

/

60

+

8

/

60

p

(

X

=

d

) =

X

Y

p

(

X

=

d,

Y

)

p

(

X

) =

X

Y

(27)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Sum Rule

Y vs. X

a

b

c

d

e

f

g

h

i

sum

2

0

0

0

1

4

5

8

6

2

26

1

3

6

8

8

5

3

1

0

0

34

sum

3

6

8

9

9

8

9

6

2

60

p

(

X

) =

X

Y

p

(

X,

Y

)

p(X)

X

p

(

Y

) =

X

X

p

(

X,

Y

)

(28)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

75of 90

Product Rule

Y vs. X

a

b

c

d

e

f

g

h

i

sum

2

0

0

0

1

4

5

8

6

2

26

1

3

6

8

8

5

3

1

0

0

34

sum

3

6

8

9

9

8

9

6

2

60

Conditional Probability

p

(

X

=

d

|

Y

=

1) =

8

/

34

Calculate

p

(

Y

=

1)

:

p

(

Y

=

1) =

X

X

p

(

X,

Y

=

1) =

34

/

60

p

(

X

=

d,

Y

=

1) =

p

(

X

=

d

|

Y

=

1)

p

(

Y

=

1)

(29)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Product Rule

Y vs. X

a

b

c

d

e

f

g

h

i

sum

2

0

0

0

1

4

5

8

6

2

26

1

3

6

8

8

5

3

1

0

0

34

sum

3

6

8

9

9

8

9

6

2

60

p

(

X

) =

X

Y

p

(

X,

Y

)

p(X)

X

p

(

X,

Y

) =

p

(

X

|

Y

)

p

(

Y

)

(30)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

77of 90

Sum Rule and Product Rule

Sum Rule

p

(

X

) =

X

Y

p

(

X,

Y

)

Product Rule

(31)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Bayes Theorem

Use product rule

p

(

X,

Y

) =

p

(

X

|

Y

)

p

(

Y

) =

p

(

Y

|

X

)

p

(

X

)

Bayes Theorem

p

(

Y

|

X

) =

p

(

X

|

Y

)

p

(

Y

)

p

(

X

)

only defined for

p

(

X

)

>

0

and

p

(

X

) =

X

Y

p

(

X,

Y

)

(sum rule)

=

X

Y

(32)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

79of 90

Probability Densities

Real valued variable

x

R

Probability of

x

to fall in the interval

(

x,

x

+

δx

)

is given by

p

(

x

)

δx

for infinitesimal small

δx.

p

(

x

(

a,

b

)) =

Z

b

a

p

(

x

)

d

x.

x δx

(33)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Constraints on p(x)

Nonnegative

p

(

x

)

0

Normalisation

Z

−∞

p

(

x

)

d

x

=

1

.

x δx

(34)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

81of 90

Cumulative distribution function P(x)

P

(

x

) =

Z

x

−∞

p

(

z

)

d

z

or

d

dx

P

(

x

) =

p

(

x

)

x δx

(35)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Multivariate Probability Density

Vector

x

(

x

1

, . . . ,

x

D

)

T

=

x

1

..

.

x

D

Nonnegative

p

(x)

0

Normalisation

Z

−∞

p

(

x

)

dx

=

1

.

This means

Z

−∞

· · ·

Z

−∞

(36)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

83of 90

Sum and Product Rule for Probability Densities

Sum Rule

p

(

x

) =

Z

−∞

p

(

x,

y

)

d

y

Product Rule

(37)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Expectations

Weighted average of a function f(x) under the probability

distribution

p

(

x

)

E

[

f

] =

X

x

p

(

x

)

f

(

x

)

discrete distribution

p

(

x

)

E

[

f

] =

Z

(38)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

85of 90

How to approximate

E

[f

]

Given a finite number

N

of points

x

n

drawn from the

probability distribution

p

(

x

)

.

Approximate the expectation by a finite sum:

E

[

f

]

'

1

N

N

X

n

=

1

f

(

x

n

)

(39)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Expection of a function of several variables

arbitrary function

f

(

x,

y

)

E

x

[

f

(

x,

y

)] =

X

x

p

(

x

)

f

(

x,

y

)

discrete distribution

p

(

x

)

E

x

[

f

(

x,

y

)] =

Z

p

(

x

)

f

(

x,

y

)

d

x

probability density

p

(

x

)

(40)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory Probability Densities Expectations and Covariances 87of 90

Conditional Expectation

arbitrary function

f

(

x

)

E

x

[

f

|

y

] =

X

x

p

(

x

|

y

)

f

(

x

)

discrete distribution

p

(

x

)

E

x

[

f

|

y

] =

Z

p

(

x

|

y

)

f

(

x

)

d

x

probability density

p

(

x

)

Note that

E

x

[

f

|

y

]

is a function of

y.

Other notation used in the literature :

E

x

|

y

[

f

]

.

What is

E

[E

[

f

(

x

)

|

y

]]

? Can we simplify it?

This must mean

E

y

[E

x

[

f

(

x

)

|

y

]]

. (Why?)

E

y

[

E

x

[

f

(

x

)

|

y

]] =

X

y

p

(

y

)

E

x

[

f

|

y

] =

X

y

p

(

y

)

X

x

p

(

x

|

y

)

f

(

x

)

=

X

x

,

y

f

(

x

)

p

(

x,

y

) =

X

x

f

(

x

)

p

(

x

)

(41)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Variance

arbitrary function

f

(

x

)

var[

f

] =

E

(

f

(

x

)

E

[

f

(

x

)])

2

=

E

f

(

x

)

2

E

[

f

(

x

)]

2

Special case:

f

(

x

) =

x

(42)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

89of 90

Covariance

Two random variables

x

R

and

y

R

cov[

x,

y

] =

E

x

,

y

[(

x

E

[

x

])(

y

E

[

y

])]

=

E

x

,

y

[

x y

]

E

[

x

]

E

[

y

]

With

E

[

x

] =

a

and

E

[

y

] =

b

cov[

x,

y

] =

E

x

,

y

[(

x

a

)(

y

b

)]

=

E

x

,

y

[

x y

]

E

x

,

y

[

x b

]

E

x

,

y

[

a y

] +

E

x

,

y

[

a b

]

=

E

x

,

y

[

x y

]

b

E

x

,

y

[

x

]

|

{z

}

=

E

x

[

x

]

a

E

x

,

y

[

y

]

|

{z

}

=

E

y

[

y

]

+

a b

E

x

,

y

[1]

|

{z

}

=

1

=

E

x

,

y

[

x y

]

a b

a b

+

a b

=

E

x

,

y

[

x y

]

a b

=

E

x

,

y

[

x y

]

E

[

x

]

E

[

y

]

(43)

Introduction to Statistical Machine Learning

c

2016 Ong & Walder Data61 | CSIRO The Australian National

University

Polynomial Curve Fitting

Probability Theory

Probability Densities

Expectations and Covariances

Covariance for Vector Valued Variables

Two random variables

x

R

D

and

y

R

D

cov[x

,

y] =

E

x

,

y

(x

E

[x])(y

T

E

y

T

)

=

E

x

,

y

x y

T

References

Related documents

Respite is designed to serve Medical Assistance eligible adults and children who are identified through interagency planning teams, crisis intervention staff, Base Service Unit

When it comes to the energy of the future, the Institute is engaged in research on new materials and testing new generations of nuclear reactors, which may be built in

2 2 Ferindah, 2020 PENERAPAN MODEL PROBLEM BASED LEARNING UNTUK MEMPERBAIKI BERPIKIR KRITIS SISWA DI SEKOLAH DASAR Universitas Pendidikan Indonesia | repository.upi.edu |

– one system/model to learn multiple tasks simultaneously, with shared or separate Experience, with different performance measures.

Paishao Energy Taste Meridian Induction Functions & Indications Typical Dosage Contra- indications Paeonia lactiflora Mild, nontoxic Slightly bitter, sour Lung,

If key management resides with the customer, the cloud backup provider must have a process in place to ensure that the encryption key is stored locally with the customer and not

Husband’s “Pour Over” Will Husband’s Durable Power Of Attorney Wife’s Durable Power Of Attorney Wife’s “Pour Over” Will Joint Trust Wife’s Living Will

Thus, if the growth of the own capital of all households is due to labor income and savings, in accordance with Equation (6), then in the long run the total wealth and