• No results found

04_Probability.pdf

N/A
N/A
Protected

Academic year: 2020

Share "04_Probability.pdf"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Introduction to Statistical Machine Learning

Cheng Soon Ong & Christian Walder

Machine Learning Research Group Data61 | CSIRO

and

College of Engineering and Computer Science The Australian National University

Canberra February – June 2017

(2)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Part IV

(3)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation

Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Linear Curve Fitting - Least Squares

N=10

x(x1, . . . ,xN)T

t(t1, . . . ,tN)T

xi∈R i=1,. . . ,N tiR i=1,. . . ,N

y(x,w) =w1x+w0

X[x 1]

w∗= (XTX)−1XTt

0 2 4 6 8 10

x

0 1 2 3 4 5 6 7 8

t

(4)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation

Boxes with Apples and Oranges

Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Simple Experiment

1 Choose a box.

Red boxp(B=r) =4/10

Blue boxp(B=b) =6/10

2 Choose any item of the selected box with equal probability.

(5)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation

Boxes with Apples and Oranges

Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

What do we know?

p(F=o|B=b) =1/4

p(F=a|B=b) =3/4

p(F=o|B=r) =3/4

p(F=a|B=r) =1/4

Given that we have chosen an orange, what is the probability that the box we chose was the blue one?

(6)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation

Boxes with Apples and Oranges

Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Calculating the Posterior p(B

=

b

|

F

=

o)

Bayes’ Theorem

p(B=b|F=o) =p(F=o|B=b)p(B=b) p(F=o)

Sum Rule for the denominator

p(F=o) =p(F=o,B=b) +p(F=o,B=r) =p(F=o|B=b)p(B=b)

+p(F=o|B=r)p(B=r)

= 1

4× 6 10+

3 4×

4 10 =

9 20

(7)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges

Bayes’ Theorem

Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Bayes’ Theorem - revisited

Before choosing an item from a box: most complete information inp(B)(prior).

Note, that in our examplep(B=b) = 6

10. Therefore

choosing the box from the prior, we would opt for the blue box.

Once we observe some data (e.g. choose an orange), we can calculatep(B=b|F=o)(posterior probability) via Bayes’ theorem.

After observing an orange, the posterior probability p(B=b|F=o) = 1

3 and thereforep(B=r|F=o) = 2 3.

(8)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges

Bayes’ Theorem

Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Bayes’ Rule

posterior

z }| {

p(Y|X) =

likelihood

z }| {

p(X|Y)

prior

z }| {

p(Y) p(X)

| {z }

normalisation

= Pp(X|Y)p(Y)

(9)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem

Bayes’ Probabilities

Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Bayes’ Probabilities

classical or frequentist interpretation of probabilities Bayesian view: probabilities represent uncertainty Example: Will the Arctic ice cap have disappeared by the end of the century?

fresh evidence can change the opinion on ice loss

goal: quantify uncertainty and revise uncertainty in light of new evidence

(10)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem

Bayes’ Probabilities

Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Andrey Kolmogorov - Axiomatic Probability

Theory (1933)

Let(Ω,F,P)be a measure space withP(Ω) =1.

Then(Ω,F,P)is a probability space, with sample spaceΩ, event spaceFand probability measureP.

1. AxiomP(E)≥0 EF. 2. AxiomP(Ω) =1.

(11)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem

Bayes’ Probabilities

Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Richard Threlkeld Cox - (1946, 1961)

Postulates (according to Arnborg and Sjödin)

1 Divisibility and comparability: The plausibility of a

statement is a real number and is dependent on information we have related to the statement.

2 Common sense: Plausibilities should vary sensibly with

the assessment of plausibilities in the model.

3 Consistency: If the plausibility of a statement can be

derived in many ways, all the results must be equal.

“Common sense” includes consistency with Aristotelian logic

Result: the numerical quantities representing plausibility behave all according to the rules of probability (either p[0,1]orp[1,]).

(12)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem

Bayes’ Probabilities

Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Curve fitting - revisited

uncertainty about the parameterwcaptured in the prior probabilityp(w)

observed dataD={t1, . . . ,tN}

calculate the uncertainty inwafterthe dataDhave been observed

p(w| D) = p(D |w)p(w) p(D)

p(D |w)as a function ofw:likelihood function likelihood expresses how probable the data are for different values ofw

(13)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem

Bayes’ Probabilities

Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Likelihood Function - Frequentist versus Bayesian

Likelihood functionp(D |w)

Frequentist Approach wconsidered fixed parameter

value defined by some ’estimator’ error bars on the estimatedw obtained from the distribution of possible data setsD

Bayesian Approach only one single data setD

(14)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem

Bayes’ Probabilities

Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Frequentist Estimator - Maximum Likelihood

choosewfor which the likelihoodp(D |w)is maximal choosewfor which the probability of the observed data is maximal

Machine Learning: error function is negative log of likelihood function

log is a monoton function

maximising the likelihood ⇐⇒ minimising the error Example: Fair-looking coin is tossed three times, always landing on heads.

(15)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem

Bayes’ Probabilities

Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Bayesian Approach

including prior knowledge easy (via priorw)

BUT: if prior is badly chosen, can lead to bad results subjective choice of prior

sometimes choice of prior motivated by convinient mathematical form

need to sum/integrate over the whole parameter space advances in sampling (Markov Chain Monte Carlo methods)

(16)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities

Probability Distributions

Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Expectations

Weighted average of a function f(x) under the probability distributionp(x)

E[f] =X

x

p(x)f(x) discrete distributionp(x)

E[f] =

Z

(17)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities

Probability Distributions

Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Variance

arbitrary functionf(x)

var[f] =E(f(x)−E[f(x)])2=Ef(x)2−E[f(x)]2

Special case:f(x) =x

(18)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities

Probability Distributions

Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

The Gaussian Distribution

xR

Gaussian Distribution withmeanµandvarianceσ2

N(x|µ, σ2) = 1

(2πσ2)12 exp{− 1

2σ2(x−µ) 2

}

N(x|µ, σ2)

x

(19)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities

Probability Distributions

Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

The Gaussian Distribution

N(x|µ, σ2)>0

R∞

−∞N(x|µ, σ

2)dx=1

Expectation overx

E[x] =

Z ∞

−∞N

(x|µ, σ2)

x dx=µ

Expectation overx2

Ex2=

Z ∞

−∞N

(x|µ, σ2)x2 dx=µ2+σ2

Variance of x

(20)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities

Probability Distributions

Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Linear Curve Fitting - Least Squares

N=10

x(x1, . . . ,xN)T

t(t1, . . . ,tN)T

xi∈R i=1,. . . ,N tiR i=1,. . . ,N

y(x,w) =w1x+w0

X[x 1]

w∗= (XTX)−1XTt

0 2 4 6 8 10

x

0 1 2 3 4 5 6 7 8

t

We assume

t= y(x,w)

| {z }

deterministic

+

|{z}

(21)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities

Probability Distributions

Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

The Mode of a Probability Distribution

Mode of a distribution : the value that occurs the most frequently in a probability distribution.

For a probability density function: the valuexat which the probability density attains its maximum.

Gaussian Distribution hasonemode (unimodular); the mode isµ.

If there are multiple local maxima in the probability

distribution, the probability distribution is calledmultimodal (example: mixture of three Gaussians).

N(x|µ, σ2)

x 2σ

µ x

(22)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities

Probability Distributions

Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

The Bernoulli Distribution

Two possible outcomesx∈ {0,1}(e.g. coin which may be damaged).

p(x=1|µ) =µfor0µ≤1

p(x=0|µ) =1µ

Bernoulli Distribution

Bern(x|µ) =µx(1

−µ)1−x

Expectation overx

E[x] =µ

Variance of x

(23)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities

Probability Distributions

Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

The Binomial Distribution

Flip a coinNtimes. What is the distribution to observe heads exactlymtimes?

This is a distribution overm={0, . . . ,N}. Binomial Distribution

Bin(m|N, µ) =

N m

µm(1

−µ)N−m

m

0 1 2 3 4 5 6 7 8 9 10 0

0.1 0.2 0.3

(24)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities

Probability Distributions

Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

The Beta Distribution

Beta Distribution

Beta(µ|a,b) = Γ(a+b) Γ(a)Γ(b)µ

a−1(1

−µ)b−1

µ a= 0.1

b= 0.1

0 0.5 1

0 1 2 3

µ a= 1

b= 1

0 0.5 1

0 1 2 3

a= 2

b= 3

1 2 3

a= 8

b= 4

(25)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions

Gaussian Distribution over a Vector

Decision Theory Model Selection - Key Ideas

The Gaussian Distribution over a Vector

x

xRD

Gaussian Distribution withmeanµRDandcovariance

matrixΣRD×D

N(x|µ,Σ) = 1 (2π)D2

1

|12

exp{−1 2(x−µ)

TΣ−1(x

−µ)}

where|Σ|is the determinant ofΣ.

x1

x2

λ11/2

λ12/2

y1

y2

u1

u2

(26)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions

Gaussian Distribution over a Vector

Decision Theory Model Selection - Key Ideas

The Gaussian Distribution over a vector

x

Can find a linear transformation to a new coordinate systemyin which thexbecomes

y=UT (x −µ),

Uis the eigenvector matrix for the covariance matrixΣ

with eigenvalue matrixE

ΣU=U E=U

  

λ1 0 . . . 0

0 λ2 . . . 0

. . . .

0 0 . . . λD

  

Ucan be made an orthogonal matrix, therefore the columnsuiofUare unit vectors which are orthogonal to

each otheruT iuj=

(

1 i=j

0 i6=j

(27)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions

Gaussian Distribution over a Vector

Decision Theory Model Selection - Key Ideas

The Gaussian Distribution over a vector

x

Now use the linear transformationy=UT (x

−µ)and

Σ−1=U E−1UTto transform the exponent (x−µ)TΣ−1(xµ)into

yTE y= n

X

j=1

y2

j

λj

Now exponating the sum (and taking care of the factors) results in a product of scalar valued Gaussian distributions in orthogonal directionsui

p(y) = D

Y

j=1 1

(2πλj)1/2exp{−

y2

j

2λj}

x2

λ1/2

(28)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions

Gaussian Distribution over a Vector

Decision Theory Model Selection - Key Ideas

Partitioned Gaussians

Given a joint Gaussian distributionN(x|µ,Σ), whereµis the mean vector,Σthe covariance matrix, andΛΣ−1

theprecisionmatrix

Assume that the variables can be partitioned into two sets

x=

xa xb

, µ=

µa

µb

,Σ=

Σaa Σab

Σba Σbb

,Λ=

Λaa Λab

Λba Λbb

Λaa = (Σaa−ΣabΣ−bb1Σba) −1

Λab =−(Σaa−ΣabΣ−bb1Σba)

−1ΣabΣ−1

bb Conditional distribution

p(xa|xb) =N(xa|µa|b,Λ−aa1)

(29)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions

Gaussian Distribution over a Vector

Decision Theory Model Selection - Key Ideas

Partitioned Gaussians

xa xb= 0.7

xb

p(xa, xb)

0 0.5 1

0 0.5 1

xa p(xa)

p(xa|xb= 0.7)

0 0.5 1

0 5 10

Contours of a Gaussian distribution over two variablesxaand xb(left), and marginal distributionp(xa)and conditional

(30)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector

Decision Theory

Model Selection - Key Ideas

Decision Theory - Key Ideas

Two classesC1 andC2

joint distributionp(x,Ck) using Bayes’ theorem

p(Ck|x) =

p(x| Ck)p(Ck) p(x)

Example: cancer treatment (k=2) datax: an X-ray image

C1: patient has cancer (C2: patient has no cancer)

p(C1)is the prior probability of a person having cancer

p(C1|x)is the posterior probability of a person having

(31)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector

Decision Theory

Model Selection - Key Ideas

Decision Theory - Key Ideas

Need a rule which assigns each value of the inputxto one of the available classes.

The input space is partitioned intodecision regionsRk. Leads todecision boundariesordecision surfaces probability of a mistake

p(mistake) =p(x∈ R1,C2) +p(x∈ R2,C1)

=

Z

R1

p(x,C2)dx+

Z

R2

p(x,C1)dx

(32)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector

Decision Theory

Model Selection - Key Ideas

Decision Theory - Key Ideas

probability of a mistake

p(mistake) =p(x∈ R1,C2) +p(x∈ R2,C1)

=

Z

R1

p(x,C2)dx+

Z

R2

p(x,C1)dx

goal: minimizep(mistake)

x0 bx

p(x,C1)

(33)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector

Decision Theory

Model Selection - Key Ideas

Decision Theory - Key Ideas

multiple classes

instead of minimising the probability of mistakes, maximise the probability of correct classification

p(correct) = K

X

k=1

p(x∈ Rk,Ck)

= K

X

k=1

Z

Rk

(34)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector

Decision Theory

Model Selection - Key Ideas

Minimising the Expected Loss

Not all mistakes are equally costly.

Weight each misclassification ofxto the wrong classCj instead of assigning it to the correct classCkby a factorLkj. The expected loss is now

E[L] =

X

k

X

j

Z

Rj

Lkjp(x,Ck)dx

(35)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector

Decision Theory

Model Selection - Key Ideas

The Reject Region

Avoid making automated decisions on difficult cases. Difficult cases:

posterior probabilitiesp(Ck|x)are very small

joint distributionsp(x,Ck)have comparable values

x p(C1|x) p(C2|x)

0.0 1.0

θ

(36)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory

Model Selection - Key Ideas

S-fold Cross Validation

Given a set ofNdata items and targets.

Goal: Find the best model (type of model, number of parameters like the orderpof the polynomial or the regularisation constantλ). Avoid overfitting.

Solution: Train a machine learning algorithm with some of the data, evaluate it with the rest.

If we have many data

1 Train a range of models or a model with a range of

parameters.

2 Compare the performance on an independent data set

(validation set) and choose the one with the best predictive

performance.

3 Still, overfitting to the validation set can occur. Therefore,

(37)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory

Model Selection - Key Ideas

S-fold Cross Validation

For few data, there is a dilemma: Few training data or few test data.

Solution iscross-validation.

Use a portion of(S1)/Sof the available data for training, but use all the data to asses the performance.

(38)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory

Model Selection - Key Ideas

S-fold Cross Validation

Partition data intoSgroups.

UseS1groups to train a set of models that are then evaluated on the remaining group.

Repeat for allSchoices of the held-out group, and average the performance scores from theSruns.

run 1

run 2

run 3

run 4

References

Related documents