04_Probability.pdf

(1)

Introduction to Statistical Machine Learning

c 2016 Ong & Walder Data61 | CSIRO The Australian National

University

Introduction to Statistical Machine Learning

Cheng Soon Ong & Christian Walder

Machine Learning Research Group Data61 | CSIRO

and

College of Engineering and Computer Science The Australian National University

Canberra February – June 2017

(2)

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Part IV

(3)

University

Motivation

Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Linear Curve Fitting - Least Squares

N=10

x_≡(x1, . . . ,xN)T

t_≡(t1, . . . ,tN)T

xi∈R i=1,. . . ,N ti_∈R i=1,. . . ,N

y(x,w) =w1x+w0

X_≡[x 1]

w∗= (XT_X)−1_XT_t

0 2 4 6 8 10

x

0 1 2 3 4 5 6 7 8

t

(4)

University

Motivation

Boxes with Apples and Oranges

Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Simple Experiment

1 Choose a box.

Red boxp(B=r) =4/10

Blue boxp(B=b) =6/10

2 Choose any item of the selected box with equal probability.

(5)

University

Motivation

What do we know?

p(F=o_|B=b) =1/4

p(F=a_|B=b) =3/4

p(F=o_|B=r) =3/4

p(F=a_|B=r) =1/4

Given that we have chosen an orange, what is the probability that the box we chose was the blue one?

(6)

University

Motivation

Calculating the Posterior p(B

=

b

|

F

=

o)

Bayes’ Theorem

p(B=b_|F=o) =p(F=o|B=b)p(B=b) p(F=o)

Sum Rule for the denominator

p(F=o) =p(F=o,B=b) +p(F=o,B=r) =p(F=o_|B=b)p(B=b)

+p(F=o_|B=r)p(B=r)

= 1

4× 6 10+

3 4×

4 10 =

9 20

(7)

University

Motivation Boxes with Apples and Oranges

Bayes’ Theorem

Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Bayes’ Theorem - revisited

Before choosing an item from a box: most complete information inp(B)(prior).

Note, that in our examplep(B=b) = 6

10. Therefore

choosing the box from the prior, we would opt for the blue box.

Once we observe some data (e.g. choose an orange), we can calculatep(B=b_|F=o)(posterior probability) via Bayes’ theorem.

After observing an orange, the posterior probability p(B=b_|F=o) = 1

3 and thereforep(B=r|F=o) = 2 3.

(8)

University

Motivation Boxes with Apples and Oranges

Bayes’ Theorem

Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Bayes’ Rule

posterior

z }| {

p(Y_|X) =

likelihood

z }| {

p(X_|Y)

prior

z }| {

p(Y) p(X)

| {z }

normalisation

= Pp(X|Y)p(Y)

(9)

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem

Bayes’ Probabilities

Probability Distributions Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Bayes’ Probabilities

classical or frequentist interpretation of probabilities Bayesian view: probabilities represent uncertainty Example: Will the Arctic ice cap have disappeared by the end of the century?

fresh evidence can change the opinion on ice loss

goal: quantify uncertainty and revise uncertainty in light of new evidence

(10)

University

Andrey Kolmogorov - Axiomatic Probability

Theory (1933)

Let(Ω,F,P)be a measure space withP(Ω) =1.

Then(Ω,F,P)is a probability space, with sample spaceΩ, event spaceFand probability measureP.

1. AxiomP(E)≥0 _∀E_∈F. 2. AxiomP(Ω) =1.

(11)

University

Richard Threlkeld Cox - (1946, 1961)

Postulates (according to Arnborg and Sjödin)

1 Divisibility and comparability: The plausibility of a

statement is a real number and is dependent on information we have related to the statement.

2 Common sense: Plausibilities should vary sensibly with

the assessment of plausibilities in the model.

3 Consistency: If the plausibility of a statement can be

derived in many ways, all the results must be equal.

“Common sense” includes consistency with Aristotelian logic

Result: the numerical quantities representing plausibility behave all according to the rules of probability (either p_∈[0,1]orp_∈[1,_∞]).

(12)

University

Curve fitting - revisited

uncertainty about the parameterwcaptured in the prior probabilityp(w)

observed data_D=_{t1, . . . ,tN}

calculate the uncertainty inwafterthe data_Dhave been observed

p(w| D) = p(D |w)p(w) p(_D)

p(_{D |}w)as a function ofw:likelihood function likelihood expresses how probable the data are for different values ofw

(13)

University

Likelihood Function - Frequentist versus Bayesian

Likelihood functionp(_{D |}w)

Frequentist Approach wconsidered fixed parameter

value defined by some ’estimator’ error bars on the estimatedw obtained from the distribution of possible data sets_D

Bayesian Approach only one single data set_D

(14)

University

Frequentist Estimator - Maximum Likelihood

choosewfor which the likelihoodp(_{D |}w)is maximal choosewfor which the probability of the observed data is maximal

Machine Learning: error function is negative log of likelihood function

log is a monoton function

maximising the likelihood _⇐⇒ minimising the error Example: Fair-looking coin is tossed three times, always landing on heads.

(15)

University

Bayesian Approach

including prior knowledge easy (via priorw)

BUT: if prior is badly chosen, can lead to bad results subjective choice of prior

sometimes choice of prior motivated by convinient mathematical form

need to sum/integrate over the whole parameter space advances in sampling (Markov Chain Monte Carlo methods)

(16)

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities

Probability Distributions

Gaussian Distribution over a Vector Decision Theory Model Selection - Key Ideas

Expectations

Weighted average of a function f(x) under the probability distributionp(x)

E[f] =X

x

p(x)f(x) discrete distributionp(x)

E[f] =

Z

(17)

University

Variance

arbitrary functionf(x)

var[f] =E(f(x)−E[f(x)])2=Ef(x)2−E[f(x)]2

Special case:f(x) =x

(18)

University

The Gaussian Distribution

x_∈R

Gaussian Distribution withmeanµandvarianceσ2

N(x|µ, σ2_{) =} 1

(2πσ2₎1₂ exp{− 1

2σ2(x−µ) 2

}

N(x|µ, σ2₎

x

2σ

(19)

University

The Gaussian Distribution

N(x_|µ, σ2₎_>₀

R∞

−∞N(x|µ, σ

2₎_d_x₌₁

Expectation overx

E[x] =

Z ∞

−∞N

(x_|µ, σ2₎

x dx=µ

Expectation overx2

Ex2=

Z ∞

−∞N

(x_|µ, σ2₎_x2 _d_x₌_µ2₊_σ2

Variance of x

(20)

University

Linear Curve Fitting - Least Squares

N=10

x_≡(x1, . . . ,xN)T

t_≡(t1, . . . ,tN)T

xi∈R i=1,. . . ,N ti_∈R i=1,. . . ,N

y(x,w) =w1x+w0

X_≡[x 1]

w∗= (XT_X)−1_XT_t

0 2 4 6 8 10

x

0 1 2 3 4 5 6 7 8

t

We assume

t= y(x,w)

| {z }

deterministic

+

|{z}

(21)

University

The Mode of a Probability Distribution

Mode of a distribution : the value that occurs the most frequently in a probability distribution.

For a probability density function: the valuexat which the probability density attains its maximum.

Gaussian Distribution hasonemode (unimodular); the mode isµ.

If there are multiple local maxima in the probability

distribution, the probability distribution is calledmultimodal (example: mixture of three Gaussians).

N(x|µ, σ2₎

x 2σ

µ x

(22)

University

The Bernoulli Distribution

Two possible outcomesx_{∈ {}0,1_}(e.g. coin which may be damaged).

p(x=1_|µ) =µfor0_≤µ≤1

p(x=0_|µ) =1₋µ

Bernoulli Distribution

Bern(x|µ) =µx₍₁

−µ)1−x

Expectation overx

E[x] =µ

Variance of x

(23)

University

The Binomial Distribution

Flip a coinNtimes. What is the distribution to observe heads exactlymtimes?

This is a distribution overm={0, . . . ,N_}. Binomial Distribution

Bin(m_|N, µ) =

N m

µm₍₁

−µ)N−m

m

0 1 2 3 4 5 6 7 8 9 10 0

0.1 0.2 0.3

(24)

University

The Beta Distribution

Beta Distribution

Beta(µ|a,b) = Γ(a+b) Γ(a)Γ(b)µ

a−1₍₁

−µ)b−1

µ a= 0.1

b= 0.1

0 0.5 1

0 1 2 3

µ a= 1

b= 1

0 0.5 1

0 1 2 3

a= 2

b= 3

1 2 3

a= 8

b= 4

(25)

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions

Gaussian Distribution over a Vector

Decision Theory Model Selection - Key Ideas

The Gaussian Distribution over a Vector

x

x_∈RD

Gaussian Distribution withmeanµ_∈RDandcovariance

matrixΣ_∈RD×D

N(x_|µ,Σ) = 1 (2π)D2

1

|Σ_|12

exp_{−1 2(x−µ)

T_Σ−1_(x

−µ)_}

where_|Σ_|is the determinant ofΣ.

x1

x2

λ11/2

λ12/2

y1

y2

u1

u2

(26)

University

The Gaussian Distribution over a vector

x

Can find a linear transformation to a new coordinate systemyin which thexbecomes

y=UT ₍_x −µ),

Uis the eigenvector matrix for the covariance matrixΣ

with eigenvalue matrixE

ΣU=U E=U



  

λ1 0 . . . 0

0 λ2 . . . 0

. . . .

0 0 . . . λD



  

Ucan be made an orthogonal matrix, therefore the columnsuiofUare unit vectors which are orthogonal to

each otheruT iuj=

(

1 i=j

0 i₆=j

(27)

University

The Gaussian Distribution over a vector

x

Now use the linear transformationy=UT _(x

−µ)and

Σ−1=U E−1_UT_{to transform the exponent} (x−µ)T_Σ−1_(x₋_µ₎_into

yTE y= n

X

j=1

y2

j

λj

Now exponating the sum (and taking care of the factors) results in a product of scalar valued Gaussian distributions in orthogonal directionsui

p(y) = D

Y

j=1 1

(2πλj)1/2exp{−

y2

j

2λj}

x2

λ1/2

(28)

University

Partitioned Gaussians

Given a joint Gaussian distribution_N(x_|µ,Σ), whereµis the mean vector,Σthe covariance matrix, andΛ_≡Σ−1

theprecisionmatrix

Assume that the variables can be partitioned into two sets

x=

xa xb

, µ=

µ_a

µ_b

,Σ=

Σaa Σab

Σba Σbb

,Λ=

Λaa Λab

Λba Λbb

Λaa = (Σaa−ΣabΣ−bb1Σba) −1

Λab =−(Σaa−ΣabΣ−bb1Σba)

−1_Σab_Σ−1

bb Conditional distribution

p(xa|xb) =N(xa|µ_a|b,Λ−_aa1)

(29)

University

Partitioned Gaussians

xa xb= 0.7

xb

p(xa, xb)

0 0.5 1

xa p(xa)

p(xa|xb= 0.7)

0 0.5 1

0 5 10

Contours of a Gaussian distribution over two variablesxaand xb(left), and marginal distributionp(xa)and conditional

(30)

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector

Decision Theory

Model Selection - Key Ideas

Decision Theory - Key Ideas

Two classes_C1 andC2

joint distributionp(x,_Ck) using Bayes’ theorem

p(Ck|x) =

p(x_{| C}k)p(Ck) p(x)

Example: cancer treatment (k=2) datax: an X-ray image

C1: patient has cancer (C2: patient has no cancer)

p(_C1)is the prior probability of a person having cancer

p(C1|x)is the posterior probability of a person having

(31)

University

Decision Theory

Decision Theory - Key Ideas

Need a rule which assigns each value of the inputxto one of the available classes.

The input space is partitioned intodecision regions_Rk. Leads todecision boundariesordecision surfaces probability of a mistake

p(mistake) =p(x_{∈ R}1,C2) +p(x∈ R2,C1)

=

Z

R1

p(x,_C2)dx+

Z

R2

p(x,_C1)dx

(32)

University

Decision Theory

Decision Theory - Key Ideas

probability of a mistake

p(mistake) =p(x_{∈ R}1,C2) +p(x∈ R2,C1)

=

Z

R1

p(x,C2)dx+

Z

R2

p(x,C1)dx

goal: minimizep(mistake)

x0 _bx

p(x,C1)

(33)

University

Decision Theory

Decision Theory - Key Ideas

multiple classes

instead of minimising the probability of mistakes, maximise the probability of correct classification

p(correct) = K

X

k=1

p(x_{∈ R}k,Ck)

= K

X

k=1

Z

Rk

(34)

University

Decision Theory

Minimising the Expected Loss

Not all mistakes are equally costly.

Weight each misclassification ofxto the wrong class_Cj instead of assigning it to the correct class_Ckby a factorLkj. The expected loss is now

E[L] =

X

k

X

j

Z

Rj

Lkjp(x,Ck)dx

(35)

University

Decision Theory

The Reject Region

Avoid making automated decisions on difficult cases. Difficult cases:

posterior probabilitiesp(Ck|x)are very small

joint distributionsp(x,Ck)have comparable values

x p(C1|x) p(C2|x)

0.0 1.0

θ

(36)

University

Motivation Boxes with Apples and Oranges Bayes’ Theorem Bayes’ Probabilities Probability Distributions Gaussian Distribution over a Vector Decision Theory

S-fold Cross Validation

Given a set ofNdata items and targets.

Goal: Find the best model (type of model, number of parameters like the orderpof the polynomial or the regularisation constantλ). Avoid overfitting.

Solution: Train a machine learning algorithm with some of the data, evaluate it with the rest.

If we have many data

1 _{Train a range of models or a model with a range of}

parameters.

2 _{Compare the performance on an independent data set}

(validation set) and choose the one with the best predictive

performance.

3 Still, overfitting to the validation set can occur. Therefore,

(37)

University

S-fold Cross Validation

For few data, there is a dilemma: Few training data or few test data.

Solution iscross-validation.

Use a portion of(S₋1)/Sof the available data for training, but use all the data to asses the performance.

(38)

University

S-fold Cross Validation

Partition data intoSgroups.

UseS₋1groups to train a set of models that are then evaluated on the remaining group.

Repeat for allSchoices of the held-out group, and average the performance scores from theSruns.

run 1

run 2

run 3

run 4