• No results found

Bayesian parameter Bayesian parameter

N/A
N/A
Protected

Academic year: 2021

Share "Bayesian parameter Bayesian parameter "

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

ECE-271A ECE 271A

Statistical Learning I:

Bayesian parameter Bayesian parameter

estimation

Nuno Vasconcelos ECE Department, UCSD

(2)

Bayesian estimation

last class we considered the Gaussian problem

(

2

)

2 k

)

|

( G

P P ( ) G

(

2

)

and showed that

(

2

)

2 known

|µ(x |µ) G x ,µ,σ , σ

PX = Pµ(µ) =G

(

µ,µ0,σ02

)

with

(

2

)

|T ( |D) G x , n, n Pµ µ = µ σ

2 2

) ,

, ( )

|

( 2 2

|T n n

X x D G x

P = µ σ +σ

2 0 2

0 2 2

2 0

2

0 ˆ µ

σ σ

µ σ σ σ

µ σ

+ +

= +

n n

n

ML n

1

1 n

good example of various properties that are typical of

2 0 2

2 σ σ

σn = +

good example of various properties that are typical of Bayesian parameter estimates

(3)

Properties

regularization:

if σ 2 =σ 2 then 0

ˆ 1 µ

µ µ = n ML + if σ0 =σ then

0 1

1

1

0

with 1 ,

1

1 1

µ µ

µ µ

+ =

=

+ + +

+ +

n= i i

i ML n

X n X

n n

Bayes is equal to ML on a virtual sample with extra points

in this case, one additional point equal to the mean of the prior

for large n, extra point is irrelevant

• for small n, it regularizes the Bayes estimate by

directing the posterior mean towards the prior mean

directing the posterior mean towards the prior mean

reducing the variance of the posterior

2

0 2

2

1 1

σ σ

σ = + n

n

(4)

Conjugate priors

note that

the priorthe prior is GaussianP (µ) =G

(

µ µ σ 2

)

is Gaussian

the posterior is Gaussian

whenever this is the case (posterior in the same family as

(

, 0, 0

)

)

(µ µ µ σ

µ G

P =

(

2

)

|T ( |D) G x , n, n Pµ µ = µ σ

(p y

prior) we say that

is a conjugate prior for the likelihood

t i i th d i d it

) (µ

Pµ PX|µ(x | µ)

posterior is the reproducing density

HW: a number of likelihoods have conjugate priors

)

|

| ( D

PµT µ

Likelihood Conjugate prior Likelihood Conjugate prior

Bernoulli Beta

Poisson Gamma

Exponential Gamma

Exponential Gamma

Normal (known σ2) Gamma

(5)

Priors

potential problem of the Bayesian framework

“I don’t really have a strong belief about what the most likelyI don t really have a strong belief about what the most likely parameter configuration is”

in these cases it is usual to adopt a non-informative prior the most obvious choice is the uniform distribution

α θ =

Θ( )

P

there are, however, problems with this choice

if θθ is unbounded this is an improper distributions u bou ded s s a p ope d s bu o

1 )

( =

Θ θ dθ P

the prior is not invariant to all reparametrizations

(6)

Example

consider Θ and a new random variable η with η = eΘ since this is a 1 to 1 transformation it should not affect since this is a 1-to-1 transformation it should not affect the outcome of the inference process

we check this by using the change of variables theoremy g g

if y = f(x) then

(

( )

)

) 1

(y P f 1 y

PY = X

in this case

(

( )

)

) (

)

1(

y f

P x

y f

P X

y f x Y

=

in this case

( ) (

η

)

η η

η θ

η 1 log

1 log )

( Θ = Θ

= P P

P e

η θ θ η

=log

(7)

Invariant non-informative priors

for uniform η this means that , i.e. not constant this means that Pη (η)α η1

this means that

• there is no consistency between Θ and h

• a 1-to-1 transformation changes the non-informative prior into an g p informative one

to avoid this problem the non-informative prior has to be invariant

invariant

e.g. consider a location parameter:

a parameter that simply shifts the density

a parameter that simply shifts the density

e.g. the mean of a Gaussian

a non-informative prior for a location parameter has to be p p invariant to shifts, i.e. the transformation Y = µ + c

(8)

Location parameters

in this case

(

y c

)

P

(

y c

)

P y

P ( ) 1 P

(

y c

)

P

(

y c

)

y c P

c y

Y =

+

=

=

µ µ

µ µ

µ ) ) (

(

and, since this has to be valid for all c,

c y µ

( )

y

P y

P ( ) =

hence

( )

y

P y

PY ( ) = µ

(

y c

)

P

( )

y

P

which is valid for all c if and only if Pµ(µ) is uniform

(

y c

)

P

( )

y

Pµ = µ

non-informative prior for location is Pµ(µ) α 1

(9)

Scale parameters

a scale parameter is one that controls the scale of the densityy

σ 1f σx

e.g. the variance of a Gaussian distribution

it can be shown that, in this case, the non-informative prior invariant to scale transformations is

σ) 1

( =

P

note that, as for location, this is an improper prior

σ σ

σ ( ) = P

(10)

Selecting priors

non-informative priors are the end of the spectrum where we don’t know what parameter values to favorp at the other end, i.e. when we are absolutely sure, the prior becomes a delta function

in this case

) (

)

(θ = δ θ θ0 PΘ

) (

)

| ( )

|

(θ D P D θ δ θ θ

P

and the predictive distribution is

) (

)

| ( )

|

( | 0

| θ α Θ θ δ θ θ

Θ D P D

P T T

)

| (

) (

)

| ( )

| ( )

| (

0

|

0

|

|

|

θ

θ θ

θ δ θ θ

x P

d D

P x

P D

x P

X

T X

T X

Θ

Θ Θ

=

this is identical to ML if θ0 = θML

(11)

Selecting priors

hence,

• ML is a special case of the Bayesian formulation,

where we are absolutely confident that the ML estimate is the correct value for the parameter

but we could use other values for θ00. For example the value that maximizes the posterior

) ( )

| ( max

arg )

| ( max

arg | θ | θ θ

θMAP = θ PΘT D = θ PT Θ D PΘ

this is called the MAP estimate and makes the predictive distribution equal to

θ θ

q

it can be useful when the true predictive distribution has

)

| ( )

|

( |

|T X MAP

X x D P x

P = Θ θ

it can be useful when the true predictive distribution has no closed-form solution

(12)

Selecting priors

the natural question is then

• “what if I don’t get the prior right?”; “can I do terribly bad?”

“how robust is the Bayesian solution to the choice of prior?”

let’s see how much the solution changes between the two extremes

extremes

for the Gaussian problem

• absolute certaintyy priors:p Pµ(µ) = δ(µ µp)

• MAP estimate: since we havePµ|T (µ |D) =G (x,µn,σn2)

0 2

2

0 µ σ µ

µ σ

µ = = n +

ML

) (

)

(µ µ µp

µ

• ML estimate is µp = µML

2 0 2

0 2

2 0

σ µ µ σ

σ µ σ

µ + +

+ n

n ML

n p

we have seen already that these are similar unless the sample is small (MAP = ML on sample with extra point)

(13)

Selecting priors

for the Gaussian problem

• non-informative prior:

in this case it is Pµ(µ) α 1 or

(

, 0, 02

)

lim )

( 2

0

σ µ µ µ σ

µ G

P =

from which

nσ σ

⎜⎜

+ 2

2

lim 0

0

ML ML

n n n µ µ

σ µ σ

σ µ σ

σ =

⎜⎜

+ +

= +

2 2 0

0 2

2 0

0

2 0

lim

2

1 2

1 lim n = n σ =σ

⎜⎜

+

=

and

2 2

0 2

2 lim

2 0

ML n

n

σ σ σ

σ σ

σ =σ ⎜⎜ + = =

1

⎟⎟

⎜⎜

⎛ +

= +

=G x G x n

D x

PX T ( | ) ( ,µn ,σ 2 σn2) ,µML,σ 2 1 1

(14)

Selecting priors

in summary, for the two prior extremes

• delta prior centered on MAP:

(

2

)

|T ( | ) ,µMAP ,σ

X x D G x

P =

2 0 2

0 2 2

2 0

2

0 µ

σ σ

µ σ σ σ

µ σ

+ +

= +

n n

n

ML MAP

• delta prior centered on ML:

(

2

)

|T ( | ) ,µML,σ

X x D G x

P =

• non-informative prior

⎟⎟

⎜⎜

⎛ +

=G x n

D x

PX|T ( | ) ,µML,σ 2 1 1

all Gaussian, “qualitatively the same”:

somewhat different parameters for small n; equal for large n

n

somewhat different parameters for small n; equal for large n

this indicates robustness to “incorrect” priors!

(15)

Selecting priors

another example, problem 3.5.17 DHS (HW prob 3)

• multivariate Bernoulli (d independent Bernoulli variables)

since Bernoulli is

( ) x

X x

x x

P Θ =

=

= 1

| 1

0 1

1 ) ,

|

( θ θ

θ θ θ

multivariate likelihood is:

d

( )

X|Θ( | ) 1 , x = 0

θ

in (a) you show that if D = {x(1), ..., x(n)} is a set of n iid samples,

( )

=

Θ =

1 i

i xi

x i i

X x

P | ( |θ) θ 1 θ 1

then

( )

( )

Θ = = n

j

i j i

s n i

T D s x

P i

1

| ( | ) 1 ,

d

1 i

s i

i θ

θ θ

=1 j =1

i

(16)

Selecting priors

another example, problem 3.5.17 DHS (HW prob 3)

in (b) you then show that if Θ is uniform (non-informative) the predictive distribution is

++ ++

= d i i

x i

x T i

X n

s n

D s x P

1

| 2

1 1 2

) 1

| (

in (d) you show that comparing with

=1 ⎝ + +

i n 2 n 2

( )

d xi xi

x

P ( |θ) θ 1 θ 1

this can be interpreted as:

( )

=

Θ =

1 i

i i

i i

X x

P | ( |θ) θ 1 θ

under Bayes, with a uniform prior, the predicted distribution is the same as the likelihood, with the parameter estimate

ˆ si +1 2

1 +

= + n si θi

(17)

Selecting priors

let’s now consider the extreme of

• ML: we know that

( )

θ θ

δ

θ) ˆ

( =

PΘ

n si

i = θˆ

and

⎛ −

= d i i

x i

x T i

X n

s n

D s x P

1

| ( | ) 1

this can be interpreted as:

the predicted distribution is the same as the likelihood, with the

=1

i n n

parameter estimate

si

i = θˆ

i n

(18)

Selecting priors

• MAP: given prior

{

log ( | ) log ( )

}

max ˆ arg

| θ θ

θ = PT Θ D + PΘ

Θ

Θ =

i P i i

P (θ )

and since

{

log ( | ) log ( )

}

max

arg | θ θ

θ θ PT Θ D + PΘ

n ( )

d

this is

( )

( )

= =

Θ = =

j

i j i

s n i

T D s x

P i

1

| ( | ) 1 ,

1 i

s i

i θ

θ θ

this is

i e the solution of

{

log ( )log(1 ) log ( )

}

max ˆ arg

i i

i i

i

i s θ n s θ P i θ

θ = θ + + Θ

i.e. the solution of

0 ) ) (

( 1 1

)

( =

+

Θ

Θ

i i

i i

i i

i

i i

P P s

n

s θ

θ θ

θ θ

let’s consider some specific priors

(19)

Selecting priors

• prior that favors “1”s

θ θ) 2 ( =

Θi

P

2

• MAP solution:

i

ˆ 1 1 0

)

(n s s +

si i θ i

1 θ

and

0 1 1

) (

= +

=

+

ii i i ni

i

i θ

θ θ

θ

d xi 1 xi

1 1

this can be interpreted as:

=

+

+

+

= d +

1 i

i i

i T i

X n

s n

D s x P |

1 1 1

1 ) 1

| (

this can be interpreted as:

the predicted distribution is the same as the likelihood, with the parameter estimate

+1 s 1

ˆ +

= + n si θi

(20)

Selecting priors

• prior that favors “0”s

) 1

( 2 )

(θ = θ

Θi

P

2

• MAP solution:

i

0 ˆ 1

)

(n s = = s

si i θ i

1 θ

and

0 1 1

1 = = +

i n

i i

i

θ θ θ

θ

d xi 1 xi

this can be interpreted as:

=

+

= d +

1 i

i i

i T i

X n

s n

D s x P |

1 1 ) 1

| (

this can be interpreted as:

the predicted distribution is the same as the likelihood, with the parameter estimate

s 1 ˆ = +

n si θi

(21)

Selecting priors

in summary

• all cases are of the form PX|T (x |D) =

d θˆxi

( )

1θˆ 1xi

with

∏ ( )

=1 i

|

Estimator θˆi # tosses # “1”s interpretation

ML n si

MAP non-informative n si “the same”

n si

n si

MAP favor “1”s n+1 si+1 “add one 1”

MAP favor “0”s n+1 si “add one 0”

) 1 (

) 1

(si + n + ) 1 (n + si

all cases qualitatively the same: “ML estimate on an extended Bayes non-informative (si +1) (n + 2) n+2 si+1 “add one of each”

q y

sample with extra points that reflect the bias of the prior”.

(22)

Regularization

these are all examples of regularization

Q: what is the point of “adding one of each?”p g by Bayes y y non-informative?

the main problem of ML (si / n) is the “empty bin” problem

f ll i lik l b i d d l f h l f

for small n, si is likely to be zero independently of the value of θi

this can lead to all sorts of problems, e.g. a likelihood ratio that goes to infinity

by adding “one of each” Bayes eliminates this problem

for richly populated bins it makes no difference, but it matters for empty bins

empty bins

note that this is consistent with the non-informative prior

• empty bins are as likely as any other valuep y y y

• if we see a lot of them, we need to correct this

(23)

Regularization

“empty bin” problem

“why should I care?” this is unlikely if I have a large sample

remember that “large” is always relative

10 bins in 1D transforms into 100 in 2D, 1000 in 3D, and 10d in a d-dimensional space

d dimensional space

• when d is large, we are always in the “small sample” regime

• regularization usually makes a tremendous difference

example:

• histogram estimates in high dimensional spaces

• histogram estimates in high-dimensional spaces

e.g. histogram of English words for indexing web-pages

for each page, compute histogram C = (c1, ..., cw) where ci is the # of times word ithh word appeared in page

(24)

Regularization

histogram similarity:

natural measure is the Kullback-Leibler divergence

⎟⎟

⎜⎜

=

= j

k ki w

k

ki j

i

p p p

C C

d ( , ) log

1

where the probabilities are the counts after normalization

k

k 1 p

c i

problem: log goes to infinity when pj = 0!

=

k ki i k

k c c

p

problem: log goes to infinity when pjk = 0!

for low-frequency words the noisy estimates are amplified by the ratio of probabilities

• the distance measure has a large variance

(25)

Regularization

Prob 3 on HW

the count vector C is distributed according to a multinomial distribution

∏ ∏

= w

j

c w j

W

C j

c n c

P

1 1

! ) !

, ,

( K π

where πj is the probability of word j.

=

=

j

k ck 1

1

!

since the πj are probabilities, we can’t use any prior here.

distribution over vectors π = (π1, ..., πw) must satisfy the constraints of a probability mass function

1

> 0

j

π π

=1

j π j

(26)

Regularization

Prob 3 on HW

one such distribution is the Dirichlet distribution

⎜⎜

Γ

w W

j uj

( )

=

=

=

Π Γ

=

j

u w j

k j

j

W j

u P

1

1

1 1

1, , )

(π K π π

uj are hyper-parameters

Γ( ) is the gamma function

Γ(.) is the gamma function

(27)

Regularization

Prob 3 on HW

on HW you will show that the posterior is

∑ ∏

= ⎟⎟ +

⎜⎜

+

Γ w

u c W

j j j

j j

u c

P ( | ) 1 1

( )

=

+

=

Π Γ +

=

j

u c w j

k j j

j

C j j

u c

c P

1

1

1 1

| (π | ) π

i.e. Dirichlet of hyper-parameters cj + uj

• the prior parameters can be seen as additional counts that regularize the predictive distribution!

regularize the predictive distribution!

(28)

References

Related documents

Molecular changes during neu- rodevelopment following second-trimester binge ethanol exposure in a mouse model of fetal alcohol spectrum disorders: from immediate effects to

The framework consists of three complementary techniques: A spatio-temporal scan statistic to detect crime hotspots, a growing self- organizing map (SOM) to analyze

Rebelo(1991) in which two final goods--one consumption and one capital good--are produced and the "core" capital good sector determines the long-run growth rate of per

An explanation of the African American high rates of mortality suggested by Schnall and Kern (1986) is that they are forced to recognize the reality of their social situation,

Summer meeting planning committee By April before summer meeting Annual Activity In process for 2016 [need to determine]. Session offered [need next actions] It has been

Titel: A strategy for future knowledge intensive and flexicure retail SMEs Produkttyp: andere. Marketing

We have audited the consolidated financial statements, prepared by Phoenix Solar Aktiengesellschaft, Sulzemoos, comprising the consolidated balance sheet, the

the Product, ADESA or an affiliate will agree to buy back the applicable vehicle or refund to Customer the purchase price, provided that such buy back or refund shall not exceed the