Bayesian parameter Bayesian parameter

(1)

ECE-271A ECE 271A

Statistical Learning I:

Bayesian parameter Bayesian parameter

estimation

Nuno Vasconcelos ECE Department, UCSD

(2)

Bayesian estimation

last class we considered the Gaussian problem

(

²

)

² ^k

)

|

( G

P ^P ⁽ ⁾ ^G

(

²

)

and showed that

(

²

)

² ^known

|_µ(x |µ) G x ,µ,σ , σ

P_X = P_µ(µ) =G

(

µ,µ0,σ0²

)

with

(

²

)

|_T ( |D) G x , _n, _n P_µ µ = µ σ

2 2

) ,

, ( )

|

( ² ²

|T n n

X x D G x

P = µ σ +σ

2 0 2

0 2 2

2 0

2

0 ˆ µ

σ σ

µ σ σ σ

µ σ

+ +

= +

n n

n

ML n

1

1 n

good example of various properties that are typical of

2 0 2

2 σ σ

σ_n ⁼ ⁺

good example of various properties that are typical of Bayesian parameter estimates

(3)

Properties

regularization:

• if σ ² =σ ² then 0

ˆ 1 µ

µ µ = n _ML + if σ₀ =σ then

0 1

1

0

with 1 ,

1

1 1

µ µ

+ =

=

+ + +

+ +

∑ⁿ= ⁱ i

i ML n

X n X

n n

Bayes is equal to ML on a virtual sample with extra points

• in this case, one additional point equal to the mean of the prior

• for large n, extra point is irrelevant

• for small n, it regularizes the Bayes estimate by

• directing the posterior mean towards the prior mean

• reducing the variance of the posterior

2

0 2

2

1 1

σ σ

σ ⁼ ⁺ n

n

(4)

Conjugate priors

note that

• the priorthe prior is Gaussian^P ⁽^µ⁾ ⁼^G

(

^µ ^µ ^σ ²

)

is Gaussian

• the posterior is Gaussian

whenever this is the case (posterior in the same family as

(

, 0, 0

)

(µ µ µ σ

µ G

P =

(

²

)

|_T ( |D) G x , _n, _n P_µ µ = µ σ

(p y

prior) we say that

• is a conjugate prior for the likelihood

t i i th d i d it

) (µ

Pµ P_X_|_µ(x | µ)

• posterior is the reproducing density

HW: a number of likelihoods have conjugate priors

)

|

| ( D

P_µ_T µ

Likelihood Conjugate prior Likelihood Conjugate prior

Bernoulli Beta

Poisson Gamma

Exponential Gamma

Normal (known σ²) Gamma

(5)

Priors

potential problem of the Bayesian framework

• “I don’t really have a strong belief about what the most likelyI don t really have a strong belief about what the most likely parameter configuration is”

in these cases it is usual to adopt a non-informative prior the most obvious choice is the uniform distribution

α θ =

Θ( )

P

there are, however, problems with this choice

• if θθ is unbounded this is an improper distributions u bou ded s s a p ope d s bu o

1 )

( = ∞ ≠

∫

∞

−

Θ θ dθ P

• the prior is not invariant to all reparametrizations

∞

(6)

Example

consider Θ and a new random variable η ^withη ^{= e}^Θ since this is a 1 to 1 transformation it should not affect since this is a 1-to-1 transformation it should not affect the outcome of the inference process

we check this by using the change of variables theoremy g g

• if y = f(x) then

(

⁽ ⁾

)

) 1

(y P f ¹ y

P_Y = _X ⁻

in this case

(

⁽ ⁾

)

) (

)

1(

y f

P x

y f

P _X

y f x Y

= ⁻

∂

in this case

( ) (

^η

)

η η

η _θ

η 1 log

1 log )

( _Θ = _Θ

= ∂ P P

P e

η θ _θ _η

=log

∂

(7)

Invariant non-informative priors

for uniform η this means that , i.e. not constant this means that P_η (η)α η¹

this means that

• there is no consistency between Θ and h

• a 1-to-1 transformation changes the non-informative prior into an g p informative one

to avoid this problem the non-informative prior has to be invariant

invariant

e.g. consider a location parameter:

• a parameter that simply shifts the density

• e.g. the mean of a Gaussian

a non-informative prior for a location parameter has to be p p invariant to shifts, i.e. the transformation Y = µ ^{+ c}

(8)

Location parameters

in this case

(

^y ^c

)

^P

(

^y ^c

)

P y

P ( ) ¹ ^P

(

^y ^c

)

^P

(

^y ^c

)

y c P

c y

Y − = −

∂ +

= ∂

−

=

µ µ

µ ) ) (

(

and, since this has to be valid for all c,

c y µ

( )

^y

P y

P ( ) =

hence

( )

^y

P y

P_Y ( ) = _µ

(

^y ^c

)

^P

( )

^y

P

which is valid for all c if and only if P_µ(µ) is uniform

(

^y ^c

)

^P

( )

^y

P_µ − = _µ

non-informative prior for location is P_µ(µ) α 1

(9)

Scale parameters

a scale parameter is one that controls the scale of the densityy

⎟⎠

⎜ ⎞

⎝

− ⎛

σ ¹f σx

e.g. the variance of a Gaussian distribution

it can be shown that, in this case, the non-informative prior invariant to scale transformations is

σ) ¹

( =

P

note that, as for location, this is an improper prior

σ σ

σ ( ) = P

(10)

Selecting priors

non-informative priors are the end of the spectrum where we don’t know what parameter values to favorp at the other end, i.e. when we are absolutely sure, the prior becomes a delta function

in this case

) (

)

(θ = δ θ −θ₀ PΘ

) (

)

| ( )

|

(θ D P D θ δ θ θ

P

and the predictive distribution is

) (

)

| ( )

|

( _| ₀

| θ α _Θ θ δ θ −θ

Θ D P D

P _T _T

)

| (

) (

)

| ( )

| (

0

|

0

|

θ

θ θ

θ δ θ θ

x P

d D

P x

P D

x P

X

T X

Θ

Θ Θ

=

−

∝

∫

this is identical to ML if θ₀ = θ_ML

(11)

Selecting priors

hence,

• ML is a special case of the Bayesian formulation,

• where we are absolutely confident that the ML estimate is the correct value for the parameter

but we could use other values for θ₀₀. For example the value that maximizes the posterior

) ( )

| ( max

arg )

| ( max

arg _| θ _| θ θ

θ_MAP = θ P^Θ_T D = θ P_T ^Θ D P^Θ

this is called the MAP estimate and makes the predictive distribution equal to

θ θ

q

it can be useful when the true predictive distribution has

)

| ( )

|

( _|

|T X MAP

X x D P x

P = _Θ θ

it can be useful when the true predictive distribution has no closed-form solution

(12)

Selecting priors

the natural question is then

• “what if I don’t get the prior right?”; “can I do terribly bad?”

• “how robust is the Bayesian solution to the choice of prior?”

• let’s see how much the solution changes between the two extremes

extremes

for the Gaussian problem

• absolute certaintyy priors:p P_µ(µ) = δ(µ − µ_p)

• MAP estimate: since we haveP_µ|_T (µ |D) =G (x,µ_n,σ_n²)

0 2

2

0 µ σ µ

µ σ

µ = = n +

ML

) (

)

(µ µ µ_p

µ

• ML estimate is µ_p = µ_ML

2 0 2

0 2

2 0

σ µ µ σ

σ µ σ

µ + +

+ n

n ^ML

n p

• we have seen already that these are similar unless the sample is small (MAP = ML on sample with extra point)

(13)

Selecting priors

for the Gaussian problem

• non-informative prior:

• in this case it is P_µ(µ) α 1 or

(

, 0, 0²

)

lim )

( 2

0

σ µ µ µ σ

µ G

P = →∞

• from which

nσ σ ^⎞

⎜⎜⎛

+ ²

2

lim 0

0

ML ML

n n n µ µ

σ µ σ

σ =

⎜⎜ ⎠

⎝ + +

= +

∞

→ 2 2 0

0 2

2 0

0

2 0

lim

2

1 2

1 lim n ⎞ = n ⇔ σ =σ

⎜⎜⎛

+

=

• and

2 2

0 2

2 lim

2 0

ML n

n

σ σ σ

σ σ

σ ⁼^σ ^→^∞^⎜⎜_⎝ ⁺ _⎠ ⁼ ^⇔ ⁼

⎞

⎛ ⎛ 1 ⎞

⎟⎟⎞

⎜⎜⎛

⎟⎞

⎜⎛ +

= +

=G x G x n

D x

P_X _T ( | ) ( ,µ_n ,σ ² σ_n²) ,µ_ML,σ ² 1 ¹

(14)

Selecting priors

in summary, for the two prior extremes

• delta prior centered on MAP:

(

²

)

|_T ( | ) ,µ_MAP ,σ

X x D G x

P =

2 0 2

0 2 2

2 0

2

0 µ

σ σ

µ σ σ σ

µ σ

+ +

= +

n n

n

ML MAP

• delta prior centered on ML:

(

²

)

|_T ( | ) ,µ_ML,σ

X x D G x

P =

• non-informative prior

⎟⎟⎠

⎜⎜ ⎞

⎝

⎛ ⎟

⎠

⎜ ⎞

⎝⎛ +

=G x n

D x

P_X_|_T ( | ) ,µ_ML,σ ² 1 ¹

all Gaussian, “qualitatively the same”:

• somewhat different parameters for small n; equal for large n

⎠

⎝ ⎝ n ⎠

somewhat different parameters for small n; equal for large n

this indicates robustness to “incorrect” priors!

(15)

Selecting priors

another example, problem 3.5.17 DHS (HW prob 3)

• multivariate Bernoulli (d independent Bernoulli variables)

• since Bernoulli is

( ) ^x

X x

x x

P _Θ = − ⁻

⎩⎨

⎧ =

= ¹

| 1

0 1

1 ) ,

|

( θ θ

θ θ θ

• multivariate likelihood is:

d

( )

X^|^Θ( | ) _⎩^⎨1− , x = 0

θ

• in (a) you show that if D = {x⁽¹⁾, ..., x⁽ⁿ⁾} is a set of n iid samples,

( )

∏

=

−

Θ = −

1 i

i xi

x i i

X x

P _| ( |θ) θ 1 θ ¹

then

( )

∑

^{( )}

∏

⁻

Θ = − = ⁿ

j

i j i

s n i

T D s x

P ⁱ

1

| ( | ) 1 ,

d

1 i

s i

i θ

θ θ

=1 j =1

i

(16)

Selecting priors

another example, problem 3.5.17 DHS (HW prob 3)

• in (b) you then show that if Θ is uniform (non-informative) the predictive distribution is

∏

^⎜_⎝^⎛ ₊⁺ ^⎟_⎠^⎞ ^⎜_⎝^⎛ ⁻ ₊⁺ ^⎟_⎠^⎞ ⁻

= ^d ⁱ ⁱ

x i

x T i

X n

s n

D s x P

1

| 2

1 1 2

) 1

| (

• in (d) you show that comparing with

=1 ⎝ + ⎠ ⎝ + ⎠

i n 2 n 2

( )

∏

^d ^xⁱ ⁻^xⁱ

x

P ( |θ) θ 1 θ ¹

• this can be interpreted as:

( )

∏

=

Θ = −

1 i

i i

X x

P _| ( |θ) θ 1 θ

• under Bayes, with a uniform prior, the predicted distribution is the same as the likelihood, with the parameter estimate

ˆ s_i +1 2

1 +

= + n s_i θi

(17)

Selecting priors

let’s now consider the extreme of

• ML: we know that

( )

^θ ^θ

δ

θ) ^ˆ

( = −

PΘ

n s_i

i = θ^ˆ

• and

∏

^⎜_⎝^⎛ ^⎟_⎠^⎞ ^⎜_⎝^{⎛ −} ^⎟_⎠^⎞ ⁻

= ^d ⁱ ⁱ

x i

x T i

X n

s n

D s x P

1

| ( | ) 1

• the predicted distribution is the same as the likelihood, with the

=1 ⎝ ⎠ ⎝ ⎠

i n n

parameter estimate

s_i

i = θ^ˆ

i n

(18)

Selecting priors

• MAP: given prior

{

^log ⁽ ^| ⁾ ^log ⁽ ⁾

}

max ˆ arg

| θ θ

θ = P_T _Θ D + P_Θ

∏

^Θ

Θ =

i P _i i

P (θ )

• and since

{

^log ⁽ ^| ⁾ ^log ⁽ ⁾

}

max

arg _| θ θ

θ θ P_T ^Θ D + P^Θ

n ( )

d

• this is

( )

∑

^{( )}

∏

= =

−

Θ = − =

j

i j i

s n i

T D s x

P ⁱ

1

| ( | ) 1 ,

1 i

s i

i θ

θ θ

this is

• i e the solution of

{

^log ⁽ ⁾^log(¹ ⁾ ^log ⁽ ⁾

}

max ˆ arg

i i

i

i s θ n s θ P _i θ

θ = θ + − − + ^Θ

i.e. the solution of

0 ) ) (

( 1 1

)

( =

∂ + ∂

−

− − _Θ

Θ

i i

i

i i

P P s

n

s θ

θ θ

• let’s consider some specific priors

(19)

Selecting priors

• prior that favors “1”s

θ θ) 2 ( =

Θ_i

P

2

• MAP solution:

i

ˆ 1 1 0

)

(n −s s +

s_i _i θ _i

1 θ

• and

0 1 1

) (

= +

⇔

=

− +

− _iⁱ _i ⁱ nⁱ

i

i θ

θ θ

θ

⎞ −

⎛

⎞

⎛

d xⁱ 1 xⁱ

1 1

∏

=

⎟⎠

⎜ ⎞

⎝

⎛

+

− +

⎟⎠

⎜ ⎞

⎝

⎛

+

= ^d +

1 i

i i

i T i

X n

s n

D s x P _|

1 1 1

1 ) 1

| (

this can be interpreted as:

• the predicted distribution is the same as the likelihood, with the parameter estimate

+1 s 1

ˆ +

= + n s_i θi

(20)

Selecting priors

• prior that favors “0”s

) 1

( 2 )

(θ = −θ

Θ_i

P

2

• MAP solution:

i

0 ˆ 1

)

(n −s = ⇔ = s

s_i _i θ _i

1 θ

• and

0 1 1

1 = ⇔ = +

− −

i n

i i

i

θ θ θ

θ

⎞ −

⎛

⎞

⎛

d xⁱ 1 xⁱ

∏

=

⎟⎠

⎜ ⎞

⎝

⎛

− +

⎟⎠

⎜ ⎞

⎝

⎛

= ^d +

1 i

i i

i T i

X n

s n

D s x P _|

1 1 ) 1

| (

this can be interpreted as:

• the predicted distribution is the same as the likelihood, with the parameter estimate

s 1 ˆ = +

n s_i θi

(21)

Selecting priors

in summary

• all cases are of the form PX_|T (x |D) ⁼

_∏

^d θˆ^xⁱ

( )

1⁻θ^ˆ ¹⁻^xⁱ

• with

∏ ( )

=1 i

|

Estimator θ^ˆ_i # tosses # “1”s interpretation

ML n s_i

MAP non-informative n s_i ^{“the same”}

n s_i

MAP favor “1”s n+1 s_i+1 “add one 1”

MAP favor “0”s n+1 s_i “add one 0”

) 1 (

) 1

(s_i + n + ) 1 (n + s_i

• all cases qualitatively the same: “ML estimate on an extended Bayes non-informative (s_i +1) (n + 2) n+2 s_i+1 “add one of each”

q y

sample with extra points that reflect the bias of the prior”.

(22)

Regularization

these are all examples of regularization

Q: what is the point of “adding one of each?”p g by Bayes y y non-informative?

• the main problem of ML (s_i/ n) is the “empty bin” problem

f ll i lik l b i d d l f h l f

• for small n, s_iis likely to be zero independently of the value of θ_i

• this can lead to all sorts of problems, e.g. a likelihood ratio that goes to infinity

• by adding “one of each” Bayes eliminates this problem

• for richly populated bins it makes no difference, but it matters for empty bins

empty bins

note that this is consistent with the non-informative prior

• empty bins are as likely as any other valuep y y y

• if we see a lot of them, we need to correct this

(23)

Regularization

“empty bin” problem

• “why should I care?” this is unlikely if I have a large sample

• remember that “large” is always relative

• 10 bins in 1D transforms into 100 in 2D, 1000 in 3D, and 10^d in a d-dimensional space

d dimensional space

• when d is large, we are always in the “small sample” regime

• regularization usually makes a tremendous difference

example:

• histogram estimates in high dimensional spaces

• histogram estimates in high-dimensional spaces

• e.g. histogram of English words for indexing web-pages

• for each page, compute histogram C = (c₁, ..., c_w) where c_i is the # of times word i^thh word appeared in page

(24)

Regularization

histogram similarity:

• natural measure is the Kullback-Leibler divergence

⎟⎟⎠

⎜⎜ ⎞

⎝

=

∑

⎛

= j

k ki w

k

ki j

i

p p p

C C

d ( , ) log

1

• where the probabilities are the counts after normalization

⎠

⎝ ^k

k ₁ p

c i

• problem: log goes to infinity when p^j = 0!

=

∑

k ki i k

k c c

p

• problem: log goes to infinity when p^j_k = 0!

• for low-frequency words the noisy estimates are amplified by the ratio of probabilities

• the distance measure has a large variance

(25)

Regularization

Prob 3 on HW

• the count vector C is distributed according to a multinomial distribution

∏ ∏

= ^w

j

c w j

W

C ^j

c n c

P

1 1

! ) !

, ,

( _K π

• where π_j is the probability of word j.

∏

⁼

=

j

k ck ¹

1

!

• since the π_j are probabilities, we can’t use any prior here.

• distribution over vectors π = (π₁, ..., π_w) must satisfy the constraints of a probability mass function

1

> 0

∑

j

π π

=1

∑

j π j

(26)

Regularization

Prob 3 on HW

• one such distribution is the Dirichlet distribution

∑

^⎟

⎠

⎜⎜ ⎞

⎝ Γ⎛

w W

j uj

( ) ^∏

∏

⁼

−

=

Π Γ

⎠

= ⎝

j

u w j

k j

j

W ^j

u P

1

1 1

1, , )

(π _K π π

• u_j are hyper-parameters

• Γ( ) is the gamma function

• Γ(.) is the gamma function

(27)

Regularization

Prob 3 on HW

• on HW you will show that the posterior is

∑ ∏

−

= ⎟⎟⎠ +

⎜⎜ ⎞

⎝

⎛ +

Γ _w

u c W

j j j

j j

u c

P ( | ) ¹ ¹

( ) ^∏

∏

⁼

+

=

Π Γ +

⎠

= ⎝

j

u c w j

k j j

j

C ^j ^j

u c

c P

1

1 1

| (π | ) π

• i.e. Dirichlet of hyper-parameters c_j + u_j

• the prior parameters can be seen as additional counts that regularize the predictive distribution!

regularize the predictive distribution!

(28)