ECE-271A ECE 271A
Statistical Learning I:
Bayesian parameter Bayesian parameter
estimation
Nuno Vasconcelos ECE Department, UCSD
Bayesian estimation
last class we considered the Gaussian problem
(
2)
2 k)
|
( G
P P ( ) G
(
2)
and showed that
(
2)
2 known|µ(x |µ) G x ,µ,σ , σ
PX = Pµ(µ) =G
(
µ,µ0,σ02)
with
(
2)
|T ( |D) G x , n, n Pµ µ = µ σ
2 2
) ,
, ( )
|
( 2 2
|T n n
X x D G x
P = µ σ +σ
2 0 2
0 2 2
2 0
2
0 ˆ µ
σ σ
µ σ σ σ
µ σ
+ +
= +
n n
n
ML n
1
1 n
good example of various properties that are typical of
2 0 2
2 σ σ
σn = +
good example of various properties that are typical of Bayesian parameter estimates
Properties
regularization:
• if σ 2 =σ 2 then 0
ˆ 1 µ
µ µ = n ML + if σ0 =σ then
0 1
1
1
0
with 1 ,
1
1 1
µ µ
µ µ
+ =
=
+ + +
+ +
∑n= i i
i ML n
X n X
n n
Bayes is equal to ML on a virtual sample with extra points
• in this case, one additional point equal to the mean of the prior
• for large n, extra point is irrelevant
• for small n, it regularizes the Bayes estimate by
• directing the posterior mean towards the prior mean
• directing the posterior mean towards the prior mean
• reducing the variance of the posterior
2
0 2
2
1 1
σ σ
σ = + n
n
Conjugate priors
note that
• the priorthe prior is GaussianP (µ) =G
(
µ µ σ 2)
is Gaussian• the posterior is Gaussian
whenever this is the case (posterior in the same family as
(
, 0, 0)
)
(µ µ µ σ
µ G
P =
(
2)
|T ( |D) G x , n, n Pµ µ = µ σ
(p y
prior) we say that
• is a conjugate prior for the likelihood
t i i th d i d it
) (µ
Pµ PX|µ(x | µ)
• posterior is the reproducing density
HW: a number of likelihoods have conjugate priors
)
|
| ( D
PµT µ
Likelihood Conjugate prior Likelihood Conjugate prior
Bernoulli Beta
Poisson Gamma
Exponential Gamma
Exponential Gamma
Normal (known σ2) Gamma
Priors
potential problem of the Bayesian framework
• “I don’t really have a strong belief about what the most likelyI don t really have a strong belief about what the most likely parameter configuration is”
in these cases it is usual to adopt a non-informative prior the most obvious choice is the uniform distribution
α θ =
Θ( )
P
there are, however, problems with this choice
• if θθ is unbounded this is an improper distributions u bou ded s s a p ope d s bu o
1 )
( = ∞ ≠
∫
∞∞
−
Θ θ dθ P
• the prior is not invariant to all reparametrizations
∞
Example
consider Θ and a new random variable η with η = eΘ since this is a 1 to 1 transformation it should not affect since this is a 1-to-1 transformation it should not affect the outcome of the inference process
we check this by using the change of variables theoremy g g
• if y = f(x) then
(
( ))
) 1
(y P f 1 y
PY = X −
in this case
(
( ))
) (
)
1(
y f
P x
y f
P X
y f x Y
= −
∂
∂
in this case
( ) (
η)
η η
η θ
η 1 log
1 log )
( Θ = Θ
= ∂ P P
P e
η θ θ η
=log
∂
Invariant non-informative priors
for uniform η this means that , i.e. not constant this means that Pη (η)α η1
this means that
• there is no consistency between Θ and h
• a 1-to-1 transformation changes the non-informative prior into an g p informative one
to avoid this problem the non-informative prior has to be invariant
invariant
e.g. consider a location parameter:
• a parameter that simply shifts the density
• a parameter that simply shifts the density
• e.g. the mean of a Gaussian
a non-informative prior for a location parameter has to be p p invariant to shifts, i.e. the transformation Y = µ + c
Location parameters
in this case
(
y c)
P(
y c)
P y
P ( ) 1 P
(
y c)
P(
y c)
y c P
c y
Y − = −
∂ +
= ∂
−
=
µ µ
µ µ
µ ) ) (
(
and, since this has to be valid for all c,
c y µ
( )
yP y
P ( ) =
hence
( )
yP y
PY ( ) = µ
(
y c)
P( )
yP
which is valid for all c if and only if Pµ(µ) is uniform
(
y c)
P( )
yPµ − = µ
non-informative prior for location is Pµ(µ) α 1
Scale parameters
a scale parameter is one that controls the scale of the densityy
⎟⎠
⎜ ⎞
⎝
− ⎛
σ 1f σx
e.g. the variance of a Gaussian distribution
it can be shown that, in this case, the non-informative prior invariant to scale transformations is
σ) 1
( =
P
note that, as for location, this is an improper prior
σ σ
σ ( ) = P
Selecting priors
non-informative priors are the end of the spectrum where we don’t know what parameter values to favorp at the other end, i.e. when we are absolutely sure, the prior becomes a delta function
in this case
) (
)
(θ = δ θ −θ0 PΘ
) (
)
| ( )
|
(θ D P D θ δ θ θ
P
and the predictive distribution is
) (
)
| ( )
|
( | 0
| θ α Θ θ δ θ −θ
Θ D P D
P T T
)
| (
) (
)
| ( )
| ( )
| (
0
|
0
|
|
|
θ
θ θ
θ δ θ θ
x P
d D
P x
P D
x P
X
T X
T X
Θ
Θ Θ
=
−
∝
∫
this is identical to ML if θ0 = θML
Selecting priors
hence,
• ML is a special case of the Bayesian formulation,
• where we are absolutely confident that the ML estimate is the correct value for the parameter
but we could use other values for θ00. For example the value that maximizes the posterior
) ( )
| ( max
arg )
| ( max
arg | θ | θ θ
θMAP = θ PΘT D = θ PT Θ D PΘ
this is called the MAP estimate and makes the predictive distribution equal to
θ θ
q
it can be useful when the true predictive distribution has
)
| ( )
|
( |
|T X MAP
X x D P x
P = Θ θ
it can be useful when the true predictive distribution has no closed-form solution
Selecting priors
the natural question is then
• “what if I don’t get the prior right?”; “can I do terribly bad?”
• “how robust is the Bayesian solution to the choice of prior?”
• let’s see how much the solution changes between the two extremes
extremes
for the Gaussian problem
• absolute certaintyy priors:p Pµ(µ) = δ(µ − µp)
• MAP estimate: since we havePµ|T (µ |D) =G (x,µn,σn2)
0 2
2
0 µ σ µ
µ σ
µ = = n +
ML
) (
)
(µ µ µp
µ
• ML estimate is µp = µML
2 0 2
0 2
2 0
σ µ µ σ
σ µ σ
µ + +
+ n
n ML
n p
• we have seen already that these are similar unless the sample is small (MAP = ML on sample with extra point)
Selecting priors
for the Gaussian problem
• non-informative prior:
• in this case it is Pµ(µ) α 1 or
(
, 0, 02)
lim )
( 2
0
σ µ µ µ σ
µ G
P = →∞
• from which
nσ σ ⎞
⎜⎜⎛
+ 2
2
lim 0
0
ML ML
n n n µ µ
σ µ σ
σ µ σ
σ =
⎜⎜ ⎠
⎝ + +
= +
∞
→ 2 2 0
0 2
2 0
0
2 0
lim
2
1 2
1 lim n ⎞ = n ⇔ σ =σ
⎜⎜⎛
+
=
• and
2 2
0 2
2 lim
2 0
ML n
n
σ σ σ
σ σ
σ =σ →∞⎜⎜⎝ + ⎠ = ⇔ =
⎞
⎛ ⎛ 1 ⎞
⎟⎟⎞
⎜⎜⎛
⎟⎞
⎜⎛ +
= +
=G x G x n
D x
PX T ( | ) ( ,µn ,σ 2 σn2) ,µML,σ 2 1 1
Selecting priors
in summary, for the two prior extremes
• delta prior centered on MAP:
(
2)
|T ( | ) ,µMAP ,σ
X x D G x
P =
2 0 2
0 2 2
2 0
2
0 µ
σ σ
µ σ σ σ
µ σ
+ +
= +
n n
n
ML MAP
• delta prior centered on ML:
(
2)
|T ( | ) ,µML,σ
X x D G x
P =
• non-informative prior
⎟⎟⎠
⎜⎜ ⎞
⎝
⎛ ⎟
⎠
⎜ ⎞
⎝⎛ +
=G x n
D x
PX|T ( | ) ,µML,σ 2 1 1
all Gaussian, “qualitatively the same”:
• somewhat different parameters for small n; equal for large n
⎠
⎝ ⎝ n ⎠
somewhat different parameters for small n; equal for large n
this indicates robustness to “incorrect” priors!
Selecting priors
another example, problem 3.5.17 DHS (HW prob 3)
• multivariate Bernoulli (d independent Bernoulli variables)
• since Bernoulli is
( ) x
X x
x x
P Θ = − −
⎩⎨
⎧ =
= 1
| 1
0 1
1 ) ,
|
( θ θ
θ θ θ
• multivariate likelihood is:
d
( )
X|Θ( | ) ⎩⎨1− , x = 0
θ
• in (a) you show that if D = {x(1), ..., x(n)} is a set of n iid samples,
( )
∏
=−
Θ = −
1 i
i xi
x i i
X x
P | ( |θ) θ 1 θ 1
then
( )
∑
( )∏
−Θ = − = n
j
i j i
s n i
T D s x
P i
1
| ( | ) 1 ,
d
1 i
s i
i θ
θ θ
=1 j =1
i
Selecting priors
another example, problem 3.5.17 DHS (HW prob 3)
• in (b) you then show that if Θ is uniform (non-informative) the predictive distribution is
∏
⎜⎝⎛ ++ ⎟⎠⎞ ⎜⎝⎛ − ++ ⎟⎠⎞ −= d i i
x i
x T i
X n
s n
D s x P
1
| 2
1 1 2
) 1
| (
• in (d) you show that comparing with
=1 ⎝ + ⎠ ⎝ + ⎠
i n 2 n 2
( )
∏
d xi −xix
P ( |θ) θ 1 θ 1
• this can be interpreted as:
( )
∏
=Θ = −
1 i
i i
i i
X x
P | ( |θ) θ 1 θ
• under Bayes, with a uniform prior, the predicted distribution is the same as the likelihood, with the parameter estimate
ˆ si +1 2
1 +
= + n si θi
Selecting priors
let’s now consider the extreme of
• ML: we know that
( )
θ θδ
θ) ˆ
( = −
PΘ
n si
i = θˆ
• and
∏
⎜⎝⎛ ⎟⎠⎞ ⎜⎝⎛ − ⎟⎠⎞ −= d i i
x i
x T i
X n
s n
D s x P
1
| ( | ) 1
• this can be interpreted as:
• the predicted distribution is the same as the likelihood, with the
=1 ⎝ ⎠ ⎝ ⎠
i n n
parameter estimate
si
i = θˆ
i n
Selecting priors
• MAP: given prior
{
log ( | ) log ( )}
max ˆ arg
| θ θ
θ = PT Θ D + PΘ
∏
ΘΘ =
i P i i
P (θ )
• and since
{
log ( | ) log ( )}
max
arg | θ θ
θ θ PT Θ D + PΘ
n ( )
d
• this is
( )
∑
( )∏
= =−
Θ = − =
j
i j i
s n i
T D s x
P i
1
| ( | ) 1 ,
1 i
s i
i θ
θ θ
this is
• i e the solution of
{
log ( )log(1 ) log ( )}
max ˆ arg
i i
i i
i
i s θ n s θ P i θ
θ = θ + − − + Θ
i.e. the solution of
0 ) ) (
( 1 1
)
( =
∂ + ∂
−
− − Θ
Θ
i i
i i
i i
i
i i
P P s
n
s θ
θ θ
θ θ
• let’s consider some specific priors
Selecting priors
• prior that favors “1”s
θ θ) 2 ( =
Θi
P
2
• MAP solution:
i
ˆ 1 1 0
)
(n −s s +
si i θ i
1 θ
• and
0 1 1
) (
= +
⇔
=
− +
− ii i i ni
i
i θ
θ θ
θ
⎞ −
⎛
⎞
⎛
d xi 1 xi
1 1
• this can be interpreted as:
∏
=⎟⎠
⎜ ⎞
⎝
⎛
+
− +
⎟⎠
⎜ ⎞
⎝
⎛
+
= d +
1 i
i i
i T i
X n
s n
D s x P |
1 1 1
1 ) 1
| (
this can be interpreted as:
• the predicted distribution is the same as the likelihood, with the parameter estimate
+1 s 1
ˆ +
= + n si θi
Selecting priors
• prior that favors “0”s
) 1
( 2 )
(θ = −θ
Θi
P
2
• MAP solution:
i
0 ˆ 1
)
(n −s = ⇔ = s
si i θ i
1 θ
• and
0 1 1
1 = ⇔ = +
− −
− −
i n
i i
i
θ θ θ
θ
⎞ −
⎛
⎞
⎛
d xi 1 xi
• this can be interpreted as:
∏
=⎟⎠
⎜ ⎞
⎝
⎛
− +
⎟⎠
⎜ ⎞
⎝
⎛
= d +
1 i
i i
i T i
X n
s n
D s x P |
1 1 ) 1
| (
this can be interpreted as:
• the predicted distribution is the same as the likelihood, with the parameter estimate
s 1 ˆ = +
n si θi
Selecting priors
in summary
• all cases are of the form PX|T (x |D) =
∏
d θˆxi( )
1−θˆ 1−xi• with
∏ ( )
=1 i
|
Estimator θˆi # tosses # “1”s interpretation
ML n si
MAP non-informative n si “the same”
n si
n si
MAP favor “1”s n+1 si+1 “add one 1”
MAP favor “0”s n+1 si “add one 0”
) 1 (
) 1
(si + n + ) 1 (n + si
• all cases qualitatively the same: “ML estimate on an extended Bayes non-informative (si +1) (n + 2) n+2 si+1 “add one of each”
q y
sample with extra points that reflect the bias of the prior”.
Regularization
these are all examples of regularization
Q: what is the point of “adding one of each?”p g by Bayes y y non-informative?
• the main problem of ML (si / n) is the “empty bin” problem
f ll i lik l b i d d l f h l f
• for small n, si is likely to be zero independently of the value of θi
• this can lead to all sorts of problems, e.g. a likelihood ratio that goes to infinity
• by adding “one of each” Bayes eliminates this problem
• for richly populated bins it makes no difference, but it matters for empty bins
empty bins
note that this is consistent with the non-informative prior
• empty bins are as likely as any other valuep y y y
• if we see a lot of them, we need to correct this
Regularization
“empty bin” problem
• “why should I care?” this is unlikely if I have a large sample
• remember that “large” is always relative
• 10 bins in 1D transforms into 100 in 2D, 1000 in 3D, and 10d in a d-dimensional space
d dimensional space
• when d is large, we are always in the “small sample” regime
• regularization usually makes a tremendous difference
example:
• histogram estimates in high dimensional spaces
• histogram estimates in high-dimensional spaces
• e.g. histogram of English words for indexing web-pages
• for each page, compute histogram C = (c1, ..., cw) where ci is the # of times word ithh word appeared in page
Regularization
histogram similarity:
• natural measure is the Kullback-Leibler divergence
⎟⎟⎠
⎜⎜ ⎞
⎝
=
∑
⎛= j
k ki w
k
ki j
i
p p p
C C
d ( , ) log
1
• where the probabilities are the counts after normalization
⎠
⎝ k
k 1 p
c i
• problem: log goes to infinity when pj = 0!
=
∑
k ki i k
k c c
p
• problem: log goes to infinity when pjk = 0!
• for low-frequency words the noisy estimates are amplified by the ratio of probabilities
• the distance measure has a large variance
Regularization
Prob 3 on HW
• the count vector C is distributed according to a multinomial distribution
∏ ∏
= w
j
c w j
W
C j
c n c
P
1 1
! ) !
, ,
( K π
• where πj is the probability of word j.
∏
==
j
k ck 1
1
!
• since the πj are probabilities, we can’t use any prior here.
• distribution over vectors π = (π1, ..., πw) must satisfy the constraints of a probability mass function
1
> 0
∑
j
π π
=1
∑
j π jRegularization
Prob 3 on HW
• one such distribution is the Dirichlet distribution
∑
⎟⎠
⎜⎜ ⎞
⎝ Γ⎛
w W
j uj
( ) ∏
∏
=−
=
=
Π Γ
⎠
= ⎝
j
u w j
k j
j
W j
u P
1
1
1 1
1, , )
(π K π π
• uj are hyper-parameters
• Γ( ) is the gamma function
• Γ(.) is the gamma function
Regularization
Prob 3 on HW
• on HW you will show that the posterior is
∑ ∏
−
= ⎟⎟⎠ +
⎜⎜ ⎞
⎝
⎛ +
Γ w
u c W
j j j
j j
u c
P ( | ) 1 1
( ) ∏
∏
=+
=
Π Γ +
⎠
= ⎝
j
u c w j
k j j
j
C j j
u c
c P
1
1
1 1
| (π | ) π
• i.e. Dirichlet of hyper-parameters cj + uj
• the prior parameters can be seen as additional counts that regularize the predictive distribution!
regularize the predictive distribution!