Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Beta and Gamma Distributions
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Beta distribution,
Gamma Function, Normalization of the Beta Distribution, Beta as a Prior to Bernoulli, Posterior and Predictive Distributions
A Frequentist View of Bayesian Learning, Variance Decomposition
Gamma Distribution
Exponential Distribution
Chi Squared Distribution
Inverse Gamma Distribution
The Pareto Distribution
2
Contents
• Following closely Chris Bishops’ PRML book, Chapter 2
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Beta(a,b) distribution with is defined as follows:
The expected value, mode and variance of a Beta
random variable x with (hyper-)parameters α and β :
For more information visit this link.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
If a=b=1, we obtain a uniform distribution.
If a and b are both less than 1, we get a bimodal distribution with spikes at 0 and 1.
If a and b are both greater
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The gamma function extends the factorial to real numbers:
With integration by parts:
For integer n:
For more information visit this link.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Showing that the Beta(a,b) distribution is normalized
correctly is a bit tricky. We need to prove that:
Indeed we follow the steps: (a) change the variables y to
t=y+x; (b) change the order of integration in the shaded
triangular region; and (c) change x to m via x=tm:
1
1 1
0
( ) ( )a b (a b m) a (1 m)b dx
Beta Distribution: Normalization
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Beta Distribution
7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 x pdf Beta(0.1,0.1) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 x pdf Beta(1,1) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 x pdf Beta(2,3) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 x pdf Beta(8,4)Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Assuming a Bernoulli likelihood and Beta prior we derive the
posterior as:
This is also a Beta distribution:
a and b are the effective number of observations of x=1 and
x=0, respectively, introduced by the prior (don’t have to be
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
From the properties of the Beta distribution, we compute:
The posterior mean always lies in between the prior mean
and the MLE estimate:
This can be shown easily by noticing that:
Posterior Mean and Variance
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Distribution
10
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We can now compute the probability that the next coin flip is
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Consider the case of infinite data (N→∞):
and the posterior mean and variance become:
For N→∞, the distribution as expected spikes around the
MLE estimate with zero variance (i.e. the uncertainty
decreases as N→∞). Is this a general property?
Properties of the Posterior Distribution
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
A Frequentist View of Bayesian Learning
13
Consider inference of parameter q using data D. We
expect that because the posterior p(q|D) incorporates the
information from the data D, it will imply less variability for q
than the prior p(q).
We have the following identities:
[ ]q q |D
[ ] | | |
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
A Frequentist View of Bayesian Learning
14
This means that on average over the realizations of the
data D, the conditional expectation E[q|D] is equal to E[q].
Also, the posterior variance on average is smaller than the
prior variance by an amount that depends on the variations in posterior means over the distribution of
possible data.
[ ]q q | D
[ ] | | |
var q var q D var q D var q D
|
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Mean
15
Note the not-surprising result regarding the posterior mean:
| ( | ) ( | ) ( ) ( , ) ( ) p d p p d d p d d p d q q q q q q q q q q q q q q
D D D D D D D
q
q |
Prior Posterior mean mean Posterior mean averaged over the dataBayesian Scientific Computing, Spring 2013 (N. Zabaras)
Variance Decomposition Identity
16
If (q,D) are two scalar random variables then we have:
Here is the proof:
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Variability
17
We can derive a similar expression regarding the posterior
variance:
Thus on average (over the data), the variability in q
decreases. For a particular observed data set D, it is
however possible that
These results implicitly assume that the data follow the distribution:
Pr | | | ior Posterior variance variance averaged over all datavar q var q D var q D var q D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Gamma distribution is a two-parameter family of
continuous distributions. It has a scale parameter θ>0 and
a shape parameter k>0. If k is an integer then the
distribution represents the sum of k independent
exponentially distributed random variables, each of which
has a mean of θ (which is equivalent to a rate parameter
of θ −1) .
More often, we also use the rate
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
It is frequently a model for waiting times. For important
properties see here.
It is more often parameterized in terms of a shape
parameter a = k and an inverse scale parameter b = 1/θ,
called a rate parameter:
The mean, mode and variance with this parametrization are:
1 1 0( | , )
,
0,
, ( )
( )
a a bx a ub
p x a b
x
e
x
a
u
e du
a
Gamma Distribution- Rate Parametrization
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Plots of
As we decrease the rate b, the distribution squeezes
leftwards and upwards .
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
An empirical PDF of rainfall data fitted with a Gamma
distribution.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Distribution
22
This is defined as
Here λ is the rate parameter.
This distribution describes the times between events in a
Poisson process, i.e. a process in which events occur
continuously and independently at a constant average rate λ.
(X | ) (X |1, ) exp(x), x 0,
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Chi-Squared Distribution
23
This is defined as
This is the distribution of the sum of squared Gaussian
random variables. More precisely,
2 1 2 2 1 1 2 ( | ) ( | , ) exp( ), 0, 2 2 2 2 x X X x x Gamma 2 2 1 ~ (0,1) . ~ i i iLet Z and S Z Then S
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Inverse Gamma Distribution
24
This is defined as follows:
where:
a is the shape and b the scale parameters.
It can be shown that:
1 ~ ( | , ) ~ ( | , ) If X Gamma X a b X InvGamma X a b
( 1) ( | , ) exp( / ), 0, ( ) a a b X a b x b x x a InvGamma 2 2 ( 1), , 1 1 var ( 2) ( 1) ( 2) b bMean exists for a Mode
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Pareto Distribution
25
Used to model the distribution of quantities that exhibit
long tails (heavy tails)
This density asserts that x must be greater than some
constant m, but not too much greater, k controls what is “too much”.
As k → ∞, the distribution approaches δ(x − m).
On a log-log scale, the pdf forms a straight line, of the form
log p(x) = a log x + c for some constants a and c (power
law, Zipf’s law).
( 1)
(X k m| , ) km xk k (x m)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Pareto Distribution
26
Applications: Modeling the frequency of words vs their rank, distribution of wealth (k=Pareto Index), etc.