Beta and Gamma Distributions

(1)

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Beta and Gamma Distributions

(2)

 Beta distribution,

 Gamma Function, Normalization of the Beta Distribution, Beta as a Prior to Bernoulli, Posterior and Predictive Distributions

 A Frequentist View of Bayesian Learning, Variance Decomposition

 Gamma Distribution

 Exponential Distribution

 Chi Squared Distribution

 Inverse Gamma Distribution

 The Pareto Distribution

2





Beta Distribution: Normalization

(7)

Beta Distribution

7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 x pdf Beta(0.1,0.1) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 x pdf Beta(1,1) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 x pdf Beta(2,3) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 x pdf Beta(8,4)

(8)

 Assuming a Bernoulli likelihood and Beta prior we derive the

posterior as:

This is also a Beta distribution:

 a and b are the effective number of observations of x=1 and

x=0, respectively, introduced by the prior (don’t have to be

(9)

 From the properties of the Beta distribution, we compute:

 The posterior mean always lies in between the prior mean

and the MLE estimate:

 This can be shown easily by noticing that:

Posterior Mean and Variance

(10)

Posterior Distribution

10

(11)

 We can now compute the probability that the next coin flip is

(12)

 Consider the case of infinite data (N→∞):

and the posterior mean and variance become:

 For N→∞, the distribution as expected spikes around the

MLE estimate with zero variance (i.e. the uncertainty

decreases as N→∞). Is this a general property?

Properties of the Posterior Distribution

(13)

A Frequentist View of Bayesian Learning

13

 Consider inference of parameter q using data D. We

expect that because the posterior p(q|D) incorporates the

information from the data D, it will imply less variability for q

than the prior p(q).

 We have the following identities:





[ ]q   q |D _













[ ] | | |

(14)

A Frequentist View of Bayesian Learning

14

This means that on average over the realizations of the

data D, the conditional expectation E[q|D] is equal to E[q].

 Also, the posterior variance on average is smaller than the

prior variance by an amount that depends on the variations in posterior means over the distribution of

possible data.





[ ]q   q | D _













[ ] | | |

var q  _var q D _ var _ q D _  _var q D _

 | 

(15)

Posterior Mean

15

Note the not-surprising result regarding the posterior mean:









 

| ( | ) ( | ) ( ) ( , ) ( ) p d p p d d p d d p d q q q q q q q q q q q q q q      





 



D D D D D D D

 

q



q |



             Prior _Posterior mean _mean Posterior mean averaged over the data

(16)

Variance Decomposition Identity

16

If (q,D) are two scalar random variables then we have:

 Here is the proof:

(17)

Posterior Variability

17

 We can derive a similar expression regarding the posterior

variance:

 Thus on average (over the data), the variability in q

decreases. For a particular observed data set D, it is

however possible that

 These results implicitly assume that the data follow the distribution:

 













Pr | | | ior Posterior variance _variance averaged over all data

var q  _var q D _ var _ q D _  _var q D _ 

(18)

 The Gamma distribution is a two-parameter family of

continuous distributions. It has a scale parameter θ>0 and

a shape parameter k>0. If k is an integer then the

distribution represents the sum of k independent

exponentially distributed random variables, each of which

has a mean of θ (which is equivalent to a rate parameter

of θ −1) .

 More often, we also use the rate

(19)

 It is frequently a model for waiting times. For important

properties see here.

 It is more often parameterized in terms of a shape

parameter a = k and an inverse scale parameter b = 1/θ,

called a rate parameter:

 The mean, mode and variance with this parametrization are:

 

1 1 0

( | , )

,

0,

, ( )

( )

a a bx a u

b

p x a b

x

e

x

a

u

e du

a

    





 







Gamma Distribution- Rate Parametrization

(20)

Plots of

As we decrease the rate b, the distribution squeezes

leftwards and upwards .

(21)

An empirical PDF of rainfall data fitted with a Gamma

distribution.

(22)

Exponential Distribution

22

This is defined as

 Here λ is the rate parameter.

This distribution describes the times between events in a

Poisson process, i.e. a process in which events occur

continuously and independently at a constant average rate λ.

 

(X | )  (X |1, )   exp(x), x 0,

(23)

Chi-Squared Distribution

23

This is defined as

 This is the distribution of the sum of squared Gaussian

random variables. More precisely,

 

2 1 2 ₂ 1 1 2 ( | ) ( | , ) exp( ), 0, 2 2 2 2 x X X x x                      _{ } Gamma 2 2 1 ~ (0,1) . ~ i i i

Let Z and S Z Then S

(24)

Inverse Gamma Distribution

24

This is defined as follows:

where:

 a is the shape and b the scale parameters.

It can be shown that:

1 ~ ( | , ) ~ ( | , ) If X Gamma X a b  X InvGamma X a b

 

( 1) ( | , ) exp( / ), 0, ( ) a a b X a b x b x x a        InvGamma 2 2 ( 1), , 1 1 var ( 2) ( 1) ( 2) b b

Mean exists for a Mode

(25)

The Pareto Distribution

25

Used to model the distribution of quantities that exhibit

long tails (heavy tails)

This density asserts that x must be greater than some

constant m, but not too much greater, k controls what is “too much”.

As k → ∞, the distribution approaches δ(x − m).

On a log-log scale, the pdf forms a straight line, of the form

log p(x) = a log x + c for some constants a and c (power

law, Zipf’s law).

( 1)

(X k m| , )  km xk  k (x  m)

(26)

The Pareto Distribution

26

Applications: Modeling the frequency of words vs their rank, distribution of wealth (k=Pareto Index), etc.

Beta and Gamma Distributions