Estimating the evidence for statistical models

(1)

Estimating the evidence for statistical models

Nial Friel

University College Dublin

[email protected]

(2)

Introduction

Bayesian model choice

Given datay and competing models: m1, . . . ,ml, each with

parametersθ1, . . . , θl, respectively.

Bayesian inference:

(3)

Introduction

Model evidence

This is often called themarginal likelihood,integrated likelihood or

(4)

Introduction

Posterior model probabilities

Suppose we could computeπ(y|mk). Then, using Bayes theorem

we get, π(mk|y) = π(y|mk)π(mk) Pl 1π(y|mk)π(mk) .

(5)

Introduction

Bayes factors

If we have two competing models:

π(m1|y) π(m2|y) = π(y|m1) π(y|m2) ×π(m1) π(m2)

posterior odds = Bayes factor ×prior odds

The Bayes factor,B12= π_π(₍y_y|_|_mm₂1)₎.

The largerB12 is, the greater the evidence in favour of M1

(6)

Introduction

Bayesian model averaging

Predictions can be made by averaging over all models, weighted proportional to the posterior model probability, thereby

incorporating model uncertainty.

π(y∗|y) = l

X

k=1

π(y∗|mk,y)π(mk|y)

This is the average of the posterior distribution fory∗ under each

(7)

Introduction

Why estimating the model evidence is a challenge

I π(y|mk) is an integral of a (usually) highly variable function over a high-dimensional parameter space.

I Analytic tractability is sometimes possible, often where

conjugate priors are used. This is quite rare.

(8)

Introduction

Within model search or across model search?

Within model search:

Inference forπ(θk|y) separately for everymk. This is used to estimateπ(y|mk), for all k.

There are many approaches under this heading.

Across model search:

Here inference is carried out over the joint model and parameter

space,π(θk,mk|y). In an MCMC setting, only one chain is needed!

Reversible jump Markov chain Monte Carlo developed by Green

(9)

Review of evidence estimation

Laplace’s method

(eg Tierney and Kadane 1986)

Assume thatπ(θk|y) is highly peaked around the posterior mode

˜

θk eg if sample size is large enough.

Define

l(θk) = log{π(y|θk)π(θk)}

I Expandl(θk) as a quadratic about ˜θk and then exponentiate.

I Result gives an approximation to π(y|θk)π(θk) as a Gaussian with mean ˜θk and covariance ˜Σ = (−D2l( ˜θk))−1, where

D2l( ˜θk) is the Hessian matrix of second derivatives.

I Integrating this approximation yields

(10)

Harmonic mean estimator

(Newtown and Raftery (1994)

)

π(y) = 1/ 1 n n X 1 π(y|θi) ! , θi ∼π(θ|y). Why does this hold?

E 1 π(θ|y)π(θ|y) = Z _π_(y_|_θ₎_π₍_θ₎ π(θ|y)π(y)dθ= 1 π(y) Z π(θ)dθ= 1 π(y). The bad news?

(11)

Harmonic mean estimator

)

E 1 π(θ|y)π(θ|y) = Z _π₍_y_|_θ₎_π₍_θ₎ π(θ|y)π(y)dθ= 1 π(y) Z π(θ)dθ= 1 π(y).

(12)

Harmonic mean estimator

)

E 1 π(θ|y)π(θ|y) = Z _π₍_y_|_θ₎_π₍_θ₎ π(θ|y)π(y)dθ= 1 π(y) Z π(θ)dθ= 1 π(y). The bad news?

(13)

Harmonic mean estimator

)

π(y) = 1/ 1 n n X 1 π(y|θi) ! , θi ∼π(θ|y).

This estimator is based solely on draws from the posterior. But the posterior is typically much more peaked than the prior, eg, when the posterior is insensitive to the prior. Hence in such situations, the harmonic mean estimator will not change much as the prior changes.

Butπ(y) isvery sensitive to changes in the prior.

This drawback is very well documented. See Radford Neal’s blog, for example.

(14)

Chib’s method

(Chib 1995)

Chib (1995) presented a generic method which can be applied to output from the Gibbs sampler.

π(θ|y) = π(y|θ)π(θ)

π(y) Re-writing this,

π(y) = π(y|θ)π(θ)

π(θ|y) .

So we could estimate logπ(y) as

logπ(y) = logπ(y|θ∗) + logπ(θ∗)−log ˆπ(θ∗|y)

where ˆπ(θ∗|y) is an estimate of the posterior density at a point θ∗

(15)

Chib’s method

(Chib 1995)

Chib’s method relies on estimatingπ(θ∗|y).

Suppose the vectorθ can be partitioned as (θ1, θ2, θ3), where the

full-conditional distribution of eachθi is standard.

π(θ∗|y) =π(θ₁∗|θ₂∗, θ∗₃,y)π(θ₂∗|θ∗₃,y)π(θ₃∗|y)

Gibbs sampling can be used to estimate each factor on the LHS:

π(θ₂∗|θ₃∗,y) = 1 N X j π(θ∗₂|θ(₁j), θ₃∗). π(θ∗₃|y) = 1 N X j π(θ∗₃|θ₁(j), θ₂(j)).

(16)

Chib’s method

(Chib 1995)

In general, Chib’s method can be applied whenθis partitioned into

an arbitrary number of blocks.

The only requirement is that the full-conditional sampling of each block is possible.

(17)

Annealed importance sampling

Annealed Importance Sampling

(Neal 2001)

AIS is a very clever algorithm which shows how tempering can be used to define an importance samping function to sample from complex distributions.

Aside: Importance sampling to sample from a targetf(x) using an

importance functiong(x): x(1), . . . ,x(N)∼g(x) Efa(x) = P w(i)a(x(i)) P w(i) , where w (i)₌ f(x(i)) g(x(i)₎ Further, 1 N X w(i)→ zf zg as N→ ∞, wherezf = R xf(x)dx and zg = R xg(x)dx.

(18)

Annealed Importance Sampling

(Neal 2001)

Define

πi(θ|y) =π(θ)1−tiπ(θ|y)ti, where 1 =t0>· · ·>tn= 0.

Thusπt0 andπtn corresponds to posterior and prior, respectively. LetTi denote a Markov transition kernel with invariant πti. Forj = 1, . . . ,N

I Sampleθn−1 fromπtn

I Sampleθn−2 fromθn−1 usingTn−1

I · · ·

I Sampleθ0 fromθ1 usingT1.

I Set θ(j)=θ0 and w(j)= πn−1(θn−1) πn(θn−1) πn−2(θn−2) πn−1(θn−2) . . .π0(θ0) π1(θ0) .

(19)

Annealed Importance Sampling

AIS yields:

1. An independent sample {θ(i)}from π(θ|y).

2. An estimator of the evidence

π(y)≈ 1 n n X i=1 w(i).

(20)

Power posteriors

Evidence estimation via power posteriors

(NF and Pettitt (2008))

Consider thePower posterior:

π(θ|y,t)∝ {π(y|θ)}T(t)p(θ)

whereT : [0,1]→[0,1] is defined st T(0) = 0 andT(1) = 1.

Its normalising constant is

z(y|t) =

Z

θ

{π(y|θ)}tp(θ)dθ.

z(y|t = 1): Posterior model evidence.

(21)

Power posteriors

Evidence via power posteriors

The evidence follows the identity:

logπ(y) = log z(y|t = 1) z(y|t = 0) = Z 1 0 E_θ|tlogπ(y|θ)dt. Proof: d dt log(z(y|t)) = 1 z(y|t)z 0 (y|t) = 1 z(y|t) Z _d dt log(π(y|θ)) t_π₍_θ₎_d_θ = Z log(π(y|θ))π(y|θ) t_π₍_θ₎_d_θ z(y|t) = Eθ|tlog(π(y|θ)).

(22)

Power posteriors

Evidence via power posteriors

d

dt logz(y|t) =Eθ|tlog(π(y|θ))

This is the mean deviance wrt to (θ|y,t) - the power posterior.

Integrating wrtt yields, logπ(y) = log z(y|t = 1) z(y|t = 0) = Z 1 0 Eθ|tlogπ(y|θ)dt. This is essentially an application of thermodynamic integration, which was first developed in the statistical physics community, and outlined in Gelman and Meng (1998).

(23)

Power posteriors

In practice: Discretiset∈[0,1], 0 =t0 <t1, . . . ,tn= 1.

For eachti: Sample θ∼π(θ|y,t) and estimate

Ei =Eθ|tilogπ(y|θ). π(y) = n X i=1 (ti −ti−1) (Ei−1+Ei) 2

(24)

Power posteriors

Sensitivity of

p

(

y

) to the prior - toy example

How does sensitivity to the prior impact on this method?

Supposey={yi} iid N(θ,1). A priori, θ∼N(m,v). Then the

power posteriorθ|y,t ∼N(mt,vt), where

mt= nt¯y+m/v nt+ 1/v and vt = 1 nt+ 1/v and E_θ|y,tlogπ(y|θ) =− log 2π 2 − 1 2 n X i=1 (yi−y¯)2− n 2 (m−y¯)2 (vmt+ 1)2− n 2 1 (nt + 1/v).

(25)

Review of evidence estimation Power posteriors 0 0.2 0.4 0.6 0.8 1 −60 −50 −40 −30 −20 −10 0

Expected deviance, under the distributionθ|y,t plotted againstt

for prior variance equal to 10,5,1.

Asv increases, so too does the rate at which the mean deviance

(26)

Power posteriors

Connection to Fractional Bayes estimator

The fractionz(y|t = 1)/z(y|t=a) whereais close to 0, is precisely the estimate of the marginal likelihood used in the

‘Fractional Bayes’estimate of the Bayes factor (O’Hagan 95).

π(y)≈ z(y|t = 1) z(y|t=a) = R θπ(y|θ)π(θ)dθ R θ{π(y|θ)}aπ(θ)dθ = Z 1 a E_θ|tlogπ(y|θ)dt This method was proposed to compute Bayes factor with

un-informative priors. Impropriety inπ(θ) cancels above and below.

(27)

Power posteriors

Power posterior approach

I It is realitively straightforward to code/implement.

I It is a generic method. In some cases it can be implemented

in WinBUGS.

I Choosing the temperature schedule is vital – this is the

weakness of this approach. Behrens, NF, Hurn (2011) offer some possibility in this direction.

(28)

Nested sampling

(Skilling, 2006)

(For the moment (for ease of notation), letL(θ) =π(y|θ).)

π(y) =

Z

L(θ)π(θ)dθ=

Z

L(θ)dX,

wheredX =π(θ)dθ is an element of prior mass.

Define

X(λ) = Z

L(θ)>λ

π(θ)dθ as a cumulant prior mass.

Write the inverse function asL(X), ieL(X(λ)) =λ. This then allows us to express the evidence as a 1−dimensional integral:

π(y) = Z 1

0

(29)

Nested sampling

(Skilling, 2006)

π(y) =

Z

L(θ)π(θ)dθ=

Z

L(θ)dX,

Define

X(λ) =

Z

L(θ)>λ

π(θ)dθ

as a cumulant prior mass.

Write the inverse function asL(X), ieL(X(λ)) =λ. This then allows us to express the evidence as a 1−dimensional integral:

π(y) = Z 1

0

(30)

Nested sampling

(Skilling, 2006)

π(y) =

Z

L(θ)π(θ)dθ=

Z

L(θ)dX,

Define

X(λ) =

Z

L(θ)>λ

π(θ)dθ

as a cumulant prior mass.

Write the inverse function asL(X), ieL(X(λ)) =λ. This then

allows us to express the evidence as a 1−dimensional integral:

π(y) =

Z 1

0

(31)

Nested sampling

The main computational burden is the requirement to sampleθ

from the prior subject to the constraint thatL(θ)>l.

This is roughly similar to the computational effort of slice sampling (Neal, 2003).

The evidence is estimated by sorting draws from the prior according to their likelihood.

π(y) =Z = I−1

X

i=1

(32)

Nested sampling

Sketch of algorithm

Sampleθ1, . . . , θN from the prior. Repeat fori = 1, . . . ,I:

I Find the point θk with the smallest likelihood,li, among the

N currentθi’s.

Set Xi =exp(i/N) andwi =Xi−1−Xi. IncrementZ byLiwi.

I Replaceθk with a point sampled from the prior subject to

(33)

Evidence estimation: doubly intractable distributions

Doubly intractable distributions

π(θ|y)∝π(y|θ)π(θ)

Here we assume that the likelihood,π(y|θ), is impossible to

(34)

Ising model

Doubly intractable distributions

Gibbs random fields, which find use in spatial statistics and statistical network analysis, involves intractable likelihood models.

Ising model

I Defined on a lattice y ={y1, . . . ,yn}.

I Lattice points yi take values {−1,1}.

I Full conditional π(yi|y−i, θ) =π(yi|neighbours ofi, θ).

π(y|θ)∝q(y|θ) = exp    1 2θ1 X i∼j yiyj    .

(35)

Ising model

1st order and 2nd order Ising models.

π(y|θ) = exp(θ T_s₍_y₎₎

z(θ)

s(y) is a sufficient statistics and counts the number of ’like’

neighbours. z(θ) =X x1 · · ·X xn q(y|θ).

(36)

Ising model

Model evidence for MRFs – our approach

π(y) = q(y|θ)π(θ)

z(θ)π(θ|y) ∀θ.

I Draw from the posterior, and estimateπ(θ∗|y) for a high

probability θ∗.

I Estimate z(θ) using thermodynamic integration.

(37)

Simulating from the posterior

Auxiliary variable method

(Mølleret al., 2006)

Introduce an auxiliary variabley0 on the same space as the data y

and extend the target distribution

π(θ,y0|y)∝π(y|θ)π(θ)π(y0|θ0),

for some fixedθ0.

Joint update (θ∗,y0∗) with proposal:

h(θ∗,y0∗|θ,y0) =h1(y0∗|θ∗)h2(θ∗|θ,y0∗)

where

h1(y0∗|θ∗) =π(y0∗|θ∗) =

q(y0∗|θ∗)

(38)

Simulating from the posterior

α(θ∗,y0∗|θ,y0) = π(y|θ

∗₎_π₍_θ∗₎_π₍_y0∗_|_θ

0)π(y0|θ)h2(θ|θ∗)

π(y|θ)π(θ)π(y0_|_θ

0)π(y0∗|θ∗)h2(θ∗|θ)

z(θ∗)appears in π(y|θ∗) above and inπ(y0∗|θ∗) below, and

therefore cancels. Similarlyz(θ) cancels above and below.

The choice ofθ0 is important. eg the maximum pseudolikelihood

(39)

Exchange algorithm

(Murray, Ghahramani & MacKay 2006)

Sample from an augmented distribution

π(θ0,y0, θ|y)∝π(y|θ)π(θ)h(θ0|θ)π(y0|θ0)

whose marginal distribution forθis the posterior of interest

I π(y0|θ0) is the same likelihood model on whichy is defined.

I h(θ0|θ) arbitrary distribution for the augmented variable θ0

which might depend onθ (eg random walk distribution

(40)

Exchange algorithm

Exchange algorithm – How it works

1 _{Gibbs update of} (θ0,y0)

i Drawθ0 ∼h(·|θ)

ii Drawy0 ∼π(·|θ0)

2 _{Exchange move from} (θ,y)_, (θ0,y0) _to (θ0,y)_, (θ,y0)

with probability α= min     1,q(y 0_|_θ₎ q(y|θ) | {z } ∗ π(θ0) π(θ) h(θ|θ0) h(θ0_|_θ₎ q(y|θ0) q(y0_|_θ0₎ | {z } ∗∗ ×z(θ)z(θ 0₎ z(θ)z(θ0₎ | {z } 1    

I Exchange move proposes to “offer” the datay the auxiliaryθ0 and similarly to “offer” the auxiliary datay0 the parameterθ

I The affinity betweenθ0 andy is measured by (**) and the affinity betweenθand y0 by (*)

(41)

Exchange algorithm

Exchange algorithm – How it works

1 _{Gibbs update of} (θ0,y0)

i Drawθ0 ∼h(·|θ)

ii Drawy0 ∼π(·|θ0)

2 _{Exchange move from} (θ,y)_, (θ0,y0) _to (θ0,y)_, (θ,y0)

with probability α= min     1,q(y 0_|_θ₎ q(y|θ) | {z } ∗ π(θ0) π(θ) h(θ|θ0) h(θ0_|_θ₎ q(y|θ0) q(y0_|_θ0₎ | {z } ∗∗ ×z(θ)z(θ 0₎ z(θ)z(θ0₎ | {z } 1    

I Exchange move proposes to “offer” the datay the auxiliaryθ0

and similarly to “offer” the auxiliary datay0 the parameterθ

I The affinity betweenθ0 andy is measured by (**) and the

(42)

Exchange algorithm

Exchange algorithm for the Ising model

α= min 1,π(θ 0₎ π(θ) exp (θ−θ0)t(s(y0)−s(y)) The term exp (θ−θ0)t(s(y0)−s(y))

can be viewed as a measure of distance between the observed data

y and the auxiliary data y0.

It is somewhat similar to the accept/reject step in ABC (approximate Bayesian computation).

Note: Ifθ≈θ0, then α≈1. This does not necessarily happen with

(43)

Exchange algorithm

Exchange algorithm for the Ising model

I The main difficulty is the need to draw an exact sample

y0∼π(·|θ0)

I Perfect sampling is an obvious approach.

I A pragmatic alternative is to take a realisation from a long

MCMC run with stationary distributionπ(y0|θ0) as an

(44)

Ising model

Simulation study: Ising model

Datay simulated from an Ising model defined on a 16×16 lattice,

with a single interaction parameterθ.

Two competing models: 4 and 8 nearest neighbours.

Here the lattices are sufficently small to allow a very accurate estimate of the Bayes factor:

The normalising constantz(θ) can be calculated exactly for a grid

of{θi} values, which can then be plugged into the right hand side

of:

π(θi|y)∝

q(y|θi)

z(θi)

π(θi), i = 1, . . . ,n.

Summing up the right hand side yields an estimate ofπ(y). This

serves as a groundtruth to compare with the corresponding MCMC-based estimate of the model evidence.

(45)

Ising model

Results: Ising model

θ BFˆ BF

0.1 2.51 1.88

0.2 13.48 13.57

0.3 9.135 6.95

(46)

Exponential random graph models

(47)

(48)

The exponential random graph (or

p

∗

) model

First proposed by Frank and Strauss (JASA, 1986).

Letyij = 1 denote an edge connecting nodes i andj, and 0,

otherwise.

Datay is an adjacency matrix indicating nodes which are

connected by an edge.

1. Edges yij andykl are neighbours of one another, if they share

a common node.

2. Ifyij andykl are not neighbours, thenyij andyij are

(49)

The exponential random graph (or

p

∗

) model

First proposed by Frank and Strauss (JASA, 1986).

Letyij = 1 denote an edge connecting nodes i andj, and 0,

otherwise.

Datay is an adjacency matrix indicating nodes which are

connected by an edge.

1. Edges yij andykl are neighbours of one another, if they share

a common node.

2. Ifyij andykl are not neighbours, thenyij andyij are conditionally independent, given the rest of the graph.

(50)

The

p

∗

model

π(y|θ) =exp{θ t_s₍_y₎_} z(θ) = q(y|θ) z(θ) I y observed graph

I s(y) known vector of sufficient statistics

I θ vector of parameters

I z(θ) normalizing constant

z(θ) = X all possible graphs

exp{θts(y)}

I 2(n2) possible undirected graphs of_n_nodes

(51)

The

p

∗

model

π(y|θ) =exp{θ t_s₍_y₎_} z(θ) = q(y|θ) z(θ) I y observed graph

I s(y) known vector of sufficient statistics

I θ vector of parameters

I z(θ) normalizing constant

z(θ) = X

all possible graphs

exp{θts(y)}

I 2(n2) possible undirected graphs of_n_nodes

(52)

Model Specification: Network Statistics

edge mutual edge 2-in-star 2-out-star

(a)

2-mixed-star transitive triad cyclic triad

edge 2-star 3-star triangle

(53)

ERGM: Florentine network

Model 1: y ∼edges + 3-star

Model 2: y ∼edges + 2-star

(54)

ERGM: Florentine network

Here it is difficult to establish a groundtruth. For this purpose, we ran an ’independence’ RJMCMC sampler:

1. Sample from each model, separately, using the exchange

algorithm. (Here used the Bergm package of Caimo and NF

(2011)).

2. RJMCMC: Use the posterior mean and variance for model k,

as proposal parameters when proposing to jump to model k.

This works well, since the model space is small, but also because each posterior model is unimodal.

Acceptance rates for the jump proposals were around 40%, suggesting that the proposal distributions were a good fit to each posterior model.

This is essentially theAutoRJapproach outlined in Chapter 6 of Green (2003).

(55)

ERGM: Florentine network

Here it is difficult to establish a groundtruth. For this purpose, we ran an ’independence’ RJMCMC sampler:

1. Sample from each model, separately, using the exchange

algorithm. (Here used the Bergm package of Caimo and NF

(2011)).

2. RJMCMC: Use the posterior mean and variance for model k,

as proposal parameters when proposing to jump to model k.

This works well, since the model space is small, but also because each posterior model is unimodal.

Acceptance rates for the jump proposals were around 40%, suggesting that the proposal distributions were a good fit to each posterior model.

This is essentially theAutoRJapproach outlined in Chapter 6 of

(56)

ERGM: Florentine network

Here estimates of posterior model probabilities based onAutoRJ

are compared to those based on estimates of the model evidence for each model.

π(m1|y) π(m2|y) π(m3|y)

AutoRJ 0.29 0.69 0.02

(57)

Summary

Concluding remarks

I Model evidence is difficult to compute!

I Often complex Monte Carlo methods are needed. There are

plenty of methods in the Bayesian toolbox.

(58)

Summary

References

I Chib, S. (1995)Marginal likelihood using Gibbs output. Journal of the American Statistical Association, 90, 1313 – 1321.

I Friel, N and Pettitt, AN (2008)Marginal likelihood via power posteriors. Journal of the Royal Statistical Society, Series B, 70, 589 – 607.

I Newton MA and Raftery, AE (1994)Approximate Bayesian inference by the weighted likelihood bootstrap (with Discussion).Journal of the Royal Statistical Society, Series B, 56, 3 – 48.

I Neal, R (2001)Annealed importance sampling. Statistics and Computing, 11, 125 – 139. I Murray I., Ghahramani, Z., and MacKay, D. (2006)MCMC for doubly-intractable distributions. In

Proceedings of the 22nd annual conference on uncertainty in artificial intelligence

I Ciamo A., Friel N. (2011)Bayesian inference for the exponential random graph model. Social Networks, 33, 41 – 55.