Estimating the evidence for statistical models
Nial FrielUniversity College Dublin
Introduction
Bayesian model choice
Given datay and competing models: m1, . . . ,ml, each with
parametersθ1, . . . , θl, respectively.
Bayesian inference:
Introduction
Model evidence
Within modelmk: π(θk|y,mk)∝π(y|θk,mk)π(θk|mk) Constant of proportionality is π(y|mk) = Z θk π(y|θk,mk)π(θk|mk)dθk.This is often called themarginal likelihood,integrated likelihood or
Introduction
Posterior model probabilities
Suppose we could computeπ(y|mk). Then, using Bayes theorem
we get, π(mk|y) = π(y|mk)π(mk) Pl 1π(y|mk)π(mk) .
Introduction
Bayes factors
If we have two competing models:
π(m1|y) π(m2|y) = π(y|m1) π(y|m2) ×π(m1) π(m2)
posterior odds = Bayes factor ×prior odds
The Bayes factor,B12= ππ((yy||mm21)).
The largerB12 is, the greater the evidence in favour of M1
Introduction
Bayesian model averaging
Predictions can be made by averaging over all models, weighted proportional to the posterior model probability, thereby
incorporating model uncertainty.
π(y∗|y) = l
X
k=1
π(y∗|mk,y)π(mk|y)
This is the average of the posterior distribution fory∗ under each
Introduction
Why estimating the model evidence is a challenge
I π(y|mk) is an integral of a (usually) highly variable function over a high-dimensional parameter space.
I Analytic tractability is sometimes possible, often where
conjugate priors are used. This is quite rare.
Introduction
Within model search or across model search?
Within model search:
Inference forπ(θk|y) separately for everymk. This is used to estimateπ(y|mk), for all k.
There are many approaches under this heading.
Across model search:
Here inference is carried out over the joint model and parameter
space,π(θk,mk|y). In an MCMC setting, only one chain is needed!
Reversible jump Markov chain Monte Carlo developed by Green
Review of evidence estimation
Laplace’s method
Laplace’s method
(eg Tierney and Kadane 1986)Assume thatπ(θk|y) is highly peaked around the posterior mode
˜
θk eg if sample size is large enough.
Define
l(θk) = log{π(y|θk)π(θk)}
I Expandl(θk) as a quadratic about ˜θk and then exponentiate.
I Result gives an approximation to π(y|θk)π(θk) as a Gaussian with mean ˜θk and covariance ˜Σ = (−D2l( ˜θk))−1, where
D2l( ˜θk) is the Hessian matrix of second derivatives.
I Integrating this approximation yields
Review of evidence estimation
Harmonic mean estimator
Harmonic mean estimator
(Newtown and Raftery (1994))
π(y) = 1/ 1 n n X 1 π(y|θi) ! , θi ∼π(θ|y). Why does this hold?
E 1 π(θ|y)π(θ|y) = Z π(y|θ)π(θ) π(θ|y)π(y)dθ= 1 π(y) Z π(θ)dθ= 1 π(y). The bad news?
Review of evidence estimation
Harmonic mean estimator
Harmonic mean estimator
(Newtown and Raftery (1994))
π(y) = 1/ 1 n n X 1 π(y|θi) ! , θi ∼π(θ|y). Why does this hold?
E 1 π(θ|y)π(θ|y) = Z π(y|θ)π(θ) π(θ|y)π(y)dθ= 1 π(y) Z π(θ)dθ= 1 π(y).
Review of evidence estimation
Harmonic mean estimator
Harmonic mean estimator
(Newtown and Raftery (1994))
π(y) = 1/ 1 n n X 1 π(y|θi) ! , θi ∼π(θ|y). Why does this hold?
E 1 π(θ|y)π(θ|y) = Z π(y|θ)π(θ) π(θ|y)π(y)dθ= 1 π(y) Z π(θ)dθ= 1 π(y). The bad news?
Review of evidence estimation
Harmonic mean estimator
Harmonic mean estimator
(Newtown and Raftery (1994))
π(y) = 1/ 1 n n X 1 π(y|θi) ! , θi ∼π(θ|y).
This estimator is based solely on draws from the posterior. But the posterior is typically much more peaked than the prior, eg, when the posterior is insensitive to the prior. Hence in such situations, the harmonic mean estimator will not change much as the prior changes.
Butπ(y) isvery sensitive to changes in the prior.
This drawback is very well documented. See Radford Neal’s blog, for example.
Review of evidence estimation
Chib’s method
Chib’s method
(Chib 1995)Chib (1995) presented a generic method which can be applied to output from the Gibbs sampler.
π(θ|y) = π(y|θ)π(θ)
π(y) Re-writing this,
π(y) = π(y|θ)π(θ)
π(θ|y) .
So we could estimate logπ(y) as
logπ(y) = logπ(y|θ∗) + logπ(θ∗)−log ˆπ(θ∗|y)
where ˆπ(θ∗|y) is an estimate of the posterior density at a point θ∗
Review of evidence estimation
Chib’s method
Chib’s method
(Chib 1995)Chib’s method relies on estimatingπ(θ∗|y).
Suppose the vectorθ can be partitioned as (θ1, θ2, θ3), where the
full-conditional distribution of eachθi is standard.
π(θ∗|y) =π(θ1∗|θ2∗, θ∗3,y)π(θ2∗|θ∗3,y)π(θ3∗|y)
Gibbs sampling can be used to estimate each factor on the LHS:
π(θ2∗|θ3∗,y) = 1 N X j π(θ∗2|θ(1j), θ3∗). π(θ∗3|y) = 1 N X j π(θ∗3|θ1(j), θ2(j)).
Review of evidence estimation
Chib’s method
Chib’s method
(Chib 1995)In general, Chib’s method can be applied whenθis partitioned into
an arbitrary number of blocks.
The only requirement is that the full-conditional sampling of each block is possible.
Review of evidence estimation
Annealed importance sampling
Annealed Importance Sampling
(Neal 2001)AIS is a very clever algorithm which shows how tempering can be used to define an importance samping function to sample from complex distributions.
Aside: Importance sampling to sample from a targetf(x) using an
importance functiong(x): x(1), . . . ,x(N)∼g(x) Efa(x) = P w(i)a(x(i)) P w(i) , where w (i)= f(x(i)) g(x(i)) Further, 1 N X w(i)→ zf zg as N→ ∞, wherezf = R xf(x)dx and zg = R xg(x)dx.
Review of evidence estimation
Annealed importance sampling
Annealed Importance Sampling
(Neal 2001)Define
πi(θ|y) =π(θ)1−tiπ(θ|y)ti, where 1 =t0>· · ·>tn= 0.
Thusπt0 andπtn corresponds to posterior and prior, respectively. LetTi denote a Markov transition kernel with invariant πti. Forj = 1, . . . ,N
I Sampleθn−1 fromπtn
I Sampleθn−2 fromθn−1 usingTn−1
I · · ·
I Sampleθ0 fromθ1 usingT1.
I Set θ(j)=θ0 and w(j)= πn−1(θn−1) πn(θn−1) πn−2(θn−2) πn−1(θn−2) . . .π0(θ0) π1(θ0) .
Review of evidence estimation
Annealed importance sampling
Annealed Importance Sampling
AIS yields:
1. An independent sample {θ(i)}from π(θ|y).
2. An estimator of the evidence
π(y)≈ 1 n n X i=1 w(i).
Review of evidence estimation
Power posteriors
Evidence estimation via power posteriors
(NF and Pettitt (2008))Consider thePower posterior:
π(θ|y,t)∝ {π(y|θ)}T(t)p(θ)
whereT : [0,1]→[0,1] is defined st T(0) = 0 andT(1) = 1.
Its normalising constant is
z(y|t) =
Z
θ
{π(y|θ)}tp(θ)dθ.
z(y|t = 1): Posterior model evidence.
Review of evidence estimation
Power posteriors
Evidence via power posteriors
The evidence follows the identity:logπ(y) = log z(y|t = 1) z(y|t = 0) = Z 1 0 Eθ|tlogπ(y|θ)dt. Proof: d dt log(z(y|t)) = 1 z(y|t)z 0 (y|t) = 1 z(y|t) Z d dt log(π(y|θ)) tπ(θ)dθ = Z log(π(y|θ))π(y|θ) tπ(θ)dθ z(y|t) = Eθ|tlog(π(y|θ)).
Review of evidence estimation
Power posteriors
Evidence via power posteriors
d
dt logz(y|t) =Eθ|tlog(π(y|θ))
This is the mean deviance wrt to (θ|y,t) - the power posterior.
Integrating wrtt yields, logπ(y) = log z(y|t = 1) z(y|t = 0) = Z 1 0 Eθ|tlogπ(y|θ)dt. This is essentially an application of thermodynamic integration, which was first developed in the statistical physics community, and outlined in Gelman and Meng (1998).
Review of evidence estimation
Power posteriors
In practice: Discretiset∈[0,1], 0 =t0 <t1, . . . ,tn= 1.
For eachti: Sample θ∼π(θ|y,t) and estimate
Ei =Eθ|tilogπ(y|θ). π(y) = n X i=1 (ti −ti−1) (Ei−1+Ei) 2
Review of evidence estimation
Power posteriors
Sensitivity of
p
(
y
) to the prior - toy example
How does sensitivity to the prior impact on this method?
Supposey={yi} iid N(θ,1). A priori, θ∼N(m,v). Then the
power posteriorθ|y,t ∼N(mt,vt), where
mt= nt¯y+m/v nt+ 1/v and vt = 1 nt+ 1/v and Eθ|y,tlogπ(y|θ) =− log 2π 2 − 1 2 n X i=1 (yi−y¯)2− n 2 (m−y¯)2 (vmt+ 1)2− n 2 1 (nt + 1/v).
Review of evidence estimation Power posteriors 0 0.2 0.4 0.6 0.8 1 −60 −50 −40 −30 −20 −10 0
Expected deviance, under the distributionθ|y,t plotted againstt
for prior variance equal to 10,5,1.
Asv increases, so too does the rate at which the mean deviance
Review of evidence estimation
Power posteriors
Connection to Fractional Bayes estimator
The fractionz(y|t = 1)/z(y|t=a) whereais close to 0, is precisely the estimate of the marginal likelihood used in the
‘Fractional Bayes’estimate of the Bayes factor (O’Hagan 95).
π(y)≈ z(y|t = 1) z(y|t=a) = R θπ(y|θ)π(θ)dθ R θ{π(y|θ)}aπ(θ)dθ = Z 1 a Eθ|tlogπ(y|θ)dt This method was proposed to compute Bayes factor with
un-informative priors. Impropriety inπ(θ) cancels above and below.
Review of evidence estimation
Power posteriors
Power posterior approach
I It is realitively straightforward to code/implement.
I It is a generic method. In some cases it can be implemented
in WinBUGS.
I Choosing the temperature schedule is vital – this is the
weakness of this approach. Behrens, NF, Hurn (2011) offer some possibility in this direction.
Review of evidence estimation
Nested sampling
Nested sampling
(Skilling, 2006)(For the moment (for ease of notation), letL(θ) =π(y|θ).)
π(y) =
Z
L(θ)π(θ)dθ=
Z
L(θ)dX,
wheredX =π(θ)dθ is an element of prior mass.
Define
X(λ) = Z
L(θ)>λ
π(θ)dθ as a cumulant prior mass.
Write the inverse function asL(X), ieL(X(λ)) =λ. This then allows us to express the evidence as a 1−dimensional integral:
π(y) = Z 1
0
Review of evidence estimation
Nested sampling
Nested sampling
(Skilling, 2006)(For the moment (for ease of notation), letL(θ) =π(y|θ).)
π(y) =
Z
L(θ)π(θ)dθ=
Z
L(θ)dX,
wheredX =π(θ)dθ is an element of prior mass.
Define
X(λ) =
Z
L(θ)>λ
π(θ)dθ
as a cumulant prior mass.
Write the inverse function asL(X), ieL(X(λ)) =λ. This then allows us to express the evidence as a 1−dimensional integral:
π(y) = Z 1
0
Review of evidence estimation
Nested sampling
Nested sampling
(Skilling, 2006)(For the moment (for ease of notation), letL(θ) =π(y|θ).)
π(y) =
Z
L(θ)π(θ)dθ=
Z
L(θ)dX,
wheredX =π(θ)dθ is an element of prior mass.
Define
X(λ) =
Z
L(θ)>λ
π(θ)dθ
as a cumulant prior mass.
Write the inverse function asL(X), ieL(X(λ)) =λ. This then
allows us to express the evidence as a 1−dimensional integral:
π(y) =
Z 1
0
Review of evidence estimation
Nested sampling
Nested sampling
The main computational burden is the requirement to sampleθ
from the prior subject to the constraint thatL(θ)>l.
This is roughly similar to the computational effort of slice sampling (Neal, 2003).
The evidence is estimated by sorting draws from the prior according to their likelihood.
π(y) =Z = I−1
X
i=1
Review of evidence estimation
Nested sampling
Sketch of algorithm
Sampleθ1, . . . , θN from the prior. Repeat fori = 1, . . . ,I:
I Find the point θk with the smallest likelihood,li, among the
N currentθi’s.
Set Xi =exp(i/N) andwi =Xi−1−Xi. IncrementZ byLiwi.
I Replaceθk with a point sampled from the prior subject to
Evidence estimation: doubly intractable distributions
Doubly intractable distributions
π(θ|y)∝π(y|θ)π(θ)
Here we assume that the likelihood,π(y|θ), is impossible to
Evidence estimation: doubly intractable distributions
Ising model
Doubly intractable distributions
Gibbs random fields, which find use in spatial statistics and statistical network analysis, involves intractable likelihood models.
Ising model
I Defined on a lattice y ={y1, . . . ,yn}.
I Lattice points yi take values {−1,1}.
I Full conditional π(yi|y−i, θ) =π(yi|neighbours ofi, θ).
π(y|θ)∝q(y|θ) = exp 1 2θ1 X i∼j yiyj .
Evidence estimation: doubly intractable distributions
Ising model
1st order and 2nd order Ising models.
π(y|θ) = exp(θ Ts(y))
z(θ)
s(y) is a sufficient statistics and counts the number of ’like’
neighbours. z(θ) =X x1 · · ·X xn q(y|θ).
Evidence estimation: doubly intractable distributions
Ising model
Model evidence for MRFs – our approach
π(y) = q(y|θ)π(θ)
z(θ)π(θ|y) ∀θ.
I Draw from the posterior, and estimateπ(θ∗|y) for a high
probability θ∗.
I Estimate z(θ) using thermodynamic integration.
Evidence estimation: doubly intractable distributions
Simulating from the posterior
Auxiliary variable method
(Mølleret al., 2006)Introduce an auxiliary variabley0 on the same space as the data y
and extend the target distribution
π(θ,y0|y)∝π(y|θ)π(θ)π(y0|θ0),
for some fixedθ0.
Joint update (θ∗,y0∗) with proposal:
h(θ∗,y0∗|θ,y0) =h1(y0∗|θ∗)h2(θ∗|θ,y0∗)
where
h1(y0∗|θ∗) =π(y0∗|θ∗) =
q(y0∗|θ∗)
Evidence estimation: doubly intractable distributions
Simulating from the posterior
α(θ∗,y0∗|θ,y0) = π(y|θ
∗)π(θ∗)π(y0∗|θ
0)π(y0|θ)h2(θ|θ∗)
π(y|θ)π(θ)π(y0|θ
0)π(y0∗|θ∗)h2(θ∗|θ)
z(θ∗)appears in π(y|θ∗) above and inπ(y0∗|θ∗) below, and
therefore cancels. Similarlyz(θ) cancels above and below.
The choice ofθ0 is important. eg the maximum pseudolikelihood
Evidence estimation: doubly intractable distributions
Exchange algorithm
Exchange algorithm
(Murray, Ghahramani & MacKay 2006)Sample from an augmented distribution
π(θ0,y0, θ|y)∝π(y|θ)π(θ)h(θ0|θ)π(y0|θ0)
whose marginal distribution forθis the posterior of interest
I π(y0|θ0) is the same likelihood model on whichy is defined.
I h(θ0|θ) arbitrary distribution for the augmented variable θ0
which might depend onθ (eg random walk distribution
Evidence estimation: doubly intractable distributions
Exchange algorithm
Exchange algorithm – How it works
1 Gibbs update of (θ0,y0)i Drawθ0 ∼h(·|θ)
ii Drawy0 ∼π(·|θ0)
2 Exchange move from (θ,y), (θ0,y0) to (θ0,y), (θ,y0)
with probability α= min 1,q(y 0|θ) q(y|θ) | {z } ∗ π(θ0) π(θ) h(θ|θ0) h(θ0|θ) q(y|θ0) q(y0|θ0) | {z } ∗∗ ×z(θ)z(θ 0) z(θ)z(θ0) | {z } 1
I Exchange move proposes to “offer” the datay the auxiliaryθ0 and similarly to “offer” the auxiliary datay0 the parameterθ
I The affinity betweenθ0 andy is measured by (**) and the affinity betweenθand y0 by (*)
Evidence estimation: doubly intractable distributions
Exchange algorithm
Exchange algorithm – How it works
1 Gibbs update of (θ0,y0)i Drawθ0 ∼h(·|θ)
ii Drawy0 ∼π(·|θ0)
2 Exchange move from (θ,y), (θ0,y0) to (θ0,y), (θ,y0)
with probability α= min 1,q(y 0|θ) q(y|θ) | {z } ∗ π(θ0) π(θ) h(θ|θ0) h(θ0|θ) q(y|θ0) q(y0|θ0) | {z } ∗∗ ×z(θ)z(θ 0) z(θ)z(θ0) | {z } 1
I Exchange move proposes to “offer” the datay the auxiliaryθ0
and similarly to “offer” the auxiliary datay0 the parameterθ
I The affinity betweenθ0 andy is measured by (**) and the
Evidence estimation: doubly intractable distributions
Exchange algorithm
Exchange algorithm for the Ising model
α= min 1,π(θ 0) π(θ) exp (θ−θ0)t(s(y0)−s(y)) The term exp (θ−θ0)t(s(y0)−s(y))
can be viewed as a measure of distance between the observed data
y and the auxiliary data y0.
It is somewhat similar to the accept/reject step in ABC (approximate Bayesian computation).
Note: Ifθ≈θ0, then α≈1. This does not necessarily happen with
Evidence estimation: doubly intractable distributions
Exchange algorithm
Exchange algorithm for the Ising model
I The main difficulty is the need to draw an exact sample
y0∼π(·|θ0)
I Perfect sampling is an obvious approach.
I A pragmatic alternative is to take a realisation from a long
MCMC run with stationary distributionπ(y0|θ0) as an
Evidence estimation: doubly intractable distributions
Ising model
Simulation study: Ising model
Datay simulated from an Ising model defined on a 16×16 lattice,
with a single interaction parameterθ.
Two competing models: 4 and 8 nearest neighbours.
Here the lattices are sufficently small to allow a very accurate estimate of the Bayes factor:
The normalising constantz(θ) can be calculated exactly for a grid
of{θi} values, which can then be plugged into the right hand side
of:
π(θi|y)∝
q(y|θi)
z(θi)
π(θi), i = 1, . . . ,n.
Summing up the right hand side yields an estimate ofπ(y). This
serves as a groundtruth to compare with the corresponding MCMC-based estimate of the model evidence.
Evidence estimation: doubly intractable distributions
Ising model
Results: Ising model
θ BFˆ BF
0.1 2.51 1.88
0.2 13.48 13.57
0.3 9.135 6.95
Evidence estimation: doubly intractable distributions
Exponential random graph models
Evidence estimation: doubly intractable distributions
Exponential random graph models
Evidence estimation: doubly intractable distributions
Exponential random graph models
The exponential random graph (or
p
∗) model
First proposed by Frank and Strauss (JASA, 1986).
Letyij = 1 denote an edge connecting nodes i andj, and 0,
otherwise.
Datay is an adjacency matrix indicating nodes which are
connected by an edge.
1. Edges yij andykl are neighbours of one another, if they share
a common node.
2. Ifyij andykl are not neighbours, thenyij andyij are
Evidence estimation: doubly intractable distributions
Exponential random graph models
The exponential random graph (or
p
∗) model
First proposed by Frank and Strauss (JASA, 1986).
Letyij = 1 denote an edge connecting nodes i andj, and 0,
otherwise.
Datay is an adjacency matrix indicating nodes which are
connected by an edge.
1. Edges yij andykl are neighbours of one another, if they share
a common node.
2. Ifyij andykl are not neighbours, thenyij andyij are conditionally independent, given the rest of the graph.
Evidence estimation: doubly intractable distributions
Exponential random graph models
The
p
∗model
π(y|θ) =exp{θ ts(y)} z(θ) = q(y|θ) z(θ) I y observed graphI s(y) known vector of sufficient statistics
I θ vector of parameters
I z(θ) normalizing constant
z(θ) = X all possible graphs
exp{θts(y)}
I 2(n2) possible undirected graphs ofnnodes
Evidence estimation: doubly intractable distributions
Exponential random graph models
The
p
∗model
π(y|θ) =exp{θ ts(y)} z(θ) = q(y|θ) z(θ) I y observed graphI s(y) known vector of sufficient statistics
I θ vector of parameters
I z(θ) normalizing constant
z(θ) = X
all possible graphs
exp{θts(y)}
I 2(n2) possible undirected graphs ofnnodes
Evidence estimation: doubly intractable distributions
Exponential random graph models
Model Specification: Network Statistics
edge mutual edge 2-in-star 2-out-star
(a)
2-mixed-star transitive triad cyclic triad
edge 2-star 3-star triangle
Evidence estimation: doubly intractable distributions
Exponential random graph models
ERGM: Florentine network
Model 1: y ∼edges + 3-star
Model 2: y ∼edges + 2-star
Evidence estimation: doubly intractable distributions
Exponential random graph models
ERGM: Florentine network
Here it is difficult to establish a groundtruth. For this purpose, we ran an ’independence’ RJMCMC sampler:
1. Sample from each model, separately, using the exchange
algorithm. (Here used the Bergm package of Caimo and NF
(2011)).
2. RJMCMC: Use the posterior mean and variance for model k,
as proposal parameters when proposing to jump to model k.
This works well, since the model space is small, but also because each posterior model is unimodal.
Acceptance rates for the jump proposals were around 40%, suggesting that the proposal distributions were a good fit to each posterior model.
This is essentially theAutoRJapproach outlined in Chapter 6 of Green (2003).
Evidence estimation: doubly intractable distributions
Exponential random graph models
ERGM: Florentine network
Here it is difficult to establish a groundtruth. For this purpose, we ran an ’independence’ RJMCMC sampler:
1. Sample from each model, separately, using the exchange
algorithm. (Here used the Bergm package of Caimo and NF
(2011)).
2. RJMCMC: Use the posterior mean and variance for model k,
as proposal parameters when proposing to jump to model k.
This works well, since the model space is small, but also because each posterior model is unimodal.
Acceptance rates for the jump proposals were around 40%, suggesting that the proposal distributions were a good fit to each posterior model.
This is essentially theAutoRJapproach outlined in Chapter 6 of
Evidence estimation: doubly intractable distributions
Exponential random graph models
ERGM: Florentine network
Here estimates of posterior model probabilities based onAutoRJ
are compared to those based on estimates of the model evidence for each model.
π(m1|y) π(m2|y) π(m3|y)
AutoRJ 0.29 0.69 0.02
Evidence estimation: doubly intractable distributions
Summary
Concluding remarks
I Model evidence is difficult to compute!
I Often complex Monte Carlo methods are needed. There are
plenty of methods in the Bayesian toolbox.
Evidence estimation: doubly intractable distributions
Summary
References
I Chib, S. (1995)Marginal likelihood using Gibbs output. Journal of the American Statistical Association, 90, 1313 – 1321.
I Friel, N and Pettitt, AN (2008)Marginal likelihood via power posteriors. Journal of the Royal Statistical Society, Series B, 70, 589 – 607.
I Newton MA and Raftery, AE (1994)Approximate Bayesian inference by the weighted likelihood bootstrap (with Discussion).Journal of the Royal Statistical Society, Series B, 56, 3 – 48.
I Neal, R (2001)Annealed importance sampling. Statistics and Computing, 11, 125 – 139. I Murray I., Ghahramani, Z., and MacKay, D. (2006)MCMC for doubly-intractable distributions. In
Proceedings of the 22nd annual conference on uncertainty in artificial intelligence
I Ciamo A., Friel N. (2011)Bayesian inference for the exponential random graph model. Social Networks, 33, 41 – 55.