Chapter 4 An efficient Gibbs sampler for structural inference
4.3 Preliminaries
To introduce the Gibbs sampler, we first recall the standard MC3 sampler, and an analogous na¨ıve Gibbs sampler. Usually convergence of Gibbs samplers follows from the Hammersley-Clifford theorem (Besag, 1974), but this does not apply in this context. An alternative argument is outlined.
4.3.1 MC3 sampler
The standard sampler for structural inference for Bayesian networks is MC3 (Madi- gan and York, 1995), which is a Metropolis-Hastings sampler that explores G by proposing to add or remove a single edge from the current graph G, subject to acyclicity. Each proposalG0 is drawn uniformly at random from the neighbourhood ν(G) of the current graph, defined as the set of DAGs that differ from G by the addition or removal of a single edge. The proposalG0 is accepted with probability min(1, α(G0, G)), where α(G0, G) = min ( 1,P(G0 |X)|ν(G0)| −1 P(G|X)|ν(G)|−1) ) .
4.3.2 A na¨ıve Gibbs sampler
Constructing a Gibbs sampler that is analogous to MC3 is straightforward. To do this, we consider the posterior distribution on Bayesian networks to be a joint distribution for the off-diagonal entries in the adjacency matrix, which is a p×p matrix whose elements Gij are indicator variables for whether G includes an edge
from ito j, and whose diagonal elements Gii = 0 for all i. We thus have p(p−1)
random variablesGij, each of which takes the value 1 or 0. The proposal distribution
of MC3can be viewed as proposing to toggle the value ofGij of the adjacency matrix
for somei6=j, subject to the restriction that the proposal must be acyclic. A simple Gibbs sampler works in a similar way. At each step of the Gibbs sampler a sample from the conditional distribution ofGij is drawn, for some i, j ∈ {1, . . . , p}, i 6=j,
given the rest of the graph GCij = {Guv: 1 ≤ u ≤ p,1 ≤ v ≤ q} \ {Gij}. Define G+ij as the graph Gwith an edge fromito j, and G−ij as the graph G with no edge fromitoj. IfG+ij is cyclic,G−ij is sampled with probability 1. IfG+ij is acyclic, the
conditional distribution ofGij is Bernoulli. P(G0ij =g|GCij) = 1 g= 0, G+ij cyclic 0 g= 1, G+ij cyclic P(G−ij |X) P(G−ij |X) +P(G+ij |X) g= 0, G + ij acyclic P(G+ij |X) P(G−ij |X) +P(G+ij |X) g= 1, G + ij acyclic (4.1)
The choice ofiandj can either be made sequentially (systematically) or randomly. There are few theoretical results to guide the choice of random- and systematic-scan Gibbs samplers (Roberts and Sahu, 1997); here, random-scan Gibbs samplers are used throughout.
This na¨ıve Gibbs sampler offers no advantages over MC3. However, thinking of structural inference from a Gibbs sampling perspective opens up the possibility of drawing on ideas from the Gibbs sampling literature to improve the mixing rate of the MCMC algorithm, which we discuss in Section 4.4.
4.3.3 Convergence conditions for Gibbs samplers
Convergence of a Gibbs sampler for Bayesian networks does not follow from the usual justification of Gibbs sampling that relies on the Hammersley-Clifford theorem (Besag, 1974). The theorem gives a positivity condition that is sufficient to prove that the univariate conditional distributions, used by the Gibbs sampler, uniquely define the joint distribution. The required condition is that the support of the joint distribution is given by the Cartesian product of the supports of the marginal distributions. An example of when this condition does not hold is the densityp(x, y) with support only on [0,1]×[0,1] and [2,3]×[2,3]. Clearlyp(x) andp(y) are both positive on [0,1] and [2,3] but neither [0,1]×[2,3] or [2,3]×[0,1] are in the support
of the joint distribution (Hobertet al., 1997; O’Hagan and Forster, 2004).
The acyclicity requirement of Bayesian networks means that this positivity condition is not satisfied. Consider a Bayesian network consisting of two correlated random variablesX1 andX2. The correlation means that both the graph with a single edge
1 → 2 and the graph with a single edge 2 → 1 have positive probability. Thus P(G12 = 1) >0 and P(G21 = 1) >0 in the marginal distributions. However, the
joint distribution P(G12 = 1 and G21 = 1) = 0 because the corresponding graph
(the complete graph) is cyclic. The complete graph is thus not in the support of the joint distribution but is in the Cartesian product of the supports of the marginal distributions.
An alternative sufficient condition for uniqueness of the joint distribution and con- vergence of the Gibbs sampler when positivity is not satisfied is given by Besag (1994) in a discussion of Tierney (1994), which was expanded upon in continuous settings by Hobert et al. (1997). The condition requires that for every G(0) ∈ G
and G∈ G there exists a finite sequence G(1), . . . , G(d), with G(d) =G and d∈N, such thatG(i) and G(i−1) differ in only a single component, and that the joint dis- tribution P(G(i)) > 0 for all i = 1, . . . d. When the graph prior π(G) > 0 for all G, this condition is clearly satisfied: one such finite sequence removes every edge of G(0), one at a time, and then adds every edge of G, one at a time. Each graph in the sequence is clearly acyclic, since the sequence is composed of subgraphs of the acyclic G(0) and G, and so has positive probability in the joint distribution when the graph prior is positive everywhere in G. A similar proof follows if the graph prior has support on all subgraphs of graphs with support in the graph prior, as is true for most widely used priors.