When learning structure, we must consider the following issues:
• What is the hypothesis space? We can imagine searching the space of DAGs, equivalence classes of DAGs (PDAGs), undirected graphs, trees, node orderings, etc.
• What is the evaluation (scoring) function? i.e., how do we decide which model we prefer?
• What is the search algorithm? We can imagine local search (e.g., greedy hill climbing, possibly with multiple restarts) or global search (e.g., simulated annealing or genetic algorithms).
We discuss these issues in more detail below.
C.6.1
Search space
TreesIf we are willing to restrict our hypothesis space to trees, we can find the optimal ML tree inO(M N2)time (whereN is the number of nodes andM is the number of data cases) using the famous Chow-Liu algorithm [CL68, Pea88]. We can use EM to extend this to mixtures of trees [MJ00]. Interestingly, learning the optimal ML path (a spanning tree in which no vertex has degree higher than two) is NP-hard [Mee01].
DAGs
The most common approach is to search through the space of DAGs. Unfortunately, the number of DAGs on
Nvariables is2O(N2logN)
[Rob73, FK00]. (A more obvious upper bound isO(2n2
), which is the number of
N×Nadjacency matrices.) For example, there are 543 DAGs on 4 nodes, andO(1018)DAGs on 10 nodes. This means that attempts to find the “true” DAG, or even just to explore the posterior modes, are doomed to failure, although one might be able to find a good local maximum, which should be sufficient for density estimation purposes.
PDAGs
A PDAG (partially directed acyclic graph), also called a pattern or essential graph, represents a whole class of Markov equivalent DAGs. Two DAGs are Markov equivalent if they imply the same set of (conditional) independencies. For example,X →Y →Z,X ←Y →ZandX ←Y ←Zare Markov equivalent, since they all representX ⊥Z|Y. In general, two graphs are Markov equivalent iff they have the same structure ignoring arc directions, and the same v-structures [VP90]. (A v-structure consists of converging directed edges into the same node, such asX→Y ←Z.) Since we cannot distinguish members of the same Markov equivalence class if we only have observational data [CY99, Pea00], it makes sense to search in the space of PDAGs, which is smaller than the space of all DAGs (about 3.7–14 times smaller [GP01, Ste00]). We discuss methods to do this in Section C.6.2. Of course, using PDAGs is not appropriate if we have experimental (interventional) as well as observational data. (An intervention means setting/ forcing a node to a specific value, as opposed to observing that it has some value [Pea00].)
Variable orderings
Given a total ordering≺, the likelihood decomposes into a product of terms, one per family, since the parents for each node can be chosen independently (there is no global acyclicity constraint). The following equation was first noted in [Bun91]:
P(D| ≺) = X G∈G≺
n
Y
i=1
score(Xi,PaG(Xi)|D)
= Y
i
X
U∈U≺,i
score(Xi, U|D) (C.11)
G≺is the set of graphs consistent with the ordering≺, andU≺,iis the set of legal parents for nodeiconsistent
with≺. If we bound the fan-in (number of parents) byk, each summation in Equation C.11 takes nk≤nk
time to compute, so the whole equation takesO(nk+1)time.
Given an ordering, we can the find the best DAG consistent with that ordering using greedy selection, as in the K2 algorithm [CH92], or more sophisticated variable selection methods (see Section 6.1.2).
If the ordering is unknown, we can search for it, e.g, using MCMC [FK00] (this is an example of Rao- Blackwellisation). Not surprisingly, they claim this mixes much faster than MCMC over DAGs (see Sec- tion C.7). However, the space of orderings has sizeN!, which is still huge.
Interestingly, interventions give us some hints about the ordering. For example, in a biological setting, if we knockout geneX1, and notice that genesX2 andX3 change from their “wildtype” state, but genes
X4 andX5 do not, it suggests thatX1 is the ancestor ofX2 andX3. This heuristic, together with a set covering algorithm, was used to learn acyclic boolean networks (i.e., binary, deterministic Bayes nets) from interventional data [ITK00].
Undirected graphs
When searching for undirected graphs, it is common to restrict attention to decomposable undirected graphs, so that the parameters of the resulting model can be estimated efficiently. See [DGJ01] for a stepwise selection approach, and [GGT00] for an MCMC approach.
ChooseGsomehow While not converged
For eachG0in nbd(G)
Compute score(G0)
G∗:= arg maxG0score(G0) If score(G∗)>score(G)
thenG:=G∗
else converged := true
Figure C.3: Pseudo-code for hill-climbing. nbd(G)is the neighborhood ofG, i.e., the models that can be reached by applying a single local change operator.
C.6.2
Search algorithm
For local search (whether deterministic or stochastic), the operators that move through space are usually adding, deleting or reversing a single arc; this defines the neighborhood of a graph, nbd(G). ([KC01] consider a richer set of local DAG transformations that gives better results.) In addition, we must specify a starting point (initial graph); this could be chosen using the PC algorithm (see below). As an example of a local search algorithm, the code for hill climbing is shown in Figure C.3.
When we make a local change to a model, we would like the change in its score to be local (see Sec- tion C.6.3); that way, evaluating the cost of many neighbors is fast. Similarly, if the current graph has a certain required property (e.g., acyclicity), we would like the cost of checking if each of the neighbors has this property to be constant time (i.e., independent of the model size). [GC01] present a method for checking acyclicity in constant time by using an auxiliary data structure called the ancestor matrix, and [GGT00] give efficient ways to check for decomposability of undirected graphs.
Global search comes in two flavors: stochastic local search, where we allow “downhill” moves (e.g., MCMC, which includes simulated annealing as a special case), and search algorithms that make non-local changes such as genetic algorithms.
The PC algorithm
We can find the globally optimal PDAG inO(Nk+1N
train)time, where there areNnodes,Ntraindata cases, and each node has at mostkneighbors, using the PC algorithm [SGS00, p84]. (This is an extension of the IC algorithm [PV91, Pea00], which takesO(NNN
train)time). This algorithm, an instance of the “constraint based approach”, works as follows: start with a fully connected undirected graph, and remove an arc between
X andY if there is some set of nodesSs.t.,X⊥Y|S (we search for such separating subsets in increasing order of size); at the end, we can orient some of the undirected edges, so that we recover all the v-structures in the PDAG.
The PC algorithm will provably recover the generating PDAG if the conditional independency (CI) tests are all correct. For continuous data, we can implement the CI test using Fisher’sztest; for discrete data, we can use aχ2test [SGS00, p95]. Testing ifX
⊥Y|S for discrete random variables requires creating a table withO(K|S|+2)entries, which requires a lot of time and samples. This is one of the main drawback of the PC algorithm. [CGK+02] contains a more efficient algorithm. The other difficulty with the PC algorithm is how to implement CI tests on non-Gaussian data (e.g., mixed discrete-continuous). (The analogous problem for the the search & score methods is how to define the scoring function for complex CPDs.) One approach is discussed in [MT01]. A way of converting a Bayesian scoring metric into a CI test is given in the appendix of [Coo97].
A more sophisticated version of the PC algorithm, called FCI (fast causal inference), can handle the case where there is confounding due to latent common causes. However, FCI cannot handle models such as the one in Figure C.7, where there is a non-root latent variable. (The ability to handle arbitrary latent-variable models is one of the main strengths of the search and score techniques.)
C.6.3
Scoring function
If the search space is restricted (e.g., to trees), maximum likelihood is an adequate criterion. However, if the search space is any possible graph, then maximum likelihood would choose the fully connected (complete) graph, since this has the greatest number of parameters, and hence can achieve the highest likelihood. In such a case we will need to use other scoring metrics, which we discuss below.
A well-principled way to avoid this kind of over-fitting is to put a prior on models. By Bayes’ rule, the MAP model is the one that maximizes
Pr(G|D) = Pr(D|G) Pr(G) Pr(D)
whereP(D)is a constant independent of the model. If the prior probability is higher for simpler models (e.g., ones which are “sparser” in some sense), theP(G)term has the effect of penalizing complex models.
Interestingly, it is not necessary to explicitly penalize complex structures through the structural prior. The marginal likelihood (sometimes called the evidence),
P(D|G) =
Z
θ
P(D|G, θ)P(θ|G)
automatically penalizes more complex structures, because they have more parameters, and hence cannot give as much probability mass to the region of space where the data actually lies, because of the sum-to-one constraint. In other words, a complex model is more likely to be “right” by chance, and is therefore less believable. This phenomenon is called Ockham’s razor (see e.g., [Mac95]). Of course, we can combine the marginal likelihood with a structural prior to get
score(G)def=P(D|G)P(G)
If we assume all the parameters are independent, the marginal likelihood decomposes into a product of local terms, one per node:
P(D|G) = n Y i=1 Z θi
P(Xi|Pa(Xi), θi)P(θi)
def
= n
Y
i=1
score(Pa(Xi), Xi)
Under certain assumptions (global and local parameter independence (see Section C.3.1) plus conjugate pri- ors), each of these integrals can be performed in closed form, so the marginal likelihood can be computed very efficiently. For example, in Section C.6.3, we discuss the case of multinomial CPDs with Dirichlet pri- ors. See [GH94] for the case of linear-Gaussian CPDs with Normal-Wishart priors, and [BS94, Bun94] for a discussion of the general case.
If the priors are not conjugate, one can try to approximate the marginal likelihood. For example, [Hec98] shows that a Laplace approximation to the parameter posterior has the form
log Pr(D|G)≈log Pr(D|G,θG)ˆ −d2logM
whereM is the number of samples,θGˆ is the ML estimate of the parameters anddis the dimension (number of free parameters) of the model. This is called the Bayesian Information Criterion (BIC), and is equivalent to the Minimum Description Length (MDL) approach. The first term is just the likelihood and the second term is a penalty for model complexity. (Note that the BIC score is independent of the parameter prior.) The BIC score also decomposes into a product of local terms, one per node. For example, for multinomials, we have (following Equation C.1)
BIC-score(G) = X
i
X
m
logP(Xi|Pa(Xi),θi, Dm)ˆ −di 2 logM
= X
i
X
jk
wheredi=qi(ri−1)is the number of parameters inXi’s CPT.
A major advantage of the fact that the score decomposes is that graphs that only differ by a single link have marginal likelihoods that differ by at most two terms, since all the others cancel. For example, let G1 be the chain X1 → X2 → X3 → X4, and G2 be the same but with the middle arc reversed:
X1→X2←X3→X4. Then
P(D|G2)
P(D|G1)
= score(X1)score(X1, X2, X3)score(X3)score(X3, X4)
score(X1)score(X1, X2)score(X2, X3)score(X3, X4)
= score(X1, X2, X3)score(X3)
score(X1, X2)score(X2, X3)
In general, if an arc toXiis added or deleted, only score(Xi|Πi)needs to be evaluated; if an arc between
Xi andXjis reversed, only score(Xi|Πi)and score(Xj|Πj)need to be evaluated. This is important since, in greedy search, we need to know the score ofO(n2)neighbors at each step, but onlyO(n)of these scores change if the steps change one edge at a time.
The traditional limitation to local search in PDAG space has been the fact that the scoring function does not decompose into a set of local terms. This problem has recently been solved [Chi02].
Computing the marginal likelihood for the multinomial-Dirichlet
For the case of multinomial CPDs with Dirichlet priors, we discussed how to compute the marginal likelihood sequentially in Section C.3.1. To compute this in batch form, we simply compute the posterior means of the parameters (Equation C.3), and plug these expected values into the sample likelihood equation:
P(D|G) =P(D|θ, G) = n Y i=1 qi Y j=1 ri Y k=1 θNijk ijk (C.13)
Alternatively, this can be written as follows [CH92]:
P(D|G) = n Y i=1 qi Y j=1
B(αij1+Nij1, . . . , αi,j,ri+Ni,j,ri)
B(αij1, . . . , αi,j,ri) = n Y i=1 qi Y j=1 Γ(αij) Γ(αij+Nij)· ri Y k=1 Γ(αijk+Nijk) Γ(αijk) (C.14)
([HGC95] call this the Bayesian Dirichlet (BD) score.)
For interventional data, Equation C.14 is modified by definingNijkto be the number of timesXi =k
is passively observed in the contextΠi =j, as shown by [CY99]. (The intuition is that settingXi=kdoes not tell us anything about how likely this event is to occur “by chance”, and hence should not be counted). Hence, in addition toD, we need to keep a record of which variables were clamped (if any) in each data case.