There are at least three kinds of indepencies we can exploit to speed up exact inference. The first is conditional independence, which is reflected in the graph structure; we have already discussed how to exploit this. The second is causal independence (e.g., noisy-OR), and the third is context specific independence (CSI). We discuss these, and other tricks, below.
B.6.1
Exploiting causal independence
CPDs with causal independence were introduced in Section A.3.2. The canonical example is noisy-OR. Pearl [Pea88] showed how to exploit the structure of the noisy-OR CPD to compute theλandπmessages in time linear in the number of parents; [ZP96, RD98] showed how to exploit it in variable elimination, and [Hec89] showed how to exploit in BN20 (binary node 2-level noisy-OR) networks such as QMR-DT using the Quickscore algorithm.
To exploit causal independence in the jtree algorithm, we have to make graphically explicit the local conditional independencies which are “hidden” inside the CPD. The two standard techniques for this are the temporal transformation [Hec93, HB94] and parent divorcing [OKJ+89]: see Figure B.18. We introduce extra hidden nodesYiwhich accumulate a partial-OR of previous parents. In general, to benefit from such a transformation, we must eliminate parents before children, otherwise we end up creating a clique out of the original family. However, sometimes this is not the best ordering. See [RD98] for a discussion of this point; see also [ZY97, MD99].
Interestingly, it is not always possible to exploit causal independence if we are doing max-product (Viterbi), as opposed to sum-product, because we must always sum out the dummy hidden nodes; how- ever, the max and sum operators do not commute, so this imposes a restriction on the possible orderings (c.f., Section B.3.6) which can eliminate any potential gains.
B.6.2
Exploiting context specific independence (CSI)
CPDs with CSI were introduced in Section A.3.2. The canonical example is a tree-CPD. Such CPDs can be exploited for inference by the jtree algorithm using a network transformation [BFGK96], as illustrated in Figure B.19. This is analogous to the transformations introduced to exploit causal independence (see Section B.6.1). More aggressive optimizations are possible if we use the variable elimination algorithm [Zha98b, ZP99]. A way of exploiting CSI in the jtree algorithm is discussed in [Pfe01].
Tree-CPDs have been widely used for many years and are especially popular in the decision making community, e.g., see [BDG01] for a review of the work on factored Markov decision processes (MDPs). [Kim01, ch5] discusses how to use trees to do “structured linear algebra”. Recently the MDP community has started to investigate algebraic decision diagrams (ADDs) [BFG+93], which are very computationally
X1 X2 X3 X4
Y1 X5 Y2
Y
Figure B.19: A tree-structured CPD forP(Y|X1:5)represented in a more computationally useful form. Let all nodes be binary for simplicity. We assumeY is conditionally independent ofX3:4 given thatX5 = 1, i.e., P(Y|X1:4, X5 = 1) = P(Y1|X1, X2) = f(Y, X1, X2). Similarly, assumeP(Y|X1:4, X5 = 2) =
P(Y2|X3, X4) = g(Y, X3, X4). Hence X5 is a switching parent: P(Y|Y1, Y2, X5 = i) = δ(Y, Yi). This tree requires23parameters, whereas a table would require25; furthermore, the clique size is reduced from 6 to 4. Obviously the benefits are (potentially) greater for nodes with more parents and for nodes with bigger domains.
efficient ways of representing and manipulating real-valued functions of binary variables. These are more efficient than trees because they permit sharing of substructure; just as important, there is a fast, freely- available program for manipulating them called CUDD.6ADDs have also been used for speeding up exact
inference in discrete Bayes nets [NWJK00].
B.6.3
Exploiting deterministic CPDs
Deterministic discrete CPDs are quite common, as we saw in the HHMM models in Chapter 2. (Multiplexer nodes are also deterministic.) This means the resulting CPD has lots of zeros. The technique of “zero compression” removes 0s from the potentials so that multiplication and marginalization can be performed more efficiently, without any loss in accuracy [JA90, HD96].
[Zwe98] develops a technique for exploiting deterministic CPDs in Pearl’s directed message passing algorithm on a jtree. Specifically, he requires the jtree satisfy what he calls IRP (the immediate resolution property), which says that, in a preorder traversal of the tree (root to leaves), the first time a deterministic node appears in a clique, its parents must also be present. This can be ensured by eliminating parents before children, i.e., in some total ordering of the DAG: see Section B.3.6. The advantage of this is that it is possible to detect which elements ofP(Cc|Cp)will be zero, whereCc is the child clique andCp is its parent (one or the other of these can be separators). It is not clear if this technique is relevant to the undirected form of message passing, which does not require computation of terms likeP(Cc|Cp). In particular, it seems that using sparse potentials should achieve the same effect.
B.6.4
Exploiting the evidence
Sometimes, with deterministic CPDs, evidence on the child or one or more parents can render the remaining family members “effectively observed”, e.g., in an OR-gate, if the child is off, all the parents must be off, or if any of the parents is on, the child must be on. Such constraints can be exploited by running arc consistency before probabilistic message passing, to reduce the effective domain size of each node. This technique is widely used in genetic linkage analysis [FGL00].
B.6.5
Being lazy
In general there is a tradeoff between being lazy, i.e., waiting until the query, evidence, structure and param- eters have all been specified, and being eager, i.e., precomputing stuff as soon as possible, so the cost can be amortized across many future operations (e.g., we would not want to create a jtree from scratch every time the evidence or query changed). Usually the jtree is constructed based only on the graph structure. However, by constructing the jtree later in the pipeline, we can avail of the following kinds of information:
• If we know how many values each node can take on, we know how “heavy” the cliques will be; this will affect the search for the elimination ordering.
• If we know what kinds of CPD each node has, we can perform network transformations if necessary. Also, we know if we need to construct a strong jtree or not.
• If we know which nodes will be observed, it will affect how many values each node can have, the kinds of CPDs it can support, whether we need to build a strong jtree, etc.
• If we know the parameter values, we might be able to perform optimizations such as zero compression. • If we know which nodes will be queried and which nodes will be observed, we can figure which nodes
are relevant for answering the query.
• If we know the values of the evidence nodes, we might be able to do some pre-processing (e.g., con- straint satisfaction) before running probabilistic inference.
Most inference routines choose an elimination ordering based on the first two or three pieces of in- formation. Variable elimination can easily exploit the remaining information, as can lazy Hugin [MJ99]. Query-DAGs [DP97] are precompiled structures for answering specific queries, and can be highly optimized.