1.3 Large Instances / Bayesian Networks
1.3.2 The Markov Condition
First let’s review some graph theory. Recall that a directed graphis a pair
(V,E), whereV is a finite, nonempty set whose elements are callednodes (or vertices), andEis a set of ordered pairs of distinct elements ofV. Elements of E are called edges (or arcs), and if(X, Y)∈E, we say there is an edge from
X to Y and thatX and Y are each incidentto the edge. If there is an edge fromX toY or fromY toX, we sayXandY areadjacent. Suppose we have a set of nodes [X1, X2, . . . Xk], wherek≥2, such(Xi−1, Xi)∈Efor2≤i≤k. We call the set of edges connecting the k nodes a path from X1 to Xk. The nodes X2, . . . Xk−1 are called interior nodes on path [X1, X2, . . . Xk]. The subpath of path [X1, X2, . . . Xk] fromXi to Xj is the path [Xi, Xi+1, . . . Xj] where 1≤ i < j ≤ k. Adirected cycle is a path from a node to itself. A simple path is a path containing no subpaths which are directed cycles. A directed graph G is called a directed acyclic graph (DAG) if it contains no directed cycles. Given a DAGG= (V,E)and nodesX andY inV,Y is called aparent ofX if there is an edge fromY to X, Y is called a descendentof
X andX is called anancestorofY if there is a path fromX toY, and Y is called anondescendentofXifY is not a descendent ofX. Note that in this text X is not considered a descendent of X because we require k ≥ 2in the definition of a path. Some texts say there is an empty path fromX toX.
We can now state the following definition:
Definition 1.9 Suppose we have a joint probability distribution P of the ran- dom variables in some setVand a DAGG= (V,E). We say that(G, P)satisfies
the Markov condition if for each variable X ∈ V, {X} is conditionally in-
dependent of the set of all its nondescendents given the set of all its parents. Using the notation established in Section 1.1.4, this means if we denote the sets of parents and nondescendents of X byPAX andNDX respectively, then
IP({X},NDX|PAX).
When (G, P) satisfies the Markov condition, we say G and P satisfy the Markov condition with each other.
IfXis a root, then its parent setPAX is empty. So in this case the Markov condition means {X}is independent of NDX. That is, IP({X},NDX). It is not hard to show that IP({X},NDX|PAX) implies IP({X},B|PAX) for any B⊆NDX. It is left as an exercise to do this. Notice that PAX⊆NDX. So we could define the Markov condition by saying thatX must be conditionally
(a) (c) (d) V S C V S C V C S C V S (b)
Figure 1.3: The probability distribution in Example 1.25 satisfies the Markov condition only for the DAGs in (a), (b), and (c).
independent ofNDX−PAX givenPAX. However, it is standard to define it as above. When discussing the Markov condition relative to a particular distri- bution and DAG (as in the following examples), we just show the conditional independence ofXandNDX−PAX.
Example 1.25 Let Ω be the set of objects in Figure 1.2, and let P assign a probability of 1/13 to each object. Let random variables V, S, and C be as defined as in Example 1.19. That is, they are defined as follows:
Variable Value Outcomes Mapped to this Value
V v1 All objects containing a ‘1’
v2 All objects containing a ‘2’
S s1 All square objects
s2 All round objects
C c1 All black objects
1.3. LARGE INSTANCES / BAYESIAN NETWORKS 33 H B F L C
Figure 1.4: A DAG illustrating the Markov condition
Then, as shown in Example 1.19,IP({V},{S}|{C}). Therefore,(G, P)satisfies
the Markov condition ifGis the DAG in Figure 1.3 (a), (b), or (c). However,
(G, P)does not satisfy the Markov condition ifGis the DAG in Figure 1.3 (d) because IP({V},{S}) is not the case.
Example 1.26 Consider the DAG G in Figure 1.4. If (G, P) satisfied the Markov condition for some probability distributionP, we would have the follow- ing conditional independencies:
Node PA Conditional Independency
C {L} IP({C},{H, B, F}|{L})
B {H} IP({B},{L, C}|{H})
F {B, L} IP({F},{H, C}|{B, L})
L {H} IP({L},{B}|{H})
Recall from Section 1.3.1 that the number of terms in a joint probability distribution is exponential in terms of the number of variables. So, in the case of a large instance, we could not fully describe the joint distribution by determining each of its values directly. Herein lies one of the powers of the Markov condition. Theorem 1.4, which follows shortly, shows if(G, P)satisfies the Markov condition, then P equals the product of its conditional probability distributions of all nodes given values of their parents in G, whenever these conditional distributions exist. After proving this theorem, we discuss how this means we often need ascertain far fewer values than if we had to determine all values in the joint distribution directly. Before proving it, we illustrate what it means for a joint distribution to equal the product of its conditional distributions of all nodes given values of their parents in a DAGG. This would be the case
for a joint probability distributionP of the variables in the DAG in Figure 1.4 if, for all values off,c, b,l, andh,
P(f, c, b, l, h) =P(f|b, l)P(c|l)P(b|h)P(l|h)P(h), (1.6) whenever the conditional probabilities on the right exist. Notice that if one of them does not exist for some combination of the values of the variables, then
P(b, l) = 0or P(l) = 0or P(h) = 0, which implies P(f, c, b, l, h) = 0 for that combination of values.However, there are cases in whichP(f, c, b, l, h) = 0 and the conditional probabilities still exist. For example, this would be the case if all the conditional probabilities on the right existed andP(f|b, l) = 0for some combination of values off, b, andl. So Equality 1.6 must hold for all nonzero values of the joint probability distribution plus some zero values.
We now give the theorem.
Theorem 1.4 If (G, P) satisfies the Markov condition, then P is equal to the product of its conditional distributions of all nodes given values of their parents, whenever these conditional distributions exist.
Proof. We prove the case whereP is discrete. Order the nodes so that if Y is
a descendent ofZ, thenY followsZ in the ordering. Such an ordering is called
anancestral ordering. Examples of such an ordering for the DAG in Figure
1.4 are [H, L, B, C, F] and [H, B, L, F, C]. Let X1, X2, . . . Xn be the resultant
ordering. For a given set of values of x1, x2, . . . xn, let pai be the subset of
these values containing the values ofXi’s parents. We need show that whenever
P(pai)6= 0 for1≤i≤n,
P(xn, xn−1, . . . x1) =P(xn|pan)P(xn−1|pan−1)· · ·P(x1|pa1).
We show this using induction on the number of variables in the network. As- sume, for some combination of values of thexi’s,thatP(pai)6= 0for1≤i≤n.
induction base: SincePA1 is empty,
P(x1) =P(x1|pa1).
induction hypothesis: Suppose for this combination of values of thexi’s that
P(xi, xi−1, . . . x1) =P(xi|pai)P(xi−1|pai−1)· · ·P(x1|pa1).
induction step: We need show for this combination of values of thexi’s that
P(xi+1, xi, . . . x1) =P(xi+1|pai+1)P(xi|pai)· · ·P(x1|pa1). (1.7)
There are two cases:
Case 1: For this combination of values
1.3. LARGE INSTANCES / BAYESIAN NETWORKS 35
Clearly, Equality 1.8 implies
P(xi+1, xi, . . . x1) = 0.
Furthermore, due to Equality 1.8 and the induction hypothesis, there is somek, where1≤k≤i, such that P(xk|pak) = 0. So Equality 1.7 holds.
Case 2: For this combination of values
P(xi, xi−1, . . . x1)6= 0.
In this case,
P(xi+1, xi, . . . x1) = P(xi+1|xi, . . . x1)P(xi, . . . x1)
= P(xi+1|pai+1)P(xi, . . . x1)
= P(xi+1|pai+1)P(xi|pai)· · ·P(x1|pa1).
Thefirst equality is due to the rule for conditional probability, the second is due to the Markov condition and the fact that X1, . . . Xi are all nondescendents of
Xi+1, and the last is due to the induction hypothesis.
Example 1.27 Recall that the joint probability distribution in Example 1.25 satisfies the Markov condition with the DAG in Figure 1.3 (a). Therefore, owing to Theorem 1.4,
P(v, s, c) =P(v|c)P(s|c)p(c), (1.9)
and we need only determine the conditional distributions on the right in Equality 1.9 to uniquely determine the values in the joint distribution. We illustrate that this is the case for v1,s1, andc1:
P(v1, s1, c1) =P(One∩Square∩Black) = 2 13
P(v1|c1)P(s1|c1)P(c1) = P(One|Black)×P(Square|Black)×P(Black) = 1 3× 2 3× 9 13 = 2 13.
Figure 1.5 shows the DAG along with the conditional distributions.
The joint probability distribution in Example 1.25 also satisfies the Markov condition with the DAGs in Figures 1.3 (b) and (c). Therefore, the probability distribution in that example equals the product of the conditional distributions for each of them. You should verify this directly.
If the DAG in Figure 1.3 (d) and some probability distributionP satisfied the Markov condition, Theorem 1.4 would imply
P(v, s, c) =P(c|v, s)P(v)p(s).
P(c1) = 9/13 P(c2) = 4/13 P(v1|c1) = 1/3 P(v2|c1) = 2/3 P(v1|c2) = 1/2 P(v2|c2) = 1/2 P(s1|c1) = 2/3 P(s2|c1) = 1/3 P(s1|c2) = 1/2 P(s2|c2) = 1/2 V C S
Figure 1.5: The probability distribution discussed in Example 1.27 is equal to the product of these conditional distributions.
Theorem 1.4 often enables us to reduce the problem of determining a huge number of probability values to that of determining relatively few. The num- ber of values in the joint distribution is exponential in terms of the number of variables. However, each of these values is uniquely determined by the condi- tional distributions (due to the theorem), and, if each node in the DAG does not have too many children, there are not many values in these distributions. For example, if each variable has two possible values and each node has at most one parent, we would need to ascertain less than2n probability values to de- termine the conditional distributions when the DAG containsn nodes. On the other hand, we would need to ascertain 2n−1 values to determine the joint probability distribution directly. In general, if each variable has two possible values and each node has at mostkparents, we need to ascertain less than2kn values to determine the conditional distributions. So if kis not large, we have a manageable number of values.
Something may seem amiss to you. Namely, in Example 1.25, we started with an underlying sample space and probability function, specified some ran- dom variables, and showed that if P is the probability distribution of these variables andGis the DAG in Figure 1.3 (a), then(P,G) satisfies the Markov condition. We can therefore apply Theorem 1.4 to conclude we need only de- termine the conditional distributions of the variables for that DAG tofind any value in the joint distribution. We illustrated this in Example 1.27. How- ever, as discussed in Section 1.2, in application we do not ordinarily specify an underlying sample space and probability function from which we can com- pute conditional distributions. Rather we identify random variables and values in conditional distributions directly. For example, in an application involv- ing the diagnosis of lung cancer, we identify variables like SmokingHistory,
1.3. LARGE INSTANCES / BAYESIAN NETWORKS 37
yes),P(LungCancer=present|SmokingHistory=yes), andP(ChestXray=
positive| LungCancer=present). How do we know the product of these con- ditional distributions is a joint distribution at all, much less one satisfying the Markov condition with some DAG? Theorem 1.4 tells us only that if we start with a joint distribution satisfying the Markov condition with some DAG, the values in that joint distribution will be given by the product of the condi- tional distributions. However, we must work in reverse. We must start with the conditional distributions and then be able to conclude the product of these distributions is a joint distribution satisfying the Markov condition with some DAG. The theorem that follows enables us to do just that.
Theorem 1.5 Let a DAGGbe given in which each node is a random variable, and let a discrete conditional probability distribution of each node given values of its parents inGbe specified. Then the product of these conditional distributions yields a joint probability distributionP of the variables, and(G, P)satisfies the Markov condition.
Proof. Order the nodes according to an ancestral ordering. Let X1, X2, . . . Xn
be the resultant ordering. Next define
P(x1, x2, . . . xn) =P(xn|pan)P(xn−1|pan−1)· · ·P(x2|pa2)P(x1|pa1),
where PAi is the set of parents of Xi of in G and P(xi|pai) is the specified
conditional probability distribution. First we show this does indeed yield a joint probability distribution. Clearly, 0≤P(x1, x2, . . .xn)≤1 for all values of the
variables. Therefore, to show we have a joint distribution, Definition 1.8 and Theorem 1.3 imply we need only show that the sum of P(x1, x2, . . . xn), as the
variables range through all their possible values, is equal to one. To that end, X x1 X x2 . . .X xn−1 X xn P(x1, x2, . . .xn) = X x1 X x2 · · ·X xn−1 X xn P(xn|pan)P(xn−1|pan−1)· · ·P(x2|pa2)P(x1|pa1) = X x1 X x2 · · ·X xn−1 " X xn P(xn|pan) # P(xn−1|pan−1)· · · P(x2|pa2) P(x1|pa1) = X x1 X x2 · · ·X xn−1 [1]P(xn−1|pan−1)· · · P(x2|pa2) P(x1|pa1) = X x1 " X x2 [· · ·1· · ·]P(x2|pa2) # P(x1|pa1) = X x1 [1]P(x1|pa1) = 1.
It is left as an exercise to show that the specified conditional distributions are the conditional distributions they notationally represent in the joint distribution.
Finally, we show the Markov condition is satisfied. To do this, we need show for1≤k≤nthat wheneverP(pak)6= 0, if P(ndk|pak)6= 0 andP(xk|pak)6= 0
then P(xk|ndk,pak) = P(xk|pak), where NDk is the set of nondescendents of
Xk of inG. SincePAk ⊆NDk, we need only show P(xk|ndk) =P(xk|pak).
First for a given k, order the nodes so that all and only nondescendents of Xk precede Xk in the ordering. Note that this ordering depends onk, whereas
the ordering in thefirst part of the proof does not. Clearly then
NDk ={X1, X2, . . . Xk−1}.
Let
Dk ={Xk+1, Xk+2, . . . Xn}.
In what follows, X
dk
means the sum as the variables in dk go through all their possible values. Furthermore, notation such asˆxk means the variable has
a particular value; notation such as ndˆ k means all variables in the set have
particular values; and notation such as pan means some variables in the set may not have particular values. We have that
P(ˆxk|ˆndk) = P(ˆxk,ˆndk) P(ˆndk) = X dk P(ˆx1,xˆ2, . . .xˆk, xk+1, . . .xn) X dk∪{xk} P(ˆx1,xˆ2, . . .xˆk−1, xk, . . .xn) = X dk P(xn|pan)· · ·P(xk+1|pak+1)P(ˆxk|ˆpak)· · ·P(ˆx1|paˆ 1) X dk∪{xk} P(xn|pan)· · ·P(xk|pak)P(ˆxk−1|ˆpak−1)· · ·P(ˆx1|ˆpa1) = P(ˆxk|ˆpak)· · ·P(ˆx1|ˆpa1) X dk P(xn|pan)· · ·P(xk+1|pak+1) P(ˆxk−1|ˆpak−1)· · ·P(ˆx1|ˆpa1) X dk∪{xk} P(xn|pan)· · ·P(xk|pak) = P(ˆxk|ˆpak) [1] [1] =P(ˆxk|ˆpak).
In the second to last step, the sums are each equal to one for the following reason. Each is a sum of a product of conditional probability distributions specified for a DAG. In the case of the numerator, that DAG is the subgraph, of our original DAGG, consisting of the variables in Dk, and in the case of the denominator, it is the subgraph consisting of the variables in Dk∪{Xk}. Therefore, the fact
that each sum equals one follows from thefirst part of this proof.
Notice that the theorem requires that specified conditional distributions be discrete. Often in the case of continuous distributions it still holds. For example,
1.3. LARGE INSTANCES / BAYESIAN NETWORKS 39
X
Y
Z
P(x1) = .3 P(x2) = .7 P(y1|x1) = .6 P(y2|x1) = .4 P(y1|x2) = 0 P(y2|x2) = 1 P(z1|y1) = .2 P(z2|y1) = .8 P(z1|y2) = .5 P(z2|x2) = .5Figure 1.6: A DAG containing random variables, along with specified condi- tional distributions.
it holds for the Gaussian distributions introduced in Section 4.1.3. However, in general, it does not hold for all continuous conditional distributions. See [Dawid and Studeny, 1999] for an example in which no joint distribution having the specified distributions as conditionals even exists.
Example 1.28 Suppose we specify the DAGGshown in Figure 1.6, along with the conditional distributions shown in that figure. According to Theorem 1.5,
P(x, y, z) =P(z|y)P(y|x)P(x)
satisfies the Markov condition withG.
Note that the proof of Theorem 1.5 does not require that values in the specified conditional distributions be nonzero. The next example shows what can happen when we specify some zero values.
Example 1.29 Consider first the DAG and specified conditional distributions in Figure 1.6. Because we have specified a zero conditional probability, namely P(y1|x2), there are events in the joint distribution with zero probability. For example,
P(x2, y1, z1) =P(z1|y1)P(y1|x2)P(x2) = (.2)(0)(.7) = 0.
However, there is no event with zero probability that is a conditioning event in one of the specified conditional distributions. That is,P(x1), P(x2), P(y1),and P(y2)are all nonzero. So the specified conditional distributions all exist.
Consider next the DAG and specified conditional distributions in Figure 1.7. We have
P(x1, y1) = P(x1, y1|w1)P(w1) +P(x1, y1|w2)P(w2)
= P(x1|w1)P(y1|w1)P(w1) +P(x1|w2)P(y1|w2)P(w2) = (0)(.8)(.1) + (.6)(0)(.9) = 0.
The event x1, y1 is a conditioning event in one of the specified distributions, namelyP(zi|x1, y1), but it has zero probability, which means we can’t condition
W X Z Y P(y1|w1) = .8 P(y2|w1) = .2 P(y1|w2) = 0 P(y2|w2) = 1 P(x1|w1) = 0 P(x2|w1) = 1 P(x1|w2) = .6 P(x2|w2) = .4 P(w1) = .1 P(w2) = .9 P(z1|x1,y1) = .3 P(z2|x1,y1) = .7 P(z1|x2,y1) = .1 P(z2|x2,y1) = .9 P(z1|x1,y2) = .4 P(z2|x1,y2) = .6 P(z1|x2,y2) = .5 P(z2|x2,y2) = .5
Figure 1.7: The eventx1, y1has 0 probability.
on it. This poses no problem; it simply means we have specified some meaning- less values, namelyP(zi|x1, y1). The Markov condition is still satisfied because P(z|w, x, y) =P(z|x, y)wheneverP(x, y)6= 0 (See the definition of conditional independence for sets of random variables in Section 1.1.4.).