The Markov Condition - Large Instances / Bayesian Networks

1.3 Large Instances / Bayesian Networks

1.3.2 The Markov Condition

First let’s review some graph theory. Recall that a directed graphis a pair

(V,E), whereV is a finite, nonempty set whose elements are callednodes (or vertices), andEis a set of ordered pairs of distinct elements ofV. Elements of E are called edges (or arcs), and if(X, Y)_∈E, we say there is an edge from

X to Y and thatX and Y are each incidentto the edge. If there is an edge fromX toY or fromY toX, we sayXandY areadjacent. Suppose we have a set of nodes [X1, X2, . . . Xk], wherek≥2, such(Xi−1, Xi)∈Efor2≤i≤k. We call the set of edges connecting the k nodes a path from X1 to Xk. The nodes X2, . . . Xk−1 are called interior nodes on path [X1, X2, . . . Xk]. The subpath of path [X1, X2, . . . Xk] fromXi to Xj is the path [Xi, Xi+1, . . . Xj] where 1_≤ i < j _≤ k. Adirected cycle is a path from a node to itself. A simple path is a path containing no subpaths which are directed cycles. A directed graph _G is called a directed acyclic graph (DAG) if it contains no directed cycles. Given a DAG_G= (V,E)and nodesX andY inV,Y is called aparent ofX if there is an edge fromY to X, Y is called a descendentof

X andX is called anancestorofY if there is a path fromX toY, and Y is called anondescendentofXifY is not a descendent ofX. Note that in this text X is not considered a descendent of X because we require k _≥ 2in the definition of a path. Some texts say there is an empty path fromX toX.

We can now state the following definition:

Definition 1.9 Suppose we have a joint probability distribution P of the random variables in some setVand a DAG_G= (V,E). We say that(_G, P)satisfies

the Markov condition if for each variable X _∈ V, _{X_} is conditionally in-

dependent of the set of all its nondescendents given the set of all its parents. Using the notation established in Section 1.1.4, this means if we denote the sets of parents and nondescendents of X byPAX andNDX respectively, then

IP({X},NDX|PAX).

When (_G, P) satisfies the Markov condition, we say _G and P satisfy the Markov condition with each other.

IfXis a root, then its parent setPAX is empty. So in this case the Markov condition means _{X_}is independent of NDX. That is, IP({X},NDX). It is not hard to show that IP({X},NDX|PAX) implies IP({X},B|PAX) for any B_⊆NDX. It is left as an exercise to do this. Notice that PAX⊆NDX. So we could define the Markov condition by saying thatX must be conditionally

(a) (c) (d) V S C V S C V C S C V S (b)

Figure 1.3: The probability distribution in Example 1.25 satisfies the Markov condition only for the DAGs in (a), (b), and (c).

independent ofNDX−PAX givenPAX. However, it is standard to define it as above. When discussing the Markov condition relative to a particular distribution and DAG (as in the following examples), we just show the conditional independence ofXandNDX₋PAX.

Example 1.25 Let Ω be the set of objects in Figure 1.2, and let P assign a probability of 1/13 to each object. Let random variables V, S, and C be as defined as in Example 1.19. That is, they are defined as follows:

Variable Value Outcomes Mapped to this Value

V v1 All objects containing a ‘1’

v2 All objects containing a ‘2’

S s1 All square objects

s2 All round objects

C c1 All black objects

1.3. LARGE INSTANCES / BAYESIAN NETWORKS 33 H B F L C

Figure 1.4: A DAG illustrating the Markov condition

Then, as shown in Example 1.19,IP({V},{S}|{C}). Therefore,(G, P)satisfies

the Markov condition if_Gis the DAG in Figure 1.3 (a), (b), or (c). However,

(_G, P)does not satisfy the Markov condition if_Gis the DAG in Figure 1.3 (d) because IP({V},{S}) is not the case.

Example 1.26 Consider the DAG _G in Figure 1.4. If (_G, P) satisfied the Markov condition for some probability distributionP, we would have the following conditional independencies:

Node PA Conditional Independency

C _{L_} IP({C},{H, B, F}|{L})

B _{H_} IP({B},{L, C}|{H})

F _{B, L_} IP({F},{H, C}|{B, L})

L _{H_} IP({L},{B}|{H})

Recall from Section 1.3.1 that the number of terms in a joint probability distribution is exponential in terms of the number of variables. So, in the case of a large instance, we could not fully describe the joint distribution by determining each of its values directly. Herein lies one of the powers of the Markov condition. Theorem 1.4, which follows shortly, shows if(_G, P)satisfies the Markov condition, then P equals the product of its conditional probability distributions of all nodes given values of their parents in _G, whenever these conditional distributions exist. After proving this theorem, we discuss how this means we often need ascertain far fewer values than if we had to determine all values in the joint distribution directly. Before proving it, we illustrate what it means for a joint distribution to equal the product of its conditional distributions of all nodes given values of their parents in a DAG_G. This would be the case

for a joint probability distributionP of the variables in the DAG in Figure 1.4 if, for all values off,c, b,l, andh,

P(f, c, b, l, h) =P(f_|b, l)P(c_|l)P(b_|h)P(l_|h)P(h), (1.6) whenever the conditional probabilities on the right exist. Notice that if one of them does not exist for some combination of the values of the variables, then

P(b, l) = 0or P(l) = 0or P(h) = 0, which implies P(f, c, b, l, h) = 0 for that combination of values.However, there are cases in whichP(f, c, b, l, h) = 0 and the conditional probabilities still exist. For example, this would be the case if all the conditional probabilities on the right existed andP(f_|b, l) = 0for some combination of values off, b, andl. So Equality 1.6 must hold for all nonzero values of the joint probability distribution plus some zero values.

We now give the theorem.

Theorem 1.4 If (_G, P) satisfies the Markov condition, then P is equal to the product of its conditional distributions of all nodes given values of their parents, whenever these conditional distributions exist.

Proof. We prove the case whereP is discrete. Order the nodes so that if Y is

a descendent ofZ, thenY followsZ in the ordering. Such an ordering is called

anancestral ordering. Examples of such an ordering for the DAG in Figure

1.4 are [H, L, B, C, F] and [H, B, L, F, C]. Let X1, X2, . . . Xn be the resultant

ordering. For a given set of values of x1, x2, . . . xn, let pai be the subset of

these values containing the values ofXi’s parents. We need show that whenever

P(pa_i)₆= 0 for1_≤i_≤n,

P(xn, xn−1, . . . x1) =P(xn|pan)P(xn−1|pan−1)· · ·P(x1|pa1).

We show this using induction on the number of variables in the network. As- sume, for some combination of values of thexi’s,thatP(pai)6= 0for1≤i≤n.

induction base: SincePA1 is empty,

P(x1) =P(x1|pa1).

induction hypothesis: Suppose for this combination of values of thexi’s that

P(xi, xi−1, . . . x1) =P(xi|pai)P(xi−1|pai−1)· · ·P(x1|pa1).

induction step: We need show for this combination of values of thexi’s that

P(xi+1, xi, . . . x1) =P(xi+1|pai+1)P(xi|pai)· · ·P(x1|pa1). (1.7)

There are two cases:

Case 1: For this combination of values

1.3. LARGE INSTANCES / BAYESIAN NETWORKS 35

Clearly, Equality 1.8 implies

P(xi+1, xi, . . . x1) = 0.

Furthermore, due to Equality 1.8 and the induction hypothesis, there is somek, where1_≤k_≤i, such that P(xk|pak) = 0. So Equality 1.7 holds.

Case 2: For this combination of values

P(xi, xi−1, . . . x1)6= 0.

In this case,

P(xi+1, xi, . . . x1) = P(xi+1|xi, . . . x1)P(xi, . . . x1)

= P(xi+1|pai+1)P(xi, . . . x1)

= P(xi+1|pai+1)P(xi|pai)· · ·P(x1|pa1).

Thefirst equality is due to the rule for conditional probability, the second is due to the Markov condition and the fact that X1, . . . Xi are all nondescendents of

Xi+1, and the last is due to the induction hypothesis.

Example 1.27 Recall that the joint probability distribution in Example 1.25 satisfies the Markov condition with the DAG in Figure 1.3 (a). Therefore, owing to Theorem 1.4,

P(v, s, c) =P(v_|c)P(s_|c)p(c), (1.9)

and we need only determine the conditional distributions on the right in Equality 1.9 to uniquely determine the values in the joint distribution. We illustrate that this is the case for v1,s1, andc1:

P(v1, s1, c1) =P(One_∩Square_∩Black) = 2 13

P(v1_|c1)P(s1_|c1)P(c1) = P(One_|Black)_×P(Square_|Black)_×P(Black) = 1 3× 2 3× 9 13 = 2 13.

Figure 1.5 shows the DAG along with the conditional distributions.

The joint probability distribution in Example 1.25 also satisfies the Markov condition with the DAGs in Figures 1.3 (b) and (c). Therefore, the probability distribution in that example equals the product of the conditional distributions for each of them. You should verify this directly.

If the DAG in Figure 1.3 (d) and some probability distributionP satisfied the Markov condition, Theorem 1.4 would imply

P(v, s, c) =P(c_|v, s)P(v)p(s).

P(c1) = 9/13 P(c2) = 4/13 P(v1|c1) = 1/3 P(v2|c1) = 2/3 P(v1|c2) = 1/2 P(v2|c2) = 1/2 P(s1|c1) = 2/3 P(s2|c1) = 1/3 P(s1|c2) = 1/2 P(s2|c2) = 1/2 V C S

Figure 1.5: The probability distribution discussed in Example 1.27 is equal to the product of these conditional distributions.

Theorem 1.4 often enables us to reduce the problem of determining a huge number of probability values to that of determining relatively few. The number of values in the joint distribution is exponential in terms of the number of variables. However, each of these values is uniquely determined by the conditional distributions (due to the theorem), and, if each node in the DAG does not have too many children, there are not many values in these distributions. For example, if each variable has two possible values and each node has at most one parent, we would need to ascertain less than2n probability values to determine the conditional distributions when the DAG containsn nodes. On the other hand, we would need to ascertain 2n₋₁ _{values to determine the joint} probability distribution directly. In general, if each variable has two possible values and each node has at mostkparents, we need to ascertain less than2k_n values to determine the conditional distributions. So if kis not large, we have a manageable number of values.

Something may seem amiss to you. Namely, in Example 1.25, we started with an underlying sample space and probability function, specified some random variables, and showed that if P is the probability distribution of these variables and_Gis the DAG in Figure 1.3 (a), then(P,_G) satisfies the Markov condition. We can therefore apply Theorem 1.4 to conclude we need only determine the conditional distributions of the variables for that DAG tofind any value in the joint distribution. We illustrated this in Example 1.27. How- ever, as discussed in Section 1.2, in application we do not ordinarily specify an underlying sample space and probability function from which we can com- pute conditional distributions. Rather we identify random variables and values in conditional distributions directly. For example, in an application involv- ing the diagnosis of lung cancer, we identify variables like SmokingHistory,

1.3. LARGE INSTANCES / BAYESIAN NETWORKS 37

yes),P(LungCancer=present_|SmokingHistory=yes), andP(ChestXray=

positive_| LungCancer=present). How do we know the product of these conditional distributions is a joint distribution at all, much less one satisfying the Markov condition with some DAG? Theorem 1.4 tells us only that if we start with a joint distribution satisfying the Markov condition with some DAG, the values in that joint distribution will be given by the product of the conditional distributions. However, we must work in reverse. We must start with the conditional distributions and then be able to conclude the product of these distributions is a joint distribution satisfying the Markov condition with some DAG. The theorem that follows enables us to do just that.

Theorem 1.5 Let a DAG_Gbe given in which each node is a random variable, and let a discrete conditional probability distribution of each node given values of its parents in_Gbe specified. Then the product of these conditional distributions yields a joint probability distributionP of the variables, and(_G, P)satisfies the Markov condition.

Proof. Order the nodes according to an ancestral ordering. Let X1, X2, . . . Xn

be the resultant ordering. Next define

P(x1, x2, . . . xn) =P(xn|pan)P(xn−1|pan−1)· · ·P(x2|pa2)P(x1|pa1),

where PAi is the set of parents of Xi of in G and P(xi|pai) is the specified

conditional probability distribution. First we show this does indeed yield a joint probability distribution. Clearly, 0_≤P(x1, x2, . . .xn)≤1 for all values of the

variables. Therefore, to show we have a joint distribution, Definition 1.8 and Theorem 1.3 imply we need only show that the sum of P(x1, x2, . . . xn), as the

variables range through all their possible values, is equal to one. To that end, X x1 X x2 . . .X xn−1 X xn P(x1, x2, . . .xn) = X x1 X x2 · · ·X xn−1 X xn P(xn|pan)P(xn−1|pan−1)· · ·P(x2|pa2)P(x1|pa1) = X x1  X x2  _{· · ·}X xn−1 " X xn P(xn|pan) # P(xn−1|pan−1)· · ·  P(x2|pa2)  P(x1|pa1) = X x1  X x2  _{· · ·}X xn−1 [1]P(xn−1|pan−1)· · ·  P(x2|pa2)  P(x1|pa1) = X x1 " X x2 [_{· · ·}1_{· · ·}]P(x2|pa2) # P(x1|pa1) = X x1 [1]P(x1|pa1) = 1.

It is left as an exercise to show that the specified conditional distributions are the conditional distributions they notationally represent in the joint distribution.

Finally, we show the Markov condition is satisfied. To do this, we need show for1_≤k_≤nthat wheneverP(pa_k)₆= 0, if P(ndk|pak)6= 0 andP(xk|pak)6= 0

then P(xk|ndk,pak) = P(xk|pak), where NDk is the set of nondescendents of

Xk of inG. SincePAk ⊆NDk, we need only show P(xk|ndk) =P(xk|pak).

First for a given k, order the nodes so that all and only nondescendents of Xk precede Xk in the ordering. Note that this ordering depends onk, whereas

the ordering in thefirst part of the proof does not. Clearly then

NDk =_{X1, X2, . . . Xk−1}.

Let

Dk ={Xk+1, Xk+2, . . . Xn}.

In what follows, X

means the sum as the variables in dk go through all their possible values. Furthermore, notation such asˆxk means the variable has

a particular value; notation such as ndˆ k means all variables in the set have

particular values; and notation such as pa_n means some variables in the set may not have particular values. We have that

P(ˆxk|ˆndk) = P(ˆxk,ˆndk) P(ˆndk) = X dk P(ˆx1,xˆ2, . . .xˆk, xk+1, . . .xn) X dk∪{xk} P(ˆx1,xˆ2, . . .xˆk−1, xk, . . .xn) = X dk P(xn|pan)· · ·P(xk+1|pak+1)P(ˆxk|ˆpak)· · ·P(ˆx1|paˆ 1) X dk∪{xk} P(xn|pan)· · ·P(xk|pak)P(ˆxk−1|ˆpak−1)· · ·P(ˆx1|ˆpa1) = P(ˆxk|ˆpak)· · ·P(ˆx1|ˆpa1) X dk P(xn|pan)· · ·P(xk+1|pak+1) P(ˆxk−1|ˆpak−1)· · ·P(ˆx1|ˆpa1) X dk∪{xk} P(xn|pan)· · ·P(xk|pak) = P(ˆxk|ˆpak) [1] [1] =P(ˆxk|ˆpak).

In the second to last step, the sums are each equal to one for the following reason. Each is a sum of a product of conditional probability distributions specified for a DAG. In the case of the numerator, that DAG is the subgraph, of our original DAG_G, consisting of the variables in Dk, and in the case of the denominator, it is the subgraph consisting of the variables in Dk_∪_{Xk}. Therefore, the fact

that each sum equals one follows from thefirst part of this proof.

Notice that the theorem requires that specified conditional distributions be discrete. Often in the case of continuous distributions it still holds. For example,

1.3. LARGE INSTANCES / BAYESIAN NETWORKS 39

X

Y

Z

P(x1) = .3 P(x2) = .7 P(y1|x1) = .6 P(y2|x1) = .4 P(y1|x2) = 0 P(y2|x2) = 1 P(z1|y1) = .2 P(z2|y1) = .8 P(z1|y2) = .5 P(z2|x2) = .5

Figure 1.6: A DAG containing random variables, along with specified conditional distributions.

it holds for the Gaussian distributions introduced in Section 4.1.3. However, in general, it does not hold for all continuous conditional distributions. See [Dawid and Studeny, 1999] for an example in which no joint distribution having the specified distributions as conditionals even exists.

Example 1.28 Suppose we specify the DAG_Gshown in Figure 1.6, along with the conditional distributions shown in that figure. According to Theorem 1.5,

P(x, y, z) =P(z_|y)P(y_|x)P(x)

satisfies the Markov condition with_G.

Note that the proof of Theorem 1.5 does not require that values in the specified conditional distributions be nonzero. The next example shows what can happen when we specify some zero values.

Example 1.29 Consider first the DAG and specified conditional distributions in Figure 1.6. Because we have specified a zero conditional probability, namely P(y1_|x2), there are events in the joint distribution with zero probability. For example,

P(x2, y1, z1) =P(z1_|y1)P(y1_|x2)P(x2) = (.2)(0)(.7) = 0.

However, there is no event with zero probability that is a conditioning event in one of the specified conditional distributions. That is,P(x1), P(x2), P(y1),and P(y2)are all nonzero. So the specified conditional distributions all exist.

Consider next the DAG and specified conditional distributions in Figure 1.7. We have

P(x1, y1) = P(x1, y1_|w1)P(w1) +P(x1, y1_|w2)P(w2)

= P(x1_|w1)P(y1_|w1)P(w1) +P(x1_|w2)P(y1_|w2)P(w2) = (0)(.8)(.1) + (.6)(0)(.9) = 0.

The event x1, y1 is a conditioning event in one of the specified distributions, namelyP(zi_|x1, y1), but it has zero probability, which means we can’t condition

W X Z Y P(y1|w1) = .8 P(y2|w1) = .2 P(y1|w2) = 0 P(y2|w2) = 1 P(x1|w1) = 0 P(x2|w1) = 1 P(x1|w2) = .6 P(x2|w2) = .4 P(w1) = .1 P(w2) = .9 P(z1|x1,y1) = .3 P(z2|x1,y1) = .7 P(z1|x2,y1) = .1 P(z2|x2,y1) = .9 P(z1|x1,y2) = .4 P(z2|x1,y2) = .6 P(z1|x2,y2) = .5 P(z2|x2,y2) = .5

Figure 1.7: The eventx1, y1has 0 probability.

on it. This poses no problem; it simply means we have specified some meaning- less values, namelyP(zi_|x1, y1). The Markov condition is still satisfied because P(z_|w, x, y) =P(z_|x, y)wheneverP(x, y)₆= 0 (See the definition of conditional independence for sets of random variables in Section 1.1.4.).

In document Learning Bayesian Networks Neapolitan R E pdf (Page 41-50)