Topics in Probabilistic Graphical Models
Section 10.4: Directed Acyclic Graphs (DAGs) for Inference on Bayesian Networks
The use of DAGs for Bayesian inference illustrates the prior to posterior inference process as demonstrated in the previous section. Bayesian networks, on the other hand, are generally more complicated than those presented above. Robert Cowell in his article Introduction to Inference for Bayesian Networks, defines Bayesian networks as a
“model representation for the joint distribution of a set of variables in terms of conditional and prior probabilities, in which the orientations of the arrows represent
B A A
B
P(A)P(B|A)
B A
P(A,B) P(B)P(A|B)
influence, usually though not always of a causal nature, such that these conditional probabilities for these particular orientations are relatively straightforward to specify (from data or eliciting from an expert)”.
Hence, in this section, the focus falls on the inferential procedure which involves the calculation of marginal probabilities conditional on the observed data using Bayes’
theorem. This is equivalent to (diagrammatically) to reversing one or more of the Bayesian network arrows. This section defines a Bayesian network and shows how one can be constructed from prior knowledge.
Let’s first consider a simple causal network where A is a parent of B. Then it is simple to calculate thePθ(B|A). However, if C is also a parent of B then the individual conditional probabilitiesPθ(B|A)andPθ(B|C)are not sufficient in providing any information on the interaction between A and B. Hence, a specification ofPθ(B|A,C)is required. This ideology can be expanded to consider n random variables. Typically, the interest lies in looking for relationships among a large number of variables for which the Bayesian network is suitable for.
A Bayesian network for a set of variablesX={X1,X2,...,Xn}consists of a set of directed arcs which in turn provide (a) a network structure S that encodes a set of conditional independence assertions about variables in X, and (b) a set P of local probability distributions associated with each variable (Heckerman, 1998; Jensen, 2001). Thus, the variables together with the directed arcs form a directed acyclic graph. Together these components:
i. define the joint probability distribution for X; and ii. provide a one-to-one correspondence between S and X.
Thus, a Bayesian network is a DAG whose structure defines a set of conditional independence properties. These properties can be found through graphical manipulations such as those presented in section 10.3 (Pearl, 1988). It has been contended that any uncertainty must obey the definition of d-separation (Jensen, 2001). This is to say, that a conditional probability distribution is associated with each node where the conditioning is on the parents of the nodeP(Xi | par(Xi)). Further, d-separation can be used to read off conditional independencies from DAG representation of a Bayesian network.
Given a network structure S, the joint probability distribution over the set of all variables U is given by
∏
=
i
i
i par X
X P U
P( ) ( | ( )),
where thepar(Xi)is the parent set ofX . Thus, the local probability distributions P are i simply the distributions corresponding to the terms in the product of the equation given above (Heckerman, 1999). Jensen (2001) defines this as the chain rule.
Next let’s consider the process of building a Bayesian network. Heckerman (1999) illustrates this process through an example. Here, we generalize his “learning by example” approach. The first task is much like that of the one described in developing a decision tree or influence diagram. That is to say this initial task entails the more logical structuring of the parameters and/or variables of interest. This can be characterized to include: (1) the correct identification of the goals of modeling (i.e. prediction versus explanation versus exploration); (2) the identification of all possible observations that may be relevant to the problem; (3) the identification of a subset of relevant observations to model; and (4) the classification of observations into variables having mutually exclusive and collectively exhaustive states. As seen earlier, this is a task embedded in decision analysis and hence is not exclusive to Bayesian modeling. This concludes the more logical framework development in the construction of a Bayesian network.
The next step in the construction of a Bayesian network entails the more mathematical or statistical framework. It is at this stage a DAG encoding the “assertions of conditional independence” are built. This approach is based on the chain rule of probability which is equivalent to theP(U)shown above. In general the chain rule of probability is defined as
∏
∏
= − = == n
i
i i n
i
i
i x x P x
x P X
P
1 1
1
1,..., ) ( | )
| ( )
( π .
Then for every X there exists some subset i Πi ⊆{X1,...,Xi−1} such thatX and i
i
Xi
X ,..., −}\Π
{ 1 1 are conditionally independent givenΠ . Thus, it is clear that the i variables sets (Π1,...,Πn)correspond to the parent nodes of a Bayesian network and in turn specifying the arcs in the network structure S.
It follows that in determining the structure of the Bayesian network requires ordering the variables and determining the most appropriate subset of variables. DAGs can always have their nodes linearly ordered so that for each node X all of its parents precedes it in the ordering. Such an ordering is referred to as a topological ordering (Cowell, 1999).
Consider the graph shown below with nine nodes. This has been taken from Cowell (1999).
Figure 10-4:
This example shows that (A, B, C, D, E, F, G, H, I) and (B, A, E, D, G, C, F, I, H) are two possible topological orderings. Thus, this task (of ordering the variables) may not be the most feasible or reasonable process to be used unless done so under a more logical
D E
A
G H
B
F
I C
framework. For instance, if the variable ordering chosen is not probable, the resulting network structure may fail to reveal some of the important conditional independencies or in the worst case n! variable orderings may need to be explored. Thus, applying the semantics of causal relationships (as observed naturally) readily asserts the corresponding conditional dependencies.
Now, the task at hand is to compute the joint density over the set of all variables U as defined earlier. According to the DAG, this is referred to as recursive factorization or
“the distribution being graphical over the DAG” (Cowell, 1999). It follows a similar procedure illustrated above by simplifying the individual terms (or nodes) with respect to their structural parents. Then, the final step in the construction of Bayesian networks
“simply” entails the assessment of the local probabilities, defined asP(Xi | par(Xi))for each i. This concludes the systematic approach to the construction of Bayesian networks.
It has been illustrated that each of these models determines a set of conditional constraints, represented implicitly in the DAG. However, the implied criteria or assumptions have not been discussed explicitly, and hence is made mention of here which Pearl (1988) refers to as stability. The stability condition requires that all of the probabilistic independence relations implied by the model should be invariant across (small) perturbations to the parameters of the model. That is the parameters of the model should not be functionally related to each other.
Furthermore, Verma & Pearl (1990) proved that two DAGs are observationally equivalent if they have (i) the same skeleton; and (ii) the same sets of nodal structure – i.e. two converging arcs whose tails are not connected by an arc. Using this criterion, it can be seen that figure 10-5(a) and (b) that probabilistically they are indistinguishable while that figure 10-5(c) and (d) are not. Thus, a variety of belief nets can represent the same conditional independencies.
Figure 10-5:
E D C B
E D
C B
A
(a) (b)
A
So far the emphasis has been much on the construction of a Bayesian network (from prior knowledge, data or a combination). And the need is usually to determine various probabilities of interest from the model. Thus, we now focus on probabilistic inference in Bayesian networks. Because a Bayesian network for U determines a joint probability distribution for U (a set of all variables), in principle, the Bayesian network can be used to compute any probability of interest as described above. However, the task to refine the structure and local probability distributions of a Bayesian network given data results in a set of techniques for data analysis that combines prior knowledge with data to produce improved knowledge (Heckerman, 1999). This is process referred to as “belief updating in Bayesian networks”. Jensen (2001) summarizes the mathematical process involved:
1. Use the chain rule to calculateP(U).
2. Determine the
∏
=
This illustrates the basic ideas for learning probabilities (and structure) within the Bayesian or belief net framework. This section has described the construction steps in a sequential manner; however, it is important to assert that many of these “steps” are intermingled in practice. Both the problem of structuring and the assessments of probability can lead to changes in the network structure. The section following presents an extension of both probability modeling and belief nets within a hierarchical form.
E