Probabilistic graphical networks

2.2 Bayesian networks

2.2.2 Probabilistic graphical networks

2.2.2.1 Graph theory

Section 2.2.1 already introduced the basic concepts of probability theory. But before being able to describe graphical networks, some basic principles of graph theory have to be defined. A graph is an abstract structure K that is built of a set of edges and a set of nodes, where the set of nodes is X = {X₁. . . X_n} in most cases throughout this thesis.

The set of edges E consists of connections between two nodes X_i and X_j that can either be directed X_i → X_j, X_j → X_i or undirected X_i− X_j (also indicated by X_i ↔ X_j) for X_i, X_j ∈ X and i 6= j. A directed graph G is a graph K where all edges E are directed.

In contrast, a graph H that contains only undirected edges is called undirected graph.

When considering an directed edge X_i → X_j ∈ E, X_j is called the child of X_i and X_i is denoted as parent of X_j. Statement P a(X) is used to denote the parents of a node X, while the children of X are given by Ch(X). A node X where P a(X) = ∅ is called orphan. When considering a undirected edge X_i− X_j instead, X_j is called the neighbour of X_i and the other way round. The set of neighbours of a node X is given N b(X). An example of a graph can be seen in Figure 2.8.

D E F

A

B C

Figure 2.8: An example of a graph containing directed and undirected edges, also called partially directed graph.

In this example, graph K = (X , E ) consists of nodes X = {A, B, C, D, E, F } and edges E = {A → B, B − C, B → D, C → E, C → F, D − E, E − F }. Clearly, node B for example has one parent P a(B) = {A}, one child Ch(B) = {D} and one neighbour N b(B) = {C}.

A connection in a graph K = (X , E ) over nodes X_i. . . X_k is called a path if for every i = 1, . . . , k − 1 either X_i → X_i+1 or X_i− X_i+1. A path is called directed if at least one edge of the path is directed. Further, a directed path X_i. . . X_k where X_i = X_k is called

a cycle. A graph containing no cycles is called acyclic graph. Hence, the example graph shown in Figure 2.8 is acyclic.

2.2.2.2 Introduction to Bayesian networks

A Bayesian network B [159, 205, 85] is a probabilistic graphical model represented by a directed acyclic graph (DAG) G whose nodes represent the random variables of the domain. Further, for categorical data, a Bayesian network also holds a conditional prob-ability table (CPT) for each node. The conditional probprob-ability distribution (CPD) is defined by the chain rule 2.18 which factorizes the conditional probabilities. Let there be two random variables X and Y , then the joint distribution P (X, Y ) is factorized as P (X, Y ) = P (X)P (Y |X) with respect to the chain rule. Instead of specifying the joint entries P (X, Y ), only the prior P (X) and the conditional probability distribution P (Y |X) of Y given X has to be defined. The representation by conditional probability distributions of a node X has two important adventages: first, it is much more compact than the raw joint distribution if the number of nodes grows and second, it is modular.

If for example a new node Z would be added, only the CPD of Z and the CPDs of nodes Ch(Z) have to be updated where otherwise all entries in the joint distribution would have to be redefined. Factorizing the joint distribution into conditional probabilities of nodes given their parents and into prior distributions for orphan nodes, is a key concept of Bayesian networks.

Further, a Bayesian network can also be seen as a representation of a set of conditional independence assumptions about a distribution [87, 114]. Consider the Bayesian network B^example represented by a DAG G with nodes X = {A, B, C, D, E} illustrated in Figure 2.9. As can be seen, nodes A, C, D and E have binary values {0, 1} while node B has

Figure 2.9: An example Bayesian network B^example with five nodes X = {A, B, C, D, E}

and corresponding CPTs. Nodes A, C, D and E have binary values {0, 1} while node B has three values {0, 1, 2}. Each node is connected to its CPT by a dashed line.

three values {0, 1, 2}. Dashed lines indicate the correspondence of the CPTs to the

nodes. Connections in the network as well as entries in the CPTs indicate the conditional dependencies. It can be seen for example, that node D only depends on its parent node B while node C is dependent on nodes A and B. Changing the point of view to independences, it can be seen that node E is conditionally independent of all other nodes given its parent C:

(E ⊥ A, B, D|C). (2.23)

This means that once the value of C is known, no observation of nodes A, B or D changes the belief of node E. When investigating node C again under independence properties, the assumption that C depends only on its parents is not true any longer. Observing a value of E (a child of C) can apparently update the belief of node C. Thus, it cannot be expected that a node is conditionally independent of all other nodes given its parents as it can still depend on its children and even on further descendants. Thus, it can be noted that node C is only independent of node D given nodes A and B:

(C ⊥ D|A, B). (2.24)

Following these statements, it can be further concluded that node B is independent of node A as A is neither a parent of B nor a descendant:

(B ⊥ A). (2.25)

On the other hand, node A is obviously also independent of node B, but also of node D:

(A ⊥ B, D). (2.26)

Considering conclusions of the previously discussed example network, a formal definition of a Bayesian network structure with respect to independence assumptions is given by Koller and Friedman [114] as follows:

Definition 5 A Bayesian network structure G is a directed acyclic graph whose nodes represent random variables X₁, . . . , X_n. Let P a^G(X_i) denote the parents of X_i in G and N onDescendants_X_i denote the variables in the graph that are not descendants of X_i. Then G encodes the following set of conditional independence assumptions, called the local independencies, and denoted by I_l(G):

For each variable X_i : (X_i ⊥ N onDescendants_X_i|P a^G(X_i)) (2.27) Namely, Definition 5 states that each node X_i is conditionally independent of its non-descendants given its parents.

Finally before formally defining a Bayesian network, the association between condi-tional independences and condicondi-tional probability distributions has to be clarified. Consid-ering chain rule for probabilities from Equation 2.18, joint distribution P (A, B, C, D, E) of the Bayesian network B^example can be decomposed as

P (A, B, C, D, E) = P (A)P (B|A)P (C|A, B)P (D|A, B, C)P (E|A, B, C, D) (2.28) without relying on any assumptions. Obviously, the decomposition of Equation 2.28 does not bring any advantages compared to the joint distribution itself. But the decomposed

form on the right hand side allows to incorporate independence assumptions given for example in Equations 2.23 - 2.26. For example from (B ⊥ A) immediately follows that P (B|A) = P (B). Hence, the second term on the right hand side of Equation 2.28 can be simplified. Following this concept, the simplified decomposition becomes

P (A, B, C, D, E) = P (A)P (B)P (C|A, B)P (D|B)P (E|C) (2.29) which is exactly in line with the defined conditional probability tables. Thus, for each variable, a factor can be computed that represents its conditional probability and each entry in the joint distribution can be calculated by building a product of these factors, [114]. The chain rule for Bayesian networks concludes as follows:

P (X_i, . . . , X_n) =

i=1

P (X_i|P a^G(X_i)) (2.30)

where G is a Bayesian network graph over variables X_i, . . . , X_n and the factors P (X_i|P a^G(X_i)) are the individual CPDs. If a distribution P can be expressed as demon-strated in Equation 2.30, P factorizes according to G [101, 183].

Finally, the formal definition of a Bayesian network follows from the chain rule for Bayesian networks also presented by Koller and Friedman [114]:

Definition 6 A Bayesian network is a pair B = (G, P ) where P factorizes over G, and where P is specified as a set of CPDs associated with G’s nodes. The distribution P is often annotated PB.

D-separation The concept of d-separation [158, 159, 86, 203] describes the relationship between the graph structure of a Bayesian network and the probabilistic independences.

Two variables X and Y in a Bayesian network B are d-separated given variable Z if for all path between X and Y ,

• Z is a node of a diverging (X ← Z → Y ) or a serial path (X ← Z ← Y or X → Z → Y ) between X and Y and Z is observed, or

• Z is a node of a v-structure (converging connection X → Z ← Y ) and neither Z nor any of its descendent is observed.

In case of a v-structure X → Z ← Y , node Z is also called a collider. Therefore, the conditional dependencies and independence relations in the probability distribution over a set of random variables are described by the DAG of a Bayesian network. Verma and Pearl [203, 204] as well as Chickering [51] showed that the d-separation criterion encodes not a unique DAG, but can encode several DAGs if and only if they share the same skeleton and the same set of colliders. A set of DAGs with equal skeleton and colliders is thus called equivalence class and its members are called to be structure equivalent. In other words, the same probability distribution and therefore also the same set of d-separations can be expressed in equivalent DAGs even if some edges are differently directed.

In document Physarum Learner: a novel structure learning algorithm for Bayesian Networks inspired by Physarum polycephalum (Page 29-33)