• No results found

2.2 Bayesian networks

2.2.2 Probabilistic graphical networks

2.2.2.1 Graph theory

Section 2.2.1 already introduced the basic concepts of probability theory. But before being able to describe graphical networks, some basic principles of graph theory have to be defined. A graph is an abstract structure K that is built of a set of edges and a set of nodes, where the set of nodes is X = {X1. . . Xn} in most cases throughout this thesis.

The set of edges E consists of connections between two nodes Xi and Xj that can either be directed Xi → Xj, Xj → Xi or undirected Xi− Xj (also indicated by Xi ↔ Xj) for Xi, Xj ∈ X and i 6= j. A directed graph G is a graph K where all edges E are directed.

In contrast, a graph H that contains only undirected edges is called undirected graph.

When considering an directed edge Xi → Xj ∈ E, Xj is called the child of Xi and Xi is denoted as parent of Xj. Statement P a(X) is used to denote the parents of a node X, while the children of X are given by Ch(X). A node X where P a(X) = ∅ is called orphan. When considering a undirected edge Xi− Xj instead, Xj is called the neighbour of Xi and the other way round. The set of neighbours of a node X is given N b(X). An example of a graph can be seen in Figure 2.8.

D E F

A

B C

Figure 2.8: An example of a graph containing directed and undirected edges, also called partially directed graph.

In this example, graph K = (X , E ) consists of nodes X = {A, B, C, D, E, F } and edges E = {A → B, B − C, B → D, C → E, C → F, D − E, E − F }. Clearly, node B for example has one parent P a(B) = {A}, one child Ch(B) = {D} and one neighbour N b(B) = {C}.

A connection in a graph K = (X , E ) over nodes Xi. . . Xk is called a path if for every i = 1, . . . , k − 1 either Xi → Xi+1 or Xi− Xi+1. A path is called directed if at least one edge of the path is directed. Further, a directed path Xi. . . Xk where Xi = Xk is called

a cycle. A graph containing no cycles is called acyclic graph. Hence, the example graph shown in Figure 2.8 is acyclic.

2.2.2.2 Introduction to Bayesian networks

A Bayesian network B [159, 205, 85] is a probabilistic graphical model represented by a directed acyclic graph (DAG) G whose nodes represent the random variables of the domain. Further, for categorical data, a Bayesian network also holds a conditional prob-ability table (CPT) for each node. The conditional probprob-ability distribution (CPD) is defined by the chain rule 2.18 which factorizes the conditional probabilities. Let there be two random variables X and Y , then the joint distribution P (X, Y ) is factorized as P (X, Y ) = P (X)P (Y |X) with respect to the chain rule. Instead of specifying the joint entries P (X, Y ), only the prior P (X) and the conditional probability distribution P (Y |X) of Y given X has to be defined. The representation by conditional probability distributions of a node X has two important adventages: first, it is much more compact than the raw joint distribution if the number of nodes grows and second, it is modular.

If for example a new node Z would be added, only the CPD of Z and the CPDs of nodes Ch(Z) have to be updated where otherwise all entries in the joint distribution would have to be redefined. Factorizing the joint distribution into conditional probabilities of nodes given their parents and into prior distributions for orphan nodes, is a key concept of Bayesian networks.

Further, a Bayesian network can also be seen as a representation of a set of conditional independence assumptions about a distribution [87, 114]. Consider the Bayesian network Bexample represented by a DAG G with nodes X = {A, B, C, D, E} illustrated in Figure 2.9. As can be seen, nodes A, C, D and E have binary values {0, 1} while node B has

Figure 2.9: An example Bayesian network Bexample with five nodes X = {A, B, C, D, E}

and corresponding CPTs. Nodes A, C, D and E have binary values {0, 1} while node B has three values {0, 1, 2}. Each node is connected to its CPT by a dashed line.

three values {0, 1, 2}. Dashed lines indicate the correspondence of the CPTs to the

nodes. Connections in the network as well as entries in the CPTs indicate the conditional dependencies. It can be seen for example, that node D only depends on its parent node B while node C is dependent on nodes A and B. Changing the point of view to independences, it can be seen that node E is conditionally independent of all other nodes given its parent C:

(E ⊥ A, B, D|C). (2.23)

This means that once the value of C is known, no observation of nodes A, B or D changes the belief of node E. When investigating node C again under independence properties, the assumption that C depends only on its parents is not true any longer. Observing a value of E (a child of C) can apparently update the belief of node C. Thus, it cannot be expected that a node is conditionally independent of all other nodes given its parents as it can still depend on its children and even on further descendants. Thus, it can be noted that node C is only independent of node D given nodes A and B:

(C ⊥ D|A, B). (2.24)

Following these statements, it can be further concluded that node B is independent of node A as A is neither a parent of B nor a descendant:

(B ⊥ A). (2.25)

On the other hand, node A is obviously also independent of node B, but also of node D:

(A ⊥ B, D). (2.26)

Considering conclusions of the previously discussed example network, a formal definition of a Bayesian network structure with respect to independence assumptions is given by Koller and Friedman [114] as follows:

Definition 5 A Bayesian network structure G is a directed acyclic graph whose nodes represent random variables X1, . . . , Xn. Let P aG(Xi) denote the parents of Xi in G and N onDescendantsXi denote the variables in the graph that are not descendants of Xi. Then G encodes the following set of conditional independence assumptions, called the local independencies, and denoted by Il(G):

For each variable Xi : (Xi ⊥ N onDescendantsXi|P aG(Xi)) (2.27) Namely, Definition 5 states that each node Xi is conditionally independent of its non-descendants given its parents.

Finally before formally defining a Bayesian network, the association between condi-tional independences and condicondi-tional probability distributions has to be clarified. Consid-ering chain rule for probabilities from Equation 2.18, joint distribution P (A, B, C, D, E) of the Bayesian network Bexample can be decomposed as

P (A, B, C, D, E) = P (A)P (B|A)P (C|A, B)P (D|A, B, C)P (E|A, B, C, D) (2.28) without relying on any assumptions. Obviously, the decomposition of Equation 2.28 does not bring any advantages compared to the joint distribution itself. But the decomposed

form on the right hand side allows to incorporate independence assumptions given for example in Equations 2.23 - 2.26. For example from (B ⊥ A) immediately follows that P (B|A) = P (B). Hence, the second term on the right hand side of Equation 2.28 can be simplified. Following this concept, the simplified decomposition becomes

P (A, B, C, D, E) = P (A)P (B)P (C|A, B)P (D|B)P (E|C) (2.29) which is exactly in line with the defined conditional probability tables. Thus, for each variable, a factor can be computed that represents its conditional probability and each entry in the joint distribution can be calculated by building a product of these factors, [114]. The chain rule for Bayesian networks concludes as follows:

P (Xi, . . . , Xn) =

n

Y

i=1

P (Xi|P aG(Xi)) (2.30)

where G is a Bayesian network graph over variables Xi, . . . , Xn and the factors P (Xi|P aG(Xi)) are the individual CPDs. If a distribution P can be expressed as demon-strated in Equation 2.30, P factorizes according to G [101, 183].

Finally, the formal definition of a Bayesian network follows from the chain rule for Bayesian networks also presented by Koller and Friedman [114]:

Definition 6 A Bayesian network is a pair B = (G, P ) where P factorizes over G, and where P is specified as a set of CPDs associated with G’s nodes. The distribution P is often annotated PB.

D-separation The concept of d-separation [158, 159, 86, 203] describes the relationship between the graph structure of a Bayesian network and the probabilistic independences.

Two variables X and Y in a Bayesian network B are d-separated given variable Z if for all path between X and Y ,

• Z is a node of a diverging (X ← Z → Y ) or a serial path (X ← Z ← Y or X → Z → Y ) between X and Y and Z is observed, or

• Z is a node of a v-structure (converging connection X → Z ← Y ) and neither Z nor any of its descendent is observed.

In case of a v-structure X → Z ← Y , node Z is also called a collider. Therefore, the conditional dependencies and independence relations in the probability distribution over a set of random variables are described by the DAG of a Bayesian network. Verma and Pearl [203, 204] as well as Chickering [51] showed that the d-separation criterion encodes not a unique DAG, but can encode several DAGs if and only if they share the same skeleton and the same set of colliders. A set of DAGs with equal skeleton and colliders is thus called equivalence class and its members are called to be structure equivalent. In other words, the same probability distribution and therefore also the same set of d-separations can be expressed in equivalent DAGs even if some edges are differently directed.