Parallel Algorithms for Bayesian Networks Structure Learning with Applications in Systems Biology

(1)

2012

Parallel Algorithms for Bayesian Networks

Structure Learning with Applications in Systems

Biology

Olga Nikolova

Iowa State University

Follow this and additional works at:

https://lib.dr.iastate.edu/etd

Part of the

Bioinformatics Commons

This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please [email protected].

Recommended Citation

Nikolova, Olga, "Parallel Algorithms for Bayesian Networks Structure Learning with Applications in Systems Biology" (2012).

Graduate Theses and Dissertations. 12564. https://lib.dr.iastate.edu/etd/12564

(2)

with applications in systems biology

by

Olga Nikolova

A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Major: Bioinformatics and Computational Biology

Program of Study Committee: Srinivas Aluru, Co-major Professor Patrick Schnable, Co-major Professor

Dan Nettleton Julie Dickerson

Guang Song

Iowa State University Ames, Iowa

2012

(3)

DEDICATION

(4)

TABLE OF CONTENTS LIST OF FIGURES . . . vi LIST OF TABLES . . . ix LIST OF ALGORITHMS . . . xi ABSTRACT . . . xiii CHAPTER 1. Introduction . . . 1

1.1 Background and Motivation . . . 1

1.1.1 Gene Regulatory Networks . . . 1

1.1.2 Bayesian Networks as a Model for Gene Regulatory Networks . . . 3

1.1.3 Exact Bayesian Network Structure Learning . . . 4

1.1.4 Heuristic Bayesian Network Structure Learning . . . 6

1.2 Concepts of Bayesian Networks and Structure Learning . . . 7

1.2.1 Markov Assumption and Bayesian Networks . . . 7

1.2.2 Optimal Structure Learning . . . 8

1.2.3 Scoring Functions . . . 9

1.2.4 Faithfulness . . . 12

1.2.5 D-separation . . . 13

(5)

1.2.7 Markov Boundary . . . 14

1.2.8 Markov Equivalence . . . 14

CHAPTER 2. Parallel Globally Optimal Structure Learning of Bayesian Networks . . . 16

2.0.9 Previous Work . . . 16

2.0.10 Contributions . . . 17

2.1 Exact Structure Learning of Bayesian Networks . . . 18

2.2 Parallel Algorithm for Exact Structure Learning . . . 21

2.2.1 Mapping to ann-dimensional Hypercube . . . 22

2.2.2 Partitioning intok-dimensional Hypercubes . . . 23

2.2.3 Pipelining Hypercubes . . . 24

2.3 Correctness and Complexity . . . 25

2.3.1 Proof of Correctness . . . 25

2.3.2 Run-Time Complexity . . . 26

2.3.3 Space Complexity . . . 28

2.4 Experimental Results . . . 29

2.4.1 Performance Analysis for Varying Number of Observations . . . 30

2.4.2 Performance Analysis for Varying Number of Genes . . . 31

2.5 Biological Validation and Application to Arabidopsis Thaliana. . . 33

2.5.1 Synthetic Validation using SynTReN . . . 33

2.5.2 Application to the Carotenoid Biosynthesis Pathway inArabidopsis Thaliana 41 CHAPTER 3. Parallel Globally Optimal Structure Learning of Bayesian Networks with Bounded Node In-degree . . . 48

(6)

3.1 Structure Learning with Bounded Node In-degree . . . 48

3.1.1 Algorithm for Bounded In-degree . . . 49

3.1.2 Characterizing Run-Time Complexity as a Function ofd. . . 50

3.2 Experimental Analysis for Bounded Node In-degree . . . 52

3.3 Contributions . . . 54

CHAPTER 4. Parallel Algorithmic Framework for Large-scale Bayesian Network Structure Learning . . . 55

4.1 Introduction . . . 55

4.2 Phase 1: Parallel Discovery of Direct Causal Relations and Markov Boundaries with Applications to Gene Networks . . . 59

4.2.1 Direct Causal Relations (P C) and Markov Boundaries (M B) Learning . 60 4.2.2 Parallel Direct Causal Relations (P C) Discovery . . . 64

4.2.3 Parallel Markov Boundaries (M B) Discovery . . . 66

4.2.4 Experimental Results . . . 70

4.3 Phase 2: Parallel Structure Learning for Large Scale Bayesian Networks . . . . 75

4.3.1 Problem Statement and the Proposed Approach . . . 76

4.3.2 Overview of the Parallel Approach . . . 78

4.3.3 Parallel Discovery of Locally Optimal Parents . . . 80

4.3.4 Resolving Highly Connected Nodes . . . 85

4.3.5 Parallel Cycle Removal . . . 87

4.3.6 Repair of Optimal Parents Sets . . . 88

4.3.7 Experimental Results . . . 88

(7)

CHAPTER 5. Summary and Future Directions . . . 93

BIBLIOGRAPHY . . . 97

(8)

LIST OF FIGURES

1.1 Schematic representation of a gene regulatory network. Solid arrows indicate direct interactions. The dashed lines indicate indirect associ-ations between genes that result from projecting the interactions be-tween elements from the three different spaces onto the gene space. Arrows represent activation and bars represent inhibition. (Drawn af-ter Volkhard Helms, Principles of Computational Cell Biology, 2008.) . 2

2.1 A permutation tree and its corresponding subset lattice. . . 17 2.2 A lattice for a domain of size 3 and its partitioning into 4 levels, where

the empty set is at level 0. The binary string labels on the right-hand side of each node show the correspondence with a 3-dimensional hypercube. . . 21 2.3 Relative speedup as compared to a 32-core run for test data withn= 24

on(a) IBM Blue Gene/P and(b)AMD Opteron InfiniBand cluster. . 31 2.4 Relative speedup as compared to a 32-core run for the test data with

m= 1,000 on(a)IBM Blue Gene/P and(b)AMD Opteron InfiniBand

cluster. . . 32 2.5 Diagram of synthetic data generation procedure usingSynT ReN. . . . 33

(9)

2.6 Network 1 of the five synthetically generated gene networks of 24 genes. Three self-loops are present at nodes 4, 9, and 21, respectively. . . 34 2.7 Comparison of the averageF-scoresh2·_precisionprecision₊·recall_recaliover 10 samples

for networks learned with the exact (solid lines) and heuristic (dashed lines) algorithms. The connection between the average score values are gives solely to illustrate the trend. Results for each network are color-coded in blue(network 1), red (network 2), yellow (network 3), green (network 4), and orange (network 5). . . 39 2.8 Diagram of the Key Steps in the Carotenoid Biosynthesis inArabidopsis. 42 2.9 Gene networks generated using the 2.9(a) BN model and 2.9(b)

MI-based model under light experimental conditions. . . 44 2.10 Gene networks generated using the2.10(a) BN model and2.10(b)

MI-based model under light experimental conditions. . . 47

3.1 Run time as a function of the node in-degree boundd of the modified

parallel algorithm for BN structure learning. Test data for (a) n= 24

genes executed on 32 cores, and (b) n= 30 genes executed on 2,048

cores, for varying d= 1,2, . . . , n−1. . . 53

4.1 Relative speedup when compared to a 16-core run for the dataset with

m= 500 observations and varying number of genes denoted byn. . . . 73

4.2 Relative speedup when compared to a 16-core run for the dataset with

n= 3,000 genes and varying number of observations denoted bym. . 73

4.3 Node degree distribution for CPs for datasets of sizesn= 250,m= 200,

(10)

4.4 Relative speedup when compared to a 16-core run for the data set with n = 500 genes and m = 100 observations, for the computation

of optimal parents sets from CP sets of size no more than 25 (OP1) and the remaining ones (OP2). . . 91

(11)

LIST OF TABLES

2.1 Run-times in seconds on the IBM Blue Gene/P (left) and the AMD Opteron InfiniBand cluster (right) for test data with n = 24 genes.

The number of observations is denoted bym. . . 30

2.2 Run-time results in seconds on the IBM Blue Gene/P (left) and the AMD Opteron InfiniBand cluster (right) for test data withm= 1,000

observations. The number of genes is denoted byn. . . 32

2.3 Exact structure learning with MDL score. Sample size is denoted by

m, and for each m, lines 1 through 5 report for networks 1 through 5,

respectively; −−indicate empty networks. . . 35 2.4 Exact structure learning with BDe score. Sample size is denoted by

respectively. . . 36 2.5 Heuristic structure learning with BDe score (Banjo). Sample size is

denoted bym, and for eachm, lines 1 through 5 report for networks 1

through 5, respectively. . . 38 2.6 Information theoretic approach (TINGe). Sample size is denoted by

(12)

4.1 Total run-time for the computation ofP C +M B for the dataset with

m= 500 observations and varying number of genes denoted by n. . . 72

4.2 Total run-time for the computation ofP C +M B for the dataset with

n= 3,000 genes and varying number of observations denoted by m. . 74

4.3 Run-times for the parallel inference ofP Cs andM Bs on a genome scale

Arabidopsis thaliana data set. . . 75 4.4 Candidate parents sets statistics for the test data set of size n= 500,

m= 100. . . 89

4.5 Total run-time in seconds for the computation of optimal parents sets of size no more than 25 (OP1), node degree reduction (DR), computation of remaining optimal parents sets (OP2), and cycle removal (CYC), for the data set withn= 500 genes and m= 100 observations. . . 90

(13)

List of Algorithms

1 M M P C(T, D) . . . 60

2 M M P C(T, D) . . . 61

3 P CM B∗(T, D) . . . 63

(14)

ABSTRACT

The expression levels of thousands to tens of thousands of genes in a living cell are con-trolled by internal and external cues which act in a combinatorial manner that can be modeled as a network. High-throughput technologies, such as DNA-microarrays and next generation sequencing, allow for the measurement of gene expression levels on a whole-genome scale. In recent years, a wealth of microarray data probing gene expression under various biological conditions has been accumulated in public repositories, which facilitates uncovering the un-derlying transcriptional networks (gene networks). Due to the high data dimensionality and inherent complexity of gene interactions, this task inevitably requires automated computational approaches.

Various models have been proposed for learning gene networks, with Bayesian networks (BNs) showing promise for the task. However, BN structure learning is an NP-hard problem and both exact and heuristic methods are computationally intensive with limited ability to produce large networks. To address these issues, we developed a set of parallel algorithms. First, we present a communication efficient parallel algorithm for exact BN structure learning, which is work-optimal provided that 2n > p·logp, where n is the total number of variables,

and p is the number of processors. This algorithm has space complexity within 1.41 of the

optimal. Our empirical results demonstrate near perfect scaling on up to 2,048 processors.

(15)

number of parents per variable is imposed. We characterize the algorithm’s run-time behavior as a function of d, establishing the range 1₃n−logmn,dn₂e

of values for d where it affects

performance. Consequently, two plateaus regions are identified: for d < 1₃n−logmn, where

the run-time complexity remains the same as for d= 1, and for d≥ dn

2e, where the run-time

complexity remains the same as ford=n−1. Finally, we present a parallel heuristic approach

for large-scale BN learning. This approach aims to combine the precision of exact learning with the scalability of heuristic methods. Our empirical results demonstrate good scaling on various high performance platforms. The quality of the learned networks for both exact and heuristic methods are evaluated using synthetically generated expression data. The biological relevance of the networks learned by the exact algorithm is assessed by applying it to the carotenoid biosynthesis pathway in Arabidopsis thaliana.

(16)

CHAPTER 1. Introduction

1.1 Background and Motivation

1.1.1 Gene Regulatory Networks

The expression levels of thousands to tens of thousands of genes in a living cell are controlled by internal and external cues which act in a combinatorial manner (with multiple regulatory inputs) in the form of a network. In response to signals, transcription factor proteins (TFs) adjust the transcription rates of target genes which they regulate to ensure that the appropriate amounts of RNA are transcribed and that the necessary proteins are produced by the cell at all times. In many cases TFs’ activity levels can be closely approximated by their expression levels. In the cases when TFs’ activity levels are determined by other biochemical events, they can often be derived from the expression levels of indirect regulators, e.g. signaling proteins, which in turn control TFs’ activity [Pe’er et al.,2001,Segal et al.,2003]. Therefore, when gene expression levels are modeled as random variables, it is expected that interactions between regulators and target genes result in statistical dependencies [Pe’er et al.,2006]. While transcription regulation is a much more complex process involving various epigenetic processes (methylation, acetylation, histone modification, etc.), in this work we termgene network the schematic representation of gene interactions as defined by their relative expression levels changes. In reality, genes do not directly interact with other genes. Instead, gene expression

(17)

Gene 1 Gene 2 Gene 4 Gene 3 Protein 1 Protein 2 Protein 3 Protein 4 Complex 3-4 Metabolite 1 Metabolite 2 Direct Indirect Activation Inhibition

Metabolites

Proteins

Genes

Epigenetic

Effect

Figure 1.1 Schematic representation of a gene regulatory network. Solid arrows indicate direct interactions. The dashed lines indicate indirect associations between genes that result from project-ing the interactions between elements from the three different spaces onto the gene space. Arrows represent activation and bars represent inhibition. (Drawn after Volkhard Helms, Prin-ciples of Computational Cell Biology, 2008.)

levels are regulated by the actions of gene products, proteins and metabolites (see Figure1.1). What is observed in terms of gene expression changes are rather the indirect interactions, which could be visualized as the projection of gene interactions with proteins and metabolites onto the gene space. Gene networks are commonly represented as directed or undirected graphs, where nodes represent genes and edges - interactions (see level A of Figure1.1).

High-throughput technologies, such as DNA-microarrays and Next Generation Sequencing Technologies, allow for the measurement of gene expression levels on a whole-genome scale [ Ko-toulas and Siebes, 1999, Schena et al., 1995]. In the recent years a wealth of microarray expression data has been accumulated in public repositories, such as, NASCArrays [Craigon et al., 2004] and ArrayExpress [Brazma et al., 2003], resulting from both small- and

(18)

large-scale studies, probing gene expression under various biological conditions. Uncovering the underlying gene networks from observed microarray gene expression data (also referred to as reverse engineering gene networks) is an important and difficult problem. Due to the high data dimensionality and inherent complexity of gene interactions, this task inevitably requires automated computational approaches. Various models have been proposed for reverse en-gineering gene networks including correlation-based methods [Rice et al., 2005], information theory-based methods [Basso et al., 2005], dynamical models [Fuente et al.,2002], as well as, Bayesian networks [Friedman,2004]. In their community-wide effort for assessing the quality of computational techniques for network reconstruction, Marbach et al. concluded that this problem remains open and very challenging [Marbach et al.,2010]. Their findings reveal that more than 1/3rd _{of the methods perform as good as random guessing, and that the majority}

of current inference methods fail to correctly predict multiple regulatory inputs (combinatorial regulation).

1.1.2 Bayesian Networks as a Model for Gene Regulatory Networks

Bayesian networks (BNs) are graphical models which represent probabilistic relationships between interacting variables in a given domain. In BNs, relationships between variables are depicted in the manner of conditional independencies (presence or absence of edges) and quantitatively assessed by conditional probability distributions [Grzegorczyk and Husmeier, 2008]. BNs provide a holistic view of the interactions within the domain and can be learned from data. BNs have been successfully applied in a variety of different research areas, including decision support systems, language processing, medicine, and bioinformatics [Basilio et al., 2000, Friedman, 2004, Heckerman et al., 1995b]. Certain properties make them especially attractive for gene network induction. One of them is the ability to depict combinatorial

(19)

interactions, highlighted as one of the most challenging tasks by [Marbach et al.,2010]. In the search for a globally optimal BN, all possible subsets of variables are evaluated as potential regulators for each gene, in the context of a complete structure, unlike co-expression networks, where individual pair-wise or partial correlations are considered. Another one, is the ability to learn causal edges in presence of perturbation data, which in the context of a biological system can be achieved for example via a gene knockout or over-expression of a given gene [Pe’er et al., 2001]. Further, the scoring functions used in BN structure learning approaches are robust to overfitting. Most scoring functions for BN structure learning employ a penalty term to balance the goodness of fit with the complexity of the given model. Examples of such scoring functions are the information theoretic Bayesian Information Criterion (BIC) [Schwarz,1978] and Akaike Information Criterion (AIC) [Akaike, 1974], the information theoretic scoring function based on the Minimum Description Length principle (MDL) [Lam and Bacchus,1994], and Bayesian scores, which estimate the posterior probability of the considered model given the data, such as Bayesian Dirichlet (BD) [Cooper and Herskovits,1992]) and its extension Bayesian Dirichlet equivalence (BDe) [Heckerman et al.,1995a]. Major difficulties in inferring BNs from data are due to its computational and space complexity. In the following subsections we outline the state-of-the-art approaches and the remaining open problems for both exact and heuristic BN structure learning.

1.1.3 Exact Bayesian Network Structure Learning

In order to learn an optimal BN from data, every possible network configuration over the variables in the given domain is scored, given the observed data. The search space of BN structure learning fornvariables corresponds to the space of all possible directed acyclic graphs

(20)

an effort to reduce the search space, marginalization over node orderings was proposed, which successfully reduces the search space to exponential [Koivisto, 2006, Koivisto et al., 2004, Ott et al., Silander and Myllymaki, 2006]. Adopting a canonical representation of the DAGs in conjunction with a decomposable scoring function (where the score of the network can be computed as a sum of score contributions of each variable and its parents) allows for a dynamic programming (DP) approach, described in detail in Chapter 2. However, even with these improvements, and after further limiting the number of possible regulators for each gene, the BN structure learning problem is shown to be NP-hard [Chickering et al.,1994].

State-of-the-art sequential algorithms exploit the improvements described above [Ott et al., Silander and Myllymaki, 2006] but are still unable to scale to domains of even 30 variables. The reason for this is not only in the run time, but also in the memory requirements, which grow exponentially in n as well. In addition, with the large amounts of publicly available

microarray data, large samples (thousands of microarray chips) can be compiled after appro-priate statistical pre-processing as shown in Aluruet al.[Aluru et al.,Under Review], and used for network induction. Computational complexity of the utilized scoring functions increases linearly in the number of observations (as the data matrix is scanned), which introduces addi-tional computaaddi-tional cost. Thus, both the computaaddi-tional and memory requirements create a bottleneck for BN structure learning, leaving the problem of computing optimal BNs for both large number of variables and large number of observations open. Even-though heuristic-based algorithms for BN structure learning exist, they trade off optimality for the ability to learn larger networks, and structure learning without additional assumptions remains valuable. Ott et al. report that inferring a BN for 20 variables under 173 conditions took approximately 50 hours.

(21)

1.1.4 Heuristic Bayesian Network Structure Learning

Due to the exponential complexity of exact learning, exact approaches do not scale to large networks, prompting the development of heuristic approaches [Chen et al.,2006,Cheng,2002, Chickering,2002,Friedman et al.,Heckerman et al.,1995a,Moore and Wong,2003,Pe’er et al., 2006,Spirtes et al.,2001,Spirtes and Meek,1995,Tsamardinos et al.,2006,Wang et al.,2007]. In principle the heuristic approaches can be categorized as either i) score optimization methods, ii) constraint-based methods, or iii) hybrid approaches, which include elements of the first two. A commonly used approach from the first category is a greedy search. It starts with an empty network (all variables are independent), a fully-connected network, or a random network, and proceed to score every possible change, which can be implemented by adding an edge, removing an edge, or reversing an edge of this structure, utilizing a Bayesian scoring criteria. The newly generated structure which results from making the highest-scoring change is accepted and the algorithm repeats this step in a greedy search manner until a structure is selected which optimizes the score [Chickering,2002,Heckerman et al.,1995a,Moore and Wong,2003]. In the second category, direct and entailed conditional independencies are learned from the observed data using statistical or information theoretic measures and a graphical criterion called d

-separation. A partially oriented network is returned which satisfies these constraints [Elidan and Friedman,2005,Spirtes et al.,2001]. These methods are shown to do better in discovering the sets of parents and children of each variable (undirected network), than identifying the parents of each node (the final DAG). Methods in the third category utilize constraint-based approaches to learn the sets of parents and children for each variable, and then proceed with heuristic score-optimization to learn the final network [Chen et al., 2006, Spirtes and Meek, 1995,Tsamardinos et al.,2006,Wang et al.,2007]. A more in depth literature review is given

(22)

in Chapter 4. An extensive evaluation of the performance of the state-of-the-art algorithms, including the Sparse Candidate [Friedman et al.], Three Phase Dependency Analysis [Cheng, 2002], Optimal Reinsertion [Moore and Wong,2003], Greedy Equivalence Search [Chickering, 2002], PC [Spirtes et al., 2001], and Max-Min Hill Climbing (MMHC) [Tsamardinos et al., 2006] algorithms, shows that the M M HC performs best in terms of both correctness of the

inferred BNs and and ability to infer large networks [Tsamardinos et al.,2006]. Tsamardinos et al. report that building a BN for 5,000 variables took approximately 14 days, of which 19

hours were spent to infer the skeleton network and the rest 13 days to orient it. This indicates that the state-of-the-art heuristic-based methods for BN structure learning do not scale to large genome domain sizes, leaving this as an open problem.

In this thesis we address the open problems outlined above by devising and implementing run-time and memory efficient parallel algorithms. We also investigate the biological relevance of the learned networks from large mixtures of microarray experiments.

1.2 Concepts of Bayesian Networks and Structure Learning

1.2.1 Markov Assumption and Bayesian Networks

Throughout this thesis, we denote random variables with upper-case letters (e.g. X, Xi)

and their values with lower-case letters (e.g. x). We denote sets of random variables with

upper-case bold-face letters (e.g. Z) and the corresponding assignment of values for each variable in a given set with a lower-case bold-face letter (e.g. z). Conditional independence of two variables X and Y given set Z is specified as ⊥⊥ (e.g. X ⊥⊥ Y|Z), and conditional

dependence as6⊥⊥(e.g. X 6⊥⊥Y|Z). LetX ={X1, X2, . . . , Xn}be a set of random variables. In

(23)

exists fromXi toXj, and anondescendent otherwise. Further, a nodeXk is called aparentof

Xi, andXi is called a child ofXk, if there is an edge from Xk toXi. The set of all parents of

Xi is denoted byP a(Xi). The following is known as the Markov assumption:

Definition 1.1 Let P be a joint probability distribution over the variables in set X and N =

(X, E) be a DAG. Then N and P are said to satisfy the Markov assumption if each variable

Xi ∈ X is conditionally independent of the set of its nondescendants, given the set of its

parents [Neapolitan, 2003].

The pair (N, P) is called a BN if N and P satisfy the Markov assumption [Pearl, 1988].

This allows for a compact representation of the joint probability overX, as the product of the conditional probabilities of each variable given its parents:

P(X1, X2, . . . , Xn) = n Y i=1 P(Xi |P a(Xi)). Formally:

Theorem 1.1 If (N, P) satisfies the Markov assumption, thenP is equal to the product of its

conditional distributions of all nodes given values of their parents, whenever these conditional distributions exist [Neapolitan, 2003].

1.2.2 Optimal Structure Learning

When learning the structure of a BN from data, the Bayesian approach utilizes a statisti-cally motivated scoring function to evaluate how well a given graph depicts the dependencies in the given data [Cooper and Herskovits,1992,Heckerman et al.,1995a,Schwarz,1978,Lam and Bacchus, 1994, Akaike, 1974]. One example of such a scoring function is the Bayesian

(24)

score, where the posterior probability of the proposed graphN is evaluated given the observed

data. Suppose we are given m observations of the system, each observation consisting of the

value of each of then random variables inX. These observations can be viewed as an n×m

matrixDn×m, with rows representing variables and columns representing observations. Then,

the Bayesian score for a network N given Dn×m is:

Score(N) = logP(N |D)

= logP(D|N) + logP(N) +C,

where C is a constant. In this score the two non-constant terms logP(D|N) and logP(N)

refer respectively to the log-likelihood of the data Dgiven DAG N and the prior probability

of N, usually taken to be uniform.

Optimal structure learning is formalized as an optimization problem. LetScore(N :Dn×m) be a scoring function which evaluates how well a given DAG N predicts the dependencies in

Dn×m. Finding an optimal BN structureNopt is equivalent to finding a DAG which optimizes

the scoring function:

Nopt= argmax N

Score(N :Dn×m).

1.2.3 Scoring Functions

To find the optimal network efficiently, it is crucial to choose a scoring function which decomposes into individual score contributions s(Xi, P a(Xi)) of each of the variables in X

given its parents:

Score(N) = X Xi∈X

(25)

Examples of such information theoretic and Bayesian scoring functions include the Bayesian Information Criterion (BIC) [Schwarz,1978] and Akaike Information Criterion (AIC) [Akaike, 1974], the information theoretic scoring function based on the Minimum Description Length principle (MDL) [Lam and Bacchus,1994], and Bayesian scores, which estimate the posterior probability of the considered model given the data, such as Bayesian Dirichlet (BD) [Cooper and Herskovits, 1992]) and its extension Bayesian Dirichlet equivalence (BDe) [Heckerman et al.,1995a]. In this work we utilize two of them, namely the MDL and BDe scoring functions, which are described in more detailed below.

Minimum Description length principle-based scoring function

This function is based on the minimum description length principle, which states that for given data one should choose such a model to encode the data, which minimizes the description length of both the model itself and the data [Cover and Thomas,2006]. In the case of Bayesian networks the model is a network (i.e. DAG with corresponding conditional probabilities table). Its description length is decomposed into the sum of description lengths of the component nodes. Because the network describes probability distribution with respect to the input data, it can be used to find optimal encoding of this data, which induces its description length.

Specifically, to describe the graphN, the parents sets of each variable need to be described.

This can be achieved by storing the number of parents |P a(Xi)| and an index of the parent

set P a(Xi) in an agreed upon enumeration of all _|_{P a}n₍_X

i)|

sets of cardinality|P a(Xi)|. Thus,

the description length of the graphN is

n X i=1 logn+ log n |P a(Xi)| .

(26)

proba-bility tables (CPT) needs to be saved. Suppose that for variableXi,ri denotes the number of

discrete states andqidenotes the number of all possible parent configurations. Thenqi·(ri−1)

parameters describe the CPT for variable Xi, and given that 1₂ ·m bits are used to represent

each parameter, the set of all CPTs can be encoded in:

n

X

i=1

1

2·qi(ri−1) logm

bits. Finally, the data can be described using conditional entropyH(Xi|P a(Xi)), the

descrip-tion length of which is given by:

n

X

i=1

m·H(Xi|P a(Xi)).

1.2.3.1 Bayesian Dirichlet Equivalence scoring function

The Bayesian Dirichlet Equivalence (BDe) scoring function evaluates the posterior prob-ability of a graph N given the observed data Dn×m. The best network is the one which maximizes the posterior probability, as opposed to the MDL score, where the optimal network has minimum description length.

Letri denote the number of discrete states of variableXi, andqi be the number of possible

configurations of its parent set P a(Xi) inN. Let wj forj = 1, . . . qi be one of these

configu-ration. Let also Nijk be the number of instances in the data set Dn×m where the variableXi is in state k with parents in configuration j. Thus, Nij is the number of instances where the

(27)

ScoreBDe(N :DN×m) = log(p(N))+ n X i=1 qi X j=1 log Γ (ηij) Γ(Nij +ηij) + ri X k=1 log Γ (Nijk+ηijk) Γ (ηijk) ,

whereηijk are the hyperparameter for the Dirichlet prior distribution of the parameters given

the network structure N, and nij =Pr_ki₌₁ηijk. The Gamma function Γ(c) =

R∞ 0 e

−u_.uc−1_d_u

is given by Γ(c) = (c−1)! whencis an integer. The hyperparameters are computed as follows:

ηijk =η×p(xik, wik|G0),

wherep(.|G0) represents a probability distribution associated with aprior Bayesian networkG0

andηis a parameter representing the equivalent sample size. In our implementation of the BDe

score we use the variation BDeu, where a uniform probability is assigned to each configuration of variable state and parent set configuration: p(xik, wik|G0) = _r_i1_q_i. We also assign uniform

distribution to all possible structures, and therefore ignore the log(p(N)) term. The BDe

assigns equal scores to graphs from the same Markov equivalence class (see Section 1.2.8).

1.2.4 Faithfulness

Variables X and Y are said to be conditionally independent given set Z with respect to

a probability distribution P if P(X, Y|Z) = P(X|Z)P(Y|Z). The following condition is an

essential assumption we will be making throughout this thesis:

Definition 1.2 A Bayesian network (N, P) is said to satisfy the faithfulness condition if P

(28)

1.2.5 D-separation

WhileN encodes some independencies ofP directly (e.g. the absence of an edges indicates

lack of direct dependence), others are entailed. A graphical criterion which allows for entailed independencies to be read from the BN structureN is called d-separation.

Definition 1.3 LetX,YandZbe three mutually exclusive subsets ofX. Thend-sep(X,Y|Z)

denotes thatXandYinN ared-separated byZ, andd-sep(X,Y|Z)is said to hold [Neapolitan,

2003] when on every undirected path between a node in X and a node in Y, a node Z exists

such that either:

1. Z does not have two incoming edges and Z ∈Z, or

2. Z does have two incoming edges but neitherZ nor any of its descendants in N are inZ.

In a faithful Bayesian network (N, P), d-sep(X,Y|Z) is equivalent to X ⊥⊥ Y|Z [Pearl,

1988]. Due to the faithfulness assumption we made in 1.2.4, d-separation and independence

are used interchangeably.

1.2.6 Constraints-based Learning

In BN structure learning statistical tests or information theoretic criteria are often used to discover conditional independencies. The main idea behind constraint-based BN structure learning consists of determining direct and entailed conditional independencies via statistical tests and d-separation relations. The following theorem [Spirtes et al., 2001] is instrumental

to the edge discovery procedures in the algorithms discussed in Chapter4:

(29)

1. For each pair of nodesX andY inX,X andY are adjacent inN if and only ifX6⊥⊥Y|Z

for allZ⊆ X \{X, Y}, and

2. For each triplet of nodes X, Y, and Z in N such that X and Z are adjacent, Y and Z

are adjacent, butX andY are not adjacent,X →Z←Y is a subgraph ofN if and only

if X6⊥⊥Y|Zfor all Zsuch that X, Y /∈Z and Z ∈Z.

Let P C(T) denote the set of parents and children in N of some variable T ∈ X. The set is

unique for all BNs that are faithful to the same probability distribution P [Verma and Pearl,

1988,1991].

1.2.7 Markov Boundary

For any variable T ∈ X, its Markov boundary, denotedM B(T), is defined as any minimal

subset of X which shields T from all other variables inX excludingT and M B(T):

Definition 1.4 Markov boundary M B(T) of a variable T ∈ X is the minimal set of variables

in X such that T ⊥⊥(X \(M B(T)∪ {T}))|M B(T).

For any variableT ∈ X the setM B(T) is the union of the parents ofT, children ofT, and all

other parents of children ofT (also calledspouses of T). The set M B(T) is unique for all BNs

that are faithful to the same probability distributionP [Pearl,1988].

1.2.8 Markov Equivalence

Multiple DAGs can represent the same set of conditional independecies, i.e. they have the same d-separations. Such graphs are called Markov equivalent and are formally defined as

(30)

Definition 1.5 Let N1 = (X, E1) and N2 = (X, E2) be two DAGs containing the same set

of nodes X. Then N1 and N2 are called Markov equivalent is for every three mutually

disjoint subsetsA,B,C⊆ X,A andB ared-separated byCin N1 if and only ifA andB are

d-separated by Cin N2 [Neapolitan,2003].

The application of the above definition over graph properties is in probability due to the following theorem:

Theorem 1.3 Two DAGs are Markov equivalent if and only if, based on the Markov

assump-tion, they entail the same conditional independencies [Neapolitan, 2003].

Since Markov equivalent graphs represent the same set of independencies, when using Bayesian scoring functions, we will desire that the same score is assigned to all DAGs within the same Markov equivalence class. One such Bayesian score is the Bayesian Dirichlet equivalence (BDe) score.

(31)

CHAPTER 2. Parallel Globally Optimal Structure Learning of Bayesian Networks

2.0.9 Previous Work

A major difficulty in BN structure learning is the super exponential search space in the number of random variables. As reported by Robinson [Robinson,1973], for a set ofnvariables

there exist n!2 n 2(n−1)

r·zn possible DAGs, wherer ≈0.57436 andz ≈1.4881. To reduce the search space, marginalization over node orders has been proposed [Koivisto et al., 2004, Ott et al., Silander and Myllymaki,2006]. Here we briefly review the main idea.

Consider the graph N = (X, E) of a BN (N, P). Due to its DAG structure, the random

variables inX can be ordered in such a way that for each variableXi∈ X its parentsP a(Xi)

precede it in the specified ordering. Optimizing a scoring function for each variable Xi ∈ X

and all of its respective subsets of preceding elements allows us to discover the optimal network that respects the particular ordering. Further, to investigate all possible graphs we need to consider a total of n! orderings of X. This yields the search space of ordered subsets ofX of

size n!2n (see Figure 2.1). However, as long as P a(Xi) precedes Xi, the order among the

nodes in P a(Xi) is irrelevant. Thus, the search space can be reduced to the 2n unordered

subsets ofX.

Even reduced, the search space remains exponential and in practice, exact structure learn-ing without additional constraints has been achieved only for moderate size domains and at

(32)

() (1) (2) (3) (1,2) (1,3) (2,1) (2,3) (3,1) (3,2) (1,2,3) (1,3,2) (2,1,3) (2,3,1) (3,1,2) (3,2,1) {} {1} {2} {3} {1,2} {1,3} {2,3} {1,2,3}

Permutation Tree, O(n!2n₎ _{Subset Lattice, O(2}n₎

Figure 2.1 A permutation tree and its corresponding subset lattice.

high execution cost. While exact network structure learning, or globally optimal structure learning, has been shown to be NP-hard even with bounded node in-degree [Chickering et al., 1994], constructing exact Bayesian networks without additional assumptions remains valuable nevertheless.

2.0.10 Contributions

In this chapter, we present a parallel algorithm for exact Bayesian network structure learn-ing with optimal parallel efficiency on up toO _n1 ·2n processors regardless of the complexity

of the scoring function used. To assess the performance of our algorithms experimentally, we report results on two parallel systems – IBM Blue Gene/P and AMD Opteron cluster with InfiniBand interconnect. Experimental results demonstrate the validity of the theoretical predictions. To our knowledge, the presented work is the first parallel algorithm for exact learning.

The remainder of this chapter is organized as follows: In Section 2.1, we provide a brief overview of the Ott et al. sequential method for exact Bayesian network structure learning, upon which our parallel algorithm is based. Sections2.2and 2.3contain our parallel algorithm, and proofs of correctness and run-time and space complexity analysis, respectively. In

(33)

Sec-tion2.4we present scaling experiments on an IBM Blue Gene/P system and an AMD Opteron InfiniBand cluster by applying our algorithm to the problem of identifying gene regulatory networks, a problem that arises in systems biology.

2.1 Exact Structure Learning of Bayesian Networks

Given a set of observationsDn×m and a decomposable scoring function, the Bayesian net-work structure learning problem is to find a DAGN on thenrandom variables that optimizes

Score(N). Learning the DAG is equivalent to finding the parents of each node in the domain

under the acyclicity constraint.

While an exhaustive enumeration of all possible structures is prohibitive in space and time, the notion of ordering the variables (nodes) has been shown to be advantageous [Friedman and Koller, 2003]. Adopting a canonical representation of the DAGs in conjunction with a decomposable scoring function permits a dynamic programming approach, greatly reducing the search space. Exact structure learning algorithms exploit this [Ott et al., Silander and Myllymaki,2006]. Our parallel algorithm is based on the sequential method of Ottet al. [Ott et al.], which we briefly describe here.

Let Xi ∈ X and A⊆ X − {Xi}. Using Dn×m, a scoring function s(Xi, A) determines the score of choosingA to be the set of parents forXi; or equivalently,s(Xi, A) specifies how well

A predictsXi given Dn×m. For the moment, we do not go into details of how sis computed,

except to note that the cumulative sfunction computations could be the dominant factor in

run-time complexity.

(34)

that can be obtained by choosing parents of Xi from A; i.e.,

F(Xi, A) = max

B⊆As(Xi, B).

A set B which maximizes the above equation is an optimal parents set for Xi from the

candidate parents set A. While F(Xi, A) can be computed by directly evaluating all 2|A| subsets of A, it is advantageous to compute it based on the following recursive formulation.

For any non-empty set A,

F(Xi, A) = max            s(Xi, A) max Xj∈A F(Xi, A− {Xj}).

An ordering of a set A⊆ X is a permutation π(A) of the elements in A. A network on A

is said to be consistent with a given orderingπ(A) if and only if∀Xj ∈A, the parentsP a(Xj)

precedeXj inπ.

Definition 2.2 An optimal ordering π∗(A) is an ordering with which an optimal network on

A is consistent.

Let Xj be the last element in π∗(A). Then, the permutation π(A− {Xj}) obtained by

leaving out the last elementXj is consistent with an optimal network onA− {Xj}. We write

this as:

π∗(A) =π∗(A− {Xj})Xj.

(35)

of a network on A, which is consistent with π(A):

Q(A, π(A)) = X Xj∈A

F(Xj,{Xq|Xq precedes Xj in π(A)}).

Optimal score of a network on A⊆ X is then given by Q∗(A) =Q(A, π∗(A)). Our goal is to

find π∗(X) and the optimal network scoreQ∗(X), and learn the corresponding DAG.

Like the F function, functions π∗ and Q can be computed using a recursive formulation.

Consider a subsetA. To find π∗(A), we consider all possible choices for its last element:

X_j∗= argmax Xj∈A Q(A, π∗(A− {Xj})Xj) = argmax Xj∈A (F(Xj, A− {Xj}) +Q∗(A− {Xj})). Then, Q∗(A) =F X_j∗, A− X_j∗ +Q∗ A− X_j∗ , π∗(A) =π∗ A−X_j∗X_j∗.

By keeping track of the optimal parents sets for each of the variables in X, an optimal network is easily reconstructed.

Using these recursive formulations, a dynamic programming algorithm to compute a glob-ally optimal BN structure can be derived. The algorithm considers all subsets ofX in increas-ing size order, startincreas-ing from the empty subset. When considerincreas-ing A, the goal is to compute

F(Xi, A) for each Xi ∈/ A and Q(A, π∗(A− {Xj})Xj) for each Xj ∈ A. All the F and Q∗

values required for computing these have already been computed when considering subsets

(36)

{ } {2} {1,3} {1,2,3} {1} {3} {1,2} {2,3} 000 001 010 100 011 101 110 111

Figure 2.2 A lattice for a domain of size 3 and its partitioning into 4 levels, where the empty set is at level 0. The binary string labels on the right-hand side of each node show the correspondence with a 3-dimensional hypercube.

For a subset A, the F function computations (not including the sfunction computations

required within) take O((n− |A|)· |A|) time and the Q computations take O(|A|) time. The

time taken by s function computations depends upon the particular scoring function used.

Ignoring them for the moment, the rest of the computations take

O Σn_i₌₀(n−i+ 1)·i· n_i

=O(n2·2n) time.

2.2 Parallel Algorithm for Exact Structure Learning

To develop the parallel algorithm, it is helpful to visualize the dynamic programming algorithm as operating on the lattice L formed by the partial order “set inclusion” on the

power set ofX. Koivistoet al. [Koivisto et al.,2004] proved that the permutation tree over all possible orders collapses to the lattice. The latticeLis a directed graph (V, E), whereV = 2X and (B, A) ∈E if B ⊂ A and |A| =|B|+ 1. The lattice is naturally partitioned into levels,

(37)

be derived by mapping the nodes to processors and having edges represent communication if the incident nodes are assigned to different processors. A node A at level l has l incoming

edges from nodes A− {Xj}for eachXj ∈A, andn−l outgoing edges to nodes A∪ {Xi} for

eachXi∈/ A. FunctionsQ∗(A),π∗(A), and a total of (n−l)F functions are computed at node

A. All of these values need to be sent along each of the outgoing edges. On an outgoing edge

to node A∪ {Xi}, the F(Xi, A) value is used in computing Q∗(A∪ {Xi}), and the remaining

F(Xk, A) (Xk∈/AandXk6=Xi) values are used in computingF(Xk, A∪ {Xi}) values at node

A∪ {Xi}. Note that each of the (n−l) F values at A are used in computing the Q∗ value at one of the n−l nodes connected to A by outgoing edges. Each level in the lattice can be

computed concurrently, with data flowing from one level to the next.

2.2.1 Mapping to an n-dimensional Hypercube

Observe that the undirected version ofLis equivalent to ann-dimensional (n-D) hypercube.

Let (X1, X2, . . . , Xn) be an arbitrary ordering of the variables in the domain. A subsetA can

be represented by an n-bit string ω, where ω[j] = 1 if Xj ∈ A, and ω[j] = 0 otherwise. As

lattice edges connect pairs of nodes that differ in the presence of one element, they naturally correspond to hypercube edges (see Figure 2.2). This suggests an obvious parallelization on an n-D hypercube. While we expect the number of processors p 2n, we describe this

parallelization in slightly greater detail due to its use as a module in the development of our parallel algorithm.

The n-D hypercube algorithm runs in n+ 1 steps. Let ω denote the id of a processor

and let µ(ω) denote the number of 1’s in ω. Each processor is active in only one time step –

processor ω is active in time step µ(ω). It receives (n−µ(ω) + 1) F values and one Q∗ value

(38)

to zero. It then computes its own F and Q∗ values, and π∗, and sends them to each of its

n−µ(ω) neighbors obtained by inverting one of its zero bits to 1.

The run-time of stepiisO((n−i+1)·i), in addition to the cost of computing thesfunction

on a set of sizei, which depends on the application and type of scoring function used. Ignoring

the latter, the parallel run-time is O(Pn

i=0(n−i+ 1)·i) = O(n3), using 2n processors. As

the sequential run-time isO(n2·2n), the parallel efficiency is Θ(1/n).

2.2.2 Partitioning into k-dimensional Hypercubes

Let p= 2k be the number of processors, where k < n. We assume that the processors can

communicate as in a hypercube. This is true of parallel computers connected as a hypercube, hypercubic networks such as the butterfly, multistage interconnection networks such as the Omega, and point-to-point communication models such as the permutation network model or the MPI programming model. Our strategy is to decompose then-Dhypercube into 2n−kk-D

hypercubes and map each k-D hypercube to the p = 2k processors as described previously.

Although the efficiency of such a mapping by itself is suboptimal (Θ(1/k)), note that each

processor is active in only one time step and p 2n. Therefore, we pipeline the execution

of the k-D hypercubes to complete the parallel execution in 2n−k +k time steps such that

all processors are active except for the first k and last k time steps during the build up and

draining of the pipeline.

For convenience of presentation, we use the lattice Land then-Dhypercube

interchange-ably and useA to denote the lattice node for subsetA, and use ωAto denote then-bit binary

string denoting the corresponding hypercube node. We number the positions of then-bit

bi-nary string using integers 1 through n, with 1 denoting the leftmost position, and n denoting

(39)

i and j. We partition the n-Dhypercube into 2n−k k-D hypercubes based on the first n−k

bits of node ids. For a lattice node ωA, ωA[1, n−k] specifies the k-D hypercube it is part of

and ωA[n−k+ 1, n] specifies the processor it is assigned to. Conversely, a processor with id r

computes for all lattice nodes Asuch that r=ωA[n−k+ 1, n].

2.2.3 Pipelining Hypercubes

Eachk-Dhypercube is specified by an (n−k) bit string, which is the common prefix to the

2k lattice/k-Dhypercube nodes that are part of thisk-Dsub-hypercube. Thek-Dhypercubes

are processed in the increasing order of the number of 1’s in their bit string specifications, and in lexicographic order within the group of hypercubes with the same number of 1’s. Formally, we have the following:

Definition 2.4 Let Hi and Hj be two k-D hypercubes and let A and B be two nodes in

the lattice that map to Hi and Hj, respectively. Then, computation of Hi is initiated before

computation ofHj if and only if:

1. µ(ωA[1, n−k])< µ(ωB[1, n−k]), or

2. µ(ωA[1, n−k]) = µ(ωB[1, n−k]) and ωA[1, n−k] is lexicographically smaller than

ωB[1, n−k].

The above total order is used to initiate the 2n−k k-D hypercube evaluations, starting from time step 0. IfT denotes the time step in whichHi is initiated andAis a lattice node mapped

to Hi, then processor with id ωA[n−k+ 1, n] computes for the lattice node A in time step

(40)

2.3 Correctness and Complexity

2.3.1 Proof of Correctness

We now prove that when a lattice node is computed, nodes incident to incoming edges have already been computed in previous steps.

Lemma 2.1 Let(B, A)denote a directed lattice edge. Then, nodeB is computed at an earlier

time step than when nodeA is computed, and is either 1) local toA’s processor, or 2) resides

on a different processor but was mapped to the same k-dimensional hypercube as A.

Proof 1 (B, A) is a directed lattice edge, and hence B ⊂ A and |A| = |B|+ 1. Therefore,

ωA and ωB differ in one bit position, in which ωB has ‘0’ and ωA has ‘1’. It follows that

µ(ωA) =µ(ωB) + 1. Let cbe the position of the differing bit.

Case 1: 1≤c≤n−k

In this case, ωA[n−k+ 1, n] =ωB[n−k+ 1, n]. Thus, lattice nodes B andA are mapped

to the same processor. Asµ(ωB[1, n−k])< µ(ωA[1, n−k]),B is previously computed on the

same processor and is available, without requiring any communication, for computing A.

Case 2: n−k+ 1≤c≤n

In this case, ωA[1, n−k] =ωB[1, n−k], meaning both A and B are mapped to the same

(41)

Then,A is computed in time stepT+µ(ωA[n−k+ 1, n]) =T+µ(ωB[n−k+ 1, n]) + 1, one

time step after B is computed.

This proves that the algorithm correctly computes all of the lattice nodes, and hence derives the optimal Bayesian network.

2.3.2 Run-Time Complexity

The parallel algorithm initiates 2n−k k-D hypercubes, one per each time step starting

from 0. Each k-D hypercube takesk+ 1 time steps but successive hypercube evaluations are

pipelined. Thus, the total number of parallel time steps is given by 2n−k+k. Ignoring the

application-specific scoring function s, the computation of the F functions, Q∗, and the last

element of π∗ for a lattice node at level ltakes O(l·(n−l+ 1)), which is clearlyO(n2). Even with this relaxed analysis, the parallel run-time is O (2n−k+k)·n2

. Given the serial run-time is O(n2·2n), optimal parallel run-time isO(n2·2n−k), which is achieved by our parallel algorithm when 2n−k _{> k}_{. This can be restated as} _{n > k}_{+ log}_k_{, or} _p _{= 2}k ₌ _O 1

n·2n

. Thus, the parallel algorithm can utilize processors up to exponential in n without sacrificing

work-optimality.

The mapping strategy adopted by the parallel algorithm also reduces the number of com-munications fromntokfor each lattice node. Recall that a lattice node Aat levell receivesl

communications in computing itsF and Q∗ values, and subsequently communicates these val-ues ton−lneighbors, for a total of ncommunications. Each of these correspond to inverting

one of the bits in ωA. In case of p = 2k processors, inverting of one of the first (n−k) bits

results in a lattice node assigned to the same processor. Only bit inversions in one of the last

k bits requires communication with a neighbor. Thus, communication complexity reduces to

(42)

for the computation of the Q∗ and F values. In addition, it is highly likely that the scoring

function computation itself dominates theO(n2) compute time for the rest. Thus, this

dispar-ity in computation and communication complexdispar-ity in favor of relatively lower communication complexity should aid in achieving good scaling in practice.

We now consider the cost of computings(Xi, A) for all subsetsA, and for eachXi∈/ A. The

cost of computing s is particular to the application domain and the specific scoring function

used. However, s(Xi, A) is computed from the m observations of the random variables in A

and the random variable Xi; thus, the run-time for computing s(Xi, A) is Ω(m|A|). In the

analysis that follows, we consider the scoring function based on the MDL principle, which was utilized in this work, for which computings(Xi, A) takes O(m|A|) time. Serially, for a subset

A, theF computations takeO((n− |A|)·(|A|+m|A|)) time and theQcomputations still take

O(|A|) time. The cost of computingF functions,Q∗, and the last element ofπ∗ for all subsets

A is given by O n X i=0 n i ·((n−i)·(i+m·i) +i) ! =O m·n2·2n , where|A|=i.

In parallel, as described above, the number of parallel time steps is given by 2n−k+k, and

the computation ofQ∗,F, and the last element ofπ∗for a subsetAnow takes (n− |A|)·(|A|+m|A|) +|A|. Thus, the total cost is given by

O   2n−k+k X t=0 ((n− |A|)·(|A|+m|A|) +|A|)  =O m·n2·2n−k .

Note that no communication is involved in the computation ofsfunctions, which further widens

(43)

of the algorithm.

2.3.3 Space Complexity

At level l ∈[0, n] of the subset lattice, for each of the n_l subsets of sizel, the sequential

algorithm stores (n−l)F scores and an optimal orderingπ∗ of sizel. Thus a total of

n X l=0 n l ·n

elements are stored. As values computed at one level of the lattice are used only at the next level, storing only two levels at a time is sufficient. The maximum number of elements stored at a level occurs whenl=dn₂e, which using Stirling’s formula approximates to

q 2 π · 1 √ n ·2 n_·_n_.

In the parallel algorithm, suppose that processor with id r is computing for a lattice node

corresponding to a subset A. Let i=µ(ωA[1, n−k]). At this node, n−i−µ(r) F functions

and the corresponding π∗(A) of size|A|=i+µ(r) are computed and stored. Each processor stores computed results for n−_ik

subsets of sizei+µ(r), for a total of

n−k X i=0 n−k i ·n

elements. As before, the maximum storage needed at the same time is given by the mum value of the summation of two consecutive terms in the above summation. The maxi-mum value of a term occurs when i=n−₂k, which using Stirling’s formula is approximately

q 2 π · 1 √ n−k ·2

n−k_·_n_{. Thus, the ratio serial/parallel of the maximum number of stored elements}

yields √ n−k √ n ·2 k_≥ _√1 2 ·2 k_, _for_k_≤ n 2.

(44)

Therefore, the storage per processor used by the parallel algorithm is within a factor of √2≈ 1.41 of the optimal.

2.4 Experimental Results

We developed an implementation of our parallel algorithm in C++ and MPI. To assess its scalability, we used the resulting code to run a series of tests on an IBM Blue Gene/P system and an AMD Opteron InfiniBand cluster. The Blue Gene/P system is comprised of four 850 MHz PowerPC 450 cores and 2 GB main memory per node (512 MB per PowerPC core). All experiments were run with one MPI process per core. For the AMD cluster, we used the TACC Ranger system on the TeraGrid with each node consisting of four AMD Opteron 2.3 GHz quad-core processors and 32 GB memory per node (2 GB per core). All experiments were run with one MPI process per core, with all 16 cores per node sharing the communication channel.

We used the SynTReN package [Van den Bulcke et al., 2006] that synthetically generates observations for the specified number of genes. This easily allows us to change the number of variables, and the number of observations to test the performance with various data sets. As an optimization criterion, we implemented the Minimum Description Length (MDL) score [Lam and Bacchus, 1994], which can be evaluated in linear time with respect to the number of observations in the input data set (reviewed in Section1.2.3). In terms of the earlier discussion on run-time complexity, the run-time of M DL(Xi, A) is O(m|A|), meeting the lower bound

for any scoring function. As scalability improves with higher run-time complexity of the scoring function, choosing a scoring function with the lowest possible run-time complexity also provides the best way to demonstrate the scalability of the algorithm. The quality of the

(45)

Table 2.1 Run-times in seconds on the IBM Blue Gene/P (left) and the AMD Opteron InfiniBand cluster (right) for test data with

n= 24 genes. The number of observations is denoted by m.

No. Cores IBM Blue Gene/P AMD InfiniBand Cluster

m= 200 m= 1,000 m= 200 m= 1,000 32 1127 4587 209 876 64 578 2590 112 446 128 274 1280 69 229 256 133 641 38.54 119 512 65 319 24.1 65 1024 33 159 14.83 36.39 2048 16 79 8.69 18.43

learned networks is evaluated extensively in Section 2.5.1.

2.4.1 Performance Analysis for Varying Number of Observations

In the first set of experiments we measured how the number of observations influences the scalability of our solution, by fixing the number of genes. A smaller number of observations lowers the computational complexity, but leaves the communication complexity unaffected. This creates the possibility that the cost of the communication dominates, which can limit scalability. However, the observations need to grow as a function of the number of variables, to be able to derive a meaningful network. To run the experiments we used two data sets consisting ofn= 24 genes withm= 200 andm= 1,000 observations, respectively. Run-times

measured as a function of the number of processors and corresponding relative speedups on the IBM Blue Gene/P and the AMD Opteron InfiniBand cluster are presented in Table 2.1 and Figure2.3, respectively.

The execution time increases linearly with the number of observations, and at the same time our implementation on the Blue Gene/P maintains close to linear scalability up to 2,048

(46)

0 10 20 30 40 50 60 32 256 512 1024 2048 Relative Speedup Number of Cores m=200 m=1000 Linear Speedup

(a) Blue Gene/P

0 10 20 30 40 50 60 32 256 512 1024 2048 Relative Speedup Number of Cores m=200 m=1000 Linear Speedup

(b) AMD Opteron InfiniBand Cluster

Figure 2.3 Relative speedup as compared to a 32-core run for test data with n= 24 on (a)IBM Blue Gene/P and (b) AMD Opteron InfiniBand cluster.

with cache effects. These results indicate that computation is the dominant factor, even for small number of observations. The results on the AMD cluster are not as good due to the differences in the performance of the interconnection network. The Blue Gene/P network has low latency. In addition, 16 MPI processes running on one node of the AMD cluster share the same InfiniBand port, diluting the bandwidth available per core. The results for m = 1,000

on the AMD cluster exhibit much better scaling than for m = 200. This is due to the linear

increase in computation as a function ofm, while the communication remains unaffected.

2.4.2 Performance Analysis for Varying Number of Genes

In the second set of experiments we studied the performance of the parallel algorithm with respect to the number of input variables. We processed a collection of data sets with fixed number of observationsm= 1,000, and varying number of genesn, and recorded the resulting

run-times. Obtained results are summarized in Table2.2 and Figure2.4, respectively.

The run-times are reflective of the exponential dependence on n. Note that memory

(47)

Table 2.2 Run-time results in seconds on the IBM Blue Gene/P (left) and the AMD Opteron InfiniBand cluster (right) for test data with

m= 1,000 observations. The number of genes is denoted byn.

No. Cores IBM Blue Gene/P AMD InfiniBand Cluster

n= 16 n= 24 n= 16 n= 24 32 13.5 4587 2.24 876 64 8.32 2590 1.16 446 128 4.21 1280 0.63 229 256 2.16 641 0.33 119 512 1.13 319 0.18 65 1024 0.61 159 0.12 36.39 2048 0.36 79 0.08 18.43 0 10 20 30 40 50 60 32 256 512 1024 2048 Relative Speedup Number of Cores n=16 n=24 Linear Speedup

(a) Blue Gene/P

0 10 20 30 40 50 60 32 256 512 1024 2048 Relative Speedup Number of Cores n=16 n=24 Linear Speedup

(b) AMD Opteron InfiniBand Cluster

Figure 2.4 Relative speedup as compared to a 32-core run for the test data withm = 1,000 on (a) IBM Blue Gene/P and (b) AMD Opteron InfiniBand cluster.

(48)

throughout. Thus, going parallel is advantageous both from the perspective of run-time and the ability to solve larger problems. Using 2,048 cores, our implementation was able to learn the structure of a BN for 33 genes in 1 hour and 14 minutes. This task would require days if executed on a sequential computer [Ott et al.].

2.5 Biological Validation and Application to Arabidopsis Thaliana

Further we proceed to assess the quality of the learned networks using both synthetic and real biological gene expression data.

2.5.1 Synthetic Validation using SynTReN

Figure 2.5 Diagram of synthetic data generation procedure using

SynT ReN.

SynT ReN is a software package for synthetic gene expression data generation, which

al-lows us to control the number of genes (n) as well as observations (m) in the generated

(49)

Figure 2.6 Network 1 of the five synthetically generated gene networks of 24 genes. Three self-loops are present at nodes 4, 9, and 21, respectively.

(biological noise: 0.25, experimental noise: 0.01, input noise: 0.01, percent activators: 0.5,

cor-relation noise: 0.1, subnetwork selection: cluster addition). Self-loops were introduced in some

of the networks to test how they affect the quality of the learned networks (see Figure2.6). To assess performance over a range of parameters, we consider three different m values: m= 32,

128, and 512 (Figure 2.5). For each of the five networks, 10 datasets of sizes n = 24 and

respectively m = 32, 128, and 512 were generated, corresponding to data matrices: D24×32,

D24×128, andD24×512 (total of 5×3×10 = 150 datasets). Each data-matrix was subjected to

standardization and discretization to three levels corresponding to under-expressed (<−0.75),

over-expressed (≥ 0.75), and no change (remaining values). For each resulting discretized

data matrix a network was induced. Experiments were performed to compare correctness of networks learned using the MDL vs BDe scoring functions, the exact vs a heuristic Bayesian network structure learning method as implemented in the software packageBanjo[Hartemink], as well as the exact vs an information theoretic approach as implemented by the software tool TINGe [Zola et al., 2010]. The quality of the networks is also assessed as a function of the number of observations m.

(50)

Table 2.3 Exact structure learning with MDL score. Sample size is denoted by m, and for each m, lines 1 through 5 report for networks 1 through 5, respectively; −− indicate empty networks.

m= 32

T P F P T N F N SEN SP EC P REC F-score

– – – – – – – – 2.00 0.00 251.00 23.00 0.08 1.00 1.00 0.15 1.60 1.00 252.00 21.40 0.07 1.00 0.62 0.12 1.00 0.10 247.90 27.00 0.04 1.00 0.95 0.07 1.13 0.50 252.50 21.88 0.05 1.00 0.77 0.09 m= 128

1.00 1.67 251.33 22.00 0.04 0.99 0.39 0.08 3.50 2.30 248.70 21.50 0.14 0.99 0.62 0.23 2.80 1.70 251.30 20.20 0.12 0.99 0.64 0.20 2.00 3.30 244.70 26.00 0.07 0.99 0.38 0.12 3.70 0.80 252.20 19.30 0.16 1.00 0.82 0.27 m= 512

10.60 2.90 250.10 12.40 0.46 0.99 0.78 0.58

9.60 10.00 241.00 15.40 0.38 0.96 0.49 0.43

9.10 9.80 243.20 13.90 0.40 0.96 0.48 0.43

10.60 6.50 241.50 17.40 0.38 0.97 0.62 0.47

10.80 5.20 247.80 12.20 0.47 0.98 0.68 0.55

truth” networks generated by SynT ReN. Directionality can accurately be inferred only in

presence of intervention data (e.g. knocking out a gene, over-expressing a gene). Because

SynT ReN does not allow for the generation of such data, to avoid penalizing networks which

belong to the same Markov equivalent class, and therefore are indistinguishable by Bayesian scoring functions (reviewed in Sections1.2.3and 1.2.8), we ignore directions and compare the networks as undirected. We report true positives (TP), false positives (FP), true negatives (TN), false negatives (TN), and compute recall (sensitivity)

h T P T P+F N i , specificity h T N T N+F P i , precisionh_{T P}T P₊_{F P}i, and F-scoreh2·_precisionprecision₊·recall_recali. The results for each network (average over

(51)

Table 2.4 Exact structure learning with BDe score. Sample size is denoted by m, and for each m, lines 1 through 5 report for networks 1 through 5, respectively.

m= 32 Exact (BDe) Exact (MDL)

T P F P T N F N SEN SP EC P REC F-score F-score

7.5 37.2 215.8 15.5 0.33 0.85 0.17 0.22 – 9.70 34.1 216.9 15.3 0.39 0.86 0.22 0.28 0.15 6.50 38.5 214.5 16.5 0.28 0.85 0.14 0.19 0.12 8.30 38.9 209.1 19.7 0.3 0.84 0.18 0.22 0.07 9.10 31.6 221.4 13.9 0.4 0.88 0.23 0.29 0.09 m= 128

12.30 11 242 10.7 0.53 0.96 0.53 0.53 0.08 11.40 21.4 229.6 13.6 0.46 0.91 0.35 0.4 0.23 11.60 25.9 227.1 11.4 0.5 0.9 0.31 0.38 0.2 8.80 23.5 224.5 19.2 0.31 0.91 0.27 0.29 0.12 12.30 17.1 235.9 10.7 0.53 0.93 0.42 0.47 0.27 m= 512

16.40 10.9 242.1 6.6 0.71 0.96 0.6 0.65 0.58

19.40 20.1 230.9 5.6 0.78 0.92 0.49 0.6 0.43

18.40 23.7 229.3 4.6 0.8 0.91 0.44 0.57 0.43

16.40 25.6 222.4 11.6 0.59 0.9 0.39 0.47 0.47

19.10 16.8 236.2 3.9 0.83 0.93 0.53 0.65 0.55

2.5.1.1 Exact Learning: MDL vs BDe Scores

We compare the quality of networks learned using the MDL and BDe scores, and report the summary statistics in Tables 2.3 and 2.4, respectively. Our results indicate that sample size of m = 32 is insufficient for the MDL scoring function to make meaningful predictions.

In particular, for network 1 whenm= 32 the algorithm learned empty networks (all variables

were independent). In the case of m = 128, while the T P rates are low as compared to the

BDe results, theF P rates are also much lower than those of BDe networks. Overall the MDL

scoring function is more conservative than the BDe scoring function, in that it prefers graphs with fewer edges, as is expected in theory [Bouckaert,1995].

(52)

2.5.1.2 Heuristic (Banjo) vs Exact with BDe Score

We compare the quality of the networks learned using the exact algorithm to those learned using a heuristic algorithm. To generate the latter, we use a well-established software Banjo developed by Hartemink [Hartemink], which implements simulated annealing with BDe scoring function (equivalent sample size: 1, and maximum parents: 7). Without bounding the node in-degree our algorithm learns a network of 24 variables and 200 and 1000 observations in 8.69s and 18.43s, respectively (Table 2.1). We let the heuristic algorithm run for 30s in all

cases. Note that limiting the number of parents effectively reduces the search space of the heuristic algorithm (imposed by implementation). Since in the true networks, no variable has more than 3 parents, this limitation is favorable to the performance of the heuristic algorithm as it decreases the possible DAGs in the search space. The summary statistics computed by the exact algorithm and Banjo are given in Tables2.4 and 2.5, respectively.

A comparison of the average F-scores between the two algorithms for all networks and

sample sizes is given in Figure 2.7. It can be seen that on average, although the results are close, the exact algorithm generally performs better than the heuristic algorithm, with the exception of the case form= 32 and network 1. We find this result interesting, as the average

BDe scores over the 10 samples computed by the exact algorithm andBanjo in this case were −620.25 and−624.30, respectively. This indicates that the exact algorithm does find a “more

optimal” than the heuristic algorithm network, w.r.t. to the BDe score. However, in this case, this does not depict the biological structure more closely. Recall that network 1 has three self-loops. We believe that these self-loops (Figure 2.6) are a possible cause for this outcome. To allow for modeling of loops different models (e.g. dynamic Bayesian networks), or a combination of models, could be considered.

(53)

Table 2.5 Heuristic structure learning with BDe score (Banjo). Sample size is denoted by m, and for each m, lines 1 through 5 report for networks 1 through 5, respectively.

m= 32 Banjo Exact (BDe)

7.56 34 219 15.44 0.33 0.87 0.18 0.23 0.22 8.44 35.44 215.56 16.56 0.34 0.86 0.19 0.24 0.28 7 42.89 210.11 16 0.3 0.83 0.14 0.19 0.19 8.1 40.9 207.1 19.9 0.29 0.84 0.16 0.21 0.22 9.2 30.8 222.2 13.8 0.4 0.88 0.23 0.29 0.29 m= 128

12 15 238 11 0.52 0.94 0.45 0.48 0.53 11.1 24.7 226.3 13.9 0.44 0.9 0.31 0.37 0.4 10.8 29.3 223.7 12.2 0.47 0.88 0.27 0.34 0.38 8.75 25.75 222.25 19.25 0.31 0.9 0.25 0.28 0.29 12.6 19.8 233.2 10.4 0.55 0.92 0.39 0.45 0.47 m= 512

16.5 14.7 238.3 6.5 0.72 0.94 0.53 0.61 0.65

19 24.1 226.9 6 0.76 0.9 0.44 0.56 0.6

18.8 27.1 225.9 4.2 0.82 0.89 0.41 0.55 0.57

15.8 30.1 217.9 12.2 0.56 0.88 0.34 0.43 0.47

(54)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

32

128

512 F-Score

Sample Size m

Net 1 Exact

Net 1 Heuristic

Net 2 Exact

Net 2 Heuristic

Net 3 Exact

Net 3 Heuristic

Net 4 Exact

Net 4 Heuristic

Net 5 Exact

Net 5 Heuristic

Figure 2.7 Comparison of the average F-scores h2·_precisionprecision₊·recall_recaliover 10 samples for networks learned with the exact (solid lines) and heuristic (dashed lines) algorithms. The connection between the average score values are gives solely to illustrate the trend. Results for each network are color-coded in blue(network 1), red (network 2), yellow (network 3), green (network 4), and orange (network 5).