R E S E A R C H
Open Access
Simultaneous learning of instantaneous and
time-delayed genetic interactions using novel
information theoretic scoring technique
Nizamul Morshed
*, Madhu Chetty and Nguyen Xuan Vinh
Abstract
Background: Understanding gene interactions is a fundamental question in systems biology. Currently, modeling of gene regulations using the Bayesian Network (BN) formalism assumes that genes interact either instantaneously or with a certain amount of time delay. However in reality, biological regulations, both instantaneous and time-delayed, occur simultaneously. A framework that can detect and model both these two types of interactions simultaneously would represent gene regulatory networks more accurately.
Results: In this paper, we introduce a framework based on the Bayesian Network (BN) formalism that can represent both instantaneous and time-delayed interactions between genes simultaneously. A novel scoring metric having firm mathematical underpinnings is also proposed that, unlike other recent methods, can score both interactions
concurrently and takes into account the reality that multiple regulators can regulate a gene jointly, rather than in an isolated pair-wise manner. Further, a gene regulatory network (GRN) inference method employing an evolutionary search that makes use of the framework and the scoring metric is also presented.
Conclusion: By taking into consideration the biological fact that both instantaneous and time-delayed regulations can occur among genes, our approach models gene interactions with greater accuracy. The proposed framework is efficient and can be used to infer gene networks having multiple orders of instantaneous and time-delayed
regulations simultaneously. Experiments are carried out using three different synthetic networks (with three different mechanisms for generating synthetic data) as well as real life networks of Saccharomyces cerevisiae, E. coli and cyanobacteria gene expression data. The results show the effectiveness of our approach.
Background
In any biological system, various genetic interactions occur concurrently amongst different genes. While some genes interact almost instantaneously, other genes could have time delayed interactions (see Figure 1). From a bio-logical perspective, instantaneous regulations represent the scenarios where the effect of a change in the expres-sion level of a regulator gene is carried on to the regulated gene (almost) instantaneously. In such cases, the effect is reflected almost immediately in the regulated gene’s expression levela. On the other hand, in cases where regu-latory interactions are time-delayed, its effect will be seen on the regulated gene after a finite time delay.
*Correspondence: [email protected]
Gippsland School of Information Technology, Faculty of Information Technology, Monash University, Northways Road, Churchill, VIC 3842, Australia
Bayesian network and its extension, dynamic Bayesian network (DBN), has found significant applications in the modeling of genetic interactions [1,2]. To the best of our knowledge, prior works on inter and intra-slice con-nections in the dynamic probabilistic network formalism [3,4] have modelled a DBN using an initial network and a transition network employing the 1st-order Markov assumption, where the initial network exists only during the initial period of time and subsequently the dynamics is expressed using only the transition network. Realising that a d-th order DBN has variables replicated d times, a 1st-order DBN for this taskb is therefore usually
lim-ited to around 10 variables. Alternately, if a 2nd-order DBN model is chosen, it can mostly deal with 6-7 vari-ables [5]. Thus, prior works on DBNs were either unable to discover these two interactions simultaneously or were
© 2012 Morshed et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
A B D Time tn C A A D C D C B B Time tn-1 Time tn-2
Figure 1 Example of network structure with both instantaneous and time-delayed interactions.
unable to fully exploit its potential, thereby restricting studies to simpler network configurations. However, since our proposed approach does not replicate variables, we can study any complex network configuration without limitations on the number of nodes. Zou et al. [2], while highlighting the existence of both instantaneous and time-delayed interactions among genes while considering the parent-child relationships of a particular order, did not account for the regulatory effects of other parents (hav-ing different order of regulation than the current one) on that particular child. This is in violation of the biological reality that parents with various orders of regulation can jointly regulate a child. Our proposed method supports multiple parents to regulate a child simultaneously, with different orders of regulation. Moreover, the limitation of detecting genetic interactions like A ↔ B, which are prevalent in genetic networks [6], is also overcome with the proposed method. Experiments conducted using both synthetic and real-life GRNs show the effectiveness of our approach.
Results and discussion
We evaluate our proposed method by studying both syn-thetic networks and real-life biological networks of
Sac-charomyces cerevisiae (yeast), E. coli and cyanobacteria.
The overall accuracy of the inference method and correct-ness of the modeling approach is evaluated by four widely accepted performance measures given below. The terms, TP, FP, TN and FN, used in the following expressions respectively mean the number of true positives, number of false positives, number of true negatives and number of false negatives.
1. Sensitivity(Se): It measures the proportion of true connections which are correctly inferred. It is defined as follows:
Se= TP
TP+ FN (1)
2. Specificity (Sp): Specificity is defined by the following equation:
Sp= TN
TN+ FP (2)
3. Precision (Pr): Precision is proportional to the inferred connections which are correct. It is defined as follows:
Pr= TP
TP+ FP (3)
4. F-score (F): Biologically, a good reconstruction algorithm should infer as many correct arcs as possible, in addition to the criteria that most of the inferred arcs should be correct. The F-score measure is the harmonic mean ofSe and Pr [7] and represents a compromise between these two objectives:
F= 2 Pr Se
Pr+ Se (4)
Since our method uses discrete data for the statistical significance tests embedded in the scoring function, we applied the Persist [8] algorithm to discretize the data into 3 levels. The confidence level (α) is set to 0.9. We will use a local search in the DAG space with the classical oper-ators of arc addition, arc deletion and arc reversal. The starting point of the search is always an empty graph. The parameters for all the other methods that are used for comparison are set to their default values mentioned in their user manuals.
Synthetic network
Synthetic network using differential equation based models
For performing studies using synthetic networks, we gen-erated 3 random networks of size 10, 25 and 50 using the
genenetweaver tool [9]. This tool has been used to gener-ate in silico benchmarks in the DREAM (both DREAM3 [10] and DREAM4 [11]) challenge initiative. The tool is able to obtain biologically plausible network topologies (and also biologically plausible network dynamics) of a given size by extracting random sub-networks of
Saccha-romyces cerevisiae and E. Coli [9,12]. We used the tool
to generate time series data as in the DREAM4 challenge with ten different perturbations for each experiment. Ini-tial and final timestamps for the simulations were 0 and 1000, respectively, and the time step was 50. One of the objectives of this experiment was to test the usefulness of the proposed approach in the presence of noise in mRNA expression levels. Since microarray experiments can incur a wide range of noise levels depending on the technology, environment and the subject under study, we experi-mented under various noise levels that are likely to be
present in the expression data. To mimic a real-life noisy environment, as in [13,14], we added 5 different noise lev-els to the data samples (random Gaussian noise with zero mean and variance, σ2 = 0.0, 0.01, 0.02, 0.05, 0.10). The
performance, measured by the four performance mea-sures, corresponding to the three different sized networks is reported in Figure 2. Figure 2(A) shows the performance variation as a function of network size and noise level. The X-axes represent the noise levels while the Y-axes represent the corresponding performance measures (Se,
Sp, Pr, F). In Figure 2(B)-(D), we compare our approach
with three other methods, namely TDARACNE, BANJO and BNFinder (BDe and MDL) using the F-Score (results corresponding to other measures are available in Addi-tional file 1). It is evident from the results that there is no clear winner in all the cases. Some methods perform good in some cases, while others outperform it in other cases.
(A)
0 0.02 0.04 0.06 0.08 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Se 0 0.02 0.04 0.06 0.08 0.1 0.88 0.9 0.92 0.94 0.96 0.98 Sp 0 0.02 0.04 0.06 0.08 0.1 0.1 0.2 0.3 0.4 0.5 Pr 0 0.02 0.04 0.06 0.08 0.1 0.1 0.2 0.3 0.4 0.5 0.6 F Noise (variance) Noise (variance) Noise (variance) Noise (variance)(B)
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 10 gene F-Score 10 Noise (variance)(C)
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 25 gene Noise (variance) F-Score(D)
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 50 gene F-Score Noise (variance)Figure 2 Reconstruction of synthetic networks generated using differential equation based methods. (A) How performance of our method
varies with network size and noise (Red(*)-10 gene; Green(o)-25 gene; Blue(square)-50 gene). The X-axes represent the 5 levels of noise used, whereas the Y axes represent the corresponding performance measures (see text). (B)-(D) Comparison of performance with 3 other methods for the 10, 25 and 50-gene network. Red(+)-Proposed, Green(o)-BANJO, Blue(x)-BNFinder+BDe, Cyan(square)-BNFinder+MDL,
Figure 3 Yeast cell cycle sub-network [15].
However, it is clear that our proposed approach, albeit not always the best, it is always among the top performers and has consistently superior performance.
Probabilistic network of yeast
We use a sub-network from the yeast cell cycle, shown in Figure 3, taken from Husmeier et al. [15]. The net-work consists of 12 genes and 11 interactions. For each interaction, we randomly assigned a regulation order of 0, 1, 2 or 3. We used two different conditional proba-bilities for the interactions between the genes, namely, the noisy regulation according to a binomial distribution and the noisy XOR-style co-regulation. For the binomial distribution dependent noisy regulation, the parameters were set as follows: excitation: P(on|on) = 0.9, P(on|off) = 0.1; inhibition: P(on|on) = 0.1, P(on|off) = 0.9. For the noisy XOR-style co-regulation the parameters were set as: P(on|on, on) = P(on|off, off) = 0.1, P(on|on, off) = P(on|off, on) = 0.9 [15]. Eight confounder nodes were also added, resulting in the total number of nodes to be 20.
We used 30, 50 and 100 samples, generated 5 datasets in each case and compared our approach with two other
DBN based methods, namely BANJO [16] and BNFinder [17]. Since these methods detect only regulations of order 1, while calculating performance measures for these methods, we ignored the exact orders for the time-delayed interactions in the target network. We could not apply TDARACNE [7] to this network since the generated data has two levels of discrete values and TDARACNE returns error when applied to such discrete datasets. We show the results for this network in Table 1, where we observe that our method, coupled with a high precision, outperforms the other two in terms of both sensitivity and specificity. The F-score is also the best in all the cases. This points to the strength of our method in discovering complex inter-action scenarios where multiple regulators may jointly regulate target genes with varying time-delays.
Synthetic network of glucose homeostasis
In higher eukaryotes, glucose homeostasis is maintained via a complex system involving many organs and signaling mechanisms. The liver plays a crucial role in this system by storing glucose as glycogen when blood glucose levels are high, and releasing glucose into the bloodstream when
Table 1 Comparison of proposed method with BANJO and BNFinder on the yeast sub-network
N=30 N=50 N=100 Se Sp Pr F Se Sp Pr F Se Sp Pr F Proposed 0.62± 0.992± 0.57± 0.59± 0.80± 1.0± 0.79± 0.79± 0.82± 1.0± 0.76± 0.79± Method 0.12 0.0045 0.11 0.11 0.04 0.0 0.07 0.05 0.06 0.0 0.03 0.04 BNFinder 0.53± 0.996± 0.68± 0.59± 0.62± 0.997± 0.74± 0.67± 0.69± 0.997± 0.74± 0.72± +BDe 0.04 0.0006 0.02 0.02 0.04 0.0019 0.13 0.06 0.08 0.0007 0.06 0.07 BNFinder 0.51± 0.996± 0.63± 0.56± 0.60± 0.996± 0.68± 0.63± 0.65± 0.996± 0.69± 0.67± +MDL 0.08 0.0006 0.07 0.08 0.05 0.0022 0.15 0.09 0.0 0.0 0.04 0.02 BANJO 0.51± 0.987± 0.49± 0.46± 0.55± 0.993± 0.57± 0.55± 0.60± 0.995± 0.61± 0.61± 0.08 0.01 0.2 0.15 0.09 0.0049 0.23 0.16 0.08 0.0014 0.09 0.08
IPA CPA GPA CBA CBB FA0 FA3 HN1 CRP SB1 PG1 EBH ACC ALB ALK AA1 AC3 AAT FAS FB1 GDH GLK G6D G6P G6T GLS GL4 HK2 IP1 LCP PEP PF2 PYK TAT TFN
Figure 4 Synthetic network of glucose homeostasis.
blood glucose levels are low. To accomplish its task, the liver responds to circulating levels of hormones, mainly insulin, epinephrine, glucagon, and glucocorticoids [18].
Le et al. [18] conducted an extensive review of the literature regarding the biological components affect-ing perinatal glucose metabolism. Based on the study, a Bayesian Network model of glucose homeostasis contain-ing 35 nodes and 52 interactions (shown in Figure 4) was constructed. We used the model for generating datasets of varying size (50, 75 and 100 samples), having first and second-order regulations using the Bayes Net Tool-box [19]. The random multinomial CPDs used by this approach of data generation were obtained by sampling from a Dirichlet distribution with hyper-parameters cho-sen by the methodcdescribed in [20] with a corresponding Equivalent Sample Size (ESS) value of 10. The choice of this prior distribution for the conditional parame-ters ensures a reasonable level of dependence between d-connected variables in the generative structure [20].
We compare our method with the three other methods that were used previously for comparison, namely BANJO [16] and BNFinder [17](using BDe and MDL). While cal-culating performance measures for these methods, we ignored the exact orders for the time-delayed interactions in the target network. Similar to the probabilistic network of yeast, we could not apply TDARACNE for this network due to error occurring because TDARACNE is unable to cope with the discrete data. The results are shown in Table 2. We observe that, both in terms of specificity and
precision, our method outperforms others. The F-score is the highest in all the cases, indicating a good balance between sensitivity and precision.
Real-life biological data of saccharomyces cerevisiae (IRMA)
To validate our method with a real-life biological gene reg-ulatory network, we investigate a recent network reported in [21]. In that significant work, the authors built a net-work, called IRMA, of the yeast Saccharomyces cerevisiae [21]. They tested the transcription of network genes by culturing the cells in presence of galactose and glucose. The network is composed of five genes regulating each other; it is also negligibly affected by endogenous genes. It is one of the first attempts at building a reference data set having an accurately known target network [7]. There are two sets of gene profiles called Switch ON and Switch OFF for this network, each containing 16 and 21 time series data points, respectively. A ’simplified’ network, ignoring some internal protein level interactions, is also reported in [21]. To compare our reconstruction method, we con-sider 3 other methods, namely, TDARACNE [7], BANJO [16] and BNFinder [17].
IRMA ON dataset
The performance comparison amongst various method based on the ON dataset is shown in Table 3. We observe that our method clearly outperforms the others. There are no false predictions and precision is highest. The sensitivity and F-score measures are also very high.
Table 2 Comparison of proposed method with BANJO and BNFinder on the glucose homeostasis network
N=50 N=75 N=100 Se Sp Pr F Se Sp Pr F Se Sp Pr F Proposed 0.50 0.9812 0.54 0.52 0.46 0.9914 0.71 0.56 0.54 0.9906 0.72 0.62 Method BNFinder 0.48 0.9488 0.29 0.37 0.52 0.9506 0.32 0.39 0.56 0.9557 0.36 0.44 +BDe BNFinder 0.54 0.948 0.31 0.40 0.56 0.9395 0.29 0.38 0.54 0.9369 0.27 0.37 +MDL BANJO 0.52 0.97 0.44 0.47 0.48 0.9838 0.57 0.52 0.54 0.9881 0.67 0.60
Table 3 Performance comparison based on IRMA ON dataset
Original Network Simplified Network
Se Sp Pr F Se Sp Pr F Proposed Method 0.63 1.0 1.0 0.77 0.67 1.0 1.0 0.80 TDARACNE 0.63 0.88 0.71 0.67 0.67 0.90 0.80 0.73 BNFinder+BDe 0.13 0.82 0.25 0.17 0.17 0.80 0.33 0.22 BNFinder+MDL 0.13 0.82 0.25 0.17 0.17 0.80 0.33 0.22 BANJO 0.25 0.76 0.33 0.27 0.50 0.70 0.50 0.50
IRMA OFF dataset
Due to the lack of ’stimulus’, it is relatively difficult to reconstruct the exact network from the OFF dataset [7]. As a result, the overall performances of all the algorithms suffer to some extent. The comparison is shown in Table 4. Again, we observe that our method reconstructs the gene network with high precision. Specificity is also quite high, implying that the inference of false positives is low.
Yeast KEGG pathway reconstruction
In order to test the proposed method’s performance on yeast S. cerevisiae cell cycle, we selected a eleven gene net-work of the G1-phase: Cln3, Cdc28, Mbp1, Swi4, Clb6, Cdc6, Sic1, Swi6, Cln1, Cln2, Clb5. The data used was obtained from the cdc28 experiment of Spellman et al. [22]. In the later stage of the G1-phase, the Cln3-Cdc28 protein kinase complex activates two transcription fac-tors, MBF and SBF, and these promote the transcription of some genes important for budding and DNA synthesis [7,23]. Entry into the S-phase requires the activation of the protein kinase Cdc28p through binding with Clb5 or Clb6, and also the destruction of Sic1 [24]. Also, Swi4 becomes associated with Swi6 to form the SCB complex that acti-vates CLN1 and CLN2 in late G1. Mbp1 forms the MCB-binding factor complex with Swi6, which activates DNA synthesis genes and S-phase cyclin genes CLB5 and CLB6 in late G1 [7]. In budding yeast, commitment to DNA replication during the normal cell cycle requires degrada-tion of the cyclin-dependent kinase (CDK) inhibitor Sic1. The G1 cyclin-CDK complexes Cln1-Cdk1 and Cln2-Cdk1 initiate the process of Sic1 removal by directly catalyzing Sic1 phosphorylation at multiple sites [7,25].
In Figure 5(B)-(F), we report network graphs recon-structed by our proposed approach, TDARACNE, BNFinder(BDe and MDL) and BANJO. We also report the KEGG pathway [26] of the cell-cycle in yeast in 5(A). Since the ground truth for this network is not known, instead of applying performance measures as a means of determining network accuracy, we refer to the available correct interactions obtained from the KEGG pathway [26] and identify which of the predicted interactions are correct or otherwise. We observe from the results that our approach correctly identifies the regulation of SWI4-SWI6 and MBP1-SWI4-SWI6 complex by the CLN3-CDC28 complex. Also, the proposed approach infers that the SWI4-SWI6 complex regulates the CLN1-CLN2-CDC28 complex, which is correct. Two more interactions inferred by our approach (CLN1→CLN2 and CLB5-CLB6-CDC28→CDC6) are also correct based on the KEGG pathway. Overall we observe that none of the methods perform particularly well on this network. However, the number of correct predictions by our method (5) is higher than the other methods.
SOS DNA repair network of E. coli
We analyze the well-known SOS DNA repair network in
E. colias shown in Figure 6(A). This GRN is well known
for its responsibility of repairing the DNA if it gets dam-aged. It is the largest, most complex, and best understood DNA damage-inducible network to be characterized to date.
The expression of the genes in the SOS regulatory net-work is controlled by a complex circuitry which involves the RecA and LexA proteins [27]. Normally LexA acts
Table 4 Comparison based on IRMA OFF dataset
Original Network Simplified Network
Se Sp Pr F Se Sp Pr F Proposed Method 0.50 0.94 0.80 0.62 0.50 0.90 0.75 0.60 TDARACNE 0.60 - 0.37 0.46 0.75 - 0.50 0.60 BNFinder+BDe 0.13 0.82 0.25 0.17 0.33 0.80 0.50 0.40 BNFinder+MDL 0.13 0.82 0.25 0.17 0.33 0.80 0.50 0.40 BANJO 0.38 0.88 0.60 0.46 0.33 0.90 0.67 0.44
(A)
SIC1 CDC6(B)
SIC1 CDC6(C)
SIC1 CDC6(D)
SIC1 CDC6(E)
SIC1 CDC6(F)
SIC1 CDC6Figure 5 Reconstruction of Yeast KEGG Pathway [26]. (A) Target Network. (B) Network Inferred by proposed approach. (C) Network Inferred by
TDARACNE. (D) Network Inferred by BANJO. (E)Network Inferred by BNFinder+BDe. (F) Network Inferred by BNFinder+MDL.
as the master repressor of more than 20 genes, includ-ing lexA and recA genes. This repression is done by its binding to the interaction sites in the promoter regions of these genes. When DNA damage occurs, one of the SOS proteins, RecA, acts as a sensor. By binding to
single-stranded DNA, it becomes activated, senses the damage and mediates LexA autocleavage [27]. The drop in LexA levels in turn stops the repression of the SOS genes and activates them. When the damage has been repaired, the level of activated RecA drops and it stops mediating
(DNA Damage) recA recA* (Cleavage) lexA umuD ruvA uvrA recA ... uvrD lexA
umuD recA uvrA
uvrY ruvA polB uvrD lexA umuD recA uvrA uvrY ruvA polB uvrD lexA umuD recA uvrA uvrY ruvA polB uvrD lexA umuD recA uvrA uvrY ruvA polB uvrD lexA umuD recA uvrA uvrY ruvA polB
(A)
(B)
(C)
(D)
(E)
(F)
Figure 6 Reconstruction of SOS DNA Repair Network (A) Target Network. (B) Network Inferred by proposed approach. (C) Network Inferred by
LexA autocleavage. LexA level in turn increases, starting repression of the SOS genes, and the cell then returns to its normal state.
The expression data sets of the SOS DNA repair sys-tem were obtained from Uri Alon Lab [28]. These data are expression kinetics of 8 genes namely uvrD, lexA, umuD, recA, uvrA, uvrY, ruvA and polB. Four experiments were done for various UV light intensities (Exp. 1 and 2:5Jm−2, Exp. 3 and 4:20Jm−2). In each experiment, the above 8 genes were monitored at 50 instants which are evenly spaced by 6 minutes intervals.
The results corresponding to Experiment 1 is pre-sented in Figure 6(B). Along with our result, we include the results from BANJO, TDARACNE and BNFinder in Figure 6(C)-(F) and the target network in 6(A). The results corresponding to the other experiments are available in Additional file 2, Additional file 3, Additional file 4, Addi-tional file 5 and AddiAddi-tional file 6. From the results, we observe that our method correctly identifies lexA and recA as the ’hub’ genes for this network. Again, the exact ground truth for this network is not precisely known, and hence it is not possible to calculate the well-known per-formance measures. Instead, using the known interactions obtained from literature [13,14], an analysis of correct and incorrect predictions by our method is obtained and shown in Table 5. We observe that most of the inter-actions inferred by our proposed method are correct. It successfully infers lexA as the regulator of uvrA, uvrD, umuD and recA. Also, considering the indirect regulation of RecA through LexA, two more interactions, namely recA→uvrY and recA→polB can also be considered cor-rect. In contrast, 3 of the 5 identified interactions by TDARACNE [7] are correct. Most of the interactions identified by BANJO and BNFinder+MDL are incorrect. BNFinder+BDe successfully identifies regulation of ruvA, polB and uvrA by lexA. In addition, the regulation of umuD by recA can also be considered correct. However, compared to these methods, our proposed method infers
Table 5 Analysis of individual interactions inferred by proposed method
correct/
Regulator Target incorrect
LexA
uvrD correct
umuD correct
recA correct
uvrA correct
RecA uvrY correct
a
polB correcta
uvrD ruvA incorrect
acorrect considering indirect regulation of RecA through LexA
the highest number of correct predictions. Number of incorrect predictions is also very low for our method.
Network analysis of strongly cycling genes in cyanobacteria, Cyanothece sp. ATCC 51142
To study our approach on a large scale network, we use a network of a strain of cyanobacteria, namely Cyanothece sp. strain ATCC 51142 [29]. The microarray data corre-sponding to the genes were collected from two publicly available genome-wide microarray data sets of
Cyanoth-ece, performed in alternating light-dark (LD) cycles with samples collected every 4h over a 48h period: the first one starting with 1h into dark period followed by two DL cycles (DLDL), and the second one starting with two hours into light period, followed by one LD and one continuous LL cycle (LDLL) [30]. In total, there were 24 samples. Using a threshold filter with a 2-fold change cutoff, 730 genes were selected for the analy-sis. The genes are responsible for performing the major tasks of energy metabolism and respiration, nitrogen fix-ation, protein translation and folding, and photosynthe-sis, along with several other tasks. Result obtained using our method is shown in Figure 7. The degree distribu-tion is shown in Figure 8. To compare our result with the other methods, we applied BANJO, BNFinder(BDe and MDL) and TDARACNE. The results of all the three except BNFinder(BDe) was not satisfactory. As a result, we compare our method only with BNFinder+BDe.
Similar to other large scale datasets (e.g. the Human HeLa cell data [31], the Arabidopsis L. Heynth dataset [32]), the microarray data set for cyanobacteria also has very few samples. Moreover, being not a well-studied organism, it requires caution in the interpretation of results. We note that GRN reconstruction studies of cyanobacteria reported earlier (e.g. [29,33,34]) commonly emphasise an evaluation criteria, namely “functional enrichment” analysis of sub-networks. Further, another common feature noted for genetic networks [35-37] is that transcriptional regulatory networks possess the scale free nature of the network topologyd. Since we have limited samples and also because the ground truth is unknown, we have carried out the evaluation of the inferred network using both: (i) statistical means i.e. GO functional enrich-ment analysis (using both p = 0.05 and p = 0.10), and (ii) the R2measure of the power-law fit of the network to establish its scale-free nature.
The enrichment analysis was done by using gene ontol-ogy (GO) database (compiled using two sources: one from the Cyanobase database [38], and another from genome-wide amino sequence matching using the Blast2GO soft-ware suite [39]; the the compiled database is available in Additional file 7), where every GO terms appearing in each sub-network is assessed to find out whether a cer-tain functional category is significantly over-represented
cce_0686 cce_0669 cce_3268 cce_3963 cce_3274 cce_1854 cce_3678
cce_2224cce_1463 cce_4022
cce_4307 cce_2424 cce_2295 cce_0537 cce_3685 cce_3835 cce_1372 cce_1891 cce_0304 cce_3377 cce_3518 cce_3127 cce_3271 cce_4154 cce_3272 cce_0477 cce_1473 cce_0872 cce_0871 cce_4032 cce_0867 cce_3261 cce_2982 cce_0093 cce_2839 cce_2729 cce_4436 cce_4486 cce_2900 cce_0579 cce_2603 cce_0649 cce_0685 cce_0247 cce_0482 cce_4757 cce_1433 cce_1064 cce_2902 cce_1961 cce_0605 cce_1282 cce_3676 cce_1232 cce_1747 cce_3084 cce_1590 cce_3543 cce_4758 cce_3654 cce_3594 cce_0333 cce_0032 cce_2325 cce_2329cce_0116 cce_3626cce_3092cce_0138cce_0636
cce_2536 cce_0999 cce_3741 cce_0791 cce_0580 cce_3025 cce_0804 cce_3251 cce_0565 cce_2544 cce_1129 cce_2324 cce_0168 cce_1403 cce_0632 cce_4658 cce_2614 cce_2079 cce_4400 cce_2052 cce_3398 cce_2611 cce_0668 cce_2344 cce_3436 cce_1538 cce_4194 cce_3673 cce_3936 cce_2214 cce_1611 cce_3522 cce_0319 cce_3584 cce_2812 cce_4283 cce_2084 cce_2288 cce_2658 cce_1005 cce_4023 cce_3638 cce_0660 cce_0887 cce_1559 cce_2639 cce_1482 cce_4018 cce_4195 cce_0262 cce_1513 cce_1581 cce_4129 cce_0302cce_3586 cce_4096 cce_4325 cce_4273 cce_2371 cce_4324 cce_1748 cce_5302 cce_3814 cce_1615 cce_5288 cce_2966 cce_3503 cce_2840 cce_3149 cce_0351 cce_0446 cce_2467 cce_5224 cce_0223 cce_0403 cce_1308 cce_1259 cce_3319 cce_1480 cce_0445 cce_4627 cce_0970 cce_4432 cce_2062 cce_3941 cce_4340 cce_3445 cce_0835 cce_1746 cce_2072 cce_0258 cce_1539 cce_4489 cce_1234 cce_4299 cce_3222cce_2446 cce_3179 cce_3672 cce_3760 cce_2258 cce_1540 cce_1931cce_0997
cce_0233 cce_0584 cce_2610 cce_4205 cce_1532 cce_0604 cce_0172 cce_0303 cce_2776 cce_3756 cce_4450 cce_3288 cce_0420 cce_4282 cce_2909 cce_1315 cce_3606 cce_1114 cce_1978 cce_1277 cce_1400 cce_0103 cce_0259 cce_0994 cce_3636 cce_1589 cce_2939 cce_2589 cce_2157
cce_1857 cce_4343 cce_0750 cce_4484 cce_3358 cce_4826 cce_3935 cce_1856 cce_3209 cce_2071 cce_1775 cce_3501 cce_2134 cce_3870 cce_4227 cce_1270 cce_1828 cce_3950 cce_3517cce_1982 cce_0375 cce_2000 cce_0536 cce_5039 cce_4228 cce_4666 cce_0379 cce_3607 cce_2903 cce_2588 cce_1800 cce_0306 cce_0788 cce_3527 cce_1523 cce_0078 cce_1472 cce_0155 cce_1432 cce_1880 cce_0313 cce_1383cce_3028 cce_1116 cce_3423 cce_2306 cce_0528 cce_2602 cce_3734 cce_0736 cce_0720 cce_3233 cce_3176 cce_3583 cce_2731 cce_4020 cce_3633 cce_1606 cce_0043 cce_2223 cce_1837 cce_3378 cce_2268 cce_4104 cce_1365 cce_3446 cce_0892 cce_1474 cce_2937 cce_3934 cce_3299 cce_1305 cce_0525 cce_2307 cce_0367cce_3083 cce_0291 cce_2566 cce_0990 cce_2825 cce_3944 cce_2426 cce_1464 cce_2080 cce_0875 cce_3961 cce_4142 cce_4518 cce_1977 cce_1063 cce_4030 cce_2038 cce_4599 cce_1569 cce_4540 cce_4039 cce_3507 cce_1290 cce_0073 cce_0843 cce_3972 cce_0628 cce_3746cce_0798 cce_3612 cce_2771 cce_4024 cce_4029 cce_4031 cce_0561 cce_1942 cce_0033 cce_1364 cce_0432 cce_0119 cce_1303 cce_0901 cce_3422 cce_3107 cce_4040 cce_1309 cce_1117 cce_2655 cce_4015 cce_1983 cce_0709 cce_0601 cce_0707 cce_0841 cce_2226 cce_4391 cce_0568 cce_2911 cce_3591 cce_1689 cce_4663 cce_2654 cce_0554 cce_3693 cce_2176 cce_1524 cce_0029 cce_0320 cce_1898 cce_3927 cce_0558 cce_0135 cce_4141 cce_0563 cce_0555 cce_4353 cce_0854 cce_1112 cce_4370 cce_3822 cce_1801 cce_0566 cce_0995 cce_4713 cce_0567 cce_2028 cce_0557 cce_1357 cce_1805 cce_3724 cce_1900 cce_3210 cce_1077 cce_1578 cce_3424 cce_0598 cce_4219 cce_2237 cce_3742 cce_3677 cce_4488 cce_4383 cce_1943 cce_3592 cce_1289 cce_0495 cce_2942 cce_4744 cce_2500 cce_1844 cce_3273 cce_3951 cce_0435 cce_1128 cce_4664 cce_3411 cce_1477 cce_1313 cce_3962 cce_4019 cce_4390 cce_2804 cce_2139 cce_2888 cce_4735 cce_0072 cce_3585 cce_0470 cce_0818 cce_2575 cce_2388 cce_1707 cce_4306 cce_1892 cce_0409 cce_4155 cce_2708 cce_2750 cce_2081 cce_1620 cce_2754 cce_1810 cce_1736 cce_0704 cce_1294 cce_0317 cce_0734 cce_0989 cce_2156 cce_0755 cce_1918 cce_1579 cce_3572 cce_3187 cce_2703 cce_3230 cce_0545 cce_2880 cce_3234 cce_3833 cce_4437 cce_3165 cce_1580 cce_1059 cce_4202 cce_3337 cce_2887 cce_4897 cce_1855cce_2604 cce_0248 cce_4667 cce_4697 cce_2423 cce_1053 cce_4413 cce_3101 cce_1182 cce_0844 cce_0527 cce_0516 cce_0564 cce_2907 cce_3161 cce_3834 cce_0608 cce_1975 cce_1976 cce_3141 cce_4451 cce_0232 cce_0546 cce_1307 cce_2955 cce_1947 cce_2111 cce_4671 cce_1649 cce_1347 cce_3280 cce_4355 cce_1690 cce_0810 cce_0339 cce_0549 cce_2563 cce_2330 cce_1607 cce_0260 cce_0364 cce_2494 cce_0569 cce_3702 cce_0560 cce_3910 cce_0550 cce_1231 cce_1724 cce_0559 cce_3029 cce_4516 cce_4329 cce_1466 cce_0553 cce_0836 cce_2552 cce_0556 cce_3100 cce_0969 cce_0570 cce_4379 cce_0205 cce_4571 cce_1827 cce_0252 cce_0526 cce_1672 cce_3687 cce_1169 cce_1462 cce_4153 cce_0094 cce_0578 cce_2313 cce_2717 cce_1749 cce_4748 cce_0987 cce_4003 cce_2408 cce_0576 cce_3653 cce_0459 cce_3652 cce_1644 cce_1123 cce_3188 cce_1659 cce_2663 cce_2943 cce_2124 cce_3263 cce_3225 cce_0099 cce_4025 cce_1344 cce_0586 cce_2657 cce_4027 cce_0224 cce_0552 cce_1789 cce_3164 cce_2514 cce_2225 cce_1944 cce_1678 cce_3571 cce_4354 cce_4388 cce_0953 cce_0809 cce_0574 cce_4304 cce_1629 cce_2653 cce_0562 cce_0572 cce_2637 cce_4314 cce_0577 cce_0575 cce_1461 cce_4510 cce_4049 cce_3194 cce_4150 cce_0659 cce_0811 cce_2496 cce_1337 cce_4222 cce_4487 cce_3394 cce_0603 cce_5143 cce_0225 cce_0589 cce_1122 cce_5166 cce_0571 cce_0161 cce_1808 cce_1998 cce_2573 cce_1692 cce_3366 cce_2051 cce_4117 cce_3974 cce_4281cce_3253 cce_0992 cce_2312 cce_4280 cce_3252 cce_2652 cce_3094 cce_2449 cce_3208 cce_2122 cce_0705 cce_4595 cce_3701 cce_2651 cce_2908 cce_2112 cce_5203 cce_1901 cce_1155 cce_2879 cce_4078 cce_0437 cce_3755 cce_0921 cce_1946 cce_2373 cce_1999 cce_3437 cce_4301 cce_3639 cce_2820 cce_3330 cce_3617 cce_4118 cce_1822 cce_2120 cce_3627 cce_0269 cce_4555 cce_2741 cce_1343 cce_4139 cce_1563 cce_3615cce_2958 cce_4490 cce_3964 cce_3146 cce_4017 cce_2959 cce_3396 cce_2472 cce_5172 cce_3104 cce_0643 cce_3945 cce_0860 cce_2257 cce_3477 cce_4730 cce_4122 cce_2702 cce_2248 cce_2572 cce_1809 cce_3357 cce_2813 cce_2535 cce_3897 cce_3898 cce_4095 cce_2070 cce_1034 cce_3464 cce_4358 cce_3246 cce_4547 cce_4687 cce_2195 cce_2699 cce_2236 cce_3635 cce_0902 cce_0436 cce_4412 cce_4452 cce_4453 cce_3383 cce_4454 cce_4103 cce_0666 cce_1843 cce_4680 cce_4021 cce_5153 cce_2453 cce_0842 cce_1341 cce_0547 cce_2370 cce_4006 cce_4446 cce_1081 cce_3103 cce_0305 cce_4698 cce_0183 cce_3166 cce_4580 cce_1547 cce_4482 cce_0590 cce_4013 cce_4483 cce_0298 cce_3625 cce_4610 cce_4395 cce_0548 cce_4485 cce_3256 cce_0861 cce_0044cce_3269 cce_0920 cce_0019 cce_0710 cce_4028 cce_2967 cce_2625 cce_0299
Figure 7 Network inferred by proposed approach.
in a certain sub-network/cluster, more than what would be expected by chance. The Cytoscape [40] plugin BiNGO [41] was used for GO functional category enrichment analysis. For BiNGO, we use the combined and filtered gene set as the reference set, the hypergeometric test as the test for functional over-representation, and False Discovery Rate (FDR) as the multiple hypothesis testing correction scheme.
First, we present the results corresponding to p= 0.05. The network obtained by BNFinder+BDe has 16 sub-networks each containing at least 3 genes. Of these, 6 sub-networks have significantly enriched functionalities (as determined by the GO functional enrichment test). Of
the other 10, we compute the 3 most densely connected hubs for each network, and in 2 of 10 such sub-networks, the hubs have defined significantly enriched functionalities. On the other hand, in our result, there are 14 sub-networks in total having at least 3 genes. Of these, 3 sub-networks have defined enriched functions (the largest sub-network has the role of nitrogen fixation according to the enrichment test). Of the other 11, we compute the 3 most densely connected hubs for each sub-network, and in 5 of the 11 such sub-networks, the hubs have defined significantly enriched functionalities.
The results corresponding to p= 0.10 show that for BNFinder+BDe, 7 sub-networks have enriched
Figure 8 Degree distribution analysis of the resultant network of cyanothece. We used a power law fit which yields R2= 0.93. The result
confirms that the inferred network is scale-free.
functionalities (as determined by the test). Of the other 9, we compute the 3 most densely connected hubs for each sub-network, and in 2 of the 9 such sub-networks, the hubs have defined enriched functionalities. On the con-trary, the result using our approach has 5 sub-networks with defined significantly enriched functions (the largest sub-network has the role of nitrogen fixation, similar to the p = 0.05 case). Of the other 9, we compute the 3 most densely connected hubs for each sub-network, and in 6 of the 9 such sub-networks, the hubs have defined significantly enriched functionalities.
We also test the networks to assess whether they are scale free, using a power law fit. The R2 value of the fit corresponding to our network is 0.93, which is a better fit compared to BNFinder+BDe (0.62).
Conclusion
In this paper, we propose a framework that can simulta-neously represent instantaneous and time-delayed genetic
interactions. The proposed scoring metric uses informa-tion theoretic quantities having not only relevant prop-erties but also implicitly includes the biological truth that some genes may jointly regulate other genes. Incor-porating these novel features, we have implemented a score+search based GRN reconstruction algorithm. Experiments have been performed on different synthetic networks of varying complexities and also on real-life biological networks. Our method shows improved perfor-mance compared to other recent methods, both in terms of reconstruction accuracy and number of false predic-tions and at the same time maintaining comparable or bet-ter true predictions. A natural extension of the described method can be incorporation of a-priori knowledge from sources like protein-protein interactions databases and fusing the knowledge with existing regulatory networks to make the inferred networks much more reliable, and we are pursuing this objective. Along with these exten-sions, the proposed approach would improve the accuracy
of gene regulatory network reconstruction and enhance research in systems biology.
Methods
The representational framework
Let us model a gene network containing n genes (denoted
by X1, X2. . ., Xn) with a corresponding microarray dataset
having N time points. A basic DBN-based GRN recon-struction method would try to find associations between genes Xi and Xj by taking into consideration the data xi1, . . . , xi(N−δ) and xj(1+δ), . . . , xjN or vice versa (small
case letters mean data values in the microarray), where 1≤ δ ≤ d. That is, it will take into consideration the d-th order Markov rule, for a gene having a maximum order of regulation d with its parents. This will effectively enable this model to capture at most d-step time delayed interac-tions. Conversely, a basic BN-based strategy would use the entire set of N time points and it will capture regulations that are effective instantaneously.
Now, to represent both instantaneous and multiple step time-delayed interactions, we consider an adjacency matrix based structure as shown in Figure 9. The zero entries in the figure denote no regulation. For the first n columns, the entries marked by 1 correspond to instanta-neous regulations whereas for the last n columns non-zero entries denote the order of regulation. As an example, the entry 1 in the cell (X1, X2)means X1has (almost)
instan-taneous regulatory effect on X2. Similarly, the entry d in
the cell (Xn, X
2)means Xnregulates X2with a d-step time
delay. Using this representation, we do not need to repli-cate layers of interactions for each increment in the order of regulations, making it efficient and particularly suitable for representing GRNs, where higher-order regulations is a common phenomenon.
Complications in the alignment of data samples can arise if the parents have different orders of regulation with the child node. To make this notion clear, we describe an example where we have already assessed the degree of interest in adding two parents (gene B and C, hav-ing third and first order regulations, respectively) to the gene under consideration, X. Now, we want to assess the degree of interest in adding gene A as a parent of X with a second order regulatory relationship, that is we want to computeeMI(X, A2|{B3, C1}), where superscripts on the
Figure 9 The adjacency matrix based approach for the representation.
parent variables denote the order of regulation it has with the child node.
There are two possibilities to consider. The first one cor-responds to the scenario where the time-series data is not periodic. In this case, we cannot use all the N samples for MI computation, rather we have to use (N− δ) samples where δ is the maximum order of regulation that the gene under consideration has, with its parent nodes (3 in this example). Figure 10 shows how the alignment of the sam-ples can be done for the current example. In the figure, we have N samples and since δ = 3, we can effectively use
(N− 3) samples.
The√symbol inside a cell denotes that this data sam-ple will be used for MI computation, whereas empty cells denote that these data samples will not be considered for computing the MI. Similar alignments will need to be done for the other case, where the data is considered to be periodic (e.g., datasets of yeast compiled by [42] show such cyclic behavior [43]). However, we can use all the N data samples in this case, where the data is shifted in a circular manner.
The interpretation of the results obtained from an algo-rithm that uses this framework can be done in a straight-forward manner. Using this framework and the aligned data samples, if we construct a network where we observe, for example, arc X1 → X
n having order δ, we conclude
that the inter-slice arc between X1 and Xn is inferred
and X1regulates Xn with a δ-step time-delay. Similarly,
if we find an arc X2 → Xn, we say that the intra-slice
arc between X2 and Xn is inferred and a change in the
expression level of X2will almost immediately effect the
expression level of Xn. To ensure consistency in the
result-ing Bayesian networks, the followresult-ing 3 conditions must also be satisfied:
1. The network must be a directed acyclic graph. 2. The inter-slice arcs must go in the correct direction
(no backward arc).
3. Interactions remain existent independent of time (stationarity assumption).
Our proposed scoring metric, CCIT
We share the same idea with MIT (Mutual Information Tests) [44] and MDL (the Minimum Description Length principle) for developing a scoring metric that can score
Figure 10 Sample points used for the calculation of the Mutual Information (MI).
both instantaneous and time-delayed interactions simul-taneously: to use the MI/log-likelihood measure between a node X, and its parents, Pa(X) for measuring the degree of association between them, and penalizing struc-tural complexity. The first part aims at minimizing the Kullback-Leibler (KL) divergence between the joint distri-bution corresponding to the original network (pD) and the
graph under consideration (pG), according to the
follow-ing equation: argmin G∈Gn KL(pD, pG)= argmax G∈Gn n i=1 PaG(Xi)=φ MI(Xi, PaG(Xi)) (5) which is equivalent to maximizing the log-likelihood (i.e., the higher the MI/log-likelihood score, the better is the network) [44]. In our approach, calculation of the MI/log-likelihood score is done in a manner which is similar to the approaches in MIT/MDL, with a major differ-ence: calculation of score (using MI/log-likelihood) in case of joint regulation. To make the notion clear consider Figure 1. Using MIT, the MI part for scoring for gene
BisfMI(B,{A0, D0}) + MI(B, C1)(similar calculations of log-likelihood will be used for MDL). As we can see, the calculation of MI/log-likelihood for the zero-order inter-actions do not take into account the parents who regulate it with time-delay. Unlike the approach in basic MIT and other approaches where zero and higher-order interac-tions are scored separately and then combined, in our approach, we also condition (during computation) on par-ents which have different orders of regulation with the target gene. The marginal probability for each node of this model thus becomes:
P(X[ t]|X[ t−1] , . . . , X[ t−d] )= n i=1 P(Xi[ t]|Pa(Xi[ t])) (6) The term Pa(Xi[ t] ) in the above equation represents
the parents of gene Xi at time t, which can be in the
same time-slice or in one of the d previous time-slices (d is the maximum order of regulation) of gene Xi at
time t. Thus, using our approach, the scoring function for B will calculate MI(B,{A0, D0} ∪ {C1}). Scoring in
this manner enables us to score both intra and inter-slice interactions simultaneously, rather than considering these two types of interactions in an isolated manner, mak-ing it specially suitable for problems like reconstructmak-ing GRNs, where occurrence of joint regulation is a common phenomenon.
The idea of penalizing complex structures is ubiqui-tous, finding its place in most of the scores like BIC, MIT and MDL. The penalization component for BIC and
MDL are global, whereas for MIT it is specific for each variable and its parents. Being local in nature, the MIT scheme usually outperforms the other two [44]. In this scheme, the localised penalization is based on a theorem of Kullback [45], which says that for a particular confi-dence level α, the quantity 2N.MI(Xi, Xj|Pa(Xi))− χα,lij
represents a statistical test of conditional independence, where lijis the degrees of freedom of a chi-squared
dis-tribution, and χα,lijis the statistical significance threshold.
The more positive the value is, the more likely is that Xi
and Xk are related (given the current parent set, Pa(Xi))
and vice-versa. Thus, adding up the MI quantities for all the genes (multiplied by 2*number of samples) and subtracting the corresponding local penalization mea-sures effectively constitute a series of conditional inde-pendence (CI) tests, and this scheme is used for scoring using MIT.
However, porting this idea of local penalization directly to a gene regulatory network which is cursed with dimen-sionality (there are a large number of variables (genes), but only a few samples are available), has the problem of over-penalization. This can be exemplified using Figure 1. The penalization component for gene B according to MIT, will be: χα,4+ χα,12+ χα,36, assuming the special case where
we have 3 levels of discrete data (the details of how these penalization components can be computed will be shown later). For a Bayesian network design having thousands of samples available, this penalization is not a problem. How-ever, but for GRN reconstruction with samples ranging between 20-50, this penalization is too high. To rem-edy this situation, we propose to apply the penalization only on a per-order of regulation basis. Using this modi-fied scheme, the penalization will be 2χα,4+ χα,12, which
constitutes considerable savings, thereby increasing better prediction ratio (in terms of sensitivity and specificity).
The approaches described above are summarised as a scoring metric, named CCIT (Combined Conditional Independence Tests) in Equation 7. The score, when applied to a graph G containing n genes (denoted by
X1, X2. . ., Xn), with a corresponding microarray dataset
D, can be expressed as:
SCCIT(G: D)= n i=1 Pa(Xi)=φ ⎧ ⎪ ⎨ ⎪ ⎩2Nδi.MI(Xi, Pa(Xi)) − δi k=0 (max σik sk i j=1 χα,liσk i(j)) ⎫ ⎪ ⎬ ⎪ ⎭ (7) Here ski denotes the number of parents of gene Xihaving
a k step time-delayed regulation and δi is the maximum
set of gene Xi, Pa(Xi)is the union of the parent sets of Xi
having zero time-delay (denoted by Pa0(X
i)), single-step
time-delay (denoted by Pa1(X
i)) and up to parents having
the maximum time-delay (δi). This is defined as follows: Pa(Xi)= Pa0(Xi)∪ Pa1(Xi)· · · ∪ Paδi(Xi) (8)
The number of effective data points, Nδi, depends on
whether the data can be considered to be showing peri-odic behavior or not (e.g., datasets from [42] can be considered as showing periodic behavior [43]). In the case of aperiodicity, Nδiis determined by subtracting, from the
total length of the time profile (N), the maximum order of the time-delay that the gene under consideration has with its parents (δi).
Nδi=
N if data is periodic
N− δi otherwise (9)
Finally, σik = (σik(1), . . . , σik(ski))denote any permuta-tion of the index set (1, . . . , ski)of the variables Pak(Xi)and liσk
i(j), the degrees of freedom, is defined as follows:
liσk i(j)= (ri− 1)(rσk i(j)− 1) j−1 m=1rσk i(m), for 2≤j≤s k i (ri− 1)(rσik(1)− 1), for j = 1 (10) where rpdenotes the number of possible values that gene Xpcan take (after discretization, if the data is continuous).
If the number of possible values that the genes can take is not the same for all the genes, the quantity σik denotes the permutation of the parent set Pak(Xi)where the first
parent gene has the highest number of possible values, the second gene has the second highest number of possible values and so on.
Some properties of CCIT Score
In this section we study several useful properties of the proposed scoring metric. The first among these is the decomposability property, which is especially useful for local search algorithms:
Proposition 1.CCIT is a decomposable scoring metric.
Proof.This result is evident as the scoring function is,
by definition, a sum of local scores.
Next, we show in Theorem 1 that CCIT takes joint regulation into account while scoring and it is different than three related approaches, namely MIT [44] applied to: a Bayesian Network (which we call MIT0); a dynamic
Bayesian Network (called MIT1); and also a naive
com-bination of these two, where the intra and inter-slice networks are scored independently (called MIT0+1). For
this, we make use of the decomposition property of MI, defined next:
Property 1.(Decomposition Property of MI) In a BN, if
Pa(Xi)is the parent set of a node Xi, and the cardinality of
the set is si, the following identity holds [44]: MI(Xi, Pa (Xi))= MI (Xi, Xi1) + si j=2 MIXi, Xij| Xi1, . . . , Xi(j−1) (11)
Theorem 1.CCIT scores intra and inter-slice arcs
con-currently, and is different from MIT0, MIT1and MIT0+1
since it takes into account the fact that multiple regula-tors may regulate a gene simultaneously, rather than in an isolated manner.
Proof.We prove by showing a counter example, using
the network in Figure 11. We apply our metric along with the three other techniques on the network, describe the
working procedure in all these cases to show that the pro-posed metric indeed scores them concurrently, and finally show the difference with the other three approaches. The network in Figure 11 has 4 interactions, 2 of these are instantaneous and 2 are time-delayed (with δ = 1). We assume a non-trivial case where the data is supposed to be periodic (the proof is trivial otherwise). Also, we assume that all the gene expressions were discretized to 3 quantization levels.
1. Application of MIT in a BN based framework:
sMIT0 = 2N.MI(B, {A
0, D0})−χ
α,4+ χα,12
(12) 2. Application of MIT in a DBN based framework:
sMIT1 = 2N{MI(B, C
1)+MI(A, D1)}−2χ α,4 (13)
3. A naive application of MIT in a combined BN and DBN based framework:
sMIT0+1= 2N{MI(B, {A0, D0}) + MI(B, C1)
+ MI(A, D1)} −3χ
α,4+ χα,12
(14) 4. Our proposed scoring metric:
sCCIT = 2N{MI(B, {A0, D0} ∪ {C1})
+ MI(A, D1)} −3χ
α,4+ χα,12 (15)
The concurrent scoring behavior of CCIT is evident from the first term in RHS of (15). Also, the inclusion of C in the parent set in the first term of the RHS of the equation exhibits the manner by which it achieves the objective of taking into account the biological fact that multiple regulators may regulate a gene jointly (the cal-culation, however, needs to be carried out in accordance with the process we described in the Methods Section).
Considering (12) and (13), it is also obvious that CCIT is different from both MIT0and MIT1. To show that CCIT
is different from MIT0+1, we consider (14) and (15). It suffices to consider whether MI(B,{A0, D0}) + MI(B, C1)
is different from MI(B,{A0, D0} ∪ {C1}). Using (11), this becomes equivalent to considering whether
MI(B,{A0, D0}|C1)is the same as MI(B,{A0, D0}), which are clearly inequal. This completes the proof.
Endnotes
aThe time-delay will always be greater than zero. However, if the delay is small
enough so that the regulated gene is effected before the next data sample is taken, it can be considered as an instantaneous interaction
ba tutorial can be found in http://www.cs.ubc.ca/∼murphyk/Software/BDAGL/
dbnDemo hus.htm
cThe method works as follows: for a variable Xiwith k states, a basis vector is
constructed for P(Xi|Pa(Xi))by normalizing the vector11,12,· · · ,1k. For the j-th instantiation pa(Xi) of Pa(Xi), samples are obtained for the probability corresponding to this instantiation by using θij∼ Dirichlet(sαij)where s is the equivalent sample size and the αij’s are obtained by shifting the basis vector to the right j places where j modulo k is not one.
dWe clarify that different processes including genetic networks will generate
scale free networks. However, if a network obtained using microarray data is
scale free, it indicates that it is modelling the underlying biological process more accurately
ein this paper, we use Mutual Information (MI)/log-likelihood based
Conditional Independence tests for analysis of regulatory interactions
fit should be noted here that MIT/MDL are basic scoring metric for BNs, which
can be extended to score both Static and Dynamic BNs separately. Here, we are discussing MIT/MDL applied to a network having both zero and higher-order interactions
Additional files
Additional file 1: Comparison of performance with 3 other methods for the 10, 25 and 50-gene differential equation based synthetic network.
Additional file 2: Reconstruction of SOS DNA Repair Network in E. coli-Experiment 2, 3, 4; results obtained using BANJO.
Additional file 3: Reconstruction of SOS DNA Repair Network in E. coli-Experiment 2, 3, 4; results obtained using BNFinder+BDe. Additional file 4: Reconstruction of SOS DNA Repair Network in E. coli-Experiment 2, 3, 4; results obtained using BNFinder+MDL. Additional file 5: Reconstruction of SOS DNA Repair Network in E. coli-Experiment 2, 3, 4; results obtained using the proposed approach. Additional file 6: Reconstruction of SOS DNA Repair Network in E. coli-Experiment 2, 3, 4; results obtained using TDARACNE. Additional file 7: The compiled gene ontology annotation database used for the functional enrichment analysis.
Competing interests
The authors declare that they have no competing interests. Acknowledgments
This research is a part of the larger project on genetic network modeling supported by Monash University and Australia-India Strategic Research Fund. Comments and suggestions from Assoc. Prof. Manzur Murshed from Gippsland School of IT is gratefully acknowledged.
Authors’ contributions
NM developed the algorithms and carried out the experiments. NM and NXV drafted the manuscript. MC and NXV suggested the biological data, experiments and provided biological insights on the results. MC provided overall supervision, direction and leadership to the research. All authors read and approved the final manuscript.
Received: December 17 2011 Accepted: June 6 2012 Published: June 12 2012
References
1. Ram R, Chetty M: A markov-blanket-based model for gene regulatory network inference. Comput Biol Bioinf, IEEE/ACM Trans 2011,
8(2):353–367.
2. Zou M, Conzen S: A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 2005, 21:71.
3. de Campos, C, Ji Q: Efficient structure learning of Bayesian networks using constraints. J Machine Learning Res 2011, 12:663–689. 4. Friedman N, Murphy K, Russell S: Learning the structure of dynamic
probabilistic networks: Morgan Kaufmann Publishers Inc.; 1998. 5. Eaton D, Murphy K: Bayesian structure learning using dynamic
programming and MCMC. Proceedings of the Twenty-Third Confererence on Uncertainty in Artificial Intelligence (UAI 2007) 2007.
6. Chaitankar V, Ghosh P, Perkins E, Gong P, Deng Y, Zhang C: A novel gene network inference algorithm using predictive minimum description length approach. BMC Syst Biol 2010, 4(Suppl 1):S7.
7. Zoppoli P, Morganella S, Ceccarelli M: TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC Bioinf 2010, 11:154. 8. M ¨orchen F, Ultsch A: Optimizing time series discretization for
international conference on Knowledge discovery and data mining. Chicago, IL, USA: ACM; 2005:660–665.
9. Schaffter T, Marbach D, Floreano D: GeneNetWeaver: In silico benchmark generation and performance profiling of network inference methods. Bioinformatics 2011, 27(16):2263–2270. Oxford Univ Press.
10. Prill R, Marbach D, Saez-Rodriguez J, Alexopoulos L, Sorger P, Xue X, Clarke N, Altan-Bonnet G, Stolovitzky G: Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PloS one 2010, 5(2):e9202.
11. Prill R, Saez-Rodriguez J, Alexopoulos L, Sorger P, Stolovitzky G: Crowdsourcing network inference: the DREAM predictive signaling network challenge. Science’s STKE 2011, 4(189):mr7.
12. Marbach D, Schaffter T, Mattiussi C, Floreano D: Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J Comput Biol 2009, 16(2):229–239. 13. Noman N, Iba H: Inferring gene regulatory networks using
differential evolution with local search heuristics. IEEE/ACM Trans Comput Biol Bioinf (TCBB) 2007, 4(4):634–647.
14. Kimura S, Ide K, Kashihara A, Kano M, Hatakeyama M, Masui R, Nakagawa N, Yokoyama S, Kuramitsu S, Konagaya A: Inference of S-system models of genetic networks using a cooperative coevolutionary algorithm. Bioinformatics 2005, 21(7):1154–1163. IEEE Computer Society Press. 15. Husmeier D: Sensitivity and specificity of inferring genetic regulatory
interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics 2003, 19(17):2271.
16. Yu J, Smith V, Wang P, Hartemink A, Jarvis E: Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics 2004, 20(18):3594. 17. Wilczy ´nski B, Dojer N: BNFinder: exact and efficient method for
learning Bayesian networks. Bioinformatics 2009, 25(2):286. 18. Le P, Bahl A, Ungar L: Using prior knowledge to improve genetic
network reconstruction from microarray data. Silico Biol 2004, 4(3):335–353.
19. Murphy K, et al.: The bayes net toolbox for matlab. Comput Sci Stat 2001, 33(2):1024–1034.
20. Chickering D, Meek C: Finding optimal Bayesian networks. In Proc. UAI; 2002.
21. Cantone I, Marucci L, Iorio F, Ricci M, Belcastro V, Bansal M, Santini S, Di Bernardo M, Di Bernardo D, Cosma M: A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches. Cell 2009, 137:172–181.
22. Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycle–regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9(12):3273–3297. 23. Cross F: Starting the cell cycle: what’s the point? Curr Opin Cell Biol
1995, 7(6):790–797.
24. Chun K, Goebl M: Mutational analysis of Cak1p, an essential protein kinase that regulates cell cycle progression. Mol Gen Genet MGG 1997, 256(4):365–375.
25. Sawarynski K, Kaplun A, Tzivion G, Brush G: Distinct activities of the related protein kinases Cdk1 and Ime2. Biochimica Et Biophysica Acta 2007, 1773(3):450–456.
26. Kanehisa M, Goto S, Kawashima S, Nakaya A: The KEGG databases at GenomeNet. Nucleic Acids Res 2002, 30:42–46.
27. Kabir M, Noman N, Iba H: Reverse engineering gene regulatory network from microarray data using linear time-variant model. BMC Bioinf 2010, 11(Suppl 1):S56. BioMed Central Ltd.
28. Uri Alon’s SOS Dataset webpage. [http://www.weizmann.ac.il/mcb/ UriAlon/Papers/SOSData/].
29. St ¨ockel J, Welsh E, Liberton M, Kunnvakkam R, Aurora R, Pakrasi H: Global transcriptomic analysis of Cyanothece 51142 reveals robust diurnal oscillation of central metabolic processes. Proc Nat Acad Sci 2008, 105(16):6156.
30. Wang W, Ghosh B, Pakrasi H: Identification and modeling of genes with Diurnal Oscillations from microarray time series data. Comput Biol Bioinf, IEEE/ACM Trans 2011, 8:108–121.
31. Whitfield M, Sherlock G, Saldanha A, Murray J, Ball C, Alexander K, Matese J, Perou C, Hurt M, Brown P, et al.: Identification of genes periodically
expressed in the human cell cycle and their expression in tumors. Mol Biol Cell 2002, 13(6):1977–2000.
32. Yuan Y, Li C, Windram O: Directed partial correlation: inferring large-scale gene regulatory network through induced topology disruptions. PLoS One 2011, 6(4):e16835.
33. McDermott J, Oehmen C, McCue L, Hill E, Choi D, St ¨ockel J, Liberton M, Pakrasi H, Sherman L: A model of cyclic transcriptomic behavior in the cyanobacterium Cyanothece sp. ATCC 51142. Mol BioSyst 2011, 7(8):2407–2418.
34. Toepel J, Welsh E, Summerfield T, Pakrasi H, Sherman L: Differential transcriptional analysis of the cyanobacterium Cyanothece sp. strain ATCC 51142 during light-dark and continuous-light growth. J Bacteriol 2008, 190(11):3904–3913.
35. Jeong H, Tombor B, Albert R, Oltvai Z, Barab´asi A: The large-scale organization of metabolic networks. Nature 2000, 407(6804):651–654. 36. Jeong H, Mason S, Barabasi A, Oltvai Z: Lethality and centrality in
protein networks. Arxiv preprint cond-mat/0105306 2001. 37. Guelzim N, Bottani S, Bourgine P, K´ep`es F: Topological and causal
structure of the yeast transcriptional regulatory network. Nat Genet 2002, 31:60–63.
38. Kazusa DNA Research Institute: The cyanobacteria database. [http:// genome.kazusa.or.jp/cyanobase].
39. G ¨otz S, Garc´ıa-G ´omez J, Terol J, Williams T, Nagaraj S, Nueda M, Robles M, Tal ´on M, Dopazo J, Conesa A: High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Res 2008, 36(10):3420–3435.
40. Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13(11):2498–2504.
41. Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 2005, 21(16):3448–3449.
42. Cho R, Campbell M, Winzeler E, Steinmetz L, Conway A, Wodicka L, Wolfsberg T, Gabrielian A, Landsman D, et al.: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 1998, 2:65–73. 43. Xing Z, Wu D: Modeling multiple time units delayed gene regulatory
network using dynamic Bayesian network. In Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops: IEEE; 2006:190–195.
44. de Campos L: A scoring function for learning Bayesian networks based on mutual information and conditional independence tests. J Machine Learning Res 2006, 7:2149–2187.
45. Kullback S: Information Theory and Statistics. Volume. 2nd edition, Vol. 1. New York: Dover Publications; 1968.
doi:10.1186/1752-0509-6-62
Cite this article as: Morshed et al.: Simultaneous learning of instantaneous
and time-delayed genetic interactions using novel information theoretic scoring technique. BMC Systems Biology 2012 6:62.
Submit your next manuscript to BioMed Central and take full advantage of:
• Convenient online submission
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at www.biomedcentral.com/submit