Mining large-scale transcriptomic datasets for gene

1.3 Transcription regulation and gene regulatory networks

1.3.3 Searching for gene regulatory network models

1.3.3.3 Mining large-scale transcriptomic datasets for gene

Whilst it is possible to perform de novo inference of a regulatory network from a time course dataset featuring many genes, this procedure is likely to require a degree of pre-processing to be effective and computationally tractable. Common practice includes the identification of differentially expressed genes, reducing the gene list to transcription factors only in order to elucidate the core of the regulatory network, or performing clustering to group together genes with similar expression profiles (Penfold and Wild, 2011). Another limitation of the modelling is the sole reliance on mRNA levels, which are easily experimentally captured via microarrays (Schulze and Downward, 2001) or RNA-Seq (Wang et al., 2009), with the ability to experimentally quantify protein levels being limited in comparison (Miranda et al., 2007).

One of the classes of models that can be used to infer regulatory networks from large-scale transcriptomic data is dynamic Bayesian networks, with an example of an algorithm being Variational Bayesian State Space Modelling (VBSSM) (Beal et al., 2005). The underlying model structure of state space modelling (SSM) assumes the existence of a set of hidden states that drive the values of observed variables. For the purpose of this application, the observed variables are the expression profiles of the measured genes, whilst the hidden states capture all the effects that are missing from the experimental quantification, such as unmeasured genes or mRNA degradation. The basic SSM is expanded to include an additional driving input to affect the hidden state change between time points, and said input is set to be the gene expression at the previous time point. A simple transformation of the resulting equations allows for the identification of a single matrix capturing the interactions between every possible gene pair in the data, where interaction can

be defined as transcriptional activation or repression based on transcript levels in adjacent time points of the parent (regulating) and child (regulated) gene. This matrix is iteratively elucidated from the data in a process named the Variational Bayesian Expectation Maximisation (EM) Algorithm, which is a modification of the EM algorithm that accounts for the parameters forming a distribution instead of a point estimate. Subsequent iterations aim to maximise the marginal likelihood, and the final model can be converted into a binary network by applying a z-score threshold to the final parameter matrix (Beal et al., 2005).

Another possible modelling approach is Causal Structure Inference (CSI) (Klemm, 2008; Penfold and Wild, 2011). Similarly to VBSSM in nature, it infers its connections as a relation between the expression of the parent at a given time point and the expression of the child at the succeeding time point. The algorithm relies on Gaussian processes to accomplish this task — given n possible parents in a given evaluated fit, an n+1-dimensional space is created, with n of the axes occupied by the time shifted parents and one by the child, and a zero-mean Gaussian process prior is fit to the data. The hyperparameters of the Gaussian process fits are tuned with the EM algorithm to maximise the marginal likelihood, defined as the sum of the likelihoods of all of the individual fits performed as part of CSI. In order to combat the scaling of the number of fits in need of evaluating to preserve computational tractability, the concept of indegrees is introduced, which caps the maximum number of parents to be evaluated for the child in the fit. Upon completing the fits, each individual fit can be compared with the others by the computed marginal likelihood. The likelihoods are pooled together and scaled to produce a marginal distribution, which allows the extraction of the frequency with which any particular parent appears across the models (Penfold and Wild, 2011).

Given the existence of numerous network inference algorithms (other options include methods using Granger’s causality), a comparison of the accuracy of network reconstruction of the individual algorithms was performed (Penfold and Wild, 2011). The process was performed using DREAM4 data, which is a collection of synthetic data reflecting the topology of known regulatory networks inEscherichia coli and Saccharomyces cerevisiae. The data collection features ten different networks, with five each featuring 10 and 100 genes, and the expression data being 21 time points modelled using parameterised stochastic differential equations (Green- field et al., 2010). The performance of the algorithms was assessed using the area under the ROC curve, capturing the changes of true and false positive rates across varying stringency thresholds, and the area under the precision-recall curve, which captures the changes of true positive rate and positive predictive value in a similar

fashion. For the 10-gene networks, VBSSM outperformed all other options, includ- ing dynamic Bayesian networks with no hidden states, for three of the five networks, with CSI achieving the best results for the other two. For the 100-gene networks, CSI had the best scores for every network model tested (Penfold and Wild, 2011).

An example of the application of such methodology to large-scale transcriptomic data can be found in the analysis performed on a high resolution (24 time point) time course ofA. thaliana response toB. cinerea infection (Windram et al., 2012). The full data set features 30,336 microarray probes, with this number being far beyond the computational tractability scope of any network inference approach. The first step in decreasing the dimensionality of the data was the manually cu- rated identification of differentially expressed genes with the aid of GP2S (Stegle et al., 2010), which reduced the number of genes to 9,838. Whilst this is a step in the right direction, the number of profiles was still too high for network inference algorithms to handle. As such, SplineCluster (Heard et al., 2005) was applied to the set of differentially expressed genes, resulting in 44 co-expressed gene clusters, which was sufficient to perform network inference. The clusters, along with a tubu- lin expression profile representing B. cinerea growth, were mined for an underlying network model with CSI (Klemm, 2008; Penfold and Wild, 2011). The resulting network structure was used to formulate a number of specific regulatory hypotheses by matching transcription factors from specific families placed directly upstream of clusters with known binding sites for that transcription factor family overrepresented in the genes’ promoters. The model also helped cement the novel role of TGA3 in necrotrophic pathogen response, subsequently experimentally validated with the aid of mutant lines, by placing its cluster (where it was one of only two transcription factors) as the only node upstream of the B. cinerea growth profile in the network model (Windram et al., 2012).

In document Finding network modules and motifs regulating plant stress responses : integration and model ling across multiple data sets (Page 52-54)