1.6 Analysis of Gene Expression Data
1.6.3 Identifying Network Topology Using Variational Bayesian State
Reverse engineering the underlying topology of the regulatory network of a set of genes is not a trivial task. Many published methods make the assumption that all the possible interacting elements have been observed and that they have been included in the gene set (Beal et al., 2005). In reality, microarrays do not provide complete information about the regulatory network because they:
• may be missing probes for some genes,
• provide noisy data for some probes,
• do not make observations of metabolites and hormones which may form part of the network; and,
• do not make observations of RNA and protein degradation which can result in altered response times.
This is confounded by the gene set selected for modelling often being identified by clustering methods which are unlikely to associate every element of a given network. A proposed solution to this problem might be to include all observed genes to identify the overall network, but this approach has its own problems since the number of gene pair permutations grow exponentially with the number of genes being modelled. This leads to an unidentifiable model because the information about each gene is finite and restricted to the number of time points and replicates. As the number of genes to be modelled increases, so do the number of parameters to be estimated by the model. Since the number of data points about each gene remains fixed, a number of alternative models become equally likely and it becomes impossible to identify which represents the most likely relationship between genes.
Beal et al. (2005) have developed VBSSM which is a State Space Modelling approach to the reverse engineering of the regulatory network for a small set of genes. Linear Gaussian state space models (SSMs) have been known by several previous names in the past, including Linear Dynamical Systems (Roweis & Ghahramani, 1999) and Kalman filter models (Brown & Hwang, 1997). All are a subclass of dynamic Bayesian
networks which are suited to the modelling of time series data. SSMs are particularly suited to the modelling of data collected from gene expression microarrays since they assume the existence of a number of hidden states which evolve with Markovian dy- namics and can be used to model the e↵ects of unmeasured variables such as missing gene expression data or protein degradation rates.
By providing a variational Bayesian treatment of SSMs, a novel approach has been provided to ensure that the dimensionality of the state space can be correctly identified without holding out data from that used to train the model, as in Beal (2003). This approach therefore leaves more data available for training the model and can more accurately infer the gene interactions as a consequence.
Bealet al.(2005) states that a sequence ofp-dimensional real-valued observation vectors (y1, . . . ,yT) are modelled by assuming that at each time stept,ytwas generated
from ak-dimensional real-valued hidden state vector xt.
By focussing on models in which the dynamics and the output functions are linear and time-invariant whilst the distributions of the state evolution and noise variables are Gaussian, the following linear-Gaussian SSM equations can be used:
xt=Axt 1+wt, wt⇠N(0, Q) (1.7)
yt=Cxt+vt, vt⇠N(0, R) (1.8)
whereA is the (k⇥k) state dynamics matrix andC is the (p⇥k) observation matrix. These matrices e↵ectively capture the interaction between hidden states at adjacent time steps and influences caused by hidden states upon observations of the same time step, respectively. Q and R are covariance matrices for the state and output noise variables wt and vt.
By allowing the model to include driving inputsu1:D which allow the control of
the model by an external influence, much in the way that a remote controlled car can be influenced by telling it to move forward, backward or to steer, the model equations become:
xt=Axt 1+But+wt (1.9)
yt=Cxt+Dut+vt (1.10)
whereBis a (k⇥d) input-to-state matrix andDis a (p⇥d) input-to-observation matrix. The driving inputs can be replaced by feedback from gene expression measurements at the previous time step in an attempt to discover the gene–gene interactions across time steps. This allows Equations 1.9 and 1.10 to be rewritten as:
xt=Axt 1+Byt 1+wt (1.11)
Figure 1.8–Probabilistic graphical Bayesian network model representation of VBSSM
The VBSSM model can be summarised by plotting a network of matrices (edges) which define the transition between states (nodes). xt and yt represent the hidden state and observed genes, respectively, at time t. U is an input to the model and can be used to define known interactions as priors of the model. Green arrows (A) are the state dynamics matrix which captures the transition of the hidden states between time points. Yellow arrows (B) are the matrix which models the influence of observed genes on hidden states of the next time point. Blue arrows (C) are the matrix modelling the influence of the hidden states on the observed genes at each time point. Red arrows (D) are the matrix which captures the observed gene expression level influences on other observed genes at adjacent time points. A combination of the yellow, blue and red matrices can be used to directly describe the expression of each observed gene as a function of only the observed gene expressions at the previous time point (CB+D), therefore inferring the influence of each gene on each other gene in the network.
and in turn, since the driving input vector is now p-dimensional, the dimensions of matrixBbecome (k⇥p) and the dimensions of matrixDbecome (p⇥p). The graphical representation of this model can be seen in Figure 1.8 which illustrates these state space vectors as circles connected by arrows which correspond to the four matrices in the model, A, B, C and D.
Under this model, yt denotes the gene expression levels at time step t whilst
xt represents the unobserved hidden factors. D is the matrix which captures gene–
gene influences at adjacent time points,C is the matrix which captures the influence of hidden factors on gene expression at the same time point,B is the matrix capturing the influence of gene expression on hidden factors of the following time point and A is the matrix which captures the state dynamics between hidden states. In order to identify the level of influence which exists between genes, the two equations can be rewritten so that yt is a function of only gene expression at the previous time step,yt 1:
yt= (CB+D)yt 1+rt (1.13)
where rt = vt+Cwt +CAxt 1 and includes all the contributions from noise and previous hidden states. This leads to the ability to characterise the interaction between genejand geneiby observing elementij of the matrix [CB+D] which is describing the
influence of gene expression observations at the previous time step upon gene expression observations at the following time step whilst also accounting for all the hidden factors. Once a model has been inferred, the ability of the model to explain the experi- mental data is returned as a log marginal likelihood. This score will be maximised when a model is a good fit for the data used to train it. This score can therefore be used to compare two models and identify which is the best explanation of the data provided.
1.6.3.1 Limitations
Whilst VBSSM demonstrates a huge step forward in the elucidation of regulatory net- works and has been proven to work successfully on artificially simulated data and data collected in a longitudinal manner, it is unfortunate that it was not possible to gather the senescence experiment samples in the same way. In order to obtain RNA from the leaves of theArabidopsis plants being studied, it was necessary that they were destroyed and biological replicates, despite being labelled similarly at di↵erent time points, were in fact collected from separate individuals in a cross-sectional manner.
The models produced by VBSSM use the biological replicates in a longitudinal manner, associating the observations of each replicate label as providing a discrete observation of the entire time series. E↵orts were made to determine the importance of this mis-interpretation by randomly reassigning the biological replicate labels within time points to see what e↵ect it may have on the resulting model. It was concluded that although the resulting models demonstrated some di↵erences, the variance between the replicates will still remain constant and as such, the underlying network should still be identifiable.
Another limitation of VBSSM is that its models are linear and assume time- invariant interactions between the genes. It is known that not all gene interactions are a linear relationship and that, on some occasions, unmeasured variables such as phosphorylation can alter the downstream regulatory e↵ects of some genes. However, these are computationally intensive areas for improvement in VBSSM, and could not feasibly be added at this time. No alternative better suited method of modelling could be identified and as such, VBSSM was the sole software used for the reverse engineering of regulatory networks. A recent review provides benchmarking comparisons of VBSSM and other methods (Penfold & Wild, 2011, in print). In this review, four alternative algorithms were compared:
• A time-series networks identification (TSNI) algorithm (Bansalet al., 2006).
• A granger causality analysis (GCA) method (Seth, 2010).
• The G1DBN dynamic Bayesian network (DBN) package (L`ebre, 2009).
• A non-parametric non-linear dynamical system (NDS) found in the Matlab pack- age GP4GRN ( ¨Aij¨o & L¨ahdesm¨aki, 2009).
• An implementation of algorithms proposed by Klemm (2008) to provide a causal structure identification (CSI) method.
Of the comparisons made, it was found that the VBSSM was slightly better than the TSNI and GCA methods at identifying the underlying network topology, it was similarly accurate to the other DBN method tested, but was not as accurate as the NDS and CSI methods shown. However, the latter methods are far more computationally intensive, taking 48 times as long to produce a result when compared with VBSSM and were not available when they were required for this PhD project.