• No results found

Practical problems in application of PGMs to systems biology

General discussion

6.3 Practical problems in application of PGMs to systems biology

Systems biology is an inter-disciplinary field that integrates biology, computer science and engineering to decipher complex biological systems using holistic approaches (Calvert and Fujimura 2009). It is based on the understanding that the functions of a whole living organism are more than the sum of its parts (Hurlbut 2006). Accordingly, it requires the ability to obtain, integrate and analyze complex data sets from multiple sources.

The rapid evolution of high-throughput technologies, such as nucleotide sequencing, DNA-chips and protein mass spectrometry, have enabled extensive generation of multi-omics data. But, these data are typically heterogeneous and distributed in various databases, since they come from studies driven by different objectives and conducted on different platforms. This raises challenges in data access and integration. Although the acquisition of publicly available sources has been largely facilitated thanks to the current data explosion, discovering the appropriate data is often not straightforward due to the diversity of data types and formats. Besides, experimental biologists are still struggling to provide complete and non-redundant information collected from varied data sources. For instance, Pathguide has reported by 2013 a list of 547 biological pathways and molecular interaction related resources. These resources are not simply complementary, but often define similar signaling and metabolic pathways with different boundaries and components (Gomez-Cabrero et al. 2014).

Challenges beyond data access and integration lie in the integrative analysis of multi-omics data. A number of factors including data quality, the complexity of the target system and the characteristic of the technology employed come together to make integrative analysis not an easy job. It is notable that most of the systematic approaches developed so far are pipelines of analysis that apply several methods to carry out a sequence of tasks (Bersanelli et al. 2016). An example is the

General discussion 137

genotype-phenotype modeling scheme we have proposed in Chapter 3. Encouragingly, pipelines presented for addressing a particular problem can also be used, with minor modifications, to solve another problem, possibly with other types of omics (Bersanelli et al. 2016). For instance, although in Chapter 3 we have only demonstrated that the proposed scheme is effective in inferring directed associations among metabolites and sensory traits given relevant QTLs, this scheme should also be applicable to the modeling of general hierarchical networks that represent multilevel phenotypic responses to DNA variations.

Identifying associations among entities within and across heterogeneous data sources is of great importance in most studies, as it is a straightforward and effective way to “glue” together pieces of information so as to provide a coherent view of the whole system. To establish multilevel associations, earlier studies often employed distance-based or correlation-based metrics, while recent studies tend to adopt more sophisticated modeling techniques such as PGMs. Whichever method is used, the conflicts between data measures must be handled beforehand. This includes missing-data imputation, data scaling, discretization, normalization, standardization, and etc. Nonetheless, even if data pre-processing has been done properly, it is noteworthy but often-overlooked that very few associations across heterogeneous data sets are usually revealed, compared to the great number of associations identified within the same data set (see, for example, Fig.4 in Chapter 3).

A rich body of literature supports the idea that associations in observational data can provide insights into causal relations among the measured variables (Blair et al. 2012; Pearl 2009; Shipley 2016). Nonetheless, it has been seen that causal relations are sensitive to subtle association patterns, which may be driven by other factors (e.g. environmental and experimental design factors) that do not reflect the underlying biological nature (Blair et al. 2012). In addition, graphical methods for causal inference from observational data, especially from observed associations, are admittedly subject to the existing theoretical constraints. As elaborated in Chapter 5 and Section 6.1, constructing BNs on the basis of the BDe metric or conditional independence facts can end up with a distinct Markov equivalence class rather than a distinct BN. Also, Chapter 2 has demonstrated that inferring causal relations from observed associations requires introducing extra known causal factors to the measured variables. For instance, causal inference in correlated traits (or equivalently, the construction of directed phenotype networks) is based upon logic that involves the underlying QTLs. The existing related algorithms request at least one unique QTL for each trait studied, though such prerequisite is hardly being met in reality. In comparison, the QPSO algorithm presented in Chapter 2 is of more practical significance since it has a more realistic prerequisite – some traits can come without QTL. More encouragingly, as indicated in Chapter 2 and 3, the QPSO algorithm can

be embedded into a bottom-up strategy to systematically model multilevel phenotypic responses to DNA variations.

Agreement between a mathematical or statistical model and the true underlying biology is vital to any practical study. It should be recognized that the extent to which a PGM derived from observational data can recapitulate the architecture of an underlying biological process is not yet well understood (Blair et al. 2012). On the one hand, observational data are often collected at single time points; on the other hand, biological processes typically display time varying dynamics. This conflict makes the interpretation of the reconstructed model challenging. Dynamic Bayesian networks (DBNs) are the time-generalization of BNs and relate variables to each other over discrete time points. Their major advantages lie in the ability to deal with multivariate time series data and permit presentation of cyclic causal relationships. Here, however, we will not discuss practical issues related to the use of DBNs, since analyzing time series data is beyond the scope of this thesis. For those who are interested in the computational power of DBNs, please refer to Ghahramani (1998), Murphy (2002) and Brulé (2016) for details.