1. Introduction
1.2 Omics data integration
1.2.3 Omics data integration methods
1.2.3.1 Network-based integration
Network-based methods use graph theory and statistics to portray relationships between elements in the polyomics datasets. In this way, they offer an intuitive, versatile, and powerful approach to represent and analyse complex systems (Nardini et al., 2015). Networks (G) include nodes or vertices (V) that represent the system components such as genes, proteins, and metabolites, and edges (E) that represent interactions among them, and usually denoted as G = (V, E). Depending on the statistical measure used and the type of data they represent, network edges can be weighted or unweighted, and directed or undirected. The
connectivity pattern in a network is generally represented by an adjacency matrix (A). In an undirected and unweighted network G = (V, E), its adjacency matrix, A, is a square matrix of size |V| X |V|, where each row and column denotes a node and entries in the matrix are either Aij = 1, if nodes i and j are connected, or Aij =
0, if they are not connected. If it is a weighted network, the adjacency matrix includes real numbers representing the strengths of associations between the nodes, instead of the binary 0 or 1 in the unweighted network (Gligorijevic and Przulj, 2015).
Generally, network-based methods start with construction of a similarity matrix using a measure of similarity or relatedness between the elements in the omics datasets. Several measures can be used to determine the similarity between the pairs of elements, and each measure has its specific strengths and weaknesses. Usually, the Pearson product–moment correlation coefficient or Spearman’s rank correlation is used as a measure of similarity, and comparative studies have shown that these simple measures perform well compared to more sophisticated methods such as mutual information (MI) in terms of finding relationships and computational performance on very large omics datasets (Song et al., 2012, Ballouz et al., 2015, Serin et al., 2016). The most popular correlation measure used is Pearson correlation, even though it assumes normal distribution of transcript, protein or metabolite expression. In contrast, Spearman's rank correlation is more robust, but less powerful than Pearson correlation (Serin et al., 2016). Networks constructed using the Pearson correlation method have undirected edges, and causality cannot be inferred from the relationships. The Pearson correlation coefficient is a measure of the linear relation between two variables, and the coefficient value (r) ranges between -1 and 1, where r = −1 indicates a perfectly negative linear relation, r = 1 indicates a perfectly positive relation, and r = 0 indicates the absence of any linear relation.
An excellent example of the network-based integration of polyomics datasets is ‘the integrated disease network’, which was constructed from different types of biological data including genomics, clinical, disease–metabolites associations, genome-wide associations, biological pathways, and Gene Ontology5 annotations
5 http://www.geneontology.org/; The Gene Ontology (GO) is a framework that provides a set of
data (Sun et al., 2014). A similar network-based integration approach was used to study the systemic impact of adverse therapeutic events in rheumatoid arthritis, and this study integrated polyomics datasets including genomics, transcriptomics, epigenetics and microbiome, and clinical datasets (Tieri et al., 2014). Gibbs et al used a slightly different networks-based approach to study polyomics datasets (Gibbs et al., 2014). Their approach involved mapping the polyomics data to a common identifier (Entrez ID), generating co-expression networks from individual omics datasets, identifying co-expression modules in them and comparing the co- expression modules between the omics layers using multiple measures such as module member overlap and module summary correlation. However, mapping polyomics data to a common identifier is challenging, and may not be possible in some cases such as mapping metabolites to genes. 3Omics, a web-based tool to integrate transcriptomics, proteomics and metabolomics data also uses correlation-based networks to visualize relationships in the datasets (Kuo et al., 2013).
Another linear method related to multiple linear regression, but which has an interpretation that is similar to that of Pearson correlation coefficient is partial correlation (Lipsitz et al., 2001). Partial correlation can distinguish between direct and indirect relationships, and is useful when covariates are measured on different scales. Kayano et al used a partial correlation approach to construct metabolic networks from metabolome, proteome, and transcriptome data, and demonstrated that their partial correlation-based approach was superior to Pearson correlation-based approach (Kayano et al., 2013).
Mutual information is a non-linear measure of dependency, and provides a natural generalization of the correlation (Song et al., 2012). However, MI did not perform better than Pearson correlation in comparative studies (Song et al., 2012). Nevertheless, MI is the basis used in the development of new improved information-theoretic methods such as relevance networks (Butte and Kohane, 2000), the context likelihood of relatedness (CLR) algorithm (Faith et al., 2007), the minimum redundancy networks (MRNET) algorithm (Meyer et al., 2007) and Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE)
includes 3 top level categories: (1) Biological process; (2) Molecular function; and (3) Cellular component
(Margolin et al., 2006). Of these, ARACNE is particularly notable for its effectiveness in the reconstruction of regulatory networks, and therefore remains a popular choice (Lachmann et al., 2016). The ARACNE method can distinguish between direct and indirect relationships, and this is achieved through pruning the lowest weight edge in a triplet. Regression methods can also be used to construct networks, and are very useful as directed graphs.
Similarly, Bayesian methods are used in constructing omics networks, and they allow the inclusion of prior knowledge. A Bayesian network is a directed graph, where nodes represent random variables such as transcript or protein levels and directed edges represent the causal relationship and conditional probabilities between pairs of variables (Gligorijevic and Przulj, 2015). Bayesian networks are effective in representing the structure of the data and their sparsity provides a compact representation. These properties address one of the biggest challenge in integration of polyomics datasets, which is network inference from disparate data sources by constructing sparse networks where only the important associations are present (Gligorijevic and Przulj, 2015). Although application of Bayesian methods- based networks is computationally challenging for large polyomics datasets (Serin et al., 2016), several studies have successfully used Bayesian networks in deriving knowledge from polyomics datasets. For example, Jansen et al used Bayesian networks to predict protein-protein interactions in yeast by integrating different types of omics data including transcriptomics and proteomics (Jansen et al., 2003). Similarly, using Bayesian networks, Zhu et al reconstructed causal gene networks in yeast by integrating polyomics data including genomics, transcriptomics (gene expression and expression quantitative trait loci (eQTL)), proteomics, transcription factor binding site, and protein–protein interaction data (Zhu et al., 2008). Using a similar integrated Bayesian network approach, Zhang et al recently reconstructed causal regulatory networks in late-onset Alzheimer’s disease from 1,647 post-mortem human brain tissues (Zhang et al., 2013a). There are several tools available for network construction and analysis. Weighted correlation network analysis (WGCNA), a R package developed by Langfelder and Horvath is a popular tools for co-expression analysis (Langfelder and Horvath, 2008). Similarly, GraphViz (Gansner and North, 2000) and Cytoscape (Shannon et al., 2003) are very popular for visualization and analysis of networks. In addition,
CFinder (Adamcsek et al., 2006), NAViGaTOR (Brown et al., 2009), Gephi (Cherven, 2015) and Pajek (Mrvar and Batagelj, 2016) are also notable for network visualization and analysis. NetworkAnalyst is a web-based tool for network analysis and visualization of omics datasets that provides many options to analyse omics datasets (Xia et al., 2015). Cytoscape is a Java-based open-source software for integrating and visualizing biological networks. In Cytoscape, biological entities such as proteins or genes are represented as nodes and their interactions are represented as edges connected between the nodes to construct networks. Attributes of nodes and edges can be overlaid in the Cytoscape networks depicting interactions. While the Cytoscape core provides basic visualization, annotation and query functionalities, available plug-ins provide several additional capabilities that enhance the utility of Cytoscape as an important systems biology tool. One of the plug-ins for Cytoscape, the Molecular Complex Detection (MCODE), finds highly connected regions in large networks that may represent molecular interactions (Bader and Hogue, 2003). The MCODE plug-in functions in three recursive stages: node weighting, cluster formation, and optional addition of nodes to the cluster using certain criteria.