• No results found

Multivariate-based integration

1. Introduction

1.2 Omics data integration

1.2.3 Omics data integration methods

1.2.3.3 Multivariate-based integration

In multivariate-based integration, individual omics data types are analysed using multivariate analysis methods, and then multiple omics datasets are associated by finding covariance associations between the elements of the datasets, or the multivariate model from one omics type is applied to other omics types to make predictions (Cavill et al., 2016). Several multivariate methods can be used for integration. For example, Forshed et al used partial least squares (PLS) and PCA in integrating LC-MS-based and NMR-based metabolomics datasets (Forshed et al., 2007a). Although PCA is an unsupervised technique and PLS is a supervised technique, both are useful in identifying collinearity between the elements (genes, transcripts, proteins or metabolites) in polyomics datasets. Using PLS, Griffin et al integrated microarray-based transcriptomic data and NMR-based metabolomic data generated from liver tissues of rats induced to show fatty liver by feeding orotic acid (Griffin et al., 2004). They associated the changes in transcripts with changes in metabolites by modelling the transcriptomic data (Y) as the function of metabolomics data (X) using PLS regression. The PLS-based integration of microarray and NMR data helped them to define transcriptomic and metabolomic regulatory responses in liver due to orotic acid, and to identify the specific pathways and cellular responses in pathogenesis of fatty liver. The PLS method is asymmetric, and hence, does not represent the true biological relationships (Bouhaddani et al., 2016). In the PLS method, when the response variable is a discrete rather than continuous variable, then it is commonly referred to as partial least squares discriminant analysis (PLS-DA).

To overcome the asymmetric nature of PLS, a two-way orthogonal partial least squares (O2PLS) model was used by Rantalainen et al to integrate NMR-based metabolomics and 2D-DIGE-based proteomics data generated from human prostate cancer xenograft in mice (Rantalainen et al., 2006). In this study, orthogonal projections to latent structures (OPLS), a supervised multivariate projection method similar to PLS but modified with an integrated orthogonal signal correction filter (OSC), was also used to integrate proteomics and metabolomics data. Although OPLS is also asymmetric in nature, it attempts to

correct for systematic variations in the design matrix before presenting the data to PLS, which allows easier interpretation of the model (Bouhaddani et al., 2016). On the other hand, being symmetric, the O2PLS models both symmetric and predictive variations. The O2PLS model decomposes the variation present in two matrices X and Y, for example two omics datasets such as proteomics and metabolomics datasets, into three parts: (1) the joint part wherein the underlying latent variables in both matrices X and Y are assumed to provide the relationship between X and Y, and hence this joint part could be taken as a representation of the integration of the two datasets X and Y; (2) the orthogonal part wherein the underlying latent variables, independent from those in the joint part, are assumed to be responsible for the unique systematic variation in X (Y), which does not contribute to the prediction of Y (X); (3) the noise, which captures the unsystematic variation in the datasets (Bouhaddani et al., 2016). From the joint part, it is possible to obtain the percentage of variance of each omics data set (X and Y) that can be modelled by the other data set, and this gives a measure of similarity between the two datasets. Recently, Bouhaddani et al conducted a simulation study to assess the performance of O2PLS models in integrating transcriptomic and metabolomic data, and the results showed that the estimates obtained from the O2PLS model were close to true parameters in both low and high dimensions (Bouhaddani et al., 2016). However, when there was increased noise (> 50%) in the datasets, there was no clear distinction between the orthogonal and joint parts, suggesting lack of robustness in this method.

Boccard and Rutledge recently introduced a consensus OPLS-DA multiblock data modelling strategy that combines the kernel implementation of the OPLS method with a data fusion procedure for simultaneous evaluation of multiple data blocks in the OPLS-DA modelling framework (Boccard and Rutledge, 2013). This consensus OPLS-DA multiblock data modelling strategy can integrate more than two omics types, and hence is an improvement over the O2PLS method. However, the consensus OPLS-DA multiblock data modelling strategy regresses all the data against a class variable without providing information about the interrelated features between the datasets. To extend the O2PLS method to analyse more than two polyomics datasets, a new method called OnPLS was developed by Lofstedt and Trygg, and was used to study oxidative stress response in Populus plants by

integrating transcriptomic, proteomic and metabolomic data (Löfstedt and Trygg, 2011, Löfstedt et al., 2013, Srivastava et al., 2013).

Many other multivariate methods have been used to integrate multiple omics datasets. These include sparse regression models such as random forest regression (Acharjee et al., 2016, Acharjee et al., 2011), multiple co-inertia analysis (MCIA) (Meng et al., 2014), parallel factor analysis (PARAFAC) (Forshed et al., 2007b), canonical correlation analysis (CCA) (Jozefczuk et al., 2010), ComDim-OPLS (Boccard and Rutledge, 2014), least absolute shrinkage and selection operator (LASSO) (Cai et al., 2013, Omranian et al., 2016) and kernel-based methods such as support vector machine recursive feature elimination (SVM-RFE) (Smolinska et al., 2012). Recently, Pineda et al used LASSO and Elastic Net–based penalized regression methods to identify relationships between genetic variants, gene expression and DNA methylation data obtained from bladder tumour samples, and proposed a permutation-based method to correct for multiple testing (Pineda et al., 2015).