Network-based group-penalized prediction - Statistical methods for the analysis of complex omic

Another approach is to first perform a dimension reduction to the omics datasets. For example Acharjee et al. (2016), first performs random forest on each omic source to select the most relevant features. These features are then combined in a single prediction model. In order to understand possible interactions between omic sources, these features are then combined in a network. Potentially, such network could be included in a prediction model using the methods developped by Tissier et al. (2018). Alternatively, latent structures within and between datasets can be identified by using O2PLS (Trygg and Wold, 2003; Bouhaddani et al., 2016). These latent structures can be summarized in a few indepen- dent components (dimension reduction) which can be included in a prediction model. This approach is interesting as it has been developed to deal with heterogeneous datasets. However, disadvantages of first applying dimension reduction step is the loss of possible relevant information, either by ignoring the joint distribution of two different omics features or the relationship between the omics datasets an the outcome.

In this paper, we will not perform a-priori dimension reduction, instead, we propose group penalization. Therefore, as part of the model building procedure, group inference is performed using network analysis followed by clustering. To this end we extended our approach for a single omic source (Tissier et al., 2018) to multiple omic sources. Specifically, we explore how to include groups containing features from different omic sources and whether this is beneficial in terms of predictive performance. Under this general framework, we will investigate several possible alternatives regarding the inference of groups and the incorporation of this information in regularized regression. It is indeed unclear if, due to the heterogeneity between omic sources as for example metabolites and transcriptomics, combining the datasets before performing the subsequent group inference might lead to poor results. An alternative is to restrict group inference to each of the omic sources which, however, might miss possible across-omic relations due to shared biological pathways.

The rest of the paper is organized as follows: in Section 2, we present the general methodological framework based on grouped regression and several variants regarding group inference and grouped lasso regression. Section 3 contains technical details about the specific method used for group inference. Section 4 focuses on outcome prediction. An intensive simulation study is presented in Section 5 to empirically evaluate the performance of the different studied methods in terms of predictive ability and variable selection properties. The results of the integrated approach are compared with single omic source predictive ability. In Section 6 the methods are applied to two different studies. Main conclusions and a final discussion follow in Section 7

6.2 Network-based group-penalized prediction

Let the observed data be given by(z,Y,X), wherez = (z1, . . . , zn)T is the con-

tinuous outcome measured innindependent individuals, andYandXare matrices of dimensionn×pandn×qrespectively, representing two omic predictor sources withp andqfeatures. LetM be the stacked dataset ofYandX. Our main goal is to build a

predictive model forzbased onYandXwith good predictive performance.

The matricesXandYmight be high-dimensional (n < p, q) and present complex dependence structures, potentially shared due to existing biological pathways, or coordi- nated functions of groups of features.

We propose a general framework of grouped regularized regression methods including group inference as part of the model building procedure. Group inference relies on first estimating the existing relations among features using network analysis techniques and then deriving groups of features using hierarchical clustering. Based on this general framework, three approaches are proposed, with variable level of complexity in group inference.

The first algorithm namedGLasso0consists of constructing a separate network for each omic source and to perform subsequent hierarchical clustering on each of the resulting adjacency matrices. Finally, group lasso regression is performed. This approach only allows for omic-specific groups of features, so correlation across omic sources cannot be captured. The second proposed method, GLasso, starts by building a unique network from the stacked datasetM. Subsequent hierarchical clustering is performed on the resulting adjacency matrix and group regression is also based on group lasso. This approach is a direct application of the method proposed by Tissier et al. (2018) on the stacked datasetMand it potentially allows for groups including features from different omic datasets. However, when noise structures of the omic datasets are different stacking the datasets might be problematic for network construction. Finally, the third proposed approach,OverlapLasso, is an extension ofGLasso0and allows for overlapping groups of features. Namely, after obtaining the omic-specific groups of features, an extra network analysis and hierchical clustering is conducted at the group level to try to incorporate extra shared information by the two omic sources.

In all three approaches weighted gene co-expression network analysis (WGCNA) and the dynamic tree cut algorithm for hierarchical clustering were used. Outcome prediction relies on group lasso in the two first procedures (GLasso0and GLasso) and on a extension to allow the presence of features on multiple groups in the case of Over- lapLasso. Specific components used in each step are described in more detail in the next section. For each approach, double cross-validation (Mertens et al., 2006, 2011) of the whole process (including group inference) was applied to obtain proper tuning parameters and summary performance measures in absence of an external validation set. The three procedures have been implemented in the R function MultiPredNet which is available at github (https://github.com/RenTissier/MultiPredNet). The function calls the packages

WGCNA(for network construction and hierarchical clustering),grpreg(group lasso) and

grpregOverlap(overlapping group lasso).

6.2 Network-based group-penalized prediction 95

GLasso0 Network-based group-penalized prediction model based on omic-specific group inference

Step 1 Network construction

Input Y,X

Output AY,AXtwo adjacency

matrices

Step 2 Hierarchical clustering

Input AY,AX

Output PY, PX omic-specific

clusters

Step 3 Outcome prediction: Group lasso

Input (M, PY, PX)

Output p+q βregression coef- ficients

Figure 6.1: GLasso0

GLasso Network-based group-penalized prediction model on stacked datasets YandX

Step 1 Network construction

Input M = (Y,X)

Output AM adjacency matrix

Step 2 Hierarchical clustering

Input AM

Output PM clusters

Step 3 Outcome prediction: Group lasso

Input (M, PM)

Output p+q βregression coef- ficients

Figure 6.2: GLasso

OverlapLasso Network-based overlapping group-penalized prediction model based on omic-specific group inference

Step 1 Network construction

Input Y,X

Output AY, AX two adjacency matrices

Step 2.a. Hierarchical clustering

Input AY,AX

Output PY, PX omic-specific clusters

Step 2.b. Principal component analysis

Input PY,PX

Output U = (P CP_Y, P CP_X)

set of two first principal compo-

nents of each cluster inPY and

inPX

Step 2.c. Network construction +hier-

archical clustering onU Input U Output PUclusters Step 2.d. Identification of (Y,X)- shared clusters inPU Input Input:PU

Output Output:PU_Mclusters

i. Identify the m clusters ob-

tained in Step 2.c. which

contain elements from both

PXandPY

ii. For each i = 1, . . . , m of

the identify clusters in i.

identify the corresponding

variables fromXandY

iii. Denote byPUM the corre-

sponding set ofmnew clus-

ters.

Step 3 Outcome prediction: Overlap- ping group lasso

Input (M, PY, PX, PU M)

Output p+q βregression coeffi- cients

In document Statistical methods for the analysis of complex omics data (Page 102-106)