Another approach is to first perform a dimension reduction to the omics datasets. For example Acharjee et al. (2016), first performs random forest on each omic source to select the most relevant features. These features are then combined in a single prediction model. In order to understand possible interactions between omic sources, these features are then combined in a network. Potentially, such network could be included in a prediction model using the methods developped by Tissier et al. (2018). Alternatively, latent structures within and between datasets can be identified by using O2PLS (Trygg and Wold, 2003; Bouhaddani et al., 2016). These latent structures can be summarized in a few indepen- dent components (dimension reduction) which can be included in a prediction model. This approach is interesting as it has been developed to deal with heterogeneous datasets. However, disadvantages of first applying dimension reduction step is the loss of possible relevant information, either by ignoring the joint distribution of two different omics fea- tures or the relationship between the omics datasets an the outcome.
In this paper, we will not perform a-priori dimension reduction, instead, we propose group penalization. Therefore, as part of the model building procedure, group inference is performed using network analysis followed by clustering. To this end we extended our approach for a single omic source (Tissier et al., 2018) to multiple omic sources. Specifically, we explore how to include groups containing features from different omic sources and whether this is beneficial in terms of predictive performance. Under this gen- eral framework, we will investigate several possible alternatives regarding the inference of groups and the incorporation of this information in regularized regression. It is indeed unclear if, due to the heterogeneity between omic sources as for example metabolites and transcriptomics, combining the datasets before performing the subsequent group infer- ence might lead to poor results. An alternative is to restrict group inference to each of the omic sources which, however, might miss possible across-omic relations due to shared biological pathways.
The rest of the paper is organized as follows: in Section 2, we present the general methodological framework based on grouped regression and several variants regarding group inference and grouped lasso regression. Section 3 contains technical details about the specific method used for group inference. Section 4 focuses on outcome prediction. An intensive simulation study is presented in Section 5 to empirically evaluate the perfor- mance of the different studied methods in terms of predictive ability and variable selection properties. The results of the integrated approach are compared with single omic source predictive ability. In Section 6 the methods are applied to two different studies. Main conclusions and a final discussion follow in Section 7
6.2
Network-based group-penalized prediction
Let the observed data be given by(z,Y,X), wherez = (z1, . . . , zn)T is the con-
tinuous outcome measured innindependent individuals, andYandXare matrices of dimensionn×pandn×qrespectively, representing two omic predictor sources withp andqfeatures. LetM be the stacked dataset ofYandX. Our main goal is to build a
predictive model forzbased onYandXwith good predictive performance.
The matricesXandYmight be high-dimensional (n < p, q) and present complex dependence structures, potentially shared due to existing biological pathways, or coordi- nated functions of groups of features.
We propose a general framework of grouped regularized regression methods includ- ing group inference as part of the model building procedure. Group inference relies on first estimating the existing relations among features using network analysis techniques and then deriving groups of features using hierarchical clustering. Based on this general framework, three approaches are proposed, with variable level of complexity in group inference.
The first algorithm namedGLasso0consists of constructing a separate network for each omic source and to perform subsequent hierarchical clustering on each of the result- ing adjacency matrices. Finally, group lasso regression is performed. This approach only allows for omic-specific groups of features, so correlation across omic sources cannot be captured. The second proposed method, GLasso, starts by building a unique net- work from the stacked datasetM. Subsequent hierarchical clustering is performed on the resulting adjacency matrix and group regression is also based on group lasso. This approach is a direct application of the method proposed by Tissier et al. (2018) on the stacked datasetMand it potentially allows for groups including features from different omic datasets. However, when noise structures of the omic datasets are different stacking the datasets might be problematic for network construction. Finally, the third proposed approach,OverlapLasso, is an extension ofGLasso0and allows for overlapping groups of features. Namely, after obtaining the omic-specific groups of features, an extra network analysis and hierchical clustering is conducted at the group level to try to incorporate ex- tra shared information by the two omic sources.
In all three approaches weighted gene co-expression network analysis (WGCNA) and the dynamic tree cut algorithm for hierarchical clustering were used. Outcome predic- tion relies on group lasso in the two first procedures (GLasso0and GLasso) and on a extension to allow the presence of features on multiple groups in the case of Over- lapLasso. Specific components used in each step are described in more detail in the next section. For each approach, double cross-validation (Mertens et al., 2006, 2011) of the whole process (including group inference) was applied to obtain proper tuning parameters and summary performance measures in absence of an external validation set. The three procedures have been implemented in the R function MultiPredNet which is available at github (https://github.com/RenTissier/MultiPredNet). The function calls the packages
WGCNA(for network construction and hierarchical clustering),grpreg(group lasso) and
grpregOverlap(overlapping group lasso).
6.2 Network-based group-penalized prediction 95
GLasso0 Network-based group-penalized prediction model based on omic-specific group inference
Step 1 Network construction
Input Y,X
Output AY,AXtwo adjacency
matrices
Step 2 Hierarchical clustering
Input AY,AX
Output PY, PX omic-specific
clusters
Step 3 Outcome prediction: Group lasso
Input (M, PY, PX)
Output p+q βregression coef- ficients
Figure 6.1: GLasso0
GLasso Network-based group-penalized prediction model on stacked datasets YandX
Step 1 Network construction
Input M = (Y,X)
Output AM adjacency matrix
Step 2 Hierarchical clustering
Input AM
Output PM clusters
Step 3 Outcome prediction: Group lasso
Input (M, PM)
Output p+q βregression coef- ficients
Figure 6.2: GLasso
OverlapLasso Network-based overlapping group-penalized prediction model based on omic-specific group inference
Step 1 Network construction
Input Y,X
Output AY, AX two adjacency matrices
Step 2.a. Hierarchical clustering
Input AY,AX
Output PY, PX omic-specific clusters
Step 2.b. Principal component analysis
Input PY,PX
Output U = (P CPY, P CPX)
set of two first principal compo-
nents of each cluster inPY and
inPX
Step 2.c. Network construction +hier-
archical clustering onU Input U Output PUclusters Step 2.d. Identification of (Y,X)- shared clusters inPU Input Input:PU
Output Output:PUMclusters
i. Identify the m clusters ob-
tained in Step 2.c. which
contain elements from both
PXandPY
ii. For each i = 1, . . . , m of
the identify clusters in i.
identify the corresponding
variables fromXandY
iii. Denote byPUM the corre-
sponding set ofmnew clus-
ters.
Step 3 Outcome prediction: Overlap- ping group lasso
Input (M, PY, PX, PU M)
Output p+q βregression coeffi- cients