Conclusion - Statistical Methods for the Analysis of Contextual Gene Expression Data

3.7 Discussion

3.7.3 Conclusion

There is a growing appreciation of the role of spatial distribution of proteins, RNA transcripts and other molecules in determining tissues functioning and its deregulation in disease, with potential value as predictors of clinical outcomes (Bodenmiller, 2016). This is largely driven by vigorous development of novel technologies that enable us to capture such data (Boden- miller, 2016; Aichler and Walch, 2015; Lin et al., 2017; Schulz et al., 2018). We believe that the SVCA framework and extensions thereof could be of broad use to analyse this burgeoning spatially resolved molecular data to advance our understanding of the pathophysiology of multiple diseases.

Chapter 4 Biofam: a flexible framework for Factor

Analysis models in biology

In this Chapter, we present biofam, a flexible Factor Analysis framework which can be applied to gene expression data and other biological layers, while accounting for structured data context such as the existence of multiple sample groups or the combined analysis of multiple omics.

The development of biofam was motivated by the prior implementation of the MOFA package (Argelaguet et al., 2018) a more limited factor analysis model for the integration of multiple data types (see Section 5.2). Biofam provides a more efficient inference scheme, additional sparsity-inducing priors allowing the user to address new biological questions and is based on a more modular implementation enabling the selection of optional model features in any combination.

This is joint work with Oliver Stegle, Ricard Argelaguet, Danila Bredikin and Yonatan Deloro. I led the development of the package. I designed the model and software in collaboration with Ricard Argelaguet. Ricard Argelaguet and I derived all variational updates and implemented most of the core software, which includes the probabilistic model, the inference routine and the python user interface. Danila Bredikin and Yonatan Deloro joined the project later and contributed to the implementation of some core software components and implemented most of the R package for downstream analysis presented in Section 5.1, based on previous work from Britta Velten and Ricard Argelaguet. I designed and implemented the stochastic inference extension, and tested it with the help of Yonatan Deloro, an intern I supervised.

All analysis presented in this chapter is the result of my own work, except the simulations from the non-Gaussian data which were performed by Ricard Argelaguet (Fig. 4.14 and 4.15), In Chapter 5, we illustrate use cases of biofam on real data by others and in collaboration with me. The software is open source and freely available here: https://github.com/biofam/biofam.

4.1 Introduction

Unsupervised models such as clustering or dimensionality reduction are widely used approaches for the exploratory first-pass analysis of high-dimensional data. For example, Principal Component Analysis (Hotelling, 1933) (see Section 2.2) computes a low dimensional representation of the data which can be easily visualised and highlights the main dependencies between samples, such as subpopulations (Novembre et al., 2008; Pollen et al., 2014; Islam et al., 2011; Shalek et al., 2014; Tang et al., 2010), technical batch effects (Luo et al., 2010; Holmes et al., 2011; Yang et al., 2008; Lazar et al., 2013) or continuous relation- ships (Wang et al., 2013; Haghverdi et al., 2015; Guo et al., 2010).

With recent advances in experimental techniques, gene expression datasets are measured in an increasing number of different contexts, including multiple tissues (GTEx Consortium, 2013; Nica et al., 2011), and in combination with other biological layers such as methylation (Hasin et al., 2017). This gives rise to structured datasets, where samples belong to distinct groups, and features to distinct views, with sometimes different data modalities. These new datasets pose additional requirements to perform dimensionality reduction while explicitly accounting for this contextual information.

We will here build on the probabilistic Factor Analysis framework introduced in Chapter 2. This formulation allows for the definition of hierarchical priors which reflect prior knowledge about the data context and structure, or prior assumptions about the distribution of the model parameters. In addition, Factor Analysis has other advantages such as the ability to handle datasets with missing values.

Probabilistic methods for linear dimensionality reduction

Multiple variants of Factor Analysis have been developed and applied to the analysis of gene expression datasets. They differ in the choice of the prior distributions used on the model parameters, which are tailored to a given context or application. For example, while most

4.1 Introduction 79

models typically assume Normally distributed weights and factors (see Section 2.2), Inde- pendent Component Analysis assumes a non-Normal distribution of the weights (Lawrence and Bishop, 2000), which induces statistical independence across latent factors.

In addition, Factor Analysis extensions differ in their choice of hierarchical priors. Group Factor Analysis (Virtanen et al., 2012; Klami et al., 2015) models prior knowledge about existing groups of features, which allows for the combined analysis of data from multiple sources. As conventional (Group) Factor Analysis tends to infer dense weight matrices, so that the latent factors capture a maximum variability in the data (minimising the reconstruc- tion error), relating these latent factors to known biological processes involving a small subset of genes is challenging. To address this issue, sparse extensions of Factor Analysis (Zhao et al., 2016; Khan et al., 2014; Gao et al., 2013) model the assumption that meaningful drivers of variation affect a limited number of features, resulting in the inference of sparse weight matrices which are more easily interpretable as biological processes. Bi-clustering models extend this sparsity assumption to the factors (Hochreiter et al., 2010; Suvitaival et al., 2014; Bunte et al., 2016; Leppäaho et al., 2017)

While Group Factor Analysis models are useful for the purpose of integrating multiple types of data sources for matching samples, existing models do not account for group structures on the sample axis. Thus, they do not offer a principled framework for the unsupervised analysis of multidimensional data across biological contexts, a problem which remains less studied. A notable exception is the GFA mixture model of Remes et al. (2015), which however offers only limited options and does not provide a runnable software (see Section 4.7.1).

In addition, current GFA implementations each have a number of specific limitations. Com- monly employed inference using Gibbs sampling does not scale to large datasets (Khan et al., 2014; Bunte et al., 2016; Leppäaho et al., 2017). Other implementations do not handle missing values (Khan et al., 2014; Zhao et al., 2016; Remes et al., 2015; Klami et al., 2015; Virtanen et al., 2012). These models rely solely on a Gaussian likelihood, which is not adapted for the analysis of data with different modalities such as binary mutation or methylation data. Finally, most of the tools available for dimensionality reduction provide only sparse functionalities for downstream analysis of the latent factors and weights inferred. At the end of this Chapter (Section 4.7.1), Table 4.3 gives a summary of the characteristics of published Group Factor Analysis models.

Biofam: a unified framework for Factor Analysis in biology

Fig. 4.1 Biofam enables joint Factor Analysis across multiple omics and multiple sample groups. Weight matrices are represented in red, and factor matrices have the colour of the sample groups they correspond to. Latent factors may be relevant to specific omics and specific sample groups only, as illustrated by the null column in the weight matrices and the null rows in the factor matrices (white cells). Biofam also supports datasets with missing values, including samples missing an entire assay, as represented by empty columns in the data matrix (and likewise, features missing across an entire sample group).

Here, we propose biofam (Bio - Factor Analysis models), a unified Factor Analysis modelling framework which brings together the individual strengths of the multiple separate GFA models and implementations mentioned before. Biofam is implemented in a modular manner, so that grouping structures and sparsity-inducing priors can be encoded flexibly on the feature axis, on the sample axis or on both. Thus, it is able to fit existing FA, GFA, ICA and bi-clustering models, with or without sparsity-inducing priors.

In addition to unifying existing GFA modelling approaches, biofam offers a number of unique features. Firstly, it models grouping structures on the sample axis, facilitating the analysis of new types of gene expression analysis problems such as the combined analysis of datasets across different biological contexts (Fig. 4.1). Biofam also extends GFA to support Poisson and Bernoulli data likelihoods, which are particularly suited to the analysis of multi-omics data where different data sources exhibit different modalities, such as binary data in the case of somatic mutations. Biofam is implemented using an efficient inference scheme using variational methods and GPU optimisation, which makes it scalable to large datasets. We also present ongoing work regarding an extension to stochastic variational inference, which holds

In document Statistical Methods for the Analysis of Contextual Gene Expression Data (Page 97-103)