Future applications outlook - Statistical Methods for the Analysis of Contextual Gene Expressio

CER1 KRT18 RBM47 NDRG1 FLRT3 KITL SLC39A8 BMP7 FGF5 UPP1 −1.0 −0.5 0.0 0.5 Rank position Loading NDRG1 APLNR EFNB2 CALCA RFX3 AMOT SEMA6D KITL DKK1 CYP26A1 TRH CER1 −1.0 −0.5 0.0 0.5 1.0 Rank position Loading

Fig. 5.18 Ordination of the weights associated to Factor 8 (left) and Factor 9 (right). Shown are the names of the genes with top weights in absolute value

−6 −3 0 −5 0 5 Factor 2 Factor 8 Epiblast (E5.5) Epiblast (E6.5) PS Ectoderm Endoderm Mesoderm

Fig. 5.19 Representation of the cells in factor space using Factor 2 (mesoderm formation) and Factor 8 (endoderm formation).

5.4 Future applications outlook

A key interest of the biofam framework is that it provides a novel perspective on the analysis of molecular differences between distinct sample groups. This is allowed by the explicit modelling of group structures on the sample axis, combined with an ergonomic visualisation package to summarise molecular differences between those groups.

To further explore this direction, we are currently analysing a dataset of 768 cells of the Hematopoietic Stem Cell Compartment in mice, across 2 age groups and 2 mutant groups (Kirschner et al., 2017). Here, biofam can provide a global map of the molecular differences between young and old mice, and investigate how those are affected by differences in genotype. We are also analysing the Tabula Muris dataset (The Tabula Muris Consortium et al., 2017) which consists in more than 100,000 cells from 20 mouse organs.

Here, the scalability of the biofam software is key to integrate so many cells in a single analysis, thereby comparing the biological determinants of multiple organ functions. Other applications could consider the application of Factor Analysis to standard case control studies, or the comparative study of molecular phenotypes between multiple environments.

Chapter 6 Concluding remarks

Recent technological advances in gene expression profiling have resulted in a variety of contextual gene expression datasets. This thesis aimed at developing statistical approaches to explicitly account for contextual information when modelling gene expression. The presented methods build on two distinct fields of Machine Learning research, Gaussian Processes and Factor Analysis. We gave a theoretical perspective on these approaches in Chapter 2.

In Chapter 3, we presented Spatial Variance Component Analysis (SVCA), a probabilistic model based on Gaussian Processes for the analysis of spatial gene expression data. Most prominently, SVCA assesses the effect of cell-cell interactions on gene expression profiles.

Using simulated data, we showed that SVCA yielded more accurate estimates of cell-cell interactions than alternative regression models and was more robust to different simulation settings. This was enabled by the flexibility of the Gaussian Process framework which allows modelling of non-linear positional effects with little prior knowledge about their functional form. We applied SVCA to a protein expression dataset assayed in human breast cancer biopsies, and an RNA expression dataset assayed in the mouse hippocampus. In these applications, we showed that cell-cell interactions are a major driver of gene expression variation at the single cell level, underlying the importance of developing models of gene expression variation which account for the spatial relationship between cells. We also discussed the biological interpretation of the SVCA cell-cell interaction estimates and found that they were largely in accordance with the function of the genes concerned, even if precise mechanistic interpretation remains difficult.

In Chapter 4, we presented biofam, a software for the unsupervised analysis of gene expression data in the context of multi-omics experiments and for the combined analysis of multiple sample batches, such as samples from different biological contexts, experimental conditions or tissues.

Biofam extends the framework of Group Factor Analysis. It combines the strengths of published implementations of Group Factor Analysis methods, and adds novel extensions to these models, such as the implementation of non-Gaussian likelihoods and the modelling of a sample group structure. It is implemented in a modular software which enables the selection of different types of sparsity-inducing priors in any combination, to best reflect assumptions about the data and to enable the comparison of different models implemented within the same framework. We showed that biofam variational inference scheme was performant and we presented ongoing work on a stochastic extension to this inference scheme. In applications to simulated data, we validated the impact of sparsity-inducing priors, and investigated their effect on model identifiability in different simulation settings. This showed that element-wise sparsity-inducing priors helped identify the true latent structure of the data.

In Chapter 5, we illustrated two use cases of the biofam software, with the help of the biofamtools R package for the visualisation and downstream analysis of biofam results.

In an application to a multi-omics dataset of chronic lymphocytic leukaemia, we showed that biofam was able to identify major drivers of variation in a clinically and biologically heterogeneous disease. Most notably, biofam identified previously known clinical markers as well as novel putative molecular drivers of heterogeneity, some of which were predictive of clinical outcome. In a second application we analysed single-cell RNA expression data assayed across multiple stages of the mouse embryo development. We illustrated how biofam can be used in the context of definite sample groups to provide a compact map of their molecular differences. In the gastrulation context, we showed that the biofam approach provided a coherent way to dissect the major processes involved in germ layer commitment. Biofam is still an ongoing project with more real data applications in preparation, in the context of ageing and cross-species comparisons.

Although the SVCA and biofam modelling approaches extend the state of the art in their respective fields, we have shown in Section 3.7.1 and Section 4.7.3 that both models have several limitations and still offer a lot of room for improvement and extensions. Most

137

importantly, SVCA would benefit from a finer understanding of the biological meaning of the cell-cell interactions measured, for which we could use hypothesis-driven research with simpler biological systems showing clear positive and negative controls. For biofam, a pressing direction of improvement is the extension of the current stochastic inference framework with lazy IO functions in order to permit the analysis of datasets which do not fit on the computer memory.

SVCA and biofam were developed as disconnected models and are tailored for different types of data contexts. However, future work may try to bring these approaches together in a unified framework where spatial context is modelled jointly with latent factors of variations. A first approach could be to encode the spatial relatedness of cells with a squared exponential covariance prior on the latent variables of the biofam model, whose length scale could be jointly optimised with the model in an Expectation-Maximisation scheme. Although we originally implemented such a spatial covariance feature in the biofam package, it was subsequently dropped for two reasons. First, experiments on simulated data, as well as spatial transcriptomics (Ståhl et al., 2016) and seqFISH (Shah et al., 2017) data, yielded undistinguishable results from standard biofam. Second, the dependency across samples introduced in the prior distribution of the latent variables had a cost in terms of computational complexity, as also demonstrated in Hore (2015). In the case of a univariate prior on the latent variables, and an approximation to the posterior distribution which is factorised over samples, updates of a given latent variable are independent across samples (see Appendix D). This enables a fast vectorised implementation which is no longer possible with a multivariate prior. Future work should address these problems or explore other alternatives.

Building generative models for bioinformatics requires a thought process which goes in round trips between three components: the model, the data and the biological question or purpose. New perspectives on the studied question or data may emerge from this thought process, which go beyond technical developments alone. For example, Group Factor Analysis did not bring any technical novelties from the already widely used ARD prior. It did however provide a new outlook or perspective on the analysis of data from different sources, and, with it, a principled framework to explore novel questions. Likewise, the models presented in this thesis combine existing models and methods to approach specific datasets and biological questions from a new angle. For example, biofam offers a new perspective on the analysis of global molecular differences between experimental conditions, and provides a principled

statistical framework to study them. We hope that this work will inspire users to address new biological questions in this direction.

Appendix A

Supplementary materials for SVCA

A.1 Methodological notes

In document Statistical Methods for the Analysis of Contextual Gene Expression Data (Page 155-161)