1.3 Tackling heterogeneity of the DNA methylomes
1.3.4 Computational inference and deconvolution
The advantages of cell separation-based and single-cell methods for decreasing and delineat- ing heterogeneity of the methylomes are doubtless. They are, however, outweighed by in- herent drawbacks limiting their utility in many application scenarios. In the same time, an average methylome of a cell sample with a decent depth of support carries information about all cell populations which contributed to it. As a convenient, cost-efficient and time-saving alternative, it was suggested that computational methods can be used to detect and measure tissue heterogeneity of complex samples [Baron et al.,2006;Sehouli et al.,2011]. Earlier this has been successfully demonstrated for the gene expression data [Kuhn et al.,2011;Shen-Orr
et al.,2010].
The later developed computational methods for methylome heterogeneity analysis are usu- ally classified in two large groups depending on whether or not they are using any prior information. Reference-based methods utilize the existing, usually genome-scale profiles of purified cell populations. Reference-free methods attempt to infer the heterogeneity effects without any prior information. Existing reference-free and reference-based approaches are summarized in Table1.1.
Methods for reference-based estimation and adjustment
Houseman et al. suggested a regression calibration-like approach to estimate the contribution of each of the two effects for every tested covariate [Houseman et al.,2012]. Their model relies upon reference methylome measurements which were directly used to find cell type-specific quantitative markers and to model the target data using the values for the markers observed in the reference data set. In addition, they devised a systematic quadratic optimization-based procedure, under the name of constrained projection, that allowed estimation of leukocyte proportions in genome-scale DNA methylation profiles of whole blood samples. The esti- mated proportions could then be considered as covariates in a subsequent association analy- sis [Houseman et al.,2012]. The approach by Houseman et al. was validated in several follow- up studies [Accomando et al.,2014;Koestler et al.,2013].
Similar reference-based approaches were developed focusing at heterogeneity in the brain tissue. Cell epigenotype-specific (CETS) model is using DNA methylation profiles of neuronal-
enriched and depleted fractions for quantifying the proportion of neuronal and non-neuronal (glial) cells in bulk samples of various regions in brain cortex [Guintivano et al.,2013]. CETS mixes measured profiles of NeuN+ and NeuN− enrichment fractions in silico with an incre-
mental range of possible mixing proportions to simulate brain tissue methylomes for each neurons to glia ratio. The observed brain profile is then compared to all simulated mixtures, and the mixing proportions of the best fitting one are considered to represent the proportions of the observed profile. Another study extended the blood-based constrained projection to support the brain tissue data and showed how accurate estimates can be obtained using data from different brain regions instead of the NeuN-sorted references [Montaño et al.,2013].
An important aspect of the reference-based methods is the selection of cell type-specific quantitative marker CpGs. The pilot study in blood applied per-CpG linear modeling of refer- ence data with cell type as an independent variable and selected the top 500 CpGs showing the best F-statistic [Houseman et al.,2012]. Later implementation of the constrained projection in the minfi R-package used a slightly different approach, selecting 100 best markers for each cell type [Aryee et al.,2014]. Finally, a recent paper presents an in-depth systematic method IDOL for addressing this problem, and demonstrates its superiority [Koestler et al.,2016].
At the moment, most of the available reference-based methods use the reference data also in their estimation steps. This hinges upon two significant assumptions. First, it is expected that technical differences between the target and the reference data sets are not affecting the analysis substantially. Second, they assume that the cell type methylomes of the reference in- dividuals are, on average, equivalent to those of the target cohort. While the first requirement can, at least in part, be solved by joint preprocessing of both data sets, the second assump- tion is very hard to verify in practice. This is why it appears more reliable to use reference data set to select ctDMRs, and use the methylation values observed at these loci or CpGs as a basis for estimation or adjustment. An ad hoc implementation of such approach for the cor- rection of cell type heterogeneity in saliva samples [Souren et al.,2013] is presented as a part of Chapter2.
Reference-free correction methods
Reference-based methods showed good performance in multiple data sets [Koestler et al.,2013;
Liu et al.,2013]. However, they largely rely upon the availability of reference DNA methylation
profiles of purified cell types which are not always at hand. In addition, the complete set of contributing epigenome varieties is not known a priori. This stimulated the search for methods that would be independent of such reference measurements.
First two such methods appeared simultaneously [Houseman et al.,2014;Zou et al.,2014]. RefFreeEWAS [Houseman et al.,2014], from the authors of the constrained projection method
[Houseman et al.,2012], is essentially an adaptation of the more general Surrogate Variable
Analysis (SVA) [Leek and Storey, 2007]. The SVA-inspired assumption behind RefFreeEWAS is that the cell type methylomes and proportions are convoluted into both coefficients and residuals of the phenotype-based model of the observed data. Although not obtainable explic- itly, due to orthogonality with other covariates, one can recover the cell type methylomes and proportions in a convoluted form using joint Singular Value Decomposition of the coefficients and residuals. This also allows for respective correction of the phenotype model coefficients to capture only the “direct” methylation effects.
The second method, FaST-LMM-EWASHER [Zou et al.,2014] is a descendant of the Fast- LMM algorithm for the significance analysis in genomic association studies. In a thorough mathematical analysis the authors expose limitations of the standard linear mixed models (LMMs) accounting for covariance structure in capturing the confounding caused by the cell
type variability, as well as of a simple principal component-based correction. The EWASHER model is based on augmenting the full covariance-aware LMM with correction for top princi- pal components of the CpG marker matrix. The fitting algorithm proceeds in iterative fash- ion re-selecting the features for the covariance matrix after addition of each new principal component. The authors demonstrate that EWASHER successfully eliminates the significance inflation in the statistical analysis of DNA methylation data severely affected by the cell type heterogeneity.
Finally, the most recent reference-free approach ReFACTor [Rahmani et al.,2016] is apply- ing sparse PCA for the same purpose of correcting an association analysis for the phenotype of interest. The idea behind ReFACTor is to find a small number t of cell type-specific markers as CpGs which have the lowest distance between their actual observed data and a low-rank approximation of the observed data based on k principal components, where k is the assumed number of underlying cell types. The PCA of the data subset on t CpGs is then supposed to yield scores correlated to the cell type proportions. The PCA scores can subsequently be used as covariates for the adjustment. In the original publication the authors demonstrate the slight superiority of ReFACTor as compared to the earlier published methods RefFreeEWAS and EWASHER.
The available reference-free methods are provably helpful for the analysis of EWAS data affected by strong cell type heterogeneity effects. Purely from the end user perspective the methods return heterogeneity-corrected P -values for significance of association with the tar- get phenotypic variable of interest at each CpG position. Although, this is often a desired analysis outcome, DNA methylation data sets tend to be increasingly more complex and, as a rule, do not meet all prior expectations about the tested hypothesis and the confounding factors. The internal structure of the sampled methylomes often provides for more interesting findings than the starting hypothesis itself. Routine application of such correction methods may impede an unconstrained data exploration and important insights about the analysed data set might be missed. Furthermore, from the technical perspective most of the reference-free methods are explicitly or implicitly based on PCA. PCA is a general-purpose method and the variability components it captures are not guaranteed to exclusively reflect the variability cell type composition. As a result the above methods are at risk of over-correcting the data for other unknown, yet potentially interesting factors.
Methylome deconvolution
The reference-based and reference-free methods above, apart from their specific advantages and disadvantages, were developed with the primary goal of correcting the statistical analysis in DNA methylation association studies. In general, computational methods for heterogeneity analysis significantly improve tractability of the data obtained on multi-cellular specimens. They require no additional costs and minimal labor overhead while providing for very useful inference results. However, if one abstracts oneself from the specific issue of confounding, and looks at methylome heterogeneity from a more general perspective, one can define a related deconvolution problem. The latter can be formulated as follows: given a set of cell sample average methylomes find all unique underlying methylation signatures as well as their mixing proportions.
This deconvolution problem is closely related to the goals pursued by the methods for DNA methylation profile analysis, i.e. the clustering and dimensionality reduction methods briefly reviewed at the end of Section1.2. An ideal method for solving it should be able to delineate the observed variability into “biological”, i.e. corresponding to the real DNA methylation changes, and “technical”, i.e. caused by other effects of non-biological nature, such as batch effects in
sample handling and calling procedure, measurement noise etc.
Full deconvolution was earlier attempted on gene expression data [Gaujoux and Seoighe,
2012,2013; Repsilber et al.,2010]. The promising results obtained with linear methods, such
as standard non-negative matrix factorization (NMF), in gene expression are, at the very least, surprising. Unlike DNA methylation, the expression level is a quantitative signal already at the level of a single cell meaning that the mixing of the expression signals in a multi-cellular sample does not have to obey a linear (proportional) mode. An analogous problem in DNA methylation should be much easier to solve due to its discreteness at the single-cell level and consecutive mixing linearity.
In Chapter 6 we present a computational method which attempts to solve this problem in an unsupervised, reference-free manner. We show that, although this problem is compu- tational hard and resides on the verge of mathematical tractability, the main deconvolution goals can nevertheless be achieved.