• No results found

Correlation-based Data Analysis

5.2 Analyzing Hi-C Contact Maps

5.2.3 Correlation-based Data Analysis

Hi-C contact information can be analyzed alongside with other genomic data such as the expression level of genes or data on proteins being attached to the genome. This

5.3. 3D Modeling 65

has the advantage of being able to gain insights into the correlation between the two types of information instead of only analyzing each separately. An example of such a combined analysis found that the mouse genome is organized into domains of coordinately regulated enhancers and promoters that coincide with TADs [37]. First, ChIP-seq data of RNA polymerase II indicative of active promoters and H3K4me1 as a mark for enhancers were compared across different tissues and cell types. As these signals were concordantly enriched within clusters in the genome, much like TADs in contact maps of mammalian genomes, the authors also compared both types of domains and determined that they indeed overlap. Another example of such correlation-based data analysis of ChIP-seq binding profiles and Hi-C data showed that the proteins CTCF and cohesin associate with loops that have been detected within contact maps [36]. The previously discussed domain boundaries in the contact map of the wild-type C. crescentus chromosome have been found to correlate with the position of highly expressed genes identified through DNA microarray analysis experiments.

We mentioned only a few exemplary studies that found interesting features of Hi-C contact maps to be correlated with other genomic data. These correlations, though not implying causality, are interesting for further specialized studies and hypothesis-driven modeling approaches since they hint at possible mechanisms of the 3D genome organization.

5.3 3D Modeling

Hi-C experiments yield information that can be interpreted using computational models of chromosome organization. There are two key strategies for building such models. The first data-driven strategy, referred to as 3D reconstruction, uses the contact probabilities as summarized in the contact map to determine an optimal structural model of the data. The second strategy aims at establishing general principles of folding for organization of chromosomes using physical principles in the framework of polymer simulations. Contrary to the first strategy, Hi-C data is not used as an input for these polymer models, but rather for validation. Here, we review several methods employing either of the two strategies. For a more complete overview of 3D reconstruction methods we refer to the review of Serra et al. [88].

5.3.1 3D Reconstruction

The goal of 3D reconstruction algorithms is to use the contact map as input to recapit- ulate the underlying 3D structure of a genome. In this approach 3C-based data is used to obtain spatial restraints for modeling the genome; 3D reconstruction is also known as restraint-based structure modeling. The basic concept for the reconstruction is simple: the closer two genomic loci are in 3D space, the higher the probability is that they inter- act. In technical terms, the assumption is that the Euclidean distance between two loci is inversely proportional to their contact probability. Following this basic notion, there are two strategies for translating the contact probabilities within the contact map into a set of 3D coordinates of loci representing the genome. In the first, optimization-based, approach the total difference between pairwise distances in the hypothesized set of 3D coordinates is minimized and the corresponding distances are inferred from the observed contact probabilities. In the second, model-based, strategy, the observed contact proba- bilities are assumed to follow a probability distribution from which 3D structures can be inferred.

structure or an ensemble of 3D structures. Both consensus and ensemble methods have advantages and disadvantages. Ensemble methods are biologically more plausible, be- cause they reflect the fact that Hi-C data is obtained from an ensemble of conformations. However, the analysis of an inferred ensemble of 3D structures is not straight-forward: one option is the characterization of the ensemble average [89]; another one is to select a few structures that are representative of the diversity of the ensemble [90]. Consensus methods, in contrast, generate a single structure, which can be thought of as an visualiza- tion of the contact map and is easy to analyze. Computationally, ensemble methods are more demanding than consensus methods, because they need to sample from a very large dimensional space of candidate 3D structures.

Optimization-based Methods

“ShRec3D” [91] is a method that seeks for analytically reconstructing a consensus 3D structure. It builds upon the fact that the contact matrix can be interpreted as the adjacency matrix of a weighted graph whereby the problem is reformulated in terms of embedding a graph into Euclidean space. This problem, in turn, is well-known in the literature and can be solved using classical Multidimensional Scaling (MDS) [92]. Given a set of distances between the vertexes of a graph, this method returns an Euclidean set of coordinates. Therefore, the definition of distances between the vertexes of the graph representing the contact matrix is crucial within this framework. The authors chose the shortest path distance for this purpose, but did not show how other distance definitions, such as the resistance distance or connectivity-based distances, perform compared to that choice. “ChromSDE” [93] is a numerical method that jointly optimizes the 3D structure and a parameter that maps contact frequencies to spatial distances. The main difference to “ShRec3D” is the translation of contact frequencies to spatial distances by numerical optimization. Both methods reconstruct a consensus 3D structure. In contrast, Kalhor et al. have proposed an optimization framework that generates an ensemble of structures [89]. The idea behind this approach is to convert contact probabilities into a set of contact restraints for the 3D structures in the ensemble. However, any given contact is enforced with its contact probability, hence only in a fraction of the inferred structures in the ensemble.

Probabilistic Modeling Methods

Different from the optimization-based approaches, probabilistic modeling methods assign an uncertainty to the spatial distances between genomic loci. The observed contact fre- quency of two loci is typically assumed to follow a Poisson distribution [94, 95]. This ac- counts for the fact that 3C-based experiments detect contact frequencies among restriction fragments and, hence, count data. This approach is valid for non-genome wide input data. However, these methods are not valid for Hi-C input data as these consist of contact prob- abilities rather than contact frequencies among genomic loci. The Markov chain Monte Carlo (MCMC) - based method “MCMC5C” [90] is an exception in this respect since it assumes a Gaussian distribution for the input contact data; therefore it can model both Hi-C contact probabilities and other 3C-based contact frequencies. In this approach DNA is modeled as a chain of beads representing the 3D structure, which is iteratively changed using random moves that can be either accepted or rejected depending on whether the new 3D structure is more probable given the data. After a sufficient number of iterations, this MCMC scheme samples 3D structures that fit the experimental contact data. By

5.3. 3D Modeling 67

running many of those simulations in parallel, a large ensemble of structures is generated. Hu et al. proposed a probabilistic method called “BACH” [94] that models the contact data using a Poisson distribution. Contrary to MCMC5C, Monte Carlo methods are used in order to gradually refine an initial structure conformation and generate a consensus 3D structure. “PASTIS” [95] also models the contact data using a Poisson distribution. It uses maximum likelihood estimation of the model parameters for reconstructing the 3D structure with the highest likelihood given the observed contact data.