Conclusions - Machine Learning Models for Epigenomics

The vast amount of data created by the ENCODE Consortium, Roadmap Epigenomics Consortium, and others (Martens & Stunnenberg 2013; The GTEx Consortium 2013; The Fantom Consortium et al. 2014) has allowed for an unprecedented look into the mechanisms of cellular and transcriptional regulation. The uniformity and depth of data are an invaluable resource to better understand the basis of human disease. In particular, it becomes possible to look at the interplay between different genomic elements, such as chromatin interactions, chromatin accessibility, histone mark occupancy and transcription factor binding, using statistical and machine learning models.

However, great care must be taken to not overinterpret results in more complex machine learning models. Overinterpretation due to opaque models has caused much contention in the field, for example whether enhancer-promoter interactions are predictable (Whalen et al. 2016; W Xi & Beer 2018). Overinterpretation also exists at the data level: although kilobase resolution Hi-C maps have been reported (Rao et al. 2014), they have not been shown to be useable at that resolution. Additionally, because so many features are strongly correlated (e.g., DNA sequence, TF binding, histone marks, chromatin accessibility, mappability, k-mer frequency, etc.), it becomes very important to not draw too strong of a conclusion from complicated models. Because the possible space of hypotheses is so vast, it is difficult to show an association is causal without directly performing experiments in vivo. In our analyses, we have aimed to eliminate spurious associations by accounting for potential sources of biases and creating models that are as interpretable as possible.

We showed that we can use high resolution, one-dimensional genomic data (ChIP-seq of histone marks and transcription factors and DNase-seq) to fine map coarse, two-dimensional Hi-

C data. We suspect that c-SCNN can be not only useful for the designed application, but can be extended to either different types of data or modified to be applied to different problems altogether. In particular, the combination of training a model and using it to interpret the model predictions of data allows a deeper look into what the model is learning. Although feature attribution methods are a rapidly evolving technology in the field of machine learning, they prove to be quite useful and robust in contexts such as fine-mapping.

We have shown that we can use a mixture model of HMMs (HMX) to create region- specific annotations for protein coding genes. This data can be used either instead of or in addition to expression data. It could be applied to studies such as GWAS to select a gene set more likely to be relevant to a given cell type, then using LDSC (Finucane et al. 2018) to help to prioritize variants or cell types that are more likely to be causal for a disease. It could also be used for cell type deconvolution, which can be applicable to problems involving single-cell data (Macosko et al. 2015; Klein et al. 2015). For example, cell-free DNA (cfDNA) has been proposed as a non- invasive technology to screen for cancer (Kang et al. 2017; Salvi et al. 2016; Heitzer et al. 2015). Briefly speaking, cfDNA is released by apoptosing cells, and has non-random fragmentation patterns that inform the nucleosome positioning of the cell type of origin (Ulz et al. 2016). Because nucleosome positioning is informative of chromatin state (Snyder et al. 2016), cfDNA can thus inform the chromatin state of the cells it comes from. Finally, if we consider bulk cfDNA as arising from a mixture of cell-type specific chromatin states, we can infer the proportions of cell types most likely to compose the cfDNA. Any cell types that are “surprising”, e.g., a significant breast or colon cell-type fraction, this can be indicative of cancer.

We have shown multiple suggestive patterns and correlations between combinatorial histone marks and exon inclusion levels. However, to develop a more general model, one would

138

need to control for gene length, length of introns, position along a transcript (exon number), local splicing patterns, etc. While these factors can be interesting in the context of splicing, gene structure can become a very difficult covariate to account for when studying the epigenetic determinants of splicing. We can modify HMX to model genes on a per-element basis such as exons and introns, facilitating comparison across different exons. Ideally, one would have deep data across multiple cell types, making comparing a single exon across multiple cell types possible. This would account for many covariates and provide greater certainty in results.

Finally, we showed that we can predict TF peaks that contain motifs and re-prioritize borderline-scoring TF motifs using a supervised learning framework. Although an unsupervised framework is ideal for TFs without motifs or other ChIP-seq targets, finding a robust learning methodology where TFs bind to motifs is more difficult because TFs can bind in different contexts. For example, TFs can bind with different histone marks present or either in open or closed chromatin. Finally, validation of “direct” binding predictions, either for TFs or other ChIP-seq targets, is difficult in the general sense, particularly because binding to DNA need not be sequence driven.

Although there are many difficulties in using machine learning in such complex data, great advances have been made in better understanding the functional genome. As the cost of producing data decreases and availability of deep data increases, so does the need for machine learning models that can learn from and help interpret this data.

References

Abadi, M. et al., 2016. TensorFlow: A System for Large-Scale Machine Learning. Proceedings

of the 12th USENIX Symposium on Operating Systems Design and Implementation, pp.265–

283.

Andersson, R. et al., 2009. Nucleosomes are well positioned in exons and carry characteristic histone modifications. Genome Research, 19, pp.1732–1741.

Ay, F., Bailey, T.L. & Noble, W.S., 2014. Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts. Genome Research, 24, pp.999–1011.

Ballard, D.H., 1987. Modular Learning in Neural Networks. AAAI Proceedings, pp.279–284. Barash, Y. et al., 2010. Deciphering the splicing code. Nature, 465, pp.53–59.

Barski, A. et al., 2007. High-Resolution Profiling of Histone Methylations in the Human Genome. Cell, 129, pp.823–837.

Bergstra, J. & Bengio, Y., 2012. Random Search for Hyper-Parameter Optimization. Journal of

Machine Learning Research, 13, pp.281–305.

Bernstein, B.E. et al., 2006. A Bivalent Chromatin Structure Marks Key Developmental Genes in Embryonic Stem Cells. Cell, 125, pp.315–326.

Boyle, A.P. et al., 2011. High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome Research, 21, pp.456–464.

Bromley, J., Guyon, I. & Lecun, Y., 1994. Signature Verification using a “Siamese” Time Delay Neural Network. American Telephone and Telegraph Company, pp.737–744.

Buratti, E., Baralle, M. & Baralle, F.E., 2006. Defective splicing, disease and therapy: searching for master checkpoints in exon definition. Nucleic Acids Research, 34(12), pp.3494–3510. Cameron, C.J., Dostie, J. & Blanchette, M., 2018. Estimating DNA-DNA interaction frequency

from Hi-C data at restriction-fragment resolution. bioRxiv, (5), pp.1–20.

Cao, Q. et al., 2017. Reconstruction of enhancer–target networks in 935 samples of human primary cells, tissues and cell lines. Nature Genetics, 49(10), pp.1428–1436.

Carron, L. et al., 2019. Boost-HiC: Computational enhancement of long-range contacts in chromosomal contact maps. Bioinformatics, pp.1–6.

Crick, F., 1970. Central Dogma of Molecular Biology. Nature, 227, pp.561–563.

Das, M.K. & Dai, H.-K., 2007. A survey of DNA motif finding algorithms. BMC Bioinformatics, 8(21), pp.1–13.

Davydov, E. V et al., 2010. Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP ++. PLoS Computational Biology, 6(12), pp.1–13. Dekker, J. et al., 2017. The 4D nucleome project. Nature, 549, pp.219–226.

ENCODE Consortium, 2012. An integrated encyclopedia of DNA elements in the human genome. Nature, 489, pp.57–74.

140

Ernst, J. et al., 2011. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473, pp.43–49.

Ernst, J. & Bar-Joseph, Z., 2006. STEM: a tool for the analysis of short time series gene expression data. BMC Bioinformatics, 7(191), pp.1–11.

Ernst, J. & Kellis, M., 2012. ChromHMM: automating chromatin-state discovery and characterization. Nature Methods, 9(3), pp.215–216.

Ernst, J. & Kellis, M., 2013. Interplay between chromatin state, regulator binding, and regulatory motifs in six human cell types. Genome Research, 23, pp.1142–1154.

Ernst, J. & Kellis, M., 2015. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nature Biotechnology, 33(4), pp.364–376. Farré, P. et al., 2018. Dense neural networks for predicting chromatin conformation. BMC

Bioinformatics, 19(372), pp.1–12.

Faustino, N.A. & Cooper, T.A., 2003. Pre-mRNA splicing and human disease. Genes &

Development, 17, pp.419–437.

Finucane, H.K. et al., 2018. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nature Genetics, 50, pp.621–629.

Finucane, H.K. et al., 2015. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genetics, 47(11), pp.1228–1235.

Fortin, J. & Hansen, K.D., 2015. Reconstructing A/B compartments as revealed by Hi-C using long-range correlations in epigenetic data. Genome Biology, 16(180), pp.1–23.

Fullwood, M.J. et al., 2009. An oestrogen-receptor-alpha-bound human chromatin interactome.

Nature, 462(7269), pp.58–64.

Fullwood, M.J. & Ruan, Y., 2009. ChIP-Based Methods for the Identification of Long-Range Chromatin Interactions. Journal of Cellular Biochemistry, 107, pp.30–39.

Ge, X. et al., 2019. EpiAlign: an alignment-based bioinformatic tool for comparing chromatin state sequences. Nucleic Acids Research, pp.1–12.

Glajch, K.E. & Sadri-vakili, G., 2015. Epigenetic Mechanisms Involved in Huntington’ s Disease Pathogenesis. Journal of Huntington’s Disease, 4, pp.1–15.

Guo, Y. & Gifford, D.K., 2017. Modular combinatorial binding among human trans -acting factors reveals direct and indirect factor binding. BMC Genomics, 18(45), pp.1–16. He, B. et al., 2014. Global view of enhancer – promoter interactome in human cells. PNAS,

111(21), pp.2191–2199.

He, H.H. et al., 2014. Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification. Nature Methods, 11(1), pp.73–78.

He, Y. & Wang, T., 2017. EpiCompare: an online tool to define and explore genomic regions with tissue or cell type-specific epigenomic features. Bioinformatics, 33(20), pp.3268–3275. Heitzer, E., Ulz, P. & Geigl, J.B., 2015. Circulating Tumor DNA as a Liquid Biopsy for Cancer.

Hoffman, M.M. et al., 2013. Integrative annotation of chromatin elements from ENCODE data.

Nucleic Acids Research, 41(2), pp.827–841.

Hoffman, M.M. et al., 2012. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods, 9(5), pp.473–476.

Huang, J. et al., 2015. Predicting chromatin organization using histone marks. Genome Biology, 16(162), pp.1–11.

Javierre, B.M. et al., 2016. Lineage-Specific Genome Architecture Links Enhancers and Non- coding Disease Variants to Target Gene Promoters. Cell, 167, pp.1369–1384.

Kang, S. et al., 2017. CancerLocator: non-invasive cancer diagnosis and tissue-of-origin

prediction using methylation profiles of cell-free DNA. Genome Biology, 18(53), pp.1–12. Kharchenko, P. V et al., 2011. Comprehensive analysis of the chromatin landscape in Drosophila

melanogaster. Nature, 471, pp.480–485.

Kheradpour, P. & Kellis, M., 2014. Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Research, 42(5), pp.2976–2987. Klein, A.M. et al., 2015. Droplet Barcoding for Single-Cell Transcriptomics Applied to

Embryonic Stem Cells. Cell, 161(5), pp.1187–1201.

Knight, P.A. & Ruiz, D., 2012. A fast algorithm for matrix balancing. IMA Journal of Numerical

Analysis, 33, pp.1029–1047.

Koch, G., Zemel, R. & Salakhutdinov, R., 2015. Siamese Neural Networks for One-shot Image Recognition. Proceedings of the 32nd International Conference on Machine Learning, 37, pp.1–8.

Kolasinska-Zwierz, P. et al., 2009. Differential chromatin marking of introns and expressed exons by H3K36me3. Nature Genetics, 41(3), pp.376–381.

Kooistra, S.M. & Helin, K., 2012. Molecular mechanisms and potential functions of histone demethylases. Nature Reviews, 13(5), pp.297–311.

Landt, S.G. et al., 2012. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Research, 22, pp.1813–1831.

de Leeuw, C.A. et al., 2015. MAGMA: Generalized Gene-Set Analysis of GWAS Data. PLoS

Computational Biology, 11(4), pp.1–19.

Lek, M. et al., 2016. Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536, pp.285–291.

Leung, M.K.K. et al., 2014. Deep learning of the tissue-regulated splicing code. Bioinformatics, 30, pp.121–129.

Li, G. et al., 2012. Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell, 148(1–2), pp.84–98.

Li, H., Qiu, J. & Fu, X.-D., 2012. RASL-seq for Massively Parallel and Quantitative Analysis of Gene Expression. Current Protocols in Molecular Biology, pp.1–9.

Li, T. et al., 2018. Integrative analysis reveals functional and regulatory roles of H3K79me2 in mediating alternative splicing. Genome Medicine, 10(30), pp.1–11.

142

Liang, J. et al., 2014. Chromatin Immunoprecipitation Indirect Peaks Highlight Long-Range Interactions of Insulator Proteins and Pol II Pausing. Molecular Cell, 53, pp.672–681. Lieberman-Aiden, E. et al., 2009. Comprehensive Mapping of Long-Range Interactions Reveals

Folding Principles of the Human Genome. Science, 326, pp.289–293.

Luco, R.F. et al., 2011. Epigenetics in Alternative Pre-mRNA Splicing. Cell, 144, pp.16–26. Luco, R.F. et al., 2010. Regulation of Alternative Splicing by Histone Modifications. Science,

327, pp.996–1000.

Lupiáñez, D.G. et al., 2015. Disruptions of Topological Chromatin Domains Cause Pathogenic Rewiring of Gene-Enhancer Interactions. Cell, 161, pp.1012–1025.

Macosko, E.Z. et al., 2015. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell, 161(5), pp.1202–1214.

Manolio, T.A. et al., 2009. Finding the missing heritability of complex diseases. Nature, 461, pp.747–753.

Martens, J.H.A. & Stunnenberg, H.G., 2013. BLUEPRINT: mapping human blood cell epigenomes. Haematologica, 98(10), pp.1487–1489.

Mifsud, B. et al., 2015. Mapping long-range promoter contacts in human cells with high- resolution capture Hi-C. Nature Genetics, 47(6), pp.598–606.

Mortazavi, A. et al., 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq.

Nature Methods, 5(7), pp.621–628.

Mumbach, M.R. et al., 2016. HiChIP: efficient and sensitive analysis of protein-directed genome architecture. Nature Methods, 13(11), pp.919–922.

Naumova, N. et al., 2012. Analysis of long-range chromatin interactions using Chromosome Conformation Capture. Methods, 58(3), pp.192–203.

Niimura, Y. & Nei, M., 2007. Extensive Gains and Losses of Olfactory Receptor Genes in Mammalian Evolution. PLoS ONE, (8), pp.1–8.

Nissim-rafinia, M. & Kerem, B., 2002. Splicing regulation as a potential genetic modifier.

TRENDS in Genetics, 18(3), pp.123–127.

Orphanides, G., Lagrange, T. & Reinberg, D., 1996. The general transcription factors of RNA polymerase II. Genes & Development, 10, pp.2657–2683.

Park, P.J., 2009. ChIP-seq: advantages and challenges of a maturing technology. Nature

Reviews, 10, pp.669–80.

Pope, B.D. et al., 2014. Topologically associating domains are stable units of replication-timing regulation. Nature, 515, pp.402–405.

Poptsova, M.S. et al., 2014. Non-random DNA fragmentation in next-generation sequencing.

Scientific Reports, 4(4532), pp.1–6.

Pulecio, J. et al., 2017. CRISPR/Cas9-Based Engineering of the Epigenome. Cell Stem Cell, 21, pp.431–447.

removal from ChIP‑seq data clarifies true binding signal and its functional correlates.

Epigenetics & Chromatin, 8(33), pp.1–16.

Rao, S.S.P. et al., 2014. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell, 159(7), pp.1665–1680.

Roadmap Epigenomics Consortium et al., 2015. Integrative analysis of 111 reference human epigenomes. Nature, 518(7539), pp.317–329.

Roeder, R.G., 1996. The role of general initiation factors in transcription by RNA polymerase II.

Trends in Biochemical Sciences, 4(96), pp.327–335.

Roy, S. et al., 2015. A predictive modeling approach for cell line-specific long-range regulatory interactions. Nucleic Acids Research, 43(18), pp.8694–8712.

Salvi, S. et al., 2016. Cell-free DNA as a diagnostic marker for cancer: current insights.

OncoTargets and Therapy, 9, pp.6549–6559.

Sanborn, A.L. et al., 2015. Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. PNAS, 112(47), pp.6456–6465.

Schor, I.E. et al., 2009. Neuronal cell depolarization induces intragenic chromatin modifications affecting NCAM alternative splicing. PNAS, 106(11), pp.4325–4330.

Shen, S. et al., 2014. rMATS: Robust and flexible detection of differential alternative splicing from replicate RNA-Seq data. PNAS, 111(51), pp.5593–5601.

Snyder, M.W. et al., 2016. Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin. Cell, 164, pp.57–68.

Song, L. & Crawford, G.E., 2010. DNase-seq: A High-Resolution Technique for Mapping Active Gene Regulatory Elements across the Genome from Mammalian Cells. Cold Spring

Harbor Laboratory Press, 2010(2), pp.1–12.

Soufi, A. et al., 2015. Pioneer Transcription Factors Target Partial DNA Motifs on Nucleosomes to Initiate Reprogramming. Cell, 161(3), pp.555–568.

Srivastava, N. et al., 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15, pp.1929–1958.

Sundararajan, M., Taly, A. & Yan, Q., 2017. Axiomatic Attribution for Deep Networks.

Proceedings of the 34th International Conference on Machine Learning, 70, pp.1–11.

Sung, M.-H. et al., 2014. DNase Footprint Signatures Are Dictated by Factor Dynamics and DNA Sequence. Molecular Cell, 56, pp.275–285.

Teif, V.B. et al., 2012. Genome-wide nucleosome positioning during embryonic stem cell development. Nature Structural & Molecular Biology, 19(11), pp.1185–1192.

Tennyson, C.N., Klamut, H.J. & Worton, R.G., 1995. The human dystrophin gene requires 16 hours to be transcribed and is cotranscriptionally spliced. Nature, 9, pp.184–190.

The Fantom Consortium, RIKEN PMI & CLST (DGT), 2014. A promoter-level mammalian expression atlas. Nature, 507(7493), pp.462–70.

The Gene Ontology Consortium, 2000. Gene Ontology: tool for the unification of biology.

144

The GTEx Consortium, 2017. Genetic effects on gene expression across human tissues. Nature, 550, pp.204–213.

The GTEx Consortium, 2013. The Genotype-Tissue Expression (GTEx) project. Nature

Genetics, 45(6), pp.580–585.

Ulz, P. et al., 2016. Inferring expressed genes by whole-genome sequencing of plasma DNA.

Nature Genetics, 48(10), pp.1273–1278.

Valouev, A. et al., 2011. Determinants of nucleosome organization in primary human cells.

Nature, 474, pp.516–520.

Visscher, P.M. et al., 2012. Five Years of GWAS Discovery. The American Journal of Human

Genetics, 90, pp.7–24.

Wang, E.T. et al., 2008. Alternative isoform regulation in human tissue transcriptomes. Nature, 456, pp.470–476.

Wang, J.R., Quach, B. & Furey, T.S., 2017. Correcting nucleotide-specific biases in high- throughput sequencing data. BMC Bioinformatics, 18(357), pp.1–10.

Wang, Z., Gerstein, M. & Snyder, M., 2009. RNA-Seq: a revolutionary tool for transcriptomics.

Nature Reviews, 10, pp.57–63.

Whalen, S., Truty, R.M. & Pollard, K.S., 2016. Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nature Genetics, pp.1–10.

Won, H. et al., 2016. Chromosome conformation elucidates regulatory relationships in developing human brain. Nature, 538(7626), pp.523–527.

Wuarin, J. & Schibler, U., 1994. Physical Isolation of Nascent RNA Chains Transcribed by RNA Polymerase II: Evidence for Cotranscriptional Splicing. Molecular and Cellular Biology, 14(11), pp.7219–7225.

Xi, W. & Beer, M., 2018. Local epigenomic state cannot discriminate interacting and non- interacting enhancer–promoter pairs with high accuracy. bioRxiv, pp.1–12.

Xi, W. & Beer, M.A., 2018. Local epigenomic state cannot discriminate interacting and non- interacting enhancer–promoter pairs with high accuracy. PLoS Computational Biology, 14(12), pp.1–7.

Yardımcı, G.G. et al., 2014. Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection. Nucleic Acids Research, 42(19), pp.11865–11878. Yen, A. & Kellis, M., 2015. Systematic chromatin state comparison of epigenomes associated

with diverse properties including sex and tissue type. Nature Communications, 6(7973), pp.1–13.

Zaret, K.S. & Carroll, J.S., 2011. Pioneer transcription factors: establishing competence for gene expression. Genes & Development, 25, pp.2227–2241.

Zeiler, M.D., 2012. ADADELTA: An Adaptive Learning Rate Method. arXiv, pp.1–6.

Zhang, S. et al., 2018. In silico prediction of high-resolution Hi-C interaction matrices. bioRxiv,

In document Machine Learning Models for Epigenomics (Page 150-159)