COMPUTATIONAL EPIGENOMICS Mapping human methylomes

The European Nucleotide Archive Team

COMPUTATIONAL EPIGENOMICS Mapping human methylomes

In collaboration with Vardhman Rakyan, Barts and the London; Stephan Beck, University College London; Thomas Down, Wellcome Trust Sanger Institute; Natalie Thorne and Simon Tavaré, Cancer Research UK

DNA methylation is required for genome function; it is a key regulator of gene expression in normal tissue and aber- rant DNA methylation is a hallmark of certain cancers. Genome-wide reference DNA methylation profiles in multiple tissues and the means to compare these profiles is critical to understanding the role of DNA methylation, as is the identification of tissue-specific differentially methylated regions (tDMRs) that are thought to play a role in cellular identity. Using a custom designed microarray and the MeDIP (methylated DNA immunoprecipation) technique, the team and our collaborators presented the most comprehensive set of DNA methylation profiles, consisting of 13 normal human somatic tissues in addition to human placenta, sperm and the GM06990 lymphoblastoid cell line (Raykan

et al., 2008). The results suggested that promoters across a wide range of CpG densities are regulated by tissue-specific

DNA methylation and demonstrated that exon methylation is a common feature in mammalian genomes. These profiles are currently available in Ensembl (www.ensembl.org), the infrastructure of which can be used by the community to present similar data.

Analysis of DNA–protein interactions

In collaboration with Thomas Down, Ian Dunham and David Vetrie while all were members of the Wellcome Trust Sanger Institute

The genome-wide variability of transcription factor binding in individual cell types and the relationship of this binding to cellular identity is largely unknown. An investigation to map the binding of REST (repressor element 1-silencing transcription factor) across eight human cell lines leveraged analysis methods recently developed in the group and further led to the development of methods to compare positive regions in multiple cell types. The study exposed sev- eral interesting characteristics of the transcription factor binding site usage across a single species (Bruce et al., 2009). The experiment was conducted on a PCR tiling array platform across the ENCODE regions and included seven cell lines expressing REST (see figure 1a) and the KELLY cell line which does not express REST. After analysis, a total of 591 positive regions were identified across the expressing cell lines, while the non-expressing KELLY cells were found to have no positive regions, providing confidence in a low false positive rate for the analysis. The positive regions were

further categorised by whether they appeared in a single cell line, all seven cell lines, or multiple but not all cell lines (see figure 1a). These groupings identified a core set of approximately 30 positive binding sites across the ENCODE regions and larger sets with restricted or unique binding patterns, which corresponded to the enrichment values observed on the array and the strength of the motif with respect to the consensus. Unexpectedly, the DNA sequence in the restricted and unique binding sites shows increased evolutionary constraint (at every conservation threshold) compared to the common sites (figure 1b). The genes closest to the binding sites with restricted binding profiles were enriched in tissue-specific genes, and we hypothesise that the higher conservation in these cases is analogous to the observed higher conservation of alternatively spliced exons with tissue or condition-specific expression patterns. ENSEMBL

Ensembl (Hubbard et al., 2009), a joint project of EMBL-EBI and the Wellcome Trust Sanger Institute, provides an integrated set of tools for genome annotation, data mining and visualisation. Ensembl’s mission is to enable genomic science by providing high-quality, integrated annotation on chordate genomes within a consistent and accessible infrastructure. At EMBL-EBI, Ensembl includes members of the Vertebrate Genomics team and components of the PANDA Nucleotides group (see page 17).

The Ensembl genome browser at www.ensembl.org is the primary entry point for most users. In addition to the website, we also provide data access though a number of other routes including an extensively supported Perl API, the Ensembl BioMart (Smedley et al., 2009), direct queries of our publicly available MySQL databases, full download of all resources and the provision of Ensembl data as one of the Public Data Sets available on the Amazon Web Services cloud computing platform (http://aws.amazon.com/publicdatasets/). Ensembl places no restriction on the use of the data and provides all of the code though an open source licence that allows it to be used without cost by any interested organisation.

This year, the two most significant project achievements were the launch of the new Ensembl web interface in November 2008 and the release of the annotation set for the updated GRCh37 version of the human genome assembly in July 2009. The new website was the result of approximately one year of development and was designed to enable greater discovery of the numerous data resources provided by Ensembl with easier and more intuitive navigation. These features were implemented such that the overall speed of the website also saw significant improvements. Since the launch of the new web interface, we have concentrated on increasing overall performance and consolidating previously existing features within the new interface. Ensembl’s support for the updated GRCh37 human assembly included a new gene set incorporating both automatic Ensembl gene predictions and manually annotated genes from the Havana project. This combined gene set is created within the context of the GENCODE project. In addition to genome annotation, support for the new assembly included the update of all of the pairwise and multi-species whole genome alignments as well as the mapping of genome variation and Ensembl regulatory features.

Services in 2009 – V ertebrate Genomics

Figure 1. Figures from Bruce et al.,(2009). (a) A pinwheel diagram showing the seven REST expressing cell lines and the pattern of overlap of the nessie identified REST binding sites in each of the cell lines. The numbers in the centre of the circle represent the binding sites common to all cell lines and are not equal due to cases in which two regions in one cell line overlapped one region in a second cell line. The outer numbers represent binding sites unique to the given cell line. (b) The amount of evolutionary constraint for GERP score level thresholds and categories of REST binding sites.

39 Services in 2009 – V ertebrate Genomics

Beyond the major efforts detailed above, there were five full Ensembl releases during the period of this report. From the September 2009 release onwards, Ensembl fully supports a total of 24 high coverage chordate genomes and 23 low coverage chordate genomes including the seven new species introduced this year; the anole lizard (Anolis carolinen-

sis), the first reptile in Ensembl; the two-toed sloth (Choloepus hoffmanni), the white-tufted-ear marmoset (Callithrix jacchus), the pig (Sus scrofa), the Tamar wallaby (Macropus eugenii), the zebra finch (Taeniopygia guttata) and the

Western lowland gorilla (Gorilla gorilla). Of these, the anole lizard, zebra finch, marmoset and pig were high coverage genome assemblies based on approximately 4–6x coverage from Sanger-style sequencing reads and gorilla was the first example of an assembly that combined traditional Sanger-style sequencing at low coverage with high-throughput short read sequencing at high coverage. The lamprey (Petromyzon marinus), another high coverage chordate genome, is cur- rently provided with preliminary support only. An additional three non-chordate species (Saccharomyces cerevisiae,

Caenorhabditis elegans and Drosophila melanogaster) are included to facilitate comparative analysis.

In order to increase consistency of the Ensembl resources, we have been steadily increasing our contacts and col- laborative activities with similar resources at the University of California Santa Cruz (UCSC) and the NCBI. This year, the first Joint NCBI-EBI Coordination meeting in Washington was attended by all of the Ensembl and Genome Variation project leaders. We also have connections to many model organism-specific database resources such as the Rat Genome Database (RGD). The goal of these connections is to provide the wider research community with data resources that are maximally consistent and interconnected.

Ensembl maintains a significant commitment to user support and training. During the past year, our training team presented nearly 100 training events in over 20 countries. These events range from relatively short presentations as part of larger EMBL-EBI or Wellcome Trust workshops to intensive multiple day courses dedicated to the Ensembl API and those developers maintaining full Ensembl mirror sites. We have also developed a library of video tutorials for users not able to attend a course in person and these are now provided though the Ensembl YouTube channel.

The Ensembl infrastructure is being leveraged by the Ensembl Genome project and this has resulted in the generali- sation of key aspects of the Ensembl toolset to support the requirements of the Ensembl Genomes project in novel areas. Additionally, the Ensembl core software team, which is part of the PANDA Nucleotides group (see page 17) have reengineered several key components of the Ensembl infrastructure, especially those supporting the mapping of external database identifiers to the Ensembl identifiers and the management of Gene Ontology (GO) information within the Ensembl databases.

Ensembl comparative genomics

Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Leo Gordon, Albert Vilella

Ensembl’s comparative genomics resources include pairwise and multi-species whole genome alignments as well as the calculation of homology relationships though Ensembl families and gene trees. As the number of supported species within Ensembl increases, the value of the comparative genomics resources to connect all aspects of the project increases. These resources also provide valuable information about the regions of the well-annotated human and mouse genomes which are subject to evolutionary constraint.

The Ensembl multi-species alignments are produced by the recently published Enredo-Pecan-Ortheus (EPO) pipeline (Paten et al., 2008a; Paten et al., 2008b; Paten et al., 2009) and summarised as follows. In the first step Enredo, a graph- based method that is robust to duplicated regions within the genome, is used to identify orthologous and paralogous collinear genomic regions. Pecan, a consistency-based multiple aligner, is then used to create alignment blocks from the Enredo-identified collinear segments. These alignment blocks are used by Ortheus to infer ancestral sequences using a branch transducer model of sequence evolution that includes insertions and deletions. We extend Ensembl’s multiple alignments to low coverage genomes by first constructing the core multiple alignment and then mapping each low coverage genome using pairwise alignments to the human genome. This procedure allows us to better determine sequence constraints through mammalian evolution.

In addition to the alignment resources, Ensembl comparative genomics also provides comprehensive predictions of vertebrate gene phylogeny which result in gene trees that are presented graphically on the Ensembl genome browser (see figure 2) and have recently been described in detail (Vilella et al., 2009). Over the course of this year, we have implemented a number of improvements in the GeneTree pipeline to reflect actual or artefactual gene-split events and improved the GeneTree visualisation to aid in the interpretation of the trees. Phylogenetic predictions are comple- mented by Ensembl Families which feature alignments of homologous UniProt entries to the Ensembl proteins. Ensembl functional genomics

Ian Dunham, Stefan Gräf, Nathan Johnson, Damian Keefe, Steven Wilder

The Ensembl functional genomics resources include the Ensembl regulatory build, an integrated analysis of experi- mental assays designed to create an automatic, evidenced-based annotation of genome function. The regulatory build uses several data types including genome-wide chromatin state maps, experimentally determined locations of

40 Services in 2009 – V ertebrate Genomics

Figure 2. The new Ensembl website showing data from the human GRCh37 assembly including (from top to bottom) the Ensembl gene tree view for the human FOXP2 gene; a promoter associated regulatory feature on human chromosome 7 created by integrative analysis of the displayed histone modifications and other functional data; a SNP on human chromosome 8 found by genome-wide association to be association with both prostate and colorectal cancer.

In document Annual Scientific Report European Bioinformatics Institute (Page 39-43)