• No results found

2. Chapter 2: Identification of potential functional elements in the intergenic spacer

2.1. Introduction

2.1.1. Strategy to characterized potential functional elements in the IGS

deletion of the region of interest. The multiple copies and locations of the rDNA units means, it is not straightforward to make genetic changes in the regions of interest in the human IGS to determine their function. Furthermore, there is no certainty that the effect of any changes made in a single rDNA repeat unit will have a detectable effect. The protocols and molecular techniques required to alter the rDNA units at a global level and observe the consequent phenotypic changes are only well established in yeast, and are currently unavailable in human. Thus, it is challenging to characterise functional elements in the human rDNA using experimental approaches. Therefore, I have decided to take a comparative genomics approach, phylogenetic footprinting, to identify potential functional elements in the human IGS (Figure 2.1).

26

Figure 2.1: Schematic overview of the identification of potential functional elements in th

The flow diagram show the progression of the project and the different analysis performed

regions. The major outcome of the study is shown on the right side of the figure.

27

The principle behind phylogenetic footprinting is that mutations in functional elements are deleterious, therefore changes in the sequences of functional elements are selected against and change at a slower rate than non-functional sequence in evolutionary time (Tagle et al. 1988). Comparison of the orthologous sequences from related species results in the

functional elements appearing as “phylogenetic footprints” i.e. highly conserved regions in the multiple sequence alignment in a background of non-functional, poorly conserved sequences (Tagle et al. 1988). The success of the phylogenetic footprinting depends on the evolutionary relatedness of the species selected for comparison. Previous studies have shown that the inclusion of closely related species along with more distantly related species give the ability to identify conserved regions with high confidence (McCue et al. 2002; Stone et al. 2005).

Phylogenetic footprinting has been successfully applied to identify known and novel functional elements in the IGS of Saccharomyces cerevisiae (Ganley et al. 2005). The rDNA of five species that are related to the Saccharomyces cerevisiae were compared. Functional elements i.e. a bi-directional promoter E-pro, cohesion association region, replication fork barrier, and other potential gene independent functional elements (NOCs) correspond to the identified conserved regions in the IGS. Further, this technique has also been applied to other regions of the human genome to identify promoters and other gene regulators for variety of genes (Tagle et al. 1988; Bachman et al. 1996; Berezikov et al. 2005). To identify the potential functional elements in the human IGS, I searched for the phylogenetic footprints by comparing the human IGS with different primate IGS sequences. However, there were no primate rDNA IGS sequences available except for human. Therefore, to perform phylogenetic footprinting I first needed to obtain the rDNA sequences for the primate species I selected for this study. The whole genome shotgun sequencing (WGS) data of an organism contains the nucleotide information of its entire genome. However, current sequencing technologies can only determine the sequences of relatively small DNA fragments. Thus, to obtain the nucleotide information of the entire genome the genomic sample is sheared into small fragments and sequenced, and these sequences reads are merged together to construct the sequence of the genome. The process of merging the reads is known as de novo whole genome sequence assembly (WGA). To construct the rDNA sequence of the selected primates for the human rDNA phylogenetic analysis study, I used publically available whole genome sequencing data from different primate genome projects to perform WGA. Further, to verify that the obtained WGA rDNA sequences are not misassembled, I decided to identify and sequence BAC (bacterial artificial chromosome) clones containing the rDNA. A BAC is an engineered DNA molecule constructed by inserting a large DNA fragment (usually 100-200 kb) into a bacterial plasmid. Since BACs can contain large fragments of the

28

genome, they are used to construct the whole genome libraries. Children's Hospital Oakland Research Institute, USA (CHORI) provides BAC genome libraries for the primates in the form of BAC filters (BAC clones that have been gridded onto membranes). I decided to screen the filters from CHORI to identify the rDNA-containing BAC clones and then compare them to the rDNA sequences obtained by WGA.

The IGS are known to be transcribed to produce different regulatory long noncoding RNAs. Therefore, one potential function of the conserved regions identified using phylogenetic footprinting can be transcribing RNA transcripts that may have regulatory roles. To identify the full range of potential transcripts from the IGS, I decided to map RNA-seq data for different cell types. The data used for analysis was publically available from ENCODE project and sequencing was performed at Cold Spring Harbour Laboratory (CSHL). To identify potential IGS transcripts I selected long (>200 bp in length) polyadenylated denoted as poly(A)+ and long non-polyadenylated denoted as poly(A)- RNA-seq data for the analysis. Further, to identify potential micro RNA, small nucleolar RNA, tRNA and small nuclear RNA from the IGS, poly(A)+ RNA-seq data were selected to be mapped to the IGS. In the ENCODE project RNAs were fractionated according to their location in the cell i.e. cytosol and nucleus before sequencing (Djebali et al. 2012). The noncoding transcripts from the IGS are known to be located in the nucleolus (Audas et al. 2012; Jacob et al. 2012). Moreover, it has been recently shown that the larger fraction of the noncoding and intergenic RNAs are located in the nucleus compared to the cytosol (Djebali et al. 2012). Therefore, I decided to search for the transcripts from the IGS using RNA-seq data from the nucleus. Previous studies have shown that several transcripts originate from the human IGS. This implies that transcriptional regulators (promoters, enhancers and insulators) of these IGS transcripts are present in the IGS and these may lie within the conserved regions. The histone modifications and transcription factors (TFs) associated with a genomic region determines and regulates its transcriptional activity (Caparros et al. 2009). Therefore, to identify potential transcription regulators in human IGS, I decided to map ChIP-seq data from ENCODE project for various histone modifications and TFs. The list of factors I mapped is given in Table 2.1 and Table 2.2.

29

Table 2.1: List of histone modifications mapped to the human rDNA sequences

Histone

Modification

Description

H2A variant Z Play role in several functions including Polycomb silencing, transcription activation and nucleosome assembly. H3K9ac Associated with regions that have open chromatin structure (less

nucleosome occupancy)

H3K9me1 Associated with regions that have open chromatin structure (less nucleosome occupancy) H3K27ac Associated with transcriptional initiation and open chromatin structure. H3K4me1 Enriched at enhancers and downstream of transcription start sites H3K4me2 Enriched at enhancers and downstream of transcription start sites H3K4me3 Enriched at promoters

H3K79me2 Marks the transcriptional transition region - the region between the initiation marks and the elongation marks H4K20me1 Associated with active promoters and/or transcribing regions

H3K09me3 Promotes a repressive heterochromatic state H3K27me3 Promotes a repressive heterochromatic state

Table 2.2: List of transcription factors mapped to the human rDNA sequence

Transcription

factors Description

Bdp1 cofactor of RNA Pol III Brf1 cofactor of RNA Pol III Brf2 cofactor of RNA Pol III

c-Myc Associated with activation of rDNA (Pol I) and Pol II transcribed genes.

CTCF Zinc finger protein enriched at insulators/promoters Pol-II POLR 2A subunit of RNA Pol II

Pol-III POLR 3G subunit of RNA Pol III

TBP Associated with all three polymerases ( Pol I, Pol II and Pol III) UBF Involved in the recruiting Pol I to the rDNA

30

The ENCODE project provides RNA-seq and ChIP-seq data for 147 different cell types out of which 18 were given higher priority by the ENCODE project based on the physiological conditions represented by them (http://www.genome.gov/26524238; Dunham et al. 2012). These 18 cell types were further divided into two tiers: tier-1 (includes three cell types) and tier-2 (includes 15 cell types) based on the level of priority in the ENCODE project. As described previously, the rDNA is thought to play a key role in several processes including cellular proliferation (Section 1.7). Cancerous cells are well-established examples of rapidly proliferating cells. Therefore, to identify the transcripts and transcriptional regulators associated with the rDNA that may play roles in cellular proliferation, I decided to select both noncancerous and cancerous cell types for comparative ChIP-seq and RNA-seq analysis. For this study, I have selected all cell types from tier-1 viz. non-cancerous cell types GM12878 and H1-hESC, and cancerous cell type K562. Since tier-1 contains only one cancerous cell type, I selected A549 and HeLa-S3 from tier-2 to include other cancerous cell types. Further, to have equal representation of the cancerous and noncancerous cell types in the analysis I have included one additional noncancerous cell type, HUVEC, from tier-2 for noncancerous cell types. The cell types A549, HeLa-S3 and HUVEC were selected over other 12 cell types in tier-2 because ChIP-seq data were present for most of the epigenetic factors included in the study. The details of selected cell types are given in Table 2.3

Table 2.3: The cell types included in this study

Cell type

Description

GM12878 Lymphoblastoid

HUVEC Human umbilical vein endothelial cells H1-hESC Human embryonic stem cells line H1 K562 Leukaemia

HeLa-S3 Cervical carcinoma

31