2. Chapter 2: Identification of potential functional elements in the intergenic spacer
2.3. Results
2.3.7. Phylogenetic footprinting to identify potential noncoding functional elements
In order to identify potential functional elements in the human IGS, I next compared the human rDNA with the primate rDNA sequences to identify phylogenetic footprints (regions that are highly conserved during evolutionary timeline) in the human rDNA IGS. To achieve this goal, I initially made a multiple sequence alignment (MSA) by aligning human rDNA with the rDNA sequences of chimpanzee, gorilla, orangutan, gibbon, macaque and marmoset. However, the MSA was poorly aligned because of the low sequence identity between human and marmoset rDNA. Therefore, marmoset rDNA was removed from the analysis and rDNA sequences were realigned to obtain the MSAhuman-macauqe that was used for the further study. Next, to observe the level of sequence conservation in the human IGS a similarity plot was generated from MSAhuman-macauqe (Figure 2.31) using Synplot (Section 2.2.1.5). The human rDNA sequence in MSAhuman-macauqe has long runs of gaps that are predominantly the result of the satellite blocks in the macaque rDNA. Because the goal of this study is to search for the conserved regions in the human IGS, all the columns in the MSA with gaps in the human rDNA were removed before generating the similarity plot. This facilitated visualization of the positions of conserved regions (obtained in the next step) relative to the human rDNA. A 75 bp sliding window with 1 bp increment was used to generate the similarity plot (Section 2.2.1.5). The plot represents the sequence conservation between the human and primate rDNA sequences. To demarcate the conserved regions in the human IGS, a cutoff of 80% identity with a minimum length of 10 bp was used (Section 2.2.1.5). The average rDNA sequence identity among the selected primates is 61.1%, and this decreases to 51.6% when just the IGS is considered. This arbitrary 80% cutoff mark was chosen as it is much higher than the average sequence identity (61.1%), and therefore represents a conservative cutoff value. Further, a comparison between different database studies showed that the average minimum sequence length of the binding sites of transcription factors is ~16 bp (Kulakovskiy et al. 2013). Therefore, the smaller cutofff of 10
91
bp was used to ensure most the potential protein binding sites in the IGS could be identified. Conserved regions less than 10 bp apart were merged together to obtain fifty-three conserved regions that represent the potential noncoding functional elements in the human IGS (Figure 2.31; Appendix Table 2). The conserved regions are referred to as ConR-1 to ConR-53 in the
order of distance from the last base of the 5’ ETS. The conserved regions can be grouped into three clusters: the first between the coding region and the long track of [TC]n satellites, the second between the [TC]n tracks and the cdc27 pseudogene, and the third between the cdc27 pseudogene and the end of the IGS. The conserved patterns appearing in the rDNA are outlined below:
92
Figure 2.31: Sequence similarity plot for human rDNA with five different
primate species viz. chimpanzee, gorilla, orangutan, gibbon and macaque.
The horizontal axis represents the position in the human rDNA. The vertical axis represent the level of sequence similarity between 0 (no identity) and 1 (all the bases are same in the column). The conserved regions (green shaded regions) were identified using a 10 bp minimum length and >= 0.8 sequence identity score. The name of the conserved regions is indicated on green box. Annotations of the human rDNA representing different functional elements and repeat elements are mapped to the similarity plot and shown above it. Orange boxes represent Alu elements, yellow boxes represent microsatellites and blue box represents cdc27 pseudogene. The purple vertical bar represents the promoter and the red vertical bars represent terminator elements.
93
2.3.7.1.
Conservation of previously known features in the human IGS:
To verify that the phylogenetic footprinting is capable of identifying functional elements in the human rDNA, I first looked whether known functional elements are present in the conserved regions.2.3.7.1.1.
rRNA coding regions
As anticipated the 18S and 5.8S rDNA are highly conserved across the primates (Hillis and Dixon 1991) (Figure 2.32). For 28S rDNA, conserved region appears as strong peaks and the variable regions as region of relatively low conservation. Similarly as reported previously (Netchvolodov et al. 2006), the core promoter element (-45 bp to +18 bp) and the upstream element (-156 bp to -107 bp) for rDNA transcription are highly conserved. Eleven rDNA transcription terminators are present 390 bp downstream of the 28S rRNA coding region (Pfleiderer et al. 1990). Of these eleven putative terminators, the first three are conserved while the others are not. It has been reported previously that the termination efficiency of first three terminators is higher than the remaining terminators (Pfleiderer et al. 1990), and this is supported by the higher conservation of the first three terminators.
Figure 2.32: Sequence conservation plot for the rRNA coding regions.
Notations are the same as in Figure 2.31
.2.3.7.1.2.
c-Myc and p53 binding sites
c-Myc is an oncogene and is associated with the rDNA transcriptional upregulation in many cancerous cells (Grandori et al. 2005). c-Myc binding site in the human rDNA is present proximal to the rDNA promoter (Grandori et al. 2005) and are conserved among the primates. Another crucial protein known to be associated with the rDNA is p53. p53 is a tumour suppressor gene (Oren 2003) and is involved in the rDNA transcription suppression (Budde and Grummt 1999). A putative binding site of p53 has been reported in the IGS
94
(Kern et al. 1991) and are conserved among the primates. Overall, this suggests that phylogenetic footprinting is capable of identifying protein binding sties in the IGS.
2.3.7.1.3.Noncoding transcripts
Two noncoding transcripts from the rDNA IGS region have been previously characterized, the pRNA (plays a role in rDNA silencing) (Mayer et al. 2006) and the IGS28RNA (involved in nucleolar protein sequestration during acidosis) (Audas et al. 2012). Interestingly both of the regions encoding these transcripts correspond to conserved regions in the IGS. The pRNA is transcribed from conR-53 while IGS28RNA is transcribed from a region of the IGS that overlaps with conserved regions conR-23 to conR-25 demonstrating that the phylogenetic footprinting capable of identifying noncoding transcripts.
2.3.7.1.4.Alu elements conservation
Alu elements constitute 13.3% of the human IGS. Several studies have shown the conservation of the Alu elements in the IGS. Therefore, next I looked for the conservation of Alu elements. Several Alu elements present in the IGS are highly conserved and correspond to various conserved regions. The high conservation of the Alu elements in the human IGS suggests that they may have some biological role. Further, this also demonstrates that the phylogenetic footprinting is able to identify conservation in the IGS.
2.3.7.1.5.Conservation of cdc27 pseudogene in apes
The phylogenetic footprinting demonstrate that cdc27 pseudogene is conserved in human and apes but is absent in the monkeys (Figure 2.33). This is same as has been reported previously by Gonzalez et al. (1993). The average identity of cdc27 pseudogene is 89.2%. Recent studies have shown that the pseudogenes are not inert and have several biological roles (Pink et al. 2011; Poliseno 2012; Johnsson et al. 2013). High conservation of the cdc27 pseudogene in human and apes suggests that probably they have some biological function.
Figure 2.33: Sequence conservation plot for the cdc27 pseudogene.
95
Together, these functional elements account for four of the 53 conserved regions, suggesting that the remaining conserved regions represent novel functional elements.