2.2 Results
2.2.3 Major transcripts from coding genes do not always code for
Functional classification of major transcripts revealed that, for 17% of protein coding genes expressed in primary tissues (15.26% to 20.64%, SD = 1.60), the major transcript lacks an annotated CDS as indicated by GENCODE. Taking into account expression levels, and focusing on cell lines data, major non-coding transcripts were observed to be more abundant in the nucleus, where they represent approximately 15% of the studied mRNA pool (12.99% to 16.66%, SD = 1.10, Figure 2.6a). Genes with major non-coding transcripts are expressed at higher levels in the nucleus, compared to those with major coding transcripts, while this trend is inverted in the cytosol (Figure 2.6b). In addition, non-coding major transcripts are less dominant than coding ones in both compartments (Figure 2.6b). Finally, analysis of the annotation revealed that these major non-coding transcripts correspond to retained introns and processed transcripts, which lack an open reading frame (see Methods). The latter are more prevalent in the cytosol, while the proportion of retained introns is higher in the nucleus (Figure 2.6c).
In order to evaluate the hypothesis that incomplete splicing could explain the higher proportion of major retained introns in the nucleus, I compared intron expression levels across cellular compartments (see Methods for details on the calculation of intron expression). As expected, intron expression was detected to be higher in the nucleus compared to the cytosol (Figure 2.7a). In addition, such analysis revealed a general trend in the location of major retained introns towards the transcriptional 3’-end (Figure 2.7b), which has been previously linked to the nonsense-mediated decay pathway (see Discussion). Interestingly, this trend is more accentuated in the cytosol than in the nucleus, where it could be masked by the higher intronic expression levels. Alternatively, the prevalence of retained introns as a major transcript could point to a functional mechanism, since genes with retained introns as the major transcript both in nucleus and cytosol were detected to be expressed at lower levels in the latter (Figure 2.7c; see Discussion). Those genes are associated with ribosomal components, consistent with previous findings indicating that introns regulate the expression of ribosomal proteins in yeast (Table B.5, see Discussion).
BM PE ENCODE cell ENCODE cytosol ENCODE nucleus 0 20 40 60 80 100 major non-coding
minor non-coding minor coding major coding % mRNA pool 17.11 23.10 17.06 31.34
% genes with a major
non- coding tr anscr ipt retained intron processed transcript (a) (b) (c) BM PE ENCODE cell ENCODE cytosol ENCODE nucleus 0 5 10 15 20 25 30 35 0 0.2 0.4 0.6 0.8
gene expression (FPKM) major transcript dominance
1
Figure 2.6| Major non-coding transcripts in protein coding genes.
(a) Proportion of the mRNA studied represented by different categories of transcripts. Av- erage proportions were calculated including all the samples from each dataset. Major non-coding transcripts are more abundant in the nucleus compared to the cytosol.
(b) Expression patterns across cellular compartments for major non-coding transcripts. Protein coding genes for which the most abundant transcript is non-coding are expressed at higher levels in the nucleus, whilst this trend becomes inverted in the cytosol (left). Major transcript dominance becomes reduced both when the major transcript is non-coding and in the nucleus (right).
(c) Transcript biotype categories for the major non-coding transcripts. Average pro- portions were calculated including all the samples from each dataset. Processed transcripts are more abundant in the cytosol, while retained introns represent the major fraction in the nucleus. Other minor categories that represented less than 1% of the transcripts were also identified, but are not visible in the plots.
cell cytosol nucleus 0 0.2 0.4 0.6 0.8 1 nor malised in tr on e xpr ession (a) (b) M A JOR RI MINOR RI
protein coding transcript retained intron
0 20 40 60 80 100
% transcript
nucleus cytosol cell
(c) gene expression (FPKMs) cytosol nucleus 0 10 20 30 40 425 nucleus cytosol 81 42
Figure 2.7| Focus on retained introns.
(a) Normalised intron expression in different cellular compartments. FPKMs were cal- culated for all the introns and normalised by gene expression levels (see Meth- ods). Intron expression is higher in the nucleus than in the cytosol (Wilcoxon test p-value < 2.2·10 16).
(b) Location of the dominant retained introns within the context of protein coding tran- scripts. Genes for which the major transcript is a retained intron (RI) were initially considered in the analysis, and cases where the second most abundant transcript is protein coding and overlaps with the RIs were further selected. Similar criteria were applied to analyse minor RIs. The location of the RIs is obtained by measur- ing the distance from their centre to the transcriptional start of the overlapping coding transcript, as illustrated in the panel below the figure (red dots). Major RIs are preferentially located towards the transcriptional end of protein coding transcripts.
(c) Expression levels for genes with major retained introns. The number of genes for which the most abundant transcript is a RI is represented in the left. Amongst the genes with major RIs in both cellular compartments (n = 81), gene expression is higher in the nucleus (Wilcoxon test p-value < 2.2·10 16).
On the other hand, the term processed transcript constitutes an ambiguous cate- gory. Manual inspection of a subset of processed transcripts that were consistently identified across all samples as the major transcript indicated that they could po- tentially be re-annotated to protein coding, nonsense-mediated decay or retained intron (Table B.6). Together, these observations suggest that the true proportion of non-coding major transcripts for protein coding genes may be lower than the current annotation suggests, in line with recent evidence pointing to the existence of peptides from non-coding RNAs [Hemberg et al., 2014].