Introns are common genomic elements in most of eukaryotic genes. The mechanism
of intron removal of eukaryotic genes is a complex process of interaction of several factors.
This process involves the precise identification of splice sites by associated splicing factors.
However, when the length of introns is extremely long, correctly selecting the true splice
sites become a challenging task. To explore how the long introns are spliced, several studies
have been proposed. The first hypothesis is recursive splicing, which can remove the sub-
fragment of the introns stepwise from 5’ to 3’. The most striking feature of the recursive
splicing is the juxtaposition of a 3’ acceptor site and a 5’ donor site, which results in a zero
length exon (Figure 3.1(A)). The 5’ sites are then regenerated after the removal the upstream
sub-fragment, and this process may be repeated recursively. The recursive splicing has been
confirmed in Drosophila melanogaster[3]. Their study used a simple Position-Specific
Scoring Matrix (PSSM) based scoring model[104] to predict 165 recursive sites and 5 of
them are validated experimentally. However, their prediction model may not be very accurate
since the splice sites composition is not static and highly related to the size of introns and the
flanking exons [105-107]. This prediction method has been refined by their successive study
[108] and proposed 376 predicted recursive sites and 10 of them are supported by RT-PCR
analysis. This improved model relies heavily on the special feature of upstream
polypyrimidine tract around the recursive sites. However, our results have shown that this
feature may not be observed in other species, thus making this ab initio prediction model
limit to Drosophila or invertebrates only. Indeed, most of recursive splicing studies focus on
Drosophila family only, and whether this mechanism ubiquitous in other species is still
analysis or mutation tests. The reliability of these prediction models cannot be tested genome
wide.
Another less restrict type of stepwise intron removal process, called intrasplicing, has
also been introduced by [109] in 2004. In this model, a long intron comprises a set of
intraintrons, which are removed until the remaining intron is short enough to be spliced in
single step (Figure 3.1(C)). However, this model is totally based on bioinformatics analysis
and no experimental confirmation is provided. Marilyn K Parra, etc. demonstrated an
example of intrasplicing in the first exon of protein 4.1 R gene, which may be coordinated
with downstream alternative splicing.
In addition to the recursive splicing and intrasplicing model, Shepard [110] used
computational methods to predict the recursive splice sites on insects and vertebrates. They
found that insects have more abundant recursive splicing sites compared to their
complementary strand, but their results did not show the significant difference in vertebrates
even most vertebrates have longer introns. They also demonstrated that the large introns in
vertebrates tend to have many repeat elements such as SINE and LINE. They postulate the
large introns of vertebrates may form stem structures which may facilitate the splicing by
bringing donor and acceptor splicing junction closer. Although this study does not have any
experimental evidence to support their hypotheses, they brought up an interesting observation
that the stepwise removal mechanism may not be able to handle extreme long introns in
mammals.
Although several studies have been proposed in the past decade, most of them are
based on computational prediction. Some studies used RT-PCR to test the existence of
we take advantage of the deep sequencing feature of RNA-Seq protocol. Since RNA-Seq
data is so deep that many reads may be mapped on introns [111], some studies have used
RNA-Seq data to analyze the splicing patterns [112, 113]. In our study, we developed a tool
called RSSFinder, which identifies the recursive splicing sites and intrasplicing sites using
RNA-Seq data. We used RSSFinder to confirm the existence of recursive splicing and
intrasplicing by identifying the reads mapped on the junction of intermediates. Theoretically, the recursive splicing may also occur from 3’ to 5’. We call this type II recursive splicing
(Figure 3.1). We search for the evidence of two types of recursive splicing as well as the
intrasplicing. Here we use the term type III recursive splicing to refer to the intrasplicing.
The recursive splice sites are denoted as RSSs. We used RSSFinder to investigate four
species including Drosophila, mouse, rice and Arabidopsis. We found that recursive splicing
seems not uncommon even in plant species, even whose intron size is known very short
Figure 3.1 Long intron splicing models.
(A) Type I recursive splicing. The left part of the intron is removed by recognizing the acceptor sites in the middle of the intron. Then a new 5 splice site will be regenerated. If we want to prove the existence of the recursive splicing, we need to find the reads that mapped to the junction of exon1 and the right part of the intron. (B) Type II recursive splicing. In this case, the right part of the intron is removed first. The new acceptor site is regenerated. (C) Type III recursive splicing or intrasplicing. The long intron is shortened by removal of subintrons without the use of splice sites of long introns. (D) Stem structure with loops may facilitate the splicing of long introns in vertebrates.
Methods
To prove the existence of the recursive splicing, we have to find the reads that are
mapped to the junction of the exon and the spliced intron. We would like to explore whether
recursive splicing is the dominant way to deal with long introns among various species. For
this purpose, we developed a pipeline called RSSFinder, which includes several Perl scripts
and a C++ program.
Intron retrieval
The intron datasets for Drosophila, mouse, Arabidopsis and rice are first constructed.
The genome and annotation version are shown as in Table 2.1. The detailed statistics of
retrieved introns are described in Table 3.4. An obvious observation is that mouse has larger
number of introns and longer intron size. The plant species, on the other hand, have shorter
introns. Since the size of some introns is extremely long, the median size of intron is far
shorter than average size. A Perl script named IntronRetriever.pl parses the gff3 or gtf format
Table 3.1 - Datasets Species Project/ Institute Version Number of Introns RNA-Seq runs1 Drosophila Melanogaster Flybase 5.45 72,306 SRR352499~SRR3525062,SRR043397 ,SRR040044,SRR061686, SRR070259, SRR074421, SRR168834, SRR029112, SRR038616, SRR042297, SRR364724, SRR414921.
Mus musculus GRC Build38 532,819 SRR001365, SRR006492, SRR037945, SRR037946, SRR037497, SRR037950, SRR037951, SRR037952, SRR099239.
Arabidopsis thaliana
TAIR 10 116,481 SRR013417, SRR013418, SRR071240, SRR089777, SRR360152, SRR391051, SRR394082.
Oryza sativa RGAP 7 184,635 SRR037717, SRR037720, SRR037737, SRR037738, SRR037739, SRR504369, SRR504371.
1
All dataset can be downloaded from NCBI SRA: http://www.ncbi.nlm.nih.gov/sra/
2 Dataset can be downloaded from DNA Bank of Japan (http://trace.ddbj.nig.ac.jp/DRASearch/submission?acc=SRA047035).
Finding RSSs
Then a way to predict all potential recursive sites for each intron is required. Each
RSS is treated as a pair of regular acceptor and donor splice sites. We implemented a C++
program named RSSPredictor which is based on the latest version of SplicePredictor [69,
114], which employs the Bayesian Markov model to predict splice sites including non-
canonical sites. For type I and II RSSs, we scan all intron sequences and identify all qualified
acceptor sites immediately followed by a donor site. For type III RSSs, we search for all
qualified donor sites and pair them with acceptor sites within specified range. The sites
whose posterior probability and Bayes factor [115] are larger than specified cutoff are
considered as our candidate RSSs. For each hypothetic RSS, we construct the pseudo
intermediates by ligating upstream exons to regenerated 5 prime donor site (Figure 3.2(A))
for type I RSS, regenerated 3 prime RSS to downstream exon for type II RSS (Figure
The Bowtie [15] indexes of these pseudo intermediates are then built using Bowtie-
build program. Then RNA-Seq data are aligned to these sequences. RSSFinder can take
multiple RNA-Seq libraries as input. Here we used 19 Illumina runs for Drosophila, 9 runs
for mouse, 7 runs for Arabidopsis and 7 runs for rice (Table 3.1). We first used Bowtie to
map all the RNA-Seq reads to the reference genomes. The reads that are initially unable to
map to the reference genome are the potential reads mapped to the junctions of recursive
sites. Therefore we again used Bowtie to align these unmapped reads to the pseudo
intermediates according to the following rules (they are the adjustable parameters in
RSSFinder): (1) Reads must span the junction at least 12bp. (2) If the shorter part of the read
is less than 18bp, the number of mismatches allowed is 1. Total 2 mismatches are allowed for
the whole read. (Figure 3.2(D)). All confirmed RSSs for each RNA-Seq library are then
combined and the duplicates are removed. We also filter out the RSSs associated to the
known alternative splicing events. In other words, only non-exonic RSSs are considered in
our study.
Figure 3.2 Confirmation of RSSs.
To confirm the three stepwise intron removal models. We first create the pseudo
intermediates by concatenating intermediates by (A) ligating upstream exons to regenerated 5 prime donor site for type I RSS, (B) regenerated 3 prime RSS to downstream exon for type II RSS, or (C) upstream and downstream of subintron for type III RSS. Then we align unmapped reads to those pseudo transcripts. (D) An RSS is confirmed by RNA-Seq reads if (1) the length of shorter part of the reads mapped on the pseudo intermediates must be larger than or equal to 12bp, and (2) if the shorter part of the read is less than 18bp, the maximum mismatches allowed in this part is 1.
Results and discussion
We tested four model species including Drosophila, mouse, Arabidopsis and rice with
simulated and real intron sequences. We then used RNA-Seq data to find the confirmed RSSs
according to the rules mentioned in the method section.
Simulation
To understand whether the RSSs are the results of evolutionary pressure or simply
formed by chance, we simulated the same number of introns and the flanking exons by
conserving dinucleotide composition. Since the similar study has been done by Ott et al.
[109] for intrasplicing, here we focus on the case where the acceptor site is immediately
followed by donor site. We used relative stringent criteria to search for the RSSs with cutoff
p-value (posterior probability) 0.85 and c-value (Bayes factor) 3. We found out that the
predicted number of RSSs is significantly higher than the sites found in random sequences
for all four species. We also noticed that in Drosophila, there is an obvious bias that recursive
sites tend to happen in longer introns. But we did not find this bias in other three species. The
Figure 3.3 Comparison of simulated and real intron sequences.
These are the results of predicted recursive sites distribution normalized by million nucleotides. We compare the number of sites with the sequences generated by first order Markov model. Among these four species, we found the number of sites are significant larger than random sequences, indicating that the recursive splicing may be a common mechanism in both animals and plants.
Finding RSSs
Although the simulation results suggest the recursive splicing may be a dormant way
to process intron removal. We want to see if we can find evidence for recursive splicing from
collect as many candidate sites as possible, thereby obtaining better sensitivity. We used very
low p-value cutoff 0.5 and c-value 1. On the other hand, since the number of the paring
between donor and acceptor sites for type III RSSs is explosive, we have to use very
stringent criteria or the number of predicted sites is extremely high. Here we used p-value
0.97 and c-value 7 for mouse and Drosophila; p-value 0.9 and c-value 5 for Arabidopsis and
rice. For mouse and Drosophila, the criterion of the range of subintron length is from 1,000bp
to 10,000bp, and we used smaller ranger for Arabidopsis and rice, which is from 300 to
2,000. The detail results of candidate sites are shown in Table 3.2.
Table 3.2 - The candidate sites for type I, type II and type III RSSs.
Species Type I Type II Type III
Cutoff # of sites Cutoff # of sites Cutoff # of sites
p-val c-val p-val c-val p-val c-val
Drosophila 0.5 1 5,124 0.5 1 9,397 0.97 7 161,643 Mus musculus 0.5 1 397,835 0.5 1 242,935 0.97 7 8,324,153 Arabidopsis thaliana 0.5 1 1,193 0.5 1 1,249 0.9 5 7,205 Oryza sativa 0.5 1 8,912 0.5 1 10,285 0.9 5 55,245
Then we mapped RNA-Seq reads to these candidate sites. The RSSs are confirmed by
these RNA-Seq reads. For type I and type II RSSs, the RSSs are defined by the location of
RSSs and their upstream (downstream) exons. For type III RSSs, the RSSs are defined by the
absolute position of subintrons. Here we do not count the duplicate RSSs If they are shared
by many genes. Our results indicate that type II RSS are barely observed, which is consistent
with the results of Burnette’s study [3]. This observation may result from the co-
transcriptional pre-mRNA splicing order, which has been shown the introns close to 5’ end
stringent condition. We expect we can find more type III RSSs if we lower the cutoff and the
range constraint. Except for Arabidopsis, we got hundreds of type I RSSs. The results are
shown as Table 3.3. Note that the number of RSSs highly depends on the number of mapped
reads required. For example, if the required number of reads is larger or equal to 10, the
number of type I RSSs in mouse drops from 140 to 6. This observation shows that most
RNA-Seq experiments are not designed to capture the splicing intermediates. Our datasets
are all public, and they are mainly used to study mature mRNA. People may design some
special experiments such as sequencing chromatin associated mRNAs if the targets is to
search for the recursive splicing sites.
Table 3.3 - Confirmed sites.
Species Minimum number of
mapped reads >= 1
Minimum number of mapped reads >= 3
Minimum number of mapped reads >= 10