Background - Local assembly and pre-mRNA splicing analyses by high-throughput sequencing data

Introns are common genomic elements in most of eukaryotic genes. The mechanism

of intron removal of eukaryotic genes is a complex process of interaction of several factors.

This process involves the precise identification of splice sites by associated splicing factors.

However, when the length of introns is extremely long, correctly selecting the true splice

sites become a challenging task. To explore how the long introns are spliced, several studies

have been proposed. The first hypothesis is recursive splicing, which can remove the sub-

fragment of the introns stepwise from 5’ to 3’. The most striking feature of the recursive

splicing is the juxtaposition of a 3’ acceptor site and a 5’ donor site, which results in a zero

length exon (Figure 3.1(A)). The 5’ sites are then regenerated after the removal the upstream

sub-fragment, and this process may be repeated recursively. The recursive splicing has been

confirmed in Drosophila melanogaster[3]. Their study used a simple Position-Specific

Scoring Matrix (PSSM) based scoring model[104] to predict 165 recursive sites and 5 of

them are validated experimentally. However, their prediction model may not be very accurate

since the splice sites composition is not static and highly related to the size of introns and the

flanking exons [105-107]. This prediction method has been refined by their successive study

[108] and proposed 376 predicted recursive sites and 10 of them are supported by RT-PCR

analysis. This improved model relies heavily on the special feature of upstream

polypyrimidine tract around the recursive sites. However, our results have shown that this

feature may not be observed in other species, thus making this ab initio prediction model

limit to Drosophila or invertebrates only. Indeed, most of recursive splicing studies focus on

Drosophila family only, and whether this mechanism ubiquitous in other species is still

analysis or mutation tests. The reliability of these prediction models cannot be tested genome

wide.

Another less restrict type of stepwise intron removal process, called intrasplicing, has

also been introduced by [109] in 2004. In this model, a long intron comprises a set of

intraintrons, which are removed until the remaining intron is short enough to be spliced in

single step (Figure 3.1(C)). However, this model is totally based on bioinformatics analysis

and no experimental confirmation is provided. Marilyn K Parra, etc. demonstrated an

example of intrasplicing in the first exon of protein 4.1 R gene, which may be coordinated

with downstream alternative splicing.

In addition to the recursive splicing and intrasplicing model, Shepard [110] used

computational methods to predict the recursive splice sites on insects and vertebrates. They

found that insects have more abundant recursive splicing sites compared to their

complementary strand, but their results did not show the significant difference in vertebrates

even most vertebrates have longer introns. They also demonstrated that the large introns in

vertebrates tend to have many repeat elements such as SINE and LINE. They postulate the

large introns of vertebrates may form stem structures which may facilitate the splicing by

bringing donor and acceptor splicing junction closer. Although this study does not have any

experimental evidence to support their hypotheses, they brought up an interesting observation

that the stepwise removal mechanism may not be able to handle extreme long introns in

mammals.

Although several studies have been proposed in the past decade, most of them are

based on computational prediction. Some studies used RT-PCR to test the existence of

we take advantage of the deep sequencing feature of RNA-Seq protocol. Since RNA-Seq

data is so deep that many reads may be mapped on introns [111], some studies have used

RNA-Seq data to analyze the splicing patterns [112, 113]. In our study, we developed a tool

called RSSFinder, which identifies the recursive splicing sites and intrasplicing sites using

RNA-Seq data. We used RSSFinder to confirm the existence of recursive splicing and

intrasplicing by identifying the reads mapped on the junction of intermediates. Theoretically, the recursive splicing may also occur from 3’ to 5’. We call this type II recursive splicing

(Figure 3.1). We search for the evidence of two types of recursive splicing as well as the

intrasplicing. Here we use the term type III recursive splicing to refer to the intrasplicing.

The recursive splice sites are denoted as RSSs. We used RSSFinder to investigate four

species including Drosophila, mouse, rice and Arabidopsis. We found that recursive splicing

seems not uncommon even in plant species, even whose intron size is known very short

Figure 3.1 Long intron splicing models.

(A) Type I recursive splicing. The left part of the intron is removed by recognizing the acceptor sites in the middle of the intron. Then a new 5 splice site will be regenerated. If we want to prove the existence of the recursive splicing, we need to find the reads that mapped to the junction of exon1 and the right part of the intron. (B) Type II recursive splicing. In this case, the right part of the intron is removed first. The new acceptor site is regenerated. (C) Type III recursive splicing or intrasplicing. The long intron is shortened by removal of subintrons without the use of splice sites of long introns. (D) Stem structure with loops may facilitate the splicing of long introns in vertebrates.

Methods

To prove the existence of the recursive splicing, we have to find the reads that are

mapped to the junction of the exon and the spliced intron. We would like to explore whether

recursive splicing is the dominant way to deal with long introns among various species. For

this purpose, we developed a pipeline called RSSFinder, which includes several Perl scripts

and a C++ program.

Intron retrieval

The intron datasets for Drosophila, mouse, Arabidopsis and rice are first constructed.

The genome and annotation version are shown as in Table 2.1. The detailed statistics of

retrieved introns are described in Table 3.4. An obvious observation is that mouse has larger

number of introns and longer intron size. The plant species, on the other hand, have shorter

introns. Since the size of some introns is extremely long, the median size of intron is far

shorter than average size. A Perl script named IntronRetriever.pl parses the gff3 or gtf format

Table 3.1 - Datasets Species Project/ Institute Version Number of Introns RNA-Seq runs1 Drosophila Melanogaster Flybase 5.45 72,306 SRR352499~SRR3525062,SRR043397 ,SRR040044,SRR061686, SRR070259, SRR074421, SRR168834, SRR029112, SRR038616, SRR042297, SRR364724, SRR414921.

Mus musculus GRC Build38 532,819 SRR001365, SRR006492, SRR037945, SRR037946, SRR037497, SRR037950, SRR037951, SRR037952, SRR099239.

Arabidopsis thaliana

TAIR 10 116,481 SRR013417, SRR013418, SRR071240, SRR089777, SRR360152, SRR391051, SRR394082.

Oryza sativa RGAP 7 184,635 SRR037717, SRR037720, SRR037737, SRR037738, SRR037739, SRR504369, SRR504371.

All dataset can be downloaded from NCBI SRA: http://www.ncbi.nlm.nih.gov/sra/

2_{Dataset can be downloaded from DNA Bank of Japan (http://trace.ddbj.nig.ac.jp/DRASearch/submission?acc=SRA047035).}

Finding RSSs

Then a way to predict all potential recursive sites for each intron is required. Each

RSS is treated as a pair of regular acceptor and donor splice sites. We implemented a C++

program named RSSPredictor which is based on the latest version of SplicePredictor [69,

114], which employs the Bayesian Markov model to predict splice sites including non-

canonical sites. For type I and II RSSs, we scan all intron sequences and identify all qualified

acceptor sites immediately followed by a donor site. For type III RSSs, we search for all

qualified donor sites and pair them with acceptor sites within specified range. The sites

whose posterior probability and Bayes factor [115] are larger than specified cutoff are

considered as our candidate RSSs. For each hypothetic RSS, we construct the pseudo

intermediates by ligating upstream exons to regenerated 5 prime donor site (Figure 3.2(A))

for type I RSS, regenerated 3 prime RSS to downstream exon for type II RSS (Figure

The Bowtie [15] indexes of these pseudo intermediates are then built using Bowtie-

build program. Then RNA-Seq data are aligned to these sequences. RSSFinder can take

multiple RNA-Seq libraries as input. Here we used 19 Illumina runs for Drosophila, 9 runs

for mouse, 7 runs for Arabidopsis and 7 runs for rice (Table 3.1). We first used Bowtie to

map all the RNA-Seq reads to the reference genomes. The reads that are initially unable to

map to the reference genome are the potential reads mapped to the junctions of recursive

sites. Therefore we again used Bowtie to align these unmapped reads to the pseudo

intermediates according to the following rules (they are the adjustable parameters in

RSSFinder): (1) Reads must span the junction at least 12bp. (2) If the shorter part of the read

is less than 18bp, the number of mismatches allowed is 1. Total 2 mismatches are allowed for

the whole read. (Figure 3.2(D)). All confirmed RSSs for each RNA-Seq library are then

combined and the duplicates are removed. We also filter out the RSSs associated to the

known alternative splicing events. In other words, only non-exonic RSSs are considered in

our study.

Figure 3.2 Confirmation of RSSs.

To confirm the three stepwise intron removal models. We first create the pseudo

intermediates by concatenating intermediates by (A) ligating upstream exons to regenerated 5 prime donor site for type I RSS, (B) regenerated 3 prime RSS to downstream exon for type II RSS, or (C) upstream and downstream of subintron for type III RSS. Then we align unmapped reads to those pseudo transcripts. (D) An RSS is confirmed by RNA-Seq reads if (1) the length of shorter part of the reads mapped on the pseudo intermediates must be larger than or equal to 12bp, and (2) if the shorter part of the read is less than 18bp, the maximum mismatches allowed in this part is 1.

Results and discussion

We tested four model species including Drosophila, mouse, Arabidopsis and rice with

simulated and real intron sequences. We then used RNA-Seq data to find the confirmed RSSs

according to the rules mentioned in the method section.

Simulation

To understand whether the RSSs are the results of evolutionary pressure or simply

formed by chance, we simulated the same number of introns and the flanking exons by

conserving dinucleotide composition. Since the similar study has been done by Ott et al.

[109] for intrasplicing, here we focus on the case where the acceptor site is immediately

followed by donor site. We used relative stringent criteria to search for the RSSs with cutoff

p-value (posterior probability) 0.85 and c-value (Bayes factor) 3. We found out that the

predicted number of RSSs is significantly higher than the sites found in random sequences

for all four species. We also noticed that in Drosophila, there is an obvious bias that recursive

sites tend to happen in longer introns. But we did not find this bias in other three species. The

Figure 3.3 Comparison of simulated and real intron sequences.

These are the results of predicted recursive sites distribution normalized by million nucleotides. We compare the number of sites with the sequences generated by first order Markov model. Among these four species, we found the number of sites are significant larger than random sequences, indicating that the recursive splicing may be a common mechanism in both animals and plants.

Finding RSSs

Although the simulation results suggest the recursive splicing may be a dormant way

to process intron removal. We want to see if we can find evidence for recursive splicing from

collect as many candidate sites as possible, thereby obtaining better sensitivity. We used very

low p-value cutoff 0.5 and c-value 1. On the other hand, since the number of the paring

between donor and acceptor sites for type III RSSs is explosive, we have to use very

stringent criteria or the number of predicted sites is extremely high. Here we used p-value

0.97 and c-value 7 for mouse and Drosophila; p-value 0.9 and c-value 5 for Arabidopsis and

rice. For mouse and Drosophila, the criterion of the range of subintron length is from 1,000bp

to 10,000bp, and we used smaller ranger for Arabidopsis and rice, which is from 300 to

2,000. The detail results of candidate sites are shown in Table 3.2.

Table 3.2 - The candidate sites for type I, type II and type III RSSs.

Species Type I Type II Type III

Cutoff # of sites Cutoff # of sites Cutoff # of sites

p-val c-val p-val c-val p-val c-val

Drosophila 0.5 1 5,124 0.5 1 9,397 0.97 7 161,643 Mus musculus 0.5 1 397,835 0.5 1 242,935 0.97 7 8,324,153 Arabidopsis thaliana 0.5 1 1,193 0.5 1 1,249 0.9 5 7,205 Oryza sativa 0.5 1 8,912 0.5 1 10,285 0.9 5 55,245

Then we mapped RNA-Seq reads to these candidate sites. The RSSs are confirmed by

these RNA-Seq reads. For type I and type II RSSs, the RSSs are defined by the location of

RSSs and their upstream (downstream) exons. For type III RSSs, the RSSs are defined by the

absolute position of subintrons. Here we do not count the duplicate RSSs If they are shared

by many genes. Our results indicate that type II RSS are barely observed, which is consistent

with the results of Burnette’s study [3]. This observation may result from the co-

transcriptional pre-mRNA splicing order, which has been shown the introns close to 5’ end

stringent condition. We expect we can find more type III RSSs if we lower the cutoff and the

range constraint. Except for Arabidopsis, we got hundreds of type I RSSs. The results are

shown as Table 3.3. Note that the number of RSSs highly depends on the number of mapped

reads required. For example, if the required number of reads is larger or equal to 10, the

number of type I RSSs in mouse drops from 140 to 6. This observation shows that most

RNA-Seq experiments are not designed to capture the splicing intermediates. Our datasets

are all public, and they are mainly used to study mature mRNA. People may design some

special experiments such as sequencing chromatin associated mRNAs if the targets is to

search for the recursive splicing sites.

Table 3.3 - Confirmed sites.

Species Minimum number of

mapped reads >= 1

Minimum number of mapped reads >= 3

Minimum number of mapped reads >= 10

In document Local assembly and pre-mRNA splicing analyses by high-throughput sequencing data (Page 52-63)