• No results found

Conclusions

We propose a new RNA-seq gapped alignment algorithm named as TrueSight, incorporating both DNA sequence and RNA-seq features into a logistic regression model which assigns reliability scores to SJs inferred by RNA-seq reads. To our knowledge, TrueSight is the first method with the ability to combine RNA-seq mapping with genome wide splicing signal and coding potential computation from the DNA sequence. Testing on both real and simulation datasets, TrueSight has shown an overall better performance than existing tools in terms of sensitivity and specificity, demonstrating that DNA information is helpful in RNA-seq alignment, especially in SJs with low coverage. Since many AS isoforms are of low coverage, we expect TrueSight useful in AS detection. Mapping RNA-seq reads to SJs is the pivotal point in an algorithm of isoform construction utilizing a reference genome. For example, in IsoLasso (Li et al.,2011a), a recent isoform construction algorithm using TopHat’s output, inferred SJs are explicitly used to significantly reduce the total number of possible isoforms subjected to the LASSO procedure. We expect that TrueSight will improve isoform construction and consequently will make a strong impact on improving the accuracy of estimating isoform expression levels.

There are several other features that we could incorporate in order to further improve the algorithm. First, we could add a module to explicitly model SJs in untranslated region (UTR). Second, we could use the three-periodic model to trace exon reading frames locally; this will enhance the modeling of SJs in coding regions and will reduce the number of pairs of candidate splice sites for a SJ to those that do not disrupt the reading frame. These models could be made GC content dependent and thus more accurate.

With the aid of TrueSight which is designed to discover SJs, we perform AS analysis to honey bee, a social insect without intensive gene annotation, using deep RNA-seq data. We have identified 22,150 instances of AS on 8,852 genes, indicating that 94.6% multi-exon genes in honey bee can code for multiple transcripts. RI and AB are the most dominant AS

subtypes in honey bee. Also, we quantitatively analyzed various factors correlated with AS based on the high coverage sequencing data. The exon-intron structure, strengths of splice sites and the methylation pattern are shown to be involved in cells’ AS decision. Specifically, from observation in CE, AB and ATE, rarely used exon sequences are found presumably to be hypo-methylated, while normal honey bee exons are hyper-methylated. The motif strength of splice site is essential in splice site recognition, and weak sites may disrupt the recognition if introns, exons, 3’/5’ alternative exon regions and terminal exons, leading to RIs, CEs, ABs and ATEs respectively. RIs are relatively shorter than non-retaining introns, however, longer RIs are more likely to be retained than shorter RIs. This observation supports both sides of intron splicing model: (i) splice site mis-recognition in short introns will lead to RI, and CE would emerge if splice site mis-recognition happened on long introns. (ii) short introns are relatively easier to be spliced out than longer ones. The size of CEs and ABs have a significant 3-periodic enrichment, suggesting (i) coding frame preserving AS might have selective advantage in evolution; (ii) nonsense-mediated decay (NMD) might turned over frame shifting AS transcripts which have premature termination codons.

AS often contributes to major developmental decisions and understanding the structure of splice-forms of a gene is essential for uncovering its AS behavior. We have examined the AS of fruitless, a well-established sex-specific AS gene in Drosophila. With the aid of refined honey bee gene model, we identified three first exons, five last exons, a cassette exon and two alternative 5’ splice sites for fruitless. The refined fruitless model would be useful in sex-determination study for honey bee.

Using the gapped alignment module within TrueSight, we designed and implemented an open-source software tool, called FusionHunter, which reliably identifies fusion transcripts from transcriptional analysis of paired-end RNA-seq. We show that FusionHunter can accurately detect fusions that were previously confirmed by RT-PCR in a publicly available dataset. The purpose of FusionHunter is to identify potential fusions with high sensitivity and specificity and to guide further functional validation in the laboratory.

As the cost of next-generating sequencing keeps going down, more and more whole transcriptome sequencing datasets for different organisms will be available. Bioinformatics tools are increasingly important in deciphering the rich biological information behind these high-throughput digital readouts, and broadening our understanding of the transcriptomic behavior of cells.

References

Amini, M. R. and Gallinari, P., 2002. Semi-supervised logistic regression. In fifteenth European Conference on Artificial Intelligence, pages 390–394.

Au, K. F., Jiang, H., Lin, L., Xing, Y., and Wong, W. H., 2010. Detection of splice junctions from paired-end rna-seq data by splicemap. Nucleic Acids Res, 38(14):4570–8.

Berger, M. F., Levin, J. Z., Vijayendran, K., Sivachenko, A., Adiconis, X., Maguire, J., Johnson, L. A., Robinson, J., Verhaak, R. G., Sougnez, C., et al., 2010. Integrative analysis of the melanoma transcriptome. Genome Res, 20(4):413–27.

Berget, S. M., 1995. Exon recognition in vertebrate splicing. J Biol Chem, 270(6):2411–4. Bertossa, R. C., van de Zande, L., and Beukeboom, L. W., 2009. The fruitless gene in nasonia displays complex sex-specific splicing and contains new zinc finger domains. Mol Biol Evol, 26(7):1557–69.

Birney, E., Stamatoyannopoulos, J. A., Dutta, A., Guigo, R., Gingeras, T. R., Margulies, E. H., Weng, Z., Snyder, M., Dermitzakis, E. T., Thurman, R. E., et al., 2007. Identifica- tion and analysis of functional elements in 1% of the human genome by the encode pilot project. Nature, 447(7146):799–816.

Bradley, R. K., Merkin, J., Lambert, N. J., and Burge, C. B., 2012. Alternative splic- ing of rna triplets is often regulated and accelerates proteome evolution. PLoS Biol, 10(1):e1001229.

Bryant, D. W., Shen, R. K., Priest, H. D., Wong, W. K., and Mockler, T. C., 2010. Supersplat-spliced rna-seq alignment. Bioinformatics, 26(12):1500–1505.

Burge, C. and Karlin, S., 1997. Prediction of complete gene structures in human genomic dna. J Mol Biol, 268(1):78–94.

Burset, M., Seledtsov, I. A., and Solovyev, V. V., 2000. Analysis of canonical and non- canonical splice sites in mammalian genomes. Nucleic Acids Res, 28(21):4364–75. Celeux, G. and Govaert, G., 1992. A classification em algorithm for clustering and 2 stochas-

tic versions. Computational Statistics & Data Analysis, 14(3):315–332.

Clark, M. B., Amaral, P. P., Schlesinger, F. J., Dinger, M. E., Taft, R. J., Rinn, J. L., Ponting, C. P., Stadler, P. F., Morris, K. V., Morillon, A., et al., 2011. The reality of pervasive transcription. PLoS Biol, 9(7):e1000625; discussion e1001102.

Cloonan, N., Forrest, A. R. R., Kolle, G., Gardiner, B. B. A., Faulkner, G. J., Brown, M. K., Taylor, D. F., Steptoe, A. L., Wani, S., Bethel, G., et al., 2008. Stem cell transcriptome profiling via massive-scale mrna sequencing. Nature Methods, 5(7):613–619.

Clynen, E., Ciudad, L., Belles, X., and Piulachs, M. D., 2011. Conservation of fruitless’ role as master regulator of male courtship behaviour from cockroaches to flies. Development Genes and Evolution, 221(1):43–48.

Daines, B., Wang, H., Wang, L., Li, Y., Han, Y., Emmert, D., Gelbart, W., Wang, X., Li, W., Gibbs, R., et al., 2011. The drosophila melanogaster transcriptome by paired-end rna sequencing. Genome Res, 21(2):315–24.

Demir, E. and Dickson, B. J., 2005. fruitless splicing specifies male courtship behavior in drosophila. Cell, 121(5):785–94.

Dimon, M. T., Sorber, K., and DeRisi, J. L., 2010. Hmmsplicer: a tool for efficient and sensitive discovery of known and novel splice junctions in rna-seq data. PLoS One, 5(11):e13875.

Elango, N., Hunt, B. G., Goodisman, M. A. D., and Yi, S. V., 2009. Dna methylation is widespread and associated with differential gene expression in castes of the honeybee, apis mellifera. Proceedings of the National Academy of Sciences of the United States of America, 106(27):11206–11211.

Elsik, C. G., Mackey, A. J., Reese, J. T., Milshina, N. V., Roos, D. S., and Weinstock, G. M., 2007. Creating a honey bee consensus gene set. Genome Biol, 8(1):R13.

Gardiner-Garden, M. and Frommer, M., 1987. Cpg islands in vertebrate genomes. J Mol Biol, 196(2):261–82.

Gonzalez-Porta, M., Calvo, M., Sammeth, M., and Guigo, R., 2012. Estimation of alterna- tive splicing variability in human populations. Genome Res, 22(3):528–38.

Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. D., et al., 2011. Full-length transcriptome assembly from rna-seq data without a reference genome. Nature Biotechnology, 29(7):644– U130.

Guttman, M., Garber, M., Levin, J. Z., Donaghey, J., Robinson, J., Adiconis, X., Fan, L., Koziol, M. J., Gnirke, A., Nusbaum, C., et al., 2010. Ab initio reconstruction of cell type- specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincrnas. Nat Biotechnol, 28(5):503–10.

Hertel, K. J., 2008. Combinatorial control of exon recognition. J Biol Chem, 283(3):1211–5. Hiller, M., Huse, K., Szafranski, K., Jahn, N., Hampe, J., Schreiber, S., Backofen, R., and Platzer, M., 2004. Widespread occurrence of alternative splicing at nagnag acceptors contributes to proteome plasticity. Nat Genet, 36(12):1255–7.

Hinzpeter, A., Aissat, A., Sondo, E., Costa, C., Arous, N., Gameiro, C., Martin, N., Tarze, A., Weiss, L., de Becdelievre, A., et al., 2010. Alternative splicing at a nagnag acceptor site as a novel phenotype modifier. PLoS Genet, 6(10).

Hu, Y., Wang, K., He, X., Chiang, D. Y., Prins, J. F., and Liu, J., 2010. A probabilistic framework for aligning paired-end rna-seq data. Bioinformatics, 26(16):1950–7.

Kent, W. J., 2002. Blat - the blast-like alignment tool. Genome Research, 12(4):656–664. Komarek, P. and Moore, A., 2003. Fast robust logistic regression for large sparse datasets

with binary outputs. In Artificial Intelligence and Statistics.

Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. L., 2009. Ultrafast and memory- efficient alignment of short dna sequences to the human genome. Genome Biol, 10(3):R25. Li, H., Ruan, J., and Durbin, R., 2008. Mapping short dna sequencing reads and calling

Li, W., Feng, J., and Jiang, T., 2011a. Isolasso: a lasso regression approach to rna-seq based transcriptome assembly. J Comput Biol, 18(11):1693–707.

Li, Y., Chien, J., Smith, D. I., and Ma, J., 2011b. Fusionhunter: identifying fusion tran- scripts in cancer using paired-end rna-seq. Bioinformatics, 27(12):1708–10.

Li, Y., Li, H., Burns, P., Borodovsky, M., Robinson, G. E., and Ma, J., 2012. Truesight: Self-training algorithm for splice junction detection using rna-seq. In Proc. 16th Annual International Conference on Research in Computational Molecular Biology (Barcelona, Spain, April 2012).

Luco, R. F., Allo, M., Schor, I. E., Kornblihtt, A. R., and Misteli, T., 2011. Epigenetics in alternative pre-mrna splicing. Cell, 144(1):16–26.

Luco, R. F., Pan, Q., Tominaga, K., Blencowe, B. J., Pereira-Smith, O. M., and Misteli, T., 2010. Regulation of alternative splicing by histone modifications. Science, 327(5968):996– 1000.

Lyko, F., Foret, S., Kucharski, R., Wolf, S., Falckenhayn, C., and Maleszka, R., 2010. The honey bee epigenomes: differential methylation of brain dna in queens and workers. PLoS Biol, 8(11):e1000506.

Maher, C. A., Kumar-Sinha, C., Cao, X., Kalyana-Sundaram, S., Han, B., Jing, X., Sam, L., Barrette, T., Palanisamy, N., and Chinnaiyan, A. M., et al., 2009a. Transcriptome sequencing to detect gene fusions in cancer. Nature, 458(7234):97–101.

Maher, C. A., Palanisamy, N., Brenner, J. C., Cao, X., Kalyana-Sundaram, S., Luo, S., Khrebtukova, I., Barrette, T. R., Grasso, C., Yu, J., et al., 2009b. Chimeric tran- script discovery by paired-end transcriptome sequencing. Proc Natl Acad Sci U S A, 106(30):12353–8.

Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., and Gilad, Y., 2008. Rna-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research, 18(9):1509–1517.

Marquez, Y., Brown, J. W., Simpson, C. G., Barta, A., and Kalyna, M., 2012. Transcriptome survey reveals increased complexity of the alternative splicing landscape in arabidopsis. Genome Res, 22(3):528–38.

Mitelman, F., Johansson, B., and Mertens, F., 2007. The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer, 7(4):233–45.

Mortazavi, A., Williams, B. A., Mccue, K., Schaeffer, L., and Wold, B., 2008. Mapping and quantifying mammalian transcriptomes by rna-seq. Nature Methods, 5(7):621–628. Nilsen, T. W. and Graveley, B. R., 2010. Expansion of the eukaryotic proteome by alterna-

tive splicing. Nature, 463(7280):457–63.

Okoniewski, M. J. and Miller, C. J., 2006. Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinformatics, 7:276. Pan, Q., Shai, O., Lee, L. J., Frey, B. J., and Blencowe, B. J., 2008. Deep surveying of alter-

native splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet, 40(12):1413–5.

Pertea, M., Lin, X., and Salzberg, S. L., 2001. Genesplicer: a new computational method for splice site prediction. Nucleic Acids Res, 29(5):1185–90.

Pickrell, J. K., Pai, A. A., Gilad, Y., and Pritchard, J. K., 2010. Noisy splicing drives mrna isoform diversity in human cells. PLoS Genet, 6(12):e1001236.

Reese, M. G., Eeckman, F. H., Kulp, D., and Haussler, D., 1997. Improved splice site detection in genie. J Comput Biol, 4(3):311–23.

Robertson, G., Schein, J., Chiu, R., Corbett, R., Field, M., Jackman, S. D., Mungall, K., Lee, S., Okada, H. M., Qian, J. Q., et al., 2010. De novo assembly and analysis of rna-seq data. Nat Methods, 7(11):909–12.

Royce, T. E., Rozowsky, J. S., and Gerstein, M. B., 2007. Toward a universal microarray: prediction of gene expression through nearest-neighbor probe sequence identification. Nu- cleic Acids Res, 35(15):e99.

Sakabe, N. J. and de Souza, S. J., 2007. Sequence features responsible for intron retention in human. BMC Genomics, 8:59.

Sboner, A., Habegger, L., Pflueger, D., Terry, S., Chen, D. Z., Rozowsky, J. S., Tewari, A. K., Kitabayashi, N., Moss, B. J., Chee, M. S., et al., 2010. Fusionseq: a modular framework for finding gene fusions by analyzing paired-end rna-sequencing data. Genome Biol, 11(10):R104.

Sorek, R., Shamir, R., and Ast, G., 2004a. How prevalent is functional alternative splicing in the human genome? Trends in Genetics, 20(2):68–71.

Sorek, R., Shemesh, R., Cohen, Y., Basechess, O., Ast, G., and Shamir, R., 2004b. A non-est-based method for exon-skipping prediction. Genome Res, 14(8):1617–23. Staden, R., 1988. Methods to define and locate patterns of motifs in sequences. Comput

Appl Biosci, 4(1):53–60.

Stamm, S., Zhu, J., Nakai, K., Stoilov, P., Stoss, O., and Zhang, M. Q., 2000. An alternative- exon database and its statistical analysis. DNA Cell Biol, 19(12):739–56.

Sultan, M., Schulz, M. H., Richard, H., Magen, A., Klingenhoff, A., Scherf, M., Seifert, M., Borodina, T., Soldatov, A., Parkhomchuk, D., et al., 2008. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science, 321(5891):956–960.

Thanaraj, T. A., 2000. Positional characterisation of false positives from computational prediction of human splice sites. Nucleic Acids Res, 28(3):744–54.

Trapnell, C., Pachter, L., and Salzberg, S. L., 2009. Tophat: discovering splice junctions with rna-seq. Bioinformatics, 25(9):1105–11.

Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., Wold, B. J., and Pachter, L., 2010. Transcript assembly and quan- tification by rna-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol, 28(5):511–5.

Villella, A. and Hall, J. C., 2008. Neurogenetics of courtship and mating in drosophila. Adv Genet, 62:67–184.

Wang, K., Singh, D., Zeng, Z., Coleman, S. J., Huang, Y., Savich, G. L., He, X., Mieczkowski, P., Grimm, S. A., Perou, C. M., et al., 2010. Mapsplice: accurate map- ping of rna-seq reads for splice junction discovery. Nucleic Acids Res, 38(18):e178. Wang, Z., Gerstein, M., and Snyder, M., 2009. Rna-seq: a revolutionary tool for transcrip-

tomics. Nat Rev Genet, 10(1):57–63.

Xia, H., Bi, J., and Li, Y., 2006. Identification of alternative 5’/3’ splice sites based on the mechanism of splice site competition. Nucleic Acids Res, 34(21):6305–13.

Yeo, G. and Burge, C. B., 2004. Maximum entropy modeling of short sequence motifs with applications to rna splicing signals. J Comput Biol, 11(2-3):377–94.

Yu, Y., Maroney, P. A., Denker, J. A., Zhang, X. H. F., Dybkov, O., Luhrmann, R., Jankowsky, E., Chasin, L. A., and Nilsen, T. W., 2008. Dynamic regulation of alternative splicing by silencers that modulate 5 ’ splice site competition. Cell, 135(7):1224–1236. Zhang, G. J., Guo, G. W., Hu, X. D., Zhang, Y., Li, Q. Y., Li, R. Q., Zhuang, R. H.,

Lu, Z. K., He, Z. Q., Fang, X. D., et al., 2010. Deep rna sequencing at single base- pair resolution reveals high complexity of the rice transcriptome. Genome Research, 20(5):646–654.

Zhang, Y., Lameijer, E. W., t Hoen, P. A., Ning, Z., Slagboom, P. E., and Ye, K., 2012. Passion: a pattern growth algorithm-based pipeline for splice junction detection in paired- end rna-seq data. Bioinformatics, 28(4):479–86.

Zhu, X. and Goldberg, A. B., 2009. Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1):1.

Related documents