High-throughput sequencing and small non-coding RNAs

(1)

High-throughput sequencing

and small non-coding RNAs

Von der Fakultät für Mathematik und Informatik der Universität Leipzig

angenommene

Dissertation

zur Erlangung des akademischen Grades

Doctor rerum naturalium

(Dr. rer. nat.)

im Fachgebiet

Informatik

vorgelegt

von Diplom-Ingenieur (FH) David Langenberger geboren am 28. November 1981 in Forchheim

Die Annahme der Dissertation wurde empfohlen von:

1. Professor Dr. Peter F. Stadler, Institut für Informatik, Universität Leipzig 2. Akad. Oberrätin Dr. Kay Nieselt, Fachbereich Informatik, Universität Tübingen

Die Verleihung des akademischen Grades erfolgt mit Bestehen

(2)

(3)

Abstract

Before the discovery of microRNAs in 2001 it was widely believed that most of the human genome, namely everything not coding for proteins, is just ’junk DNA’. Meanwhile, these tiny RNAs have been shown to be functional by regulating the expression of thousands of genes. RNAs are no longer just carriers of genetic information, but performers of important regulatory tasks within the cell.

A typical microRNA is processed from a primary pol II transcript and cut by the Drosha enzyme, resulting in a characteristic hairpin of length 60-120 nucleotides. This precursor microRNA is then transported by the protein Exportin-5 to the cytoplasm, where the hairpin is processed by the enzyme Dicer into a double stranded RNA about 22nt in length with a 2nt 3’-overhang. The obtained mature microRNA is then incorporated into a protein complex named RISC and one strand is selected (miR), while the other one is degraded (miR*). The microRNA performs the post-transcriptional gene regulation by perfectly or imperfectly binding to cis-regulatory target sites in the 3’ UTR of messenger RNAs. It is predicted that around one third of all human genes are regulated by one or more microRNAs. But microRNAs are not alone: when measuring the amount of RNA molecules within a cell, only 1-5% are protein-coding RNAs, while the rest comes from non-protein-coding RNAs (ncRNAs) of which only <10% are microRNAs. A lately published article of the ’ENCODE project’ highlights the functionality of these RNA molecules. They assigned biochemical functions to ∼80% of the human genome, while only 1.5% code for proteins.

In this thesis I investigate the processing mechanisms of these short ncRNAs by using data generated by the current method of high-throughput sequencing (HTS). The recently adapted short RNA-seq protocol allows the sequencing of RNA fragments of microRNA-like length (∼18-28nt). Thus, after mapping the data back to a reference genome, it is possible to not only measure, but also visualize the expression of all ncRNAs that are processed to fragments of this specific length. For microRNAs a typical pattern of two distinct stacks of reads, representing the miR and miR* sequences, can be observed.

(6)

produced from human microRNA precursors. These additional RNAs are generated from sequences immediately adjacent to mature miR and miR* loci. Like mature miRNAs, they are ∼22nt long, developmentally regulated, and appear to be produced by RNAse III-like processing from the precursor miRNA hairpin. This observation prompted me to specifically search for analogous patterns in human small RNA sequencing libraries. To simplify the search, we developed the blockbuster tool that automatically recognizes blocks of reads to detect specific expression patterns. By using blockbuster, blocks from moRNAs were detected directly next to the miR or miR* blocks and could thus easily be registered in an automated way. Further analysis showed that the expression levels of moRNAs are unrelated to those of the associated microRNAs. We could also show that their microRNA precursors are typically evolutionarily old.

When further investigating the short RNA-seq data I realized that not only microRNAs give rise to short ∼22nt long RNA pieces, but also almost all other classes of ncRNAs, like tRNAs, snoRNAs, snRNAs, rRNAs, Y-RNAs, or vault RNAs. Only for some types, like snoRNAs or microRNAs, it was already known that they undergo specific maturation processes that lead to the production of shorter RNAs. The formed read patterns that arise after mapping these RNAs back to a reference genome seem to reflect the processing of each class and are thus specific for the RNA transcripts of which they are derived from. I explored the potential of this patterns in classification and identification of non-coding RNAs. Using a random forest classifier which was trained on a set of characteristic features of the individual ncRNA classes, it was possible to distinguish three types of ncRNAs, namely microRNAs, tRNAs, and snoRNAs. With Positive Predictive Values (PPV) and recall rates of ∼0.8 for all three classes, the classifier performed well and I used it to predict new ncRNA candidates. Another finding of the performed analysis of this dataset is the direct connection of the read patterns to the predicted secondary structure of the RNAs. The pairing probabilities of bases covered by HTS reads are significantly increased, indicating the necessity of properly paired nucleotides for processing.

To make the classification available to the research community, we developed a free web service that allows to study short read data from small RNA-seq experiments. This web server is called DARIO and it provides a wide range of analysis features, including quality control, read normalization, ncRNA quantification, and prediction of putative ncRNA candidates using the random forest classifier. The web site supports six species: human, rhesus monkey, mouse, fruit fly, worm, and zebrafish. After file upload, a single job typically takes between 5 and 30 minutes and the results are summarized on a single web page containing job details, quality control measures and figures, ncRNA quantification and classification.

(7)

The classification has shown that read patterns are specific for different classes of ncRNAs. To make use of this feature, we developed the tool deepBlockAlign. deepBlockAlign intro-duces a two-step approach to align read patterns with the aim of quickly identifying RNAs that share similar processing footprints. Overlapping mapped reads are first merged to blocks using the earlier developed tool blockbuster and then closely spaced blocks are combined to block groups, each representing a locus of expression. In order to compare block groups, the constituent blocks are first compared using a modified sequence alignment algorithm to de-termine similarity scores for pairs of blocks. In the second stage, block patterns are compared by means of a modified Sankoff algorithm that takes both block similarities and similarities of patterns of distances within the block groups into account. Hierarchical clustering of block groups clearly separates most miRNA and tRNA, and also identifies about a dozen tRNAs clustering together with miRNA. Most of these putative Dicer-processed tRNAs, including eight cases reported to generate products with miRNA-like features in literature, exhibit read blocks distinguished by precise start position of reads.

It has already been shown that Dicer is not only involved in microRNA biogenesis. It appears to be also involved in the processing of other small RNA species beyond canonical microRNAs. In order to find possible exceptions to the well-known microRNA maturation by Dicer and to identify additional substrates for Dicer processing I re-evaluated the small RNA sequencing data of a Dicer knockdown experiment in MCF-7 cells. While the prominent non-Dicer mir-451 was not sufficiently expressed in these experiments, there were several addi-tional Dicer-independent microRNAs, among them the important tumor supressor mir-663a. I recovered previously described examples of non-miRNA Dicer substrates such as tRNA-Gln and several snoRNAs. Interestingly, snoRNA-derived RNAs from box C/D snoRNAs are Dicer-independent, while those from box H/ACA snoRNAs are often Dicer dependent. Several pol-III transcripts, in particular the vault RNAs and the great ape specific snaRs are processed by Dicer, while the small RNAs originating from Y RNAs seemed to be Dicer independent.

It is known that many aspects of the RNA maturation leave traces in RNA sequencing data in the form of mismatches from the reference genome. I was able to recover many well-known modified sites in tRNAs, providing evidence that modified nucleotides are a pervasive phenomenon in these data sets. Furthermore, I checked if non-encoded CCA tails, which are post-transcriptionally added to tRNAs, can be seen in short RNA-seq data. Surprisingly, they can be found in a diverse collection of transcripts, including sub-populations of mature microRNAs.

(8)

(9)

Acknowledgments

First of all, I thank my supervisors Peter F. Stadler and Steve Hoffmann for their continuous support and extremely helpful discussions.

Furthermore, I thank all current and previous members of the bioinformatics group at the Uni-versity of Leipzig for their scientific advice and support. Special thanks go to Anke Busch, Stephanie Keller-Schmidt, Mario Fasold, Lydia Hopp, Alexander Donath, J¨org Lehmann Markus Riester, Dominic Rose, and Gero Doose.

Additionally, I would like to say thank you to all these that mostly remain unmentioned in sciences: family and friends that indirectly contributed to this work by, for example, read-ing and reviewread-ing manuscripts, discussread-ing different topics related to my work or puttread-ing their trust in me. Here I want to emphasize Elisabeth, who supported me in every step and believed in me.

This PhD thesis was conducted within the framework of the Leipzig Research Centre for Civilization Diseases (LIFE). In this context, I would like to thank the European Social Fund (ESF) of the European Union and the Free State of Saxony for funding three full years of my PhD studies. Further, I want to thank my former university ’Fachhochschule Weihenstephan’ (University of Applied Sciences Weihenstephan) and especially Prof. Dr. Frank Leßke for supporting me and my goal of doing the PhD.

(10)

(11)

This thesis is based on the following publications: Hertel J, Langenberger D, Stadler PF (2013).

Computational prediction of microRNA genes. RNA sequence, structure and function: computational and bioinformatic methods, in press.

Langenberger D, C¸ akir MV and Hoffmann S and Stadler, PF (2012).

Dicer-Processed Small RNAs: Rules and Exceptions. Journal of Experimental Zoology Part B: Molecular and Developmental Evolution 16:1-12.

Richter J*, Schlesner M*, Hoffmann S*, Kreuz M*, Leich E, Burkhardt B*, Rosolowski M, Ammerpohl O, Wagener R, Bernhart SH, Lenze D, Szczepanowski M, Paulsen M, Lipinski S, Russell RB, Adam-Klages S, Apic G, Claviez A, Hasenclever D, Hovestadt V, Hornig N, Korbel JO, Kube D, Langenberger D, Lawerenz C, Lisfeld J, Meyer K, Picelli S, Pischimarov J, Radlwimmer B, Rausch T, Rohde M, Schilhabel M, Scholtysik R, Spang R, Trautmann H, Zenz T, Borkhardt A, Drexler HG, Möller P, Macleod RA, Pott C, Schreiber S, Trümper L, Löffler M, Stadler PF, Lichter P, Eils R, Küppers R, Hummel M, Klapper W, Rosenstiel P, Rosenwald A, Brors B, Siebert R. (2012).

Recurrent mutation of the ID3 gene in Burkitt lymphoma identified by integrated genome, exome and transcriptome sequencing. Nature genetics 11;44(12):1316-1320. * authors contributed equally

Langenberger D*, Pundhir S*, Ekstrom CT, Stadler PF, Hoffmann S, Gorodkin J (2012). deepBlockAlign: A tool for aligning RNA-seq profiles of read block patterns. Bioinformatics 1;28(1):17-24. * authors contributed equally

Langenberger D, Bartschat S, Hertel J, Hoffmann S, Tafer H, and Stadler PF (2011). MicroRNA or not microRNA? Lecture Notes in Computer Science. Springer pp. 1-9.

Fasold M*, Langenberger D*, Binder H, Stadler PF, Hoffmann S (2011).

DARIO: A ncRNA detection and analysis tool for next-generation sequencing experiments. Nucleic Acids Res. 39(Web Server issue):W112-7. * authors contributed equally

Findeiss S, Langenberger D, Stadler PF, Hoffmann S (2011).

Traces of Post-Transcriptional RNA Modifications in Deep Sequencing Data. Biol Chem. 392(4):305-313.

(12)

JJ, Evans C, Flicek P, Florea L, Folkerts O, Groenen MAM, Harkins TT, Herrero J, Hoffmann S, Megens HJ, Jiang A, de Jong P, Kaiser P, Kim H, Kim KW, Kim A, Langenberger D, Lee MK, Lee T, Mane S, Marcais S, Marz M, McElroy AP, Modise T, Nefedov M, Notredame C, Paton IR, Payne WS, Pertea G, Prickett D, Puiu D, Qioa D, Raineri E, Salzberg SL, Schatz MC, Scheuring C, Schmidt CJ, Schroeder S, Smith EJ, Smith J, Sonstegard TS, Stadler PF, Tafer H, Tu Z, Van Tassell CP, Vilella AJ, Williams K, Yorke JA, Zhang L, Zhang H, Zhang X, Zhang Y, and Reed KM (2010).

Multi-platform next generation sequencing of the domestic turkey (Meleagris gallopavo): Genome assembly and analysis. PLoS Biology. 8,9: e1000475.

Langenberger D, Bermudez-Santana C, Stadler PF, Hoffmann S (2010).

Identification and Classification of Small RNAs in Transcriptome Sequence Data Pac Symp Biocomput. 2010:80-7.

Langenberger D, Bermudez-Santana C, Hertel J, Hoffmann S, Khaitovitch P, Stadler PF (2009).

Evidence for Human microRNA-Offset RNAs in Small RNA Sequencing Data Bioinformatics. 25(18):2298-230.

Hackenberg M, Sturm M, Falcon J, Langenberger D, Aransay A (2009).

miRanalyzer: A microRNA detection and analysis tool for next-generation sequencing experiments Nucleic Acids Res. 37(Web Server issue):W68-76.

(13)

CHAPTER

1 Introduction

Contents

1.1 About this work . . . 3

1.2 Genome, transcriptome and proteome . . . 3

1.3 Non-coding RNAs . . . 7

1.3.1 Different types of non-coding RNAs . . . 7

1.3.2 RNA interference and the microRNA pathway . . . 8

1.4 Sequencing methods . . . 12

1.4.1 A short history about sequencing . . . 13

1.4.2 454 pyrosequencing . . . 14

1.4.3 Illumina / Solexa . . . 16

1.4.4 Short RNA-seq . . . 17

1.5 Sequencing data . . . 18

1.5.1 Data format . . . 19

1.5.2 Short read mapping . . . 20

1.5.3 Error sources . . . 23

1.6 Short RNA-seq and microRNAs . . . 24

1.6.1 The microRNA pattern . . . 25

1.6.2 MicroRNA gene prediction using structure and read patterns . . . . 25

1.6.3 MicroRNA-like processing products from other ncRNAs . . . 27

1.7 Computational methods . . . 28

1.7.1 Short read mapping: segemehl . . . 29

(14)

(15)

1.1 About this work

F

or almost one decade now, non-protein-coding RNAs (ncRNAs) have been well known to not only act as important adapter molecules within the cell, like the amino acid transporting transfer-RNAs (tRNAs), but also directly interact with the protein construction (expression) apparatus in order to regulate the production of a great amount of proteins. Thus, these regulatory interactions add a new layer of complexity. It was quickly realized, that errors in this new network can lead to major mis-regulation and thus disease. The race for finding unknown ncRNAs in the human genome started and the number of new predictions of known classes and even new ncRNA classes increased monthly. Nowa-days, the new technique of high-throughput sequencing (HTS) allows the measurement of huge amounts of RNA and provides a possibility to validate and quantify known and newly predicted ncRNAs. HTS has also successfully been used to predict microRNA genes by devel-oping a new sequencing protocol, specifically measuring molecules of microRNA-like length (short RNA-seq).

1.1 About this work

In this work, I take a deeper look into data from short RNA-seq experiments. These datasets only contain RNAs that were processed within the cell to smaller pieces, which are thought to be functional. I realized, that these short RNAs form specific patterns, when being mapped back to a reference genome and that these patterns can be used to classify different types of ncRNAs. I designed algorithms to cluster these piles of mapped molecules, assign them to ncRNA classes, align them and classify unknown ones. The classification algorithm was also made available to the research community by an easy to use web server. Furthermore, I took a deeper look into a widely used database of known microRNAs and discovered, that there are several false annotations. Continuative studies showed that it is hard to distinguish some ncRNA classes, since they seem to be processed by the same mechanism, ending up in similar patterns. This well known behavior was then double-checked using another analysis in which this mechanism was switched off in-vivo. Overall, in this work, I used the new method of HTS to study and understand the processing of ncRNAs and I used this data to predict new ncRNA candidates.

1.2 Genome, transcriptome and proteome

The human body is built up of around 1013 _{cells. Every single cell contains a genome storing} the identical genetic information, which is specific for each human being. This information is stored using DNA (deoxyribonucleic acid). It is encoded by chaining up monomeric molecules called nucleotides to polymeric macromolecules. There are four different characters in the alphabet of the DNA, adenine (A), guanine (G), thymine (T) and cytosine (C). By putting

(16)

human cell human being mitochondrial genome nuclear genome

DNA

RNA

protein

transcription translation

Central dogma of molecular biology

genome transcriptome proteome a) b) pre-mRNA exon

5‘UTR intron exon intron exon 3‘UTR

spliceosome

mRNA c)

5‘UTR CDS 3‘UTR

Figure 1.1: The human genome and the production of proteins. a) The human genome is situated in each cell of a human being. The nuclear genome is fractionated in 24 chromosomes, while the mitochondrial genome is stored in a circularized manner. b) The central dogma of molecular biology describes the way of how genetic in-formation is expected to be processed. DNA is transcribed to RNA molecules, which are translated to proteins. c) After transcription of DNA to pre-mRNA the spliceosome cuts out the non-protein-coding introns. Pictures redrawn from (Brown, 2006).

these four nucleotides together, like characters in a book, nature found a way to write down the blueprints for all tools, which are needed by the cells to survive. When thinking of the genome as an encyclopedia, every single entry stores the information of how one specific tool has to be built up. These entries are called genes and the tools are molecules with important functions like enzymes, the workhorses within the cells.

The human genome is divided into two distinct parts, the nuclear genome and the mitochon-drial genome. The nuclear genome consists of 3,137,144,693 nucleotides (GRCh37), separated into 24 chromosomes, which hold information for 20,110 protein-coding genes (Gencode V12 May 2012 freeze). The mitochondrial genome is a circularized molecule, 15,000-17,000 nu-cleotides in length and consists of just 13 protein-coding genes. Overall, the stored genetic code between two human beings is highly similar (∼99.9% identical, Clinton (2000)) and only very small differences in the genetic code of the proteins lead to the differences, like the color

(17)

1.2 Genome, transcriptome and proteome

of the hairs, the eyes, or the skin.

The genome itself is not able to release the stored information to the cell. For the so-called expression of genes, several enzymes and proteins are needed.

One of the first main actors is the RNA polymerase enzyme, which precisely finds the entry for a needed tool within the huge genome and generates a copy of only this entry. The copied information consists of RNA (ribonucleic acids) molecules. In contrast to DNA, the alphabet of the RNA has an uracil (U) instead of a thymine (T). The copying process is known as transcription and the copy of the protein-coding gene is called messenger RNA (mRNA). In the human genome, the encoded and transcribed genes consist of three main parts, protein-coding exons, non-protein-coding introns and untranslated regions (UTRs) at the 5’ and the 3’ ends (see Figure 1.1c). The UTRs are important, since they contain functional sequences for further processing and regulatory motifs. The introns are not needed for the protein productions and thus they are found and deleted by a complex within the nucleus, the spliceosome (see Figure 1.1c). This process is called splicing and results in an intron-free mature mRNA sequence. The sum of all transcribed RNA polymers is known as the transcriptome of a cell.

In a next step, the encoded information of a mRNA has to be used to design proteins. The synthesis of proteins from single RNA molecules is called translation. The copies of the genes, the mRNAs, are found by ribosomes, which read the text and build together the proteins, using amino acids, the building blocks of proteins. The collectivity of all proteins is called proteome. The proteins are used to build up the cell and perform important tasks for its function.

The explained process of the expression of a gene with the transcription of DNA to RNA and the translation to proteins is known as the ’central dogma of molecular biology’ (Crick, 1958; Crick et al., 1970) (see Figure 1.1b).

It is quite obvious that cells from different tissues, like cells from the heart and cells from the skin, have to have an unequal behavior and a different construction. These adapted functions of cells can be reached by adjusting the expression of proteins. Skin cells for example have to be more robust, and thus more proteins for stabilizing the cell walls are created. To regulate the expression of genes and thus end up with a set of needed proteins for a specific function, the transcription of genes can be turned on and off (Latchman, 2005). It was long thought, that this regulation is mainly handled by two different regulatory layers: 1) Epigenetic modifications control the readability of genes by preventing the polymerase from binding the DNA (Khavari et al., 2010) (see Figure 1.3a). 2) Transcription factors, a special kind of proteins, bind next to the start of a gene and activate or inhibit the transcription of this locus (Latchman, 1997; Karin et al., 1990) (see Figure 1.3b).

Even though different compositions of proteins can build up hundreds of different types of cells (Levine and Tjian, 2003; Buchler et al., 2003), only ∼1.5 % of the human genome code

(18)

for them. It was long believed that the rest of the genome is just ’junk DNA’ (Ohno, 1972; Comings, 1972) without any functionality.

There are several widely discussed hypotheses, why there should be that amount of useless DNA in our genome, like the protection against mutations (Yunis et al., 1971), or due to evolutionary accumulation of dysfunctional genes (Brosius and Gould, 1992). But there are some striking observations, conflicting with the assumption that these parts are really useless and thus ’junk’. First of all, there are some well known molecules that are functional in their RNA state, like transfer RNAs (tRNAs) or ribosomal RNAs (rRNAs). The genes that code for the named RNAs are transcribed, but not translated to proteins (Ladner et al., 1975; Kim et al., 1973; Yusupov et al., 2001). Furthermore, when measuring the amount of RNA molecules within a cell, only 1-5% are mRNAs and thus come from protein-coding RNAs (Maniatis, 1989). The rest consists of 80-85% rRNAs and 10-15% other small RNAs (tRNAs, microRNAs, etc.) which do not result in proteins. A lately published article of the ’ENCODE project’ highlights the functionality of these RNA molecules. The researchers have systematically analyzed transcribed regions and assigned biochemical functions to ∼80 % of the genome (Khatun, 2012). This discovery of the ’ENCODE project’ is a logical consequence of several observations made years before. It was realized, that a huge amount of transcribed RNA pieces show regulatory functionality and a new subgroup of RNAs, the non-(protein)-coding RNAs (ncRNAs), was born. Since then, the transcriptome of a cell is divided into two parts, the protein-coding RNAs and the ncRNAs (see Figure 1.2). While the protein-coding fraction follows the ’central dogma of molecular biology’, the ncRNAs are functional in their RNA state. These functional RNA molecules have several essential roles within the cell.

transcriptome

protein-coding RNA

non-coding RNA

protein miRNA snoRNA snRNA rRNA tRNA

Figure 1.2: The transcriptome of the human cell can be divided into two fractions, protein-coding RNAs and ncRNAs. Picture redrawn from (Brown, 2006).

(19)

1.3 Non-coding RNAs 5‘ 5‘ 3‘ histone histone tail epigenetic factor transcription factor polymerase mRNA microRNA ribosome protein a) b) c)

Figure 1.3: Different layers of expression regulation in human. There are three important lay-ers of regulation: a) Different methylation states impact the DNA accessibility for transcription. b) Transcription factors bind the DNA and influence polymerase activity. c) microRNAs bind to the 3’ UTR of the mRNA and regulate its trans-lation to proteins.

1.3 Non-coding RNAs

Non-coding RNAs have been known for quite some while. The first known ncRNAs were tRNAs (transfer RNAs) and rRNAs (ribosomal RNAs). These types of RNAs have well known and important functions within the human cells. At the beginning of this century, nevertheless, several new ncRNAs have been found and analyzed. Short RNA molecules, like miRNAs (microRNAs), piRNAs (PIWI interacting RNAs), siRNAs (short interfering RNAs), snoRNAs (small nucleolar RNAs), or snRNAs (small nuclear RNAs) show regulatory functionality. This army of tiny regulators changed the thitherto picture of the regulation of gene expression by transcription factors and added a new layer of complexity (see Figure 1.3c). Till now, the regulatory pathways are not fully understood and the underlying networks of reciprocative influence seem to be almost unpredictable.

1.3.1 Different types of non-coding RNAs

Different classes of non-coding RNAs are distinguished by their functions, which directly depend on molecular similarities, like the length of the molecule, the composition of their sequences, as well as their secondary structures. ncRNAs tend to fold into completely different secondary structures. Partly, because they need this structure to be functional, partly, because of downstream processing to shorter RNA pieces. The probably best known example of the class that needs structure to be functional are the transfer RNAs (tRNAs). tRNAs fold into their typical cloverleaf structure, which extrudes and thus presents the anti-codon, which is needed to bind to the correct position on the mRNA. Another task of the secondary structure is the processing of longer RNA molecules to shorter, functional RNAs. Here, the class of microRNAs is the probably most famous. Their typical hairpin structure is found by enzymes,

(20)

which cut and mature the microRNAs. In the last years, the family of short and long non-coding RNAs was growing fast and researchers are still working on fully understanding their exact functionality. Some of the most important ncRNAs are the following (as described in (Brown, 2006)):

Ribosomal RNAs (rRNAs) are the RNA components of ribosomes. The ribosomes construct proteins, using the mRNA as template. rRNAs directly interact with tRNAs during the translation process. rRNAs are the most abundant ncRNAs in a cell. Around 80% of a cells RNA consists of them.

Transfer RNAs (tRNAs) are involved in protein synthesis. They carry the amino acids to the ribosomes, which assemble them to polymeric molecules, the proteins.

Small nuclear RNAs (snRNAs) are small RNAs found in the nucleus of a cell. They are involved in the splicing process, where the introns of a primary transcript are deleted, resulting in the mRNA.

Small nucleolar RNAs (snoRNAs) direct enzymes that perform modifications to specific nucleotides of other RNAs. There are two classes of snoRNAs. C/D box snoRNAs, which are linked to methylation and H/ACA box snoRNAs, which are linked to pseudouridylation. This class of small RNAs is also known as guide RNAs, since it guides other enzymes to specific locations.

MicroRNAs (miRNAs) are tiny RNAs that regulate gene expression by binding to mRNAs and repressing their translation. microRNAs attract a great deal of attention, since they fine-tune the expression of thousands of genes. Furthermore, by using their endogenous pathway, researchers found a way to turn off (knock-down) specific genes in living organisms. In 2006, for the discovery of this method, called RNA interference, the nobel price was awarded.

1.3.2 RNA interference and the microRNA pathway

RNA interference (RNAi) was first observed in 1990, when Jorgensen and his group tried to enrich flower pigmentation by overexpressing chalcone synthase and ended up with reduced pigmentation (Napoli et al., 1990; Liu and Paroo, 2010). They poorly understood it and did not know that antisense RNA was the cause. Years later, the groups of Andrew Fire and Craig Mello systematically highlighted the involvement of double-stranded RNA (dsRNA) (Fire et al., 1998). By injecting short dsRNA, which was homologous to the mRNA of a gene called unc-22, they were able to significantly repress its expression resulting in a change of the genes phenotype. The repression using the dsRNA was much better than just using the sense or antisense molecules alone. They named this dsRNA induced silencing method

(21)

1.3 Non-coding RNAs

RNA interference (RNAi) and the exogenous dsRNAs small interfering RNAs (siRNAs). Since then, this methodology was constantly improved and refined. Nowadays, it is used to perform high-throughput RNAi screenings, to assign phenotypes to specific genes by ’knocking them down’ and it also found its way to therapeutic applications. Finally, in 2006, Fire and Mello got the noble price for their work in RNAi.

Until Fire and Mello highlighted the potential of these short RNAs in 1998, only long mRNA molecules were in the focus. Researchers used gel electrophoresis to filter out these longer transcripts, resulting in a complete oversight of the smaller RNA fragments. Knowing about the functionality of the short RNA molecules, the run after the short, expression regulating RNAs started and the recent years resulted in a profound change in our understanding of the regulation of gene expression. Small non-coding RNA especially came into focus as it became clear that they are key players in many cellular processes by post-transcriptionally regulating gene expression via either degradation, translational repression, or both (Kim and Nam, 2006; Lagos-Quintana et al., 2001).

The most prominent candidate of the small ncRNAs are the microRNAs (miRNAs). They are endogenously encoded in many animal and plant genomes (Bartel, 2004; Griffiths-Jones, 2006) and are now recognized to be one of the major regulatory gene families in eukaryotic cells. They are believed to regulate the expression of around one third of all genes in the human genome (Lewis et al., 2005), involved in many fundamental processes like metabolism, development and regulation of the nervous and immune systems (Ouellet et al., 2006; Bagasra and Prilliman, 2004). Furthermore, it has been reported that some microRNAs are actively involved in the development of pathologies like cancer (Lu et al., 2005).

The microRNA pathway (see Figure 1.4) is probably one of the newest and best understood processing pathways. 1993, microRNAs were firstly discovered by the groups of Ruvkun (Wightman et al., 1993) and Ambros (Lee et al., 1993). They found a small RNA (lin-14) that, when being expressed, negatively regulated the production of the LIN-14 protein in C. elegans. LIN-14 encodes a protein whose activity is required for specifying the division timing of specific cells during postembryonic development (Ruvkun and Giusto, 1989). Since lin-4 is only produced in the first larval stage, it temporally decreases the production of LIN-14 and thus controls the developmental-stage timing of this worm. No homologs of this first microRNA were found in other species in further studies. Only in the year 2000 another microRNA was observed. The discovery of let-7 (Reinhart et al., 2000) changed the picture of microRNAs, since it is highly conserved between species. Even in human several homologs were found, showing the immense importance of this small piece of RNA. A new class of small RNAs was born, regulating protein production by complementary RNA-RNA binding to mRNA molecules. First these short RNAs were named small temporal RNAs (stRNAs) (Pasquinelli et al., 2000), but after finding several other candidates with similar functions, they were grouped together and named microRNAs (Lagos-Quintana et al., 2001). Nowadays,

(22)

several thousands of these small regulatory microRNAs have been identified, building up a huge regulatory network, controlling not only developmental-timing, but influence almost all cellular processes.

microRNAs can be encoded in the genome as independent units, being transcribed by RNA-Polymerase II, or they can occur in introns, being transcribed together with their host genes and then spliced out by the spliceosome (see Figure 1.1c). The latter are called mirtrons and it is thought that ∼40 % of all known microRNAs lie in the introns of protein- or non-protein-coding genes (Rodriguez et al., 2004). The transcribed RNA sequence is called primary microRNAs (pri-miRNAs) and directly folds into a stem loop (hairpin) structure, which is typical for microRNAs.

There are also microRNA clusters in the genome, containing up to six microRNA genes, which are regulated and transcribed together, using a common promoter (Altuvia et al., 2005; Lee et al., 2004; Cullen, 2004). Pri-miRNAs encoding microRNA clusters can be several hundred nucleotides long and fold in several stem loop structures with each microRNA hairpin being flanked by a region long enough for efficient downstream processing.

The secondary stem loop structure of each microRNA gene is then found by a protein named ’DiGeorge Syndrome Critical Region 8’ (DGCR8). DGCR8 is bound to Drosha, a RNase III enzyme that cuts RNA, forming the ’Microprocessor complex’ (Gregory et al., 2006). Drosha cuts out the hairpins, ending up with precursor microRNAs (pre-miRNAs). One exception here are the mirtrons, which bypass the processing by Drosha, since the spliced out intron automatically folds into a valid pre-miRNA. The pre-miRNAs are around 70nt in length and have a two-nucleotide overhang at their 3’ end. The pre-miRNAs are then exported to the cytoplasm by a protein called Exportin-5, using the two-nucleotide overhang left by Drosha as a docking station (Murchison and Hannon, 2004).

In the cytoplasm, the hairpins are further processed by a RNase III enzyme named Dicer, in-teracting with the 3’ end of the hairpin and cutting of the loop structure (Lund and Dahlberg, 2006). Dicer cuts, like Drosha, with a two-nucleotide 3’ overhang, resulting in an imperfect double stranded RNA molecule of around 22-24nt in length.

This double stranded RNA is then found by the RNA induced silencing complex (RISC) (McManus et al., 2002), which takes one strand and incorporates it (guide microRNA, or miR), while the other strand is degraded (passenger microRNA, or miR*). The loaded RISC complex uses the miR sequence to find and bind to complementary regions in the mRNA sequence. In human, argonaut proteins within the RISC complex can then, depending on the perfectness of the binding, either cleave the transcript, or recruit additional proteins to repress its translation (Pratt and MacRae, 2009). If the binding is perfect, the mRNA will be directly cleaved by the argonaut Ago2 and degraded (Kawasaki and Taira, 2004). Imperfect bindings result in prevention of translation (Lim et al., 2005) and occur mostly in the 3’UTR part of the mRNAs. One mRNA can be targeted by several microRNAs at a time and the

(23)

1.3 Non-coding RNAs Drosha Exportin-5 pri-microRNA pre-microRNA nucleus cytoplas m

RNA Pol II/III

DICER mature microRNA RISC RISC RISC RISC 3‘UTR CDS CDS mRNA cleavage translational repression 3‘UTR siRNA exogenous

Figure 1.4: The RNA interference pathway in human. Endogenous microRNAs are encoded in the genome and transcribed as individual units by polymerase II, or they occur in introns, transcribed together with host genes and spliced out. These primary miRNAs folds into typical hairpin structures, which are recognized by the RNase enzyme Drosha and cut out. The resulting precursor miRNA uses Exportin-5 to be transported to the cytoplasm, where it is found by another RNase (Dicer), which cuts the loop, releasing a double-stranded mature miR-miR* miRNA. The latter is loaded to the RNA induced silencing complex and the miR sequence is used to bind to complementary mRNA regions, while the miR* sequence is degraded. mRNAs with miRNA target sites can be, depending on the binding, cleaved and degraded, or post-transcriptionally regulated. mRNAs targeted by several miRNAs show stronger regulatory effects.

(24)

down regulation of the protein production seems to correlate with the number of target sites for microRNAs (Rajewsky, 2006; Krek et al., 2005).

In consideration of the fact, that the sequence composition of these tiny RNAs is of great importance for their binding, small differences (e.g. mutations) in the functional molecule can change the expression of hundreds of targeted genes. Thus, the individual sequences of the mature microRNAs have to be deciphered. Till some years ago, this task was achieved by performing a size fractionation with a downstream Sanger sequencing (see Figure 1.5). Since there are not only microRNAs in this size range, but also degradation products of all kinds of different, longer RNAs, it was like fishing in muddy waters and thus a very expensive challenge. The new developed method of high-throughput sequencing (HTS) provides a new technique allowing the measurement of millions of these RNA snippets in short time and low-price. Morin and colleagues (Morin et al., 2008) have shown that experimentally measured miRNA molecules have variations with respect to their genomic encoded sequences. They called this phenomenon isomers and defined four different types. The 5’ end or the 3’ end of the microRNA is elongated or shortened (5’ trimming and 3’ trimming), there are additional nucleotides at the 3’ end (3’ nucleotide addition) or nucleotides of the precursor are post-transcriptionally changed (nucleotide substitution). In the following I will describe the idea of HTS to show, how it can be used to make such findings. I will explain the two most widely used sequencing technologies, then I will shortly explain a protocol which assures, that only molecules in microRNA-like length are sequenced and finally I will go into more detail and show, how these sequences are used for microRNA prediction.

1.4 Sequencing methods

Determining the order of the nucleotide bases A, C, G, and T is known as sequencing. New methods, known as high-throughput sequencing have made it feasible to contemplate sequenc-ing the genomes of hundreds - if not thousands - of species of agronomic, evolutionary, and ecological importance, as well as biomedical interest (Haussler et al., 2009; Dalloul et al., 2010). The main idea behind this method is to shear long DNA sequences to short pieces and read out the nucleotides in a parallelized manner. Using this trick speeds up the sequencing process and makes it thus feasible.

In the following section, I will summarize two different ideas of sequencing DNA. I will try to turn the readers attention to some important characteristics of the different methods, highlighting error sources needed to be handled in downstream analysis. The high-throughput sequencing methods and preparation protocols are intellectual properties of the respective companies and similar wordings of the explanations are indispensable.

(25)

1.4 Sequencing methods

1.4.1 A short history about sequencing

Around one hundred years ago the DNA molecule was discovered, and it was soon realized that it is the molecule of heredity. In 1953 Watson and Crick announced the double helical structure of the DNA and thereby set the stage for almost everything that takes place in biomedical research since then. This structure showed researchers that the DNA molecule is a prerequisite for complex biological life. Right after that a series of studies tried to answer basic questions of how this information is used to create the building blocks of cells, leading to the central dogma of molecular biology: Biological information transfers from DNA to RNA, and then to proteins. We now know that it is much more complicated than that, but back in that time, this was a fundamental new finding.

In the late 1970’s Frederick Sanger came up with the idea of using dideoxy nucleotides to sequence DNA (see Figure 1.5). This revolutionary method made it possible to commercialize DNA sequencing. From the early 80’s to the late 90’s, a lot of improvement was done in this method. Then, in the late 90’s, the ‘Human Genome Project’ was set up and the method improved even more. In the end, sequencing the first human genome took around 13 years and cost around $300 million. Interestingly, only one percent of the genome was sequenced after 5 years, highlighting the progress in optimizing the sequencing methodology. Around the year 2000, several companies invented machines that completely automatically sequence DNA. Several such machines were standing in few institutes around the world and produced large amounts of DNA sequence data, improving the first reference genome. But the introduction of the so-called Next Generation Sequencing machines in 2005 changed the world of sequencing. These machines are able to sequence millions of sequences in parallel. The two primary devices

TCCCAATTGCTGAGTAACAAATGAGACGCTGTGCA ddATP ddCTP ddTTP ddGTP TCCCAATTGCTGAGTAACAAATGAGACGCTGTGCA

DNA polymerase template DNA

dCTP dGTP dTTP dATP dATP ddCTP ddTTP dGTP AGGGTTA AGGG AGGGTT AGGGT AGGGTTAAC short fragments long fragments AGG G AGGGT T AGGG T AGGGTT AA C capillary tube AGGG T AGGGTT AA C AGGGT AGGGTTA AGGGTT AGGG

Figure 1.5: Sanger sequencing with fluorescent markers. A DNA polymerase transcribes a template DNA by adding a mixture of normal nucleotides and fluorescent labeled dideoxy nucleotides to the growing chain. When adding a dideoxy nucleotide, the transcription stops, since there is no 3’-OH, which is needed for further incorpora-tion of nucleotides. This method results in fragments of different length. In a final step, the fragments are separated by their length, using a gel-filled capillary and the different light signals at each position are read out, resulting in the templates sequence. Picture redrawn from (Scott, 2004).

(26)

here are the ’Roche-454 pyrosequencing machine’ and the ’Illumina Genome Analyzer’. Both companies frequently improve their machines leading to more sequenced nucleotides, shorter sequencing times and thus lower costs.

1.4.2 454 pyrosequencing

The 454 pyrosequencing method was the first high-throughput sequencing method ready for the markets (Margulies et al., 2005). All sequencing methods start with the preparation of a library. This step is shared by almost all methods and is highly similar in its design. In the beginning the DNA of interest is randomly fragmented to shorter pieces, specific adapters are ligated to both ends and the double-strand is opened. These fragments are then immobilized to a solid surface and amplified. The amplification step is very important since in the downstream sequencing process, nucleotides are incorporated into a growing strand, emitting light signals. In order to intensify these lights and measure them correctly, hundreds of duplicates are needed.

In the 454 pyrosequencing method (see Figure 1.6), the DNA fragments are bound to solid

fragmentation

adapter ligation amplification

sequencing T A Polymerase Primer Sulfurylase ATP Luciferase Light Luciferin APS PPi immobilization

(27)

beads, covered with sequences complementary to the adapters on the fragments. By washing significantly less DNA molecules than beads, it is statistically assured that not more than one fragment binds to a single bead. The beads are then dispersed in a water-in-oil emulsion. This way, each bead is covered by an oil bubble, creating a sealed environment for DNA amplification. This method ensures that only clones from a unique fragment will be amplified and attached to the bead. The oil-bubble is filled with all reagents, needed for the polymerase chain reaction (PCR) (Bartlett and Stirling, 2003) cycling steps, ending up with hundreds of identical copies of the original fragment. Some cleaning steps are performed, freeing the beads from the oil. For the sequencing, the beads are brought on a picotiter plate. This glass structure has tiny wholes (wells), just big enough for one single bead. This plate is put into the sequencing machine. The top of the picotiter plate allows to load enzymes to the wells, by just flowing the reagents over it. The bottom of the plate is made out of optically clear glass that sits right on top of a high density CCD camera, recording the flashes of almost a million sequencing reactions, as they occur. The four different nucleotides (A, C, G, and T) are sequentially washed over the plate. A DNA polymerase incorporates matching nucleotides, releasing a pyrophosphate moiety, which goes through a series of downstream reactions, that are catalyzed by the enzymes on the beads and the output is light, recorded by the CCD camera. After each cycle, the used nucleotides are washed away, assuring that the signal of the next cycle is triggered by the correct nucleotide. These steps are repeated hundreds of times.

The first four nucleotides of each fragment on the beads form the string ’TCAG’, which is called the ’key-sequence’. The sequencing of this string is important, since it returns the signal of a single nucleotide incorporation and is used for calibration. The occurrence of several nucleotides of the same type in a row, known as homopolymers, is a major problem, since the incorporation does not stop after each nucleotide. Thus, homopolymers result in more pyrophosphate and thus a stronger light signal. The intensity of the light is the only way to get information about the length of the homopolymer and rather complicated signal processing steps, using the information of the key-sequence, are necessary. The main advantage of the pyrosequencing method is the length that can be sequenced. With several hundreds of nucleotides, it is very useful for e.g. whole genome assemblies. The sequenced fragments are called reads and stored in a machine readable manner.

The latest machine using this method is the 454 FLX+ machine. While the read length of around 700nt is advantageous, the main drawback of the 454 machines is the relatively small number of parallelized processes, ending up with a small throughput. With 900 mega bases, several runs are needed to get a sufficient coverage for sequencing a complete human genome.

(28)

fragmentation adapter ligation bridge amplification immobilization cluster generation cluster generation sequencing A A A A C C C C A C G A ACGA... laser C G A C A G T

1.4.3 Illumina / Solexa

Like in the 454 pyrosequencing method, the input DNA for the Illumina method (see Figure 1.7) is fragmented and adapter sequences are ligated to the ends of the fragments. The adapter ligated DNA is size fractionated, filtering out fragments between 150 and 200 bases of length, using a gel. This size fractioning step is important, since the length of the fragments sequenced in parallel is fixed and shorter sequences would decrease the number of sequenced nucleotides. Thus, size fractioning assures an optimal throughput.

The adaptor ligated, size fragmented fraction of DNA sequences is called a library and will be used for sequencing. In a next step, the library is washed over a glass flow cell, which is decorated with adapter sequences. DNA oligonucleotides with their adapters, reverse complementary to the adapter sequences on the flow cell, are in this way immobilized on the surface. A low concentration of fragments in the solution washed over the flow cell assures that the fragments bind scattered all around the flow cell in distance to each other. In a process called bridge amplification, the DNA molecules bend over and encounter a complementary second-end primer on the surface. A DNA polymerase creates multiple copies in one place, which results in a collection of millions of copies of the same fragment, called a cluster. The reverse strands are washed away, ending up with a cluster of all fragments bound at the same end. These clusters explain the needed distance of the initial fragments bound to the surface. If the solution contains too many fragments, the cluster density is getting too high and it

(29)

is hard to distinguish the signals and if the density is too small, the throughput of the run decreases. Like in the pyrosequencing method, the amplification step is needed to multiply the fragments in order to get a stronger signal. The sequencing chemistry of Illumina sequencing is fundamentally different to the 454 one. All four nucleotides are supplied at each sequencing step. Each nucleotide has its own and unique fluorphore attached, reporting a specific wave length, when they are scanned by a laser. This way it is possible to obtain the identity of the nucleotide by the specific color. It is possible to add all four nucleotides at once, because the bases have at their 3’ ends a chemical block in place, where normally there is a hydroxyl available for the next base incorporation.

This block does not allow the incorporation until it goes through the detection and deblock-ing steps of the sequencdeblock-ing. At the detection step, a laser scans the flow cell, stimulatdeblock-ing the fluorophor on the incorporated bases, resulting in the release of light, which is recorded by a sensitive camera. This way, all incorporated nucleotides of all clusters at a specific round are measured and stored. Then the fluorescent group is cleaved of and the chemical block is deleted, getting the flow cell ready for the next round. This process is repeated several times, resulting in the complete sequence of all clusters fixed on the flow cell. This method is called dye-terminators technology (Erlich and Higuchi, 1994) and was patented in 2004.

Just to name some numbers: One Illumina flow cell consists of eight lanes, storing around six billion read clusters and is thus able to sequence 600 billion bases in one run. The whole run can be done in around eleven days. Thus, using this technique, it is possible to sequence six complete human genomes with a 30x coverage.

1.4.4 Short RNA-seq

The machines explained above need DNA as input, which technically restricts the sequenc-ing to the genome. But by ussequenc-ing cDNA, which is DNA synthesized by reverse transcription using the input RNA as template, researchers found a way to also sequence RNA. The first application of this method was the sequencing of mRNAs in a cell. But, because of microR-NAs being substantial regulators, a special protocol to sequence the mature microRmicroR-NAs was developed. The ∼24nt long RNA pieces regulate mRNAs by binding to its 3’ UTRs and thus the specific sequence of them is of high interest. Since these short regulators are smaller than the sequenced length, no size fragmentation is needed. In the short RNA sequencing (short RNA-seq) protocol, the RNA of a cell is isolated and size fractionated, using a gel. Only these bands of the gel including short RNAs (18-30 nt) are cut out and used as library for the sequencing (see Figure 1.8). In this way, all precursor microRNAs, tRNAs, rRNAs, etc. are discarded, since they are too long. The short mature microRNA molecules pass this gel filter and are sequenced, using a special Illumina protocol, that sequences pieces up to 35 nt in length, speeding up the sequencing process, lowering the sequencing costs. One important note here is the fact, that the main fraction of the short RNA molecules is shorter than the

(30)

35 nt, sequenced by the machine. Thus, parts of the ligated adapter sequences at the 3’ end of the fragments are also sequenced (see Figure 1.9), necessitating the subsequent clipping of this adapters. This clipping step recovers the sequence of the original molecule and is performed computationally in the downstream bioinformatics analysis of the data.

1 2 3 4 5 10 20 30 100 1,000 bp

rRNAs, lnRNAs, etc.

tRNAs, snoRNAs, pre-miRNAs, etc.

mature miRNAs, other processing and degradation products, etc.

Figure 1.8: Size fractionation using a gel. By cutting out the respective length region, it is possible to select specific kinds of ncRNA classes of interest. For a short RNA-seq experiment, the region for the mature microRNAs is extracted (∼18-30nt in length).

1.5 Sequencing data

This section is based on a book chapter written by Steve Hoffmann which explains the ba-sic output formats of high-throughput sequencing and the different approaches to map the identified molecule sequences back to a reference genome (Hoffmann et al., 2011). I modified and shortened some parts and added new fractions to accentuate the application of short RNA-seq data, instead of the much longer DNA data originally used in the book. Since short RNA fragments add new and different problems to the bioinformatics analysis, it is of high importance to go into detail about the format and the mapping.

The bioinformatics tasks start with the process of converting the electromagnetic signals into the correct nucleotides, named base calling. There are base calling approaches coming together with the sequencing machine, but also several different tools with optimized results are available. Here I will not go into detail about the different base callers, since only the company-provided base callers were used. The customized file formats, the complicated map-ping procedure, as well as the different sources for errors, nevertheless, have to be explained.

(31)

1.5 Sequencing data

Ligate adapters

Attach molecule to flow cell Perform amplification Generate clusters TCCCAATTGCTGAGTAAC AGGGTTAACGACTCATTG AACAAATGAGACGCTGTGCAATTGCT TTGTTTACTCTGCGACACGTTAACGA TCCATCTTGGGGCGTCCCAATTGC AGGTAGAACCCCGCAGGGTTAACG GGCGTCCCAATTGCTGAGTAACAAATGAGAC CCGCAGGGTTAACGACTCATTGTTTACTCTG GTTCAGAGTTCTACAGTCCGACGATC CAAGTCTCAAGATGTCAGGCTGCTAGTCCCAATTGCTGAGTAAC AGGGTTAACGACTCATTGTCGTATGCCGTCTTCTGCTTGT AGCATACGGCAGAAGACGAACA TCCATCTTGGGGCGTCCCAATTGC AGGTAGAACCCCGCAGGGTTAACG GTTCAGAGTTCTACAGTCCGACGATC CAAGTCTCAAGATGTCAGGCTGCTAG TCGTATGCCGTCTTCTGCTTGT AGCATACGGCAGAAGACGAACA AACAAATGAGACGCTGTGCAATTGCT TTGTTTACTCTGCGACACGTTAACGA GTTCAGAGTTCTACAGTCCGACGATC CAAGTCTCAAGATGTCAGGCTGCTAG TCGTATGCCGTCTTCTGCTTGT AGCATACGGCAGAAGACGAACA GGCGTCCCAATTGCTGAGTAACAAATGAGAC CCGCAGGGTTAACGACTCATTGTTTACTCTG GTTCAGAGTTCTACAGTCCGACGATC CAAGTCTCAAGATGTCAGGCTGCTAG TCGTATGCCGTCTTCTGCTTGT AGCATACGGCAGAAGACGAACA flow cell A G G G T T A A C G A C T C A T T G G T T C A G A G T T C T A C A G T C C G A C G A T C T C G T A T G C C G T C T T C T G C T T G T G T T C A G A G T T C T A C A G T C C G A C G A T C G T T C A G A G T T C T A C A G T C C G A C G A T C G T T C A G A G T T C T A C A G T C C G A C G A T C G G C G T C C C A A T T G C T G A G T A A C A A A T G A G A C T C C A T C T T G G G G C G T C C C A A T T G C A A C A A A T G A G A C G C T G T G C A A T T G C T T C G T A T G C C G T C T T C T G C T T G T T C G T A T G C C G T C T T C T G C T T G T T C G T A T G C C G T C T T C T G C T T G T A G C A T A C G G C A G A A G A C G A A C A A G C A T A C G G C A G A A G A C G A A C A A G C A T A C G G C A G A A G A C G A A C A A G C A T A C G G C A G A A G A C G A A C A T C C C A A T T G C T G A G T A A C C A A G T C T C A A G A T G T C A C C G C A G G G T T A A C G A C T C A T T G T T T A C T C T G C A A G A G G T A G A A C C C C G C A G G G T T A A C G C A A G T C T C A A G T T G T T T A C T C T G C G A C A C G T T A A C G A C A A G T C A C A 35 nt sequenced in the adapter Sequencing AGGTAGAACCCCGCAGGGTTAACGCAAGTCTCAAG TTGTTTACTCTGCGACACGTTAACGACAAGTCTCA CCGCAGGGTTAACGACTCATTGTTTACTCTGCAAG TCCCAATTGCTGAGTAACCAAGTCTCAAGATGTCA sequenced reads

Figure 1.9: When performing short RNA-seq experiments using the Illumina technique, most of the sequenced reads contain parts of the adapter at their 3’ end. After size fractionation and cDNA synthesis, 5’ and 3’ adapters are ligated to the ∼18-30nt long double-stranded short RNA pieces. These sequences are then immobilized on the flow cells surface and the Illumina machine performs 35 sequencing circles. For all sequences that are shorter than 35nt in length, the last sequenced fraction consists of a non-cellular adapter sequence (highlighted in red) which has to be clipped in order to receive the original read length.

1.5.1 Data format

The base calling methods assign a quality value to each nucleotide they determine. These numbers reveal the estimated probability of a base being wrong. The output format of the 454 pyrosequencing machine is different to the format of the Illumina sequencer. The 454 machine returns a binary SFF file (Standard Flowgram Format), which can be exported to two multiple FASTA files. One contains the obtained sequences and the other stores the nucleotide-wise quality values. The data coming out of the Illumina sequencing machines are stored in a FASTQ file (Quality FASTA). This modified FASTA file contains not only the sequence information, but also the quality values. The exact calculation of the quality values is still a business secret of the developers.

In the multiple FASTA format each sequence has two entries. The first one is the header line, which starts with the symbol ”>”, followed by the identifier and further sequence information. All following lines, without the header label at the first position, hold the sequence. In an output file of a 454 run, the header contains four columns, a unique identifier, the read length, the coordinates of the bead on picotiter plate and the date of the run.

(32)

>FW8YT1Q01B9VMY length=24 xy=0815_2008 region=1 run=R_XXXX TGAGCCAGTGACACAGTGACACAG

The header lines of the second file are identical, allowing an identification of associated sequence-quality value pairs. The number of quality values below the header line has to be identical to the length of the sequence in the first file, since every single nucleotide got one quality value assigned.

>FW8YT1Q01B9VMY length=24 xy=0815_2008 region=1 run=R_XXXX

37 39 39 39 39 39 37 38 38 38 39 39 39 39 39 39 39 39 39 39 35 30 28 29

The FASTQ format used by Illumina is in FASTA style, but consists of four, instead of two, entries for each sequence.

@HWI-EAS244:6:1:5:927 TTTGGTGCCAGTGATCTTGATCTG +HWI-EAS244:6:1:5:927 CCCCCCCCCCCC@=ACCB@@C

The header of the FASTQ format starts with an ”@”, followed by several information delimited by colons. The stored information is the unique instrument name (HWI-EAS244), the flow cell lane (6), the tile number within the flow cell (1) and the x/y-coordinate of the cluster (5:927). The following lines hold the sequence itself. The third entry starts with the marker ”+” and provides space for further information. In most experiments this entry is empty (no text after the ”+”, but the marker itself has to be available), or it contains the same information as the header field. In the last entry, the quality values are stored. Note that the values are stored in ASCII code, allowing the usage of one character per nucleotide. Compared to the 454 way of storing these numbers, where two character numbers (e.g. 39) have to be delimited by a space, using the ASCII code a great amount of memory can be saved.

The stored quality values are in Phred format. It was developed during the Human Genome Project and is given by

Q = −10 · log₁₀p

where p is the probability that the given nucleotide was called incorrectly. It has to be mentioned that the ranges of the quality values have been changed and are still subject to changes. For the Phred score range from 0-62, the ASCII characters from 64 to 126 are used, while for the range from 0-93, the ASCII characters from 33 to 126 are used.

1.5.2 Short read mapping

The short length of the reads obtained by short RNA-seq experiments complicates the search for their locus of origin. To illustrate the problem, a very simple calculation can be made.

(33)

1.5 Sequencing data

The human genome consists of around 3 · 109 base pairs with an alphabet size of four letters (A,C,G and T), each for one of the nucleotides adenine, cytosine, guanine, and thymine. When assuming that the probability p of each nucleotide to occur at each position is 0.25 and the nucleotides are uniformly and randomly composed (which is of course not correct for the human genome, but for reasons of simplicity, we just set it like that), one can easily calculate the minimum length of a DNA sequence expected to occur just once in the human genome. Using the formula

E = pk· n

one can calculate the expected number of occurrence of a sequence of length k within a genome of length n. When rewriting the term, to bring the length of the sequence to the left side, one gets the following equation:

k = log_pE/n

When setting p = 0.25, n = 3 · 10−9 and the expectation of occurrence E = 1, we obtain a length k ≈ 16. That means, that a sequence of at least 16 nucleotides is needed to statistically assure, that there will not be a second hit in the human genome, just by chance. But it has to be clarified, that the human genome is not random and the nucleotides are not uniformly distributed. Based on experience, we are able to find sequences of length of as little as 15 nucleotides, but a length of at least ∼ 20 base pairs (bp) is preferred. Another issue is that the sequencing protocols also introduce errors to the sequences, complicating the discovery of its correct position within the reference genome.

Locating the locus of origin is called mapping and, based on the problems explained above, is one of the major challenges when working with high-throughput sequencing data. The large size of the genomes, the huge number of short sequences (millions), and the relatively high rate of errors (see 1.5.3), result in the need of sophisticated mapping algorithms. In a standard short RNA-seq experiment, several million molecules are measured. Assuming that the algorithm would need one second to find the mapping position of one sequence (find a ∼ 20bp long sequence in a 3 · 109 long reference sequence) and we have 1 Mio of these, we would end up in a running time of 11.5 days, which is very inefficient.

As explained above, the errors in the sequences make the finding of the correct position of origin difficult. There can be three different types of errors in a sequenced read: 1) The sequencing machine called a wrong nucleotide, e.g. it measured an adenine, instead of a guanine. The resulting mapping at this position will be called a mismatch, since the nucleotide in the reference does not match the sequenced one. 2) The sequencing machine called one extra nucleotide, e.g. two adenines, instead of one, resulting in an insertion. 3) Or the machine reads over one nucleotide, resulting in a missing base and thus a deletion.

(34)

Another problem of the mapping procedure is the fact that some reads may map to multi-ple regions within the genome. Especially in ncRNAs this behavior is well known, since e.g. tRNAs or microRNAs occur in multiple copies in one genome, resulting in equally good align-ments at all these loci. Thus, the demand for algorithms which reliably returns all possible sites, is mandatory when working with short RNA-seq data.

There are three different types of modern mapping algorithms, one using hash tables (HT) and the others using enhanced suffix arrays (ESA) or the Burrows-Wheeler transform (BWT). In table 1.1, I summarized the most popular mapping tools.

One of the first mapping algorithms was MAQ (Li et al., 2008a). MAQ starts with creating an indexed hash tables holding only the first 28bp of each read (the seed). Each seed is stored in a way that all reads with up to 2 mismatches can be found. Then, in the mapping procedure, every time when MAQ finds a hit of the seed, it extends the locus within the reference genome to discover and score the complete locus. Using the MAQ algorithm, it is not possible to find hits with insertions and deletions (Indels) and no multiple, equally good hits are reported.

BWA(Li and Durbin, 2009), Bowtie2 (Langmead and Salzberg, 2012), and SOAP2 (Li et al., 2009b) are based on the Burrows-Wheeler transform (Burrows and Wheeler, 1994). In a first indexing step, the reference genome is transformed using the BWT. The transformation permutes the order of the characters in a way, that substrings, occurring multiple times, are stored in the transformed string at several places with single characters, repeated multiple times in a row. This way the sequence can be better compressed. As a simple example the text “ˆANNA$” can be taken. First, all rotations of the text are sorted in alphabetical order, then the last column is stored, ending up with the transformed text “ˆNNA$A” (for a more detailed description see (Burrows and Wheeler, 1994)). In the mapping step, the backward search algorithm (Ferragina and Manzini, 2000) is used. Using two arrays, it can directly access the compressed BWT and simulate a fast traversal of a prefix for the sequence of interest. Since it does not need to load the complete transformation into the memory, it is fast and has a low memory footprint. Nevertheless, to retrieve inexact matches, a time consuming enumeration of all possible mismatches is needed. Even though, tools using the BWT method are very fast in finding exact matches, the speed decreases significantly when searching for loci with >2 errors, or finding all multiple, equally good hits of one read.

The tool segemehl (Hoffmann et al., 2011) is based on enhanced suffix arrays (Abouelhoda et al., 2004). A suffix array (Manber and Myers, 1993; Frakes and Baeza-Yates, 1992) is a simple data structure to store data from suffix trees by creating a sorted list of all suffixes. A suffix tree can implicitly represent all substrings of a given string. A string S consisting of n characters results in a suffix tree of ≤ n (edge labels can get compressed) paths from the root to its leaves. Each leaf holds one suffix. ESAs are able to combine the benefits of suffix trees with suffix arrays. While, just like in the suffix tree, an exact search of a string requires