7 Microarray projects - Heterogeneous data analysis for annotation of microRNAs and novel genom

Microarrays was the main techonolgy measuring transcriptome composition a few years ago. We have also participated in two microarray analysis projects which have not been described in this thesis. The first project concerned the interpretation of microarray time series data of the Streptomyces coelicolor ssgC mutant. In this project, the transcriptome of the wildtype and ssgC mutant was measured at 9 different time points over their life span and the main goal was to look for genes which have a different expression in the mutant compared to the wildtype. The second project was zebrafish embryogenesis microarray interpretation using functional and anatomical annotation. The aim was to study the temporal-spatial patterns of developmentally regulated genes during zebrafish embryogenesis. In both research projects, the bottleneck we experienced was the data normalization which is a sophisticated process to remove the bias and noise within and between arrays. Choosing different statistics models or methods for normalization led to different candidates, which had great impact on the downstream analysis. The lesson we learned from these two microarray projects is that bioinfomaticans and biostaticicians should be involved beyond the data analysis. They should be involved in the stage of experiment design as well, since the experiment design directly decides how the data should be analyzed later on. A weak and messy experiment design will not lead to very signifi- cant results.

Currently, a new technology for trancriptome analysis is RNA-Seq. It has been applied in carp genome project described in Chapter 5. Compared to microarrays, RNA-Seq can measure the dynamic transcriptome without prior knowledge of genome sequence and has a much higher range of detection, base-level resolution and the ability to detect the previously unknown transcripts. Besides these, the advantages for the data analysis are that it is digital data and does not require sophisticated normalization.

The price of RNA-Seq has dropped dramatically recently. Although currently it is still more expensive than microarrays, in a few years, it will be possible to have 1000 dollar

Annotation is a broad topic and will be one of the main research themes for biology in the future. In this thesis, we demonstrated how we use bioinformatics, and integration in particular, to annotate miRNAs and a novel genome. Although this is just a small part of annotation, we have shown that bioinformatics can guide wet experiments by providing the candidates for validation. By incorporating integration in the workflow, the efficiency and accuracy of bioinformatics predictions can be further improved. Currently, in life science studies high-throughput experiments, multiple platforms and different species as model system are very commonly used. Therefore, heterogeneous data integration is no doubt a trend for the analysis of biological data.

the use of bioinformatics is indispensable as the large volumes of data only obtain added value through thorough computational analysis.

Genes are the building blocks for the machinery of cells. Genes are regions of DNA that can be transcribed to messenger RNA and subsequently translated to proteins. The proteins are the chief actors within the cell. Some of the RNA molecules are controlled by small RNA-like structures called microRNAs. These, recently discovered, microRNAs are very short messenger RNAs that are also transcribed from DNA sequences. However, instead of being further translated to protein, these short RNAs bind to messenger RNAs, and, in this manner, inhibit expression of their target.

The complete set of hereditary material of an organism is referred to as the genome. In order to understand the genome we need to be able to label all functional parts. This labeling is referred to as annotation and this is typically the domain of bioinformatics. In this thesis annotation is achieved, in particular, through the use of heterogeneous data integration. The analysis focuses on annotation of genes and molecular structures that control the expression of genes, the microRNAs. The heterogeneous aspect refers to the integration of multiple resources within the analysis so that one can reason efficiently about the data.

The main goal of this thesis is efficient and accurate annotation of microRNAs (miR- NAs) and functionally unknown DNA sequences. Gene annotation is the process de- tecting the structure and biological function of the raw DNA sequences. It is the most time-consuming analysis in a genome project. As for miRNA annotation, the major task currently is to identify miRNAs targets since miRNAs modify gene expression by binding to their target genes. To achieve these goals, we developed several complex workflows which integrate the current most relevant data sources and tools.

In Chapter 2, we explained an integrative method which investigates several aspects of the relationships between miRNAs and their targets with the nal purpose of extracting high

confident targets from the target pool. The applied techniques include statistical tests, clustering and association rules. The research comprised a case study for two miRNAs, i.e. dre-miR-10 and dre-miR-196, in which seven high confidence target candidates were predicted, all of which belong tohoxgene family and have similar characteristics as al- ready validated target genes.

In Chapter 3, we presented an approach for analyzing miRNA-miRNA relationships and subsequently utilizing these relations for target predictions in human. In support of this a machine learning pipeline was developed in order to reveal the feature patterns between known miRNAs. Subsequently, the observed patterns were applied to miRNAs of which the targets are not yet known so as to see if new targets could be predicted. Our method contributes to the improvement of target identification by predicting targets with high specificity and without constraints on evolutionary conservation.

In Chapter 4, we evaluated the performance of different target prediction algorithms and used integration methods to improve prediction accuracy. To this end, high-level integration approaches, i.e. algorithm combinations and ranking aggregation, as well as low-level integration approaches, e.g. a Bayesian Network classification, were performed. All of the methods were tested on miRNA-target interactions that were experimentally validated and on several compiled negative control data sets. The results showed how each individ- ual prediction algorithm has its own advantages. Moreover, among different integration strategies, the application of the Bayesian Network classifier on the features calculated from multiple prediction methods significantly improved target prediction accuracy. In Chapter 5, we focused on the assembly and functional annotation of the carp genome. The common carp is a candidate model system that can be used for high throughput screens of pharmaceutical compound libraries. In this chapter, we develop a genome assembly and an annotation pipeline with the final aim of identifying innate immune re- sponse genes, especially Toll/Interleukin-1 receptor (TIR) domain-containing genes, using next generation sequencing data. The genome assembly pipeline consists of data cleaning, pre-assembly and assembly using CLCBio, ABySS and SOAP-denovo. A ba- sic gene annotation pipeline is developed by using a simple gene prediction that is based on protein-based gene model prediction as well as comparative annotation. The latter is focused on prediction of orthologues with respect to the zebrafish genome.

5, the genome annotation section, different data such as genomic DNA reads, RNA-Seq reads and motifs are integrated in a sequential fashion. Each step in the workflow, adds one extra type of data to serve as a filter to screen the TIR domain containing candidate sequences.

The purpose of using integration is to improve sensitivity and/or specificity of the system. These two measurements characterize the system performance. Sensitivity is defined as the ratio of actual positives which are correctly identified. Specificity measures the probability that the negatives are correctly identified. For each algorithm, it is desirable to achieve both high sensitivity and specificity. There is, however, a trade-off between the measures; high sensitivity will sacrifice specificity by increasing its false positive rate and vice versa. In Chapter 2, by including a feature for genomic distance between miRNAs and their targets and other enrichment information, the number of targets for dre-miR-10 and dre-miR-196 has been reduced to less than 10 for each. In Chapter 3, using functionally similar miRNAs for functionally unknown miRNA target prediction, 6 new targets have been predicted as target candidates for 5 of the miRNAs. Using heterogeneous data, we greatly reduced the number of candidates to a scale in which biologist can easily vali- date the results. In these two chapters, our aim was to improve the specificity, and the cost of our integration strategy was a slight reduction of sensitivity. Tools with a high specificity will speed up the process of finding the real targets. In Chapter 4, we integrated three target prediction methods using three integration strategies with the aim to achieve the best performance. Performance is defined with a criterion considering both sensitivity and specificity. In the end, we substantiated a concept that proper integration can improve the performance than any other single method. In Chapter 5, by considering both genomic and RNA sequencing data, our purpose was to maximize the probability of finding TIR containing genes in the common carp, therefore sensitivity has been the main focus in this chapter.

The application of the aforementioned methods promotes our understanding of miRNA regulation as well as the structures and function of the novel genes. New biological in- sights were gained during these studies.

Currently, the mechanism of miRNA regulation in animals is acknowledged as being sophisticated but not yet fully understood; as such many targets are left unidentified and many false positive targets remain. In our study, we found several interesting new features. In Chapter 2, we discovered that there is a correlation between the genomic lo- cation of predicted target genes and miRNAs by showing that many targeted genes are physically located close to their miRNAs. Knowing the genomic distance is a related feature, in Chapter 3, we further found that many functionally similar miRNAs are also located in clusters. From these findings, we conclude that genomic distance plays a role in miRNA-target interaction. If two miRNAs or one miRNA and its targets are genom- ically close, the probability of co-transcription is high. The co-occurrence implies that they might have similar functions or interact with each other. By studying the features of the validated miRNA-target relationships in human, in Chapter 4 we found that some miRNAs tend to bind their targets at either the beginning or the end of 3’ UTR sequences. Gene annotation is a time and labor intensive task. For the non-model species without a sequence assembly available, the genome sequences need to be established, before being able to fully annotate the genome. In Chapter 5, we demonstrated how we annotated a non-model species, i.e. the common carp. We generated huge amount of genomic reads together with RNA sequencing data. In the end, the preliminary carp genome assembly was achieved with an N50 contig length of 2260 bp and it is estimated that the carp genome is about 1.23 Gbp. Compared to zebrafish innate immune genes, we estimated that there are 39 TIR domain-containing genes and transcripts in the common carp. To sum up, annotation is a broad topic and will be one of the main research themes for biology, and thus bioinformatics, in the future. In this thesis, we demonstrated how we use bioinformatics, and integration in particular, to annotate miRNAs and a novel genome. Although this is just a small part of annotation, we have shown that bioinformatics can guide wet experiments by providing the candidates for validation. By incorporating integration in the workflow, the efficiency and accuracy of bioinformatics predictions can be further improved. Currently, in life sciences high-throughput studies are being incor- porated in the experimental workflow, multiple platforms and different model species are

gegenereerd verdiepen pas ons inzicht juist door grondige computationele analyse. Genen zijn de bouwstenen voor de machinekamer van de cel. In feite zijn genen re- gios in het DNA die kunnen worden overgeschreven naar boodschapper RNAs (mRNA) die vervolgens kunnen worden vertaald naar eiwitten. De eiwitten zijn de belangrijkste actoren in de cel. Sommigen RNA moleculen worden gecontrolleerd door kleine RNA- achtige strucuturen die microRNA worden genoemd. Deze, recentelijk ondekte, microR- NAs (miRNA) zijn in feite hele kleine mRNAs en worden net als mRNA ook uit het DNA overgeschreven. Echter, in plaats van de normale vertaling naar eiwit binden deze korte RNA fragmenten aan mRNA en op deze manier kunnen ze het aflezen van het eiwit (het doel) verhinderen.

Het complete erfelijke materiaal van een organisme wordt ook wel het genoom genoemd. Teneinde het genoom te begrijpen moeten we alle functionele dele labelen. Dit proces van labelen wordt annotatie genoemd en de annotatie komt tot stand door bioinformatica. In dit proefschrift wordt annotatie gerealiseerd door middel van het gebruiken en integr- eren van verschillende, i.e. heterogene, bronnen. De nadruk ligt op het verkrijgen van annotaties voor genen en moleculaire structuren waarmee genen worden gecontrolleerd, de micro RNAs. Het gebruik van heterogene bronnen vergroot daarbij de mogelijkheden voor het redeneren over de data.

Het hoofddoel van dit proefschrift is efficiente en nauwkeurige annotatie van miRNAs en DNA sequenties waarvan tot nu toe geen functie bekend is. Gen-annotatie is het proces waarmee de structuur en functie uit ”ruwe” DNA sequenties wordt verkregen. In een genoom-project is dit het meest tijdrovende deel van de analyse. Wat betreft miRNA annotatie is, in het huidige onderzoek, de belangrijkste taak de doel-RNAs (target) te kunnen vaststellen van een miRNA. Dit omdat miRNA de expressie van een gen controlleert door het binden aan een specifiek doel-mRNA. Teneinde deze verschillende annotatie taken te kunnen realiseren zijn een verscheidene complexe werkschemas (workflows) opgesteld

waarin de op dit moment relevante bron data als ook de analyse technieken worden geintegreerd.

In hoofdstuk 2 wordt een integratie method behandeld waarmee een aantal aspecten worden onderzocht van de relaties tussen miRNAs en het doel-mRNA met de bedoeling om uit de poel van mogelijke doel-mRNAs juist die te selecteren die met een hoge mate van waarschijnlijkheid juist zijn. De technieken die hierbij zijn toegepast omvatten on- der andere statistische testen, clustering en zogenaamde associatie regels. Het onderzoek dat in dit hoofdstuk wordt beschreven omvat ook een ”case” studie voor twee specifieke miRNAs, te weten, dre-miR-10 en dre-miR-196. Voor deze twee miRNAs werden zeven candidaat mRNAs voorspeld met een hoge waarschijnlijkheids score; alle voorspelde kan- didaten behoren tot de hox-gen familie. De voorspelde candidaten delen karakteristieken met genen die reeds gevalideerd zijn.

In hoofdstuk 3 wordt een strategie gepresenteerd voor het analyseren van relaties tussen miRNA’s en daarbij wordt vervolgens aangegeven hoe deze relaties gebruikt kunnen worden in de voorspelling van doel-genen zoals die in de mens gevonden kunnen worden. Om dit te ondersteunen is een proces-koppeling ontwikkeld teneinde patronen van kenmerken die tussen bekende miRNAs bestaan, te onthullen. De patronen die gevonden zijn, zijn vervolgens toegepast op miRNAs waarvan de doel-genen nog niet bekend zijn om op die manier mogelijke voorspellingen te kunnen doen over doel-genen van deze miRNAs. Deze methode draagt bij aan de verbeteringen die noodzakelijk zijn voor het identificeren van de doel- genen waarbij de specificiteit vergroot is en de beperking die wordt opgelegd vanwege het principe van conservering van evolutie, niet behoefd te worden toegepast. In hoofdstuk 4 hebben zijn de verschillende algoritmes die worden toegepast om doel- genen te voorspellen aan een evaluatie onderworpen; daarbij hebben we gebruik gemaakt van methodes van integratie om de nauwkeurigheid van de voorspelling te kunnen ver- groten. Daarbij zijn zowel integratie methodes toegepast aan de bovenkant van het spectrum, i.e. combinaties van algoritmen en ranking aggregatie technieken, als ook methodes aan de onderkant van het spectrum, i.e. Bayesiaanse Netwerk classificaties. Al deze methodes zijn getest op miRNA-doel interacties die experimenteel gevalideerd zijn als ook op datasets die samengesteld zijn als negatieve controle data. Uit de resultaten komt naar voren hoe ieder van de voorspellings algoritmen zijn eigen voordelen heeft. Bovendien blijkt dat het gebruik van de Bayesiaanse Netwerk classificatie zoals toegepast op de ken-

hoe de genoom assemblage is gerealiseerd en hoe we een proces-koppeling voor annotatie van het genoom maken waarbij we vooral gericht zijn op het het identificeren van de genen verantwoordelijk voor de aangeboren immuunrespons; dit zijn met name de genen die domeinen bevatten voor de Toll/Interleukin-1 receptor (TIR). De analyse is gebaseerd op zogenaamde volgende generatie sequentie gegevens. De proces-koppeling voor genoom assemblage bestaat uit het opschonen van de data, pre-assemblage en assemblage gebruik makend van CLCBio, ABySS en SOAP-denovo software. Een recht toe recht aan gen voorspelling gebaseerd op een proteine gebaseerd gen model voor- spellingsmodel tesamen met een vergelijkignsannotatie gebaseerde annotatie vormen een werkbare proces-koppeling voor het annoteren van genen in dit nieuwe genoom. In de vergelijkingsannotatie wordt gebruik gemaakt van de ortologe genen in het genoom van de zebravis.

Zoals eerder vermeld, heterogene data integratie is het centrale thema in dit proefschrift. In de hoofdstukken 2 en 3 worden verschillende kenmerken zoals afstand op het genoom, sequentie similariteit, vrije energie en concepten uit de Gene Ontology zorgvuldig gecom- bineerd om tot een eindoordeel te komen of een voorspelling omtrent een doel-RNA goed of fout is. Integratie wordt gerealiseerd door een combinatie van data mining technieken zoals, beslisbomen, relatieve subgroup discovery, een lineaire classifier en een kwadratis- che classifier. In hoofdstuk 4 worden de intermediaire kenmerken, gegenereerd uit de drie geselecteerde voorspellingsmethoden, vastgelegd en geintegreerd met een Bayesiaanse Netwerk Classifier. In hoofdstuk 5, in de sectie die handelt over genoom annotatie, worden verschillende typen van data zoals DNA fragmenten, RNA-Seq fragmenten en RNA motieven geintegreerd op een sequentieele wijze. Elke stap in het werkschema voegt een nieuw type data toe dat werkt als een filter om te zoeken naar de TIR-domein bevattende kandidaat sequenties in het karper genoom.

systeem. Deze twee maten karakteriseren de prestatie van het systeem. Sensitiviteit wordt gedefinieerd als de ratio van het aantal positieven en het aantal correct geidentificeerde

In document Heterogeneous data analysis for annotation of microRNAs and novel genome assembly (Page 120-140)