Classification of raw count quantification results

3.2 Results

3.2.2 Classification of raw count quantification results

The difference between rows matched to annotated regions and reads with no matches in annotated regions of two library types were used as classification features. Figure 5 shows that likewise three definitive clusters are formed with two variables plotted against each-other. For evaluation 10-fold cross validation was used on all samples as in previous section.

Figure 5: Strand-specific normalized differences of matched and mismatched reads on annotation regions by library types

Table 3 shows that the recall and precision of each library type versus all others are again near perfect. Only a single data-point between strands was misclassified. This corresponds to implications from the visualization on figure 5.

Table 3: Recall and precision of multi-labelled KNN-classification of normalized differences between stranded library types with matched and mismatched reads on annotation data as features Recall Precision Unstranded 1 1 First strand 0.99 1 Second strand 1 0.99 3.2.3 Protocol classifier

In protocol based library type assessment no sample specific discrimination can be made. A Bayesian classifier with leave-one-out cross validation was trained. Protocol sample size was increased from 23 experiments to 57. Naive Bayes classifier for text classification separated stranded protocols from unstranded protocols with 0.77 accuracy. Table 4 shows that the classifier suggested experiments belonging to stranded library types conservatively misclassifying many such experiments. Experiments classified as stranded were correct in 94% of cases.

Table 4: Naive Bayes text classifier’s recall and precision in separating stranded and unstranded library types

Recall Precision

Stranded 0.59 0.94

Unstranded 0.96 0.69

Table 5: Naive Bayes text classifier’s recall and precision in distinguishing library types from one another in a multi-labelling problem

Recall Precision

Unstranded 0.96 0.69

First strand 0 0

Second strand 0.39 0.64

protocols does not work as well. Total accuracy fell to 0.63 due to complete inability to distinguish between the two strand specific protocols. These results are shown in table 5.

4 Pipeline

4.1 Database development

Only experiments that have been selected as eligible for analysis are processed by the pipeline. This necessitates a construction of systematic database containing such info. The process for this is demonstrated in the current section.

First, raw metadata files for all experiments are scraped from ArrayExpress FTP databases. Retrieved IDF and SDRF metadata files are downloaded if missing and updated when newer versions compared to local file variants are available. This process is tracked in a separated database table and enables fast and effective updates of the system.

Locally downloaded metadata files are the basis for consequential identification of pro- cessing eligible RNA-Seq experiments. SDRF files contain data about individual samples of experiments. Example of a SDRF file is in appendix 1. In the following, all flowcharts describe the retrieval or analysis of a single experiment unless looping over experiments is specifically remarked.

As new data is continuously released databases need to be updated to corroborate the actual status of submissions to ArrayExpress database online.

First step in database construction is distinguishing NGS experiments from other experimental settings as shown on figure 6. Separation may be based on metadata fields

only present in a certain assay types or directly inferred from values from within parsed file contents. All experiments available are separated into four broad categories and classified either as array assays(microarrays), hybrid assays, sequencing assays or other experiments which have too little data for accurate separation. Tables for experiments are updated with corresponding decisions.

Figure 6: Flowchart of experiment assay classification in public experiment database construction

RNA-Seq is one technique out of many NGS based methodologies. Sequencing experiments therefore also need to be filtered for experiments that were conducted using RNA-Seq library construction techniques. Some NGS methods explore other genomic features besides gene expression. Such information can be inferred from fields specified in IDF metadata describing general overview of the experiment as outlined on figure 7.

Figure 7: Flowchart of RNA-Seq experiment retrieval amongst NGS experiments in public experiment database construction

Experiments are further filtered by the availability of raw data and some other terms for downloading process as outlined on figure 8. For all individual samples, ENA experiment- and individual run accessions have to exist. This is needed to match metadata with the outputted expression matrices. A valid ENA accession code is also a requirement for all experiments. Links for raw data download are parsed from ENA metadata accessions. Metadata constructions and file specifications published by ENA and SDRF/IDF do not always match so practical solutions need early difference detection mechanisms. Pro- cessing superseries experiments would replicate existing results. Duplications need to be eliminated for removing over-representation of some experimental conditions.

Figure 8: Flowchart of data availability classification for RNA-Seq experiments in public experiment database construction

Analysis process for RNA-Seq experiments is time critical and analysed by species in the order of experiment’s raw data sizes. This number is retrieved by counting all

individual samples and summing their size grouped by species in an experiment as shown on figure 9. This completes the content constructions to the database which can be used as a basis for querying, maintaining the automatic analysis and structural composition of the data.

Figure 9: Flowchart of grouping RNA-Seq experiments with available raw data by species and sizes of raw files in public experiment database construction

In document Re-using public RNA-Seq data (Page 30-35)