Normalization

5.2 GenomeRing: alignment visualization based on SuperGenome coordi-

6.1.1 Normalization

The first data processing step following an RNA-seq experiment is read mapping. The TSS detection method described here is not working on the mapping data directly but on coverage graphs in single nucleotide resolution that are derived from these data. These graphs (also called wiggle graphs) basically consist of a value for each genomic position indicating the number of mapped reads covering the respective position. As the mapping is strand-specific this results in two graphs per library, one for the forward and one for the reverse strand.

The graphs are usually normalized by the complete number of reads that could be mapped from this library. However, this number is often biases by only a few strongly expressed transcripts [43, 130]. These can be ribosomal RNAs, as the efficiency of the rRNA depletion protocol might vary between libraries. For this reason an additional normalization of the dRNA-seq graphs is conducted prior to TSS detection. This is done by performing a percentile normalization, which is more robust against the variation of very strongly expressed genes than using the total number of mapped reads as a normalization factor. For this the 90th percentile of all expression values

6.1. The TSS prediction pipeline 5'PPP enriched (+)-Strand no treatment (+)-Strand TSS positions 5'PPP 5'P

RNAs in the cell Primary transcripts Processed RNA Exonuclease degrading 5'P RNAs standard library treated library

Figure 6.1.: Illustration of the differential RNA-seq (dRNA-seq) protocol. Two RNA sequencing libraries are produced, one that is untreated and one that is treated with a terminator exonuclease that degrades RNAs with a 5’ monophosphate in order to enrich the 5’ ends of primary transcripts that carry a 5’ triphosphate. Both libraries are sequenced and the sequencing data are compared to distinguish TSS from RNA processing sites. Detection of TSS candidates Replicate comparison

SuperGenome-based comparative analysis Classification Gene annotations Output GFF TSV WIG

...

dRNA-seq

data input

Figure 6.2.: Basic steps of the TSS prediction pipeline. The dRNA-seq input data is read and normalized. TSS candidates are detected in the replicates of the different data sets and the results are compared in order to eliminate irreproducible sites. The SuperGenome approach is employed to associate TSS candidates from different genomes to each other. After classification with respect to annotated genes all results such as the TSS MasterTable and supplemental data are generated.

is calculated and used as the normalization factor. The factor is calculated from the treated library, but it is applied to both, the treated and the untreated library. Thus, the enrichment factors are not changed during this normalization step. After the dRNA-seq graphs of libraries have been normalized, all expression values are multiplied by the minimal normalization factor in order to restore the original data range.

This normalization procedure actually makes no assumptions about the normalization state of the input data as the result is independent of any factor that have have been applied as a normalization factor earlier. However, still a linear normalization is used, which might not be sufficient if non-linear effects occur. A comparison of TSS expression height distributions after normalization between 4 RNA-seq libraries of 4 different Campylobacter jejuni strains (see chapter 7 for details) is shown as a Q-Q plot matrix in figure 6.3. For several libraries non-linear effects are evi- dent. However, reasonable expression height thresholds for the annotation of TSS are between 5 and 10 reads. In this interval the normalization strategy presented here seems to be sufficient as pronounced non-linear effects are only observed for much higher expression levels.

Another important property for TSS prediction is the enrichment factor, i.e., the factor by which the expression value in the treated library is higher than in the untreated library. It has to be considered that the efficiency of the enrichment procedure directly influences the number of detectable TSS. Variations between the enrichment rates of different libraries biases the comparative analysis. In figure 6.4 the distributions of enrichment factors of predicted TSS are compared between 4 different C. jejuni strains. Here, only the normalization method described above was applied. The enrichment rates differ significantly between the strains. E.g., in strain NC 009312 the enrichment strength was about twice as high compared to strain NC 009839.

To account for this effect an additional normalization method was integrated. For this a preliminary prediction of TSS is performed for each pair of treated and untreated library, which uses fixed thresholds of 0.1 for the minimal step height and 1.5 for the minimal step factor (see 6.1.2). Other properties are not evaluated. The resulting TSS set is then used to determine the median enrichment factor for the respective library pair. Taking the library pair with the strongest enrichment as reference these values are used to determine for each pair the normalization factor that is necessary to achieve the same rate as the reference. This factor is then applied to the dRNA-seq graphs of the respective untreated library. In figure 6.5 the same comparison as described above is shown but with this additional normalization applied to the data. As for the expression heights there seem to be additional non- linear differences between the libraries. However, a reasonable threshold for the minimal enrichment factor will presumably be smaller than 10 in any case and the normalization compensates for all significant effects in an interval between 0 and 20.

6.1. The TSS prediction pipeline

Figure 6.3.: Q-Q plot matrix comparing the distributions of TSS expression heights after normalization between 4 Campylobacter jejuni strains. Only the interval between 0 and 20 reads, which is relevant for the threshold of the TSS prediction method is shown.

Figure 6.4.: Q-Q plot matrix comparing the distributions of TSS enrichment factors without additional normalization between 4 Campylobacter jejuni strains. Only the interval between 0 and 20 reads, which is relevant for the threshold of the TSS prediction method is shown.

6.1. The TSS prediction pipeline

Figure 6.5.: Q-Q plot matrix comparing the distributions of TSS enrichment factors with additional normalization between 4 Campylobacter jejuni strains. Only the interval between 0 and 20 reads, which is relevant for the threshold of the TSS prediction method is shown.

In document Computational Methods for the Identification and Characterization of Non-Coding RNAs in Bacteria (Page 104-110)