RNA-seq data quality control, mapping and normalization

3.2 Methods

3.2.1 RNA-seq data quality control, mapping and normalization

3.2.1.1 Data quality control and mapping

In the original study, the raw reads from the RNA-seq data were mapped to NCBI

Sus scrofa genome build Sscrofa9.225 (Gunawan et al., 2013). But in case of the experiments discussed here, the raw reads (in .fastq files) were remapped to a new NCBI Sus scrofa genome build released at the time, Sscrofa10.226_{. The first step in this remapping process was the quality}

control step. In this step, the quality of the raw read sets (testis and liver) were independently

http://tophat.cbcb.umd.edu/index.shtml last accessed October 9, 2013

25_{http://www.ncbi.nlm.nih.gov/assembly/111518/ last accessed October 14, 2013}

assessed using FastQC quality control tool. Over represented PCR primers, bad quality bases (with Phred score <20) and bases with fluctuating GC content were identified in this step (reported by FastQC) and removed from the raw sequencing data using a combination of Cutadapt and Seqtk tools. Cutadapt, as the name suggests, was mainly used for pruning PCR primers from the raw reads. A well known issue with the next generation sequencing Illumina systems is the low quality of the bases at the 3’ end of the reads. A recommended procedure in this case is to exclude these bases of the reads from further processing (Minoche et al., 2011), for which Seqtk was primarily used. The selection of threshold cut-off (Phred score >20) was arbitrary and yet this cut-off threshold ensured that only the reads with a base quality score of 99% or more were retained for further analysis. The pruned datasets obtained as a result of this quality control step were aligned to the Sus scrofa genome build Sscrofa10.2 using the “splice aware” mapping algorithm TopHat (see section 3.1.2). According to Trapnell et al. (2009), in RNA-seq experiments, the major objective of mapping raw reads to genome are (i) identification of novel transcripts and (ii) abundance estimation of transcripts (Trapnell et al., 2009). In this thesis, mapping raw reads to genome was primarily used for the abundance estimation of transcripts. As explained in the section Algorithms and softwares (section 3.1.2), to compute read depth coverage for each gene and to generate gene read coverage (gene expression) matrices, sequence alignments in the BAM format and gene feature annotations in GFF format were given as inputs for BEDTools coverageBed utility. The next step was to filter genes with low read counts testis and liver expression matrices. In both cases (testis and liver), genes with mean read count <25 in HA and LA phenotypes were removed from the raw read count expression matrix before further processing. Table 3.1 shows the number of genes in each gene expression matrix generated from RNA-seq sequence alignment files.

Table 3.1: RNA-seq expression data statistics

Sample tissue type Number of genes before pruning Number of genes after pruning HA samples LA samples Testis samples 21,340 16,760 5 5 Liver samples 18,427 11,736 5 5

3.2.1.2 Expression data normalization

In RNA-seq experiments, the expression of a gene is measured as the number of reads mapping into a particular genomic interval, unlike the probe intensity values measured in microarray experiments. The measured RNA-seq gene expression values follows a negative binomial distribution (Robinson et al., 2010) in contrast to the normally distributed gene expression values in microarray experiments. A major challenge raised by this difference in data distribution is that the classical linear modeling analysis procedures developed for microarray data mining and analysis assumes the data to be normally distributed and hence cannot be directly applied to RNA-seq expression data. Although various non parametric procedures (distribution free methods) can be used in this context, the initial “trial and error” experiments have shown that the results given by such

analysis procedures were statistically non significant, owing to the small sample size of the data set (number of phenotypes per sample: 5) used in the experiments performed in this thesis and also due to the limited power of non parametric methods to draw significant conclusions from data sets with small sample sizes. Additionally, in the second experiment (see section 3.2.2.2), to combine RNA-seq meta data with microarray meta data it is necessary that expression data from all the experiments follow the same distribution.

Recently, Law et al. (2013) proposed applying normal distribution based microarray like statistical analysis methods to RNA-seq read count data. This proposed model is based around the principle that accurate modeling of the mean-variance relationship intrinsic to the data generating process is essential to design statistically powerful methods (Law et al., 2013). Mean-variance modeling at the observational level (voom) estimates mean-variance relationship in the read count data and computes weights for each observation based on this relationship (Law et al., 2013). In order to overcome the limitations of small sample sizes and non parametric methods to an extend and also following the proposed idea in (Law et al., 2013), the RNA-seq gene expression matrix was normalized and log2 transformed using voom function implemented in limma R package (Smyth, 2005). Comparison of various normalization and differential expression analysis methods for RNA-seq data have shown that voom normalization combined with limma package to be relatively unaffected by outliers and to perform well under many conditions (Soneson and Delorenzi, 2013). An additional study (Rapaport et al., 2013) concluded that modeling RNA-seq gene count data as log normal distribution with appropriate pseudo counts (limma voom modeling) is a reasonable approximation of the data.

In document Application of knowledge discovery and data mining methods in livestock genomics for hypothesis generation and identification of biomarker candidates influencing meat quality traits in pigs (Page 57-59)