CHAPTER 3. SCISSOR: SHAPE CHANGES IN SELECTING SAMPLE OUT-
3.4 Pre-processing data
3.4.1 Filtering out degraded samples
Degradation of RNA transcripts could confound subsequent analyses, especially if data sub- jected to different amounts of degradation are naively compared against each other (Romero et al., 2014). Figure 3.3 (a) shows the three suspicious degraded RNA-seq observations indicated by different colors with the other non-degraded observations in gray. It is well known that sequencing degraded RNA samples often leads to less read coverage at the 5’ end of the gene and negatively affect subsequent analyses such as transcript quantification, gene expression profiles, and fusion detection (Opitz et al., 2010; Romero et al., 2014; Davila et al., 2016). In particular, this leaves degraded RNA-seq samples susceptible to being considered as shape outliers, which could confound real biological aberrations from the high quality samples. In this subsection, we propose a method to quantify the level of degradation in each case at each gene and identify a group of cases that were globally degraded at a considerable number of genes.
A recent study (Davila et al., 2016) reports that the transcript coverage of degraded samples show exponential decrease as a function of the distance from the 3’ end of mRNA that more highly degraded samples show a faster rate of decrease. This motivates us to measure the extent of degradation, also calleddecay rate, by the mean-corrected slope of log-transformed RNA-seq data. To accurately assess the decay rates, we first adjusted the different sequencing depths at each locus by using the first step of the scale normalization method. This procedure helps to remove the intrinsic slopes, allowing for high quality RNA-seq samples to be free of decreasing trend from the 3’ end so that the remaining trend can be observed only in a set of degraded samples. Therefore, it enables more accurate comparison of the decay rates across genes by adjusting the other sources possibly affecting slopes. After the adjustment, we fitted a linear model to the mean corrected coverage with the ordered base positions as a covariate for each sample. Letqi j=qj(i)be the mean
corrected coverage for the jth observation (1≤ j≤n=522) whereiindexes base position at a given locus (1≤i≤d) of which total length isd. The linear model
qi j =qj(i) =αj+βj×(
i d) +εi j
Figure 3.4: Heatmap of decay rates for all genes and all tumor cases, in which the brown color indicates the top 5% decay rates whereas the beige color indicates the lower values. The vertical color bar on the right of the heatmap encodes the gene lengths and a darker blue denotes a longer gene. The horizontal color bar on the bottom of the heatmap encodes the decay ranks from the 3’/5’ bias method of Abeshouse et al. (2015) and a darker red denotes higher decay rank. Based on two groups from the hierarchical clustering on the top of the heatmap, the 70 identified degraded cases are in brown and the rest of the intact cases are in green.
was fitted and the least square estimates ˆαjand ˆβjwere obtained. Note that we divided the covariate
ibydto correct the effect of gene length. Then, the ˆβjis the decay rate of the jth observation with
a higher value of ˆβjindicating severe degradation.
We obtained n=522 decay rates at each gene and collected those values across genes as a large matrix of which columns are samples and rows are genes. Unsupervised hierarchical cluster
analysis was performed with this matrix using hclust in R/Bioconductor with the complete linkage method. Figure 3.4 displays a heatmap of the decay rate matrix based on the order of unsupervised hierarchical clustering for both rows and columns. The samples were classified to two groups as shown in the dendrogram above the heatmap and samples in the second group indicated by the brown color clearly show higher decay rates at more than 50% of the genes. Based on this cluster analysis, we identified 70 RNA-seq samples with strong evidence of degradation and excluded them from the downstream analysis. The vertical color code on the right of the heatmap shows lengths of genes in which darker blue indicates longer genes. This color code shows that longer genes tend to undergo more degradation, which is expected since long genes are fully affected by degradation whereas short genes are less affected (Davila et al., 2016). As an illustration, a set of intact/degraded RNA-seq overlays for a short gene, LGALS1, and a long gene, FAT1, are displayed in Figure 3.5. The gene LGALS1 shows no clear changes in expression between the two groups whereas the gene FAT1 shows severe impact of degradation on the second group. We also found strong association between our decay rates and the 3’/5’ biases from Abeshouse et al. (2015). The latter are shown using a color code at the bottom of the heatmap in Figure 3.4. A darker red represents severe degradation based on the 3’/5’ bias and the samples with dark red color tend to have high decay rates.
The three suspicious degraded samples in Figure 3.3 (a) are all identified as degraded, and the remaining intact samples are shown in Figure 3.3 (b) with two colored suspicious shape outliers (red and blue).