2.2 NGS data analysis
2.2.3 RNA sequencing analysis
•Quality control for RNA-seq data
Quality control was performed using RNA-SeQC [82] which can provide investigators with
alignment rate, duplication rate, strand-specific rate, GC bias, rRNA content, regions of alignment
(exon, intron and intragenic), and count of detectable transcripts. An example of strand-specific
RNA-seq from breast cancer sample is given in Figure 2.1.
•Gene expression
Paired-end RNA-seq data were aligned to human reference genome hg19 with decoy sequence
(Hs37D5) using STAR-2.3.0e [83]. Sequence reads mapped on each gene were counted by
HTSeq-count [84] with GENCODE annotation version 19. HTSeq-count was specifically designed
for gene-level count based on exon model in order to analyze differentially expressed genes.
Ambiguous reads mapped on different genes were not considered for downstream analysis.
Gene-level count was then quantified in terms of reads per kilobase per million mapped reads
(RPKM). Gene length was calculated based on total non-redundant cumulative exon length of a
gene with GENCODE annotation version 19.
•Integration of SNVs and RNA-seq data
For breast tumor samples where both WES and RNA-seq data were available, DNA variant
positions (somatic mutation positions) were annotated with RNA mapped reads (RNA BAM files).
RNA aligned reads containing the same variance (tumor allele) was required to call a candidate
Paired-end RNA-seq
STAR
fusionMap
soapFuse
deFuse
Quality control
(RNA-seqQC)
Gene-level
count (HTSeq)
confFuse
BLAT validation
Experimental validation
RPKM
Mutant allele
expression
Figure 2.2: Overview of data flow in RNA-seq analysis. High-confidence fusion candidates
were selected by confFuse and then validated in silico by BLAT, followed by experimental
validation.
DNA tumor allele expressed. Expressed tumor alleles with less then ten RNA-sequencing reads
mapped on the position were not used in further analysis.
•Fusion gene detection and validation
Different tools (fusionMap [65], soapFuse [66] and deFuse [50]) were used to detect fusion
genes based on paired-end RNA-seq data. Versions of these tools were SOAPfuse-v1.26, deFuse-
0.6.1 and FusionMap-2015-03-31. Human genome reference hg19/GRCh37 was used in deFuse
and SOAPfuse. Genome reference Human.B37.3 and gene model Ensemble.R75 were used in
FusionMap. In addition, I developed a new scoring algorithm, confFuse, to reliably select high-
confidence fusion genes (details in next section). After validation in silico by BLAT, experimental
validation for fusion candidates was carried out using RT-PCR followed by Sanger sequencing
5.
Primers for RT-PCR validation were designed using Primer3 [85].
•Pathway analysis
Gene lists in different pathways were based on KEGG (Kyoto Encyclopedia of Genes and
Genomes). For fusion genes identified by confFuse, pathway figures were generated in INGENU-
ITY pathway analysis.
•Overview of data flow
An overview of data flow in RNA-seq analysis is given in Figure 2.2.
5