• No results found

2.2 NGS data analysis

2.2.3 RNA sequencing analysis

Quality control for RNA-seq data

Quality control was performed using RNA-SeQC [82] which can provide investigators with

alignment rate, duplication rate, strand-specific rate, GC bias, rRNA content, regions of alignment

(exon, intron and intragenic), and count of detectable transcripts. An example of strand-specific

RNA-seq from breast cancer sample is given in Figure 2.1.

Gene expression

Paired-end RNA-seq data were aligned to human reference genome hg19 with decoy sequence

(Hs37D5) using STAR-2.3.0e [83]. Sequence reads mapped on each gene were counted by

HTSeq-count [84] with GENCODE annotation version 19. HTSeq-count was specifically designed

for gene-level count based on exon model in order to analyze differentially expressed genes.

Ambiguous reads mapped on different genes were not considered for downstream analysis.

Gene-level count was then quantified in terms of reads per kilobase per million mapped reads

(RPKM). Gene length was calculated based on total non-redundant cumulative exon length of a

gene with GENCODE annotation version 19.

Integration of SNVs and RNA-seq data

For breast tumor samples where both WES and RNA-seq data were available, DNA variant

positions (somatic mutation positions) were annotated with RNA mapped reads (RNA BAM files).

RNA aligned reads containing the same variance (tumor allele) was required to call a candidate

Paired-end RNA-seq

STAR

fusionMap

soapFuse

deFuse

Quality control

(RNA-seqQC)

Gene-level

count (HTSeq)

confFuse

BLAT validation

Experimental validation

RPKM

Mutant allele

expression

Figure 2.2: Overview of data flow in RNA-seq analysis. High-confidence fusion candidates

were selected by confFuse and then validated in silico by BLAT, followed by experimental

validation.

DNA tumor allele expressed. Expressed tumor alleles with less then ten RNA-sequencing reads

mapped on the position were not used in further analysis.

Fusion gene detection and validation

Different tools (fusionMap [65], soapFuse [66] and deFuse [50]) were used to detect fusion

genes based on paired-end RNA-seq data. Versions of these tools were SOAPfuse-v1.26, deFuse-

0.6.1 and FusionMap-2015-03-31. Human genome reference hg19/GRCh37 was used in deFuse

and SOAPfuse. Genome reference Human.B37.3 and gene model Ensemble.R75 were used in

FusionMap. In addition, I developed a new scoring algorithm, confFuse, to reliably select high-

confidence fusion genes (details in next section). After validation in silico by BLAT, experimental

validation for fusion candidates was carried out using RT-PCR followed by Sanger sequencing

5

.

Primers for RT-PCR validation were designed using Primer3 [85].

Pathway analysis

Gene lists in different pathways were based on KEGG (Kyoto Encyclopedia of Genes and

Genomes). For fusion genes identified by confFuse, pathway figures were generated in INGENU-

ITY pathway analysis.

Overview of data flow

An overview of data flow in RNA-seq analysis is given in Figure 2.2.

5

RT-PCR and Sanger sequencing done by Yonghe Wu and Achim Stefan for three breast tumor samples,

2.3

“confFuse” algorithm

A great number of fusion gene detection tools/pipelines (Chapter 1.1) have been developed to

interrogate the NGS data. Those tools/pipelines consist of three major parts: mapping based

on existing alignment tools such as Bowtie and BWA; individual methods for generating fusion

candidates; and filtering algorithms to remove false positive candidates. The sensitivity of fusion

gene detection mainly depends on the mapping step and the specificity depends on the methods

of generating fusion candidates and filtering methods.

Most of those tools/pipelines generate a large number of putative fusion transcripts even

after filtering, of which most may be false positives or of low biological interest (e.g. precursor

read-through transcripts), making it hard to prioritize candidates for experimental validation.

Additional filtering methods were developed based on individual datasets in order to select

reliable fusion candidates [52, 57]. Those individual filters, however, may have a bias towards

cancer or cell type-specific artefacts. Furthermore, stringent filtering can decrease sensitivity of

true fusion detection [71]. Therefore, I developed confFuse, a new scoring algorithm, which can

be applied on paired-end RNA-seq across tumor entities with both high true positive rate and

high detection accuracy.

ConfFuse was designed to rank fusion candidates based on deFuse output by assigning each

fusion candidate a confidence score, with the aim of markedly reducing the total number of

fusion candidates while retaining a high recall rate for true positives. It takes multiple features

into account, including some from the standard deFuse output and also newly generated features,

with each given a specific score weight. Those features are related to number and quality of reads

supporting a fusion, fusion structural features and sequence motif such as gene homology. The

final confidence score is the sum of the score weights of different single/combined features (initial

baseline score is 10). These parameter weightings were iteratively optimized in comparison to a

known validated fusion list, in order to achieve a balance between eliminating false positives

whilst retaining true fusions. Fusion candidates scoring between 8 and 10 are considered as

being high-confidence. The main features used to calculate the score are described below.

Split reads and spanning reads. Highly expressed fusions are easier to be detected by RNA-

seq technology, resulting in more sequence reads identified by fusion detection tools. Sequencing

depth is another important factor related to the number of sequence reads from fusions. High

fusion expression and/or high coverage sequencing would likely result in a high number of split

reads and spanning reads in fusion detection. For lowly expressed fusions with low sequencing

depth, it is hard to detect them and only a few reads could probably be confirmed to support

predicted fusions. Setting a number threshold to filter fusions may increase the specificity but

can reduce the sensitivity. The more uniquely mapped the spanning reads are, the stronger is

the evidence supporting fusions. Considering the complexity of some genome regions, multiple

Osteosarcoma (4)

Glioblastoma multiforme (5)

Brain control (5)

Low−grade glioma (5)

Ewing sarcoma (5)

ATRT (6)

Medulloblastoma (7)

Pilocytic astrocytoma (7)

ETANTR (7)

Nephroblastoma (8)

Pancreatic cancer (16)

Sarcoma (18)

Healthy blood (23)

CLL (27)

Breast tumor (28)

Figure 2.3: 171 paired-end RNA-seq samples from 15 different entities.

mapped reads (ambiguous alignments) can still be evidence supporting true fusions. However,

spanning reads mapped on repeat regions may result in false putative fusion. ConfFuse therefore

assigns a positive score to fusions with a high number of split and uniquely mapped spanning

reads (or a negative score otherwise).

Artefact list. Fusions identified in multiple tumor types are mostly considered to be of high

false positive rate or of low biological interest (i.e. read-through fusions). In total, 171 samples

from 15 different entities were used to generate an artefact list including fusions identified in

no less than three different entities (Figure 2.3). To increase the detection accuracy, previously

verified fusions were manually extracted from the blacklist. Some fusions in the blacklist, however,

could still be true and play an important role in tumor formation in different tumor types (false

negative fusions). ConfFuse assigns a negative score to fusions in the artefact list.

Occurrence of fusion genes. Fusion genes with different fusion transcripts (i.e. splice

variants) in the same samples may be of high true positive rate, especially those fusion transcripts

of high count of split reads and spanning reads. ConfFuse gives a positive score to these

candidates.

Read through. Two adjacent genes in the same orientation may give rise to an apparent

fusion due to read-through transcription and aberrant splicing rather than genome rearrangement.

Although some may acquire novel functions, the vast majority are expected to be false positives.

Fusions with read-through or alternative splicing are assigned a negative score.

Open reading frame. True oncogenic fusions typically preserve the open reading frame in

order to form a functional fusion protein. Fusions without an open reading frame are therefore

given a negative score.

Breakpoint of fusions. Homology of breakpoint shows the number of nucleotides at the

fusion splice region that map equally well to first fusion partner or second fusion partner. The

location of fusion breakpoint play an critical role in demonstrating evidence supporting true

positive fusions. When the locations of fusion splicing are at exon boundaries, those are more

likely true positive fusions. It may be of low biological interest when a breakpoint is located

downstream of the 3

0

fusion partner. Fusions with high breakpoint homology were given a

negative score by confFuse, as are fusions with breakpoint locations not at exon boundaries or

located downstream of the 3

0

fusion partner.

The initial confidence score is 10 and the final confidence score is the sum of the score weights

of different features. The weights of single/combined features are shown in Table 2.1.

Table 2.1: The weights of single and combined features in confFuse scoring algorithm.

Features

Score weight

Single features

Fusion in artefact list

-6

Fusion with alternative splicing between adjacent genes

-4

Fusion with read through

-4

Fusion occurs in the same gene

-4

Breakpoint locates in 3

0

fusion partner downstream or UTR3p

-4

Fusion splice not at exon boundaries

-1.5

Breakpoint homology ≥ 10

-1

Without open reading frame

-1

Max of proportion of the spanning reads in 5

0

gene or 3

0

gene that span a repeat region

>0.9

-1

Fusion from adjacent genes

-0.5

Fusion produced by intrachromosomal rearrangement

-0.5

Max of proportion of the spanning reads in 5

0

gene or 3

0

gene that span a repeat region

between 0.8 and 0.9

-0.5

Combined features

If (5

0

gene or 3

0

gene with zero detected reads) and (occurrences ≥ 2) and (span count

- num multi map + split count <30)

-2

If span count = num multi map

-1.5

If num multi map / span count>0.8 and span count>5

-1

If (span count - num multi map<5) and (number of split reads <5)

-1

If span count - num multi map<5

-0.5

If occurrences ≥ 2 and (max proportion <0.5) and non-read-through and non-

alternative-splicing and (100 ≥ span count - num multi map + split count >40)

and (num multi map / span count <0.2)

+0.5

Table 2.1 – continued from previous page

Features

Score weight

If occurrences ≥ 2 and (max proportion <0.5) and non-read-through and non-

alternative-splicing and (100 ≥ span count - num multi map + split count >40)

and (num multi map / span count <0.2) and (location of breakpoint in 5

0

gene in

coding region)

+1

If occurrences ≥ 2 and (max proportion <0.5) and non-read-through and non-

alternative-splicing and (span count - num multi map + split count >100) and

(num multi map/ span count <0.9)

+1.5

If occurrences ≥ 2 and (max proportion <0.5) and non-read-through and non-

alternative-splicing and (100 ≥ span count - num multi map + split count >40)

and (num multi map / span count <0.2) and (location of breakpoint in 5

0

gene in

coding region) and (location of breakpoint in 3

0

gene in coding or upstream)

+1.5

If occurrences ≥ 2 and (max proportion <0.5) and non-read-through and non-

alternative-splicing and (span count - num multi map + split count >100) and

(num multi map / span count <0.9) and (location of breakpoint in 5

0

gene in coding

region)

+2

If occurrences ≥ 2 and (max proportion <0.5) and non-read-through and non-

alternative-splicing and (span count - num multi map + split count >100) and

(num multi map / span count <0.9) and (location of breakpoint in 5

0

gene in coding

region) and (location of breakpoint in 3

0

gene in coding or upstream)

+2.5

num multi map: number of multiple mapped spanning reads

span count: number of spanning reads supporting the fusion

split count: number of split reads supporting the fusion prediction

occurrences: occurrences of fusion gene pairs

max proportion: max of proportion of the spanning reads in 5

0

gene or 3

0

gene that span a repeat

region

Results

3.1

CNVs and SNVs/Indels in breast cancer

Related documents