Introduction to NGS data analysis

(1)

Introduction to NGS data analysis

Jeroen F. J. Laros

Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics

(2)

Sequencing Illumina platforms

Figure 1: HiSeq 2500.

Characteristics:

•

High throughput (3 genomes).

•

Paired end.

•

High accuracy.

•

Read length 2 × 125bp.

•

Relatively long run time (6 days).

•

Relatively expensive.

(3)

Sequencing

Illumina platforms

Characteristics:

•

Moderate throughput (3 exomes).

•

Paired end.

•

High accuracy.

•

Read length 2 × 300bp.

•

Relatively short run time (3 days).

•

Relatively expensive.

(4)

Sequencing Illumina platforms

Figure 3: Flowcell.

Figure 4: Amplification.

(5)

Sequencing Illumina platforms

Figure 3: Flowcell.

Figure 4: Amplification.

Optical system.

•

DNA fragments are loaded on a flowcell.

•

Fragments are amplified.

•

Clusters of fragments

give the same signal.

(6)

Sequencing Illumina platforms

Figure 5: Image of tile (part of the flowcell).

(7)

Sequencing

Illumina platforms

(8)

Sequencing Illumina platforms

Figure 7: Paired end.

(9)

Sequencing

Illumina platforms

Advantages:

•

Mapping to non-unique regions.

•

Structural variation.

•

De novo assembly

(scaffolding).

(10)

Sequencing Life-Tech

Figure 8: Ion torrent.

Characteristics:

•

Low throughput.

•

Single end (for now).

•

High accuracy.

•

Read length ±400bp.

•

Short run time (1 day).

•

Cheap runs.

(11)

Sequencing Life-Tech

Figure 9: Ion proton.

Characteristics:

•

Moderate throughput (1 exome).

•

Single end (for now).

•

High accuracy.

•

Read length ±175bp.

•

Short run time (1 day).

•

Cheap runs.

(12)

Sequencing Life-Tech

Figure 10: Ion Torrent chip. Figure 11: Ion Proton chip.

(13)

Sequencing

Life-Tech

(14)

Sequencing Life-Tech

Figure 13: Loading a sample.

Figure 14: Base calling.

Problems with mono nu-

cleotide stretches.

(15)

Sequencing

Pacific Biosciences

(16)

Sequencing

Pacific Biosciences Characteristics:

•

Long reads (50% is longer than 10,000 bp).

•

High error rate (15-20%).

•

Low throughput (comparable with the Roche 454).

(17)

Sequencing

Pacific Biosciences Characteristics:

•

Long reads (50% is longer than 10,000 bp).

•

High error rate (15-20%).

•

Low throughput (comparable with the Roche 454).

Circular consensus sequencing.

•

Sequence the same molecule several times.

•

Extremely high accuracy.

•

Acceptable read length (±250bp).

(18)

Sequencing Pacific Biosciences

Figure 16: Real time polymerase monitoring.

(19)

Sequencing Pacific Biosciences

Figure 17: Base calling on the PacBio.

(20)

Data analysis

Next generation sequencing data

1 @SGGPP: 4 : 1 0 1

2 TTCGGGGGCTGGCAAATCCACTTCCGTGACACGCTACCATTCGCTGGTGGT

3 +

4 − ’+4589 ,53330 −0&07+03:54/2362−+.488587>@/25440++0(+

5 @SGGPP: 4 : 1 0 2

6 CGGTAAACCACCCTGCTGACGGAACCCTAATGCGCCTGAAAGACAGCGTTC

7 +

8 34/ − −0 ’+.000(.55:;:99(0(+2(22(0316;185;;0;: < < >=AA59

9 @SGGPP: 4 : 1 0 6

10 TCGTTAACGACTTTGTTCGCCACCGCAACCGCCTGTTTCGGGTCACAGGCA

11 +

12 09875;5? <;?@A4?B :BBB<AA>CCC>C>BB0.−>=0488+3444:@5@<

13 @SGGPP: 4 : 1 1 2

14 TTGATGAATATATTATTTCAGGGAATAATTATGACACCTTTAGAACGCATT

15 +

16 70<<@: : 5 : < ; = = 7 ; > > / 7 9 < : . : 4 9 4 . 8 ( , , 8 : 7 5 3 / 5@5? ?C>B? ? ?B7 Listing 1:A FastQ file.

(21)

Data analysis

Common data analysis techniques Reference based techniques:

•

Resequencing.

• Exome sequencing.

• Transcriptome sequencing (RNA).

• Full genome sequencing.

•

Structural variation.

•

Copy number variation.

Not reference based:

•

De novo assembly.

•

k-mer profiling.

(22)

Resequencing Data analysis

Resequencing pipelines can roughly be divided in five steps.

(23)

Resequencing

Data analysis

Resequencing pipelines can roughly be divided in five steps.

1. Pre-alignment.

• Quality control.

• Data cleaning.

(24)

Resequencing

Data analysis

Resequencing pipelines can roughly be divided in five steps.

1. Pre-alignment.

• Data cleaning.

2. Alignment.

• Post-alignment quality control.

(25)

Resequencing

Data analysis

Resequencing pipelines can roughly be divided in five steps.

1. Pre-alignment.

• Data cleaning.

2. Alignment.

3. Variant calling.

(26)

Resequencing

Data analysis

Resequencing pipelines can roughly be divided in five steps.

1. Pre-alignment.

• Data cleaning.

2. Alignment.

3. Variant calling.

4. Filtering.

• Post-variant calling quality control.

(27)

Resequencing

Data analysis

Resequencing pipelines can roughly be divided in five steps.

1. Pre-alignment.

• Data cleaning.

2. Alignment.

3. Variant calling.

4. Filtering.

• Post-variant calling quality control.

5. Annotation.

(28)

Pre-alignment

Data cleaning

Depending on the sequencing platform, parts of the reads need to be removed.

•

Remove linker sequences (Cutadapt, FASTX toolkit).

•

Clip low quality reads at the end of the read (Sickle, Trimmomatic, FASTX toolkit).

•

Length filtering (Fastools).

http://code.google.com/p/cutadapt/

http://hannonlab.cshl.edu/fastx toolkit/

https://github.com/najoshi/sickle

http://www.usadellab.org/cms/index.php?page=trimmomatic https://pypi.python.org/pypi/fastools

(29)

Pre-alignment

Trimming

(30)

Pre-alignment Clipping

Figure 19: Sequencing linkers.

(31)

Pre-alignment

Quality control

The FastQC toolkit can be used for quality control (both be- fore and after the data cleaning step).

•

GC content.

•

GC distribution.

•

Quality scores distribution.

•

. . .

(32)

Pre-alignment Example QC output

Figure 20: Per base sequence content.

Figure 21: Per sequence quality.

(33)

Alignment

The best match to the reference genome

Figure 22: Visualisation of

Very efficient.

•

The reference genome needs to be indexed.

•

Finding an alignment

is as easy as looking

up a word in a

dictionary.

(34)

Alignment

Choose an aligner

Not all aligners can deal with indels.

•

Only a couple of years ago, only SNPs were considered.

• Bowtie.

http://bowtie-bio.sourceforge.net/index.shtml http://research-pub.gene.com/gmap/

http://tophat.cbcb.umd.edu/

http://bio-bwa.sourceforge.net/

(35)

Alignment

Choose an aligner

Not all aligners can deal with indels.

•

Only a couple of years ago, only SNPs were considered.

• Bowtie.

Few aligners can work with large deletions.

•

Spliced RNA.

• GMAP/ GSNAP.

• Tophat.

•

BWA-MEM.

http://bowtie-bio.sourceforge.net/index.shtml http://research-pub.gene.com/gmap/

http://tophat.cbcb.umd.edu/

(36)

Alignment

Choose an aligner

The choice of aligner may be restricted by the sequencer.

•

For the Ion Torrent: Tmap.

• Combination of three different aligners.

• Deals with errors in homopolymer stretches.

•

For the PacBio: BLASR.

https://github.com/iontorrent/TS/tree/master/Analysis/TMAP https://github.com/PacificBiosciences/blasr

(37)

Variant calling

Consistent deviations from the reference

(38)

Variant calling

Some considerations

Things a variant caller might take into account:

•

Strand balance.

•

Base quality.

•

Mapping quality.

• Distribution within the reads.

•

Ploidity of the organism in question.

(39)

Variant calling

Some considerations

Things a variant caller might take into account:

•

Strand balance.

•

Base quality.

•

Mapping quality.

•

Ploidity of the organism in question.

Complicating factors:

•

Pooled samples.

(40)

Variant calling

Some considerations

Things a variant caller might take into account:

•

Strand balance.

•

Base quality.

•

Mapping quality.

•

Ploidity of the organism in question.

Complicating factors:

•

Pooled samples.

•

RNA.

• Allele specific expression.

• RNA editing.

(41)

Variant calling

Some considerations

Things a variant caller might take into account:

•

Strand balance.

•

Base quality.

•

Mapping quality.

•

Ploidity of the organism in question.

Complicating factors:

•

Pooled samples.

•

RNA.

• Allele specific expression.

• RNA editing.

(42)

Variant calling

Choice of variant caller Rules of thumb:

•

Well known organism and experiment: Statistical model.

•

Use a simpler variant caller otherwise.

http://samtools.sourceforge.net/

https://www.broadinstitute.org/gatk/

http://varscan.sourceforge.net/

(43)

Variant calling

Choice of variant caller Rules of thumb:

•

Well known organism and experiment: Statistical model.

•

Use a simpler variant caller otherwise.

Popular variant callers:

•

Samtools.

•

GATK.

•

VarScan.

http://samtools.sourceforge.net/

https://www.broadinstitute.org/gatk/

(44)

Variant filtering

Filtering on coverage We can set some thresholds:

•

Minimum.

•

Maximum.

(45)

Variant filtering

Filtering on coverage We can set some thresholds:

•

Minimum.

•

Maximum.

We filter for a maximum coverage because of copy number

variation.

(46)

Variant filtering

Filtering on coverage We can set some thresholds:

•

Minimum.

•

Maximum.

We filter for a maximum coverage because of copy number variation.

A good way to calculate the maximum:

•

Calculate the mean coverage.

• Only of the covered (targeted) regions.

•

Multiply this number with a reasonable factor e.g., 2.5.

(47)

Annotation

What is already known about a variant A selection of SeattleSeq annotation:

•

Is the variant known?

•

Does it hit a gene?

(48)

Annotation

What is already known about a variant A selection of SeattleSeq annotation:

•

Is the variant known?

•

Does it hit a gene?

• Is it in an intron?

• Does it hit a splice site?

(49)

Annotation

What is already known about a variant A selection of SeattleSeq annotation:

•

Is the variant known?

•

Does it hit a gene?

• Is it in the coding region?

• Is there a gain/loss of a stop codon?

• Does the variant result in a frameshift?

• . . .

(50)

Annotation

What is already known about a variant A selection of SeattleSeq annotation:

•

Is the variant known?

•

Does it hit a gene?

• . . .

• Is it in the 5’/3’ UTR of a gene?

• . . .

(51)

Annotation

What is already known about a variant A selection of SeattleSeq annotation:

•

Is the variant known?

•

Does it hit a gene?

• . . .

• Is it in the 5’/3’ UTR of a gene?

• . . .

•

Is it in a regulatory region?

•

. . .

(52)

Full genome sequencing Copy number variation

Figure 24: Coverage patterns over a whole chromosome.

(53)

Full genome sequencing

Copy number variation Per sample:

•

The reference needs to be very good.

•

Sequencability biases.

•

Mapping biases.

(54)

Full genome sequencing

Copy number variation Per sample:

•

The reference needs to be very good.

•

Sequencability biases.

•

Mapping biases.

Within a population:

•

Mixture of distributions.

•

Not sensitive to aforementioned biases.

•

Needs a lot of controls.

(55)

Full genome sequencing

Structural variation

(56)

Full genome sequencing Structural variation

Figure 26: Discordant and split reads.

http://breakdancer.sourceforge.net/

http://sourceforge.net/projects/pindel/

(57)

Full genome sequencing

Structural variation

(58)

De Novo assembly Assesmbly

Figure 28: General overview of de novo assembly.

(59)

De Novo assembly Assesmbly

Figure 29: Overlaps are used to reconstruct a genome.

(60)

De Novo assembly Scaffolding

Figure 30: Paired end or mate pair reads can be used.

(61)

De Novo assembly

Easier assembly with PacBio

(62)

Pipelines Pipelines

Figure 32: A real-life pipeline.

(63)

Pipelines

(64)

Pipelines

Pipelines Combining tools:

•

The output of one tool can serve as the input for another.

•

Not necessarily linear.

•

. . .

(65)

Pipelines

Pipelines Combining tools:

•

The output of one tool can serve as the input for another.

•

Not necessarily linear.

•

. . .

Running various different tools:

•

Two or three different aligners.

•

A couple of variant callers.

•

. . .

(66)

Pipelines Combining tools

1 bwa aln −t 8 $reference $i > $i . s a i

2 bwa samse $ r e f e r e n c e $ i . s a i $ i > $ i . sam

3 samtools view −bt $reference −o $i .bam $i . sam Listing 2:Shell script

(67)

Pipelines Combining tools

1 bwa aln −t 8 $reference $i > $i . s a i

2 bwa samse $ r e f e r e n c e $ i . s a i $ i > $ i . sam

3 samtools view −bt $reference −o $i .bam $i . sam Listing 2:Shell script

1 %. s a i : %. f q

2 $ (BWA) aln −t $ (THREADS) $ ( c a l l MKREF, $@) $< > $@

3

4 %.sam : %. s a i %. f q

5 $ (BWA) samse $ ( c a l l MKREF, $@) $ ˆ > $@

6

7 %.bam : %.sam

8 $ (SAMTOOLS) view −bt $ ( c a l l MKREF, $@) −o $@ $<

(68)

Graphical interfaces

Galaxy

Galaxy: a graphical user interface:

•

Wrapper for command line utilities.

•

User friendly.

•

Point and click.

http://galaxy.psu.edu/

(69)

Graphical interfaces

Galaxy

Galaxy: a graphical user interface:

•

Wrapper for command line utilities.

•

User friendly.

•

Point and click.

•

Workflows.

• Save all the steps you did in your analysis.

• Rerun the entire analysis on a new dataset.

• Share your workflow with other people.

• . . .

(70)

Graphical interfaces Galaxy

Figure 34: Galaxy main user interface

(71)

Graphical interfaces

Galaxy

(72)

Graphical interfaces Workflow of a parallel pipeline

Makefile blueprint

000.MC0184ACXX_s_1_1_sequence.part

MC0184ACXX_s_1_1_sequence.part000 all

MC0184ACXX_s_1.xysome.targeted.avg-cov

MC0184ACXX_s_1.asome_cutoff.vcf MC0184ACXX_s_1.sorted.dedup.bam.flagstat

MC0184ACXX_s_1.bcf

MC0184ACXX_s_1_1_sequence.txt.recaldata MC0184ACXX_s_1.h_cov

MC0184ACXX_s_1.xysome.vcf

MC0184ACXX_s_1.covPerTarget

MC0184ACXX_s_1.asomes_cnvhigh.bed MC0184ACXX_s_1.final.vcf

MC0184ACXX_s_1.loose-ontarget.bam.flagstat MC0184ACXX_s_1.xysomes_avgcov.bed

MC0184ACXX_s_1.pileup

MC0184ACXX_s_1.bases.strong-ontarget MC0184ACXX_s_1.strong-ontarget.bam.flagstat MC0184ACXX_s_1.asome.vcf

MC0184ACXX_s_1_1_sequence.filtered_fastqc MC0184ACXX_s_1.asome.targeted.avg-cov

MC0184ACXX_s_1.readsPerChr MC0184ACXX_s_1.final.vcf.vdb_annotation

MC0184ACXX_s_1.xysomes_cnvhigh.bed

MC0184ACXX_s_1.hsMetrics

MC0184ACXX_s_1.insertSizes

MC0184ACXX_s_1.bases.aligned

MC0184ACXX_s_1_1_sequence_fastqc

MC0184ACXX_s_1.wig MC0184ACXX_s_1.mtdna_cutoff.vcf

MC0184ACXX_s_1.final.vcf.ti-tv

MC0184ACXX_s_1.sorted.dedup.bam.bai

MC0184ACXX_s_1_2_sequence.filtered_fastqc MC0184ACXX_s_1.xysome_cutoff.vcf

MC0184ACXX_s_1.sorted.dedup.bam MC0184ACXX_s_1.insertSizesHist.pdf MC0184ACXX_s_1.final.indelonly.vcf.annotation MC0184ACXX_s_1.final.snponly.vcf.annotation

MC0184ACXX_s_1.bases.loose-ontarget MC0184ACXX_s_1.tar.gz

MC0184ACXX_s_1.mtdna_mincov.bed

MC0184ACXX_s_1_2_sequence_fastqc MC0184ACXX_s_1.asomes_avgcov.bed

MC0184ACXX_s_1.sorted.dedup.bam.pileup

MC0184ACXX_s_1.sorted.dedup.bam.sorted.dedup.bam

create_report.SECONDARY

/sureselect_whole_exome_v2.hg19.CBR.bed H.Sapiens/hg19/reference.fa

MC0184ACXX_s_1.asomes_cutoff_steepcov.bed

MC0184ACXX_s_1_1_sequence.filtered

MC0184ACXX_s_1_1_sequence.trimmed MC0184ACXX_s_1.final.snponly.vcf

Makefile

MC0184ACXX_s_1_1_sequence.txt MC0184ACXX_s_1_2_sequence.txt

/sureselect_whole_exome_v2.hg19.CBR.bed.interval

MC0184ACXX_s_1.mincov.bed MC0184ACXX_s_1.mtdna.vcf

MC0184ACXX_s_1.parts

000.MC0184ACXX_s_1_2_sequence.part MC0184ACXX_s_1.sorted.dedup.bam.sorted.bam

MC0184ACXX_s_1.sorted.dedup.bam.parts MC0184ACXX_s_1.xysomes_mincov.bed

MC0184ACXX_s_1_1_sequence.sync .SUFFIXES.DEFAULT

MC0184ACXX_s_1.final.indelonly.vcf

MC0184ACXX_s_1_2_sequence.part000

MC0184ACXX_s_1_2_sequence.sync

MC0184ACXX_s_1_2_sequence.filtered MC0184ACXX_s_1.xysomes_cutoff_steepcov.bed

MC0184ACXX_s_1_2_sequence.trimmed MC0184ACXX_s_1.asomes_mincov.bed

MC0184ACXX_s_1.sorted.bam recursiveclean

MC0184ACXX_s_1_2_sequence.txt.recaldata

Figure 36: Dependency diagram.

(73)

Graphical interfaces Workflow of a parallel pipeline

000.MC0184ACXX_s_1_1_sequence.part

MC0184ACXX_s_1_1_sequence.txt.recaldata MC0184ACXX_s_1_1_sequence.filtered_fastqc MC0184ACXX_s_1.sorted.dedup.bam

MC0184ACXX_s_1_1_sequence.filtered

MC0184ACXX_s_1_1_sequence.trimmed

MC0184ACXX_s_1_1_sequence.txt MC0184ACXX_s_1.parts

000.MC0184ACXX_s_1_2_sequence.part MC0184ACXX_s_1.sorted.dedup.bam.parts

MC0184ACXX_s_1_2_sequence.filtered MC0184ACXX_s_1.sorted.bam

MC0184ACXX_s_1_2_sequence.txt.recaldata

(74)