Introduction to NGS data analysis
Jeroen F. J. Laros
Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics
Sequencing Illumina platforms
Figure 1: HiSeq 2500.
Characteristics:
•
High throughput (3 genomes).
•
Paired end.
•
High accuracy.
•
Read length 2 × 125bp.
•
Relatively long run time (6 days).
•
Relatively expensive.
Sequencing
Illumina platforms
Characteristics:
•
Moderate throughput (3 exomes).
•
Paired end.
•
High accuracy.
•
Read length 2 × 300bp.
•
Relatively short run time (3 days).
•
Relatively expensive.
Sequencing Illumina platforms
Figure 3: Flowcell.
Figure 4: Amplification.
Sequencing Illumina platforms
Figure 3: Flowcell.
Figure 4: Amplification.
Optical system.
•
DNA fragments are loaded on a flowcell.
•
Fragments are amplified.
•
Clusters of fragments
give the same signal.
Sequencing Illumina platforms
Figure 5: Image of tile (part of the flowcell).
Sequencing
Illumina platforms
Sequencing Illumina platforms
Figure 7: Paired end.
Sequencing
Illumina platforms
Advantages:
•
Mapping to non-unique regions.
•
Structural variation.
•
De novo assembly
(scaffolding).
Sequencing Life-Tech
Figure 8: Ion torrent.
Characteristics:
•
Low throughput.
•
Single end (for now).
•
High accuracy.
•
Read length ±400bp.
•
Short run time (1 day).
•
Cheap runs.
Sequencing Life-Tech
Figure 9: Ion proton.
Characteristics:
•
Moderate throughput (1 exome).
•
Single end (for now).
•
High accuracy.
•
Read length ±175bp.
•
Short run time (1 day).
•
Cheap runs.
Sequencing Life-Tech
Figure 10: Ion Torrent chip. Figure 11: Ion Proton chip.
Sequencing
Life-Tech
Sequencing Life-Tech
Figure 13: Loading a sample.
Figure 14: Base calling.
Problems with mono nu-
cleotide stretches.
Sequencing
Pacific Biosciences
Sequencing
Pacific Biosciences Characteristics:
•
Long reads (50% is longer than 10,000 bp).
•
High error rate (15-20%).
•
Low throughput (comparable with the Roche 454).
Sequencing
Pacific Biosciences Characteristics:
•
Long reads (50% is longer than 10,000 bp).
•
High error rate (15-20%).
•
Low throughput (comparable with the Roche 454).
Circular consensus sequencing.
•
Sequence the same molecule several times.
•
Extremely high accuracy.
•
Acceptable read length (±250bp).
Sequencing Pacific Biosciences
Figure 16: Real time polymerase monitoring.
Sequencing Pacific Biosciences
Figure 17: Base calling on the PacBio.
Data analysis
Next generation sequencing data
1 @SGGPP: 4 : 1 0 1
2 TTCGGGGGCTGGCAAATCCACTTCCGTGACACGCTACCATTCGCTGGTGGT
3 +
4 − ’+4589 ,53330 −0&07+03:54/2362−+.488587>@/25440++0(+
5 @SGGPP: 4 : 1 0 2
6 CGGTAAACCACCCTGCTGACGGAACCCTAATGCGCCTGAAAGACAGCGTTC
7 +
8 34/ − −0 ’+.000(.55:;:99(0(+2(22(0316;185;;0;: < < >=AA59
9 @SGGPP: 4 : 1 0 6
10 TCGTTAACGACTTTGTTCGCCACCGCAACCGCCTGTTTCGGGTCACAGGCA
11 +
12 09875;5? <;?@A4?B :BBB<AA>CCC>C>BB0.−>=0488+3444:@5@<
13 @SGGPP: 4 : 1 1 2
14 TTGATGAATATATTATTTCAGGGAATAATTATGACACCTTTAGAACGCATT
15 +
16 70<<@: : 5 : < ; = = 7 ; > > / 7 9 < : . : 4 9 4 . 8 ( , , 8 : 7 5 3 / 5@5? ?C>B? ? ?B7 Listing 1:A FastQ file.
Data analysis
Common data analysis techniques Reference based techniques:
•
Resequencing.
• Exome sequencing.
• Transcriptome sequencing (RNA).
• Full genome sequencing.
•
Structural variation.
•
Copy number variation.
Not reference based:
•
De novo assembly.
•
k-mer profiling.
Resequencing Data analysis
Resequencing pipelines can roughly be divided in five steps.
Resequencing
Data analysis
Resequencing pipelines can roughly be divided in five steps.
1. Pre-alignment.
• Quality control.
• Data cleaning.
Resequencing
Data analysis
Resequencing pipelines can roughly be divided in five steps.
1. Pre-alignment.
• Quality control.
• Data cleaning.
2. Alignment.
• Post-alignment quality control.
Resequencing
Data analysis
Resequencing pipelines can roughly be divided in five steps.
1. Pre-alignment.
• Quality control.
• Data cleaning.
2. Alignment.
• Post-alignment quality control.
3. Variant calling.
Resequencing
Data analysis
Resequencing pipelines can roughly be divided in five steps.
1. Pre-alignment.
• Quality control.
• Data cleaning.
2. Alignment.
• Post-alignment quality control.
3. Variant calling.
4. Filtering.
• Post-variant calling quality control.
Resequencing
Data analysis
Resequencing pipelines can roughly be divided in five steps.
1. Pre-alignment.
• Quality control.
• Data cleaning.
2. Alignment.
• Post-alignment quality control.
3. Variant calling.
4. Filtering.
• Post-variant calling quality control.
5. Annotation.
Pre-alignment
Data cleaning
Depending on the sequencing platform, parts of the reads need to be removed.
•
Remove linker sequences (Cutadapt, FASTX toolkit).
•
Clip low quality reads at the end of the read (Sickle, Trimmomatic, FASTX toolkit).
•
Length filtering (Fastools).
http://code.google.com/p/cutadapt/
http://hannonlab.cshl.edu/fastx toolkit/
https://github.com/najoshi/sickle
http://www.usadellab.org/cms/index.php?page=trimmomatic https://pypi.python.org/pypi/fastools
Pre-alignment
Trimming
Pre-alignment Clipping
Figure 19: Sequencing linkers.
Pre-alignment
Quality control
The FastQC toolkit can be used for quality control (both be- fore and after the data cleaning step).
•
GC content.
•
GC distribution.
•
Quality scores distribution.
•
. . .
Pre-alignment Example QC output
Figure 20: Per base sequence content.
Figure 21: Per sequence quality.
Alignment
The best match to the reference genome
Figure 22: Visualisation of
Very efficient.
•
The reference genome needs to be indexed.
•
Finding an alignment
is as easy as looking
up a word in a
dictionary.
Alignment
Choose an aligner
Not all aligners can deal with indels.
•
Only a couple of years ago, only SNPs were considered.
• Bowtie.
http://bowtie-bio.sourceforge.net/index.shtml http://research-pub.gene.com/gmap/
http://tophat.cbcb.umd.edu/
http://bio-bwa.sourceforge.net/
Alignment
Choose an aligner
Not all aligners can deal with indels.
•
Only a couple of years ago, only SNPs were considered.
• Bowtie.
Few aligners can work with large deletions.
•
Spliced RNA.
• GMAP/ GSNAP.
• Tophat.
•
BWA-MEM.
http://bowtie-bio.sourceforge.net/index.shtml http://research-pub.gene.com/gmap/
http://tophat.cbcb.umd.edu/
Alignment
Choose an aligner
The choice of aligner may be restricted by the sequencer.
•
For the Ion Torrent: Tmap.
• Combination of three different aligners.
• Deals with errors in homopolymer stretches.
•
For the PacBio: BLASR.
https://github.com/iontorrent/TS/tree/master/Analysis/TMAP https://github.com/PacificBiosciences/blasr
Variant calling
Consistent deviations from the reference
Variant calling
Some considerations
Things a variant caller might take into account:
•
Strand balance.
•
Base quality.
•
Mapping quality.
• Distribution within the reads.
•
Ploidity of the organism in question.
Variant calling
Some considerations
Things a variant caller might take into account:
•
Strand balance.
•
Base quality.
•
Mapping quality.
• Distribution within the reads.
•
Ploidity of the organism in question.
Complicating factors:
•
Pooled samples.
Variant calling
Some considerations
Things a variant caller might take into account:
•
Strand balance.
•
Base quality.
•
Mapping quality.
• Distribution within the reads.
•
Ploidity of the organism in question.
Complicating factors:
•
Pooled samples.
•
RNA.
• Allele specific expression.
• RNA editing.
Variant calling
Some considerations
Things a variant caller might take into account:
•
Strand balance.
•
Base quality.
•
Mapping quality.
• Distribution within the reads.
•
Ploidity of the organism in question.
Complicating factors:
•
Pooled samples.
•
RNA.
• Allele specific expression.
• RNA editing.
Variant calling
Choice of variant caller Rules of thumb:
•
Well known organism and experiment: Statistical model.
•
Use a simpler variant caller otherwise.
http://samtools.sourceforge.net/
https://www.broadinstitute.org/gatk/
http://varscan.sourceforge.net/
Variant calling
Choice of variant caller Rules of thumb:
•
Well known organism and experiment: Statistical model.
•
Use a simpler variant caller otherwise.
Popular variant callers:
•
Samtools.
•
GATK.
•
VarScan.
http://samtools.sourceforge.net/
https://www.broadinstitute.org/gatk/
Variant filtering
Filtering on coverage We can set some thresholds:
•
Minimum.
•
Maximum.
Variant filtering
Filtering on coverage We can set some thresholds:
•
Minimum.
•
Maximum.
We filter for a maximum coverage because of copy number
variation.
Variant filtering
Filtering on coverage We can set some thresholds:
•
Minimum.
•
Maximum.
We filter for a maximum coverage because of copy number variation.
A good way to calculate the maximum:
•
Calculate the mean coverage.
• Only of the covered (targeted) regions.
•
Multiply this number with a reasonable factor e.g., 2.5.
Annotation
What is already known about a variant A selection of SeattleSeq annotation:
•
Is the variant known?
•
Does it hit a gene?
Annotation
What is already known about a variant A selection of SeattleSeq annotation:
•
Is the variant known?
•
Does it hit a gene?
• Is it in an intron?
• Does it hit a splice site?
Annotation
What is already known about a variant A selection of SeattleSeq annotation:
•
Is the variant known?
•
Does it hit a gene?
• Is it in an intron?
• Does it hit a splice site?
• Is it in the coding region?
• Is there a gain/loss of a stop codon?
• Does the variant result in a frameshift?
• . . .
Annotation
What is already known about a variant A selection of SeattleSeq annotation:
•
Is the variant known?
•
Does it hit a gene?
• Is it in an intron?
• Does it hit a splice site?
• Is it in the coding region?
• Is there a gain/loss of a stop codon?
• Does the variant result in a frameshift?
• . . .
• Is it in the 5’/3’ UTR of a gene?
• . . .
Annotation
What is already known about a variant A selection of SeattleSeq annotation:
•
Is the variant known?
•
Does it hit a gene?
• Is it in an intron?
• Does it hit a splice site?
• Is it in the coding region?
• Is there a gain/loss of a stop codon?
• Does the variant result in a frameshift?
• . . .
• Is it in the 5’/3’ UTR of a gene?
• . . .
•
Is it in a regulatory region?
•
. . .
Full genome sequencing Copy number variation
Figure 24: Coverage patterns over a whole chromosome.
Full genome sequencing
Copy number variation Per sample:
•
The reference needs to be very good.
•
Sequencability biases.
•
Mapping biases.
Full genome sequencing
Copy number variation Per sample:
•
The reference needs to be very good.
•
Sequencability biases.
•
Mapping biases.
Within a population:
•
Mixture of distributions.
•
Not sensitive to aforementioned biases.
•
Needs a lot of controls.
Full genome sequencing
Structural variation
Full genome sequencing Structural variation
Figure 26: Discordant and split reads.
http://breakdancer.sourceforge.net/
http://sourceforge.net/projects/pindel/
Full genome sequencing
Structural variation
De Novo assembly Assesmbly
Figure 28: General overview of de novo assembly.
De Novo assembly Assesmbly
Figure 29: Overlaps are used to reconstruct a genome.
De Novo assembly Scaffolding
Figure 30: Paired end or mate pair reads can be used.
De Novo assembly
Easier assembly with PacBio
Pipelines Pipelines
Figure 32: A real-life pipeline.
Pipelines
Pipelines
Pipelines
Pipelines Combining tools:
•
The output of one tool can serve as the input for another.
•
Not necessarily linear.
•
. . .
Pipelines
Pipelines Combining tools:
•
The output of one tool can serve as the input for another.
•
Not necessarily linear.
•
. . .
Running various different tools:
•
Two or three different aligners.
•
A couple of variant callers.
•
. . .
Pipelines Combining tools
1 bwa aln −t 8 $reference $i > $i . s a i
2 bwa samse $ r e f e r e n c e $ i . s a i $ i > $ i . sam
3 samtools view −bt $reference −o $i .bam $i . sam Listing 2:Shell script
Pipelines Combining tools
1 bwa aln −t 8 $reference $i > $i . s a i
2 bwa samse $ r e f e r e n c e $ i . s a i $ i > $ i . sam
3 samtools view −bt $reference −o $i .bam $i . sam Listing 2:Shell script
1 %. s a i : %. f q
2 $ (BWA) aln −t $ (THREADS) $ ( c a l l MKREF, $@) $< > $@
3
4 %.sam : %. s a i %. f q
5 $ (BWA) samse $ ( c a l l MKREF, $@) $ ˆ > $@
6
7 %.bam : %.sam
8 $ (SAMTOOLS) view −bt $ ( c a l l MKREF, $@) −o $@ $<
Graphical interfaces
Galaxy
Galaxy: a graphical user interface:
•
Wrapper for command line utilities.
•
User friendly.
•
Point and click.
http://galaxy.psu.edu/
Graphical interfaces
Galaxy
Galaxy: a graphical user interface:
•
Wrapper for command line utilities.
•
User friendly.
•
Point and click.
•
Workflows.
• Save all the steps you did in your analysis.
• Rerun the entire analysis on a new dataset.
• Share your workflow with other people.
• . . .
Graphical interfaces Galaxy
Figure 34: Galaxy main user interface
Graphical interfaces
Galaxy
Graphical interfaces Workflow of a parallel pipeline
Makefile blueprint
000.MC0184ACXX_s_1_1_sequence.part
MC0184ACXX_s_1_1_sequence.part000 all
MC0184ACXX_s_1.xysome.targeted.avg-cov
MC0184ACXX_s_1.asome_cutoff.vcf MC0184ACXX_s_1.sorted.dedup.bam.flagstat
MC0184ACXX_s_1.bcf
MC0184ACXX_s_1_1_sequence.txt.recaldata MC0184ACXX_s_1.h_cov
MC0184ACXX_s_1.xysome.vcf
MC0184ACXX_s_1.covPerTarget
MC0184ACXX_s_1.asomes_cnvhigh.bed MC0184ACXX_s_1.final.vcf
MC0184ACXX_s_1.loose-ontarget.bam.flagstat MC0184ACXX_s_1.xysomes_avgcov.bed
MC0184ACXX_s_1.pileup
MC0184ACXX_s_1.bases.strong-ontarget MC0184ACXX_s_1.strong-ontarget.bam.flagstat MC0184ACXX_s_1.asome.vcf
MC0184ACXX_s_1_1_sequence.filtered_fastqc MC0184ACXX_s_1.asome.targeted.avg-cov
MC0184ACXX_s_1.readsPerChr MC0184ACXX_s_1.final.vcf.vdb_annotation
MC0184ACXX_s_1.xysomes_cnvhigh.bed
MC0184ACXX_s_1.hsMetrics
MC0184ACXX_s_1.insertSizes
MC0184ACXX_s_1.bases.aligned
MC0184ACXX_s_1_1_sequence_fastqc
MC0184ACXX_s_1.wig MC0184ACXX_s_1.mtdna_cutoff.vcf
MC0184ACXX_s_1.final.vcf.ti-tv
MC0184ACXX_s_1.sorted.dedup.bam.bai
MC0184ACXX_s_1_2_sequence.filtered_fastqc MC0184ACXX_s_1.xysome_cutoff.vcf
MC0184ACXX_s_1.sorted.dedup.bam MC0184ACXX_s_1.insertSizesHist.pdf MC0184ACXX_s_1.final.indelonly.vcf.annotation MC0184ACXX_s_1.final.snponly.vcf.annotation
MC0184ACXX_s_1.bases.loose-ontarget MC0184ACXX_s_1.tar.gz
MC0184ACXX_s_1.mtdna_mincov.bed
MC0184ACXX_s_1_2_sequence_fastqc MC0184ACXX_s_1.asomes_avgcov.bed
MC0184ACXX_s_1.sorted.dedup.bam.pileup
MC0184ACXX_s_1.sorted.dedup.bam.sorted.dedup.bam
create_report.SECONDARY
/sureselect_whole_exome_v2.hg19.CBR.bed H.Sapiens/hg19/reference.fa
MC0184ACXX_s_1.asomes_cutoff_steepcov.bed
MC0184ACXX_s_1_1_sequence.filtered
MC0184ACXX_s_1_1_sequence.trimmed MC0184ACXX_s_1.final.snponly.vcf
Makefile
MC0184ACXX_s_1_1_sequence.txt MC0184ACXX_s_1_2_sequence.txt
/sureselect_whole_exome_v2.hg19.CBR.bed.interval
MC0184ACXX_s_1.mincov.bed MC0184ACXX_s_1.mtdna.vcf
MC0184ACXX_s_1.parts
000.MC0184ACXX_s_1_2_sequence.part MC0184ACXX_s_1.sorted.dedup.bam.sorted.bam
MC0184ACXX_s_1.sorted.dedup.bam.parts MC0184ACXX_s_1.xysomes_mincov.bed
MC0184ACXX_s_1_1_sequence.sync .SUFFIXES.DEFAULT
MC0184ACXX_s_1.final.indelonly.vcf
MC0184ACXX_s_1_2_sequence.part000
MC0184ACXX_s_1_2_sequence.sync
MC0184ACXX_s_1_2_sequence.filtered MC0184ACXX_s_1.xysomes_cutoff_steepcov.bed
MC0184ACXX_s_1_2_sequence.trimmed MC0184ACXX_s_1.asomes_mincov.bed
MC0184ACXX_s_1.sorted.bam recursiveclean
MC0184ACXX_s_1_2_sequence.txt.recaldata
Figure 36: Dependency diagram.
Graphical interfaces Workflow of a parallel pipeline
000.MC0184ACXX_s_1_1_sequence.part
MC0184ACXX_s_1_1_sequence.part000
MC0184ACXX_s_1_1_sequence.txt.recaldata MC0184ACXX_s_1_1_sequence.filtered_fastqc MC0184ACXX_s_1.sorted.dedup.bam
MC0184ACXX_s_1_1_sequence.filtered
MC0184ACXX_s_1_1_sequence.trimmed
MC0184ACXX_s_1_1_sequence.txt MC0184ACXX_s_1.parts
000.MC0184ACXX_s_1_2_sequence.part MC0184ACXX_s_1.sorted.dedup.bam.parts
MC0184ACXX_s_1_1_sequence.sync
MC0184ACXX_s_1_2_sequence.part000
MC0184ACXX_s_1_2_sequence.sync
MC0184ACXX_s_1_2_sequence.filtered MC0184ACXX_s_1.sorted.bam
MC0184ACXX_s_1_2_sequence.txt.recaldata