Introduction to next-generation sequencing data
Centre for Experimental Medicine Queens University Belfast
http://www.qub.ac.uk/research-centres/CEM/
David Simpson
Outline
• History of DNA sequencing
• NGS or ‘massively parallel’ sequencing
– How it works: Illumina ‘sequencing by synthesis’
– Library preparation
– Clonal amplification – future single molecule
• Characteristics of the data: Quality control
– Base calling and quality (FastQ format)
• Phasing and homopolymers
– Trimming
– Implications of PCR
• Duplicates and bias
– Contamination
Sequencing time-line
Andy Vierstraete
2014 : Illumina HiSeq X10 - $1,000 Genome?
Conventional DNA sequencing
• Dideoxy terminator
– Sanger method
• Fluorescent dyes
• Gel electrophoresis
• 1 lane = 1 sequence
• Capillary electrophoresis
http://www.bio.davidson.edu/Courses/Molbio/MolStudents/
spring2003/Obenrader/sanger_method_page.htm
"G" tube:
All four dNTP's, ddGTP and DNA polymerase
"A" tube:
All four dNTP's, ddATP and DNA polymerase
"T" tube:
All four dNTP's, ddTTP and DNA polymerase
"C" tube:
All four dNTP's, ddCTP and DNA polymerase
Electropherogram
Primer
Next Generation Sequencing (NGS)
• Process millions of sequencing reads in parallel
• Common concept is the analysis of millions of sequences associated with a solid surface (or in wells)
– Contrast with traditional gel electrophoresis
• Range of platforms available
– Illustrate with Illumina
– Ion Torrent (Life Technologies/Thermo Fisher)
NGS workflow
Library preparation
Template preparation:
Single molecule
‘clonal’ amplification
Sequencing
RNA DNA
Fragmentation/size selection Addition of adaptors
Bridge PCR on a slide
(cluster generation)
Emulsion PCR
Reversible terminator
(Illumina)
Semiconductor (Ion Torrent)
Single molecule
(Nanopore)
Overview of DNA-Seq and RNA-Seq
AAAAAAA
Extract RNA cDNA
library
Exon 1 Exon 2
Reference sequence
Align to reference sequence
Fragmented DNA Library
Massively parallel sequencing
>10 million reads
TACATTTGGGAAAAGTAAATTTGCTGAAAATAATCCCGGT AAGAAAGAAACACTTTTCATGTAATTAGCTTTTTTACATC AAACTTCAGAACCCAAAGTCATTGAGAATATTAGGGATCA CAGAACCACATGAGTCAGAATCATCAGAATATCCCACCAA AGGAGAAGGAAGGAGCAGAGGATTCAAAAGGAAATGGAAT GATGAATATGAAGAAATGTCAGAAATGAAAGAAGGGAAAG GAAATTGAATTCGATGAAATAAATGATACTTGCTTATCTG
...
...
Genomic DNA
Library preparation
http://res.illumina.com/documents/products/research_reviews/sequencing-methods-review.pdf
Illumina: Cluster generation
Clonal amplification achieved by generating clusters on the surface of a flow cell (slide)
See SBS technology video at www.illumina.com/
Massively parallel sequencing
Glowing dots on a glass slide mark cloned DNA being sequenced
Reading the sequence
• Wash over all 4 nucleotides each with a fluorescent dye
• Only one complementary nucleotide incorporated
Illumina: Sequencing by synthesis:
• Prepare libraries with different index sequences
• Pool and sequence together – ‘multiplexing’
Platforms
• Illumina has several instruments
– Desktop-sized MiSeq that can complete smaller runs in under a day – NextSeq 500
– High throughput HiSeq 2500
• Ion Torrent ‘semi-conductor’ sequencing (Life Technologies)
– Fast, cheap entry level, output increasing rapidly – Personal Genome Machine
– Proton
HiSeq 2500 PGM 314 chip Proton P1 chip Total output 600/120 Gb up to 100Mb 10Gb
Run time 11 days/27 hrs 2-4 hrs 2-4 hrs Output/day 55 Gb up to 200 Mb ~20 Gb Read length 2 x 100/150bp up to 400b up to 200bp
# of single
reads 3/0.6 Billion up to 0.6M up to 82 Million
Ion torrent ‘Semiconductor sequencing’
• No optics required!
Beads with template attached (prepared by emulsion PCR)
Incorporation of a nucleotide changes pH
Detected on a semiconductor sequencing chip
Signal processing to optimise base calling
• Signal Decay
• Phase correction
– phasing is the rate at which single molecules within a cluster loose sync with each other.
– Incomplete Extension
• Limit read length
Further discussion
Ion torrent: http://biolektures.wordpress.com/2011/08/10/fundamentals-of-base-calling-part-1/
Illumina: http://pathogenomics.bham.ac.uk/blog/2013/11/diagnosing-problems-with-phasing-and-pre-phasing-on-illumina-platforms/
Read length and quality
• Per base sequence quality
• Phred quality score: Q an integer mapping of p, the probability that the corresponding base call is incorrect
Damien Gregory: http://www.somewhereville.com/?p=1508
FASTQ format
Nucleotide sequence and associated quality score (represented by ASCI characters)
@PSI179204_0007:4:1:1025:10482#0/1
GAGCAAAATTGTAGAAGAATTCAGGATCTCGTATGCCGTC +PSI179204_0007:4:1:1025:10482#0/1
C-:AC:?5:C-AAA-5>-,A5A>5:A?-DD?5A::>;><B
Flowcell lane & tile
‘X'-and ‘Y’co- ordinates of the cluster
Index of
multiplex sample
Illumina:
P. J. A. Cock, C. J. Fields, N. Goto, M. L. Heuer and P. M. Rice, “The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.” Nucleic Acids Research, 2010, Vol. 38, No. 6, 1767–1771 doi:10.1093/nar/gkp1137
Homopolymers (runs of the same nucleotide)
• Illumina: Flow all 4 nucleotides, incorporate single one
• Ion torrent: Sequential flows of individual unmodified nucleotides
Ionogram (Ion torrent)
EBITrimming
• Quality
– Ends
• Adaptors
– Clip adaptors (fastx clipper)
Insert Adaptor B
Adaptor A Adaptor A Adaptor B
FASTX-toolkit by Assaf Gordon
Implications of PCR
• Duplicate reads
– Erroneous quantification or variant detection
• Uneven coverage
– Additional sequencing required to achieve minimal coverage
Single nucleotide resolution
• High specificity
• Show ZEB1 mutation
Mutation:
c.1920G>T p.Gln640His
ZEB1 exon 7
CAG = Gln
CAT = His
Contamination
• Sample mix ups (!) - indexing
• Carry-over from previous run
• FastQ screen
Single molecule sequencing: Nanopore
https://www.nanoporetech.com/news/movies#movie-24-nanopore-dna-sequencing
• Single-stranded DNA polymer is passed through a protein nanopore
• Individual DNA bases on the strand are identified in sequence as the DNA
molecule passes through
Oxford Nanopore
Summary
• NGS works by sequencing millions of reads in parallel
• Library preparation
– Add adaptors to DNA of interest
– Requires clonal amplification (template preparation)
• Sequence data presented in FastQ format
– Quality control critical
• Errors inherent in the technology, eg. Phasing and homopolymers, PCR
• Trimming