Experimental Biology, Academy of Sciences of Uzbekistan, 1 USA
2. Genomics tools and technologies and their applications
2.3 Next generation sequencing
2.3.1 NGS technologies
The advent of NGS technologies has changed the dynamics and the pace of genomic research in humans, plants, animals and microorganisms because of their rapid, inexpensive and highly accurate sequencing capabilities. Unlike Sanger sequencing method which depends upon capillary electrophoresis, these NGS technologies are highly dependent on massive parallel sequencing, high resolution imaging, and complex algorithms to deconvolute the signal data to generate sequence data. NGS technologies offer a wide variety of applications such as whole genome de novo and re-sequencing, transcriptome sequencing (RNA-seq), microRNA sequencing, amplicon sequencing, targeted sequencing, chromatin immunoprecipitated DNA sequencing (ChIP-seq), methylome sequencing and many others. Before dwelling into the use of this wide variety of NGS applications for crop improvement, various NGS technologies and their capabilities are briefly reviewed first. Current NGS technologies can be broadly grouped into long and short read length technologies based on the number of bases they can sequence in a single sequencing reaction. Long read length technologies are preferred for applications involving de novo
Genomics-Assisted Plant Breeding in the 21st Century: Technological Advances and Progress 139 sequencing while short read length technologies are relatively inexpensive and mostly used for re-sequencing applications. Most of the NGS technologies monitor millions of sequencing reactions in parallel and thus result in a massive amount of sequencing data. The output capacities of these instruments outpaced the development of computational tools and hardware for data processing needs. Sophisticated computer programs are created to handle and process large amounts of sequencing data before final data analysis. Several bioinformatics tools were designed for diverse purposes such as de novo sequence assembly, mapping sequences to an existing reference genome sequence, mutation detection and annotation. Long read technologies include Roche/454 GS FLX and Pacific Biosciences RS systems while short read technologies include Illumina Genome Analyzer IIx, HiSeq 2000, MiSeq, Life Technologies’ SOLiDTM system, Helicos Genetic Analysis system and Life
technologies/Ion Torrent Personal Genome Machine (PGM). Mardis (2008b) and Metzker (2009) provided detailed reviews of these NGS technologies. NGS technologies that are widely used at present are briefly reviewed below and sequencing capabilities of instruments are summarized in Table 1.
2.3.1.1 Roche/454 GS FLX – pyrosequencing
This is the first NGS technology commercially introduced and is based on pyrosequencing method (Margulies et al., 2005). This technology is relatively rapid and inexpensive as it omits the expensive in vivo sub-cloning of sheared fragments for template amplification. Instead of cloning, sheared fragments are attached to microbeads and amplified in an emulsion-based PCR. These microbeads are further distributed to a fiber optic slide (PicoTiterPlateTM), where
the four dNTPs are added in turns. In pyrosequencing, the DNA sequence is determined by analyzing the fluorescence emitted by the activity of luciferase during the process of template extension by a single nucleotide addition. The fluorescence emitted is captured by a high resolution CCD camera for each type of nucleotide passed in a flow cycle. The intensity of the fluorescence is proportional to the number of nucleotides integrated in each step. The first commercial 454 instrument was able to generate >25 milion bases in short reads of 100 bp or more per 4 hr run. With the improvements in sequencing chemistry, PicoTiterPlate (PTP), reagent volumes and the number of nucleotide flow cycles in the instrument, the current GS FLX plus instrument was able to achieve an average read length of ~750 bp across 1 – 1.5 million sequences in ~20 hr runtime. Long read length capabilities of this instrument enable de
novo sequencing of genomes and transcriptomes with ease compared to short read
technologies. However, this technology is prone to sequencing errors in the homopolymer regions. Since the advent of 454 sequencing technology, there are ~1331 peer reviewed publications as of July, 2011 (http://454.com/publications/all-publications. asp) across a wide range of topics.
2.3.1.2 Illumina Genome Analyzer/HiSeq/MiSeq – sequencing-by-synthesis
Illumina sequencing method utilizes clonal array formation and proprietary reversible terminator reaction chemistry for rapid, accurate and large scale sequencing. DNA template fragments were immobilized in an 8-channel microfabricated flow cell where they were amplified up to 1000 copies in close proximity by bridge amplification method. Sequencing- by-Synthesis uses all four fluorescently labeled nucleotides to sequence millions of clusters on the flow cell surface. The fluorescent label in each nucleotide blocks the 3’–OH group and thus acts as a terminator for polymerase extension. At the incorporation of each nucleotide,
* B as ed o n th e i nf orm at ion provided by Met zker ( 2009) an d oth er com pan y web res ou rces 1 Sin gle en d read ch em is tr y, 2paire d end read c hemist ry фSequen ce capac it y c han ge wi th th e t ype of c hip used for sequen ci ng . Table 1 . Com parison of NG S t echnolog ies and capab ilit ies *
Genomics-Assisted Plant Breeding in the 21st Century: Technological Advances and Progress 141 fluorescent dye is imaged to identify the dye and then the label is enzymatically cleaved to allow the incorporation of next base (Bentley et al., 2008; Ju et al., 2006). As each nucleotide base incorporation is a unique event, the error rate in homopolymer regions is minimal compared to 454 pyrosequencing method (http://www.illumina.com/technology/ sequencing_technology.ilmn). Illumina has a range of sequencing instruments that can generate from ~1 Gibabase (Gb) from ~3-6 million sequences (MiSeq) and up to 600 Gb from 6 billion paired end reads per two flow cells (HiSeq 2000) in a single sequencing run. Though the output capabilities of Illumina sequencing instruments are large, they also take longer sequencing time from 3 – 11 days depending on the machine, single end or paired end protocol and number of flow cycles. This technology has revolutionized the pace of re- sequencing efforts in human and other genomes besides bringing down per base cost to a bare minimum. As of July, 2011, there are ~1746 peer reviewed publications that have used this technology.
2.3.1.3 Life technologies SOLiDTM – Sequencing-by-Ligation
Life technologies, previously Applied Biosystems, developed another short read sequencing technology which utilizes sequencing-by-ligation method. Template DNA fragments are clonally amplified in an emulsion PCR reaction similar to that of 454 sequencing and the clonal bead populations are covalently bound to a slide by 3’ modification of the beads. During the sequencing reaction, a fluorescently labeled di-base probe hybridizes to the complementary sequence adjacent to primed template and DNA ligase enzyme joins the dye-labeled probe to the primer. After the non-ligated probes are washed off, fluorescence is imaged to identify the nucleotides incorporated at first and second base (http:// www. appliedbiosystems.com/absite/us/en/home/applications-technologies/solid-nextgenerati on-sequencing/next-generation-systems/solid-sequencing-chemistry.html). The cycle can be repeated either by using cleavable probes to remove the fluorescent dye and regenerate a 5‘ -PO4 group for subsequent ligation cycles or by removing and hybridizing a new primer
to the template (Metzker, 2009; Valouev et al., 2008). SOLiD 5500, the recent version of this technology can generate up to 90 Gb of sequence data from ~1.4 billion reads of 35-75 bases in length over ~7 days of time. Due to its massive outputs and short read length capabilities, this system is heavily used for re-sequencing and RNA-Seq applications.
2.3.1.4 Pacific Biosciences RS – SMRT™ (single molecule real time) sequencing
Pacific Biosciences developed SMRT technology which implements detection of fluorescently labeled nucleotides as they are incorporated over a single DNA molecule in real time. A single Ф-29 DNA polymerase enzyme molecule, a highly processive and strand displacing enzyme, is immobilized in a small hole called zero-mode wave guide (ZMW) to process the extension of a single molecule of primed DNA template (Eid et al., 2009). Four color phospholinked dye labeled nucleotides are used in this process and their fluorescence is quenched until they are incorporated during the sequencing reaction (Korlach et al., 2008). In the ZMW, as each nucleotide is incorporated by the anchored DNA polymerase, the phospholinked dye label is cleaved and its fluorescence light pulses are captured by four single photon sensitive cameras in the sequencing instrument (Lundquist et al., 2008). The real time light pulse information coming from 75000 ZMWs in a SMRT cell is converted to A, C, G, or T based on quality metrics to provide the sequencing information. The biggest advantage of this technology is longer read lengths of ~1000 – 10000 bases which facilitates
easy sequence assemblies especially for de novo sequencing applications. As the sequencing reaction in a SMRT cell is monitored in real time, each typical sequencing run requires as little as 30 minutes compared to other technologies which can take up to 11 days. Strobe sequencing was used to achieve higher read lengths with higher accuracy (Lo et al., 2011). Though the cost of sequencing is relatively cheap, observed sequencing error rates are higher compared to other NGS technologies.
2.3.1.5 Life technologies/Ion Torrent PGM – Semi conductor sequencing
Ion Torrent PGM machine uses semi conductor technology with simple, non-fluorescent sequencing chemistry to generate the sequencing information. It is based on the detection of H+ ions released (pH change) during a natural polymerase reaction using an ion sensor
underneath the micro machined wells in a semiconductor chip, each containing a different DNA template. As each nucleotide flows in one at a time during the sequencing reaction, pH change is observed in all wells where the complementary nucleotide is incorporated (Pennisi, 2010). Change in pH is relative to the number of bases added to the template strand, and thus can sequence the homopolymer regions. As there is no involvement of fluorescent labeled nucleotides or imaging, incorporation of each nucleotide is recorded in seconds and the cost of sequencing is relatively cheap compared to other NGS technologies (http://www.iontorrent.com/technology-scalability-simplicity-speed/). Current read lengths are ~200 bp and each run takes about 2 hrs.
Existing and emerging NGS technologies are helping to bring down the sequencing costs towards making personalized genome services, personalized medicine and other applications possible in the near future. Third generation sequencing technologies such as Oxford’s nanopore sequencing and VisiGen’s nano sequencing technologies are currently being developed and would help the genome research more affordable than any time before.