Omics and Arrays
3.2 Structural Genomics
3.2.3 Genome sequencing
The sequencing of DNA in laboratories first began in 1978. The first genome of a multicellular eukaryote, Caenorhabditis elegans, was published in 1998. The ration-ale behind genome sequencing includes
Human chromosome 16
CY14 CY19 CY11 CY165 CY12 CY4CY8 CY7 CY2
FRA16B FRA16D
D16 S48 D16 S150 D16 S149 D16 S160 D16 S40 D16 S144
D16 S159
D16 S60
D16 S85 CY15CY13CY180
23HA 16AC6.5
Cytogenetic map Somatic cell hybridization map (from cultured human–mouse hybrid cells)
Genetic linkage map
YAC clone insert
Each of these lines represents a sequence-tagged site (STS), a unique DNA sequence that can be amplified by PCR; presence of an STS in a clone indicates where the insert originated from in the chromosome.
BAC or PAC clones containing the region of interest
BAC and/or PAC contigs
YAC clone containing region of interest
Region of interest between genetic markers 16AC6.5 and D16S150 Region of interest between breakpoints CY8 and CY7
Site of hybridization with labelled probe
GATCAAGGCGTTACATGA STS
AGTCAAACGTTTCCGGCCTA
Region of interest can be localized either on physical map (somatic cell hybrid map) or genetic map.
Fig. 3.6. Example of physical mapping and integration of genetic, cytological and physical maps.
identification of all the genes in the sequenced genome, elucidation of the functions and the interactions of genes in the genome, func-tional analysis of orthologues in related complex genomes, evolutionary analysis of genes or genomes and product development and commercial application. As the next-generation sequencing technologies contin-ued to facilitate genome sequencing, new applications and new assay concepts (e.g.
Huang et al., 2009) have emerged that are vastly increasing our ability to understand genome function, including sequence census methods for functional genomics (Wold and Myers, 2008; Varshney et al., 2009).
Technical developments in DNA sequencing There are three major milestones in DNA sequencing: (i) the invention of sequenc-ing reactions; (ii) automated fluorescent
DNA sequencers; and (iii) PCR. Until the late 1970s, obtaining the DNA sequences of even five to ten nucleotides was dif-ficult and very laborious. The develop-ment of two new methods in 1977, that of Maxam and Gilbert (chemical sequenc-ing method) and the other by Sanger and Coulson (enzymatic sequencing), made it possible to sequence large DNA molecules.
Later refinements of Sanger’s chain termi-nation method made it the preferred proce-dure since it has proven to be technically simpler.
The modified Sanger sequencing method or chain terminator procedure capi-talizes on two properties of DNA polymer-ases: (i) their ability to synthesize faithfully a complementary copy of a single-stranded DNA template; and (ii) their ability to use 3'-dideoxynucleotides as substrates. Once the analogue is incorporated at the growing
point of the DNA chain, the 3' end lacks a hydroxyl group and is no longer a substrate for chain elongation. Thus, the dideoxynu-cleotides act as chain terminators.
The development of labelling and detection techniques have contributed to an acceleration of sequencing procedures, which include 33P labelled primer (1970s);
33P or 35S labelled primer with sharper image and lower radiation (early 1980s);
and fluorescently labelled primers and dyes in four different reactions (1986).
DNA sequencing became automated in the late 1980s when the primer used for each reaction was labelled with a differently coloured fluorescent tag. This technology allowed thousands of nucleotides to be sequenced in a few hours and the sequenc-ing of large genomes then became a reality.
With ABI PRISM® technology, up to four different dyes can be used to label DNA each of which can be differentiated when run together in the same lane of a gel or injected into a capillary. For DNA sequenc-ing, this means that the four different dyes representing each of the DNA bases (A, C, G and T) can be electrophoresed together.
The improvement of polyacrylamide gel electrophoresis (in the late 1980s and early 1990s) led to high resolution, thin-ner gels and a sharper image. Capillary electrophoresis (CE) (1998) offers a number of performance advantages such as faster runs, small sample volumes and the abil-ity to eliminate manual gel pouring and sample loading tasks. Walk-away automa-tion reduces instrument-associated labour time by more than 80% over slab-gel sys-tems. The introduction of CE resulted in the availability of automated electrophoresis instruments with much lower cost per sam-ple (Amersham’s MegaBACE and Applied Biosystems ABI3700, 3730, etc.). High-throughput sequencing can also incorporate full automation in colony picking, 96-well plasmid isolation and purification, PCR reactions, sample loading and sequence data analysis.
The new generation of high-through-put sequencing technologies promises to transform the scientific enterprise, poten-tially supplanting array-based technologies
and opening up many new possibilities (Kahvejian et al., 2008; Shendure and Ji, 2008). There are three commercial next-generation DNA sequencing systems avail-able (Schuster, 2008) which promise vastly more sequencing capability (> 1 Gb of sequence per run) than standard capillary-based technology can produce. A high-throughput DNA sequencing technique using a novel massively parallel sequenc-ing-by-synthesis approach called pyrose-quencing was developed more recently by 454 Life Sciences (Margulies et al., 2005;
www.454.com). 454 Sequencing employs clonal DNA fragment amplification on beads in droplets of an aqueous–oil emul-sion, followed by loading the beads into nanoscale (∼ 44 µm) wells of a PicoTiterPlate which is a fibre optic chip. In each reac-tion cycle, one of the four deoxynucleotide triphosphates (dNTPs) is delivered to the reactor along with DNA polymerase, ATP sulfurylase and luciferase. Incorporation, which is accompanied by a chemolumins-cent signal, is detected by a high-resolution charge-coupled device (CCD) sensor. 454 Sequencing is capable of sequencing roughly 100 Mb of raw DNA sequence per 7-h run with their 2007 sequencing machine, the GS FLX Genome Analyzer.
454 Sequencing allows large amounts of DNA to be sequenced at low cost compared to the Sanger chain-termina-tion methods; G-C rich content is not as much of a problem, and the lack of reli-ance on cloning means that unclonable segments are not skipped; it is also capa-ble of detecting mutations in an amplicon pool at a low sensitivity level. However, each read of the 2005 sequencing machine GS20 is only 100 bp long, resulting in some problems when dealing with highly repetitive genomes, as repetitive regions of over 100 bp cannot be ‘bridged’ and thus must be left as separate contigs. Also, the nature of the technology lends itself to problems with long homopolymer runs.
As one of the projects using 454 sequenc-ing, Project ‘Jim’ determined the first sequence of an individual, the complete genome sequence of James Dewey Watson, in May 2007.
The second high-throughput sequenc-ing technique is Solexa™ (Illumina, Inc.;
http://www.illumina.com) which depends on sequencing by synthesis. Diluted DNA templates are attached to a solid planar sur-face and then amplified clonally. Sequencing is performed by delivering a mixture of four differentially labelled reversible chain ter-minators along with DNA polymerase. The resulting signal is detected at each cycle and a new cycle can be initiated after termi-nator removal (Bennet et al., 2005). Current average read lengths are about 30–40 bases with 1 Gb per run.
The third high-throughput sequenc-ing technique is SOLiD™ System which enables massively parallel sequencing of clonally-amplified DNA fragments linked to beads. The SOLiD™ sequencing method-ology is based on sequential ligation with dye-labelled oligonucleotides. The SOLiD™ technology provides unmatched accu-racy, ultra-high throughput and applica-tion flexibility. It delivers advancements in throughput approaching 20 Gb per run. The flexibility of two independent flow cells, each capable of running 1, 4 or 8 samples, allows multiple experiments to be con-ducted in a single run. With unparalleled throughput and greater than 99.9% overall accuracy, the SOLiD™ System enables large- scale sequencing and tag-based experiments to be completed more cost effectively than previously possible.
There are several emerging sequencing methods: sequencing by hybridization; mass spectrophotometric techniques; direct visu-alization of single DNA molecules by atomic force microscopy; single-molecule sequenc-ing strategies. The intense drive towards developing technology that can sequence a complete human genome for under US$1000 will ensure that the speed and cost of sequencing will continue to improve rap-idly (Schuster, 2008). For example, a nano-pore-based device provides single-molecule detection and analytical capabilities that are achieved by electrophoretically driving molecules in solution through a nano-scale pore. Further research and development to overcome current challenges to nanopore identification of each successive nucleotide
in a DNA strand offers the prospect of ‘third generation’ instruments that will sequence a diploid mammalian genome for ∼US$1000 in∼ 24 h (Branton et al., 2008).
Sequencing strategies
There are two general genome sequencing strategies: (i) clone-by-clone or hierarchical sequencing (International Human Genome Sequencing Consortium, 2001); and (ii) whole shotgun sequencing (Venter et al., 2001).
After constructing the complete physical map, clone-by-clone sequencing can be started in any specific region. Clone-by-clone or hierarchical sequencing strategy has the following advantages: (i) the ability to fill gaps and re-sequence the uncertain regions;
(ii) the ability to distribute the clones to other laboratories; and (iii) the ability to check the produced sequence by restriction enzymes. The main disadvantages are that it is expensive and time consuming for the construction of a physical map and experi-enced personnel are required.
The shotgun sequencing strategy consists of making small insert librar-ies (1–10 kb) from the genomic DNA of an organism, sequencing a large number of clones (six to eight times redundancy) and assembling contigs using bioinformatics software. It has no physical map construc-tion and less risk of recombinant clones. It is cost effective and fast and ideal for small genome sequencing. However, it is difficult to fill gaps and re-track all the sequenced plasmids and the resulting data is less use-ful for positional cloning. Figure 3.7 com-pares the two sequencing methods.
COMBINING CLONE-BY-CLONE AND SHOTGUN SEQUENC
-ING STRATEGIES. In 1997 The Institute of Genome Research (TIGR) launched the ini-tiative of a whole-genome shotgun approach for the human genome. But BACs, BAC end sequences and STS markers were used extensively in assembling the sequencing data from shotgun clones. The first draft of the human genome was completed within 3 years compared with the 12 years taken by the Human Genome Project which is funded by government agencies.
Chromosomal DNA
Assemble contigs and bioinformatics analysis
‘Rock’
‘Stones’
50 kb ‘Mates’
STSs Gap
Hierarchical sequencing Shotgun sequencing
Fragment and sequence whole genome 1. Construct
large BAC or P1 clones
2. Align
3. Take subset of clones, fragment and sequence
U-unitigs
Scaffold
Link mapped scaffold to existing map
Fig. 3.7. Comparison of two sequencing strategies: assembly of a mapped scaffold. U-unitigs are assembled into scaffolds using mate-pair information to bridge gaps between two U-unitigs, and by linking unitigs to ‘rock’, which are less-well supported unitigs that nevertheless fit in place according to at least two independent large insert mate pairs. ‘Stones’ are single short contigs whose position is supported by only a single read. Gaps are filled in the finishing stage by further site-directed sequencing.
Scaffolds are placed against existing genetic and physical maps by sequence tagged site (STS) matches and against the cytological map by fluorescent in situ hybridization (FISH).
Genome filtering strategies
The extremely large size of many crop genomes makes it difficult to decode them using the standard methods of genome sequencing such as clone-by-clone and whole-genome shotgun. Determining their complete sequences is daunting and costly. In recent years two genome filtra-tion strategies, methylafiltra-tion filtrafiltra-tion (MF) (Rabinowicz et al., 1999) and C0t-based cloning and sequencing (CBCS; Peterson et al., 2002) or high C0t (HC; Yuan et al., 2003) have been suggested for selec-tively sequencing the gene space of large genomes. MF is based on the characteristics of plant genomes in which genes are largely hypomethylated but repeated sequences are highly methylated. Methylated DNA
is cleaved when transferred into a Mcr + E. coli strain and only hypomethylated DNA is recovered. CBCS/HC separates single- and low-copy sequences including most genes from the repeated sequences on the basis of their differential renatura-tion characteristics. Using the MF strategy, Bedell et al. (2005) sequenced 96% of the genes in sorghum with an average cover-age of 65% across their length. This strat-egy filtered out repetitive elements during the sequencing of the genome of sorghum which reduced the amount of sorghum DNA to be sequenced by two-thirds, from 735 Mb to approximately 250 Mb. Both MF and HC have been used for efficient char-acterization of maize gene space (Palmer et al., 2003; Whitelaw et al., 2003). Using
high C0t and MF, Martienssen et al. (2004) generated up to twofold coverage of the gene space with less than one million sequencing reads and simulations using sequenced BAC clones predicted that 5× coverage of gene-rich regions, accompa-nied by less than 1 × coverage of subclones from BAC contigs, will generate a high qual-ity mapped sequence that meets the needs of geneticists while accommodating unu-sually high levels of structural polymor-phism. Haberer et al. (2005) selected 100 random regions averaging 144 kb in size, representing about 0.6% of the genome, to define their content of genes and repeats for characterizing the structure and archi-tecture of the maize genome. Combining CBCS with genome filtration can greatly reduce the cost while retaining the high coverage of genic regions. An alternative approach is the identification of gene-rich regions on a detailed physical map and sequencing large-insert clones from these regions.
Plant genomic sequences
The first complete plant genome to be sequenced was that of Arabidopsis. The sequenced regions cover 115.4 Mb of the 125-Mb genome and extend into centro-meric regions. The evolution of Arabidopsis involved a whole genome duplication fol-lowed by subsequent gene loss and extensive local gene duplications. The genome contains 25,498 genes encoding proteins from 11,000 families (The Arabidopsis Genome Initiative, 2000). Arabidopsis contains many families of new proteins but also lacks several common protein families. The proportion of predicted Arabidopsis genes in different functional cat-egories is provided in Fig. 3.8. The complete genome sequence provides the foundation for more comprehensive comparison of con-served processes in all eukaryotes, identifying a wide range of plant-specific gene functions and establishing rapid systematic methods of identifying genes for crop improvement (Varshney et al., 2009).
Fig. 3.8. Proportion of predicted Arabidopsis genes in different functional categories.
Metabolism 11%
Energy 7%
Cell growth 2%
Transcription 3%
Protein synthesis 27%
Protein destination 12%
Transport facilitators 4%
Intracellular traffic 3%
Cellular organization 5%
Signal transduction 4%
Elicitors 4%
Cell defence 3%
Net yet clear-cut 5%
Unclassified 10%
Rice was the first crop to be fully sequenced because of its importance as one of the major cereals and also because of its small genome size, small number of chromo-somes (n = 12), well characterized genetic and genomic resources and availability of a large number of DNA markers and a high density genetic linkage map. Two draft sequences were completed in 2002 (Goff et al., 2002; Yu et al., 2002) and a complete sequence was published in 2005 (IRGSP, 2005) which is available in the National Center for Biotechnology Information (NCBI) database.
Many sequencing projects for impor-tant crop species are currently ongoing. The US Department of Energy’s Joint Genome Institute (JGI) is providing funding and technical assistance to decode the genomes of several major plants, including cassava (Manihot esculenta), cotton (Gossypium), foxtail millet (Setaria italica), sorghum, soy-bean and sweet orange (Citrus sinensis L.) (http://www.jgi.doe.gov/sequencing/).
Other plants for which there are ongo-ing genome sequencongo-ing projects include Medicago truncatula (http:///www.medi cago.org/genome), Lotus japonicum (http://
www.kazusa.or.jp), poplar, tomato (http://
www.sgn.cornell.edu) and grapevine.
The International Wheat Genome Sequencing Consortium (IWGSC) has been formed to advance agricultural research for wheat production and utilization by develop-ing DNA-based tools and resources that result from the complete sequencing of the expressed genome of common (hexaploid) bread wheat and to ensure that these tools and the sequences are available for all to use without restriction and without cost (Gill et al., 2004; http://www.
wheatgenome.org/). A Global Musa Genomics Consortium (GMGC) is decoding the Musa genome (http://www.newscientist.com/article.
ns?id-dn1037). A Global Cassava Partnership, an alliance of the world’s leading cassava researchers and developers, has proposed that sequencing the cassava genome should be a priority (Fauquet and Tohme, 2004).
To sequence the maize genome, two consortia in the USA began a pilot study:
one with Jo Messing (Rutgers University), Rod Wing (Arizona University), Ed Coe
(University of Missouri), Mark Vaudin (Monsanto) and Steve Rousley (Cereon);
the other included Jeff Bennetzen (Purdue University), Karel Schubert and Roger Beachy (Danforth Center), Cathy Whitelaw and John Quackenbush (TIGR) and Nathan Lakey (Orion). These two pioneer programmes have been extended by a massive US programme from the National Science Foundation (NSF), USDA and the Department of Energy (DOE) led by Rick Wilson (Washington University).
The sequencing strategy is a hybrid between a BAC-by-BAC approach and a whole-genome shotgun.