3.4 Conclusions and future work
4.1.3 Next-generation sequencing
DNA sequencing characterises the sequential order of different nucleotide bases in a length of DNA. The first techniques to achieve this were based upon the chemical modifications of DNA, developed by Maxam and Gilbert (1977) and the chain- termination method developed by Sanger et al. (1977). The latter method, termed ‘Sanger sequencing’, was the first to be automated, which allowed researchers to sequence large quantities of DNA faster and more cheaply. The technology was de- veloped over the next two decades and became the first method to sequence the full human genome in 2001 (Lander et al., 2001). Sanger sequencing is similar to natural DNA replication as it uses DNA polymerase to elongate complementary strands of short primers. Different labels are used to identify the four different dideoxynu- cleotides so that an addition of one can be detected through a detectable chain termination (Mutz et al., 2013). This method is associated with a low sequencing error rate of ∼ 2% and read lengths of up to ∼ 2000 bp (Nagarajan and Pop, 2013). Pyrosequencing, also known as sequencing by synthesis, is a DNA sequencing technique developed in the late 1990s by Ronaghi et al. (1996). This technique brought an attractive alternative to Sanger sequencing through its ability to perform real-time sequencing that was simple, automated and faster than previous methods. As DNA polymerase moves along an immobilised single stranded template of DNA, the four different nucleotides are sequentially added in solution, and if incorporated a flash of light is detected. The pyrophosphate released from the DNA polymerase- catalysed reaction forms ATP, which is then used in the ATP-dependent conversion of luciferin to oxyluciferin. The production of oxyluciferin causes a pulse of light, whose amplitude is directly related to the number of nucleotides incorporated (Ron- aghi et al., 1996; Petrosino et al., 2009). DNA pyrosequencing is generally only able to sequence DNA fragments up to 100-200 bases whereas from 2005, 454 Life Sci-
CHAPTER 4. FUNCTIONAL ANALYSIS OF THE HUMAN ORAL
METAGENOME 167
ences released a high-throughput pyrosequencing technique, which can now sequence fragments up to approximately 750 bases (Margulies et al., 2005; Glenn, 2011). This technique is known as 454 sequencing and it was the first next generation sequenc- ing method. While longer reads are produced with fast run times, the reagent costs are high and there are high error rates associated with homopolymer repeats and duplicate reads (Glenn, 2011). Nagarajan and Pop (2013) reported a sequence error rate of ∼ 4% for this technology.
Illumina next generation sequencing (NGS) (initially developed in 2007 by Solexa) also uses a sequencing-by-synthesis method and was the first next generation short- read sequencer (Bentley et al., 2008). The DNA of an amplified library of fragments is sequenced using reversible dye terminators. In this method, all four nucleotides can be added at the same time in each cycle as each carries a different fluorescent la- bel. A nucleotide is added by DNA polymerase, then the unincorporated nucleotides are washed away and an image is taken to identify the fluorescent signal. The fluo- rescent group is then cleaved and the 3’-hydroxyl group is chemically de-blocked so that the next nucleotide can be incorporated. Up to 150 nucleotides can be added in this way. Most Illumina reads are reported to have an error rate of 0.5%, i.e. 1 error in 200 bases) (Mardis, 2013). Nagarajan and Pop (2013) also reported a low sequencing error rate of < 2%. These errors can be a result of phasing, which is where the de-blocking process is incomplete, or where a blocking group is missing. Errors can also result from fluorescence interference noise, which can occur when a fluorescent group has not been cleaved from a previous cycle (Mardis, 2013).
Most recently, instruments have been developed that are able to sequence in- dividual strands of metagenomic DNA in real time. Pacific Biosciences, known as PacBio, was developed in 2009 (Eid et al., 2009) and made commercially available in 2010. Starlight is another single-molecule sequencing technique however it is still under development and is not commercially available. In PacBio, each nucleotide has a different fluorescent label that is detected as soon as it is cleaved during syn- thesis (Glenn, 2011). While the reads produced are 964 bases on average, it has the
CHAPTER 4. FUNCTIONAL ANALYSIS OF THE HUMAN ORAL
METAGENOME 168
highest error rates compared to other NGS techniques of ∼ 18% (Metzker, 2010; Nagarajan and Pop, 2013).
A number of metagenomic studies have used 454 pyrosequencing since its release in 2005. A single run of 454 pyrosequencing allowed for the analysis of a 13 Mb sequence of 28,000-year-old mammoth in 2006 (Poinar et al., 2006). Since then, projects using 454 technology have investigated the metagenomes of soils (Leininger et al., 2006), a coral holobiont (Wegley et al., 2007), and nine biomes (Dinsdale et al., 2008), which include stromatolites, fish gut, fish ponds, mosquito viriome, chicken gut, bovine gut and marine viriome (Hugenholtz and Tyson, 2008). Recent years have seen a focus on the sequencing of the human microbiome using next-generation technologies, with projects such as the HMP as mentioned in Section 4.1.9. Another recent example used the Illumina sequence reads to establish a human gut microbial gene catalogue (Qin et al., 2010).
As there are error rates associated with all methods of DNA sequencing a method was developed to calculate the reliability of each base-call through a quality score. The program Phred was developed to estimate the probability of error for each base-call (Ewing and Green, 1998). Log-transformed error probabilities are used to calculate a quality value (q) (see Equation 4.1).
q = −10 x log10(p) (4.1)
p represents the estimated error probability for a given base-call. A high quality value corresponds to a low error probability. For example, a Phred quality score of 30 corresponds with a 1 in 1000 chance of an incorrect base call (Ewing and Green, 1998).