characterisation of the cattle and African buffalo (Syncerus caffer) IGH gene segments
3.1.1 Comparison of short and long read sequence technologies for IGH assembly
Knowledge of DNA sequence is indispensable for understanding biological systems. The development of Sanger sequencing, via selective incorporation of chain-terminating dideoxynucleotides, permitted the first high throughput method of determining the genetic code. Sanger sequencing produces DNA sequences of approximately 800 bp which can then be assembled into larger contigs. The generation of the first cattle genome in 2004 (NHGRI, 2004; 163), a 3.3-fold coverage of the Hereford cow L1 Dominette 01449, was achieved by shotgun assembly of Sanger reads. This method is still widely used for the sequencing of smaller regions of the genome and the resolution of difficult regions, however it is considered an expensive method for whole genome assembly that does not provide the depth of coverage achieved by new sequencing technologies.
Second generation sequencing technologies (SGS), especially Illumina sequencing, transformed the field of genomics by providing comparatively high coverage of a whole genome at low cost. Illumina technology incorporates fluorescently labelled nucleotides to detect each base added by polymerase to the growing DNA chain. Clusters are generated by capturing the DNA library on a flow cell of surface-bound oligos and amplifying each individual fragment into distinct clonal clusters. The clusters of millions of fragments are extended in a parallel fashion in order to detect the fluorescent signal. The resulting reads are highly accurate base-by-base sequences with a very low error rate. The maximum read length however is short, with the Illumina HiSeq 2500 generating paired-end reads up to only 250 bp. Illumina technology has been adopted for sequencing and assembly of whole genomes, including the Hereford L1 Dominette 01449, covering most of the genome at 9.5-fold coverage, UMD2 (Zimin et al., 2009; 142) and has since been used for the sequencing and assembly of the most recently published cattle genome, the UMD3.1 (Elsik et al., 2009; 159). The uniform short read length of SGS and amplification biases can lead to fragmented
genome assemblies, particularly across highly repetitive, GC-rich and GC-poor regions. Short reads are unable to span repetitive regions of the genome. If the read does not contain a unique sequence, the origin of the read cannot be precisely determined and so multi-mapping occurs. The consequent multiple alignments and misalignments lead to sequence gaps,
80
assembly errors and incorrect abundance estimation. Consequently, the antibody loci are highly repetitive, GC-rich regions which are heavily disrupted with large sequence gaps in the available reference assemblies. Resolution of the antibody loci then, requires longer reads as they are more likely to contain unique sequences and span repetitive regions.
Third generation sequencing technology, Pacific Biosciences (PacBio) Single Molecule Real- Time (SMRT) sequencing, overcomes many of the limitations of Illumina sequencing, especially short read length and amplification biases. The sequencing-by-synthesis technology uses a zero-mode waveguide (ZMW) with an affixed DNA polymerase and a single template molecule. As fluorescently tagged nucleotides are incorporated along the chain, real-time imaging of the fluorescent signal is illuminated in the ZMW structure to allow observation at the single molecule level; unlike Illumina which detects the fluorescent signal from a cluster of amplified fragments. The use of DNA polymerase and the imaging of single molecules means there is no degradation of signal over time so the sequencing reaction only ends when the template and polymerase dissociate. The resulting read lengths are much longer, averaging at 10 kb, with over half of the reads >20 kb, using the latest chemistry. SMRT is however, not without its own limitations. The throughput of SMRT sequencing is lower than that of Illumina technology, typically at 0.5–1 billion bases per SMRT cell compared to the 8 billion paired-end 125 bp reads capable of being produced on the Illumina Hi-Seq 2500 (Rhoads and Au, 2015; 164). Individual reads contain a random 11-14% error rate (Korlach, 2015; 165) but with sufficient coverage of the read, the statistically averaged consensus eliminates most of the errors in the sequence as it is highly unlikely the same error will be randomly observed multiple times. Accuracy of >99% requires a coverage of 15 sequencing passes but the number of sequencing passes and the read length is a trade off as the read length is limited by the lifetime of the polymerase. Sequences have lower accuracy if they have longer lengths, shorter lengths yield higher accuracy. However, PacBio overcomes issues introduced by sequencing with Illumina as longer reads are able to span across highly repetitive sequences for assembly of complex genomic regions, such as the antibody loci. Oxford Nanopore technology is another third generation sequencing technology which generates long sequence reads. The technology involves passing an ionic current through the nanopore and measuring changes in the current as biological molecules pass through. The current changes are different between the four nucleotide bases and so the DNA sequence can be determined. Repetitive regions, such as the antibody loci, can be sequenced without
81
exists to the length of the DNA molecules being sequenced and so the technology has
applications in sequencing entire chromosomes. But where PacBio reads a molecule multiple times to error correct for generation of a high quality consensus, Oxford nanopore can only sequence a molecule twice. Oxford nanopore therefore has an estimated 38.2% error after base calling (Laver et al., 2015; 166). The principle advantage of Oxford nanopore over other sequencing technologies is its affordability and portability which makes it useful in field studies.
For highly contiguous and accurate genome assemblies then, a conjunction of both PacBio sequencing and Illumina sequencing would be advantageous. PacBio can close gaps in reference assemblies and the long reads are able to overcome limitations of genome assembly using Illumina sequencing. Illumina, however, provides depth of coverage and higher
accuracy. A cattle genome utilising both Illumina and SMRT reads for assembly however is currently not available. Here we are able to compare separate genome assemblies from PacBio and Illumina sequencing for post-assembly analysis of their structure and sequence.