Multialignment and generation of consensus sequence

CHAPTER 5: OVINE BACTERIAL ARTIFICIAL CHROMOSOME (BAC) SEQUENCING

5.3.4 Multialignment and generation of consensus sequence

Approximately 110 kb of sheep BAC sequence was generated using the 85 contaminant-free BAC contigs with adjacent contigs aligned in multiple GeneDoc files. These were merged to produce a sequence assembly, furthermore referred to as ‘initial sequence assembly’. This initial sequence assembly was annotated manually using results from the BLAST analysis against multiple reference sequences, these being the international sheep genomic sequence (ISGC) versions 1 and 2, the published CLN6 sheep mRNA sequence (GenBank GeneID: 678673), the unpublished sheep sequences obtained at Lincoln University and the CNCS sequence generated earlier (Chapter 4). The initial sequence assembly was then used as a template to produce a final consensus sequence assembly referred to as the ‘consensus sequence assembly’ which represents the region of interest for the two upcoming sequencing approaches in Chapters 7 and 8.

The consensus sequence assembly was used as query in RepeatMasker to identify known repeats. A total of 40,211 bp (36.35%) of the original 110,618 bp sequences were masked, leaving a final sequence of 70,407 bases. Among the repeats identified, 15,624 and 16,317 bp were of short interspersed nuclear elements (SINE) and long interspersed nuclear elements (LINE), respectively, 4,127 bp were of long terminal repeats (LTR) and 3,020 bp were of DNA elements.

5.4 Discussion

The two ovine BAC clones 270H8 and 35C9 used for analysis in this study were screened for the CNCS region and the CALML4 gene downstream to ovine CLN6 prior to sequencing. This span of sequences was identified to be of interest due to the hypothesis that the disease causing mutation for ovine NCL in the South Hampshire sheep is in sequences flanking ovine CLN6 (Chapter 1.3). Compared to the labourious Sanger sequencing method used in sequencing the CNCS (Chapter 4), the Roche 454 pyrosequencing method produced longer sequence reads averaging 400 bp than those generated by Sanger (Mardis, 2008; Zhou et al., 2010), thus allowing generation of a normal sheep genomic reference sequence. This reference sequence was used for mutation screening approaches (Chapter 6 and 7).

Purified DNA from ovine BAC clone 270H8 was chosen as the sequencing template instead of BAC clone 35C9. Clone 35C9 had a OD260/280 reading of 1.35 which suggested a high level of protein contamination. DNA purity is a critical factor for consideration (Liu et al., 2013) as it is likely to impede sequencing.

Agarose gel electrophoresis analysis of both BAC DNAs revealed the presence of different forms of DNA including the supercoiled BAC DNA, the purified BAC vector and an insert, which theoretically should have been removed during purification. The relaxed DNA mid band was the purified contaminant-free genomic DNA required for the sequencing project, whereas the slowest traveling DNA in the agarose gel (upper band) was the combination of nicked/sheared damaged DNA, and a smear of BAC DNA which does not hybridize with probes on the gDNA and migrate easily into the gel. The band closest to the well was likely to be an analysis artifact (H. Zhou pers. comm.). A better method for analysis and estimation of the size of large DNA construct such as BAC clones is to use a standard agarose with a supercoiled DNA ladder or to use pulse field gel electrophoresis (PFGE; Herschleb et al., 2007); neither of which were available in our laboratory at that time.

The DNA from BAC clone 270H8 was sent for sequencing after the high throughput DNA sequencing unit confirmed that the quantity and quality of the DNA was sufficient. In this case, sheared DNA and minor contaminations were not of concern. The DNA was destined to be fragmented and ‘over sequenced’ such that it had excessive sequence coverage for the region. Further information on the 454 sequencing chemistry has been described in Chapter 1.6.2.1.

Homopolymer length sequencing error is very common in 454 sequencing reads, constituting 39% of error rates, as stated by Huse et al. (2007). These errors occur due to the unique technique of sequencing for the 454 platform as nucleotide bases are not called directly as in Sanger sequencing but rely on the intensity of lumimenscence brightness emitted each time a nucleotide is added to the DNA strand (Mardis, 2008). Manual removal of these homopolymers prevents possible problems with sequence assembly, as the length variation can generate ambiguity when encountered causing long stretches of one or more nucleotides.

In finalising this assembly, there were circumstances when nucleotide bases varied between aligned sequences. To call the correct nucleotide base several conditions were followed to finalise consensus between ISGC, CLN6 and CALML4 published mRNA, unpublished genomic DNA and BAC sequencing. Only when these conditions were met was the particular nucleotide base called. In some regions the sequence may not have been fully accurate and in that particular situation the best sequence was called. There may be errors or miscalling of some bases in the final 110 kb consensus sequence as base calling decisions were made based on the resources available at that time. Several conditions ensured that the basecalling method was standardised throughout the sequence assembly. These conditions were as follows:

i. During alignment, if within a region only BAC sequence was present with no other sequence backup then the BAC sequence was called

ii. If a single nucleotide varied between all reference sequences then the sheep BAC nucleotide was called

iii. If the BAC sequence contains ambiguity of ‘N’ but there are specific nucleotides in another reference sequence (even from one source) the sequence from the other source was called.

The sheep sequence in the publicly available ISGC and GenBank databases was incomplete when the study began in 2010. Thus the Roche 454 sequences generated from this study bridged gaps and enriched sequence information of the genome specifically in the CLN6 region of interest, which was crucial to provide reference sequence for mutation screening (Chapters 7 and 8).

In conclusion, the Roche 454-pysequencing of ovine BAC was cost effective, efficient and provided approximately 120 kb of ~14X coverage sequence of normal sheep genomic reference sequence.

CHAPTER 6: MUTATION SCREENING

APPROACH 1: SEQUENCE CAPTURE FOR

TARGETED SEQUENCING

6.1 Introduction

The generation of new ovine sequences (Chapters 4 and 5), supplemented with known ovine sequences from published and unpublished sources greatly enriched sequence information within and flanking ovine CLN6. A consensus sequence formed using a combination of these sequences was subjected to two mutation screening approaches, to be described in this and the following chapter (Chapter 7). The mutation screening approach described in the present chapter is based on NGS sequencing of enriched genomic DNA that was captured using sequence capture.

6.2 Materials and methods

In document Identification of a novel mutation in the CLN6 gene (CLN6) in South Hampshire sheep affected with Neuronal Ceroid Lipofuscinosis (Page 134-137)