Alignment of the OARv3.1 to the ovine reference sequence

CHAPTER 7: MUTATION SCREENING APPROACH 2: NGS SOLiD SEQUENCING OF

7.3.5 Alignment of the OARv3.1 to the ovine reference sequence

When the ovine genome OARv3.1 became available in 2013, genome sequence from this version was uploaded as an additional track on the GBrowse overview and details panels (‘Oar3’ alignment; Figure 7.6). In general, there was a good similarity between OARv3.1 and the ovine reference sequence, especially considering that one is derived from a single Texel sheep and that the ovine reference sequence is a combination of an ovine BAC and Merino, Coopworth and

There were large gaps in alignment between the Oar3 sequence and LR-PCR regions ‘C2ar’ and ‘5fr’. ‘Oar3’ sequence aligned to the LR-PCR region and ‘4ir’ contained the longest contiguous repeat sequence (see ‘repeatmasker’ track). This sequence was analysed using the Repeatmasker program (Smit et al., 1996) and the repeat was found to match with repeat class/ family long interspersed nuclear element LINE/RTE - Bovb (137) at position 3710 - 2708.

There was a significant sequence variation in LR-PCR region ‘2ir’ not determined in the ‘Oar3’ at position 39,980 - 40,330 in the ovine reference sequence (see Oar3 alignment track). Interestingly, there were eight flanking ‘GGGGCGGA’ octonucleotide sequence repeats immediately upstream to this region. Exon 1 lies within this unsequenced gap region of genomic DNA and thus its exact position could not be ascertained.

There was also a significant sequence variation in LR-PCR region ‘2jr’ due to the presence of an insertion in the ovine reference sequence at position 35,820 - 36,079.

Alignment between the OARv3.1 sequence and the ovine reference sequence for PCR region ‘C4dr’ showed a high sequence similarity, with the only mismatches due to the presence of ambiguous or ‘N’ nucleotide sequence in the ovine reference sequence. Further analysis revealed that the origin of the ovine reference sequence in this region originated from merging of 19 overlapping ovine BAC contigs (see Chapter 5).

7.4 Discussion

A large deletion of approximately 415 bp, which appears to cover the whole of exon 1 of CLN6, was only observed in the three-affected South Hampshire and was considered the most likely causative mutation for NCL in the South Hampshire sheep. No other DNA variants appeared to segregate with the disease in South Hampshire sheep but the known Merino disease causing mutation as well as a SNP used as an indirect DNA test for the South Hampshire were correctly identified. Confirmation and definition of break points of the identified deletion of additional sheep with this genetic variant are described in the following Chapter 8.

Identification of a disease causing mutation in a specific genomic region can be a challenging task. The recent reduction in the cost of whole genome sequencing (WGS) and advances in bioinformatics has made WGS a time and cost effective approach, even in species with preliminary draft genome assemblies such as horses (Towers et al., 2013) and dogs (Drögemüller et al., 2014). However, when this study was conducted in August 2010 the sheep genome assembly was incomplete and costs for whole genome sequencing were prohibitive.

The combination of long-range PCR with NGS offered the possibility to perform mutation analysis in a relatively large region of interest in a time-efficient and economical way with the LR-PCR amplification amplifying products much larger (up to 12 kb for genomic DNA and 20 kb for phage/plasmid DNA) than those achieved with conventional Taq polymerases (up to 3 kb) (Mullis et al., 1986).

At the start of this study, 60 primers were designed for amplification of the whole CLN6 genomic region and flanking sequences, including CALML4. Extended PCR optimisation of the various primer combinations resulted in 28 primers that generated 14 partially overlapping LR- PCR products of expected sizes (Table 7.2) that covered an estimated 49 kb region of interest. The remaining 32 primers generated either no product or multiple bands. It was discovered later using RepeatMasker (Smit et al., 1996 - 2010) that at least several of the remaining primers were located in repeat regions, which would explain some of the generation of multiple PCR products.

Initially there was a debate about which sequencing platform was to be used (either Illumina, 454 pyrosequencing or SOLiD sequencing-by-ligation) and the labeling approach that needed to be applied to the primers (either biotinylated or amine modification). The decision made was based on efficiencies of cost and resources (sheep DNA, overwhelming raw data and bioinformatic analysis for non-targeted regions of interest) when compared to sequencing the entire genomic DNA using other platforms. Other considerations included assessment of the relative ease of sequencing through challenging templates using the SOLiD platform, and compatibility between the amine labeling approach and chosen NGS platform. A comparison of these NGS platforms was described earlier (Chapter 2).

able to alter the original properties of the oligonucleotide, however a re-optimisation step in the form of increase in annealing temperatures was required for efficient PCR amplification of most of the LR-PCR products in this study. End modification positioned at the 3’- or 5’-end of the primers greatly reduced or prevented over-representation of amplicon ends (overlapping intervals of the amplicon ends which results in extremely high coverage in this area compared to other areas) in the sequencing libraries, thus improving the overall sequence coverage uniformity (Petermann, pers. comm.).

Analysis of SOLiD sequence data provided by the service provider identified SNPs relative to the ovine reference sequence. Identification of two known SNPs in the CLN6 (Tammen et al., 2006) using the SOLiD sequence generated from the LR-PCR products are evidence that this approach worked successfully for mutation screening.

SNPs found within a gene coding sequence are often given the highest priority for further analysis because they are likely to affect the amino acid sequence of a protein and could be disease associated (Cargill et al, 1999). The majority of the identified SNPs in this study were in non-coding regions immediately adjacent to the CLN6 and CALML4 gene coding regions. SNPs found in these regions can affect biological functions such as splicing and gene regulation (Jaenisch and Bird, 2003; Cargill et al., 1999). However those identified in this study were of low significance, as they did not segregate with the NCL disease phenotype in sheep.

The known SNP identified in exon 7 (c.822C>T in reverse complement) was found to segregate as expected in the normal Coopworth and carrier South Hampshire sheep. However, the reference sequence contained the T nucleotide, which is the allele associated with the disease in the South Hamsphire NCL research flock. This SNP has been identified in unrelated sheep from different breeds and the ovine reference sequence which comprised of consensus between the early ovine genome draft (OARv2.0), ISGC, CLN6 and CALML4 published mRNA sequence, unpublished in house genomic DNA and BAC sequencing (Chapter 5).

As indels play an important role in biological processes and human disease (Ley et al., 2003; Strausberg et al., 2003; Pao et al., 2004; Cox et al., 2005), their accurate detection, annotation, and characterisation are critical for high-throughput human resequencing studies. Although

indels were detected in the coding sequence of the CLN6 and CALML4 genes, these did not segregate with the NCL phenotype, thus no further analysis were implemented at this stage.

Employment of the LR-PCR and SOLiD sequencing approach worked successfully due to rigourous consideration of possible issues in the design stage of the study. Prior to sequencing, the purified amplicons were pooled in equimolar ratios using molar information obtained from the Bioanalyzer analysis. It is known that accurate equimolar pooling is important for equal distribution of reads, sufficient coverage and successful variant detection (Harakalova et al., 2011). Our experimental design of sequencing two equimolar pools of 7 LR-PCR products each per animal resulted in successful sequencing for all 8 sheep with an overall good uniformity in coverage. However, some inconsistencies in read alignment (read depth and position) were observed.

Poor quality samples usually produce lower depths of sequencing (Ulahannan et al., 2013) which might have contributed to low coverage for PCR region ‘C4dr’ across all sheep, however, this is not likely to be the reason why sheep 4 in PCR region ‘2ir’, sheep 2 in PCR region ‘3dr’, sheep 6 in PCR region ‘4ar’, sheep 7 in PCR region ‘3ir’ and sheep 1 - 4 in PCR region ‘5fr’ had relatively low read depths (Figure 7.6). The PCR products for those regions in these animals did not have the lowest amount of DNA when compared to other animals in the same PCR region (Table 7.5).

The low sequence read depth in PCR region ‘C1cr’ is not likely due to DNA quantity issues (Table 7.5). In instances where long-range PCR products are equimolarly pooled sequence coverage drastically drops in fragments smaller than the average length (Knierim et al., 2011) which in our case is 4 kb. However, ‘C1cr’ is 4.5 kb longer than average length and the reasons for the poor read depth in all animals for this region as well as the lower read depth for specific regions in some animals described above remain unclear. Considering that the read depth in these “lower coverage” regions was at least 1,000 reads, they are still sufficient for mutation screening.

Unexpected read alignments was observed for one or more animals in both tracks of the same sheep in these PCR regions: sheep 1 - 6 in PCR region ‘5fr’, sheep 2 in PCR region ‘C2ar’ and

sheep 1, 2 and 4 for PCR region ‘C3cr’ most likely occurred because of human error. This could have taken place during the pooling of multiple PCR products from different reactions to achieve the minimum amount of DNA for SOLiD sequencing. Ideally, this should have been identified in the Bioanalyzer analysis but the three PCR regions are of similar sizes (between 3.4 to 3.7 kb) so this might have been undetected. Review of the Bioanalyzer data did not show any unusual peaks to indicate possible mixing of samples from these different regions. Alternatively, human error could have occurred when creating equimolar pooling of amplicons from individual wells in the 96-well plates into two pools of seven non-overlapping LR-PCR products.

The various SNPs and indels identified by the service provider, when compared to the ovine reference sequence, did not correspond to the disease phenotype in the South Hamphire sheep. Visualisation of sequence reads in GBrowser suggested that a large deletion occurred in all three affected South Hamsphire sheep in the regions of PCR product ‘2ir’, positioned approximately at 39,920 - 40,335 bp in the ovine reference sequence and 14,836,464 - 14,838,151 bp in the OARv3.1 ovine genome sequence. Considering that the LR-PCR region ‘2ir’ contains exon 1 of CLN6, the large deletion is highly likely to include the whole of exon 1 as well. Alignment of this region to the OARv3.1 revealed a large gap in sequence, thus confirmation of the sequence could not be established yet.

However, the identified variant was positioned in a region where all sheep appeared to have relatively low coverage in sequence with less than 1000 reads, located within an unsequenced gap region in OAR7v3.1 (39,980 - 40,330 bp) which in addition has been reported to be difficult to sequence in other species (Tammen et al., 2006). Sequencing challenging templates has been shown to decrease the sequencing coverage irrespective of method utilized (Sanger or NGS) (Bachmann et al., 2003; Kieleczawa, 2006; Yu et al., 2013). Bioanalyzer analysis of all the 112 LR-PCR products showed that the length of all the products was not significantly different from expected sizes and that sizes among animals of different phenotypes within a specific region did not show any significant differences. Statistical analysis of the data (Table 7.6) further supports this observation showing that there were no significant findings suggestive of the disease causing such a large insertion or deletion.

content and Repeatmasker tracks in Figure 7.6) which required the addition of PCR additive DMSO (Winship, 1989) for effective amplification of the PCR product ‘2ir’. Approaches to sequence through these templates have been suggested; however they all seem to be quite specific to particular types of difficult templates and not broadly applicable for all templates. Such examples include using a novel method termed ‘Slow down PCR’ (Bachmann et al., 2003) and a combination of both DMSO and betaine additives in the PCR for sequencing through GC rich regions (Jensen et al., 2010).

Further work arising from these studies could include the verification of the genetic variants that do not segregate with the disease phenotypes. The SNPs and indels identified in this study were not confirmed by independent methods, and could represent sequencing errors or could be due to errors in the reference sequence. After comparison with information about known genetic variants in the ovine CLN6 and CALM4 genes (e.g. ENSEMBL variation tables for these genes), any new polymorphisms might be of interest for future research in relation to protein function and/or genetic marker development.

CHAPTER 8: VERIFICATION OF A LARGE

In document Identification of a novel mutation in the CLN6 gene (CLN6) in South Hampshire sheep affected with Neuronal Ceroid Lipofuscinosis (Page 181-188)