Comparative enrichment study (exome array versus gene capture array)

Chapter 3. Validation of a wheat gene capture array

3.4 Comparative enrichment study (exome array versus gene capture array)

The exome array bait probes were tiled across 198,056 design-space contigs that were used as reference sequences in all relevant mapping analyses and ranged from a minimum length of 51 to a maximum length of 2467. Similarly in all analyses that involved the gene capture array, these bait probes were tiled across 169,345 design-space contigs that were used as reference sequences in mapping analyses and ranged from a minimum length of 100 to a maximum length of 13168.

The hexaploid wheat variety Rialto was enriched using the gene capture array and sequenced externally using Illumina sequencing technology (Genome Analyzer IIx). The resultant paired end sequencing dataset was mapped to the gene capture array design-space as a fragment library, due to the shorter length of the reference sequence contigs, using BWA (v 0.6.2) short read mapping (prior to the 2013 update to BWA-backtrack). Indexing of the reference sequence involved use of the ‘IS’ algorithm andduring alignment of reads to the reference 4 mismatches were allowed per sequencing read. All unmapped, non-uniquely mapped and duplicate reads were later removed using SAMtools. The steps involved in this analysis are shown in figure 1.6 and example commands are outlined in the command line appendix sections 1 and 4. Rialto was also enriched using the exome capture array in solution and sequenced using SOLiD sequencing technology. This dataset was mapped to

the exome capture array design-space using the same steps, filtering and parameters although additional parameters to allow processing of colour-space SOLiD reads were used.

Non-enriched DNA (whole genome sequencing) from the wheat variety Rialto, that was sequenced externally using SOLiD sequencing technology, was also mapped separately to both the gene capture array design-space and the exome capture array design-space using the same steps and parameters as detailed for the enriched Rialto above ensuring that parameters to allow processing of colour-space SOLiD reads were used. The program csfasta2fastq was used to convert the SOLiD sequencing csfasta/qual files into a fastq file, which is used as input for BWA.

SAMtools mpileup (v 0.1.16) was implemented on all mapped datasets and finally SNP calls were filtered out using VarScan (VarScan.v2.2.3.jar)with the following parameters: discard SNPs covered by 20 or fewer reads, discard sequencing reads with a quality less than 20 and if the alternate allele has less than 2 supporting reads passing the quality filter discard it. For this SNP analysis the tool awk was implemented to remove indels from the VarScan output. This analysis enabled a comprehensive comparison of enrichment quality between the original exome capture array and the gene capture array. The results gained are shown below in table 3.1. The non-enriched data maps with a deeper coverage on average to the gene capture array compared to the exome array, it also maps to more of the gene capture array i.e. ~42% compared to ~31%. Although this gene capture array is double the size of the exome capture array it has fewer design-space contigs since average contig length is ~654bp while for the exome capture array it is ~205bp. It is therefore not surprizing that a higher depth and more extensive coverage has been achieved using the gene capture array with use of significantly longer reference contigs. This could also account for the ability to confidently call a higher number of SNPs using the gene capture array.

Table 3.1. Exome capture array versus gene capture array. Mapping statistics for enriched and non-enriched Rialto to the exome capture array design-space and also to the gene capture array design-space.

The enriched Rialto data, as anticipated, had a much deeper coverage than the non-enriched Rialto on average across both of the arrays. Approximately ~6x more sequence data was generated for the gene capture array than for the exome capture array and, as it is double the size of the exome capture array, we would expect ~3x more coverage overall, yet the enriched Rialto maps with over 7x deeper coverage to the gene capture array. It is likely that this is because the gene capture array generates less off target sequence data; almost 50% of its sequence reads can be mapped compared to ~30% of the exome capture array reads.Data also maps to ~95% of the gene capture array design-space but to only ~65% of the exome capture array design-space suggesting that a greater proportion of the gene capture array bait probes are enriching effectively in comparison to the exome capture array.Of the unmapped sequencing data typically ~63% of sequencing reads include repetitive sequence.

For the exome capture array overall 21,077 contigs out of 198,056 were not mapped to (11% of reference contigs). However, for the gene capture array overall only 579 contigs out of 169,345 were not mapped to, this amounts to less than 1% of the reference contigs and shows a great improvement. 2,399 contigs (~1%) in the exome array had a high depth of coverage (over 3 SD from the mean) whilst only 2,415 contigs had a high depth of coverage

Wheat Variety Mean depth % Of array probes mapped Mean % coverage array probe StdDev coverage depth % Of reads mapped

Total reads SNP No.

cDNA array Rialto (enriched) 36.5 86.4 86.1 18.8 29.4 106,435,597 146527 Rialto (non- enriched) 11.5 58.8 53.4 6.5 0.38 1,725,138,247 57222

Genomic DNA array Rialto (enriched) 268.3 98.6 96.1 183.0 49.9 642,311,196 517022 Rialto (non- enriched) 17.1 88 48.1 14.7 1.25 1,725,138,247 124021

in the gene capture array amounting to less than 1% of the reference contigs. The gene capture array performed enrichment more efficiently.

3.4.2 SNP analysis

A comparison of the homeologous SNPs that could be identified in enriched and non- enriched Rialto data for each array was carried out to ensure that enrichment did not affect SNP calling i.e. that they enrich all three wheat genome copies effectively. The low average non-enriched data coverage was potentially an issue. It was possible that SNP alleles found in the enriched data would not be picked up at all, or at a depth high enough to confidently call SNPs, in non-enriched data due to comparatively low coverage. Conversely it was possible that low frequency SNP alleles in the non-enriched data could be proved to be false positive SNP calls in the higher coverage of the enriched data. The following technique was adopted for this particular situation; SNPs were only considered for comparison between the non-enriched and enriched datasets if they were in regions that were mapped to in both datasets with a depth greater than or equal to 20 and if the alternate allele from one dataset is found in the raw reads of the other (or in the case of an ambiguous alternate allele if both alleles represented are seen) then a SNP was defined as conserved. As detailed in section 3.6 VarScan outputs multiple alternate alleles for one position if they pass quality filters therefore even if multiple homeologous SNP alleles for one position are seen then they could all be validated.

For the exome array 86% of SNPs that were found in the enriched data could be identified in the non-enriched data. The remaining 14% of SNPs from the enriched dataset that were not identified in the non-enriched tend to be low frequency alternate alleles making them difficult to define in the low coverage of the non-enriched data. 96% of SNPs found in the non-enriched data could be identified in the enriched dataset. The remaining 4% of non- enriched data SNPs that could not be identified in the enriched data also showed evidence of SNPs with low frequency alternate alleles or low quality and as such may have been proved to be false positives in the high coverage gained by the enriched data.

For the gene capture array 97% of SNPs found in the enriched data were found in non- enriched data with only 3% that could not be seen at all. 96% of SNPs found in the non- enriched data were found in the enriched data with only 4% that could not be seen at all. The same reasoning as seen in the exome capture array could be attributed to the 3 and 4% of SNPs that could not be seen. Importantly the proportion of unseen SNP alleles was consistently low in both of the enriched datasets although an improvement was noted with use of the gene capture array.

In document New technologies for high throughput genetic analysis of complex genomes (Page 100-104)