Mutant identification in a hexaploid using an artificial dataset

Chapter 2. Mutant identification in the model plant Arabidopsis thaliana

2.7 Mutant identification in a hexaploid using an artificial dataset

This novel mutant identification pipeline was finally trialed on an artificial hexaploid sequencing dataset; an additional diploid mutant was developed to add to the existing tetraploid mutant created in section 2.6.1 to create this hexaploid mutant. Creation of this diploid was based on the evolution of the D genome of wheat. Evolver was implemented with the Arabidopsis Col-0 reference sequence as input and a ‘branch length’ of 0.0137 to create an Arabidopsis D genome that was as different from Col-0 as the wheat D genome is from its ancestor Ae. tauschii. This branch length was the value for the genome D in wheat as calculated by Gu, Y. Q. et al. and creates a substitution approximately every 73 bases. It was not ideal that the A and B genome’s branch lengths were taken from their evolution from T. turgidum while the D genomes branch length was taken from its evolution from Ae. tauschii, however, these plants act as hexaploid wheat’s genome donors to allow a guideline for diversity between the 3 genomes since little is known about their long-term divergence. Here, as for the Arabidopsis B genome, a minimally evolved Arabidopsis D’ genome was developed to compliment the Arabidopsis D genome and to create heterozygosity within it. To create this the difference between wheat genomes A and D was calculated and divided by 10, this figure (0.00044) was used as the branch length along with the Arabidopsis D genome as input for evolver.

A phenotype inducing SNP in a hexaploid is likely to be homozygous and in a homozygous hotspot within one genome, as such the Illumina data that was generated in section 2.6.1 for genomes A and B could be re-used and the D and genome data simply added to it. With no mutation in the Arabidopsis D genome, data could be generated for it as follows; the Arabidopsis D genome and the Arabidopsis D’ genome sequences were used once each as input for SAMtools wgsim with the same parameters to create another 100,000,000 read pairs. This Illumina data was merged with the section 2.6.1 Arabidopsis Genome A and B sequencing data, 200,000,000 read pairs in total resulting in 100,000,00 read pairs for each genome ensuring no bias between genomes.

The same features as detailed in section 2.6.1 were expected for this hexaploid however the addition of the Arabidopsis D genome altered the proportions of reads effected. Due to the presence of 3 genomes heterozygotes in one of the 3 genomes along the chromosomes 2-5 and in chromosome 1 3,500,000bp+ were, in general, in approximately 16.6% of sequencing reads (unless identical heterozygotes were present in 2 or even 3 genomes or more than 2 alternate alleles were present; both were found to be rare occurrences which would not effect results). In the ‘phased’ regions of chromosome 1 approaching the mutation; 0-1,600,000bp and 1,900,000-3,500,000 alternate alleles were expected in ~25% of reads. In the homozygote region of chromosome 1 (1,600,000-1,900,000bp) a high homozygote alternate allele frequency to the Arabidopsis Col-0 reference in this area generated by the Arabidopsis

genome A results in an alternate allele in ~33% of sequencing reads.

2.7.2 Mutant identification in the artificial hexaploid dataset

After the Illumina data was generated for the hexaploid it was analyzed identically to the data in both of the previous sections using the SHORE/SHOREmap pipeline and later the bespoke mutant identification pipeline, using Arabidopsis Col-0 (TAIR10) as the reference sequence. SHOREmap ‘denovo’ was run on the output files with a window size of 100,000bp. The minor allele frequency text file was filtered using awk to remove any minor alleles with a frequency in the sequencing reads lower than 10% and greater than 20% since we expect heterozygotes in this dataset to be found in ~16.6% of sequencing reads. The output from ‘denovo’ gave inconclusive results using the filtered minor allele frequency file. The minor allele frequency text file was then re-filtered using awk to remove any minor alleles with a frequency in the sequencing reads lower than 16% and greater than 17% and the output from ‘denovo’ using the re-filtered file is shown in figure 2.9a.

The VarScan output of the mutant identification pipeline that this dataset was also taken through was filtered for SNPs with more than 28% and less than 38% of the sequencing

reads containing the alternate allele. These SNPs were added into a homozygote’s file. A heterozygote SNP was defined as having greater than or equal to 10% and less than or equal to 20% of sequencing reads containing the alternate allele and these SNPs were also filtered from the VarScan output into a heterozygote’s file. Both files were then run through the Perl script (Allele-frequency-interval-determination.pl) and the numbers of heterozygotes and homozygotes per 100,000bp interval along each chromosome were calculated along with the homozygote to heterozygote ratio. These ratios were plotted to produce the frequency plot shown in figure 2.9b. In both analyses in figure 2.9a and 2.9b average depth of coverage was calculated to be ~300.

(a) (b)

Figure 2.9. Allele frequency analysis of a simulated Arabidopsis hexaploid mutant. (a) ‘SHOREmap denovo’ output pdf file for an artificial Arabidopsis

mutant mapped to the TAIR10 Arabidopsis reference genome (window size of 100,000bp used). (b) Output from bespoke allele frequency analysis of

artificial Arabidopsis mutant. Plots describe the homozygote to heterozygote ratio (y axis) per 100,000bp window along each chromosome (x axis).

winstep:10000 winsize:100000 Chromosome 1 1 5000000 10000000 15000000 20000000 25000000 30000000 −1 0 1 winstep:10000 winsize:100000 Chromosome 2 1 5000000 10000000 15000000 19698289 −1 0 1 winstep:10000 winsize:100000 Chromosome 3 1 5000000 10000000 15000000 20000000 23459830 −1 0 1 winstep:10000 winsize:100000 Chromosome 4 1 5000000 10000000 15000000 18585056 −1 0 1 winstep:10000 winsize:100000 Chromosome 5 1 5000000 10000000 15000000 20000000 25000000 −1 0 1 Chromosome 1 Chromosome 2 Chromosome 3 Chromosome 4 Chromosome 5 Fake hexaploid; Chromosome 1 Chromosome 2 Chromosome 3 Chromosome 4 Chromosome 5 !" #" $" %" &" '" (" )" *" !" '!" #!!" #'!" $!!" $'!" %!!" !" #" $" %" &" '" (" )" *" !" '!" #!!" #'!" $!!" $'!" %!!" !" #" $" %" &" '" (" )" *" !" '!" #!!" #'!" $!!" $'!" %!!" !" #" $" %" &" '" (" )" *" !" '!" #!!" #'!" $!!" $'!" %!!" !" #" $" %" &" '" (" )" *" !" '!" #!!" #'!" $!!" $'!" %!!" Chromosome 1 Chromosome 2 Chromosome 3 Chromosome 4 Chromosome 5

In figure 2.9a a peak appears in the target region around chromosome 1 and on closer inspection this maps to ~1,360,000-1,920,000bp. SHOREmap ‘annotate’ was run over this peak and a list of 9228 homozygous SNPs in ~33% of sequencing reads were identified of

which the mutant SNP was 2368th_{(see Appendix 1, table 8). It was identified correctly as a}

G!A SNP and found in ~37% of sequencing reads with a depth of coverage of 105. Figure

2.9b supports these findings since a peak appears in the target region in chromosome 1 windows 16, 17 and 18 corresponding to 1,600,000-1,900,000bp. 6197 homozygous SNPs were found within this peak in the VarScan output (see Appendix 1, table 9). Within this list

the mutant SNP was identified correctly as a G!A SNP and found in 36.67% of sequencing

reads with a depth of coverage of the alternate allele of 117. Either the bespoke pipeline or SHORE/SHOREmap can be used effectively for determination of the correct interval containing a phenotype inducing SNP position within a hexaploid mutant.

In document New technologies for high throughput genetic analysis of complex genomes (Page 86-90)