I conducted several experiments that demonstrate the versatility of the MegaMUGA ar- ray. I first assessed the genotyping error rate of MegaMUGA and found it to be remarkably low (0.04%) when comparing biological replicates of 11 inbred lines. I also compared those samples to the Sanger genotypes and found an inconsistency rate of 2.1%. I hypothesized that most inconsistencies were systematic, either due to incorrect Sanger genotypes or poorly performing MegaMUGA probes. When I eliminated the 4,052 poorly-performing markers (5.2% of markers), the rate of inconsistency fell to 0.005%. I developed a set of QC met- rics for determining the quality of array data (described in Appendix A). I implemented these methods in an R package,megamugaQC, that will be released along with the publication on the MegaMUGA array.
I assigned subspecific origin to the autosomes and Chr Xs of all samples. First, I identified 36,822 diagnostic SNPs using the methods described above for MDA. I developed a Hidden Markov Model (HMM) that had seven states corresponding to pureM. m. domesticus, M. m. musculus and M. m. castaneus, the three pairwise mixtures, and the indeterminate state in which all three subspecies are equally likely. I assigned initial values based on the predicted subspecies of each sample (0.94 for the state corresponding to the predicted subspecies and 0.01 for each other state). I used a transition probability matrix in which the diagonal values were 0.94 and all other values were 0.01. I estimated the mean and covariance matrix pa- rameters of the multivariate normal distribution by averaging the diagnostic values for each subspecies in a 5 Mb sliding window. The background mean and variance were based on the number of misclassifications for each diagnostic allele. For each sample, chromosome and subspecies in each 5 Mb sliding window, I summed the diagnostic values for each matching allele and divided by the total number of diagnostic alleles to derive the three-variable obser- vation matrix. I then used the HMM to assign the subspecific state to each window based on this matrix. MegaMUGA did a reasonably good job of recapitulating previous results based
on MDA data. All wild-caught animals were classified as completely pure representatives of their predicted subspecies. For inbred strains, most small introgressions (<0.5 Mb) were not identified. Large introgressions were identified as a deviation from the predicted subspecies, but the subspecific origin was often assigned incorrectly (usually as a mixture of two sub- species). Those results were expected due to the relatively low density ofM. m. musculusand
M. m. castaneusdiagnostic alleles.
While the genotypes for Chr Y and mitochondrial markers were robust, the relatively small number of markers made phylogenetic analysis problematic. I examined the intensity data and found a large number of additional sample clusters (i.e., VINOs) that were important to producing correct phylogenies. Since the VINO identification method has not yet been extended to MegaMUGA, I performed supervised clustering of 44 mitochondrial and 38 Chr Y probes that were both unique and had multiple distinct clusters. I randomly assigned non- allelic genotypes to the clusters beyond the two expected alleles. I created parsimony trees using the DNAPARS program in the Phylippackage [84]. The Chr Y phylogeny yielded a single best tree, while the mitochondrial phylogeny yielded multiple best trees. I analyzed each SNP independently using a test for leaf node proximity [85] and found that 26 of the mitochondrial markers (59%) showed evidence of homoplasy; only 9 Chr Y markers were homoplasic (24%). This high level of homoplasy in the mitochondrial tree was expected because the majority of the mitochondrial SNPs on MegaMUGA are located in the D-loop region, which has an extremely high mutation rate [86].
Significant increases or decreases in intensity across consecutive probes are indicative of copy number variation (CNV). MUGA and MegaMUGA have been important for us and oth- ers as a tool to study multiple types of CNV. In creating the Chr Y phylogeny, I uncovered intra-specific variation in the pseudo-autosomal region (PAR) of the Chr Y. The PAR is a
∼ 700kb region of homology between Chrs X and Y where the two chromosome pairs and undergo recombination during male meiosis. It was previously reported that the region of ho- mology in the CAST/EiJ strain (derived fromM. m. castaneusmice from Thailand) is 430kb
longer than in other mouse strains. However, intensity data showed that, inM. m. castaneus
mice from Taiwan, the region of homology extended less than 100 kb beyond the ancestral boundary. Furthermore, recombination data from the CC showed that 90% of recombinations involving CAST/EiJ PARs were within the ancestral PAR, 10% were within the 100kb proxi- mal to the ancestral PAR, and no recombinations occurred in the other 330kb of the CAST/EiJ region of homology. This finding suggests that the CAST/EiJ PAR evolved through two sepa- rate events: first a duplication of∼100kbof Chr X sequence in the ancestralM. m. castaneus
lineage, followed by a second duplication event exclusive to some subset of southeast Asian mice that happened after they diverged with the Taiwanese population. This lends further support to the finding thatM. m. castaneusis polytypic [87].
In another study, I surveyed 100 mouse cell lines using the GAP algorithm [88] and found that a large fraction of lines (15%) had evidence of whole-chromosome loss or gain for at least one chromosome.
I also developed a method to distinguish male and female samples based on their X- and Y-chromosome intensity profiles, and simultaneously detect sub-chromosomal CNV. Briefly, I used a supervised method based on predicted sex to identify sex-specific intensity distributions for each marker. I then determined the probability that a sample belongs to each distribution within a moving SNP window, and I identified intervals of consistent copy number predic- tion. I predicted the baseline chromosome copy number from the relative local copy-number rates. I used this method to predict the sex of approximately 5,000 MegaMUGA arrays in our database. I identified 27 samples with obviously incorrect reported sexes. I also identified 33 females having a single Chr X (XO, Figure 2.11). Interestingly, the frequency of XOs is much higher in the DO population than in other laboratory strains or wild mice. This finding is the subject of ongoing investigation.
Finally, we recently used MegaMUGA to prove the existence of a segregating∼ 250kb duplication on Chr 12 in the CAST/EiJ inbred line, which was initially predicted from allele- specific analysis of RNA sequencing data (Crowleyet al.submitted).
Figure 2.11: MegaMUGA can identify chromosome loss. Intensity profiles of A) a normal male, B) an XO female and C) a normal female. Each panel shows B-allele frequency (BAF, top) and Log-R ratio (LRR, bottom). BAF is a measure of the ratio of the A and B alleles for a SNP; points near 0.0 indicate the A allele, points near 1.0 indicate the B allele, and points near 0.5 indicate a heterozygous genotype. LRR is a measure of the sum intensity of a SNP relative to a reference distribution; values above zero indicate greater intensity than the reference, and values below zero indicate lower intensity than the reference. Values below zero on Chr X are expected for the male, who has only one Chr X, and thus half as much Chr X DNA to hybridize to the array. Similarly, values below zero on the Chr Y are expected for the female. An XO female is detected by values below zero on both Chrs X and Y.