• No results found

Variable intensity oligonucleotides (VINOs)

In document Didion_unc_0153D_14419.pdf (Page 36-45)

Genotype calling programs use a variety of methods to infer discrete genotypes from con- tinuous intensity data. Many methods, including the standard Affymetrix algorithm (BRLMM-P 2D[63]), employ clustering of multiple samples based on the contrast between allelic probe intensities. Samples belonging to the two clusters with a large absolute contrast are called as homozygous genotypes and samples with low contrast are called heterozygous. Samples that do not fall within any of the three clusters in the contrast dimension remain uncalled (Figure 2.3).

We previously genotyped 162 laboratory mouse strains using the MDA [40]. Contrary to our expectation of homozygosity at all SNPs in inbred mouse strains, we observed a sub- stantial number of heterozygous genotype calls. Furthermore, the rates of both no-calls and unexpected heterozygous calls were positively correlated with divergence from the reference genome. The highest rates were observed in strains derived from species of the Musgenus other than Mus musculus, such asM. spretus andM. spicilegus, followed by strains derived from the M. m. musculus andM. m. castaneus subspecies. These findings were indicative

Figure 2.3: VINOs are identified as a cluster of low-intensity samples. Contrast plots of a SNP called by A)BRLMM-P 2Dand B)MouseDivGeno. Probe intensities from 351 sam- ples are shown in MA-transformed space. The sample contrast is the normalized difference between A and B allele intensities [(A-B)/(A+B)]. The y-axis shows the log2 mean of A and B allele intensities. Dark blue: AA call; light blue: BB call; purple: AB call; red: V call; gray: N call. Circles represent strains with a homozygous haplotype in the region of the SNP, while squares represent strains with a heterozygous haplotype. F1 animals with parental alle- les of AA and BB are true heterozygotes and are highlighted along with their parental strains.

MouseDivGenosoftware is able to identify samples in the low intensity cluster as contain- ing an OTV and assigns a VINO (V) call, whereas BRLMM-P 2Dassigns several different genotype calls (AB, N) to samples in this cluster.

of problems affecting all hybridization arrays, genotype calling software and studies that use those genotype data for a variety of goals. Our studies of well-characterized inbred strains provided an opportunity for investigating the underlying causes of genotyping errors.

Essentially, a no-call or incorrect genotype call is the result of abnormal hybridization in- tensity for a sample at a given SNP and may be due to technical or biological causes. Technical errors are generally either very obvious, such as a high no-call rate due to poor DNA quality, or slight enough that they do not affect genotype calling. On the other hand, genotype calling errors that are biological in origin can be attributed to previously uncharacterized variation

in genomic DNA, either in the sequence targeted by a probe set or in the proximal or distal restriction sites used for genome-wide amplification. These variants can reduce hybridiza- tion intensity sufficiently to eliminate or reverse the contrast between allelic probes such that an incorrect genotype call (or no-call) is made. We call such variants “off-target variants” (OTVs) to distinguish them from the expected variant targeted by the SNP probe set. We call probe sets that are affected by OTVs “variable intensity oligonucleotides” (VINOs) due to the dynamic effect of OTVs on hybridization intensity [41].

We hypothesized that OTVs were the primary cause of miscalls and no-calls. Hyuna Yang developed a novel genotype calling algorithm that also recognized clusters of samples apart from those with the standard homozygous or heterozygous genotypes (MouseDivGeno, [40, 41]). Probes with such clusters are considered putative VINOs, and the samples in those clusters are given a genotype call of “V” (Figure 2.3). To confirm that VINOs do indeed represent previously unidentified genetic variation, we selected 15 SNP probes with VINO calls. For each probe, I selected at least four mouse strains of each genotype (homozygous for allele A, homozygous for B or VINO) for targeted sequencing. Strains for resequencing were selected to maximally sample across subspecies and strain type (classical or wild-derived). I designed sequencing primers approximately 200 bp proximal and distal to each probe using PrimerQuest (Integrated DNA Technologies). I amplified probe regions by PCR and submit- ted them for automated Sanger sequencing at UNC. I aligned the resulting sequences using

Sequencher 4.9(Gene Codes). Supplementary Table 4 of [40] lists all probes, strains and primer sequences used. I confirmed that all homozygous SNP genotype calls were concordant with the sequencing data. In addition, in 14 out of 15 probes the VINO calls were associated with the presence of one or more additional variants near the target SNP. The final case was explained by polymorphisms outside of the sequenced region that altered the cut sites for the enzymes used for genome-wide amplification.

We followed up on this work with a more thorough characterization of the effects of OTVs on hybridization intensities, and a formal description of the MouseDivGeno soft-

ware [41]. We first hybridized 351 mouse DNA samples on the MDA. Those data are now public (http://cgd.jax.org/datasets/diversityarray/CELfiles.shtml), and include classical inbred strains, wild-derived strains, consomic strains, recombinant inbred strains, samples from early generations of the CC, F1 hybrids and wild mice – among the largest mouse genotype datasets available. Among the 143 inbred strains in that sample (116 classical and 27 wild-derived), we observed a significant increase in both heterozygous calls and no-calls as a function of genetic distance from the reference genome (Figure 2.4). All of those strains were expected to be fully homozygous based on previous studies (for at least 99% of their genomes), therefore we assumed that most of the heterozygous calls were errors (miscalls). We called genotypes for our sample set using three different algorithms: BRLMM-P 2D[63],Alchemy[64] and

MouseDivGeno. We found that genotype calls for the set of 351 samples were highly con- cordant in homozygous and heterozygous classes (97.4 - 97.8% agreement). The majority of discordant genotypes were due to homozygous calls using one of the methods that were called heterozygous using another method. Conflicts with opposite homozygous genotypes were very rare (less than 0.05% in all comparisons). The overall rate of AB genotypes was slightly lower forMouseDivGeno (10.26%) compared to Alchemy(11.45%) and BRLMM-P 2D

(11.62%). Of the VINO calls fromMouseDivGeno, 9.76% and 46.04% were called AB by

AlchemyandBRLMM-P 2D, respectively, while 65.32% and 34.04% were called as N. Of the 18 strains resequenced by the Sanger Institute [56, 59], 15 areM. musculusinbred strains that were genotyped with the MDA. I obtained and filtered SNPs and small inser- tions/deletions (indels) for those strains at autosomal typed loci (Appendix A). I re-annotated all MDA probes by aligning them to the latest version of the mouse genome (Build 37) using BWA [65]. Probes on the MDA were 25 bp long, and the target SNP was typically located in the center of the probe. For each probe, I identified the number, type and position of OTVs, as well as the presence of OTVs in either proximal or distal restriction sites. I used dbSNP and Ensembl to link each probe to functional classifications in public databases. I also noted whether each probe was in a region of low or missing sequence coverage for any of the Sanger

C57BL/6 Classical Laboratory Strains

M. m. domesticus

Wild-derived Laboratory Strains

M. m. musculus M. m. castaneus 106 A B 105 104 103 102 101 100 Number of SN P Probe Sets 0.1

Genetic Distance from Reference (Fraction of Non-reference Genotype Calls)

0.2 0.3 0.4 0.5 Genotype Call A H B V N

Figure 2.4: Non-homozygous VINO call rates increase with divergence from the reference genome. A) Genetic distance from the mouse reference genome for 143 laboratory inbred strains. Each strain is shown as a vertical tick mark. Strains are grouped according to their origin are arranged left-to-right in increasing order of genetic distance from the reference. Genetic distance is computed as the fraction of non-reference (non-A allele) genotype calls. B) VINO calls for each strain. For each strain, the number of SNP probe sets assigned each of the five possible calls (A, B, H, V or N) are shown as five points of different colors that sum to 526,363 SNP probe sets.

strains.

As expected due to the inbred status of the strains overlapping in the Sanger and MDA data sets, there were no heterozygous calls in the filtered Sanger genotypes. The genotypes for those samples had heterozygous call rates between 1-2%; the homozygous calls were highly concordant between the two data sets (99.8%). MouseDivGenomade 35,604 VINO calls (0.48% of total calls), a proportion similar to the one observed in the larger set of 351 samples. Among VINOs, 81.4% correspond to an AA or BB homozygous genotype calls in the Sanger data. Because Sanger SNPs were identified by alignment to the reference sequence, regions

that could not be aligned were inaccessible to SNP discovery and thus not comparable with array genotypes. The size of the inaccessible fraction of the genome increased with a strain’s divergence from the reference. I observed an enrichment of VINO calls in inaccessible regions of the Sanger data (2,221 VINO calls compared to an expectation of 54) [56], in probes with a deleted target base (24 vs. 2 expected) and unaligned or non-uniquely aligned probes (4,361 vs. 82 expected).

I examined the correlation between hybridization intensity and OTV position relative to the target SNP for the probes that had OTVs in at least one of the strains (Figure 2.5). I found that OTVs located within the first 3 bp of either the 5’ or 3’ end of a target sequence (edge OTVs) had relatively minor effect on hybridization intensity. In contrast, OTVs within the central region of the probe (central OTVs) had pronounced effect on hybridization intensity, with mean intensity differing by more than one standard deviation from that of probes having no OTVs. I also found that OTVs that disrupted a restriction fragment site and increased the size of the minimum fragment length to greater than 1500 bp significantly reduced hybridization intensities. I predicted from these results thatMouseDivGenowas undercalling VINOs by at least 1/3, since VINOs could not be recognized when the OTV was located in 6 of the 24 off- target positions. I determined the false-negative and false-positive rates for VINO calling by comparing predicted VINOs with the Sanger genotypes. Using the Sanger data as the “truth” was problematic due to miscalled or uncalled SNPs in that data set as well as known problems with the mouse genome assembly [66], but it was the best available metric. The measured false-negative rate for sequences with central OTVs was 55%. In most cases, false negatives were due to samples failing to meet the stringent requirements for VINO calling that were used to minimize the false-positive rate. The false-positive rate was 19.8%. I examined the performance ofAlchemyand BRLMM-Pand found a more than 30-fold increase in no-call rates for unexplained VINOs.

An additional complication in calling VINOs in wild mice, and in inbred mice with known regions of residual heterozygosity, was heterozygous OTVs. By definition, heterozygous

Log2(Mean Intensity)

8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ●

A

8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

B

Percent of Calls

0 10 20 30 40 50 60

C

E

0 10 20 30 40 50 60

D

OTV Position

Probe Sets

(Log10)

01 2 3 4 5 6 7 None0 1 2 3 4 5 6 7 8 9 10 11 12

Restriction Fragment Length

0 1 2 3 4 5 6

F

1000 2000 3000 4000 call V N

Figure 2.5: OTV position in the probe and RFLP have significant effects on hybridization intensity and VINO detection. Left panels: probe sets are grouped by the distance from the OTV to the nearest edge of the probe sequence for each possible OTV position (either none or between 0-12). Right panels: probe sets having no evidence of an OTV within the probe sequence are grouped by the size of their smallest restriction fragment (NspI orStyI) in bins of 250 bp. Top panels show the mean intensity across each subset using the four probes for the best-hybridizing allele in each probe set for A) OTV position in the probe and B) minimum restriction fragment length. Middle panels show the number of VINO and N calls (as a percentage of all genotype calls) for probe sets grouped by C) OTV position in the probe and D) minimum restriction fragment length. Bottom panels show the number of probes in each bin for E) OTV position in the probe and F) minimum restriction fragment length.

OTVs only alter one allele. Therefore, heterozygous genotypes with a nearby heterozygous OTV appeared as homozygous for the allele lacking the OTV. We called those “cryptic VI- NOs” (Figure 2.6). F1 hybrid mice were used to determine the extent of miscalls due to cryptic VINOs since their phase (i.e., parental origin) of haplotypes is known. We used a (C57BL/6JxCAST/EiJ)F1 with the expectation that all OTVs would be present only in the CAST/EiJ sequence. We found that 62% of SNPs with OTVs in heterozygosity were called as homozygous, leading to a low concordance rate (83.35%) between the genotypes predicted from the parental strains and the actual genotype calls for the F1 hybrid. Cryptic VINOs rep- resent a substantial source of genotyping error, particularly since they may only be recognized if the parental genotypes are known (and heterozygous parent genotypes will also be affected by cryptic VINOs). 6 C57BL/6J (AA) CAST/EiJ (BB) CAST/EiJ (V) (C57BL/6JxCAST/EiJ)F1 (AA) (C57BL/6JxCAST/EiJ)F1 (N)

Log2 (Mean Intensity)

JAX00000018 (Chr1 3218581) JAX00009448 (Chr1 131687853) JAX00072427 (Chr16 91048133)

-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 7 8 9 10 11 12 13 (A-B)/(A+B) (A-B)/(A+B) (A-B)/(A+B) A B C

Figure 2.6: Detected and undetected VINOs in homozygosity may lead to inaccurate genotyp- ing in heterozygosity. Circles represent parental strains: C57BL/6J (dark blue), which have the AA allele; CAST/EiJ, which has the BB genotype at its target position and also an OTV within the probe and is called either BB (light blue) or V (red) byMouseDivGeno; squares: (C57BL/6JxCAST/EiJ)F1 samples, which have an OTV in heterozygosity and are called AA (dark blue) or N (black) byMouseDivGeno. A)MouseDivGenocalls CAST/EiJ as V; the F1 samples are called AA due to stronger hybridization intensity for the AA allele and thus the OTV goes unrecognized. B)MouseDivGenocalls CAST/EiJ as BB due to the absence of a true BB cluster; the F1 samples are again called AA. C)MouseDivGenocalls CAST/EiJ as V but calls the F1 samples as N due to poor discrimination between genotype clusters.

Distances between consecutive SNPs are expected to follow a geometric distribution (Fig- ure 2.7), with a significant proportion in the 0-12 bp range in species with high levels of vari- ation and large populations size such as the house mouse. In a significant fraction of probes with OTVs, we were able to detect the reduction in hybridization intensity and discriminate the samples harboring previously undetected variation from those that do not. VINOs are biased in favor of more divergent samples in reverse proportion to the degree to which the ge- netic variants in a given sample were known and represented on the array at the time of design. Thus VINOs could be used to counteract SNP selection bias (discussed further below).

Distance Between Consecutive SNPs (bp)

Fr action of SNPs 0.00 0.05 0.10 0.15 0.20 0.25 >300 = 0.027 50 100 150 200 250 300

Figure 2.7: The distance between consecutive SNPs follows a geometric distribution. His- togram of distance between consecutive SNPs in 14 Sanger strains using a bin size of 12 bp. Distances greater than 300 bp are combined in the right-most bin.

The method for identifying VINOs is generalizable, and we expect that new genotyping algorithms will take the next logical step of recognizing arbitrary numbers of clusters. We

testedMouseDivGenoon a randomly chosen subset of human HapMap data [67]. In 70% of cases,MouseDivGenoeither correctly called a VINO or the correct homozygous allele of the target variant. The 30% miscalls were all due to cryptic VINOs. We identified a 2:1 bias of VINOs in human YRI (Yoruban African) samples compared the other three HapMap popula- tions. That was consistent with the greater number of genetic variants in African populations that were unknown at the time of the design of the human SNP array.

In document Didion_unc_0153D_14419.pdf (Page 36-45)