Array data can be used to infer SNP genotypes and copy number

A genotyping array provides data about a sample of genomic DNA in the form of raw fluorescence intensity for each feature, which is contained in a CEL file (in the file format .CEL).175 _{Each array type has a corresponding chip definition file (.CDF) which provides the} annotation for the features on the array using the(x,y) coordinates of the 2-dimensional array surface. Raw fluorescence intensity from the array is converted into a CEL file using a specially designed array scanner coupled with image processing methods.176 It is this CEL file that is used to infer SNP genotypes and CNVs after correction and normalization procedures have been applied.

Normalization procedures are usually employed to account for differences between arrays that are non-biological, including differences in sample labelling and array production.177_For example, batch effects are technical artefacts that can arise when comparing arrays that were processed at different times, or in different “batches”, for which mathematical adjustments can often be made.178 Within-array variation can also stem from the GC content of the probes. Peaks in fluorescence intensity acquired from genotyping arrays correlate with increased GC

content.179 This “GC waviness” can be minimized by applying mathematical corrections to array data to account for potential differences in hybridization based on probe sequence.180 The normalized and corrected fluorescence data can then be used to call genotypes and infer copy number.

Genotyping arrays such as the MDGA require an algorithm to determine the genotype of each SNP locus interrogated by the microarray. These algorithms take into account the fluorescence intensity of all probes used to interrogate each locus (in the case of the MDGA this is eight probes per locus), and calculate the most probable genotype, returned by the algorithm as AA (homozygous A), BB (homozygous B), or AB (heterozygous), where A represents the allele present in the C57BL/6J (reference) genome, and B represents an alternative allele. The algorithm recommended for use with the MDGA is a variation on BRLMM (Bayesian Robust Linear Modeling using Mahalanobis Distance) called BRLMM-P (the P stands for perfect- match, as the MDGA has only perfect-match probes).181_{This algorithm looks at the clustering} properties of each genotype for each probeset and calls the most likely genotype based on the summarization of each allele’s fluorescent intensities for a particular SNP. Most probesets yield fluorescence intensities that can be graphed as three distinct clusters, with one cluster for each genotype call (AA, AB, and BB). Although the shape and position of each cluster is predicted before the algorithm is run (the prior cluster), the algorithm will re-adjust the clusters as it runs with a given set of CEL files (creating posterior clusters). From these posterior clusters, silhouette scores are calculated for each data point (which represents a single sample’s summarized fluorescence intensity value for the SNP locus in question) to assess the clustering.182 A data point that has a silhouette score below a predetermined threshold is considered to have failed genotyping, as it was not placed near enough to one of the genotyping clusters to confidently call a genotype. When a silhouette score falls below this threshold, the algorithm will return a “NoCall”, indicating that the algorithm was unable to genotype the SNP in question for a particular sample.

1.10.1 Runs of consecutive “NoCalls” may indicate deletions

In some instances, the SNP-calling algorithm is unable to return a genotype, and in these cases the result “NoCall” is returned instead. There are several reasons that a “NoCall” may be returned instead of a genotype. The genotyping algorithm itself will have some error rate associated with it; this also means that some of the SNPs that appear to have genotyped well (assigned either AA, AB, or BB) may in reality be a different genotype or a “NoCall”. Typ- ically, a “NoCall” results from low fluorescence intensities from both allele probes (private correspondence with AffymetrixR

). A “NoCall” may also indicate hemizygosity at that locus, as the genotyping algorithms are not capable of returning a hemizygous call.183_{Another pos-} sibility that a “NoCall” may occur is a two-copy deletion, which in a diploid genome would indicate a copy number of zero. These calls are returned for a single base pair position in the genome (the SNP of interest that is interrogated by a given probeset), but it is possible that two or more “NoCalls” that occur consecutively along a chromosome may indicate a larger deletion, based on the principle that genetic markers that are physically close together on a chromosome are likely to have the same underlying copy number.161,184Similarly, a recent ef- fort by Standfußet alused the MDGA for copy number analysis of tumour tissue, but did so in the absence of genome-wide algorithms designed for the array.85 _{First, the authors identi-} fied normalized SNP probe intensities that differed from a reference intensity calculated from normal tissue samples. Next, they identified groups of consecutive SNPs that reported similar normalized SNP probe intensities, which were interpreted as relative copy number for each SNP. The advantage of using consecutive genotype calls, as opposed to raw fluorescence intensities, is that it allows for the detection of putative deletions in a genome-wide discovery study that does not have the benefit of a case-control study design, as would be necessary with the Standfußet al. approach.

In document Somatic Copy Number Mosaicism Contributes to Genomic Diversity in Mus musculus (Page 45-48)