• No results found

Evaluation of Partek and PennCNV as Hidden Markov Model-based methods

methods of calling copy number

Of the three methods used in this study to detect copy number, Partek and PennCNV are the most similar to each other in that they both employ Hidden Markov Model-based algo- rithms that use fluorescence data obtained from all SNP probes on the filtered SNP list, while ConsecN only counts SNP loci for which genotyping failed. Additionally, both Partek and PennCNV were run with similar user-set parameters including normalization and background correction procedures as well as minimum number of markers used to call a copy number event. Partek and PennCNV had the highest concordance as measured by the number of overlapping loci called by the two methods. Despite these similarities, the events called by Partek and PennCNV were not equivalent, obviated by the many more events called by PennCNV than Partek. Further inspection of the copy number events called by each of the methods revealed that Partek called fewer but longer events than PennCNV. This pattern is consistent with a pre- vious study that used Partek and PennCNV in addition to three other algorithms to call copy number from Illumina arrays, which also found Partek called the fewest but largest events.195

the matrix and tree calculated from the pedigree allows the assessment of a given method’s ability to detect copy number events that reflect the expected genetic distance between sam- ples. The difference matrix calculated from Partek’s shared and unshared CNVs was the only difference matrix found to be statistically similar to the Pedigree tree. Additionally, the result- ing Partek tree (as well as the PennCNV tree) was found to be similar to the Pedigree tree. These results suggest that Partek may be the most likely of the three methods to be calling CNVs that reflect the known genetic relationships between samples. It is important to remem- ber here that the Pedigree tree is forced to cluster tissues from the same mouse together, as tissue samples from the same mouse were imputed to have a coefficient of relatedness value of zero. As a result, if a calling method is less likely to pick up on CNCs between tissues, it may be more apt to appear topologically similar to the Pedigree tree. This appears be the case with Partek, which called more events shared between tissues within individual mice than events that were in only one of the two tissues. That Partek may be less likely to pick upde novoCNCs is strengthened by the fact that Partek called the highest percentage of copy num- ber events that overlap previously discovered CNVs documented in the database with (95.98% of Partek events). Conversely, even though PennCNV called more events, a lower percentage of these events were found to overlap the database (63.52% of PennCNV events). Taking this result together with the finding that the PennCNV tree is topologically similar to the Pedigree tree suggests that PennCNV may be calling smaller events that are being missed by Partek.

The Partek-Pedigree and PennCNV-Pedigree tree similarities in the absence of a Partek- PennCNV tree agreement, in addition to a low degree of concordance in calls between meth- ods, suggests that there are inherent biases in the calling methods. Pintoet alsuggested that the different results obtained by using various methods indicate that the methods themselves may not be better or worse overall in the detection of CNVs, but instead offer “different strengths”.51 The “different strengths” of Partek and PennCNV may be evident by looking closer at the char- acterizations of the copy number events called by each method. Despite calling longer events overall, Partek was the only method to find a difference in lengths between singletons and

merge-associated events. Although both Partek and PennCNV events had strong correlations between marker density and event length, splitting events into singletons and merge-associated events revealed that while marker density was higher in merge-associated events than single- tons for Partek, the opposite was true for PennCNV. This suggests that Partek may be less likely to call copy number events in genomic regions that are more poorly tagged with SNPs. Additionally, the narrow range of GC content in Partek events may indicate a bias in the ge- nomic regions in which the algorithm is capable of detecting events. The smaller events called by PennCNV and the shift away from regions previously found to harbour CNVs may indicate that PennCNV is capable of detecting previously unknown orde novovariants.

4.8.1

The effect of reference selection on HMM-driven copy number

detection

With all other factors being equal (sample data, HMM parameters) or as similar as possible (quantile normalization with Partek and sketch quantile normalization with PennCNV), the difference in the CNV calls made by these two software packages is most likely due to the choice of reference used in each approach. Since CNVs are called as departures from the constructed reference, the composition of the data used to construct the reference may influence whether an event is called in any given genomic region in an experimental sample. With mice, the reference genome is constructed from the C57BL/6J inbred strain.43 The MDGA was designed with this in mind, with the A allele for nearly all SNPs based on the genotype carried by C57BL/6J.168 Constructing a copy number reference based on CEL files from pure C57BL/6J mice, as was done in Partek in this study, followed this C57BL/6J-based approach in order to call CNVs that differed from the C57BL/6J reference. As noted by Marioniet al, the choice of reference sample influences whether a segment of DNA is regarded as a gain, loss, or “normal;”227 and this is evident when comparing the resulting Partek calls to PennCNV calls. Although PennCNV called CNVs from the same sample set as Partek and run with similar

HMM parameters, the present application of PennCNV used a very different composition of CEL files as the reference. PennCNV was run with the 335 CGD CEL files, while Partek was run with the seven pure C57BL/6J CGD CEL files. The difference in the two subsets of CGD CEL files used lies in both the number of CEL files and the genetic diversity of the mice represented by those files.

When tackling the issue of reference diversity, it is useful to compare mouse genomic re- search to human genomic research. As a species, humans are genetically diverse, even within a single population, and individual genomes have a high degree of heterozygosity. This is in contrast to the controlled genetic background of inbred strains of laboratory mice, in which individual mice from one strain are nearly identical to one another, and the genomes of indi- viduals are highly homozygous. Genomic research in humans often makes use of the HapMap population as a reference set. The first study to use the HapMap population to aid in the de- tection of CNVs was performed by Komura et aland used a novel algorithm to detect CNVs within the HapMap population.162The introduction of the MDGA as a copy number detection platform for the laboratory mouse has permitted the generation of CEL files from samples taken from hundreds of mice of different genetic backgrounds. These are the samples which make up the CGD set of CEL files used in this study. By using the 335 CGD CEL files that passed the genotyping step, we can best approximate the diversity seen in the HapMap population but for a diverse set of mice. The drastic differences between Partek and PennCNV results, given the same array data, underscore the necessity of selecting the appropriate reference number and composition.

4.9

Future directions

Scherer et al. (2007) stress that genome-wide discovery methods, including array-based methods, are best interpreted as screening assays to find regions of the genome that have an “in- creased probability” of being variable in copy number.7 The independent validation of CNVs

discovered with oligonucleotide arrays presents a new set of challenges, as validation methods all have their own drawbacks in inherent biases, limitations, and cost. However, genotyping arrays continue to be valuable tools for CNV, and now, CNC, discovery experiments. The identification of putative tissue-specific regions ofde novoCNC formation in mice may have strong implications for our understanding of how the genome may change over the lifetime of an individual, as well as for how we diagnoses genetic diseases.

The dynamic nature of the fields of genomics and bioinformatics means that methods and interpretations must constantly be re-evaluated and revised as the associated technologies are advanced and optimized. The exclusion of poorly performing probe sets, updates to the se- quence and annotation of the reference genome, and better algorithms for detection and back- ground correction not only lead to improved research methods and more accurate results, but also invalidates the accuracy of previous entries into databases to which new data are com- pared. Updating probe set annotations and filtering out poorly performing probe sets can have a significant impact on array performance and the interpretation of array results.174,202