• No results found

Algorithm selection for a comparison with DBAncestry

5.2 ChromoPainter data analysis

5.2.1 Algorithm selection for a comparison with DBAncestry

Beyond presenting an integrated overview of ancestry inference techniques in Section 5.1 I aim to select an approach which represents the current state-of-art for a comparison with the DBAncestry technique.

In particular, the comparison technique is sought to satisfy the following characteristics:

5.2. ChromoPainter data analysis 113

5.2.1.1 Criteria to select algorithm

(I) Number of markers: the algorithm can deal with small marker panels outlined in Section 1.5.1 representative for my data. Furthermore, the algorithm should scale well for a larger set of markers which may be available for future work.

(II) Number of populations: many of the ancestry inference techniques have been developed for the admixture of two source populations. However, due to the large number of breeds according to Section 1.5.3 a technique is required which performs inference for datasets composed of K populations.

(III) Scalability for number of populations: although some of the inference techniques can be used for large number of populations a few of them exhibit inapplicable scaling behaviour, i.e. the algorithm may scale exponentially in the number of populations.

(IV) Correlation structure: canine markers show strong LD according to Section 1.4.3 which favours model that either captures correlation between two markers or preferably for whole haplotype segments.

(V) Experienced research collaborator: to optimally use an ancestry inference implementation a collaboration with either its developer or an experienced user thereof is beneficial.

5.2.1.2 Comparing ancestry groups against criteria

Now, I check these criteria against the most sophisticated approaches from the different ancestry infer-ence groups outlined in Section 5.1 to select a suitable algorithm:

• Regression: one of the most involved regression techniques is MMLR with inverse covariance information which satisfies criteria I, II, III and partially IV. However, modeling of correlation information (IV) is limited to pairwise SNPs.

• Global model-based clustering: Admixture is a fast version of Structure which satisfies I, II and III. However, it does not account for LD.

• Local window-based approaches: Lamp/WinPop satisfies I, II and III. However, LD is not for-mally modelled. Furthermore, admixture proportions have to be supplied to the algorithm and are not inferred. However, assuming these global admixture proportions the algorithm finds the best population pair assignment for the current individual.

• Local HMM-based models approaches: Multimix satisfies I, II, III and IV. However, for criterion IV the algorithm only models covariance information between two SNPs but does not consider long-range LD. ChromoPainter satisfies I, II, III, IV and V (Garrett Hellenthal). In particular with respect to IV, ChromoPainter looks at haplotype segments instead of correlation between pairwise SNPs and represents test dogs as linear combination of haplotype segments in the training data.

Furthermore, both MultiMix and ChromoPainter have implementations available which are free for academic use.

• Non-Parametric Bayesian approaches: the iHMM satisfies I, II and III and IV. In particular, LD is accounted for modeling ancestry as a composite process of two layers: within the first layer individuals inherit from ancestral populations which again depend on a subset of founder haplotypes. However, at the moment there is no publically released code available for the iHMM technique.

114 Chapter 5. Alternative ancestry inference models

• PCA-based approaches: the iPCA technique satisfied I, II and III. However, the technique is less aimed at ancestry inference but rather targeted at unsupervised learning, i.e. if the interest is in finding the optimal number of populations or to detect population substructure. Furthermore, PCA decorrelates SNPs in eigenspace to maximize retained variance rather than to account for correlation structure. On the other hand, the hybrid technique PCAdmix also satisfies criteria I, II and III but also does not account for correlation in the genotype data.

• Machine Learning based approaches: SVMs with string kernels satisfies I and II. Criterion III is computationally expensive and can be either implemented as each class against-1’ or ’1-against-the rest’. Furthermore criterion IV is not explicitly accounted but rather two haplotype segments are compared using a similarity measure, such as an alignment score.

Based on this comparison the strongest competitors are MultiMix, ChromoPainter and iHMM.

However, Multimix does not model long-range dependencies while iHMM does not offer a publically available implementation. I selected ChromoPainter because it models LD by considering haplotype segments and is freely available available for academic use. Furthermore, researcher Garrett Hellenthal collaborated with me to smoothly obtain results. Finally, I will conclude this part by comparing the DBAncestry algorithm with ChromoPainter.

5.2.1.3 ChromoPainter vs. DBAncestry assumptions

• Time complexity for markers and training individuals: both approaches, ChromoPainter (prediction run-time) and the computation of population frequency estimates of PHASE for DBAncestry scale linearly in the number of SNP markers (Hellenthal, 2012; Scheet, 2013). How-ever, ChromoPainter is also linear in the number of purebred training individuals while PHASE is quadratic in the number of samples in the reference database (Li and Stephens, 2003; Scheet, 2013).

• Constant ancestry on chromosome: DBAncestry assumes constant ancestry within a chromo-some of two breeds assigned on the maternal and paternal side accordingly. Recombination can only occur between adjacent chromosomes. In the case of a small number of markers on each chromosome this assumption may be appropriate. However, for denser sets of markers I would like to unravel the different breeds which contribute to a given chromosome. In particular, Chro-moPainter chooses the length of haplotype segments which form constant ancestry adaptively, i.e.

according to the recombination rate and whether a test dog still copies from the same training individual.

• Marker correlation: in the age of GWAS data, 320 SNP markers is very sparse and far from a typical dense GWAS dataset. Furthermore, SNPs are not evenly spaced and distance among adjacent SNP may range from a few hundred to million of base pairs as indicated in Figure 1.1.

Therefore, it is hard to exploit correlation pattern of markers close in distance. According to Section 1.4.3 LD information given by a recombination map can be utilized to some extent.

• Amount of information extracted from training data: to discriminate breeds at the ggp level on average the algorithm uses3208 = 40 SNPs to make this prediction. DBAncestry uses more infor-mation from the genotype because for each chromosome I compute a large number of haplotype frequencies typical for a given breed while for ChromoPainter I only use the most likely phasing, i.e. haplotype pair, for each training individual.

5.2. ChromoPainter data analysis 115

• Disk storage and working memory requirements: the storage requirements do not change much for ChromoPainter because each additional marker only adds another column in the file of the most likely haplotype pair phasings of the training data. With respect to DBAncestry, more markers lead to a combinatorial increase of possible haplotypes which all need to be enumerated, stored and queried with their haplotype frequencies from file and loaded to memory. Although due to strong LD in many dense SNP datasets the number of likely haplotypes is much smaller than all possible enumerations 2Swhere S is the number of markers (Gattepaille and Jakobsson, 2012). As outlined in Section 1.4.3 in windows of 10-500 kbp there around 5 haplotypes covering 80% of the observed haplotypes (Lindblad-Toh et al., 2005; Parker, 2012). However, to yield a good coverage of rare haplotypes I expect to retrieve a substantial number of haplotype frequencies leading to very heavy use of disk read operations and large amount of working memory. Alternatively, I could store fewer haplotype frequencies but haplotypes may be more often missed in the test dog which would lead to the demand of a measure of deviation to account for imperfect copying due to mutations.