Discarding correlation structure in genotypes

2.5 Breed similarity results using genotypes and haplotypes

2.5.2 Discarding correlation structure in genotypes

2.5. Breed similarity results using genotypes and haplotypes 53

Figure2.1:ThisfigureshowsaheatmapvisualizationoftheManhattanbreeddistancematrixusinggenotypedataof125breeds.Thecolumnsarere-orderedaccordingto thedendrogramofthehierarchicalclusteringofthedistancematrixwithcompletelinkage.Breedpairswhichtendtowardsredhavesmalldistance(highsimilarity)while breedpairsgoingtotheyellow-whitespectrumaremoredistantbreeds.Breedshaveconsiderabledistanceamongeachotherwhichcanbeseenfrommostlyyellow-colored matrixentriesandmostlylongdendrogramleafbranches.Furthermore,thedendrogramshowsaflatclusterstructurewhichimplieslimitsubpopulationstructure.

54 Chapter 2. Breed similarity in dogs

Figure 2.2: This figure shows the MDS plot for the ChromoPainter coancestry similarity matrix which has been converted to a distance matrix. This measure visually discriminates Retrievers and Scent hounds while Small Terriers are partially separated out.

This section discusses the use of genotype data, i.e. sequences over the ternary alphabet {0, 1, 2}, in similarity computations which do not account for LD. Typically, studies utilizing SNP data are based on allele sharing, i.e. Hamming distance (Vonholdt et al., 2010). However, the Hamming distance equally weighs mismatches although allele distance 0-2 is evolutionary more different than mismatches 0-1 and 1-2. Therefore, I focus on Minkowski distances as proximity measure. Furthermore, I would equally weigh all three difference computations which leads to the Manhattan distance.

In this analysis I am not interested in the pairwise computation of the proximity measure between any two dogs in the training dataset but rather between any two breeds. However, the naive computation of the Manhattan breed distance matrix is very computationally expensive. There are a median of 55 training dogs per breed. So, to compute the pairwise distance between two breeds on average I need

55²

2 ≈ 1500 symmetry-adjusted computations composed of evaluating the Manhattan distance between the SNP markers of two given training dogs. This previous step needs to be computed pairwise for all breeds (symmetry-adjusted) for ¹₂ ¹²⁵₂ = 3875 times. Therefore, in total there are ⁵⁵₂² · ¹₂ ¹²⁵₂ ≈ 5.8 million evaluations of the Manhattan distance. A more efficient way to compute this Manhattan breed distance matrix is by computing the allele frequencies for all SNPs breedwise. And then I can compare these breed frequencies (symmetry-adjusted) for ¹₂ ¹²⁵₂ = 3875 between any two breeds. The breed frequency comparison between two breeds takes 6 addition and 18 multiplication operations.

I visualize the bm× bm Manhattan breed distance matrix using a heatmap. In Figure 2.1 I see the heatmap for the Manhattan distance matrix based on genotype data and in Figure 2.3 (i) I show the colour continuum with an integrated density plot showing the distribution of the different distance values. I see the same heatmap again in Figure 2.4 (a) while in Figure 2.4 (b) I see that heatmap for the same data but the distances have been converted to similarities. Therefore, more closely related breeds have smaller values using distances and higher values when applied to similarities. In this heatmap in Figure 2.1 I also see the result of a hierarchical clustering algorithm applied to the columns of the breed distance matrix.

2.5. Breed similarity results using genotypes and haplotypes 55

(a) Breed similarity for f = exp(0) (b) Transition probability for f = exp(0)

(e) Breed similarity for f = exp(4) (f) Transition probability for f = exp(4)

(g) Breed similarity for f = exp(5) (h) Transition probability for f = exp(5)

(i) Distribution of genotype distances (j) Rank exponential decay function g for s = 0.05

Figure 2.3: In Figures (a,c,e,g) each line corresponds to a breed. For each of these breeds I show the decreasingly ordered breed genotype similarities which have been exponentially transformed. In Figures (b,d,f,h) I show the corresponding breed transition probabilities from closest (most similar, left) to furthest breed (right). Figure (i) shows the distribution of distance values according to heatmap in Figure 2.1. Finally, Figure (j) shows the proposed transition probability based on rank in distance sorted breeds.

56 Chapter 2. Breed similarity in dogs

A visual inspection of the integrated density plot in Figure 2.3 (i) and the heatmap in Figure 2.1 itself suggests that the distances are almost normally distributed with a positive skew towards higher distance values. Furthermore, the dendrogram shows that the tree wide and not very deep. These plots show a global picture of the distance distribution. However, first I need to examine how steep the decay of the ordered breed distances is. Given that the decay of the original ordered distances (rank) for each breed is very flat I apply an exponential transformation which contains a scalar scaling factor f which tunes the steepness of the decay. For that purpose for each breed b I define the list of ordered distances to the i−th next breed o⁰ = dσ(1)= 0, d_σ(2), . . . , d_σ(b_M₎

∈ R^b^M where σ(1) = b corre-sponds the breed itself, σ(2) refers to the closest non-zero distance breed and in general σ(i) denotes the i−th closest breed in distance. Then, I remove the first element from the list d_σ(1) and centre by d_σ(2)to obtain list o = 0, dσ(3)− d_σ(2), d_σ(4)− d_σ(2), . . . , d_σ(125)− d_σ(2)

∈ R^b^M⁻¹. Then the centered distances in list o are converted to exponentially decaying breed similarities which are given by s = exp(−f · o) ∈ R^b^M⁻¹. In Figure 2.3 I see the ordered exponential decay transformed functions for the four scaling factors f = exp(0) = 1 (a), f = exp(3) ≈ 20.1 (c), f = exp(4) ≈ 54.6 (e) and f = exp(5) ≈ 148.4 (g), one line for each of the bM breeds. These plots show that the breed decay is not very homogeneous across breeds, and for smaller values of f the decay is still very flat. Finally, these exponential decay transformed breed functions can be converted into breedwise transition probabilities using t = _{P s}^s ∈ R^b^M⁻¹ which are shown in Figures 2.3 (b,d,f,h) for the same four scaling factors.

These figures show that there is very quick decay for the closest about first five breeds to breed b, then a very flat decay for most breeds except for the last 10 breeds which show another higher negative slope.

We would like to use the transition probabilities to propose breeds within a simulation-based framework discussed in Chapter 4. To ensure the same amount of exploration across breeds I will form a breed-independent decay function for the transition probabilities. However, note that for a fixed amount of exploration the breed proposed depends on the current breed. Furthermore, the transition probabilities should be decaying quickly until about half of the breeds and then saturates at a low level. The exponen-tial decay transition probability function I have in mind with these two characteristics is shown in Figure 2.3 (j) and is defined as function g⁰(r) = exp(−f · r) ∈ R where the scaling factor is set to f = 0.05 and rank is defined for r = 1, . . . , bM. Then, the closest breed has transition probability of 5 percent, the first six breeds have a cumulative transition probability of 25 percent, the first 14 closest breeds have a cumulative transition probability of 50 percent and the first 62 breeds cover cumulatively 96 percent of the transition probability. Then, to apply function g to our original Manhattan distance matrix I form the ordered list of distances for each breed and replace these values by function g. In other words, for each breed I replace listsd_σ(1)= 0, d_σ(2), . . . , d_σ(b_M₎

∈ R^b^M by [0, g(1), . . . , g(bM − 1)] ∈ R^b^M. The heatmap for the rank genotype matrix using function g can be seen in Figure 2.4 (c) while the corresponding rank similarity matrix is shown in Figure 2.4 (d). From these figures it can be easily seen that across breeds the majority of breeds have high distance, i.e. small similarity.

In Figure 2.7 (a) the MDS plot for the original genotype distance is shown. Four breed groups, such as Ancient, Spitz dog, Toy breeds, Mastiff-like and Retrievers are well separated while other breed groups show more overlap. In Figure 2.7 (b) we see the MDS plot for the rank adjusted genotype distance shown in Figure 2.4 (c) which pulls the previously well separated four groups further to the centre of gravity which leads to a lower signal-to-noise ratio.

2.5. Breed similarity results using genotypes and haplotypes 57

(a)Originalgenotypedistance(b)Originalgenotypesimilarity (c)Rankgenotypedistance(d)Rankgenotypesimilarity Figure2.4:TheseFiguresshowaheatmapeitherbasedonthedistance(leftcolumn)orsimilaritymatrix(rightcolumn).Intheleftcolumnreddenoteslowdistance(high similarity)andwhitereferstohighdistance(smallsimilarity)whileintherightcolumnitisthereversecase:reddenoteslittlesimilarityandwhitehighsimilarity.Thefirst rowshowstheheatmapbasedontheoriginaldistancematrixwhileinthesecondrowthegenotypedistanceshavebeentransformedbyfunctionffromFigure2.3(j),such thatmostbreedsareveryfarawayfromagivenbreed(whitishcolour)andonlyfewbreedsveryclose(red).

58 Chapter 2. Breed similarity in dogs

(a) Manhattan distance

(b) Pearson correlation

Figure 2.5: These figures show the hierarchical clustering results using complete linkage for Manhattan distance and Pearson correlation based on the SmallHap haplotype data. Breeds from the same group have their branches shown in the same colour. Although breeds from the same breed group tend to be adjacent I notice that breeds from a given group are not distributed homogeneously, i.e. not all breeds from the same group are in the same cluster. As before there is a flat cluster structure confirming limited population substructure. Furthermore, most breeds are associated with long leaf branches suggesting strong differences in breed. Compared with the genotype data dendrogram there are fewer short branches which suggests less strong breed discrimination.

2.5. Breed similarity results using genotypes and haplotypes 59

In document Identification of breed contributions in crossbred dogs (Page 52-59)