CHAPTER 2: REAL AND SIMULATED DATA
2.4 Exploratory Data Analysis
In sections 2.1, 2.2, and 2.3, a variety of example data sets that will be used in later chapters are introduced. These are expected to exhibit different levels of similarity. We now focus on 5 specific cases, in order of decreasing similarity:
• WF2 (Wright-Fisher data with width parameter = 2) has a high level of similarity. • WF10 has similarity level lower than WF2 from Section 2.3.
• WF40 has even less similarity from Section 2.3.
• Brain artery data will be seen in Sections 2.4.1 and 2.4.2 to have less similarity than WF40.
• Uniformly random data will also be seen in Sections 2.4.1 and 2.4.2 to have the least similarity among all 5 cases.
2.4.1 Angle-based Data Summaries
One way to measure the similarity of tree data topologies is to study the distribution of angles, with vertex at the origin, between each pair of trees (calledpairwise angle). Given two treesT1 and T2, denote the pairwise angle between these two trees asθ, then we can defineθby the cosine law:
cosθ= kT
1k2+kT2k2−L(Γ(T1, T2))2
2kT1kkT2k . (2.4)
It can be shown that θ does not depend on either kT1k or kT2k. Under this definition, if Γ(T1, T2) is a cone path, thenθ= 180◦, otherwiseθ <180◦. A good general definition of angle in any metric space is the Alexandrov angle [Alexandrov, 1951]. In the special case of phylogenetic tree space, the definition of angle by the cosine law in (2.4) coincides with Alexandrov angle.
The distributions of pairwise angles for the five data sets are visualized using kernel density estimation (KDE), as in Section 2.2. Figure 2.12(a) shows the overlay of KDEs for the five example data sets which allow direct comparison of these populations. The red curve corresponds to WF2 and shows that all of these pairwise angles are smaller than 10◦. The magenta curve represents WF10, indicating almost all the pairwise angles are between 30◦ and 50◦. These much larger angles for WF10 are very consistent with the greater spread of WF10 data across tree space. The green curve corresponds to WF40, showing most of the pairwise angles are between 80◦ and 120◦, again consistent with more spread for WF40. The blue curve represents the brain artery data, with all the pairwise angles between 120◦ and 170◦. This shows that the spread of the brain artery data is more than even the diverse WF40 distribution. The black curve on the far right corresponds to the uniformly random data, showing most of the pairwise angles are greater than
160◦ and a big proportion of angles are 180◦, showing the brain artery data set is not purely random. The overall comparison of these five distributions is consistent with the similarity ordering of these five data sets mentioned in the bullet points just before Section 2.4.1. Very often the spread of a data set is proportional to its mean, to investigate this issue, Figure 2.12(b) presents the overlay of the logarithms of the pairwise angles. Except for the uniformly random data, all other four data sets have similar spread, which indicates the variability in pairwise angles is proportional to the magnitude of angles.
(a) (b)
Figure 2.12: (a) Overlay of pairwise angle KDE plots shows the decreasing similarity ordering of the five data sets: WF2, WF10, WF40, brain artery data, and uniformly random data. (b) Overlay of logarithm of pairwise angle KDE plots indicates that the variability in pairwise angles is proportional to the magnitude of angles for WF2, WF10, WF40, and brain artery data, but not for the uniformly random data.
2.4.2 Distance-based Data Summaries
Another way to examine the similarity of a set of trees is to study the distances between each pair of trees (calledpairwise distance). This is defined as the length of the geodesic between the pair of trees. It is intuitive that larger pairwise distances correspond to less similar data sets. Figure 2.13(a) shows the overlay of pairwise distance KDE plots for the same five data sets which is another useful comparison. Again, the red curve corresponds to WF2 and shows that all the pairwise distances are within a narrow range smaller than 50. The magenta curve represents WF10, indicating almost all the pairwise distances are between 150 and 250, which shows that WF10 has more spread than WF2. The green curve is consistent with even larger spread for WF40, showing most of the pairwise distances are between 400 and 600. The blue curve representing the brain artery data and the black curve representing the uniformly random data on the far right overlap heavily. Both indicate pairwise distances distributing from 450 to 750, again consistent with the fact that these two data sets have the largest spread. The comparison of distributions of pairwise distances for these five data sets are quite similar to that of pairwise angles, except that the blue and black curves are relatively separated for pairwise angles but overlapped heavily for pairwise distances. This indicates that
the distances between each tree and the origin differ for the brain artery data and the uniformly random data, probably because the distribution of edge lengths used in the uniform generation is different from the true distribution of the edge lengths in the brain artery data. This shows that it is worth looking at both pairwise angles and pairwise distances. Figure 2.13(b) presents the overlay of the logarithms of the pairwise distances, which again gives a good indication of proportionality between the spread and magnitude of pairwise distances.
(a) (b)
Figure 2.13: (a) Overlay of pairwise distance KDE plots shows the decreasing similarity ordering of the five data sets: WF2, WF10, WF40, brain artery data, and uniformly random data. (b) Overlay of the logarithm of pairwise distance KDE plots implies the proportionality between the spread and magnitude of pairwise distances.