A program was written in Matlab to implement the above described MDS-Procrustes method. This program takes in a query and data file of phylogenetic trees in Newick format and outputs the MDS-Procrustes similarity scores from each query tree in the query tree file to each data tree in the data tree file. In order to test the effectiveness of our method, we initially compared a set of data trees with a set of query trees whose nearest matches to the data set are known.
31
To accomplish this, we generated a set of 40 unweighted, unrooted data trees D = {d1,…..,d40} with Ndata = {n1,…..,n20} nodes. The trees were generated using the rtree function from the Ape package in R. We then selected a random Nquery={nm,…,nm+4} with m ≤ 16 for each d in D to comprise our set of query trees Q = {q1,…..,q40}. Thus, each query tree would always be a random 5 node subtree of a data tree. We then compared each qεQ to each dεD to test the ability of our method to correctly detect similarity between a tree and its subtree as a query structure.
In this scenario, MDS-Procrustes was able to correctly identify the most similar data tree as the structure from which the query tree was derived each time (with a distance equal to 0 on each instance). This is expected since we reduce the input distance matrices to include only distances between common nodes prior to MDS and Procrustes analysis, but it is important to show that the method will work for the trivial case and that the fitting will not yield any
unforeseen results. We show the distribution of results of this random test in Fig 14. As seen in the figure, there were 40 queries that resulted in a distance of approximately 0 from the query tree to the data tree (each of these scores occurred in a comparison between the data tree and its subtree). The rest of the similarity score distribution was fairly evenly distributed and allowed for an effective ranking in this particular test scenario.
32
Figure 14 - MDS-Procrustes distribution for validation dataset
Similarity score distribution for random data set of 40 weighted trees with 20 nodes. Query tree used was a random 5 node subtree of each given data tree in the comparison. The 40 100% similarity matches represent the situation when the query tree was compared to the data tree from which it originated.
We further analyze the method by looking at a specific iteration of the program. Figures 15-19 show the results of a particular query in this test. Figure 15 shows the original five node query tree that was compared to all of the trees in the data set. Next, we show the top two ranked trees in terms of similarity from the data set to this query tree. Figure 16 shows the full tree of the top ranker and then figure 17 shows the subtree of common leaf nodes that was used as a basis for comparison between this tree and the query tree. As previously noted, the top ranked tree was the original tree that the five node subtree originated from and it can be observed visually from looking at the subtrees in Figures15 and 17 that these trees matched identically in this particular
33
query. Figures 18 and 19 show the second ranked full tree and its subtree used in the comparison with the query tree from figure 15. The similarity score for this comparison was 0.1796, a relatively high degree of similarity. As illustrated in the figures, while not identical in structure, these two trees are actually highly related in terms of the distances between their common nodes. This is an important distinction to make with regards to a fundamental difference between
topological searches and distance-based searches, as topological searches may not be able to resolve the fact that these two trees are similar since the nodes were in different positions in the two trees. However, here, we measure similarity as a function of the distance between the nodes, so the position of the node in the original tree is not of specific consequence, only its position relative to the other nodes is relevant. Being overly sensitive to the original location of particular taxa in a given phylogenetic tree can cause many trees of similarity to be considered dissimilar and is a somewhat limiting aspect to many topologically based tree comparison methods.
In the following sections, we further our analysis by showing how MDS-Procrustes compares to the other existing, topological tree comparison methods. To do this, we will use two separate synthetically created data sets and run the different methods on this data to compare the similarity score distributions.
34
Figure 15 – Random Search Query
Query Tree used in random tree search and compared to 40 different randomly generated trees.
Figure 16 – Random Search Top Ranker
Most similar data tree to query tree from Fig. 15 using MDS-Procrustes in comparison to 40 different trees.
35
Figure 17 – Random Search Top Rank Subtree
Subtree used in MDS-Procrustes comparison between trees from Fig 15 and Fig 16. It can be easily seen that this subtree matches up exactly with Fig 15, thus the 100% similarity.
Figure 18 – Random Search Second Ranker
Second most similar data tree to query tree from Fig 1 using MDS-Procrustes in comparison to 40 different random trees.
36
Figure 19 – Random Search Second Ranker Subtree
Subtree used in MDS-Procrustes between trees from Fig 1 and Fig 5. The MDS-Procrustes distance between these trees was found to be 0.176 and this similarity can be observed from the pictures by looking at the similarity in the distances between the sets of nodes.