The aim of this section is to show that our new method also works fast in practice, even on large datasets. We implemented the algorithm from Sect. 6.3 in C++ (available at www.bio.ifi.lmu.de/˜fischer/) so that it finds frequent substrings (Problem 1) and emerging substrings (Problem 2), respec- tively. For the construction of the suffix array we used the method due to Manzini and Ferragina (2004), and for the RMQ-preprocessing we used the “engineered”-implementation from Sect. 3.8, because space is a primary issue in large-scale data mining.
Unfortunately, because an implementation of the emerging substring miner (Chan et al., 2003) is not publicly available, we could only run comparative tests for mining frequent substrings. We compared our method to the algorithms called VST (De Raedt et al., 2002) and FAVST (Lee and De Raedt, 2005).4 The “VST” in both names is an abbreviation of Version Space Tree, which are simply suffix tries with some additional satellite information. All tests were performed on an Athlon XP 3300 with 2GB of RAM under Linux. All programs were compiled withg++, using the options “-O3 -fomit-frame-pointer -funroll-loops.” Further, instead of writing the output to disk, all programs were adapted to redirect their output to a virtual “null”-device called /dev/null
under Linux, in order to eliminate the influence of the access time to secondary storage units.
We used the Jan’03 release of the nucleotide database from the ARB project (Ludwig et al., 2004), containing rRNA of about 25,000 species. Because rRNA is highly preserved in evolution, the data bears a high sequential similarity, and the running time of all programs should thus be highly influenced by the cho- sen input parameters. A phylogenetic tree partitions the species into different
4We wish to thank Sau Dan Lee for providing the source codes of his methodsFAVSTand VST.
0.1 1 10 100 1000 10000 0 10 20 30 40 50 60 70 80 90 100 time (seconds)
minimum frequency threshold (%)
VST FAVST our method
Figure 6.2: Comparison of the three methods for a single minimum frequency query. Note the logarithmic y-scale.
groups, of which we selected some for evaluation. The subsets used are shown in Tbl. 6.1.
For the comparison with FAVST and VST, we were forced to pick a very small subset of the ARB-database. This is because algorithm FAVST builds a non-compacted suffix trie on the whole database, which has size O(n2) in the
worst case. As this is extremely space consuming, we could only use FAVST
for subsets of size less than 100kB. Already for datasets of this size the space consumption of FAVST is more than 1.5GB. The reason why VST cannot be applied to larger datasets is that it is incredibly slow — we will see presently that for instances where our method takes only about 2 minutes,VST already needs more than one hour!
The first test was a single minimum frequency query on 60 random entries of the database. The results, for varying values ofmin, can be seen in Fig. 6.2. It is striking that our method is faster than bothFAVST andVST, sometimes by several orders of magnitude. Further, it is interesting to see that the running time of FAVST does not drop as much as the other methods with increasing values of min, as it is the case for VST and our method. This is because constructing the suffix trie for the whole database is the most time consuming part of the algorithm and is independent of min. Further tests with other random subsets of the complete ARB-database revealed similar results.
In a second test we wanted to test the performance of a combined minimum- and maximum frequency query. For this, we selected two disjoint subsets with a higher biological relevance. The dataset chosen for the minimum frequency criterion was xanthom, and for maximum frequency criterion we chose a subset of the β-dataset, called β59 (again, the space consumption of FAVST forced
0.1 1 10 100 1000 10000 10 20 30 40 50 60 70 80 90 time (seconds)
minimum frequency threshold (%)
VST FAVST our method
(a) This graph shows the dependency on the minimum frequency threshold.
100 1000 10000
0 10 20 30 40 50 60 70 80 90 100
time (seconds)
maximum frequency threshold (%)
VST FAVST our method
(b) This graph shows the dependency on the maximum frequency threshold.
Figure 6.3: Comparison of the three methods for a combined minimum- and maximum-frequency query. Note the logarithmic y-scales.
size (MB) number of proteins time (min) 25 96,939 1:10 50 186,913 2:58 75 269,590 5:03 100 359,504 7:17 125 458,088 9:20 150 529,585 11:32
Table 6.2: Results for a minimum frequency query on files containing amino acid sequences from the Swissprot database.
us to pick such small datasets). The results for varying values of min can be seen in Fig. 6.3(a), where the maximum frequency threshold was held fixed at 50%. They resemble those from the previous experiments, except that for small values ofminour method andFAVSTperform about equally well (but still with a factor-2-advantage for our method). Profiling showed that the sheer amount of frequent patterns to be written on disk was the most time consuming part in these cases, which cannot be avoided by any method.
Figure 6.3(b) shows the results for the same test, but now for different values of max. The value for the minimum frequency query was held fixed at 3%, in order not to filter out too many patterns already by the minimum frequency constraint. However, even with this very lax choice ofminthe running times do not depend as much onmax for all three methods as in the case of Fig. 6.3(a). Still, our method is always the fastest, with a factor of about 4 compared to
FAVST, and a factor of more than 25 compared toVST.
In a last test we wanted to evaluate the scalability of our method. We created several files containing the first 25, 50, . . . , 150MB of the file “proteins” from the Pizza & Chili-site (Ferragina and Navarro, 2005), which contains amino acid sequences of various species obtained from the Swissprot database. Our implementation of the pure minimum frequency miner uses 3 integer arrays of sizen, one for the suffix- and LCP-array, and one for theC′-counters. This is the reason why 150MB was the largest file we could process on a 2GB (150MB·4·3 = 1,800MB).
We then performed a simple minimum frequency query, with a minimum frequency threshold of 1% of the number of sequences in the database. The results of this test can be seen in Tbl. 6.2. It can be seen that despite the very low minimum frequency threshold of 1% the running times are quite fast. This is because proteins do not exhibit such a high sequential similarity as rRNA does, so there are less patterns in the solution space. Hence, the running time of our algorithm is not that much dominated by the time needed to write the output (as in the previous experiments).