Fast Algorithms for Approximate Order-Statistics Computation

6.2 Mining Streaming and Sensor Network Environmental Datasets

6.2.2 Fast Algorithms for Approximate Order-Statistics Computation

Summary

I presented fast algorithms for computing approximate quantiles and biased-quantiles for streams.

The algorithms for approximate quantile computation are based on simple block-wise merge and sort operations which significantly reduces the update cost performed for each incoming element in stream. In order to handle unknown size of the stream, the incoming streams are divided into sub-streams of exponentially increasing sizes. Summaries are constructed efficiently using limited space on the sub-streams. For both fixed sized and arbitrary sized streams, the algorithm has an average update time complexity of

O(log1logN). I also analyzed the performance of prior algorithms. I evaluated the algorithms on diﬀerent data sizes and compared them with optimal implementations of prior algorithms. In practice, the algorithm can achieve up to 300× improvement in performance. Moreover, the algorithm exhibits almost linear performance with respect to stream size and performs well on large data streams.

In addition, I presented a novel approximate biased quantile algorithm for handling large, high-speed data streams. The algorithm maintains a decomposable summary structure to deterministically answer approximate biased quantile queries. Eﬃcient sampling and merge operations are used to maintain the summary structure. In practice, the algorithm requires poly-log space to maintain the summary and has an update cost of log logn where n is the current stream size. The algorithm is also applicable to sensor networks and is able to achieve signiﬁcant performance improvement over prior algorithms.

Future Work

There are many interesting problems for future investigation. The algorithms can be extended for fast computation of quantiles and biased quantiles over sliding windows. It is also interesting to design generalized framework for incremental streaming computation using the techniques of block-wise merge and compression algorithms, for either single data stream or distributed data streams.

Bibliography

Agarwal, P. K., Har-Peled, S., and Varadarajan, K. R. (2005). Geometric approximation via coresets. 66

Aggarwal, C. (2005). On abnormality detection in spuriously populated data streams. In Proceedings of ACM SIAM Conference on Data Mining. 58

Aggarwal, C. C., Han, J., Wang, J., and Yu, P. S. (2003). A framework for clustering evolving data streams. In Proceedings of 29th VLDB Conference. 58,61

Aggarwal, C. C., Han, J., Wang, J., and Yu, P. S. (2004). A framework for projected clustering of high dimensional data streams. InProceedings of 30th VLDB Conference.

Agrawal, R. and Swami, A. (1995). A one-pass space-eﬃcient algorithm for ﬁnding quantiles. 85

Ahmad, Y. and etintemel, U. (2004). Network-aware query processing for stream-based applications. InProceedings of VLDB. 58

Arasu, A. and Manku, G. S. (2004). Approximate counts and quantiles over sliding windows. In Proc. of ACM Symposium on Principles of Database Systems (PODS), pages 286–296. 14, 83, 84,85

Babcock, B. and Olston, C. (2003). Distributed top-k monitoring. In Proceedings of

ACM SIGMOD International Conference on Management of Data. 58

Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., and Zdonik, S. (2003). Scalable distributed stream processing. InProceedings of First Biennial Conference on Innovative Data Systems Research(CIDR 2003). 58

Churchill, G. A., Airey, D. C., Allayee, H., Angel, J. M., Attie, A. D., Beatty, J., Beavis, W. D., Belknap, J. K., Bennett, B., Berrettini, W., Bleich, A., Bogue, M., Broman, K. W., Buck, K. J., Buckler, E., Burmeister, M., Chesler, E. J., Cheverud, J. M., Clapcote, S., Cook, M. N., Cox, R. D., Crabbe, J. C., Crusio, W. E., Darvasi, A., Deschepper, C. F., Doerge, R. W., Farber, C. R., Forejt, J., Gaile, D., Garlow, S. J., Geiger, H., Gershenfeld, H., Gordon, T., Gu, J., Gu, W., d. H. G., Hayes, N. L., Heller, C., Himmelbauer, H., Hitzemann, R., Hunter, K., Hsu, H. C., Iraqi, F. A., Ivandic, B., Jacob, H. J., Jansen, R. C., Jepsen, K. J., Johnson, D. K., Johnson, T. E., Kempermann, G., Kendziorski, C., Kotb, M., Kooy, R. F., Llamas, B., Lammert, F., Lassalle, J. M., Lowenstein, P. R., Lu, L., Lusis, A., Manly, K. F., Marcucio, R., Matthews, D., Medrano, J. F., Miller, D. R., Mittleman, G., Mock, B. A., Mogil, J. S., Montagutelli, X., Morahan, G., Morris, D. G., Mott, R., Nadeau, J. H., Nagase, H., Nowakowski, R. S., O’Hara, B. F., Osadchuk, A. V., Page, G. P., Paigen, B.,

Paigen, K., Palmer, A. A., Pan, H. J., Peltonen-Palotie, L., Peirce, J., Pomp, D., Pravenec, M., Prows, D. R., Qi, Z., Reeves, R. H., Roder, J., Rosen, G. D., Schadt, E. E., Schalkwyk, L. C., Seltzer, Z., Shimomura, K., Shou, S., Sillanp, M. J., Siracusa, L. D., Snoeck, H. W., Spearow, J. L., Svenson, K., Tarantino, L. M., Threadgill, D., Toth, L. A., Valdar, W., de Villena, F. P., Warden, C., Whatley, S., Williams, R. W., Wiltshire, T., Yi, N., Zhang, D., Zhang, M., Zou, F., and Consortium., C. T. (2004). High-resolution haplotype structure in the human genome. Nat. Genet., pages 1133– 1137. 20

Considine, J., Li, F., Kollios, G., and Byers, J. (2004). Approximate aggregation techniques for sensor databases. In Proceedings of the International Conference on Data Engineering (ICDE04). 61

Cormode, G., Korn, F., Muthukrishnan, S., and Srivastava, D. (2005). Eﬀective computation of biased quantiles over data streams. In Proc. of International Conference on Data Engineering, pages 20–32. 15,58, 83, 84,85, 99, 113

Cormode, G., Korn, F., Muthukrishnan, S., and Srivastava, D. (2006). Space- and time- eﬃcient deterministic algorithms for biased quantiles over data streams. In Proc. of the ACM Symposium on Principles of Database Systems, pages 263–272. 15, 83, 84,

Dally, M., Rioux, J., Schaﬀner, S., Hudson, T., and Lander, E. (2001). High-resolution haplotype structure in the human genome. Nat. Genet., pages 229–232. 7, 22

Drouin, G., Prat, F., Ell, M., and Clark, G. (1999). Detecting and characterizing gene conversions between multigene family members. Mol. Biol. Evol, pages 1369–1390. 9,

Forman, G. and Zhang, B. (2001). Distributed data clustering can be eﬃcient and exact. InACM KDD Explorations special issue on Scalable Data Mining Algorithms. 59, 60

Frahling, G. and Sohler, C. (2005). Coresets in dynamic geometric data streams. In

Proc. 37th ACM Symposium on Theory of Computing, pages 209–217. 62

Gabriel, S., Schaﬀner, S., Nguyen, H., Moore, J., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., Liu-Cordero, S., Rotimi, C., Adeyemo, A., Cooper, R., Ward, R., Lander, E., Daly, M., and Altshuler, D. (2002). The structure of haplotype blocks in the human genome. Science, pages 2225–2229. 7, 8, 22,44

G.F., W. (1998). Phylogenetic proﬁles: a graphical method for detecting genetic recombination in homologous sequences. Mol. Biol. Evol, pages 326–335. 9, 44

Greenwald, M. B. and Khanna, S. (2001). Space-eﬃcient online computation of quantile summaries. In Proc. of ACM SIGMOD Record, pages 58–66. 14, 15, 83, 84, 85, 91,

95, 112

Greenwald, M. B. and Khanna, S. (2004). Power-conserving computation of order- statistics over sensor networks. InPODS. 61,83, 86, 89, 95,100, 104, 105, 112

Guha, S., Mishra, N., Motwani, R., and O’Callaghan, L. (2000). Clustering data streams. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pages 359–366. 61, 62

Gusﬁeld, D. (2002). Haplotyping as perfect phylogeny: conceptual framework and eﬃ- cient solutions. InProc. of RECOMB, pages 166–175. 8, 22

Gusﬁeld, D., Eddhu, S., and Langley, C. (2004). Optimal, eﬃcient reconstruction of phylogenetic networks with constrained recombination. J. Bioinf. Comput. Biol., pages 173–213. 8, 22

Har-Peled, S. and Kushal, A. (2005). Smaller coresets for k-median and k-means clustering. In Proceedings of the 21st annual symposium on computational geometry, pages 126–134. 62

Har-Peled, S. and Mazumdar, S. (2004). Coresets for k-means and k-median clustering and their applications. InACM Symposium on Theory of Computing. 60,61, 62, 71

Hein, J. (1990). Reconstructing evolution of sequences subject to recombination using parsimony. Math. Biosci, pages 185–200. 9, 44

Hein, J. (1993). A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol, pages 396–405. 9,44

Holmes, E. C., Worobey, M., and Rambaut, A. (1999). Phylogenetic evidence for recombination in dengue virus. Mol. Biol. Evol, page 405. 9, 44

Hudson, R. and Kaplan, N. (1985). Statistical properties of the number of the recombination events in the history of a sample of dna sequences. Genetics, pages 147–164.

8, 42,43, 45

Jain, R. and Chlamtac, I. (1985). The p2 _{algorithm for dynamic calculation for quan-}

tiles and histograms without storing observations. Communications of the ACM, 28(10):1076–1085. 85

Jakobsen, I. B., Wilson, S. E., and Easteal, S. (1997). The partition matrix: exploring variable phylogenetic signals along nucleotide sequences alignments. Mol. Biol. Evol., pages 474–484. 9, 44

Januzaj, E., Kriegel, H., and Pfeifle, M. (2003). Towards effective and efficient distributed clustering. In Workshop on Clustering Large Data Sets (ICDM2003). 60

Keralapura, R., Cormode, G., and Ramamirtham, J. (2006). Communication-eﬃcient distributed monitoring of thresholded counts. In Proceedings of ACM SIGMOD In- ternational Conference on Management of Data. 58

Kreitman, M. (1983). Nucleotide polymorphism at the alcohol dehydrogenase locus of drosophila melanogaster. Nature, pages 412–417. 53

Lin, X., Lu, H., Xu, J., and Yu, J. X. (2004). Continuously maintaining quantile summaries of the most recent n elements over a data stream. InProc. of of the 20th International Conference on Data Engineering, page 362. 14,83, 84, 85

Lole, K. S., Bollinger, R. C., Paranjape, R. S., Gadkari, D., Kulkarni, S. S., Novak, N. G., Ingersoll, R., Sheppard, H. W., and Ray, S. C. (1999). Statistical properties of the number of the recombination events in the history of a sample of dna sequences.

J. Virol, pages 152–160. 9, 44

Madden, S., Franklin, M. J., Hellerstein, J., and Hong, W. (2002). Tag: a tiny aggregation service for ad-hoc sensor networks. In Proc. of the 5th Symp. Operating Systems Design and Implementation(OSDI 02). 13,14, 59

Manku, G. S., Rajagopalan, S., and Lindsay, B. G. (1998). Approximate medians and other quantiles in one pass and with limited memory. In Proc. of ACM SIGMOD Record, pages 426–435. 14,83, 84, 85, 98

Manku, G. S., Rajagopalan, S., and Lindsay, B. G. (1999). Random sampling techniques for space eﬃcient online computation of order statistics of large datasets. ACM SIG- MOD, pages 251–262. 83, 85

Martin, D. and Rybicki, E. (2000). Rdp: detection of recombination amongst aligned sequences. Bioinformatics, pages 562–563. 9, 44

Maynard, S. J. and Smith, N. H. (1998). Detecting recombination from gene trees. Mol. Biol. Evol., pages 590–599. 9,44

Moore, K., Zhang, Q., McMillan, L., Wang, W., and de Villena, F. P. (2008). Genome- wide compatible snp intervals and their properties. InUNC Tech Report. 46

Mott, R., Talbot, C. J., Turri, M. G., Collins, A. C., and Flint, J. (2000). A new method for ﬁne-mapping quantitative trait loci in outbred animal stocks. Proc. Natl. Acad. Sci., pages 12649–12654. 7, 21, 22

Munro, J. I. and Paterson, M. (1980). Selection and sorting with limited storage. The- oretical Computer Science, 12:315–323. 83, 84

Myers, S. R. and Griﬃths, R. C. (2003). Bounds on the minimum number of recombination events in a sample history. Genetics, pages 375–394. 8, 43

N.C., G. and Holmes, E. (1997). A likelihood method for the detection of selection and recombination using nucleotide sequences. Mol. Biol. Evol, pages 239–247. 9,44

O’Callaghan, L., Mishra, N., Meyerson, A., and Guha, S. (2002). Streaming data algorithms for high-quality clustering. In Proceedings of IEEE International Conference on Data Engineering. 61

Olston, C., Jiang, J., and Widom, J. (2003). Adaptive ﬁlters for continuous queries over distributed data streams. In Proc. of ACM SIGMOD International Conference on Management of Data 2003, pages 563–574. 14,58, 59

Paterson, M. (1997). Progress in selection. Scandinavian Workshop on Algorithm The- ory. 84

Patil, N., Berno, A., Hinds, D., Barrett, W., Doshi, J., Hacker, C., Kautzer, C., Lee, D., Marjoribanks, C., McDonough, D., Nguyen, B., Norris, M., Sheehan, J., Shen, N., Stern, D., Stokowski, R., Thomas, D., Trulson, M., Vyas, K., Frazer, K., Fodor, S., and Cox, D. (2001). Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, page 5547. 8, 44

Posada, D. (2002). Evaluation of methods for detecting recombination from data sequences: empirical data. Mol. Biol. Evol., pages 1198–1212. 9, 44

Schwartz, R., Halldorson, B., Bafna, V., Clark, A., and Istrail, S. (2003). Robustness of inference of haplotyp block structure. J. Comput. Biol., pages 13–19. 8, 22

Sharfman, I., Schuster, A., and Keren, D. (2006). A geometric approach to monitoring threshold functions over distributed data streams. InProceedings of ACM SIGMOD International Conference on Management of Data. 58

Shrivastava, N., Buragohain, C., Agrawal, D., and Suri, S. (2004). Medians and beyond: new aggregation techniques for sensor networks. In SenSys ’04: Proceedings of the 2nd international conference on Embedded networked sensor systems, pages 239–249, New York, NY, USA. ACM Press. 86

Silberstein, A., Braynard, R., and Yang, J. (2006). Constraint chaining: On energy eﬀcient continuous monitoring in sensor networks. InProc. of ACM SIGMOD Inter- national Conference on Management of Data 2006, pages 157–168. 14, 59

Song, Y. and Hein, J. (2005). Constructing minimal ancestral recombination graphs. J. Comput. Biol., pages 147–169. xiii, 53, 54

Song, Y. S., Wu, Y., and Gusﬁeld, D. (2005). Eﬃcient computation of close lower and upper bounds on the minimum number of recombinations in biological sequence evolution. Bioinformatics, pages i413–i422. 8, 53

Stephens, J. C. (1985). Statistical methods of dna sequence analysis: detection of intragenic recombination or gene conversion. Mol. Biol. Evol, pages 539–556. 9, 44

Szatkiewicz, J., Beane, G., Ding, Y., Hutchins, L., de Villena, F., and Churchill, G. (2008). An imputed genotype resource for the laboratory mouse. Mamm. Genome, page 199. 55

Szewczyk, R., Polastre, J., Mainwaring, A., and Culler, D. (2004). Lessons from a sensor network expedition. InProc. of the 1st European Workshop on Wireless Sensor Networks (EWSN ’04), pages 307–322. 10

Threadgill, D. W., Hunter, K. W., and Williams, R. W. (2002). Genetic dissection of complex and quantitative traits: from fantasy to reality via a community eﬀort.

Ukkonen, E. (2002). Finding founder sequences from a set of recombinants. In Proc. of WABI, pages 277–286. 7, 20,21, 22, 24

Willett, R. M., Martin, A. M., and Nowak, R. D. (2004). Adaptive sampling for wireless sensor networks. InProc. of ISIT04. 14, 59

Worobey, M. (2001). A novel approach to detecting and measuring recombination: new insights into evolution in viruses, bacteria, and mitochondria. Mol. Biol. Evol, pages 1425–1434. 9, 44

Wu, Y. and Gusﬁeld, D. (2007). Improved algorithms for inferring the minimum mosaic of a set of recombinants. In Proc. of CPM, pages 150–161. 7, 20,21, 22, 24

Zhang, Q. and Wang, W. (2007). A fast algorithm for approximate quantiles in high speed data streams. InProceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM). 112

Zhang, Y., Lin, X., Xu, J., Korn, F., and Wang, W. (2006). Space-eﬃcient relative error order sketch over data streams. 85

Zhu, Y. and Shasha, D. (2003). Eﬃcient elastic burst detection in data streams. InPro- ceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 336–345. 58

In document Mining emerging massive scientific sequence data using block-wise decomposition methods (Page 136-143)