Scalability - Quantitative Evaluation - Localizing Violations of Approximate Constraints for Da

5.3 Quantitative Evaluation

5.3.6 Scalability

Error regions typically contain a small fraction of the records from the entire dataset.

Because the run times of many error detection algorithms depend heavily on the input data size, running them within error regions results in significantly reduced run times.

Figure 5.8 shows the run time comparison between SCODED and SCODED + TreeDe-tect by varying the retrieval target k. For small k TreeDeTreeDe-tect is slower than SCODED due to the extra cost for building the Localization Tree. However the run time of SCODED increases linearly, while the run time of TreeDetect plateaus around 2.5 seconds. This shows the potential of error localization to reduce run time through a divide-and-conquer strategy. We include the Localization Tree building cost for each k, which means that the majority of run time is taken up by the tree building process, which takes approximately 1.9 seconds to complete. Also in practice the tree will be built once for several retrieval targets k, so its construction cost will be amortized over different top-k queries and Figure 5.8 is in fact an overestimate.

Figure 5.9: Scalability of tree construction TreeDetect with increased data size While a localization tree helps to reduce error detection costs, constructing the tree is fairly expensive. Figure 5.9 shows the increase in tree construction time as we increase the data size and keep k fixed. As shows, our system does not scale well with increased data size.

The main reason is the cost of splitting on continuous variables, since finding a threshold for each internal node involves a pass over many possible continuous values. Addressing this scalability challenge is a valuable direction for future work.

Chapter 6

Conclusions and Future Work

One of the most powerful approaches to error detection leverages constraints that repre-sent domain knowledge. Recent work has developed methods for leveraging approximate constraints that are not required to hold exactly, but only to a degree. Approximate con-straints are sensitive to context: they may fail or be satisfied in the whole dataset but not in subsets.

In this thesis we proposed TreeDetect, a novel error detection system that en-hances the power of approximate constraints by applying them in a context-aware man-ner. TreeDetect employs a tree partition of the data space, which constructs user-interpretable predicates that define relevant contexts for error detection. The tree partition is constructed for a dataset using methods inspired by tree learning in machine learning. Our experiments show that, when combined with error localization, error detection algorithms show significant improvements in their ability to distinguish dirty from clean records.

Limitations. While we have provided evidence that error localization helps with lever-aging approximate constraints, it appears less useful for exact constraints, as the error de-tection performance is often context-independent when using exact constraints (e.g. denial constraints). However, the tree will still help with illustrating error regions, assuming that the majority of errors appear in clusters which can be localized using predicates. The main limitation of the tree representation arises when the recorded data is not powerful enough to capture the error source. For example, if all errors arise in the time period before year 2000, and the data contain time stamps, the TRA can learn a region defined by year < 2000 . But if the data is missing time stamps, the TRA will prevent our system from finding the correct error regions. A promising approach to this scenario is to employ a method for imputing missing values, such as the EM algorithm, with violation metrics as objective functions.

For large datasets, our tree construction method shares the scalability limitations of tree learning methods. An avenue for improving scalability is to adapt scalable methods for tree learning [13, 45], especially new heuristics for splitting and pruning tree nodes. Another direction is offline learning, where the tree is re-constructed in the background in repeating time intervals.

Leveraging approximate constraints, such as statistical constraints, is a powerful recent trend in error detection. Error localization facilitates the deployment of approximate con-straints in a context-aware manner. It enhances both the quality and the transparency of constraint-based error detection results.

Bibliography

[1] Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. Data Profiling, volume 10. Morgan & Claypool Publishers, 2018.

[2] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In 2007 IEEE 23rd International Conference on Data Engineering, pages 746–755, 2007.

[3] Craig Boutilier, Nir Friedman, Moises Goldszmidt, and Daphne Koller. Context-specific independence in bayesian networks. Proceedings of the Twelfth Annual Conference on Uncertainty in Artificial Intelligence, 03 2002.

[4] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification and regression trees. CRC press, 1984.

[5] Bojan Cestnik and Ivan Bratko. On estimating probabilities in tree pruning. In Yves Kodratoff, editor, Machine Learning — EWSL-91, pages 138–150, Berlin, Heidelberg, 1991. Springer Berlin Heidelberg.

[6] Xu Chu, Ihab F Ilyas, and Paolo Papotti. Discovering denial constraints. Proceedings of the VLDB Endowment, 6(13):1498–1509, 2013.

[7] Xu Chu, Ihab F Ilyas, and Paolo Papotti. Holistic data cleaning: Putting violations into context. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pages 458–469. IEEE, 2013.

[8] Xu Chu, John Morcos, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourc-ing. In Proceedings of the 2015 ACM SIGMOD International Conference on Manage-ment of Data, pages 1247–1261, 2015.

[9] Robert G Cowell, Philip Dawid, Steffen L Lauritzen, and David J Spiegelhalter. Prob-abilistic networks and expert systems: Exact computational methods for Bayesian net-works. Springer Science & Business Media, 2006.

[10] Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. Nadeef: A commodity data cleaning system. In Pro-ceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, page 541–552, New York, NY, USA, 2013. Association for Computing Machinery.

[11] Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, and Vladislav Shkapenyuk.

Mining database structure; or, how to build a data quality browser. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD

’02, page 240–251, New York, NY, USA, 2002. Association for Computing Machinery.

[12] A Philip Dawid. Conditional independence in statistical theory. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–31, 1979.

[13] Johannes Gehrke. Decision tree construction, 2003.

[14] Dan Geiger and Judea Pearl. Logical and algorithmic properties of conditional inde-pendence and graphical models. The Annals of Statistics, pages 2001–2021, 1993.

[15] David Harrison and Daniel L Rubinfeld. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81–102, 1978.

[16] Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. Holode-tect: Few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD ’19, page 829–846, New York, NY, USA, 2019. Association for Computing Machinery.

[17] Joseph M. Hellerstein. Quantitative data cleaning for large databases. 2008.

[18] IBM. The Four V’s of Big Data. Accessed: 2020-01-15.

[19] Ihab F. Ilyas and Xu Chu. Trends in cleaning relational data: Consistency and dedu-plication. Foundations and Trends in Databases, 5(4):281–393, 2015.

[20] Ihab F Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. Cords:

automatic discovery of correlations and soft functional dependencies. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 647–658. ACM, 2004.

[21] Liangxiao Jiang and Chaoqun Li. An empirical study on attribute selection measures in decision tree learning. Journal of Computational Information Systems, 6:105–112, 01 2010.

[22] Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. Wrangler: Inter-active visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, page 3363–3372, New York, NY, USA, 2011. Association for Computing Machinery.

[23] Batya Kenig and Dan Suciu. Integrity constraints revisited: From exact to approximate implication. arXiv preprint arXiv:1812.09987, 2019.

[24] P. Kontschieder, Samuel Rota Bulò, A. Criminisi, P. Kohli, Marcello Pelillo, and Horst Bischof. Context-sensitive decision forests for object detection. Advances in Neural Information Processing Systems, 1:431–439, 01 2012.

[25] Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg.

Activeclean: Interactive data cleaning for statistical modeling. Proc. VLDB Endow., 9(12):948–959, August 2016.

[26] Kwok-Wa Lam and V. C. S. Lee. Building decision trees using functional dependencies.

In International Conference on Information Technology: Coding and Computing, 2004.

Proceedings. ITCC 2004., volume 2, pages 470–473 Vol.2, 2004.

[27] Brian Macdonald. A regression-based adjusted plus-minus statistic for nhl players.

Journal of Quantitative Analysis in Sports, 7(3):29, 2011.

[28] Samuel R Madden, Michael J Franklin, Joseph M Hellerstein, and Wei Hong. Tinydb:

an acquisitional query processing system for sensor networks. ACM Transactions on database systems (TODS), 30(1):122–173, 2005.

[29] Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD), pages 865–882. ACM, 2019.

[30] Panagiotis Mandros, Mario Boley, and Jilles Vreeken. Discovering reliable approximate functional dependencies. In Proceedings of the 23rd ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining, KDD ’17, page 355–363, New York, NY, USA, 2017. Association for Computing Machinery.

[31] Zelda Mariet, Rachael Harding, Sam Madden, et al. Outlier detection in heterogeneous datasets using automatic tuple expansion. 2016.

[32] Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. Eracer: A database approach for statistical inference and data cleaning. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, page 75–86, New York, NY, USA, 2010. Association for Computing Machinery.

[33] Mathias Niepert, Marc Gyssens, Bassem Sayrafi, and Dirk Van Gucht. On the con-ditional independence implication problem: A lattice-theoretic approach. Artificial Intelligence, 202:29–51, 2013.

[34] J Osborne and Amy Overbay. Best practices in data cleaning. Best practices in quantitative methods, 1(1):205–213, 2008.

[35] Nikita Patel and Saurabh Upadhyay. Study of various decision tree pruning methods with their empirical comparison in weka. Int. J. Comput. Appl., 60:20–25, 12 2012.

[36] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge university press, 2000.

[37] J. R. Quinlan. Induction of decision trees. Mach. Learn., 1(1):81–106, March 1986.

[38] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.

[39] Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. Holoclean: Holis-tic data repairs with probabilisHolis-tic inference. Proc. VLDB Endow., 10(11):1190–1201, August 2017.

[40] Saharon Rosset, Claudia Perlich, Grzergorz Świrszcz, Prem Melville, and Yan Liu. Med-ical data mining: insights from winning two competitions. Data Mining and Knowledge Discovery, 20(3):439–468, 2010.

[41] Babak Salimi, Johannes Gehrke, and Dan Suciu. Bias in OLAP queries: Detection, explanation, and removal. In ACM SIGMOD, pages 1021–1035, 2018.

[42] Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. Interventional fairness:

Causal database repair for algorithmic fairness. In Proceedings of the 2019 Interna-tional Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019., pages 793–810, 2019.

[43] Oliver Schulte, Yejia Liu, and Chao Li. Model trees for identifying exceptional players in the nhl draft. arXiv preprint arXiv:1802.08765, 2018.

[44] Jiang Su and Harry Zhang. Representing conditional independence using decision trees.

pages 874–879, 01 2005.

[45] Xiangyu Sun, Jack Davis, Oliver Schulte, and Guiliang Liu. Cracking the black box:

Distilling deep sports analytics. CoRR, abs/2006.04551, 2020.

[46] Pei Wang and Yeye He. Uni-detect: A unified approach to automated error detection in tables. In International Conference on Management of Data (SIGMOD), July 2019.

[47] Larry Wasserman. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.

[48] Eugene Wu and Samuel Madden. Scorpion: Explaining away outliers in aggregate queries. Proc. VLDB Endow., 6(8):553–564, June 2013.

[49] Jing Nathan Yan, Oliver Schulte, MoHan Zhang, Jiannan Wang, and Reynold Cheng.

Scoded: Statistical constraint oriented data error detection. In Proceedings of the 2020 ACM SIGMOD international conference on Management of data. ACM, 2020.

[50] Nevin L. Zhang and David Poole. On the role of context-specific independence in probabilistic inference. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’99, page 1288–1293, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.

Appendix A

Code

The code for this project is available on GitHub. https://github.com/mhzhang/LocalizationTree

In document Localizing Violations of Approximate Constraints for Data Error Detection (Page 46-54)