Localizing Violations of Approximate Constraints for Data Error Detection

(1)

Localizing Violations of Approximate Constraints for Data Error Detection

by

MoHan Zhang

B.Sc., University of British Columbia, 2018

Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of

Master of Science in the

School of Computing Science Faculty of Applied Sciences

c

MoHan Zhang 2020 SIMON FRASER UNIVERSITY

Summer 2020

Copyright in this work rests with the author. Please ensure that any reproduction or re-use is done in accordance with the relevant national copyright legislation.

(2)

Declaration of Committee

Name: MoHan Zhang

Degree: Master of Science

Thesis title: Localizing Violations of Approximate Constraints for Data Error Detection

Committee: Chair: Ke Wang

Professor, Computing Science Oliver Schulte

Supervisor

Professor, Computing Science Jiannan Wang

Committee Member

Associate Professor, Computing Science Uwe Glässer

Examiner

Professor, Computing Science

(3)

Abstract

Error detection is key for data quality management [16, 34, 19]. Leveraging domain knowledge in the form of user-specified constraints is one of the major approaches to error de- tection [1]. A recent trend in error detection has been utilizing approximate constraints (ACs) that a relation is expected to satisfy only to a certain degree rather than com- pletely. An example are the recently introduced statistical constraints [49], that allow the user to specify which correlations among attributes she expects to be present or absent in the data. Statistical constraints allow the user to express a broad range of statistical and causal domain knowledge. Extensive empirical investigations indicate that even traditional integrity constraints such as functional dependencies hold only approximately in real-world datasets [46]. Approximate functional dependencies (AFDs) have been a data cleaning tool for some time [17]. This thesis introduces a new technique for enhancing error detection with approximate constraints.

Our starting observation is that approximate constraints are context-sensitive: the degree to which they are satisfied depends on the sub-population being considered. An error region is a subset of the data that violates an AC to a higher degree than the data as a whole, and is therefore more likely to contain erroneous records. For example, an error region may contain the set of records from before a certain year, or from a certain location. We describe an efficient algorithm for identifying distinct data regions that violate given ACs to different degrees, based on a recursive tree partitioning scheme. The learned trees describe different error regions in terms of data attributes that are easily interpreted by users (e.g. all records before 2003). This helps to explain to the user why some records were identified as likely errors.

After identifying error regions, we can apply error detection methods to each error region separately, rather than to the dataset as a whole. Our empirical evaluation, done using four datasets containing both real world and synthetic errors, shows that identifying error regions increases both precision and recall of error detection based on ACs. Error regions can be combined not only with constraint-based error detection, but also with other approaches such as those based on machine learning. Our experiments provide evidence that the error regions boost the performance of machine learning methods.

(4)

Keywords: data cleaning; approximate constraints; tree learning; error regions

(5)

Dedication

This thesis is dedicated to my family, for their love and support along every step of my journey.

(6)

Acknowledgements

My deep gratitude goes to my supervisor Dr. Oliver Schulte, who continuously guided and supported my research for the past two years. This thesis would not have been possible without him, and I could not have asked for a better supervisor during my time at Simon Fraser University.

To my co-authors Dr. Jiannan Wang and Nathan Yan, it was a pleasure working with you. I learned so much about data cleaning from you, and I appreciate your guidance and patience.

To the rest of my thesis committee, Dr. Uwe Glässer and Dr. Ke Wang, thank you for your insightful comments and feedback. They have helped improve this project tremen- dously.

To my fellow lab mates Yudong Luo, Guiliang Liu, and Xiangyu Sun, your enthusiasm towards research inspired me. Thank you for the great time that we had had together; I will cherish our camaraderie for a long time to come.

Finally, I am indebted to my parents, who encouraged me to pursue MSc two years ago and showed great love throughout my journey.

(7)

List of Tables

Table 5.1 Dataset Information . . . 24 Table 5.2 Constraints used by TreeDetect and other approaches . . . 25 Table 5.3 Performance of error detection methods compared, using Independence

Constraints . . . 34 Table 5.4 Performance of error detection methods compared, using Dependent

Constraints . . . 34

(10)

List of Figures

Figure 1.1 TreeDetect Pipeline . . . 2 Figure 1.2 Car Database. Note that without considering Year, attributes Model

and Color are independent. . . . 3 Figure 1.3 Car Database, broken down by each year. . . 4 Figure 3.1 Probability Density Function for a hypothetical dataset. The x-axis

represents the test statistic. . . 15 Figure 4.1 Localization Tree produced for data set in Figure 1.3 . . . 20 Figure 4.2 Figure 4.1 after pruning . . . 21 Figure 5.1 Tree generated from the Hockey Dataset, using uniqueness ratio on

attribute PlayerName . . . . 27 Figure 5.2 Tree generated from the Hockey Dataset, using AFD PlayerName →

DraftYear . . . 28 Figure 5.3 Tree generated from the Hockey Dataset, using SC GP ⊥⊥ GPM . . 29 Figure 5.4 Tree generated from the Hockey Dataset, further pruned . . . 29 Figure 5.5 Errors detected from the Hockey Dataset . . . 30 Figure 5.6 Tree generated from the SC SES⊥6⊥ T axRate| Crimes . . . . 31 Figure 5.7 Top-k Comparison on (a) Car Dataset, (b) Housing Dataset, (c)

Sensor Dataset . . . 35 Figure 5.8 Comparison of scalability between TreeDetect and SCODED

with different K . . . 36 Figure 5.9 Scalability of tree construction TreeDetect with increased data size 37

(11)

Chapter 1

Introduction and Overview

Error detection is the process of detecting erroneous entries in a database. With the emer- gence of large scale datasets and the growing popularity of Big Data technologies, the ability to maintain a clean database has become critical. In 2016, IBM estimated that poor data quality costs the U.S. economy around $3 trillion per year [18]. As a result of the growing impact of data errors, much recent work in academia and the industry has focused on developing efficient and accurate error detection models.

Data errors arise due to different factors and therefore come in different types, some of which are duplication errors, missing values, and conflicts with domain knowledge. A potential side effect of all these errors is violation of domain knowledge, as represented by user-specified constraints. Many error detection methods have been proposed in the past. These include constraint-based methods, outlier detection methods, and duplication detection methods [16]. Constraint-based error detection methods are one of the most widely used class of error detection. A user specifies constraints that describe what a dataset should look like; the model then detects the specific data entries that violate the constraints. As discussed by Yan et al., constraints can be elicited from the user through a mixed initiative approach, where data analysis suggests constraints which the user can accept, reject, or refine [49]. We illustrate his human-in-the-loop approach in one of our case studies below 5.2.

Constraint-based methods can be further separated into methods that use exact constraints and those that use approximate constraints.

As reported in several studies [46, 49], in real-world datasets common constraints hold only approximately not exactly. It is therefore important to leverage approximate constraints (ACs), such as approximate functional dependencies [46] and statistical constraints [49]. Af- ter detecting the violation of an (approximate) constraint, the second step is error analysis:

identifying records that are the likely cause of the violation [46, 49]. This thesis describes an error localization method that supports both detecting the violation of an AC and sub- sequent error analysis. Figure 1.1 illustrates the TreeDetect pipeline.

(12)

Figure 1.1: TreeDetect Pipeline

Approximate Constraints and Context Sensitivity. An AC is defined by a con- straint violation metric that quantifies how much a given relation violates/satisfies the constraint. We observe that ACs are context-sensitive in that the degree to which an AC is violated can vary depending on the data region considered. For example, a statistical constraint may show a statistically significant violation for the year 2017 but not in the dataset as a whole. Constraint violation is context-sensitive because data errors often have multiple sources of error and therefore occur in clusters corresponding to different data regions. For example, sensor readings may be wrong due to configuration errors that occur in a certain time period and/or a certain location. We refer to data regions that violate an approximate constraint to a high degree as error regions. TreeDetect is our method for identifying error regions given a user-specified AC. We refer to this process as error localization. Error localization has several advantages for error detection.

• Interpretability: Highlighting common features of dirty records helps the user under- stand the output of the error detection system and identify the causes of the errors.

• Scalability: Identifying error regions and then searching for likely errors within each region is a divide-and-conquer strategy that can speed up error detection compared to searching for likely errors across the whole dataset.

• Statistical Accuracy: The size of an error region is generally small compared to the size of the entire dataset, so violation metrics can be more sensitive when applied within an error region. For example, a set of violating records may constitute a small percentage of the entire dataset, but a significant fraction of these records may reside in a single region.

Example Application Scenario Our method can be applied in any error detection scenario where approximate constraints are available for supporting error detection. In many

(13)

machine learning applications, a key class of approximate constraints are statistical, involv- ing the presence or absence of correlations [49]. Statistical constraints are important because machine learning models are based on which input attributes are relevant for predicting a target attribute. They are naturally approximate, as we do not expect attribute correlations to be perfect (0 or 1). We illustrate these points in an application scenario based on the one given by Yan et al. [49].

Consider a machine learning application based on a car database with attributes Model, Color , and Year , as shown in Figure 1.2. Previous model construction has found that Model and Color should be independent; that is, knowing the model should not give any information about the color. Suppose that considering the entire dataset, without a specific context, the constraint is satisfied, as illustrated in Figure 1.2.

Figure 1.2: Car Database. Note that without considering Year, attributes Model and Color are independent.

However, error localization shows that the constraint is strongly violated for records from the year 2020, as illustrated in Figure 1.3. This suggests to the user that the most recent data are erroneous and need to be checked. The error analysis model of TreeDetect applies an error detection method to the sub data table with records from year 2020, which highlights the records of White BMW X1s as likely to cause the violation. A further step (not part of TreeDetect) is to clean the data manually or with an automatic data repair tool.

Tree-based Approach to Error Localization. The basic assumption of TreeDetect is the Tree Representation Assumption (TRA): An error region can be defined in terms of a predicate [48], which is a conjunction of selection conditions applied to the attributes

(14)

Figure 1.3: Car Database, broken down by each year.

(columns) of a relation. We assume that the predicates can be organized into a Localiza- tion Tree structure, where each node is labelled with an attribute, and each child satisfies a different value, or range of values, of the attribute. The conjunction of conditions along a complete path from root to leaf defines a predicate, and the collection of predicates for each leaf defines a partition of the data relation into error regions. To construct a Localization Tree, we adapt methods from machine learning, where tree prediction models are first grown into deep branches, then pruned to balance complexity with predictive accuracy. For AC violation, our objective function is not predictive accuracy, but the violation metric associated the AC.

We evaluated the performance of TreeDetect on four real-world datasets with a mix of real-world and synthetic errors. For each dataset, TreeDetect identified tree-based error regions given ACs specified in previous research. Our evaluation focuses on statistical constraints as they are especially recent, complex, and powerful types of ACs. We take advantage of the fact that the region defining predicates are interpretable to highlight the plausibility of the error regions in case studies. For quantitative evaluation, we show that error localization enhances previous error analysis methods by applying these methods to distinct error regions, in two settings:

1. Using the drill-down method of the SCODED [49] system. SCODED leverages statistical constraints to rank data records according to how likely they are to be dirty.

2. Using Raha [29], a state-of-the-art error detection system based on machine learning based on user-supplied dirty/clean labels of data records. The machine learning setting shows how our approach offers a way to combine user domain knowledge specified through the AC with user knowledge specified through labelling records as dirty/clean.

(15)

The precision and recall of both SCODED and Raha increased when they were applied locally to error regions, rather than globally to the entire dataset. This improvement is not simply due to considering a subset of the data: if our error localization method had indicated data regions that were in fact mostly clean, we would expect the error detection methods to be misled by error localization. In general, improvements in error detection performance is the expected behaviour, if the dataset contains context-sensitive errors, and if Localization Tree detects accurate error regions.

Thesis Outline. The thesis is structured as follows. Chapter 1 provides an introduction and the motivations to TreeDetect. Chapter 2 describes related work in the areas of statistical constraints, data cleaning, as well as decision tree learning. Chapter 3 provides the background knowledge necessary to understand our work. Chapter 4 dives deeper into our method and highlights our contributions. Chapter 5 discusses our experimental results.

Finally, Chapter 6 summarizes our work, discusses some limitations of TreeDetect, and provides directions for future research.

(16)

Chapter 2

Related Work

2.1 Data Cleaning

Data cleaning involves error detection and error repairing. A great deal of methods in both areas have been proposed. In this section we give an overview on both areas.

2.1.1 Error Detection

Error detection methods focus on identifying erroneous data records in a dataset. Error detection is a typical first step in every data analysis pipeline [16, 34, 19].

Error detection is as difficult as it is important, and user input is invaluable for solving it, because user input represents specific domain knowledge. Therefore error detection methods have been developed for leveraging as many types of user input as possible. Existing works can be categorized into four classes: outlier detection, pattern based, knowledge-base based, and constraint-based error detection methods. Outlier detection methods detect data points that do not follow the general distribution of the majority of data records. Outlier detection is data-driven in that it does not explicitly leverage user-specified constraints. DBoost [31]

is a state-of-the-art outlier detection method for finding errors that leverages Histograms and Gaussian models to detect outliers. We include DBoost as a benchmark method in our experiments. Scorpion [48] is a data analysis tool that explains outlier groups by finding predicates that lead to the outliers. Pattern based methods [22] detect data points that do not conform to a predefined data pattern. A recent trend has been to seek user input in the form of binary "clean/dirty" labels. Based on user labelling, error detection can be ap- proached using machine learning as a semi-supervised classification problem (e.g. Raha [29]

and HoloDetect [16]). We describe a possible approach for combining ML-based methods with localization trees: We leverage the user-specified constraint to identify regions of (likely) error, then apply Raha separately to each region. This hybrid approach leverages both user knowledge in the form of ACs and user-provided labels. In our experiments, this combination led to substantive improvements in error detection accuracy, compared to using Raha as a stand-alone method. Knowledge-base based methods determine the correctness of data

(17)

records by cross checking with external knowledge bases (e.g. Katara [8]) or web tables (e.g. Unidetect [46]). Finally, in constraint based error detection, the user input consists of constraints capturing domain knowledge, usually in the form of Integrity Constraints (e.g.

Functional Dependency [30] and Denial Constraints [6]).

Error Detection with Approximate Constraints.

Integrity constraints are traditionally viewed as binary, in that a dataset either does or does not satisfy them. However, there is considerable evidence that in many datasets they do not hold exactly, and therefore the scope of constraint-based error detection can be increased by developing methods that work with approximate versions of integrity constraints. A recent state-of-the-art example is the Uni-Detect system [46], which develops approximate versions of a variety of integrity constraints. Statistical constraints are another class of approximate constraints that were introduced very recently in the SCODED system. They express a large class of statistical knowledge regarding the presence or absence of statistical associations and causal relationships among attributes in a data relation [49]. Both Uni- Detect and SCODED utilize statistical hypothesis testing to define the degree to which an AC holds on a given dataset. This use of hypothesis testing method contrasts with traditional statistics, where the objective is to verify how well an empirical scientific hypothesis accords with the data (e.g., "this treatment has a significant effect on the disease").

Previous approaches differ in the source of approximate constraints for error detection.

While our error localization method is neutral on this issue and can leverage ACs regardless of their provenance, we briefly review some of the key ideas. 1) A traditional view is that ACs are elicited from the user as part of the data profiling process [1, 17], much like integrity constraints. For statistical constraints in particular, several elicitation methods have been developed in machine learning and AI, such as the use of graphical models [42, 14, 9, 33].

2) Discover ACs from the data. For instance, Uni-Detect examines a large corpus of tables from the internet to find useful (approximate) regularities (e.g. spelling mistakes).

A large table corpus can find general patterns, but not domain-specific constraints. For example in our ice hockey case study, we leverage hockey specific knowledge about the relationship among different player metrics. Deriving ACs from a single domain-specific table faces two challenges: i) If the data is dirty, patterns in the data may be due to errors rather than true reflections of the domain. In this case constraints inferred from data may point to correct records rather than dirty ones. ii) Constraints are logically more powerful than single relation instances: we cannot infer general constraints (e.g. functional dependencies) from a single instance. 3) Yan et al. proposed a hybrid human-in-the-loop approach: Data analysis suggests constraints, which are then applied to highlight potentially dirty records to human domain experts. If the human expert agrees that the records are in error, this confirms the constraint. Otherwise the constraint can be removed or refined by the expert. We illustrate this discover-and-detect methodology in our case study on

(18)

uniqueness constraints (Sec. 5.2.1). We follow the SCODED approach and evaluate error localization with user-specified ACs. In principle, error localization can also be applied with data-derived approximate ACs; we leave this as a direction for future work.

2.1.2 Error Repairing

Error repairing focuses on correcting erroneous data records. State of the art error repairing methods include ERACER [32], Holoclean [39], NADEEF [10], and ActiveClean [25]. While fully automated algorithms have been developed, their performance is underwhelming [39].

To achieve optimal results, error repairing usually involves human interaction. TreeDe- tect can be viewed as complementary to the error repairing process: errors detected by TreeDetect will be later used as inputs to these algorithms.

2.2 Statistical Constraints

Statistical Constraint (SCs) have been studied in the statistics and AI literature [12, 36].

Existing studies assume that data is clean and explore how to infer SCs from the data. The derived SCs can be used for statistical modeling and causal inference. Ilyas et al. show that SCs (involving pairs of columns) are effective in improving query optimization [20]. Salimi et al. leverage (conditional) independence relationships among attributes to resolve bias in OLAP queries [41]. Unlike traditional constraints, such as integrity constraints or Denial Constraints [6], SCs do not hold deterministically, but only to a certain degree. Therefore, SCs belong to the category of approximate constraints. Recently SCs were introduced as a new constraint-based method for detecting errors in data [49]. The SC based method, SCODED, leverages statistical dependence and independence among data attributes to detect violations of these predefined constraints. SCODED has been been shown to out- perform many state of the art error detection algorithms.

The idea of context-sensitive (in)dependence has been extensively studied in the context of Bayesian Networks [50] and [3], including using tree representations of contexts as in our work. To our knowledge, we are the first to employ a tree representation for detecting context-sensitive errors.

2.3 Tree Learning

Tree learning methods have been extensively researched in machine learning for supervised prediction tasks [35]. Decision trees are employed for predicting discrete class labels, and regression trees for continuous output variables. The main similarity to our work is that we also employ a tree to represent a partition of the input data space. The main difference is in the use of the tree partition: In prediction, the goal is to find a set of regions such that prediction in each region is easier than global prediction. For error detection, the goal is

(19)

to find a set of regions such that error detection in each region is easier than global error detection. More specifically, in prediction models, we seek regions that minimize a measure of predictive uncertainty. In error detection, we seek regions that that maximize a measure of constraint violation. While the objective functions are different between predictive modelling and error detection, the algorithmic strategies for finding optimal tree partitions are similar.

In particular, the approach of using a growth phase followed by a pruning phase is similar.

Below we discuss previous research in attribute selection and pruning.

2.3.1 Attribute Selection

Attribute selection is the process of selecting the best split attribute. Measures for attribute selection mostly revolve around the idea of reducing impurity from parent to child nodes [21]. These include Information Gain, which measures the decrease in entropy after the dataset is split on an attribute. Entropy is defined as

Entropy(D) = −

c

X

i=1

pilog₂pi (2.1)

where p_i is the occurrence frequency of class i in dataset D. Information Gain is used in the classical ID3 [37] and C4.5 [38] decision tree learning algorithms. Another such measure, Gini Index, measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen. Gini Index is used in the highly popular CART algorithm [4]. Variations of these measures have also been proposed to suit particular applications of decision trees. For example, the work in [24] introduces a novel split criterion for learning context-sensitive decision trees with the aim of applying context-sensitive decision forests to object detection.

2.3.2 Pruning

Pruning refers to the process of reducing the number of leaves within a decision tree, mostly to avoid over-fitting. There are two techniques for pruning: post-pruning and pre-pruning.

In post-pruning, a decision tree is first generated before the leaves are pruned. [5] proposes minimum error pruning, where every leaf of the tree is replaced with the most popular class. [4] introduces cost complexity pruning, a form of post pruning in which the algorithm attempts to optimize the cost complexity function. Note that post pruning can be as simple as pruning all nodes at levels beyond the maximum depth of a tree. Pre-pruning, on the other hand, prevents the tree from generating non-significant branches; in other words, this form of pruning reduces the number of branches by stopping the tree from splitting. Chi- squared pruning is a form of pre-pruning that uses statistical testing to determine whether a split on some attribute is statistically significant. The chi-squared test is defined as

χ²=^X(O − E)²

E (2.2)

(20)

where O is the observed value and E is the expected value. The idea is that if a split does not result in a partition that is significantly different from its parent, we reject the split.

Chi-squared pruning is used in Quinlan’s ID3 [37]. Note that Chi-squared pruning can also be used as a form of post-pruning, where we detect useless splits after the tree has been constructed.

In our work we use post-pruning to remove irrelevant leaves from our tree. We do not use pre-pruning because pre-pruning often prohibits the tree from being constructed entirely, and an incomplete tree may fail to encompass error regions that are defined by multiple predicates.

The predictive tree learning methods work under similar assumptions as our error region construction method: that complex conditions can be built up sequentially from relevant subconditions, and that the observed features given by the data are sufficient for defining useful data regions: useful for prediction in the case of machine learning, and useful for error detection in the case of our error localization.

2.3.3 Decision Tree and Constraints

Previous works in decision tree research have utilized conditional and functional dependencies to build better classifiers. However, most of the work focus on improving the classification performance of decision trees. CITree [44] leverages conditional independence to build more compact decision trees and to improve classification accuracy. Lam and Lee [26]

propose a new type of FD called approximate class functional dependency (ACFD), which defines the relation between attribute values and class labels. The algorithm then selects the determinants of the least violated ACFDs as the split attribute. Decision trees built using ACFDs show improved classification performance.

(21)

Chapter 3

Background

We review relevant background on previous error detection work with approximate constraints. We begin with exact integrity constraints that are generalized by ACs.

We write D |= C if the data relation D satisfies an integrity constraint C. If a relation violates a constraint, error analysis often searches for a minimum-size data subset such that removing the subset removes the violation [1].

As a minimum-size solution to this problem divides the data into a clean and a (poten- tially) dirty subset, we refer to it as the partition problem for constraint C.

3.1 Error Detection With Approximate Constraints

Approximate constraints are quantified by violation metrics [23]. A violation metric φ for constraint C is an aggregate function that, given a relation, returns a real value: φ_C(D) ∈ [0, m], where 0 indicates no violation (i.e., D |= C), and m is the maximum possible violation, for a fixed data size n = |D|. (Given a fixed dataset, the maximum m_n= sup_D:|D|=nφ(D) exists for all violation metrics in common use.) When the constraint C is irrelevant or fixed by constant, we omit it. For each violation metric there is a dual satisfaction metric σ:

σ(D) ≡ m − φ(D). An exact constraint C is represented by the violation metric φ(D) = 0 if D |= C, and φ(D) = 1 otherwise. The dual satisfaction metric is 1 if D |= C, and 0 otherwise.

Two different types of error detection output are useful for different applications: First, a subset of likely dirty data, corresponding to a "clean/dirty" labelling. Second, a "dirtiness"

ranking that supports returning the top-k data records most likely to be dirty. This is similar to outlier detection, where method output either a set of potential outliers or an

"outlierness" metric for data points. Each type of desired output leads to an optimization problem for approximate constraints. To output binary labels, we can solve a minimum repair problem as follows.

Definition 1 (Dataset Partition for Approximate Constraints). Given a dataset D, a viola- tion metric φ, and a user-defined threshold α for an acceptable violation, the data partition

(22)

problem is to find a minimum-cardinality subset ∆D of records such that removing the subset reduces the violation below the threshold: φ(D − ∆D) ≤ α.

A top-k query can be supported by finding the set of k records with the biggest impact on the violation metric, as follows.

Definition 2 (Top-k Contribution). Given a dataset D, a violation metric φ, and a user- defined number k, the top-k contribution problem identifies a set of k records ∆D that minimize the constraint violation: argmin_{∆D:|∆D|=k}φ(D − ∆D).

The dataset partition and top-k problems can be expressed in dual terms replacing minimization violation metrics by maximizing satisfaction metrics. Yan et al. show that the Partition and Top-k problems are algorithmically equivalent in the sense that a polynomial- time algorithm for one can be used to obtain a polynomial-time algorithm for the other [49, Th.1]. In our experiments we evaluate error localization for both ranking and binary labelling methods, both in terms of accuracy and scalability.

3.2 Statistical Tests

Our null hypothesis is that the two variables are independent. We work with two types of data, discrete (categorical) and continuous.

To measure the correlation between two discrete variables, we use G-test, which deter- mines whether two variables are independent or dependent. The formula of G test is as follows:

G = 2^X

i

O_i∗ lnO_i E_i

where O_i is the observed frequency of a cell, and E_i is the expected frequency of a cell. A large G-test statistic suggests that the two variables may be dependent, and vice versa. We further measure the p-value to determine whether we should reject the null hypothesis at the predefined significance level.

To measure the correlation between two continuous variables, we use Kendall’s Tau, which measures the ordinal correlation between two variables. The idea is that if two vari- ables, X and Y , are associated, they are either concordant (exhibit positive correlation where increasing X implies increasing Y , and decreasing X implies decreasing Y ), or dis- cordant (negative correlation where increase in one variable implies a decrease in the other variable). Pairs that are neither concordant nor disconcordant are considered to be tied. We denote the number of concordant pairs to be C, and the number of discordant pairs to be D. Kendall’s Tau is calculated as

τ = C − D

n 2

(23)

3.3 Examples of Approximate Constraints

We provide examples of common violation and satisfaction metrics, following previous research [46, 49]. We explain approximate statistical constraints in more detail than approximate versions of traditional constraints, as they have been introduced to data management only quite recently.

In practice, violation metrics are scaled by a term that depends on the size of the dataset.

The dependence on sample size accounts for the fact that a violation in a small number of records may be due to statistical noise rather than a genuine problem with the data.

3.3.1 Approximate Uniqueness Constraints

Uniqueness Constraints specify that values within one or more columns in a data table are to be unique. In [46] the authors show that in some cases detecting errors using assumed Uniqueness Constraints can yield many false positive errors, since in real-world datasets different records share values for the same column.

To solve this problem, we can define almost uniqueness constraints (AUCs), where columns with uniqueness ratio below a threshold are considered to be non-unique by nature [17]. For example, the conforming-row ratio computes the number of row pairs with the same value in the target column, divided by the total number ^N^dset₂ of possible violating pairs.

As an example, suppose we set the threshold to 99% and enforce an AUC on a column C . Thus if 99% of a C are unique, then non unique values in the column will be considered as errors, while if only 95% of C are unique then we invalidate the uniqueness constraint on the column.

3.3.2 Approximate Statistical Constraints: Probabilistic (In)dependence Recently Yan et al. [49] introduced a new type of constraint for error detection: asserting the probabilistic (in)dependence of two attributes/columns, possibly conditional on the values of a third. Conditional independence is defined as follows: given three attributes X, Y , Z, we say that X and Y are conditionally independent given Z if knowing Y gives no information about X, beyond what can be inferred from the values of Z.

X ⊥⊥ Y |Z ≡

P (X = x, Y = y|Z = z) = P (X = x|Z = z) × P (Y = y|Z = z) for all x, y, z.

This definition can be extended to three disjoints sets of attributes [49]. Throughout this thesis, we will be working extensively with the idea of probabilistic independence,

(24)

termed Independent Statistical Constraints (ISCs), and probabilistic dependence, termed Dependent Statistical Constraints (DSCs).

Probabilistic independence becomes a constraint on data when we apply it to the em- pirical distribution P_D associated with a relation D, which is defined as follows [23].

Let r[X] denote the tuple of values in the X columns for record r. Given an assignment X = x, a record satisfies the assignment if r[X] = x. The empirical count is the number N_D(X = x) of records that satisfy it. The empirical frequency of an assignment is the number of satisfying records, divided by the total number of records:

P_D(X = x) ≡ N_D(X = x)/N_D where N_D is the cardinality of relation D.

In the running example of Figure 1.2, we have N_D = 9, the count of white cars is NDColor = white = 3, so the empirical frequency P_D(Color = white) = 3 /9 = 1 /3 . Approximate Probabilistic Independence: Discrete Attributes

A common metric for the strength of a dependence is mutual information I(X; Y ) [47, 23]:

I(X; Y ) ≡ ^X

x,y

P (X = x, Y = y) log₂( P (Y = y, X = x) P (Y = y) × P (Y = y)).

Mutual information is a violation metric for independence: It can be shown that two categorical random variables have minimal mutual information 0 if and only if they are independent [23]. Also, Yan et al. show that if an FD X → Y holds exactly in a relation D, then the mutual information between X → Y is the maximum possible value [49, Prop.2]. In other words, a functional dependency represents a maximally strong dependence as measured by mutual information.

As we discussed above, it is important to make violation metrics sensitive to the sample size. In practical hypothesis testing, mutual information is therefore replaced by the G-test:

G_{X ⊥}⊥ Y(D) = 2 × N_D × I_D(X; Y ) (3.1) Figures 1.2 and 1.3 give G-test values in our running example.

3.3.3 p-values

A widely used approach is to transform a violation metric φ_C into a satisfaction metric known as a p-value. For a given relation D, the p-value measures the probability of a violation at least as great as the observed value:

p_φ_C(D) = P (t > φ_C(D) | C holds exactly) (3.2)

(25)

where t is a random variable ranging over possible observed violation values.

In statistical test terminology, the condition that the constraint holds exactly is called the null hypothesis. The p-value is a satisfaction metric, because the probability of observing a value of at least 0 for the violation metric is 1, and if the constraint holds exactly, the probability of observing a maximal violation value is 0. Note that we do not use p-value to make empirical claims on the world. Rather, we use it as a measure of degree of violation.

The p-satisfaction metric is widely used for two reasons. 1) Values are normalized to the [0,1] range. 2) A minimum-satisfaction threshold α ≥ p_φ_C(D) provides an upper bound on the probability of a false positive, that is, the probability of rejecting a constraint when it is true (but only apparently violated due to data noise). Therefore p-values are easier to interpret than violation metrics for the users. This facilitates eliciting a threshold from the user for producing binary labels by solving the partition problem 3.

Figure 3.1 illustrates the probability density function (3.2) for the test statistic for a hypothetical dataset. Suppose we measure a p-value of φ. If we choose α₁ as our threshold, we would reject our null hypothesis, and vice versa.

Figure 3.1: Probability Density Function for a hypothetical dataset. The x-axis represents the test statistic.

In statistical test terminology, the p-value threshold is the significance level and ob- served violations below this level are statistically significant. Conventional significance levels are 5% and 1%. In terms of p-values, the partition problem (Def. 3) is reformulated to raise the p-value above the threshold, and the top-k problem (Def. 2) is reformulated to maximize the p-value.

Computational statistics provides efficient libraries to compute p-values for many com- mon violation metrics. For a brief review in the context of error detection, please see [49, Sec.4.3]. Most p-value computations are based on a closed-form approximation of the den- sity (3.2). For the G-test the approximation is as follows.

pX ⊥⊥ Y(D) ≈ Z +∞

GX ⊥⊥ Y(D)

χ²_k(t)dt

(26)

where k = (r_X − 1) × (r_Y − 1), r_X resp. r_Y is the number of possible assignments for X resp. Y , and χ²_k(t) is the Chi-square distribution.

(27)

Chapter 4

Localization Tree

Decision trees have always been used to solve the problem of classification. One of our contributions is that we use decision trees to solve an optimization problem rather than building a predictive model. In this chapter we formally describe error regions as well as demonstrate how we use decision tree learning to solve a partitioning problem.

4.1 Error Regions

In this section we formally define an error region and provide the methodology for detecting errors within error regions.

4.1.1 Optimal Error Partitions

The concept of an error partition generalizes the binary partition to more complex error regions.

Definition 3 (Error Partition for Approximate Constraints). Given a dataset D, a violation metric φ, and a user-defined threshold α for an acceptable violation, an error partition divides D into disjoint regions E₁, . . . , Ek. A region E_i is a violation or error region if φ(E_i) > α. A region E_i is a satisfaction region if φ(E_i) ≤ α.

Computationally, this allows us to formulate error detection as an optimization problem, as presented in [49].

To support error detection, an error partition should identify regions that are likely to contain dirty records, i.e. exhibit a high degree of constraint violation. Our partition violation metric is therefore the sum of violation degrees over error regions:

Our partition violation metric is therefore the sum of violation degrees over error regions:

Definition 4 (Optimal Error Partition). Given a dataset D, a violation metric φ, and a user-defined acceptability threshold α, an optimal error partition solves the optimization problem

(28)

argmax

{Ei}

X

i:φ(Ei)>α

φ(Ei). (4.1)

For satisfaction metrics, we seek to minimize objective (4.1). Our error region construction method (Section 4.3) is a heuristic for finding error partitions that optimize the partition violation metric.

Discussion. Identifying errors by optimizing a violation metric is a common strategy for approximate constraints [1]. For example, previous work has addressed the binary partition problem: find minimum-cardinality subset ∆D of records such that removing the subset reduces the violation below the threshold: φ(D − ∆D) ≤ α [49, 46]. Our objective (4.1) is more general in that it allows non-binary partitions. Another difference is that our definition does not consider the cardinality of the error region(s). This is not necessary because for typical violation metrics, adding clean records, reduces the violation value. For example if the confirming row ratio is the violation metric uniqueness constraints, adding clean data records reduces the ratio of violating row pairs to conforming row pairs. Therefore we expect that the objective (4.1) discourages error regions that are larger than necessary. Our experiment support this intuition.

4.1.2 From Error Regions to Error Detection

Two different types of error detection output are useful for different applications: First, error classification outputs a subset of likely dirty data, corresponding to a "clean/dirty"

labelling. Second, an error ranking returns the top-k records most likely to be dirty. This is similar to outlier detection, where method output either a set of potential outliers or an

"outlierness" metric for data points. Each type of desired output leads to an optimization problem for approximate constraints.

We describe how error regions can be used to support both error rankings and error classifications.

Error Classification A binary labelling system L for a dataset can be region-wise ex- tended to an error partition: For each region E_i, apply L(E_i) to obtain a set E_i⁰ of dirty records. Then return the union ∪_iE_i⁰ of the dirty records.

In the example of Figure 1.3, an error classification algorithm can identify a subset of 2020 records that are likely to be dirty.

Error Ranking An error ranking method T for a dataset can be extended to an error partition region-by-region: Retrieve records from each region, in the order of degree of vio- lation. If it is necessary to retrieve further records from a single region, we apply the top-k method T within that leaf. More formally, the steps are as follows.

(29)

1. Order the regions as E₁, . . . , E_m in descending order of violation, such that φ(E_i) ≥ φ(Ei+1).

2. Let j be the last index for a region such that all the regions up to j cover at most k records: j = max_is.t.R_i ≡^P_i|E_i| ≤ k.

3. If |R_j| = k return ∪_i≤jEi. Otherwise let k⁰ := |R_j+1| − k, and apply a top-k⁰ algorithm to E_j+1 to find a set of records E_j+1⁰ (e.g. [49]). Return ∪_i≤jE_i∪ E_j+1⁰ .

An hybrid approach would be to use a labelling method L to find a dirty subset within each region, until k records have been identified. Refining each region can potentially return more complex top-k error sets, but increases the computational cost of the top-k search. In our top-k experiments below we use the region-by-region method.

4.2 Error Detection With Localization Trees

In this section we discuss how error regions can be represented using a Localization Tree.

4.2.1 Tree Representation of Partitions

An error partition can be represented by a Localization Tree (or tree for short), which is defined as follows.

• Every non-leaf node is labelled with a column name X.

• A node with discrete column label X has k children, one for each possible value x1, . . . , x_k in the domain of X.

• A node with numeric column label X is assigned a threshold t, and has two children, one for X ≤ t, the second for X > t.

• Each leaf node ` contains the violation value φ(D_`), the violation value for the subset of records assigned to the leaf node.

For each node v in the decision tree, the path from the root to v defines a selection condition with conjuncts X = v for a discrete column X with a legal value v, and X op t for continuous columns X, where t is a constant and op is one of {≥, <}. The records that satisfy the node’s selection condition are the node records. Since each record is assigned to exactly one leaf in the Localization Tree, the leaves `1, . . . , `k define an error partition E₁, . . . , E_k.

To constrain the space of possible partitions, we adopt the following key assumption through the remainder of the thesis.

(30)

Definition 5. Tree Representation Assumption (TRA): Error regions can be represented by a Localization Tree.

The TRA guarantees the interpretability of the discovered error regions, and facilitates the search for them because it constrains them to be rectangles in the space of the observed features. The main limitation of the TRA arises when the recorded data is not powerful enough to capture the error source. For example, if all errors arise in the time period before 2000, and the data contain time stamps, the TRA can learn a region defined by time < 2000 . However, in real world it is common that the data are missing key attributes, such as time stamps. In this case the TRA will prevent our system from finding the correct error regions.

In the future work section we discuss approaches for overcoming this limitation, trading off interpretability and scalability.

4.2.2 Example

Figure 4.1: Localization Tree produced for data set in Figure 1.3

Figure 4.1 shows a Localization Tree for the data set in Figure 1.3 built using the independence constraint that Model and Color should be independent. In the complete dataset, the independence constraint holds exactly: the p-value is the maximal value 1 for the root records; see Figure 1.2. The tree leaves define three regions defined by the conditions (1) Year = 2018 , (2) Year = 2019 , (3) Year = 2020 . However, if we look closely at Figure 1.3, we can see that every White BMW X1 was produced in year 2020. Thus given that the year is 2020 and that color is white, it is more likely that we have an instance of BMW X1, rather than X3 or X5. Thus Model and Color become conditionally dependent, given Year = 2020 . This is exactly what’s shown in Figure 4.2, which is the tree produced after we run our pruning algorithm on Figure 4.1.

(31)

Figure 4.2: Figure 4.1 after pruning Algorithm 1: Tree Building Algorithm

Input: Dataset D; A set of attributes X; a violation metric φ; max tree depth max_tree_depth, minimum number of records in leaf nodes min_records Output: A Localization Tree

1 Function Build_Tree(X, D, φ, max_tree_depth, min_records):

2 if T ree_Depth > max_tree_depth then

3 Terminate Algorithm

4 T ree ← A new tree node containing dataset D

// Run algorithm 2 to find the partitions Dl and Dr

5 Dl, D_r ← Split(X, D, φ)

6 if size(D_l) < min_records or size(D_r) < min_records then

7 Terminate Algorithm

8 else

9 lef t_sub_tree ← Build_Tree(X, D_l, φ, max_tree_depth, min_records)

10 right_sub_tree ← Build_Tree(X, D_r, φ, max_tree_depth, min_records)

11 T ree.lef t_child ← lef t_sub_tree

12 T ree.right_child ← right_sub_tree

13 return T ree

4.3 Constructing Localization Trees

We present a recursive partitioning algorithm that constructs a Localization Tree for a dataset. The algorithm is inspired by machine learning methods for classification and regression trees [35]. However, our goal is not to build a predictive model, but to identify likely error regions. The tree construction has two phases. Algorithm 2 shows the pseudo- code for choosing the next feature to split on. Algorithm 3 show the pseudo-code for the pruning procedure.

Growth For each leaf node, we select the split on a discrete feature resp. continuous fea- ture/threshold combination that maximizes the violation sum of the resulting children. Growth continues until a specified resource bound (e.g., maximum tree depth or minimum number of records assigned to each leaf [35]).

(32)

Algorithm 2: Tree Splitting Algorithm

Input: A set of attributes X; a dataset D; a violation metric φ Output: An optimal attribute and the result partitions

1 current_best_value := 0

2 Function Split(X, D, φ):

3 for X ∈ X do

// generate a set of candidate classes if X is discrete and candidate thresholds if X is continuous

4 if X is Discrete then

5 TX := {r ∈ D : r_X = x_k}

6 else if X is Continuous then

7 TX := generate_threshold(X , D)

8 for t ∈ T_X do

9 if X is Discrete then

10 D_l:= {r ∈ D : r_X == t}

11 D_r:= {r ∈ D : r_X! = t}

12 else if X is Continuous then

13 D_l:= {r ∈ D : r_X ≤ t}

14 D_r:= {r ∈ D : r_X > t}

15 split_value := φ(D_l) + φ(D_r)

16 if split_value > current_best_value then

17 current_best_value := split_value

18 best_partitions := {D_l, D_r}

19 best_feature := X

20 return best_partitions and best_feature

Pruning We iteratively remove leaves whose degree of violation is i) below that of an ancestor node or ii) below the user-specified AC threshold.

Growing the tree even for “clean" nodes whose violation metric is below the threshold helps the system to find longer conjunctive conditions. Pruning clean nodes highlights the error regions to the user and produces a smaller more comprehensible final tree.

Example In Figure 4.1, the Localization Tree finds the splits that minimizes the sum of p-values for the G-test.For the first split, the tree finds that splitting on Y ear = 2020 produces the smallest p-value sum (0.0088+0.011). The tree then splits on the two remaining years. Next, in pruning, the Localization Tree prunes away nodes whose degree of violation (p-value in this case) is below a user defined threshold (0.01 in this case). Therefore all nodes are pruned away except for the right-most node, which contains the data points from the year 2020.

(33)

Algorithm 3: Tree Pruning Algorithm Input: Tree v and Significance Level α Output: Pruned tree

1 Function Prune(T , α):

2 root → T .root

3 if T is Null or root is a leaf and (φ(Droot) > α or φ(D_root) < φ(Droot.parent)) then

4 return NULL

5 else

// Recurse down the tree 6 T⁰.root ← T .root

7 T⁰.left_subtree ← Prune (T.left_subtree, α)

8 T⁰.right_subtree ← Prune (T.right_subtree, α)

9 if (T⁰.root is a leaf and φ(D_T⁰_.root) > α) or φ(D_T⁰_.root) < φ(D_T⁰.root.parent) then

10 return NULL

11 else

12 return T⁰

Conditional Constraints Some constraints can be specified to hold only under certain conditions, such as conditional (in)dependencies [49], and conditional functional dependencies [2]. Tree construction can be made conditional as well by starting the tree with splits on the conditioning attribute. For example, consider a conditional independence constraint X ⊥⊥ Y |Z. We assign Z as the tree root, then apply the tree construction algorithms to the unconditional constraint X ⊥⊥ Y in subtrees.

(34)

Chapter 5

Experiments

In this chapter, we describe our experimental evaluations. We evaluate TreeDetect using datasets containing both synthetic and real world error. We compare the performance of TreeDetect against several state-of-the-art data cleaning algorithms.

5.1 Experimental Setup

5.1.1 Datasets

Table 5.1 summarizes the number of rows, attributes, and the types of errors present within each dataset. For synthetic errors, we investigate sorting errors and imputation errors.

Sorting errors occur when one or more columns have been sorted in ascending or descending order, while imputation errors occur when a group of values in the original dataset are replaced with misleading values. These two types of errors are frequent in practice [40] and have been investigated in previous error detection studies.

Table 5.2 summarizes the attributes, statistical constraints, and denial constraints that we use for each dataset. Note that for ISCs, we cannot find the DC representations and thus DCs were excluded for experiments involving ISCs.

Hockey. The Hockey dataset collected the records of players with a potential to be drafted from a junior hockey league into the National Hockey League (NHL), known as prospects [43]. For each prospect, the dataset lists 26 attributes that summarize their performance in

Name Rows Attributes Error Types Housing 506 13 Sorting, Imputation

Car 1728 7 Sorting

Hockey 2217 27 Imputation

Sensor 793 55 Outlier

Table 5.1: Dataset Information

(35)

Attributes SCs Dataset Denial Constraints

Temperatures (T) of Neighboring Sensors Ta⊥6⊥ T_b Sensor ∀r_i, r_j∈ D : ¬(r₁[Ta] > r2[T_b] ∧ r1[T_b] ≤ r2[T_b]) Tax rate, SEC, Crime(C) T X⊥6⊥ SEC | C Housing ∀r_i, r_j∈ D : ¬(r₁[C] = r₂[C] ∧ r₁[T ] > r₂[T ] ∧ r₁[SEC] ≤ r₂[SEC])

N_oxide, SEC, Tax rate (T) N ⊥⊥ SEC | T X Housing ×

Games(G), Goal Plus-Minus(GPM) G ⊥⊥ GP M Hockey ×

Safety(SA), Doors(DR) SA ⊥⊥ DR CAR ×

Table 5.2: Constraints used by TreeDetect and other approaches

a junior league, and how many games they played in the NHL (Games Played GP). GP = 0 for prospects who never appeared in the NHL.

Statistical Constraint. In the dataset, attributes Games Played and Goal Plus-Minus(GPM) should be independent [27].

Sensor The Sensor dataset collected the sensor reports from the Berkeley/Intel Lab. The dataset has more than 2 million records, containing the humidity and temperature reports from 54 different sensors. To compress it, we replaced sensor readings by their hourly average collected in [28].

Statistical Constraint. Nearby sensors should report similar readings. This leads to a con- straint T_a⊥6⊥ T_b for neighboring sensors T_a and T_b.

Data Errors. The Sensor dataset contains outlier errors.

Car The Car Evaluation dataset is from UCI Machine Learning repository. This dataset contains seven attributes. We used 4 attributes: Buying price (BP), Car Class (CL), Doors (DR), and Safety level (SA).

Statistical Constraint. For this dataset, we are given the constraint that Safety is indepen- dent of the number of doors SA ⊥⊥ Doors [49].

Data Errors. We inject sorting errors into this dataset: when the class of a car is "unclassi- fied", we sort both SA and Doors in ascending order, making them conditionally dependent.

Housing The Boston dataset was taken from the Boston Standard Metropolitan Statis- tical Area (SMSA) in 1970. This dataset was first used to study the relationship between clean air quality and household’s willing to pay [15]. There are 506 instances, and each instance has 14 attributes. We used 6 attributes: Distance to CBD area-Distance (D), Nitric Oxides Concentration-Noxide (N), Crime Rate-Crime (C), Socioeconomic Status of population(SEC), Rooms(R) and Tax Rate(T).

Statistical Constraint. We are given two SCs

Tax Rate⊥6⊥ SEC |Crimes and N _oxide ⊥⊥ SEC [49].

Data Errors. We explore both sorting and imputation errors for this dataset. The original dataset violates the constraint Tax Rate⊥6⊥ SEC |Crimes for Crimes > 3 .53 . In addition, to simulate imputation errors, for all records with Crimes between 0.5 and 2.5, we re- place SEC with the average of the entire column. This results in the new error region 0.5 ≤ Crimes ≤ 2 .5 . The two error regions test the ability of TreeDetect to detect

(36)

disconnected regions of errors. For sorting errors, we sort N _Oxides and SEC in ascend- ing order for records where Tax Rate is greater than 600, thus resulting in an error region Tax Rate > 600 .

5.2 Qualitative Evaluation

Compared with existing methods that can leverage ACs, our decision tree approach not only outperforms the others in detecting conditionally (in)dependent constraint violations, but offers excellent interpretability. Our method splits the dataset into a collection of subsets that describe the error intuitively, offering users more insights as to why the errors might have occurred. We illustrate these advantages in case studies for different constraint types.

Our case studies also show how error localization can be combined with constraint discovery.

We carry out our case studies in the Hockey dataset.

5.2.1 Uniqueness Constraints

A uniqueness constraint for a column X asserts that no two rows share the same column x. We denote approximate uniqueness constraints by AUC. A common satisfaction metric for an AUC is the uniqueness ratio Number of Unique Values

Total Number of Values [46, 17]. Previous work has combined AC discovery with AC application with a discover-and-detect approach [11, 17].

1. Consider each column X in a relation. If the uniqueness ratio for X is above a threshold α, consider the uniqueness constraint valid for X.

2. Apply each valid uniqueness constraint to detect and repair violations.

As noted by Wang and Yeye, [46], discover-and-detect tends to cause false positives, because in real-world clean datasets, many integrity constraints hold only approximately. A reasonable intuition is that player names should be unique. Suppose that we want to apply the discover-and-detect approach with a threshold of 99% for accepting the constraint.

A data scientist can run Localization Tree with uniqueness ratio as the satisfaction metric. She obtains the tree in Figure 5.1. The tree shows that there exist name duplication in three countries: Slovakia, Finland, and USA. The uniqueness ratio of Slovakia and Finland are both below the acceptance threshold, so discover-and-detect does not apply the AUC to these countries. Further investigations show that in Slovakia and Finland, the same player (and hence the same name) appeared in two different draft years, so in these countries the duplication is not a genuine data error. Combining discover-and-detect with a Localization Tree avoids this false positive.

In contrast, for some of the U.S. players there really are two players with the same name (Brian Lee and Nick Larson). Arguably this is not a genuine data error neither, but for a different reason than the year duplication of Slovakia and Finland. Although in this case, the tree does not avoid false positives, it illustrates how constraint violations in different error regions occur for different reasons.

Localizing Violations of Approximate Constraints for Data Error Detection