We present a recursive partitioning algorithm that constructs a Localization Tree for a dataset. The algorithm is inspired by machine learning methods for classification and regression trees [35]. However, our goal is not to build a predictive model, but to identify likely error regions. The tree construction has two phases. Algorithm 2 shows the pseudo-code for choosing the next feature to split on. Algorithm 3 show the pseudo-pseudo-code for the pruning procedure.
Growth For each leaf node, we select the split on a discrete feature resp. continuous fea-ture/threshold combination that maximizes the violation sum of the resulting chil-dren. Growth continues until a specified resource bound (e.g., maximum tree depth or minimum number of records assigned to each leaf [35]).
Algorithm 2: Tree Splitting Algorithm
Input: A set of attributes X; a dataset D; a violation metric φ Output: An optimal attribute and the result partitions
1 current_best_value := 0
2 Function Split(X, D, φ):
3 for X ∈ X do
// generate a set of candidate classes if X is discrete and candidate thresholds if X is continuous
4 if X is Discrete then
5 TX := {r ∈ D : rX = xk}
6 else if X is Continuous then
7 TX := generatethreshold(X , D)
8 for t ∈ TX do
9 if X is Discrete then
10 Dl:= {r ∈ D : rX == t}
11 Dr:= {r ∈ D : rX! = t}
12 else if X is Continuous then
13 Dl:= {r ∈ D : rX ≤ t}
Pruning We iteratively remove leaves whose degree of violation is i) below that of an ancestor node or ii) below the user-specified AC threshold.
Growing the tree even for “clean" nodes whose violation metric is below the threshold helps the system to find longer conjunctive conditions. Pruning clean nodes highlights the error regions to the user and produces a smaller more comprehensible final tree.
Example In Figure 4.1, the Localization Tree finds the splits that minimizes the sum of p-values for the G-test.For the first split, the tree finds that splitting on Y ear = 2020 produces the smallest p-value sum (0.0088+0.011). The tree then splits on the two remaining years. Next, in pruning, the Localization Tree prunes away nodes whose degree of violation (p-value in this case) is below a user defined threshold (0.01 in this case). Therefore all nodes are pruned away except for the right-most node, which contains the data points from the year 2020.
Algorithm 3: Tree Pruning Algorithm Input: Tree v and Significance Level α Output: Pruned tree
1 Function Prune(T , α):
2 root → T .root
3 if T is Null or root is a leaf and (φ(Droot) > α or φ(Droot) < φ(Droot.parent)) then
4 return NULL
5 else
// Recurse down the tree 6 T0.root ← T .root
7 T0.left_subtree ← Prune (T.left_subtree, α)
8 T0.right_subtree ← Prune (T.right_subtree, α)
9 if (T0.root is a leaf and φ(DT0.root) > α) or φ(DT0.root) < φ(DT0.root.parent) then
10 return NULL
11 else
12 return T0
Conditional Constraints Some constraints can be specified to hold only under certain conditions, such as conditional (in)dependencies [49], and conditional functional dependen-cies [2]. Tree construction can be made conditional as well by starting the tree with splits on the conditioning attribute. For example, consider a conditional independence constraint X ⊥⊥ Y |Z. We assign Z as the tree root, then apply the tree construction algorithms to the unconditional constraint X ⊥⊥ Y in subtrees.
Chapter 5
Experiments
In this chapter, we describe our experimental evaluations. We evaluate TreeDetect using datasets containing both synthetic and real world error. We compare the performance of TreeDetect against several state-of-the-art data cleaning algorithms.
5.1 Experimental Setup
5.1.1 Datasets
Table 5.1 summarizes the number of rows, attributes, and the types of errors present within each dataset. For synthetic errors, we investigate sorting errors and imputation errors.
Sorting errors occur when one or more columns have been sorted in ascending or descending order, while imputation errors occur when a group of values in the original dataset are replaced with misleading values. These two types of errors are frequent in practice [40] and have been investigated in previous error detection studies.
Table 5.2 summarizes the attributes, statistical constraints, and denial constraints that we use for each dataset. Note that for ISCs, we cannot find the DC representations and thus DCs were excluded for experiments involving ISCs.
Hockey. The Hockey dataset collected the records of players with a potential to be drafted from a junior hockey league into the National Hockey League (NHL), known as prospects [43]. For each prospect, the dataset lists 26 attributes that summarize their performance in
Name Rows Attributes Error Types Housing 506 13 Sorting, Imputation
Car 1728 7 Sorting
Hockey 2217 27 Imputation
Sensor 793 55 Outlier
Table 5.1: Dataset Information
Attributes SCs Dataset Denial Constraints
Temperatures (T) of Neighboring Sensors Ta⊥6⊥ Tb Sensor ∀ri, rj∈ D : ¬(r1[Ta] > r2[Tb] ∧ r1[Tb] ≤ r2[Tb]) Tax rate, SEC, Crime(C) T X⊥6⊥ SEC | C Housing ∀ri, rj∈ D : ¬(r1[C] = r2[C] ∧ r1[T ] > r2[T ] ∧ r1[SEC] ≤ r2[SEC])
N_oxide, SEC, Tax rate (T) N ⊥⊥ SEC | T X Housing ×
Games(G), Goal Plus-Minus(GPM) G ⊥⊥ GP M Hockey ×
Safety(SA), Doors(DR) SA ⊥⊥ DR CAR ×
Table 5.2: Constraints used by TreeDetect and other approaches
a junior league, and how many games they played in the NHL (Games Played GP). GP = 0 for prospects who never appeared in the NHL.
Statistical Constraint. In the dataset, attributes Games Played and Goal Plus-Minus(GPM) should be independent [27].
Sensor The Sensor dataset collected the sensor reports from the Berkeley/Intel Lab. The dataset has more than 2 million records, containing the humidity and temperature reports from 54 different sensors. To compress it, we replaced sensor readings by their hourly aver-age collected in [28].
Statistical Constraint. Nearby sensors should report similar readings. This leads to a con-straint Ta⊥6⊥ Tb for neighboring sensors Ta and Tb.
Data Errors. The Sensor dataset contains outlier errors.
Car The Car Evaluation dataset is from UCI Machine Learning repository. This dataset contains seven attributes. We used 4 attributes: Buying price (BP), Car Class (CL), Doors (DR), and Safety level (SA).
Statistical Constraint. For this dataset, we are given the constraint that Safety is indepen-dent of the number of doors SA ⊥⊥ Doors [49].
Data Errors. We inject sorting errors into this dataset: when the class of a car is "unclassi-fied", we sort both SA and Doors in ascending order, making them conditionally dependent.
Housing The Boston dataset was taken from the Boston Standard Metropolitan Statis-tical Area (SMSA) in 1970. This dataset was first used to study the relationship between clean air quality and household’s willing to pay [15]. There are 506 instances, and each in-stance has 14 attributes. We used 6 attributes: Diin-stance to CBD area-Diin-stance (D), Nitric Oxides Concentration-Noxide (N), Crime Rate-Crime (C), Socioeconomic Status of popu-lation(SEC), Rooms(R) and Tax Rate(T).
Statistical Constraint. We are given two SCs
Tax Rate⊥6⊥ SEC |Crimes and N _oxide ⊥⊥ SEC [49].
Data Errors. We explore both sorting and imputation errors for this dataset. The original dataset violates the constraint Tax Rate⊥6⊥ SEC |Crimes for Crimes > 3 .53 . In addition, to simulate imputation errors, for all records with Crimes between 0.5 and 2.5, we re-place SEC with the average of the entire column. This results in the new error region 0.5 ≤ Crimes ≤ 2 .5 . The two error regions test the ability of TreeDetect to detect
disconnected regions of errors. For sorting errors, we sort N _Oxides and SEC in ascend-ing order for records where Tax Rate is greater than 600, thus resultascend-ing in an error region Tax Rate > 600 .