Constructing Localization Trees - Localizing Violations of Approximate Constraints for Data Err

We present a recursive partitioning algorithm that constructs a Localization Tree for a dataset. The algorithm is inspired by machine learning methods for classification and regression trees [35]. However, our goal is not to build a predictive model, but to identify likely error regions. The tree construction has two phases. Algorithm 2 shows the pseudo-code for choosing the next feature to split on. Algorithm 3 show the pseudo-pseudo-code for the pruning procedure.

Growth For each leaf node, we select the split on a discrete feature resp. continuous fea-ture/threshold combination that maximizes the violation sum of the resulting chil-dren. Growth continues until a specified resource bound (e.g., maximum tree depth or minimum number of records assigned to each leaf [35]).

Algorithm 2: Tree Splitting Algorithm

Input: A set of attributes X; a dataset D; a violation metric φ Output: An optimal attribute and the result partitions

1 current_best_value := 0

2 Function Split(X, D, φ):

3 for X ∈ X do

// generate a set of candidate classes if X is discrete and candidate thresholds if X is continuous

4 if X is Discrete then

5 TX := {r ∈ D : r_X = x_k}

6 else if X is Continuous then

7 TX := generate_threshold(X , D)

8 for t ∈ T_X do

9 if X is Discrete then

10 D_l:= {r ∈ D : r_X == t}

11 D_r:= {r ∈ D : r_X! = t}

12 else if X is Continuous then

13 D_l:= {r ∈ D : r_X ≤ t}

Pruning We iteratively remove leaves whose degree of violation is i) below that of an ancestor node or ii) below the user-specified AC threshold.

Growing the tree even for “clean" nodes whose violation metric is below the threshold helps the system to find longer conjunctive conditions. Pruning clean nodes highlights the error regions to the user and produces a smaller more comprehensible final tree.

Example In Figure 4.1, the Localization Tree finds the splits that minimizes the sum of p-values for the G-test.For the first split, the tree finds that splitting on Y ear = 2020 produces the smallest p-value sum (0.0088+0.011). The tree then splits on the two remaining years. Next, in pruning, the Localization Tree prunes away nodes whose degree of violation (p-value in this case) is below a user defined threshold (0.01 in this case). Therefore all nodes are pruned away except for the right-most node, which contains the data points from the year 2020.

Algorithm 3: Tree Pruning Algorithm Input: Tree v and Significance Level α Output: Pruned tree

1 Function Prune(T , α):

2 root → T .root

3 if T is Null or root is a leaf and (φ(Droot) > α or φ(D_root) < φ(Droot.parent)) then

4 return NULL

5 else

// Recurse down the tree 6 T⁰.root ← T .root

7 T⁰.left_subtree ← Prune (T.left_subtree, α)

8 T⁰.right_subtree ← Prune (T.right_subtree, α)

9 if (T⁰.root is a leaf and φ(D_T⁰_.root) > α) or φ(D_T⁰_.root) < φ(D_T⁰.root.parent) then

10 return NULL

11 else

12 return T⁰

Conditional Constraints Some constraints can be specified to hold only under certain conditions, such as conditional (in)dependencies [49], and conditional functional dependen-cies [2]. Tree construction can be made conditional as well by starting the tree with splits on the conditioning attribute. For example, consider a conditional independence constraint X ⊥⊥ Y |Z. We assign Z as the tree root, then apply the tree construction algorithms to the unconditional constraint X ⊥⊥ Y in subtrees.

Chapter 5

Experiments

In this chapter, we describe our experimental evaluations. We evaluate TreeDetect using datasets containing both synthetic and real world error. We compare the performance of TreeDetect against several state-of-the-art data cleaning algorithms.

5.1 Experimental Setup

5.1.1 Datasets

Table 5.1 summarizes the number of rows, attributes, and the types of errors present within each dataset. For synthetic errors, we investigate sorting errors and imputation errors.

Sorting errors occur when one or more columns have been sorted in ascending or descending order, while imputation errors occur when a group of values in the original dataset are replaced with misleading values. These two types of errors are frequent in practice [40] and have been investigated in previous error detection studies.

Table 5.2 summarizes the attributes, statistical constraints, and denial constraints that we use for each dataset. Note that for ISCs, we cannot find the DC representations and thus DCs were excluded for experiments involving ISCs.

Hockey. The Hockey dataset collected the records of players with a potential to be drafted from a junior hockey league into the National Hockey League (NHL), known as prospects [43]. For each prospect, the dataset lists 26 attributes that summarize their performance in

Name Rows Attributes Error Types Housing 506 13 Sorting, Imputation

Car 1728 7 Sorting

Hockey 2217 27 Imputation

Sensor 793 55 Outlier

Table 5.1: Dataset Information

Attributes SCs Dataset Denial Constraints

Temperatures (T) of Neighboring Sensors Ta⊥6⊥ T_b Sensor ∀r_i, r_j∈ D : ¬(r₁[Ta] > r2[T_b] ∧ r1[T_b] ≤ r2[T_b]) Tax rate, SEC, Crime(C) T X⊥6⊥ SEC | C Housing ∀r_i, r_j∈ D : ¬(r₁[C] = r₂[C] ∧ r₁[T ] > r₂[T ] ∧ r₁[SEC] ≤ r₂[SEC])

N_oxide, SEC, Tax rate (T) N ⊥⊥ SEC | T X Housing ×

Games(G), Goal Plus-Minus(GPM) G ⊥⊥ GP M Hockey ×

Safety(SA), Doors(DR) SA ⊥⊥ DR CAR ×

Table 5.2: Constraints used by TreeDetect and other approaches

a junior league, and how many games they played in the NHL (Games Played GP). GP = 0 for prospects who never appeared in the NHL.

Statistical Constraint. In the dataset, attributes Games Played and Goal Plus-Minus(GPM) should be independent [27].

Sensor The Sensor dataset collected the sensor reports from the Berkeley/Intel Lab. The dataset has more than 2 million records, containing the humidity and temperature reports from 54 different sensors. To compress it, we replaced sensor readings by their hourly aver-age collected in [28].

Statistical Constraint. Nearby sensors should report similar readings. This leads to a con-straint T_a⊥6⊥ T_b for neighboring sensors T_a and T_b.

Data Errors. The Sensor dataset contains outlier errors.

Car The Car Evaluation dataset is from UCI Machine Learning repository. This dataset contains seven attributes. We used 4 attributes: Buying price (BP), Car Class (CL), Doors (DR), and Safety level (SA).

Statistical Constraint. For this dataset, we are given the constraint that Safety is indepen-dent of the number of doors SA ⊥⊥ Doors [49].

Data Errors. We inject sorting errors into this dataset: when the class of a car is "unclassi-fied", we sort both SA and Doors in ascending order, making them conditionally dependent.

Housing The Boston dataset was taken from the Boston Standard Metropolitan Statis-tical Area (SMSA) in 1970. This dataset was first used to study the relationship between clean air quality and household’s willing to pay [15]. There are 506 instances, and each in-stance has 14 attributes. We used 6 attributes: Diin-stance to CBD area-Diin-stance (D), Nitric Oxides Concentration-Noxide (N), Crime Rate-Crime (C), Socioeconomic Status of popu-lation(SEC), Rooms(R) and Tax Rate(T).

Statistical Constraint. We are given two SCs

Tax Rate⊥6⊥ SEC |Crimes and N _oxide ⊥⊥ SEC [49].

Data Errors. We explore both sorting and imputation errors for this dataset. The original dataset violates the constraint Tax Rate⊥6⊥ SEC |Crimes for Crimes > 3 .53 . In addition, to simulate imputation errors, for all records with Crimes between 0.5 and 2.5, we re-place SEC with the average of the entire column. This results in the new error region 0.5 ≤ Crimes ≤ 2 .5 . The two error regions test the ability of TreeDetect to detect

disconnected regions of errors. For sorting errors, we sort N _Oxides and SEC in ascend-ing order for records where Tax Rate is greater than 600, thus resultascend-ing in an error region Tax Rate > 600 .

In document Localizing Violations of Approximate Constraints for Data Error Detection (Page 31-36)