Qualitative Evaluation - Localizing Violations of Approximate Constraints for Data Error Detect

Compared with existing methods that can leverage ACs, our decision tree approach not only outperforms the others in detecting conditionally (in)dependent constraint violations, but offers excellent interpretability. Our method splits the dataset into a collection of subsets that describe the error intuitively, offering users more insights as to why the errors might have occurred. We illustrate these advantages in case studies for different constraint types.

Our case studies also show how error localization can be combined with constraint discovery.

We carry out our case studies in the Hockey dataset.

5.2.1 Uniqueness Constraints

A uniqueness constraint for a column X asserts that no two rows share the same column x. We denote approximate uniqueness constraints by AUC. A common satisfaction metric for an AUC is the uniqueness ratio Number of Unique Values

Total Number of Values [46, 17]. Previous work has combined AC discovery with AC application with a discover-and-detect approach [11, 17].

1. Consider each column X in a relation. If the uniqueness ratio for X is above a threshold α, consider the uniqueness constraint valid for X.

2. Apply each valid uniqueness constraint to detect and repair violations.

As noted by Wang and Yeye, [46], discover-and-detect tends to cause false positives, be-cause in real-world clean datasets, many integrity constraints hold only approximately. A reasonable intuition is that player names should be unique. Suppose that we want to ap-ply the discover-and-detect approach with a threshold of 99% for accepting the constraint.

A data scientist can run Localization Tree with uniqueness ratio as the satisfaction metric. She obtains the tree in Figure 5.1. The tree shows that there exist name duplica-tion in three countries: Slovakia, Finland, and USA. The uniqueness ratio of Slovakia and Finland are both below the acceptance threshold, so discover-and-detect does not apply the AUC to these countries. Further investigations show that in Slovakia and Finland, the same player (and hence the same name) appeared in two different draft years, so in these countries the duplication is not a genuine data error. Combining discover-and-detect with a Localization Tree avoids this false positive.

In contrast, for some of the U.S. players there really are two players with the same name (Brian Lee and Nick Larson). Arguably this is not a genuine data error neither, but for a different reason than the year duplication of Slovakia and Finland. Although in this case, the tree does not avoid false positives, it illustrates how constraint violations in different error regions occur for different reasons.

Figure 5.1: Tree generated from the Hockey Dataset, using uniqueness ratio on attribute PlayerName

5.2.2 Functional Dependencies

Functional dependencies (FDs) are similar to uniqueness constraints, except that FDs are defined over two columns whereas uniqueness constraints are defined over one column.

Therefore, Localization Tree could also be applied towards approximate functional de-pendencies (AFDs). AFDs allow a preset degree of violation of FDs. We use FD compliance ratio as the satisfaction metric [46]. Continuing with the Hockey dataset example from be-fore, suppose we now define a AFD PlayerName → DraftYear with a compliance threshold of 99%. This means that 99% of the player names are associated with a unique draft year.

Running TreeDetect with the above configuration results in the tree in 5.2. The tree is indeed very similar to Figure 5.1 and similar conclusions of the dataset can be made: the AFD constraint is accepted for the whole dataset, rejected for Slovakia and Finland, and accepted for the U.S.

When the data conflicts with a given constraint, the data manager has two options:

repair the data or revise the constraint. Error localization assists in repairing the data by highlighting likely errors (e.g. Nick Larson). The error localization trees can also assist in revising the constraint: For example, the data manager could introduce a conditional con-straint such that the FD is applied only to countries other than Finland, Slovakia, and the U.S. Or he could choose different thresholds for different tree regions. This error local-ization can assist with both error detection and the discovery of conditional approximate constraints.

Figure 5.2: Tree generated from the Hockey Dataset, using AFD PlayerName → DraftYear 5.2.3 Statistical Constraints

Statistical constraints are the most complex AUC class so we provide several case studies il-lustrating how taking into account context-sensitivity through error localization strengthens them in different domains.

Ice Hockey

Previous literature on hockey analytics has established that Goal Plus-minus (GPM) from minor leagues does not predict future games played (GP) in the NHL. The data scientist translates this knowledge into an ISC GP ⊥⊥ GPM . Building a prediction model on the hockey dataset, he finds that GP is an excellent predictor for GPM , so the data violate the given ISC.

To under this violation, he runs our Localization Tree algorithm which generates the tree shown in Figure 5.3. Seeing that splits on the attribute DraftYear result in partitions with particularly low p-values (high violation), the data scientist hypothesizes that the draft year is strongly related to the ISC violation.

To simplify the tree, the data scientist applies the pruning algorithm and obtains the tree shown in Figure 5.4.

The tree highlights that most violating records come from the years before 2002. Thus, the data satisfies a conditional DSC, GP⊥6⊥ GPM |Year = 2002 . The scientist then runs the SCODED top-k algorithm on the data partition and receives the results shown in figure 5.5.

The results show that for the years between 1998 and 2001, every record with Games > 0 has a PlusMinus value of 0. This suggests an imputation error; the data provider did not

Figure 5.3: Tree generated from the Hockey Dataset, using SC GP ⊥⊥ GPM

Figure 5.4: Tree generated from the Hockey Dataset, further pruned

have access to actual PlusMinus values and therefore entered 0 for all players who later appeared in the NHL.

Car Data

When testing for the correlation between Safety and NumDoors, the node with the lowest p-value results from the splits Class = Unclassified and Maintenance Level ≤ 2. This vi-olation of independence is expected, since we specifically sorted Safety and NumDoors for all car instances where Class = Unclassified, thus making the two attributes statistically dependent.

Housing Data

p-values are a satisfaction metric for an independence constraint, therefore a violation met-ric for a dependence constraint. Therefore high p-values indicate a likely violation of the dependence constraint SES ⊥6⊥ Tax Rate. Figure 5.6 shows the tree generated by Local-ization Tree by maximizing p-values. The p-values for the two continuous variables are computed using the τ statistic [49].

The leaves with the highest p-values all result from the split Crimes > 0 .39 . These leaves are then generated from the splits Crimes < 2 .79 , Crimes < 0 .83 , and Crimes > 3 .50 .

Figure 5.5: Errors detected from the Hockey Dataset

The tree regions thus match our 2 ground-truth error regions: 0.5 ≤ Crimes ≤ 2 .5 , and Crimes > 3 .53 .

For the independence constraint between N _Oxide and SES , the node with the lowest p-value results from the split TaxRate >= 600 , which is exactly the ground-truth starting point for sorting the two columns.

In document Localizing Violations of Approximate Constraints for Data Error Detection (Page 36-40)