Tree learning methods have been extensively researched in machine learning for supervised prediction tasks [35]. Decision trees are employed for predicting discrete class labels, and regression trees for continuous output variables. The main similarity to our work is that we also employ a tree to represent a partition of the input data space. The main difference is in the use of the tree partition: In prediction, the goal is to find a set of regions such that prediction in each region is easier than global prediction. For error detection, the goal is
to find a set of regions such that error detection in each region is easier than global error detection. More specifically, in prediction models, we seek regions that minimize a measure of predictive uncertainty. In error detection, we seek regions that that maximize a measure of constraint violation. While the objective functions are different between predictive modelling and error detection, the algorithmic strategies for finding optimal tree partitions are similar.
In particular, the approach of using a growth phase followed by a pruning phase is similar.
Below we discuss previous research in attribute selection and pruning.
2.3.1 Attribute Selection
Attribute selection is the process of selecting the best split attribute. Measures for attribute selection mostly revolve around the idea of reducing impurity from parent to child nodes [21]. These include Information Gain, which measures the decrease in entropy after the dataset is split on an attribute. Entropy is defined as
Entropy(D) = −
c
X
i=1
pilog2pi (2.1)
where pi is the occurrence frequency of class i in dataset D. Information Gain is used in the classical ID3 [37] and C4.5 [38] decision tree learning algorithms. Another such measure, Gini Index, measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen. Gini Index is used in the highly popular CART algorithm [4]. Variations of these measures have also been proposed to suit particular applications of decision trees. For example, the work in [24] introduces a novel split criterion for learning context-sensitive decision trees with the aim of applying context-sensitive decision forests to object detection.
2.3.2 Pruning
Pruning refers to the process of reducing the number of leaves within a decision tree, mostly to avoid over-fitting. There are two techniques for pruning: post-pruning and pre-pruning.
In post-pruning, a decision tree is first generated before the leaves are pruned. [5] proposes minimum error pruning, where every leaf of the tree is replaced with the most popular class. [4] introduces cost complexity pruning, a form of post pruning in which the algorithm attempts to optimize the cost complexity function. Note that post pruning can be as simple as pruning all nodes at levels beyond the maximum depth of a tree. Pre-pruning, on the other hand, prevents the tree from generating non-significant branches; in other words, this form of pruning reduces the number of branches by stopping the tree from splitting. Chi-squared pruning is a form of pre-pruning that uses statistical testing to determine whether a split on some attribute is statistically significant. The chi-squared test is defined as
χ2=X(O − E)2
E (2.2)
where O is the observed value and E is the expected value. The idea is that if a split does not result in a partition that is significantly different from its parent, we reject the split.
Chi-squared pruning is used in Quinlan’s ID3 [37]. Note that Chi-squared pruning can also be used as a form of post-pruning, where we detect useless splits after the tree has been constructed.
In our work we use post-pruning to remove irrelevant leaves from our tree. We do not use pre-pruning because pre-pruning often prohibits the tree from being constructed entirely, and an incomplete tree may fail to encompass error regions that are defined by multiple predicates.
The predictive tree learning methods work under similar assumptions as our error region construction method: that complex conditions can be built up sequentially from relevant subconditions, and that the observed features given by the data are sufficient for defining useful data regions: useful for prediction in the case of machine learning, and useful for error detection in the case of our error localization.
2.3.3 Decision Tree and Constraints
Previous works in decision tree research have utilized conditional and functional dependen-cies to build better classifiers. However, most of the work focus on improving the classi-fication performance of decision trees. CITree [44] leverages conditional independence to build more compact decision trees and to improve classification accuracy. Lam and Lee [26]
propose a new type of FD called approximate class functional dependency (ACFD), which defines the relation between attribute values and class labels. The algorithm then selects the determinants of the least violated ACFDs as the split attribute. Decision trees built using ACFDs show improved classification performance.
Chapter 3
Background
We review relevant background on previous error detection work with approximate con-straints. We begin with exact integrity constraints that are generalized by ACs.
We write D |= C if the data relation D satisfies an integrity constraint C. If a relation violates a constraint, error analysis often searches for a minimum-size data subset such that removing the subset removes the violation [1].
As a minimum-size solution to this problem divides the data into a clean and a (poten-tially) dirty subset, we refer to it as the partition problem for constraint C.
3.1 Error Detection With Approximate Constraints
Approximate constraints are quantified by violation metrics [23]. A violation metric φ for constraint C is an aggregate function that, given a relation, returns a real value: φC(D) ∈ [0, m], where 0 indicates no violation (i.e., D |= C), and m is the maximum possible violation, for a fixed data size n = |D|. (Given a fixed dataset, the maximum mn= supD:|D|=nφ(D) exists for all violation metrics in common use.) When the constraint C is irrelevant or fixed by constant, we omit it. For each violation metric there is a dual satisfaction metric σ:
σ(D) ≡ m − φ(D). An exact constraint C is represented by the violation metric φ(D) = 0 if D |= C, and φ(D) = 1 otherwise. The dual satisfaction metric is 1 if D |= C, and 0 otherwise.
Two different types of error detection output are useful for different applications: First, a subset of likely dirty data, corresponding to a "clean/dirty" labelling. Second, a "dirtiness"
ranking that supports returning the top-k data records most likely to be dirty. This is similar to outlier detection, where method output either a set of potential outliers or an
"outlierness" metric for data points. Each type of desired output leads to an optimization problem for approximate constraints. To output binary labels, we can solve a minimum repair problem as follows.
Definition 1 (Dataset Partition for Approximate Constraints). Given a dataset D, a viola-tion metric φ, and a user-defined threshold α for an acceptable violaviola-tion, the data partiviola-tion
problem is to find a minimum-cardinality subset ∆D of records such that removing the subset reduces the violation below the threshold: φ(D − ∆D) ≤ α.
A top-k query can be supported by finding the set of k records with the biggest impact on the violation metric, as follows.
Definition 2 (Top-k Contribution). Given a dataset D, a violation metric φ, and a user-defined number k, the top-k contribution problem identifies a set of k records ∆D that minimize the constraint violation: argmin∆D:|∆D|=kφ(D − ∆D).
The dataset partition and top-k problems can be expressed in dual terms replacing minimization violation metrics by maximizing satisfaction metrics. Yan et al. show that the Partition and Top-k problems are algorithmically equivalent in the sense that a polynomial-time algorithm for one can be used to obtain a polynomial-polynomial-time algorithm for the other [49, Th.1]. In our experiments we evaluate error localization for both ranking and binary labelling methods, both in terms of accuracy and scalability.