Examples of Approximate Constraints - Localizing Violations of Approximate Constraints for Data

We provide examples of common violation and satisfaction metrics, following previous re-search [46, 49]. We explain approximate statistical constraints in more detail than approxi-mate versions of traditional constraints, as they have been introduced to data management only quite recently.

In practice, violation metrics are scaled by a term that depends on the size of the dataset.

The dependence on sample size accounts for the fact that a violation in a small number of records may be due to statistical noise rather than a genuine problem with the data.

3.3.1 Approximate Uniqueness Constraints

Uniqueness Constraints specify that values within one or more columns in a data table are to be unique. In [46] the authors show that in some cases detecting errors using assumed Uniqueness Constraints can yield many false positive errors, since in real-world datasets different records share values for the same column.

To solve this problem, we can define almost uniqueness constraints (AUCs), where columns with uniqueness ratio below a threshold are considered to be non-unique by nature [17]. For example, the conforming-row ratio computes the number of row pairs with the same value in the target column, divided by the total number ^N^dset₂ of possible violating pairs.

As an example, suppose we set the threshold to 99% and enforce an AUC on a column C . Thus if 99% of a C are unique, then non unique values in the column will be considered as errors, while if only 95% of C are unique then we invalidate the uniqueness constraint on the column.

3.3.2 Approximate Statistical Constraints: Probabilistic (In)dependence Recently Yan et al. [49] introduced a new type of constraint for error detection: asserting the probabilistic (in)dependence of two attributes/columns, possibly conditional on the values of a third. Conditional independence is defined as follows: given three attributes X, Y , Z, we say that X and Y are conditionally independent given Z if knowing Y gives no information about X, beyond what can be inferred from the values of Z.

X ⊥⊥ Y |Z ≡

P (X = x, Y = y|Z = z) = P (X = x|Z = z) × P (Y = y|Z = z) for all x, y, z.

This definition can be extended to three disjoints sets of attributes [49]. Throughout this thesis, we will be working extensively with the idea of probabilistic independence,

termed Independent Statistical Constraints (ISCs), and probabilistic dependence, termed Dependent Statistical Constraints (DSCs).

Probabilistic independence becomes a constraint on data when we apply it to the em-pirical distribution P_D associated with a relation D, which is defined as follows [23].

Let r[X] denote the tuple of values in the X columns for record r. Given an assignment X = x, a record satisfies the assignment if r[X] = x. The empirical count is the number N_D(X = x) of records that satisfy it. The empirical frequency of an assignment is the number of satisfying records, divided by the total number of records:

P_D(X = x) ≡ N_D(X = x)/N_D where N_D is the cardinality of relation D.

In the running example of Figure 1.2, we have N_D = 9, the count of white cars is NDColor = white = 3, so the empirical frequency P_D(Color = white) = 3 /9 = 1 /3 . Approximate Probabilistic Independence: Discrete Attributes

A common metric for the strength of a dependence is mutual information I(X; Y ) [47, 23]:

I(X; Y ) ≡ ^X

x,y

P (X = x, Y = y) log₂( P (Y = y, X = x) P (Y = y) × P (Y = y)).

Mutual information is a violation metric for independence: It can be shown that two categorical random variables have minimal mutual information 0 if and only if they are independent [23]. Also, Yan et al. show that if an FD X → Y holds exactly in a relation D, then the mutual information between X → Y is the maximum possible value [49, Prop.2]. In other words, a functional dependency represents a maximally strong dependence as measured by mutual information.

As we discussed above, it is important to make violation metrics sensitive to the sample size. In practical hypothesis testing, mutual information is therefore replaced by the G-test:

G_{X ⊥}⊥ Y(D) = 2 × N_D × I_D(X; Y ) (3.1) Figures 1.2 and 1.3 give G-test values in our running example.

3.3.3 p-values

A widely used approach is to transform a violation metric φ_C into a satisfaction metric known as a p-value. For a given relation D, the p-value measures the probability of a violation at least as great as the observed value:

p_φ_C(D) = P (t > φ_C(D) | C holds exactly) (3.2)

where t is a random variable ranging over possible observed violation values.

In statistical test terminology, the condition that the constraint holds exactly is called the null hypothesis. The p-value is a satisfaction metric, because the probability of observing a value of at least 0 for the violation metric is 1, and if the constraint holds exactly, the probability of observing a maximal violation value is 0. Note that we do not use p-value to make empirical claims on the world. Rather, we use it as a measure of degree of violation.

The p-satisfaction metric is widely used for two reasons. 1) Values are normalized to the [0,1] range. 2) A minimum-satisfaction threshold α ≥ p_φ_C(D) provides an upper bound on the probability of a false positive, that is, the probability of rejecting a constraint when it is true (but only apparently violated due to data noise). Therefore p-values are easier to interpret than violation metrics for the users. This facilitates eliciting a threshold from the user for producing binary labels by solving the partition problem 3.

Figure 3.1 illustrates the probability density function (3.2) for the test statistic for a hypothetical dataset. Suppose we measure a p-value of φ. If we choose α₁ as our threshold, we would reject our null hypothesis, and vice versa.

Figure 3.1: Probability Density Function for a hypothetical dataset. The x-axis represents the test statistic.

In statistical test terminology, the p-value threshold is the significance level and ob-served violations below this level are statistically significant. Conventional significance levels are 5% and 1%. In terms of p-values, the partition problem (Def. 3) is reformulated to raise the p-value above the threshold, and the top-k problem (Def. 2) is reformulated to maximize the p-value.

Computational statistics provides efficient libraries to compute p-values for many com-mon violation metrics. For a brief review in the context of error detection, please see [49, Sec.4.3]. Most p-value computations are based on a closed-form approximation of the den-sity (3.2). For the G-test the approximation is as follows.

pX ⊥⊥ Y(D) ≈ Z +∞

GX ⊥⊥ Y(D)

χ²_k(t)dt

where k = (r_X − 1) × (r_Y − 1), r_X resp. r_Y is the number of possible assignments for X resp. Y , and χ²_k(t) is the Chi-square distribution.

Chapter 4

Localization Tree

Decision trees have always been used to solve the problem of classification. One of our contributions is that we use decision trees to solve an optimization problem rather than building a predictive model. In this chapter we formally describe error regions as well as demonstrate how we use decision tree learning to solve a partitioning problem.

In document Localizing Violations of Approximate Constraints for Data Error Detection (Page 23-27)