Descriptive and Prescriptive Data Cleaning
Anup Chalamalla
1∗Ihab F. Ilyas
1∗Mourad Ouzzani
2Paolo Papotti
2 1University of Waterloo, Canada 2 Qatar Computing Research Institute (QCRI), Qatar
ABSTRACT
Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations us-ing some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some tar-get report generated by transformations over multiple data sources. This creates a situation where the violations de-tected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possi-ble and affordapossi-ble, this would be of little help towards iden-tifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we pro-pose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level todescribe these errors and at the source level toprescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and ef-ficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules.
1.
INTRODUCTION
A common approach to address the problem of dirty data [8] is to apply a set of data quality rules or constraints over a target database, to “detect” and to eventually “repair” erroneous data [1, 7, 10, 3, 15]. Tuples or cells (attribute-value of a tuple) in a databaseDthat are inconsistentw.r.t. a set of rules Σ are considered to be in violation, thus pos-sibly “dirty”. A repairing step tries to “clean” these viola-tions by producing a set of updates overDleading to a new databaseD0that satisfies Σ. Unfortunately, in many real life scenarios [13], the picture is different, and data and rules are decoupled in space and time; constraints are often declared
∗
Work partially done while at QCRI.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
SIGMOD’14,June 22–27, 2014, Snowbird, UT, USA.
Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2376-5/14/06 ...$15.00.
http://dx.doi.org/10.1145/2588555.2610520.
not on the original data but rather on reports or views, and at a later stage in the data processing life cycle.
Example 1. Consider the report T (see Figure 1) about shops for an international franchise. The HR department enforces a set of policies in the franchise workforce and iden-tifies two problems in T. The first violation (ta andtb, in
bold) comes from a rule stating that, in the same shop, the average salary of the managers (Grd=2) should be higher than that of the staff (Grd=1). The second violation (tb
and td, in italic) comes from a rule stating that a bigger
shop cannot have a smaller staff.
T Shop Size Grd AvgSal #Emps Region
ta NY1 46 ft2 2 99$ 1 US tb NY1 46ft2 1 100$ 3 US tc NY2 62 ft2 2 96 $ 2 US td NY2 62ft2 1 90 $ 2 US te LA1 35 ft2 2 105 $ 2 US tf LND 38 ft2 1 65£ 2 EU
Emps EId Name Dept Sal Grd SId JoinYr
t1 e4 John S 91 1 NY1 2012
t2 e5 Anne D 99 2 NY1 2012
t3 e7 Mark S 93 1 NY1 2012
t4 e8 Claire S 116 1 NY1 2012
t5 e11 Ian R 89 1 NY2 2012
t6 e13 Laure R 94 2 NY2 2012
t7 e14 Mary E 91 1 NY2 2012
t8 e18 Bill D 98 2 NY2 2012
t9 e14 Mike R 94 2 LA1 2011
t10 e18 Claire E 116 2 LA1 2011
Shops SId City State Size Started t11 NY1 New York NY 46 2011 t12 NY2 New York NY 62 2012 t13 LA1 Los Angeles CA 35 2011 Figure 1: A report T on data sources Emps & Shops. To explain these errors, we adopt an approach that sum-marizes the violations in terms of predicates on the database. In the example, since the problematic tuples have Attribute Region set to U S, we describe (explain) the violations in the example as [T.Region=U S]. Note that this explana-tion summarizes all tuples that are involved in a violaexplana-tion, and not necessarily the erroneous tuples; in many cases, up-dating only one tuple in a violation (set of tuple) is enough to bring the database into a consistent state. For example, a repairing algorithm would identifytb.Grd as a possible error in the report. Hence, by updatingtb.Grd, the two violations would be removed. Limiting the erroneous tuples can guide us to a more precise explanation. In the example, the expla-nation [T.Region=U S∧T.Shop=N Y1] is more specific, if we believe thattb.Grd is the erroneous cell. The process of explaining data errors is two-fold: identifying a set of poten-tial erroneous tuples (cells); and finding concise descriptions
Figure 2: System Architecture. that summarize these errors and that can be consumed by
users or by other analytics layers.
We highlight the problem of explaining errors when er-rors are identified in a different space and at a later stage than when errors were digitally born. Consider the following query that generated TableT in Example 1. Since violations detected in the report are actually caused by errors that crept in at an earlier stage, i.e., from the sources, propagat-ing these errors from a higher level in the transformation to the underlying sources can help in identifying the source of the errors and in prescribing actions to correct them.
Example 2. Let us further assume that the previous re-portT is the result of a union of queries over multiple shops of the same franchise. We focus on the query over source relations Emps and Shops for the US region (Figure 1).
Q:SELECT SId as Shop, Size, Grd, AVG(Sal) as
AvgSal, COUNT(EId) as #Emps,‘US’ as Region FROM US.Emps JOIN US.Shops ON SId
GROUP BY SId, Size, Grd
We want to trace back the tuples that contributed to the problems in the target. Tuples ta−td are in violation in T and their lineage is{t1−t8}and{t11−t12}over Tables EmpsandShops. By removing these tuples from any of the sources, the violation is removed. Two possible explanations of the problems are therefore [Emps.J oinY r = 2012] On TableEmps, and [Shops.State=N Y] on TableShops.
As we mentioned earlier,tbis the erroneous tuple that was identified by the repairing algorithm. Its lineage is{t1, t3, t4} and {t11} over Tables Emps and Shops, respectively. By focusing on this tuple, we can compute more precise expla-nations on the sources, such as [Emps.Dept=S]. Drilling even further, an analysis on the lineage of tb may identify t4 as the most likely source of error since by removingt4, the average salary goes down enough to clear the violation. Therefore, the most precise explanation is [Emps.EId=e8]. The example shows that computing likely errors enables the discovery of better explanations. At the source level, this leads to the identification of actionsto solve the problem. In the example, the employee with id e8 seems to be the cause of the problem.
We propose Database Prescription (DBRx for short)
(Figure 2), a system to support descriptive and prescrip-tive data cleaning. DBRx takes quality rules defined over the output of a transformation and computes explanations of the errors. Given a transformation scenario (sourcesSi, 1< i < n, and queryQ) and a set of quality rules Σ,DBRx
computes a violation tableV T of tuples not complying with Σ. V T is mined to discover a descriptive explanation 1in Figure 2 such as [T.Region= U S]. The lineage of the vi-olation table over the sources enables the computation of a
prescriptive explanation 4such as [Emps.J oinY r= 2012] and [Shops.State =N Y] on the source tables. When ap-plicable, a repair is computed over the target, thus allow-ing the possibility of a more precise description 2such as [T.Region=U S∧T.Shop=N Y1], and a more precise pre-scriptive explanation 3based on propagating errors to the sources such as [Emps.Dept=S] and [Emps.EId=e8].
BuildingDBRxraises several challenges: First, propagat-ing the evidence about violatpropagat-ing tuples from the target to the sources can lead to a lineage that covers a large number of source tuples. For example, an aggregate query would clump together several source tuples, with only few contain-ing actual errors. Simply partitioncontain-ing the source tuples as dirty and clean is insufficient; tuples do not contribute to vi-olations in equal measure. Second, we need a mechanism to accumulate evidences on tuples across multiple constraints and violations and hence identify the most likely tuples with errors. For the target side, there are several repairing
algo-rithms that we can use. But for the source side, a new
algorithm, which relaxes the requirements of repair seman-tics, is needed. Third, after identifying the likely errors, mining the explanations involves two issues that we need to deal with: (1) what are the explanations that accurately cover all and only the identified erroneous tuples?; and (2) how to generate explanations concise enough in order to be consumable by humans?
We summarize ourcontributionsin this paper as follows: a. We introduce the problem of descriptive and prescrip-tive data cleaning (Section 2). We define the notion of explanation, and formulate the problem of discovering explanations over the annotated evidence of errors at the sources (Section 3).
b. We develop a novel weight-based approach to anno-tate the lineage of target violations in source tuples (Section 4).
c. We present an algorithm to compute the most likely errors in presence of violations that involve large num-ber of tuples with multiple errors (Section 5). d. We combine multi-dimensional mining techniques with
approximation algorithms to efficiently solve the expla-nation mining problem (Section 6).
We perform an extensive experimental analysis using the TPC-H Benchmark and real-world datasets (Section 7). We conclude the paper with a discussion of related work (Sec-tion 8) and of future direc(Sec-tion of research (Sec(Sec-tion 9).
2.
PROBLEM STATEMENT
LetS={S1, S2, . . . , Sn}be the set of schemas ofnsource relations, where each source schema Si has di attributes ASi
1 , . . . , A Si
di with domainsdom(A Si
1 ), . . . ,dom(A Si di). LetR
be the schema of a target view generated fromS. Without loss of generality, we assume that every schema has a special attribute representing the tuple id. Atransformation is a union of SPJA queries on an instanceI ofS that produces a unique instanceT ofRwithtattributesAT1, . . . , A
T t. Any instanceT of a target view is required to comply with a set of data quality rules Σ. We clarify the rules supported in our system in the next Section. For now, we characterize them with the two following functions:
• Detect(T) identifies cells inT that do not satisfy a ruler∈Σ, and store them in a violation tableV(T).
• Error (V(T)) returns the most likely erroneous cells for the violations in V(T) and store them in an error tableE(T).
WhileDetecthas a clear semantics,Errorneeds some
clarifications. At the target side, we consider the most likely erroneous cells as simply those cells that a given repair al-gorithm decides to update in order to produce a clean data instance, i.e., an instance that is consistentw.r.t. the input rules. Our approach can use any of the available alternative repair algorithms, e.g., [15], (Section 3.3). At the source, we need to deal with the lineage of problematic cells instead of the problematic cells themselves, to produce the most likely erroneous cells. Existing repair algorithms were not meant to handle such a scenario; we show in Section 5 our own approach to produce these cells.
Our goal is to describe problematic data with concise ex-planations. Explanations are composed of queries over the relations in the database as follows.
Definition 1. An explanation is a setE of conjunctive queries where q ∈ E is a query of k selection predicates (ASi
l1 = vl1) ∧ · · · ∧ (A Si
lk = vlk) over a table Si with di attributes, 1 ≤k ≤ di, and vlj (1 ≤j ≤k) are constant values from the domain of the corresponding attributes. We denote with size(E) the number of queries inE.
There are three requirements for an explanation: (i) cover-age - covers error tuples, (ii) conciseness - has small number of queries, and (iii) accuracy - covers mostly error tuples.
Example 3. Consider again relation Emps. Let us as-sume that t1, t3, t4, and t7 are error tuples. There are
alternative explanations that cover them. The most concise is exp7:(Emps.Grd=1), but one clean tuple is also covered
(t5). Explanation exp8:(Emps.eid=e4) ∨ (Emps.eid=e7)
∨(Emps.eid=e8)∨(Emps.eid=e14) has a larger size, but it
is more accurate since no clean tuples are covered.
We define cover of a query q as the set of tuples re-trieved by q. The cover of an explanation E is the union of cover(q1), . . . , cover(qn), qi ∈ E. For a relation R with a violation tableV(R) computed withDetect, we denote withCthe clean tuplesR\Error(V(R)). We now state the exact Descriptive and Prescriptive Data Cleaning(DPDC) problem:
Definition2 (Exact DPDC). Given a relation R, a corresponding violation tableV(R), and anErrorfunction for V(R), a solution for the exact DPDC problem is an explanationEopts.t.
Eopt= argmin size(E)
(E|(cover(E) =E(R)))
If function Error over the target is not available ( 1), the problem is defined on V(R) instead of E(R).
Unfortunately, since all errors must be covered and no clean tuples are allowed in the cover, the exact solution in the worst case does not exist. In other cases, it may be a set of queries s.t. each query covers exactly one tuple. The number of queries in the explanation equals the number of errors (as forexp8 in Example 3), making the explanation hard to consume.
To allow more flexibility, we drop the strict requirement over the precision of the solution and allow it to cover some clean tuples. We argue that explanations such as exp7 can better highlight problems over the sources and are easier to consume. More specifically, we introduce a weight function for a queryq, namelyw(q), that depends on the number of clean and erroneous tuples that it covers:
w(q) =|E(R)\cover(q)|+λ∗ |cover(q)∩ C|
whereC is the set of clean tuples, w(E) is the sumw(q1) + . . .+w(qn),qi∈ Ethat we want to minimize, and the con-stantλhas a value in [0,1]. The weight has two roles. First, it favors queries that cover many errors (first part of the weight function) to minimize the number of queries to ob-tain full coverage inE. Second, it favors queries that cover few clean tuples (second part). Constantλweighs the rela-tive importance of clean tuplesw.r.t. errors. In fact, if clean and erroneous tuples are weighted equally, selective queries with |cover(q)∩ C| = ∅ are favored, since they are more precise, but they lead to larger size forE. On the contrary, obtaining a smaller explanation justifies the compromise of covering some clean tuples. We set parameterλto the error rate for the scenario, we shall describe in Section 6 how it is computed. We now state the relaxed version of the problem.
Definition3 (Relaxed DPDC). Given a relationR, a corresponding violation table V(R), an Error function for V(R), and a weight function w(E), a solution for the relaxed DPDC problemis an explanation Eopts.t.
Eopt= argmin w(E)
(cover(E)⊇E(R))
When the DPDC problem is solved over the target (resp. sources), it computes descriptive (resp. prescriptive) ex-planations. We can map this problem to the well-known weighted set cover problem, which is proven to be an NP-Complete problem [4], where the universe are the errors in E(R) and the sets are all the possible queries overR.
3.
VIOLATIONS AND ERRORS
While many solutions are available for the standard data cleaning setting, i.e., a database with a set of constraints, we show in this section how the two levels in our framework, namely target and source, make the problem much harder.
3.1
Data Quality Rules
Quality rules can be usually expressed either using known formalisms or more generally through arbitrary code. We thus distinguish between two classes of quality rules over relational databases.
Examples for the first class are conditional functional de-pendencies (CFDs) and check constraints (CCs). Since rules in these formalisms can be expressed with the more general class of denial constraints (DCs), we will refer to this lan-guage in the following and denote such rules with ΣD.1 1Our repair model focuses on detecting problems on the ex-isting data with the big portion of business rules supported
Consider a set of finite built-in operators
B={=, <, >,6=,≤,≥}. A DC has the general form
ϕ:∀tα, tβ, tγ, . . .∈R,q(P1∧. . .∧Pm)
wherePiis of the formv1φv2orv1φ constwithv1, v2of the formtx.A, x∈ {α, β, γ, . . .}, A∈R, andconstis a constant. For simplicity, we use DCs with only one relationSinS.
Example 4. The rules in the running example corre-spond to the following DCs (for simplicity, we omit the uni-versal quantifiers):
c1:q(tα.Shop=tβ.Shop∧tα.Avgsal > tβ.Avgsal∧tα.Grd < tβ.Grd)
c2:q(tα.Size > tβ.Size∧tα.#Emps < tβ.#Emps)
The second class includes rules expressed with arbitrary declarative languages (such as SQL) and procedural code (such as Java programs) [7]. These are alternatives to the traditional rules in ΣD. We denote these more general rules with ΣP. Thus, Σ = ΣD∪ΣP.
Example 5. A rule expressed in Java could use an ex-ternal web service to validate if the ratio of the size of the staff and the size of the shop comply with a local legal policy.
3.2
Target Violation Detection
Given a set of rules Σ, we require that any ruler∈Σ has a functiondetectthat identifies groups of cells (or tuples) that together do not satisfy r. We call such set of cells a violation. We collect all such violations overT w.r.t. Σ in aviolation tablewith the schema (vid, r, tid, att, val), where vidrepresents the violation id,ris the rule,tidis the tuple id,attis the attribute name of the cell, andvalis the value tid.attof that cell. We denote the violation table of a target viewT asV(T). We mineV(T) for explanations in case 1.
For DCs in ΣD, detect can be easily obtained. A DC
states that all the predicates cannot be true at the same time, otherwise, we have a violation. Given a database in-stanceI of schemaS and a DCϕ, ifIsatisfiesϕ, we write I|=ϕ, and we say thatϕis avalid DC. If we have a set of DC Σ,I|= Σ if and only if∀ϕ∈Σ, I|=ϕ.
For rules in ΣP, the output emitted by the code when applied on the data can be used to extract the output re-quired bydetect. In Example 5, in case of non compliance with the policy, the cellsSize, #Emps and Region will be considered as one violation.
3.3
Target Errors Detection
As we mentioned in the introduction (Example 2), the ability to identify actual errors can improve the performance of the system (case 2). We can rely on the literature on data repairing as a tool to identify the errors in a database. If a cell needs to be changed to make the instance consistent, then that cell is considered as an error.
Repair refers to the process of correcting detected viola-tions. Several algorithms have been proposed for repairing inconsistent data, mainly based on declarative data quality rules (such as in ΣD) [1, 10, 3]. These rules naturally have a static semantics for violation detection (as described above) and a dynamic semantics to remove them. This can be mod-eled with a repair function. Given a violation for a certain by DCs. However, more complex repair models for missing tuples, such as [11], can be supported with extensions.
rule, this function outputs an update to the database to sat-isfy the violations identified by the correspondingdetect.
For rules in ΣP, the repair function must be provided [7]. If such a function is not available (as in many cases), our explanations will be limited to violations and their lineage, cases 1and 4, respectively. For a DC in ΣD, computing its repair function is straightforward: the repair function is the union of the inverse for each predicate in it.
Example 6. Given the rules in the running example, a repair function would compute the following updates: repair(c1): (tα.Shop6=tβ.Shop)∨(tα.Avgsal≤tβ.Avgsal)∨ (tα.Grd≥tβ.Grd)
repair(c2) : (tα.Size≤tβ.Size)∨(tα.#Emps≥tβ.#Emps) The repair problem (even with FDs only) is known to have NP complexity [15]. Heuristic algorithms to compute repairs identify the minimal number of cells to change to obtain an instance that conforms to the rules. More precisely, for a violation tableV(T) and the repair functionsF=f1, . . . , fn for n rules in Σ, Repair(V(T),F) computes a set of cell updates on the database s.t. it satisfies Σ. While we are not interested in the actual updates to get a repair, we consider the cells to be updated by the repair algorithm to be the likely errors, thereforeErrorcoincides withrepair.
3.4
From Target to Sources
We have introduced how violations and errors can be de-tected over the target. Unfortunately, a target rule can be rewritten at the sources only in limited cases. This is not possible for the rules expressed as Java code in ΣP as we treat them as black-boxes. For rules in ΣD, the rewriting depends on the SQL script in the transformation. Rules may involve target attributes whose lineage is spread across mul-tiple relations (as in Example 1), thus the transformation is needed in order to apply them. An alternative approach is to propagate the violations from the target to source at the instance level. However, going from the target to the sources introduces new challenges.
T Shop avgHrs
ta NY1 23
tb NY2 25
Shif ts Sid Hours Week Clerk
t1 NY1 20 11 John t2 NY1 20 11 Anne t3 NY1 30 12 Anne t4 NY1 30 12 John t5 NY1 22 13 John t6 NY1 22 13 John t7 NY1 17 14 John t8 NY2 20 11 Laure t9 NY2 30 11 Bill
Figure 3: Average Hours by Shop.
Example 7. Consider a source relationShiftsand a tar-get relationT (Figure 3) obtained with the following query: SELECT SId as Shop, AVG(Hours) as avgHrs
FROM Shifts where SID like ‘NY%’ GROUP BY SId
Given the check constraint¬(avgHrs <25)overT, tupleta
is a violation. By removing its lineage (t1−t7), the violation
is removed. However, we are interested in identifying most likely errors and considering the entire lineage may not be necessary. In fact, it is possible to remove the violation by just removing a subgroup of the lineage. In particular, all the subsets of size 1 to 4 involvingt1, t2, t5, t6, t7are possible
alternatives, whose removal removes the violation onta. It is easy to see that the lineage of the violation leads to the problematic tuples over the source. Computing a repair
on the source requires a new repair algorithm such that by updating some source tuples, the results of the query change and satisfy the constraints. This is always possible, for ex-ample by removing the entire lineage. However, similarly to the target level, the traditional concept of minimality can still guide the process of identifying the source tuples that need to change. There are two motivations for this choice. First, treating the entire lineage as errors is far from the re-ality for a query involving a large number of tuples. Second, considering the entire lineage for explanation discovery will not help in finding meaningful explanations. Unfortunately, it is known that computing all the possible subsets of such lineage is a NP problem even in simpler settings with one SPJU query [5]. We can easily see from the example how the number of subsets can explode.
The above problem shows the impossibility of computing a minimal repair for the target violations over the sources. However, we are interested in identifying the source error tuples in order to discover explanations, not in computing a target repair. Thus, the sourceError module will use the minimality principle, without the need to compute a target repair. In Section 4, we introduce scoring functions to quantify the importance of source tuples w.r.t. target violations. We then use these scores in two algorithms that return the most likely error source tuples (Section 5).
4.
EVIDENCE PROPAGATION
The evidence propagation module involves two tasks. The first task is to trace the lineage of tuples in violations at the target to source tuples. To this end, we implemented inverse query transformation techniques proposed by Cui et al. [6]. The second task is to determine how to propagate violations as evidence over the source.
For each tuple tin a violationv ∈V(T), we denote the cells intthat are involved invas problematic cells. These cells are in turn computed from some source cells, also la-beled as problematic. To solve a violation, we consider the delete operation over the sources. However, as discussed above, we do not want to identify the minimal groups of problematic source tuples that need to be removed. On the contrary, we take a practical approach. We look at tuples individually by using two scores that quantify the effect of source tuples and source cells in the lineage of each violation.
Cells Contribution. Given a violation v, we want to measure how much the value in each problematic source cell contributes tov. In fact, not all problematic source cells contribute equally tov.
Example 8. The first violation in Example 1 covers problematic tuples ta and tb and the problematic cells over
attributes Shop, Grd, AvgSal. The cells are in turn computed fromt11.Sid, t1-t4.Grd,t1-t4.SId, andt1-t4.Sal. One of the
predicates that trigger the violation istb.AvgSal>ta.AvgSal.
Tupletb.AvgSal is computed from t1.Sal, t3.Sal andt4.Sal.
Among them, the high value oft4.Sal is a more likely cause
for the violation thant1.Sal ort3.Sal.
Tuples Removal. Wrongly joined tuples can trigger an extra tuple in the result of a query, thus causing a violation in the target. We want to measure how much removing a problematic source tuple removesv.
Example 9. Let us assume that the correct value for t1.SId is a shop different from NY1, say NY2. Erasingt1
removes the violation for the second rule in Example 1 (the two stores would have the same number of employees), even though NY1 as a value is not involved in the violation.
We derive from sensitivity analysis [14] our definitions of contribution and removal scores. The intuition is that we want to compute the sensitivity of a model to its input. In general, given a function, the influence is defined by how much the output changes given a change in one of the input variables. In our context, the models are the operators in the SQL query, which take a set of source tuples as input and output the problematic tuples in the view.
Definition 4. Acontribution scorecsv(c) of a
problem-atic source cellcw.r.t. a target violationvis defined as the difference between the original result and the updated output after removing cdivided by the number of cells that satisfy the SQL operator.
A removal score rsv(t) of a problematic source tuple t w.r.t. a target violation v is 1 if by removing c, v is re-moved, 0 otherwise.
A score vectorCSV of a cell for contribution scores (RSV of a tuple for removal scores) is a vector [cs1, . . . , csm] ([rs1, . . . , rsm]), where m is the number of violations and cs1, . . . , csm ∈ R(rs1, . . . , rsm ∈B). If a problematic cell
or tuple does not contribute to a certain violation, we put an empty field in the vector. We will omit the subscript if there is no confusion.
We assume that the transformation is a SPJAU query. We compute CSVs and RSVs using the query tree. For an SPJAU query, every node in the tree is one of the following five operators: (1) selection (S), (2) projection (P), (3) join (J), (4) aggregation (A), and (5) union (U).
Example 10. Figure 5 shows the query tree for our run-ning example. It has three operators: (1) the ./operator, (2) the aggregationoperator with the group by, and (3) the projection on columns Sid, Size, Grd, Region, Sal, and Eid(not shown in the figure for the sake of space).
4.1
Computing CSVs
We compute CSVs for cells in a top-down fashion over the operator tree. Each leaf of the tree is a source tuple, with its problematic cells annotated with a CSV. Letvbe a violation inV(T) on a ruler∈P
. We initialize,csof each problematic cell in targetT to 1. LetIlbe anintermediate resultrelation computed by an operatorOl∈ {S, P, J, A, U} at level l of the tree, whose input is a non-empty set of intermediate source relations Inp(Ol) =I1l−1, I
l−1 2 , . . .. In our rewriting, we compute the scores for problematic cells ofInp(Ol) from the cell scores ofIl.
Letcl be a problematic cell inIl,cs(cl) its contribution score,val(cl) its value, andLin(cl, Il−1) its lineage. Proce-dure 1 computescsfor intermediate cells.
Procedure 1. (Intermediate Cell CS): LetIl−1
k be an
in-termediate relation contributing to cell cl. We have two cases for computingcs(cl−1), cl−1∈Lin(cl, Il−1
k ):
(a) If Ol = A (cl is an aggregate cell) and r ∈ ΣD, then cs(cl−1) depends on the aggregate operator op
and on the constraint predicate P ∈ r being violated, P:val(cil)φval(cl0) withφ∈ {<, >,≤,≥}:
• if op ∈ {avg, sum}, then cs(cl−1) is val(cl−1)
P
val(gi),gi∈Lin(cl,Il−1) if φ ∈ {<,≤}, and cs(cl)·(1− val(cl−1)
Pval(g
I1
1 Sid [CSV] Size [CSV] Grd [CSV] Sal [CSV] Eid [CSV] i1 1 NY1[ 1 3,‘’] 46ft 2 [‘’,1 3 ] 1 [ 1 3, 1 3] 91[ 91 300, ‘’] e4 [‘’, 1 3] i12 NY1[1,‘’] 46 ft2 2[1,‘’] 99[0, ‘’] e5 i1 3 NY1[13, ] 46 ft2 [‘’, 1 3] 1 [ 1 3, 1 3] 93[ 93 300, ‘’] e7 [‘’, 1 3] i1 4 NY1[ 1 3, ‘’] 46ft 2 [‘’, 1 3] 1 [ 1 3, 1 3] 116[ 116 300,‘’] e8 [ ‘’, 1 3] i1 5 NY2 62 ft2 [‘’, 1 2] 1 [‘’, 1 2] 89 e11 [‘’, 1 2] i1 6 NY2 62 ft2 2 94 e13 i1 7 NY2 62 ft2 [‘’,12] 1 [‘’, 1 2] 91 e14 [‘’, 1 2] i1 8 NY2 62 ft2 2 98 e18 i1 9 LA1 35 ft2 2 94 $ e19 i1 10 LA1 35 ft2 2 116 $ e20
Figure 4: Procedure 1 Applied on Intermediate SourceI11.
T
Group By(Sid, Size, Grd, Region) ComputeAvg(Sal), Count(eid)
./ Emps.Sid = Shops.Sid Emps I10=Emps Shops I20=Shops I1 1 I2 1 =T
Figure 5: Query Tree.
Emps EId [CSV] Sal [CSV] Grd [CSV] SId[CSV] [RSV] t1 e4 [‘’,13] 91 [30091,‘’] 1 [13,13] NY1 [13,‘’] [0,1] t2 e5 99 [0,‘’] 2 [1,‘’] NY1 [1,‘’] [1,‘’] t3 e7 [‘’,13] 93 [30093,‘’] 1 [13,13] NY1 [13,‘’] [0,1] t4 e8 [‘’,13] 116 [116300,‘’] 1 [13,13] NY1 [13,‘’] [1,1] t5 e11 [‘’,1 2] 89 1 [‘’, 1 2] NY2 [‘’,0] t6 e13 94 2 NY2 [] t7 e14 [‘’,12] 91 1 [‘’,12] NY2 [‘’,0] t8 e18 98 2 NY2 [] t9 e14 94 2 LA1 [] t10 e18 116 2 LA1 []
Figure 6: Procedures 1 and 2 Applied on Emps.
Shops SId [CSV] Size [CSV] [RSV] t12 NY1 [2,‘’] 46 [‘’,1] [1,1] t13 NY2 62 [‘’,1] [‘’,1]
t14 LA1 35 []
Figure 7: Procedures 1 and 2 Applied on Shops.
• if op ∈ {max, min}, let Lin¬P(cl, Ikl−1) ⊆
Lin(cl, Il−1
k ) be the subset of cells that
vio-late P, cs(cl−1) is 1
|Lin¬P(cl,Il−1 k )|
for cl−1 ∈
Lin¬P(cl, Ikl−1), and 0 for all other cells. (b) else,cs(cl−1)iscs(cl)· 1
|Lin(cl,Il−1 k )|
Example 11. Figure 4 reports the CSVs of problematic cells in the intermediate relationI11. These are computed by
rewritingI12, which isT, as shown in Figure 5. For example, tb.Grd is computed from cellsi11.Grd, i13.Grd, and i14.Grd.
By case (b) these cells get a score of 1 3.
Similarly,tb.AvgSalis aggregated fromi11.Sal,i13.Sal, and i1
4.Sal, and ta.AvgSal from i12.Sal. By case (a) the scores
ofi11.Sal,i12.Sal,i13.Sal, andi14.Sal are based on the values
of the cells, as shown in Fig. 4. Score ofi12.Sal is computed
as 0 using the first part of case (a).
Procedure 1 has two cases depending on the query oper-ators and Σ. In case (a), where an aggregate is involved in a violation because of the operator of a rule, we have addi-tional information with regards to the role of source cells in a violation. In case (b), which involves only SPJU opera-tors where the source values are not changed in the target, we uniformly distribute the scores of the problematic cells across the contributing cells. Notice that case (a) applies
for PD
only, since the actual test in PP
is not known. However, case (b) applies for both types of rules.
An intermediate source cell may be in the lineage of sev-eral intermediate problematic cells. In this case, their cell scores areaccumulatedby summation following Procedure 2.
Procedure 2. (Intermediate Cell Accumulation): Let
Ol =O(cl−1, Il) denote the set of all cells computed from
cellcl−1∈Ikl−1 in the intermediate result relationIlby op-eratorO, cs(cl−1)=P
cl∈Olcs(cl
−1, cl).
Algorithm 1:ComputeCSV(T,V(T),S) 1: OT ←Operator that generatedT 2: h←Highest level of query tree 3: Inp(OT)←Ih−1
1 , . . . , I h−1 rh 4: rstate←(T, OT, Inp(OT)) 5: stateStack←newStack() 6: stateStack.push(rstate) 7: foreach violationv∈V(T)do
8: while!stateStack.empty()do
9: nextState←stateStack.pop() 10: if nextState[1] isT then
11: pcells←vc(T){Problematic cells at T}
12: else
13: pcells←Lin(vc(T), nextState(1)){Problematic cells at an intermediate relation}
14: foreach cellc∈pcellsdo
15: computeScores(c, v, l, nextState)
16: foreach intermediate relationIl−1∈nextState[3] do
17: Apply Procedure 2 on problematic cells ofIl−1
18: Ol−1←operator that generatedIl−1 19: newState←(Il−1, Ol−1, Inp(Ol−1)) 20: stateStack.push(newState)
21:
22: function computeScores(c,v,l,nstate)
23: foreach intermediate relationIl−1∈nstate[3]do
24: Apply Procedure 1 onc, nstate[2], Il−1
Example 12. In Fig. 7, CSVs of t12.SId for the
viola-tion betweentaandtbare computed from 4 cells in the
inter-mediate relationI1
1 in Figure 4. Cellsi11.SId,i13.SId,i14.Sid
have a score of 1 3 and i
1
2.Sid has a score 1. Procedure 2
Given a target relation T, its violation table V(T)
and source relations S, Algorithm 1 computes CSVs of
the problematic cells. The algorithm defines a state as a triple (Il, Ol, Inp(Ol)). It initializes the root state (T, OT, Inp(OT)) (line 4), whereOT is the top operator in the tree that computedT. For each violationvand for each problematic cell c, we compute the scores of problematic cells (lineage ofc) in all relations inInp(OT) (Lines 10-13) using Procedure 1 (Line 24). For each intermediate relation inInp(OT), we use Procedure 2 to accumulate thecsscores of each problematic cell and compute its finalcsscorew.r.t. the violationv(Lines 16-17). We then add new states to the stack for each relation inInp(OT). The algorithm computes scores all the way up to source relations until the stack is empty, terminating when all the generated states have been visited. Examples of CSVs are shown in Figures 6 and 7.
Once CSVs are computed for cells, we compute them for tuples by summing up the cell scores along the same viola-tion while ignoring non contributing cells.
4.2
Computing RSVs
In contrast to contribution scores, removal scores are di-rectly computed on tuples and are Boolean. If a violation can be eliminated by removing a source tuple, independently of the other tuples, then such a tuple is important. This heuristics allow us to identify minimal subsets of tuples in the lineage of a violation that can solve it through removal. Instead of computing all subsets, checking for each source tuple allows fast computation.
We use a simple bottom-up algorithm to compute RSVs. It starts with the source tuples in the lineage of a violation. For each source relationS and for each problematic tuple s∈ S, it removes both s and the tuples computed from it in the intermediate relations in the path fromStoT in the query tree. If the violation is removed, we assign a score 1 tosi, 0 otherwise. RSVs for the source relations in the running example are shown in Figures 6 and 7.
5.
LIKELY ERRORS DISCOVERY
Given the target violations, we use our scoring methods to identify the most likely errors at the source (scenarios 3
and 4). Since the goal is to correctly separate the potential error tuples from non-error tuples, we use the intuition that most likely errors are expected to have higher scores. A top-kanalysis of the tuples’ scores for each violation can identify potential errors. However, there is nok that works for all scenarios.
We present two approaches to solve this problem. In the first approach, we design an outlier function to separate high and low scoring tuples for each violation. In the second approach, we show a reduction from the facility location problem and apply a polynomial time logn-approximation algorithm to compute the likely source errors [12].
5.1
Distance Based Local Error Separation
In several cases (such as queries with aggregates), source tuples in the lineage of a violation consist of a subset of tuples that have high scores based on our scoring model, while the remaining have low scores. To precisely measure the distance between tuples, we define it as follows.
Definition 5. Given two source tupless1ands2inv, we
define theirdistanceas:
D(s1, s2) =|(csv(s1) +rsv(s1))−(csv(s2) +rsv(s2))| Two tuples with high scores are expected to have a smaller distance between them than the distance between a high-scoring tuple and a low-high-scoring one. Our goal is to obtain an optimal separation between high- and low-scoring tuples. LetHv be the set of high-scoring tuples andLv the set of low-scoring ones. Intuitively, a separation is preferable to another one if by adding a tuples∈LvtoHv, the difference between the sum of pair-wise distances among all tuples of Hv∪ {s}and the sum of their scores becomes smaller. The intuition is clarified in the following gain function.
Definition 6. Let the score of a tuplesi for violation v
becv(si) = (csv(si) +rsv(si)). LetLin(v, S)consists of the
lineage tuples ofvinS andLvbe a subset ofLin(v, S). We
define theseparation gainof Lv as:
SG(Lv) = X s∈Lv (cv(s))− X 1≤j<|Lv| X j<k≤|Lv| D(sj, sk)
We define an optimal separation as the one that maximizes this function forHv.
Example 13. Consider six source tuples for a viola-tion v having scores {s1:0.67, s2:0.54, s3:0.47, s4:0.08, s5:0.06,s6:0.05}. The sum of pair-wise distances forHv=
{s1, s2, s3} is 0.24, while the sum of scores is 1.68, thus SG(Hv)=1.44. If we adds4 toHv, the pair-wise distances
of Hv0 :{s1, s2, s3, s4} raises to 1.67 and the sum of scores
to 1.76. Clearly, this is not a good separation, and this is reflected by the low gainSG(Hv0)=0.08. Similarly, if we
re-moves3 fromHvthe new SGalso decreases to 1.14.
As it is exponential in the number of subsets to obtain an optimal separation, we provide a greedy heuristic to com-pute its approximation using ideas from the nearest neigh-bor chain algorithm for agglomerative clustering [18]. We first order all the tuples inLin(v, S) in the descending order of their scores, and designate each tuple as its own cluster. We start with the highest scoring tuple’s cluster, and keep adding to it the next tuple in the order, while computing the separation gain at each step. We terminate after reaching a separation where the gain attains a local maximum. We generate two clusters, the cluster that is being extended and the subset of tuples that are not in this cluster. In Exam-ple 13, the gain after addings1,s2, ands3 is 0.6, 1.14, and 1.44, respectively. After addings4, the gain becomes 0.08 and therefore we stop ats3. From each violationv, itsHvis added to the set of most likely source error tuples. The algo-rithm requires a linear space and quadratic time (pair-wise distances) in the number of tuples.
5.2
Global Error Separation
Since we have multiple violations, instead of looking at scores locally per violation, we introduce an alternative ap-proach that looks for the most likely error tuples globally. We accumulate evidences coming from multiple violations as in the following example.
Example 14. Consider two violations v1 and v2, and
four source tupless1–s4. Let the scores of the tuples bev1:
the most likely error tuple forv1 ands3 is the one forv2 as
it is the one that contributes most over the two violations. The goal is to select a subset of tuples that globally con-tribute the most to the violations. We can formulate this problem using the known NP-Hard uncapacitated facility location problem (FLP) [17]. The uncapacitated facility lo-cation problem is described as follows.
a. a setQ={1, . . . , n}of potential sites for locating fa-cilities,
b. a setD ={1, . . . , m}of clients whose demands need to be served by the facilities,
c. a profitcqdfor eachq∈ Qandd∈ Dmade by serving the demand of clientdfrom the facility atq,
d. a non-negative costfq for eachq∈ Qassociated with opening the facility at siteq.
The objective is to select a subsetQ⊆ Qof sites to open facilities and to assign each client to exactly one facility s.t. the difference of the sum of maximum profit for serving each client and the sum of facility costs is maximized, i.e.,
argmax Q⊆Q (X d∈D max q∈Q(cqd)− X q∈Q fq)
We obtain a reduction from the FLP to the problem of computing most likely errors in PTIME in the number of violations and in the number of source tuples. For each client d, we associate a violationvj∈V(T). LetLin(V(T), S) =
∪vj∈V(T)Lin(vj, S), n =|Lin(V(T), S)|, and m =|V(T)|. For each siteq, we associate a source tuple inLin(V(T), S). For each tuple s in Lin(vj, S), we associate the cost cqd between siteq(s) and clientd(vj) with the score (csj(s) + rsj(s)). We assume the fixed cost fq of covering a source tuple to be 1. A solution to our problem is optimal if and only if a solution to the facility location problem is optimal. We present a greedy heuristic [17] for this problem as follows. We start with an empty setQof tuples, and at each step we add toQ a tuples ∈ Lin(V(T), S)\Q that yields the maximum improvement in the objective function:
f(Q) =X d∈D max q∈Q(cqd)− X q∈Q fq
For a tuple s ∈ Lin(V(T), S))\Q, let ∆s(Q) = f(Q∪
{s})−f(Q) denote the change in the function value. For a violationvj, letuj(Q) be max
s∈Q(csj(s) +rsj(s)), anduj(∅) = 0. Letδjs(Q) =csj(s) +rsj(s)−uj(Q). Then, we write ∆s(Q) as follows: ∆s(Q) =f(Q∪ {s})−f(Q) = X vj∈V(T) ( δjs(Q) ifδjs(Q)>0 0 otherwise −1 (1)
The −1 corresponds to the cost of covering a tuple. In each iteration, of the heuristic, ∆s(Q) is computed for each s ∈Lin(V(T), S))\Q. We add a tuple s whose marginal cost ∆s(Q) is maximum. The algorithm terminates if either there are no more tuples to add or if there is no suchswith ∆s(Q)>0.
The algorithm identifies tuples whose global (cumulative) contributions (to all violations) is significantly higher than others. This global information leads to higher precision compared to the distance based error separation, but to a lower recall if more than one tuple is involved in a violation.
Favor precisionw.r.t. to recall is desirable, as it is easier to discover explanations from fewer errors than discover them
from a mix of error and clean tuples. This will become
evident in the experiments.
6.
EXPLANATION DISCOVERY
The problem of explanation discovery pertains to select-ing an optimal explanation of the problematic tuples from a large set of candidate queries. An optimal explanation cov-ers the most likely error tuples, while minimizing the number of clean tuples being covered and the size of the explanation. We compute optimal explanations in two steps. We first determine candidate queries. We then use a greedy algo-rithm for the weighted set cover, with weights based on the function over queryq defined in Section 2.
Candidate Queries GenerationThe goal is to generate the candidate queries for a sourceSwithddimensions. The algorithm first generates all queries with a single predicate for each attributeAlofR, s.t. the queries cover at least one tuple inE(R). A data structureP[1..d] is used to store the queries of the respective attributes. The algorithm then has a recursive step in which queries of each attribute (Al0) are expanded in a depth-first manner by doing a conjunction with queries of attributes Al. . . Ad where l= l0+ 1. The results of the queries are added to a temporary storageP0 and are expanded in the next recursive step.
Computing Optimal Explanations In the second stage, we compute the optimal explanation from the gener-ated candidate queries. In Section 2, we defined the weight associated with each query as follows.
w(q) =|E(R)\cover(q)|+λ∗ |cover(q)∩ C|
Our goal is to cover inE the tuples inE(R), while mini-mizing the sum of weights of the queries inE. An explana-tion is optimal if and only if a soluexplana-tion to the weighted set cover is optimal. By using the greedy algorithm for weighted set cover [4], we can compute a log(|E(R)|)-approximation to the optimal solution. The explanation is constructed incrementally by selecting one query at a time. Let the marginal cover of a new queryq w.r.t. E be defined as the number of tuples from (R) that are inq and that are not already present inE:
mcover(q) = (q∩E(R))\(E ∩E(R))
Algorithm 2:GreedyPDC(candidate queriesPoverR) 1: Eopt← {}
2: bcover(Eopt)← {}
3: whilebcover(Eopt)<|E(R)|do 4: minCost← ∞
5: min q←null
6: foreach queryq∈ Pdo
7: cost(q)← mcover(q)w(q)
8: if cost(q)≤minCostthen
9: if cost(q)=minCostand
bcover(q)< bcover(min q)then
10: continue to next query
11: min q←q
12: minCost=cost(q) 13: Addmin qtoEopt
At each step, Algorithm 2 adds toE the query that min-imizes the weight and maxmin-imizes the marginal cover. Let bcover(q) =E(R)∩cover(q), similarly for bcover(Eopt).
Parameterλ weighs the relative importance of the clean tuplesw.r.t. errors. In practice, the number of errors in a database is a small percentage of the data. If clean and er-roneous tuples are weighted equally in the weight function, selective queries that do not cover clean tuples are favored. This can lead to a large explanation size. We set the pa-rameterλto be the error rate, as it reflects the proportion between errors and clean tuples. If the error rate is very low, it is harder to get explanation with few clean tuples, thus we give them a lower weight in the function. If there are many errors, clean tuples should be considered more important in taking a decision. For mining at the source level ( 3 and
4
in Figure 2), we estimate the error rate by dividing the number of likely errors by the number of tuples in the lin-eage of the transformation (either violations or errors from the target).
7.
EXPERIMENTS
An end-to-end evaluation of our system requires a setup with one or more source relations and a set of target schemas on which business rules can be defined. In Sec. 7.1, we test the quality of error and explanation discovery modules with datasets from the TPC-H benchmark. In Sec. 7.2, we com-pare the explanations computed byDBRxagainst two alter-native systems on five real-world datasets.
7.1
Synthetic Dataset
The TPC-H Benchmark data generator defines a general schema typical of many businesses. We picked two represen-tative queries as target schemas, namely Q3 and Q102, and defined three scenarios with the following rules in ΣD: Scenario S1.cQ10:q(tα.revenue > δ1).
Scenario S2.c0Q10:q(tα.name=tβ.name∧
tα.c phone[1,2]6=tβ.c phone[1,2]). Scenario S3.cQ3:q(tα.revenue > δ2),
c0Q3:q(tα.o orderdate=tβ.o orderdate∧ tα.o shippriority6=tβ.o shippriority). RulescQ10 andcQ3 are check constraints over one tuple, while c0Q10 and c
0
Q3 are FDs over pairs of tuples. In these scenarios, we focus on ΣD to show the impact of (i) the repair computation over the target and of (ii) the role of
Errorin the source. We use ΣP in the real-data study.
Error Induction on TPC-H.We generate instances and queries usingdbgenandqgen tools, respectively. We assign values to parameters in the rules s.t. the reports have no violations. We then add errors in the source relations s.t. the reports have violations when recomputed.
Each experiment has a source instanceD, a transforma-tionQ, and target rules Σ. We identify candidate source attributes A based on the lineage of the attributes in the rule. Since our goal is to explain errors, we induce errors s.t. they happen on tuples covered by some pre-set expla-nations, orground explanations. This allows us to test how good we are at recovering these explanations. We induce errors for a given ground explanation over the source, such asEg ={q1 : (lineitem.l ship=R), q2 : (lineitem.l ship= S)}, while enforcing that attributes inEg are not inA. 2
The TPC-H documentation [20] contains the SQL code.
We consider two parameters for inducing errors w.r.t. ground explanation Eg: source error rate e (ratio of num-ber of error tuples to|Lin(V(T))|), and explanation raten (a fixed percentage of e· |Lin(V(T))|). Error rate e cor-responds to the total number of errors to induce, while n corresponds to the number of such error tuples that should satisfy the ground explanation. The remaining errors, i.e., (1−n)·e· |Lin(V(T))|, are induced on random tuples which do not match the ground explanation (Lin(V(T))\ Eg).
For eachr∈ΣDand for each predicateP ∈r, we identify the corresponding attributes AP and tuples in cover(Eg), and modify their values up to the budget of errors. We make sure that errors are detectable over the target schema. In scenarios S1 and S2, one error tuple at the source is sufficient to detect a target violation, while for S3 we need to introduce two or three errors to induce a target violation.
Metrics. We introduce two metrics to testDBRx. For each metric, we show how to compute precision(P) and recall (R). Error Discovery Quality – evaluates the quality of the likely errors discovery and the scoring. We compare the er-rors computed byErrorover the lineage versus the changes introduced in the errors induction step (B).
PErr=
E(T)∩ B
E(T) RErr=
E(T)∩ B B
Explanation Quality– evaluates the quality of the discov-ered explanations. We measure the quality of an explana-tion computed by DBRx by testing the tuples overlap with the ground explanations.
PE=
cover(Eopt)∩cover(Eg) cover(Eopt)
RE=
cover(Eopt)∩cover(Eg) cover(Eg) Algorithms. We implemented the algorithms introduced in the paper and baseline techniques to compare the results. For scoring, we use the technique based on outliers detec-tion (Local Outliers) and the one based on the facility location problem (Global-FLP). As baselines, we consider all the tuples in the lineage with the same base score 1 (
No-Let), and the tuple(s) with the highest score for each vio-lation (Top-1). For explanation discovery, we implemented Algorithm 2.
Results. We discuss three experiments designed to mea-sure the effectiveness and efficiency of the modules inDBRx. Since we can compute target repairs for these scenarios, we discuss mining on propagated target errors ( 3) and mining on propagated target violations ( 4). All measures refer to the relations where errors have been introduced.
Experiment A: Quality of Error Discovery.We start testing the quality of the alternative Error implementa-tions. For space reason we report the error F-measure.
In ExpA-1, we fix the queries in the ground explana-tion and increase the error rate without any random errors (n = 1). We start by discussing the results for the case of the rewriting of the target violations ( 4). Figure 8a shows that all the methods perform well for S1, with the exception ofNo-LET. This shows that computing errors is easy when there is only a simple check constraint over an
aggregate value. Figure 8b shows that only Global-FLP
obtains high F-measure with S2. This is because it uses the global information given by the many pair-wise violations. Figure 8c shows that for S3, which has multiple constraints and multiple errors, the methods obtain comparable results. However, a close analysis shows that, despite that having
(a) Errors F,S14 (b) Errors F,S24 (c) Errors F,S34 (d) Errors F,S23
(e) Expl. F,S14 (f) Expl. F,S24 (g) Expl. F,S34 (h) Expl. F,S23
(i) Err. F, Random,S34 (j) Expl. F, Random,S14 (k) Expl. F, Random,S24 (l) Expl. F, Random,S34
Figure 8: Experimental results for the prescriptive data cleaning problem. multiple errors violate the hypothesis of Global-FLP, it
still achieves the best precision, while the best recall is
ob-tained by No-LET. Figure 8d shows again S2, but
com-puted on the rewriting of the target repair ( 3). Compared to Figure 8b,Global-FLPdoes slightly worse, while all the others improve. This reflect the effect of the target repair at the source level: it gives less, but more precise, informa-tion to the error discovery. This provides less context to
Global-FLP, which takes advantage of the larger amount of evidence in 4. Similar behavior is observed in S3.
InExpA-2, we study how having 50% random errors af-fects the quality of error discovery. Figure 8i reports the error F-measure for S3. Despite the random errors, the re-sults do not differ much from the simpler scenario in Figure 8c. We observed similar results for S1 and S2.
Experiment B: Quality of Explanations. We test the quality of the explanations with differentErrormodules.
InExpB-1, we study how increasing the error rate affects the results. Figure 8e shows that all the errors detection methods have similar results for S1, with the exception of
No-LET. This is not surprising, as in this scenario errors are easy to identify and the size of the aggregate is large. Figure 8f reflects the quality of the error detection ofGlobal-FLP (as seen in Figure 8b) on the quality of the explanation for S2 8g shows that the precision in error detection has higher impact than the recall for the discovery of explanation. As discussed for Figure 8c,Global-FLPhas the highest preci-sion for S3, but the lowest recall. This shows that is better to identify fewer errors with higher confidence. Figure 8h
shows the explanation F-measure for S2 on the rewriting of the target repair ( 3). Compared to Figure 8f, most of the methods improved their quality, while Global-FLP’s quality decreased. This is a direct consequence of the detect error quality shown in Figure 8d. Examples of ground and discovered explanations are reported in Figure 9.
InExpB-2, we study how having 50% random errors af-fects the quality of the discovered explanations. Figures 8j and 8k show the explanation’s F-measure for scenarios S1 and S2, respectively. Despite the error discovery did not change much with the random noise, the discovery of ex-planation is affected in these two scenarios, as it is clear from the comparison with Figures 8e and 8f. This behaviour shows that the quality of the explanations is only partially related to the error detection and that a large amount of er-rors that cannot be explained can make hard the discovery of existing explanations. Fortunately, results on real data show that useful explanations can still be discovered. More-over, Figure 8l shows consistent result for S3w.r.t. the case without random errors (Figure 8g). This shows that the ac-cumulation of errors from two different quality rules has a strong effect even in noisy scenarios.
Experiment C: Running Time. We measured the av-erage running time for TPC-H data of size 10 and 100 MB. For the 100 MB dataset and S1, the average running time across different error rates is 100.29 seconds for rewriting the violations and computing their score. The average running time for the Error function is less than 2 seconds. The pattern mining including the candidate pattern generation
Exp. Ground Explanation No-LET Top-1 Local Outliers Global FLP S1 4 • l shipmode= RAIL∧ l shipinstruct= TBR • l shipmode=SHIP∧ l shipinstruct=DIP • l returnf lag=R • l shipmode=RAIL∧ l shipinstruct=TBR • l shipmode=SHIP∧ l shipinstruct=DIP • l shipmode=RAIL∧ l shipinstruct=TBR • l shipmode=SHIP∧ l shipinstruct=DIP • l shipmode=RAIL∧ l shipinstruct=TBR • l shipmode=SHIP∧ l shipinstruct=DIP S2 3 • c mktsegment= HOU SE∧c author=a1 • c mktsegment= AU T O∧c author=a2 • c nationkey= 3 • c nationkey= 20 • c nationkey= 16 • c nationkey= 3 • c nationkey= 20 • c nationkey= 16 • c nationkey= 3 • c nationkey= 20 • c nationkey= 16 • c mktsegment= HOU SE∧c author=a1 • c mktsegment= AU T O∧c author=a2 Figure 9: Explanation output for scenarios S1 and S2.
took 52 seconds. The results for S2 and 100 MB vary only in the rewriting module, as it took 430 seconds because of the large number of pair-wise violations. The execution times for 10 MB are at least 10 times smaller with all modules.
7.2
Real Data
We run DBRx using the Global-FLP technique on five
real-world scenarios with different types of data quality rules. In some of the scenarios, we also compareDBRx with Scorpion [21] and the technique on tracing data errors [16]. Since ground explanations are not available, we measure the output quality only in terms of precision of the explanations. We manually mark an explanation as correct based on our own knowledge of the data. The precision is the number of correct queries in the optimal explanation over the total number of queries in the explanation.
Datasets and Data Quality Rules on Target. The five scenarios we evaluate are described as follows:
t sensors[21] The source consists of sensor data with 2.3M tuples over 7 attributes. The target is a transformation that averages temperatures grouped by hour over a selec-tion of dates. The rule over the target is a check constraint (avg(temperature)<23).
elections[21] The source contains 18 months campaign ex-penses from the 2012 US Presidential Election in a 14 at-tributes and 116K tuples table.The target reports total ex-penses of Barack Obama on each date, and the rule con-strains this amount to be less than $10M on any date. p sensors[16] In this scenario, measurements of nine mobile phone sensors are recorded and classifiers are defined to de-termine five Boolean variables. The classifiers act as trans-formations and the measurements as source data. Classifiers that do not determine the variables correctly due to input errors are identified by comparing their output against a ground truth with a rule expressed as a Java code in ΣP.
stocksWe constructed two tables about stocks. The first, namely Q, contains daily stock quotes of S&P500 compa-nies for a two-month period from a trusted source with 30K tuples. The second table is built by extracting mentions of companies and their stock quotes from 45k Bloomberg arti-cles during the same time period. On every page, we ran two regular expression based extractors (E1 and E2) and an in-duction based one (E3). We mapped the two sources to the target schema [company, date, stock price, src], where the attributesrc was eitherextracted or master. We cleaned the output of E2 to ensure that its tuples are correct. A target rule in Java code (ΣP) states that two tuples are in violation if, given the same company and the same date, the difference among the price values is higher than 10% of the
one coming fromQ.
playersIn this scenario, we obtained data with information about soccer players from 6 web sites. The data has 7 at-tributes. The transformation is the union of the 6 sources
with an attributesourcethat keeps track of the source rela-tion. The target rule isname→birthdate.
Figure 10: Precision of explanations with real datasets.
Algorithms and Results. Figure 10 shows the results we obtained by running DBRxon the five scenarios and for dif-ferent cases. For t sensorsand elections, we compare the output of our mining at the source (case 3) with Scor-pion [21]; we both achieve the same results. Fort sensors, the explanation responsible for high temperatures is sen-sorid=15. For elections, the explanation for high expenses is recipient st=‘DC’ ∧ recipient nm=‘GMMB INC.’. For p sensors, we could only obtain a fragment of the data used in [16]. This data induces errors in one of the observations from the sensor “gps”. We were able to retrieve this pattern by applying our techniques with the same performance of the system in [16].
In scenariosstocksandplayers, we mine both the target and the source and present their results in Fig 10, thus show-ing all of cases targeted byDBRx. Alternative techniques [22, 21, 16] do not apply here, since we have arbitrary SQL in the transformation, and declarative constraints as target rules, thus the ground truth is not available.
For stocks, the system is able to compute explanations with perfect precision when mining on errors E(T), both for cases 2(src=extracted) and 3(extractor=E1∧ extrac-tor=E3), while some mistakes are made when mining the violation (V(T)), in both cases 1and 4. This reflects the impact of a repair of very high quality at the target, which is possible because a reliable source is available.
Things change slightly for the more complicated case of players. Here there are six sources that often disagree and may fail to form clear majorities over correct values for the birthdate of a player. This is reflected in a low precision in all cases, but again cases 2and 3show the positive impact of the error computation at the target. Examples of correct explanations at the source level aresrc=espn∧birthdate=0 andsrc=footymania∧birthdate=0/-1/2000.
8.
RELATED WORK
Provenance. Several proposals tackled the problem of verifying the semantic correctness of data transformations by pointing at anomalies in the results. These have been mainly termed as “Why questions” [2] and “Why-Not
ques-tions” [11, 19]. In the first case, the system finds the origin of some tuples or cells in the results. Provenance can be useful to extendDBRxto transformations expressed as black boxes. In particular, a transformation system (e.g., a Java program) supporting the eager approach (aka bookkeeping) carries extra annotations that can be used to produce ev-idence tables. Our rewriting technique is a lazy approach to provenance that can be readily deployed on an existing system supporting SPJUA queries.
In the second case, we look for explanations about tuples that were expected but are missing from the results. Two models have been proposed for this case. One adjusts the query to provide the desired output [19]. Such model does not apply toDBRxbecause we trust the transformations. The other model explains a missing tupletin terms of insertions to the database s.t. tappears in the result. We can extend
DBRxwith this model by allowing inclusion dependencies in ΣD and implementing existing algorithms [11] in the evi-dence propagation module.
Causality. There have been proposals to discover expla-nations to problematic values in the results of an aggregate query [22, 21] or of a transformation process [16]. These share the same goals as our proposal. However, they have limitations that limit their applicability. Scorpion [22, 21] works on aggregate queries only, and target constraints are limited to one tuple check constraint with a variable and a constant. It lacks the support for arbitrary Σ and arbitrary SQL, which are contributions of our work. CARE [16] re-quires the availability of the ground truth in order to detect errors. This is not realistic in a data cleaning setting. More-over, it requires lineage information and it does not tackle the problem of propagating the evidence to the sources.
Similar attempts over probabilistic databases (e.g., [14]) also rank “sensitive” individual tuples by interest. However, they do not construct explanations based on predicates.
Dependency Propagation. The problem of propagat-ing dependencies is to determine, given a target view over the sources and their dependencies, if another dependency holds on the view. We address the inverse process with SP-JAU views; this is not supported by existing approaches and leads to undecidability [9]. Moreover, our instance-driven rewriting allows extension for scenarios where the transfor-mation is not a query, but a black box with provenance.
View Updates. In the view update problem, the goal is to minimize the number of tuples to delete in the sources, such that the desired change in the target is obtained and no other target tuples are modified. The side-effect free variant of this problem is related to our rewriting from target to source. Unfortunately, the problem is intractable even for views defined in terms of simple SPJU queries [5]. This intractability motivated our scoring scheme; we drop the requirement to solve the view update in an exact fashion and opt for scores that can be computed efficiently.
Data Cleaning. Data cleaning focuses on detecting and repairing errors on a database by using declarative con-straints [3, 15, 7, 10, 1]. In our targetError module, we can use any of these algorithms. However, they rely on prop-erties of the violations that do not apply when the violations are rewritten over the sources. Thus, they cannot be used
at the sourceError module. Extending them to compute
a target repair through updates on the sources requires to solve the view update problem and is thus not tractable.
9.
CONCLUSIONS
Given a view over sources and a set of quality rules over it to identify violations, we introduced explanations both at the target and at the source levels for the problematic data. To make these explanations easy to consume for users, we formulated a problem that minimise their size while guaran-teeing coverage of the violations. The main intuitions behind this work are that (i) violations at the target level can be expressed as evidence of problems over the sources and (ii) summarising such evidence leads to meaningful explanations of the problems. We plan to extend this work by considering multi-level transformations, such as ETL processes, where at each step rules for data cleaning can be enforced.
10.
REFERENCES
[1] G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1):197–207, 2010.
[2] J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where.Foundations and Trends in Databases, 1(4):379–474, 2009.
[3] X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. InICDE, 2013.
[4] V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics of operations research, 4(3):233–235, 1979. [5] G. Cong, W. Fan, F. Geerts, J. Li, and J. Luo. On the
complexity of view update analysis and its application to annotation propagation.IEEE TKDE, 24(3):506–519, 2012. [6] Y. Cui and J. Widom. Practical lineage tracing in data
warehouses. InICDE, pages 367–378, 2000.
[7] M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. Ilyas, M. Ouzzani, and N. Tang. Towards a commodity data cleaning system. InSIGMOD, 2013.
[8] W. Fan and F. Geerts.Foundations of Data Quality Management. Morgan & Claypool Publishers, 2012. [9] W. Fan, S. Ma, Y. Hu, J. Liu, and Y. Wu. Propagating
functional dependencies with conditions.PVLDB, 1(1):391–407, 2008.
[10] F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LlunaticData-Cleaning Framework.PVLDB, 6(9):625–636, 2013.
[11] M. Herschel and M. A. Hern´andez. Explaining missing answers to spjua queries.PVLDB, 3(1):185–196, 2010. [12] D. S. Hochbaum. Heuristics for the fixed cost median
problem.Mathematical programming, 22(1):148–162, 1982. [13] W. H. Inmon.Building the Data Warehouse. John Wiley
Publishers, 2005.
[14] B. Kanagal, J. Li, and A. Deshpande. Sensitivity analysis and explanations for robust query evaluation in
probabilistic databases. InSIGMOD, pages 841–852, 2011. [15] S. Kolahi and L. V. S. Lakshmanan. On approximating
optimum repairs for functional dependency violations. In ICDT, 2009.
[16] A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. InSIGMOD, pages 505–516, 2011.
[17] P. B. Mirchandani and R. L. Francis.Discrete location theory. 1990.
[18] F. Murtagh. Clustering in massive data sets. InHandbook of massive data sets, pages 501–543. Springer, 2002. [19] Q. T. Tran and C.-Y. Chan. How to conquer why-not
questions. InSIGMOD, pages 15–26, 2010.
[20] Transaction Processing Performance Council. The TPC Benchmark H 2.16.0.http://www.tpc.org/tpch, 2013. [21] E. Wu and S. Madden. Scorpion: Explaining away outliers
in aggregate queries.PVLDB, 6(8):553–564, 2013. [22] E. Wu, S. Madden, and M. Stonebraker. A demonstration
of dbwipes: Clean as you query.PVLDB, 5(12):1894–1897, 2012.