Magic Sets and their
Application to Data Integration
Wolfgang Faber, Gianluigi Greco, Nicola Leone
Department of Mathematics University of Calabria, Italy
{faber,greco,leone}@mat.unical.it
Roadmap
Motivation: Data Integration Datalog¬
Modularity Results Magic Sets
Some Experiments Conclusions
Research Context
EU-funded project: INFOMIX Data Integration
Advanced System Dealing with Incomplete and Inconsistent Information
Builds on Datalog system DLV http://www.dlvsystem.com
Univ. Calabria (Leone, Faber et al.), Univ. Rome (Lenzerini, Rosati et al.), TU Vienna (Eiter, Gottlob et al.),
Rodan (Staniszkis et al.)
Context: Data Integration
Data integration system I = hG, S, Mi:
G = hΨ, Σi global (relational) scheme
– Ψ relation schemes, Σ integrity constraints, ShΨ0, ∅i (relational) schema of the sources, M mapping between G and S.
Context: Data Integration
Users issue queries on the global schema, and the system automatically retrieves data from the sources. But:
Data stored in sources may violate global constraints
Retrieved data might be inconsistent.
Techniques for database repairing are needed.
In many settings: co-NP
Datalog
¬for Repairing Data
Idea: Given a data integration system I, construct a Datalog¬ program Π(I) whose stable models are in one-to-one
correspondence with repairs of I.
The Cautious Consequences of Π(I)
Coincide with the Consistent Query Answers
Datalog
¬: Current Situation
Competitive Systems: Bottom-Up
Focus on Models, not Query-Answering
Query Optimization Methods?
Datalog
¬Syntax
Rules:
a :- b1, . . . , bk, not bk+1, . . . , not bm.
where a, b1, . . . , bm are atoms
and not denotes default negation. Intuitive reading:
If b1 . . . , bk are true, and bk+1, . . . , bm are not true, then a is true.
Datalog
¬Syntax
Program P: finite set of safe rules.
Base BP: set of all ground atoms constructible from constants and predicates in P.
Ground Program Ground(P): set of rules
obtained by applying all possible substitutions (from variables in P to constants in P) to P.
Stable Model Semantics
An interpretation I⊆ BP is a model of a program P if it satisfies all rules in Ground(P).
The reduct PI of a ground program P (wrt I) is obtained by
1. deleting all rules with false negative body 2. deleting the negative body of the other
rules. the positive ground program.
An interpretation I is a stable model of P iff it is the least model of Ground(P)I.
Example
The program P1
p(X) :- e(X), not q(X). q(X) :- e(X), not p(X). e(1).
has exactly two stable models:
S1 = {p(1), e(1)} and S2 = {q(1), e(1)}
Ground(P1)S1 = p(1):- e(1). e(1).
Ground(P1)S2 = q(1) :- e(1). e(1).
Example
The program P2
z :- t(1), not z. t(X) :- q(X).
p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).
e(1).
has exactly one stable model: S1 = {p(1), e(1)}
S2 = {z, q(1), t(1), e(1)} is not a stable model, as P2S2 does not contain a rule with z in the head.
Note: z :- t(1), not z. acts like an integrity constraint t(1) ⇒ ⊥, inhibiting any stable model containing t(1).
Example
The program P2
z :- t(1), not z. t(X) :- q(X).
p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).
e(1).
has exactly one stable model: S1 = {p(1), e(1)}
S2 = {z, q(1), t(1), e(1)} is not a stable model, as P2S2 does not contain a rule with z in the head.
Note: z :- t(1), not z. acts like an integrity constraint t(1) ⇒ ⊥, inhibiting any stable model containing t(1).
Brave/Cautious Consequences
A ground atom a is a
brave consequence for P (P |=b a) if a is true in some stable model of P.
cautious consequence for P (P |=c a) if a is true in all stable models.
Note: If no stable model exists, all atoms in BP are cautious consequences, and no atom is a brave
consequence.
Example
p(X) :- e(X), not q(X). q(X) :- e(X), not p(X). e(1).
Stable Models: {p(1), e(1)} and {q(1), e(1)}
Brave consequences: p(1), q(1), e(1), cautious consequences: e(1).
z :- t(1), not z. t(X) :- q(X).
p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).
e(1).
Stable Model: {p(1), e(1)}
Brave and cautious consequences: {p(1), e(1)}.
Example
p(X) :- e(X), not q(X). q(X) :- e(X), not p(X). e(1).
Stable Models: {p(1), e(1)} and {q(1), e(1)}
Brave consequences: p(1), q(1), e(1), cautious consequences: e(1).
z :- t(1), not z. t(X) :- q(X).
p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).
e(1).
Queries
Syntax: Query q:
c?
c: atom (with variables)
Brave answers: Substitutions θ s.t. P |=b qθ
Cautious answers: Substitutions θ s.t. P |=c qθ
Query Evaluation
Desideratum: Evaluate only a subprogram relevant to the query
Implicit in top-down methods.
Problem: Not straightforward for query answering using stable models.
Generating subprograms along head → body is
Example
z :-t(1), not z. t(X):- q(X).
p(X) :- e(X), not q(X). q(X):- e(X), not p(X).
e(1).
Generating a subprogram for evaluation of query p(X)?, moving only along “head to body”, we would produce P:
p(X):- e(X), not q(X). q(X):- e(X), not p(X). e(1).
But then 1 is not a cautious answer for P , while it is for the original program.
Example
z :-t(1), not z. t(X):- q(X).
p(X) :- e(X), not q(X). q(X):- e(X), not p(X).
e(1).
Generating a subprogram for evaluation of query p(X)?, moving only along “head to body”, we would produce P:
p(X):- e(X), not q(X). q(X):- e(X), not p(X). e(1).
But then 1 is not a cautious answer for P , while it is for the original program.
Example
z :- t(1), not z. t(X) :- q(X).
p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).
e(1).
z :- t(1), not z. is a rule which should not be dropped
t(1) should be treated like being reached from the query, hence both rules
t(X) :- q(X). and z :- t(1), not z.
should be included in the relevant subprogram.
Example
z :- t(1), not z. t(X) :- q(X).
p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).
e(1).
z :- t(1), not z. is a rule which should not be dropped t(1) should be treated like being reached from the query,
hence both rules
t(X) :- q(X). and z :- t(1), not z.
should be included in the relevant subprogram.
Example
z :- t(1), not z. t(X) :- q(X).
p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).
e(1).
z :- t(1), not z. is a rule which should not be dropped
t(1) should be treated like being reached from the query, hence both rules
t(X) :- q(X). and z :- t(1), not z.
should be included in the relevant subprogram.
Dangerous Predicates and Rules
A predicate d is dangerous if
d occurs in a cycle with an odd number of negations, or
d occurs in the body of a rule with a dangerous head predicate.
A rule r is dangerous, if its head is dangerous.
Independent Sets
An independent set for a ground program is a set S ⊆ BP such that for each a ∈ S:
if a is the head of rule r then all atoms of r are in S, and
if a appears in the body of a dangerous rule r then all atoms of r are in S.
A subprogram T of a program P is a module if T consists of exactly the rules with head atoms from S for an independent set S.
Theorems
Let T be a module of P, and q occur in T . SM(P)/T ⊆ SM(T).
(T |=c q) ⇒ (P |=c q), and (T |=b q) ⇐ (P |=b q)
Moreover, if P is consistent, then SM(T) = SM(P)/T.
(T |=c q) ⇔ (P |=c q), and (T |=b q) ⇔ (P |=b q).
Theorems
Let T be a module of P, and q occur in T . SM(P)/T ⊆ SM(T).
(T |=c q) ⇒ (P |=c q), and (T |=b q) ⇐ (P |=b q)
Moreover, if P is consistent, then SM(T) = SM(P)/T.
(T |=c q) ⇔ (P |=c q), and (T |=b q) ⇔ (P |=b q).
Evaluation
Optimal: For a query c? use the smallest module containing c.
⇒ infeasible
⇒ use an approximating technique Adaptation of Magic Sets
Magic-Set Method
Given a query q, and a program P
Focuses on the subset of P which is relevant for q
“Pushes-down” the query constants, to eliminate rule-instances which cannot contribute to the derivation of q
Simulates the top-down evaluation of q
Magic-Set Method
Rewrite P in a query-equivalent program P’ 1. Adorn P (simulate the binding passing) 2. Generate Magic
(magic rules identify the relevant atoms).
3. Modify P (limit P to the Magic Set)
Modification for Datalog
¬Rule-by-rule processing
Process also dangerous rules
. . . but only for generating magic rules
. . . by swapping head and body, and applying standard magic generation
Enhanced Magic-Set Algorithm
Input: A Datalog¬ program P, and a query Q = g(t).
Output: The optimized program MS¬(Q, P).
var S: stack of adorned predicates; modifiedRules,magicRules: set of rules;
modifiedRules:= ∅; magicRules:=BuildQuerySeeds(Q, S);
while S 6= ∅ do pα := S.pop();
for each rule r ∈ P with H(r) = p(tp) do
ra := Adorn(r,pα,S); magicRules := magicRules S
Generate(ra);
modifiedRules := modifiedRules S
{Modify (ra)};
for each dangerous rule d ∈ P whereh(th) : − q1(t1), . . . , qm(tm) and qi = p do
letds be the ruleqi(ti) : − h(th), q1(t1), . . . , qi−1(t1), qi+1(t1), . . . , qm(tm);
S
Magic Sets: Example
e(1). z :-t(1), not z. t(X):-q(X).
p(X) :-e(X), not q(X). q(X):-e(X), not p(X).
a(X) : −not b(X). b(X) : −not a(X).
with query p(1)? yields the following
e(1). z :- t(1), not z. t(X):-magic_tb(X), q(X).
p(X):-magic_pb(X), e(X), not q(X). q(X):-magic_qb(X), e(X), not p(X).
magic_pb(1). magic_tb(X):-magic_qb(X).
magic_qb(X):-e(X), magic_pb(X). magic_pb(X):-e(X), magic_qb(X).
Theorem
Let P be a Datalog¬ program, let Q be a query.
Then, it holds that
MS¬(hQ, Pi)⊆cQP and MS¬(hQ, Pi)⊇bQP, and if SM(P) 6= ∅,
MS¬(hQ, Pi)≡bQP and MS¬(hQ, Pi)≡cQP.
Remark: Data Integration Programs Π(I) always have stable models, so we obtain query
equivalence for these!
Theorem
Let P be a Datalog¬ program, let Q be a query.
Then, it holds that
MS¬(hQ, Pi)⊆cQP and MS¬(hQ, Pi)⊇bQP, and if SM(P) 6= ∅,
MS¬(hQ, Pi)≡bQP and MS¬(hQ, Pi)≡cQP.
Remark: Data Integration Programs Π(I) always have stable models, so we obtain query
equivalence for these!
Demo Scenario
EU Project INFOMIX (IST-2001-33570)
Information system of University “La Sapienza” in Rome.
14 global relations,
29 integrity constraints,
29 relations (in 3 legacy databases) and 12 web wrappers,
More than 24MB of data regarding students,
Experiments
Relative Gain
Conclusion
Optimization for Datalog¬ with stable models Important for Data Integration
Modularity results for Datalog¬ Magic Sets for Datalog¬
Positive impact on Data Integration Application