Magic Sets and their Application to Data Integration

(1)

Magic Sets and their

Application to Data Integration

Wolfgang Faber, Gianluigi Greco, Nicola Leone

Department of Mathematics University of Calabria, Italy

{faber,greco,leone}@mat.unical.it

(2)

Roadmap

Motivation: Data Integration Datalog^¬

Modularity Results Magic Sets

Some Experiments Conclusions

(3)

Research Context

EU-funded project: INFOMIX Data Integration

Advanced System Dealing with Incomplete and Inconsistent Information

Builds on Datalog system DLV http://www.dlvsystem.com

Univ. Calabria (Leone, Faber et al.), Univ. Rome (Lenzerini, Rosati et al.), TU Vienna (Eiter, Gottlob et al.),

Rodan (Staniszkis et al.)

(4)

Context: Data Integration

Data integration system I = hG, S, Mi:

G = hΨ, Σi global (relational) scheme

– Ψ relation schemes, Σ integrity constraints, ShΨ⁰, ∅i (relational) schema of the sources, M mapping between G and S.

(5)

Context: Data Integration

Users issue queries on the global schema, and the system automatically retrieves data from the sources. But:

Data stored in sources may violate global constraints

Retrieved data might be inconsistent.

Techniques for database repairing are needed.

In many settings: co-NP

(6)

Datalog

^¬

for Repairing Data

Idea: Given a data integration system I, construct a Datalog^¬ program Π(I) whose stable models are in one-to-one

correspondence with repairs of I.

The Cautious Consequences of Π(I)

Coincide with the Consistent Query Answers

(7)

Datalog

^¬

: Current Situation

Competitive Systems: Bottom-Up

Focus on Models, not Query-Answering

Query Optimization Methods?

(8)

Datalog

^¬

Syntax

Rules:

a :- b₁, . . . , bk, not bk+1, . . . , not bm.

where a, b₁, . . . , bm are atoms

and not denotes default negation. Intuitive reading:

If b₁ . . . , bk are true, and bk+1, . . . , bm are not true, then a is true.

(9)

Datalog

^¬

Syntax

Program P: finite set of safe rules.

Base B_P: set of all ground atoms constructible from constants and predicates in P.

Ground Program Ground(P): set of rules

obtained by applying all possible substitutions (from variables in P to constants in P) to P.

(10)

Stable Model Semantics

An interpretation I⊆ B_P is a model of a program P if it satisfies all rules in Ground(P).

The reduct P^I of a ground program P (wrt I) is obtained by

1. deleting all rules with false negative body 2. deleting the negative body of the other

rules. the positive ground program.

An interpretation I is a stable model of P iff it is the least model of Ground(P)^I.

(11)

Example

The program P₁

p(X) :- e(X), not q(X). q(X) :- e(X), not p(X). e(1).

has exactly two stable models:

S₁ = {p(1), e(1)} and S₂ = {q(1), e(1)}

Ground(P₁)^S¹ = p(1):- e(1). e(1).

Ground(P₁)^S² = q(1) :- e(1). e(1).

(12)

Example

The program P₂

z :- t(1), not z. t(X) :- q(X).

p(X) :^- e(X), not q(X). q(X) :^- e(X), not p(X).

e(1).

has exactly one stable model: S₁ = {p(1), e(1)}

S₂ = {z, q(1), t(1), e(1)} is not a stable model, as P₂^S² does not contain a rule with z in the head.

Note: z :^- t(1), not z. acts like an integrity constraint t(1) ⇒ ⊥, inhibiting any stable model containing t(1).

(13)

Example

The program P₂

z :- t(1), not z. t(X) :- q(X).

p(X) :^- e(X), not q(X). q(X) :^- e(X), not p(X).

e(1).

has exactly one stable model: S₁ = {p(1), e(1)}

S₂ = {z, q(1), t(1), e(1)} is not a stable model, as P₂^S² does not contain a rule with z in the head.

Note: z :- t(1), not z. acts like an integrity constraint t(1) ⇒ ⊥, inhibiting any stable model containing t(1).

(14)

Brave/Cautious Consequences

A ground atom a is a

brave consequence for P (P |=_b a) if a is true in some stable model of P.

cautious consequence for P (P |=_c a) if a is true in all stable models.

Note: If no stable model exists, all atoms in B_P are cautious consequences, and no atom is a brave

consequence.

(15)

Example

p(X) :^- e(X), not q(X). q(X) :^- e(X), not p(X). e(1).

Stable Models: {p(1), e(1)} and {q(1), e(1)}

Brave consequences: p(1), q(1), e(1), cautious consequences: e(1).

z :^- t(1), not z. t(X) :^- q(X).

p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).

e(1).

Stable Model: {p(1), e(1)}

Brave and cautious consequences: {p(1), e(1)}.

(16)

Example

p(X) :^- e(X), not q(X). q(X) :^- e(X), not p(X). e(1).

Stable Models: {p(1), e(1)} and {q(1), e(1)}

Brave consequences: p(1), q(1), e(1), cautious consequences: e(1).

z :^- t(1), not z. t(X) :^- q(X).

e(1).

(17)

Queries

Syntax: Query q:

c?

c: atom (with variables)

Brave answers: Substitutions θ s.t. P |=_b qθ

Cautious answers: Substitutions θ s.t. P |=c qθ

(18)

Query Evaluation

Desideratum: Evaluate only a subprogram relevant to the query

Implicit in top-down methods.

Problem: Not straightforward for query answering using stable models.

Generating subprograms along head → ^body is

(19)

Example

z :-t(1), not z. t(X):- q(X).

p(X) :- e(X), not q(X). q(X):- e(X), not p(X).

e(1).

Generating a subprogram for evaluation of query p(X)?, moving only along “head to body”, we would produce P:

p(X):- e(X), not q(X). q(X):- e(X), not p(X). e(1).

But then 1 is not a cautious answer for P , while it is for the original program.

(20)

Example

z :-t(1), not z. t(X):- q(X).

p(X) :- e(X), not q(X). q(X):- e(X), not p(X).

e(1).

Generating a subprogram for evaluation of query p(X)?, moving only along “head to body”, we would produce P:

p(X):- e(X), not q(X). q(X):- e(X), not p(X). e(1).

But then 1 is not a cautious answer for P , while it is for the original program.

(21)

Example

z :- t(1), not z. t(X) :- q(X).

e(1).

z :- t(1), not z. is a rule which should not be dropped

t(1) should be treated like being reached from the query, hence both rules

t(X) :- q(X). and z :- t(1), not z.

should be included in the relevant subprogram.

(22)

Example

z :- t(1), not z. t(X) :- q(X).

e(1).

z :- t(1), not z. is a rule which should not be dropped t(1) should be treated like being reached from the query,

hence both rules

t(X) :- q(X). and z :- t(1), not z.

(23)

Example

z :- t(1), not z. t(X) :- q(X).

e(1).

z :- t(1), not z. is a rule which should not be dropped

t(1) should be treated like being reached from the query, hence both rules

t(X) :- q(X). and z :- t(1), not z.

(24)

Dangerous Predicates and Rules

A predicate d is dangerous if

d occurs in a cycle with an odd number of negations, or

d occurs in the body of a rule with a dangerous head predicate.

A rule r is dangerous, if its head is dangerous.

(25)

Independent Sets

An independent set for a ground program is a set S ⊆ B_P such that for each a ∈ S:

if a is the head of rule r then all atoms of r are in S, and

if a appears in the body of a dangerous rule r then all atoms of r are in S.

A subprogram T of a program P is a module if T consists of exactly the rules with head atoms from S for an independent set S.

(26)

Theorems

Let T be a module of P, and q occur in T . SM(P)/_T ⊆ SM(T).

(T |=c q) ⇒ (P |=c q), and (T |=b q) ⇐ (P |=b q)

Moreover, if P is consistent, then SM(T) = SM(P)/_T.

(T |=c q) ⇔ (P |=c q), and (T |=b q) ⇔ (P |=b q).

(27)

Theorems

Let T be a module of P, and q occur in T . SM(P)/_T ⊆ SM(T).

(T |=c q) ⇒ (P |=c q), and (T |=b q) ⇐ (P |=b q)

Moreover, if P is consistent, then SM(T) = SM(P)/_T.

(T |=c q) ⇔ (P |=c q), and (T |=b q) ⇔ (P |=b q).

(28)

Evaluation

Optimal: For a query c? use the smallest module containing c.

⇒ infeasible

⇒ use an approximating technique Adaptation of Magic Sets

(29)

Magic-Set Method

Given a query q, and a program P

Focuses on the subset of P which is relevant for q

“Pushes-down” the query constants, to eliminate rule-instances which cannot contribute to the derivation of q

Simulates the top-down evaluation of q

(30)

Magic-Set Method

Rewrite P in a query-equivalent program P’ 1. Adorn P (simulate the binding passing) 2. Generate Magic

(magic rules identify the relevant atoms).

3. Modify P (limit P to the Magic Set)

(31)

Modification for Datalog

^¬

Rule-by-rule processing

Process also dangerous rules

. . . but only for generating magic rules

. . . by swapping head and body, and applying standard magic generation

(32)

Enhanced Magic-Set Algorithm

Input: A Datalog^¬ program P, and a query Q = g(t).

Output: The optimized program MS^¬(Q, P).

var S: stack of adorned predicates; modifiedRules,magicRules: set of rules;

modifiedRules:= ∅; magicRules:=BuildQuerySeeds(Q, S);

while S 6= ∅ do p^α := S.pop();

for each rule r ∈ P with H(r) = p(tp) do

r_a := Adorn(r,p^α,S); magicRules := magicRules S

Generate(ra);

modifiedRules := modifiedRules S

{Modify (ra)};

for each dangerous rule d ∈ P ^whereh(th) : − q1(t1), . . . , qm(tm) ^and q_i = p ^do

letd_s be the ruleq_i(ti) : − h(th), q1(t1), . . . , q_i−1(t1), qi+1(t1), . . . , qm(tm)^;

S

(33)

Magic Sets: Example

e(1). z :-t(1), not z. t(X):-q(X).

p(X) :-e(X), not q(X). q(X):-e(X), not p(X).

a(X) : −not b(X). b(X) : −not a(X).

with query p(1)? yields the following

e(1). z :- t(1), not z. t(X):-magic_t^b(X), q(X).

p(X):-magic_p^b(X), e(X), not q(X). q(X):-magic_q^b(X), e(X), not p(X).

magic_p^b(1). magic_t^b(X):-magic_q^b(X).

magic_q^b(X):-e(X), magic_p^b(X). magic_p^b(X):-e(X), magic_q^b(X).

(34)

Theorem

Let P be a Datalog^¬ program, let Q be a query.

Then, it holds that

MS^¬(hQ, Pi)⊆^c_QP and MS^¬(hQ, Pi)⊇^b_QP, and if SM(P) 6= ∅,

MS^¬(hQ, Pi)≡^b_QP and MS^¬(hQ, Pi)≡^c_QP.

Remark: Data Integration Programs Π(I) always have stable models, so we obtain query

equivalence for these!

(35)

Theorem

Let P be a Datalog^¬ program, let Q be a query.

Then, it holds that

MS^¬(hQ, Pi)⊆^c_QP and MS^¬(hQ, Pi)⊇^b_QP, and if SM(P) 6= ∅,

MS^¬(hQ, Pi)≡^b_QP and MS^¬(hQ, Pi)≡^c_QP.

Remark: Data Integration Programs Π(I) always have stable models, so we obtain query

equivalence for these!

(36)

Demo Scenario

EU Project INFOMIX (IST-2001-33570)

Information system of University “La Sapienza” in Rome.

14 global relations,

29 integrity constraints,

29 relations (in 3 legacy databases) and 12 web wrappers,

More than 24MB of data regarding students,

(37)

Experiments

Relative Gain

(38)

Conclusion

Optimization for Datalog^¬ with stable models Important for Data Integration

Modularity results for Datalog^¬ Magic Sets for Datalog^¬

Positive impact on Data Integration Application