• No results found

Magic Sets and their Application to Data Integration

N/A
N/A
Protected

Academic year: 2021

Share "Magic Sets and their Application to Data Integration"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

Magic Sets and their

Application to Data Integration

Wolfgang Faber, Gianluigi Greco, Nicola Leone

Department of Mathematics University of Calabria, Italy

{faber,greco,leone}@mat.unical.it

(2)

Roadmap

Motivation: Data Integration Datalog¬

Modularity Results Magic Sets

Some Experiments Conclusions

(3)

Research Context

EU-funded project: INFOMIX Data Integration

Advanced System Dealing with Incomplete and Inconsistent Information

Builds on Datalog system DLV http://www.dlvsystem.com

Univ. Calabria (Leone, Faber et al.), Univ. Rome (Lenzerini, Rosati et al.), TU Vienna (Eiter, Gottlob et al.),

Rodan (Staniszkis et al.)

(4)

Context: Data Integration

Data integration system I = hG, S, Mi:

G = hΨ, Σi global (relational) scheme

– Ψ relation schemes, Σ integrity constraints, ShΨ0, ∅i (relational) schema of the sources, M mapping between G and S.

(5)

Context: Data Integration

Users issue queries on the global schema, and the system automatically retrieves data from the sources. But:

Data stored in sources may violate global constraints

Retrieved data might be inconsistent.

Techniques for database repairing are needed.

In many settings: co-NP

(6)

Datalog

¬

for Repairing Data

Idea: Given a data integration system I, construct a Datalog¬ program Π(I) whose stable models are in one-to-one

correspondence with repairs of I.

The Cautious Consequences of Π(I)

Coincide with the Consistent Query Answers

(7)

Datalog

¬

: Current Situation

Competitive Systems: Bottom-Up

Focus on Models, not Query-Answering

Query Optimization Methods?

(8)

Datalog

¬

Syntax

Rules:

a :- b1, . . . , bk, not bk+1, . . . , not bm.

where a, b1, . . . , bm are atoms

and not denotes default negation. Intuitive reading:

If b1 . . . , bk are true, and bk+1, . . . , bm are not true, then a is true.

(9)

Datalog

¬

Syntax

Program P: finite set of safe rules.

Base BP: set of all ground atoms constructible from constants and predicates in P.

Ground Program Ground(P): set of rules

obtained by applying all possible substitutions (from variables in P to constants in P) to P.

(10)

Stable Model Semantics

An interpretation I⊆ BP is a model of a program P if it satisfies all rules in Ground(P).

The reduct PI of a ground program P (wrt I) is obtained by

1. deleting all rules with false negative body 2. deleting the negative body of the other

rules. the positive ground program.

An interpretation I is a stable model of P iff it is the least model of Ground(P)I.

(11)

Example

The program P1

p(X) :- e(X), not q(X). q(X) :- e(X), not p(X). e(1).

has exactly two stable models:

S1 = {p(1), e(1)} and S2 = {q(1), e(1)}

Ground(P1)S1 = p(1):- e(1). e(1).

Ground(P1)S2 = q(1) :- e(1). e(1).

(12)

Example

The program P2

z :- t(1), not z. t(X) :- q(X).

p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).

e(1).

has exactly one stable model: S1 = {p(1), e(1)}

S2 = {z, q(1), t(1), e(1)} is not a stable model, as P2S2 does not contain a rule with z in the head.

Note: z :- t(1), not z. acts like an integrity constraint t(1) ⇒ ⊥, inhibiting any stable model containing t(1).

(13)

Example

The program P2

z :- t(1), not z. t(X) :- q(X).

p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).

e(1).

has exactly one stable model: S1 = {p(1), e(1)}

S2 = {z, q(1), t(1), e(1)} is not a stable model, as P2S2 does not contain a rule with z in the head.

Note: z :- t(1), not z. acts like an integrity constraint t(1) ⇒ ⊥, inhibiting any stable model containing t(1).

(14)

Brave/Cautious Consequences

A ground atom a is a

brave consequence for P (P |=b a) if a is true in some stable model of P.

cautious consequence for P (P |=c a) if a is true in all stable models.

Note: If no stable model exists, all atoms in BP are cautious consequences, and no atom is a brave

consequence.

(15)

Example

p(X) :- e(X), not q(X). q(X) :- e(X), not p(X). e(1).

Stable Models: {p(1), e(1)} and {q(1), e(1)}

Brave consequences: p(1), q(1), e(1), cautious consequences: e(1).

z :- t(1), not z. t(X) :- q(X).

p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).

e(1).

Stable Model: {p(1), e(1)}

Brave and cautious consequences: {p(1), e(1)}.

(16)

Example

p(X) :- e(X), not q(X). q(X) :- e(X), not p(X). e(1).

Stable Models: {p(1), e(1)} and {q(1), e(1)}

Brave consequences: p(1), q(1), e(1), cautious consequences: e(1).

z :- t(1), not z. t(X) :- q(X).

p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).

e(1).

(17)

Queries

Syntax: Query q:

c?

c: atom (with variables)

Brave answers: Substitutions θ s.t. P |=b

Cautious answers: Substitutions θ s.t. P |=c

(18)

Query Evaluation

Desideratum: Evaluate only a subprogram relevant to the query

Implicit in top-down methods.

Problem: Not straightforward for query answering using stable models.

Generating subprograms along headbody is

(19)

Example

z :-t(1), not z. t(X):- q(X).

p(X) :- e(X), not q(X). q(X):- e(X), not p(X).

e(1).

Generating a subprogram for evaluation of query p(X)?, moving only along “head to body”, we would produce P:

p(X):- e(X), not q(X). q(X):- e(X), not p(X). e(1).

But then 1 is not a cautious answer for P , while it is for the original program.

(20)

Example

z :-t(1), not z. t(X):- q(X).

p(X) :- e(X), not q(X). q(X):- e(X), not p(X).

e(1).

Generating a subprogram for evaluation of query p(X)?, moving only along “head to body”, we would produce P:

p(X):- e(X), not q(X). q(X):- e(X), not p(X). e(1).

But then 1 is not a cautious answer for P , while it is for the original program.

(21)

Example

z :- t(1), not z. t(X) :- q(X).

p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).

e(1).

z :- t(1), not z. is a rule which should not be dropped

t(1) should be treated like being reached from the query, hence both rules

t(X) :- q(X). and z :- t(1), not z.

should be included in the relevant subprogram.

(22)

Example

z :- t(1), not z. t(X) :- q(X).

p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).

e(1).

z :- t(1), not z. is a rule which should not be dropped t(1) should be treated like being reached from the query,

hence both rules

t(X) :- q(X). and z :- t(1), not z.

should be included in the relevant subprogram.

(23)

Example

z :- t(1), not z. t(X) :- q(X).

p(X) :- e(X), not q(X). q(X) :- e(X), not p(X).

e(1).

z :- t(1), not z. is a rule which should not be dropped

t(1) should be treated like being reached from the query, hence both rules

t(X) :- q(X). and z :- t(1), not z.

should be included in the relevant subprogram.

(24)

Dangerous Predicates and Rules

A predicate d is dangerous if

d occurs in a cycle with an odd number of negations, or

d occurs in the body of a rule with a dangerous head predicate.

A rule r is dangerous, if its head is dangerous.

(25)

Independent Sets

An independent set for a ground program is a set S ⊆ BP such that for each a ∈ S:

if a is the head of rule r then all atoms of r are in S, and

if a appears in the body of a dangerous rule r then all atoms of r are in S.

A subprogram T of a program P is a module if T consists of exactly the rules with head atoms from S for an independent set S.

(26)

Theorems

Let T be a module of P, and q occur in T . SM(P)/T ⊆ SM(T).

(T |=c q) ⇒ (P |=c q), and (T |=b q) ⇐ (P |=b q)

Moreover, if P is consistent, then SM(T) = SM(P)/T.

(T |=c q) ⇔ (P |=c q), and (T |=b q) ⇔ (P |=b q).

(27)

Theorems

Let T be a module of P, and q occur in T . SM(P)/T ⊆ SM(T).

(T |=c q) ⇒ (P |=c q), and (T |=b q) ⇐ (P |=b q)

Moreover, if P is consistent, then SM(T) = SM(P)/T.

(T |=c q) ⇔ (P |=c q), and (T |=b q) ⇔ (P |=b q).

(28)

Evaluation

Optimal: For a query c? use the smallest module containing c.

⇒ infeasible

⇒ use an approximating technique Adaptation of Magic Sets

(29)

Magic-Set Method

Given a query q, and a program P

Focuses on the subset of P which is relevant for q

“Pushes-down” the query constants, to eliminate rule-instances which cannot contribute to the derivation of q

Simulates the top-down evaluation of q

(30)

Magic-Set Method

Rewrite P in a query-equivalent program P’ 1. Adorn P (simulate the binding passing) 2. Generate Magic

(magic rules identify the relevant atoms).

3. Modify P (limit P to the Magic Set)

(31)

Modification for Datalog

¬

Rule-by-rule processing

Process also dangerous rules

. . . but only for generating magic rules

. . . by swapping head and body, and applying standard magic generation

(32)

Enhanced Magic-Set Algorithm

Input: A Datalog¬ program P, and a query Q = g(t).

Output: The optimized program MS¬(Q, P).

var S: stack of adorned predicates; modifiedRules,magicRules: set of rules;

modifiedRules:= ∅; magicRules:=BuildQuerySeeds(Q, S);

while S 6= ∅ do pα := S.pop();

for each rule r ∈ P with H(r) = p(tp) do

ra := Adorn(r,pα,S); magicRules := magicRules S

Generate(ra);

modifiedRules := modifiedRules S

{Modify (ra)};

for each dangerous rule d ∈ P whereh(th) : − q1(t1), . . . , qm(tm) and qi = p do

letds be the ruleqi(ti) : − h(th), q1(t1), . . . , qi−1(t1), qi+1(t1), . . . , qm(tm);

S

(33)

Magic Sets: Example

e(1). z :-t(1), not z. t(X):-q(X).

p(X) :-e(X), not q(X). q(X):-e(X), not p(X).

a(X) : −not b(X). b(X) : −not a(X).

with query p(1)? yields the following

e(1). z :- t(1), not z. t(X):-magic_tb(X), q(X).

p(X):-magic_pb(X), e(X), not q(X). q(X):-magic_qb(X), e(X), not p(X).

magic_pb(1). magic_tb(X):-magic_qb(X).

magic_qb(X):-e(X), magic_pb(X). magic_pb(X):-e(X), magic_qb(X).

(34)

Theorem

Let P be a Datalog¬ program, let Q be a query.

Then, it holds that

MS¬(hQ, Pi)⊆cQP and MS¬(hQ, Pi)⊇bQP, and if SM(P) 6= ∅,

MS¬(hQ, Pi)≡bQP and MS¬(hQ, Pi)≡cQP.

Remark: Data Integration Programs Π(I) always have stable models, so we obtain query

equivalence for these!

(35)

Theorem

Let P be a Datalog¬ program, let Q be a query.

Then, it holds that

MS¬(hQ, Pi)⊆cQP and MS¬(hQ, Pi)⊇bQP, and if SM(P) 6= ∅,

MS¬(hQ, Pi)≡bQP and MS¬(hQ, Pi)≡cQP.

Remark: Data Integration Programs Π(I) always have stable models, so we obtain query

equivalence for these!

(36)

Demo Scenario

EU Project INFOMIX (IST-2001-33570)

Information system of University “La Sapienza” in Rome.

14 global relations,

29 integrity constraints,

29 relations (in 3 legacy databases) and 12 web wrappers,

More than 24MB of data regarding students,

(37)

Experiments

Relative Gain

(38)

Conclusion

Optimization for Datalog¬ with stable models Important for Data Integration

Modularity results for Datalog¬ Magic Sets for Datalog¬

Positive impact on Data Integration Application

References

Related documents

In sum, evidence from the length of Directors’ initial referring expressions suggests that children made largely accurate assumptions about the extent to which former naïve

Esto le permitió a la fenomenología abandonar definitivamente la primacía de la percepción sobre otras modalidades de intuición, pero también –y sobre todo– descubrir que

Using a combination of these four features, we were able to discriminate benign from indeterminate nodules or suspected malignant nodules with a sensitivity of 73%, and specificity

Age distribution of the Thyroid FNAC cases in Hospital Universiti Sains Malaysia from 2010 to 2014 (n=110) Comparison of FNAC with histopathology result Distribution of benign

Thus it is necessary for neonatal and paediatric nurses to be fully aware and conversant with the legal and professional issues that impact upon enhanced practice.. Hospitals

-Worms - A worm is a small piece of software that uses computer networks and security holes to replicate itself.. A copy of the worm scans the network for another machine that has

V T.. having an underlying mission dedicated to generating social or environ- mental gains, rather than solely financial return. For the purposes of this note, social enterprise will

exploratory dissertation study was conducted for the two closely related purposes: first, to investigate a sample of fifth-grade students’ reasoning regarding the relationship between