Data integration: A theoretical perspective
Maurizio Lenzerini
Dipartimento di Informatica e Sistemistica “Antonio Ruberti” Universit `a di Roma “La Sapienza”
Tutorial at PODS 2002
Data integration
Source 1 Source 2 Global schema Mapping Query R1 C1 D1 T1 c1 d1 t1 c2 d2 t2Outline
•
Formal framework for data integration•
Approaches to data integration•
Query answering in different approaches•
Dealing with inconsistency•
Reasoning on queries in data integration•
ConclusionsFormal framework
A data integration system
I
is a triplehG
,
S
,
Mi
, where•
G
is the global schema (over an alphabetA
G)•
S
is the source schema (over an alphabetA
S)•
M
is the mapping betweenG
andS
Semantics of
I
: which are the databases that satisfyI
(models ofI
)?We refer only to databases over a fixed infinite domain
Γ
, and we start with a source databaseC
, (data available at the sources, also called source model) overΓ
. The set of databases that satisfyI
relative toC
is:sem
C(
I
)
=
{ B | B
is legal wrtG
Semantics of queries to
I
A query
q
of arityn
is a FOL formula withn
free variables.If
D
is a database, thenq
D denotes the extension ofq
inD
(i.e., the set of valuations inΓ
for the free variables ofq
that makeq
true inD
).If
q
is a query of arityn
posed to a data integration systemI
(i.e., a query overA
G), then the set of certain answers toq
wrtI
andC
isDatabases with incomplete information
•
Traditional database: one model of a first-order theoryQuery answering means evaluating a formula in the model.
•
Database with incomplete information: set of models (specified, for example, as a restricted first-order theory)Query answering means computing the tuples that satisfy the query in all the models in the set.
There is a strong connection between query answering in data integration and query answering in database with incomplete information under constraints.
Outline
•
Formal framework for data integration•
Approaches to data integration•
Query answering in different approaches•
Dealing with inconsistency•
Reasoning on queries in data integration•
ConclusionsThe mapping
How is the mapping
M
betweenG
andS
specified?•
Are the sources defined in terms of the global schema?Approach called source-centric, or local-as-view, or LAV.
•
Is the global schema defined in terms of the sources?Approach called global-schema-centric, or global-as-view, or GAV.
•
A mixed approach?Approach called GLAV.
•
Mapping between sources, without global schema? Approach called P2P.GAV vs LAV – example
Global schema: movie
(
Title
,
Year
,
Director
)
european(
Director
)
review
(
Title
,
Critique
)
Source 1: r1
(
Title
,
Year
,
Director
)
since 1960, european directors Source 2: r2(
T itle, Critique
)
since 1990Query: Title and critique of movies in 1998
∃
D.
movie(
T
,
1998
, D
)
∧
review(
T
,
R
)
,
writtenFormalization of LAV
In LAV, the mapping
M
is constituted by a set of assertions:s
⊆
φ
G (sound source)∀
~
x
(
s
(
~
x
)
→
φ
G(
~
x
))
s
≡
φ
G (exact source)∀
~
x
(
s
(
~
x
)
≡
φ
G(
~
x
))
one for each source element
s
inA
S, whereφ
G is a query overG
. Given source databaseC
, a databaseB
forG
satisfiesM
wrtC
if for eachs
∈ S
:s
C⊆
φ
GB (sound source)s
C=
φ
GB (exact source)The mapping
M
and the source databaseC
do not provide direct information about which data satisfy the global schema. Sources are views, and we have to answerLAV – example
Global schema: movie
(
Title
,
Year
,
Director
)
european(
Director
)
review
(
Title
,
Critique
)
LAV: associated to source relations we have views over the global schema
r1
(
T, Y, D
)
⊆
{
(
T, Y, D
)
|
movie(
T, Y, D
)
∧
european(
D
)
∧
Y
≥
1960
}
r2(
T, R
)
⊆
{
(
T, R
)
|
movie(
T, Y, D
)
∧
review(
T, R
)
∧
Y
≥
1990
}
The query
{
(T, R)
|
movie(T,
1998, D)
∧
review(T, R)
}
is processed bymeans of an inference mechanism that aims at re-expressing the atoms of the global schema in terms of atoms at the sources. In this case:
Formalization of GAV
In GAV, the mapping
M
is constituted by a set of assertions:g
⊇
φ
S (sound source)∀
~
x
(
φ
S(
~
x
)
→
g
(
~
x
))
g
≡
φ
S (exact source)∀
~
x
(
φ
S(
~
x
)
≡
g
(
~
x
))
one for each element
g
inA
G, whereφ
S is a query overS
. Given source databaseC
, a databaseB
forG
satisfiesM
wrtC
if for eachg
∈ G
:g
B⊇
φ
SC (sound source)g
B=
φ
SC (exact source)Given a source database,
M
provides direct information about which data satisfy the elements of the global schema. Relations inG
are views, and queries are expressed over the views. Thus, it seems that we can simply evaluate the query over the dataGAV – example
Global schema: movie
(
Title
,
Year
,
Director
)
european(
Director
)
review
(
Title
,
Critique
)
GAV: associated to relations in the global schema we have views over the sources
movie
(
T, Y, D
)
⊇
{
(
T, Y, D
)
|
r1(
T, Y, D
)
}
european(
D
)
⊇
{
(
D
)
|
r1(
T, Y, D
)
}
GAV – example of query processing
The query
{
(T, R)
|
movie(T,
1998, D)
∧
review(T, R)
}
is processed bymeans of unfolding, i.e., by expanding the atoms according to their definitions, so as to come up with source relations. In this case:
movie(T,1998,D)
∧
review(T,R)
unfolding
GAV and LAV – comparison
LAV: (Information Manifold, DWQ, Picsel)
•
Quality depends on how well we have characterized the sources•
High modularity and extensibility (if the global schema is well designed, when a source changes, only its definition is affected)•
Query processing needs reasoning (query reformulation complex) GAV: (Carnot, SIMS, Tsimmis, IBIS, Picsel, . . . )•
Quality depends on how well we have compiled the sources into the global schema through the mapping•
Whenever a source changes or a new one is added, the global schema needs to be reconsidered•
Query processing can be based on some sort of unfolding (query reformulation looks easier)Beyond GAV and LAV: GLAV
In GLAV, the mapping
M
is constituted by a set of assertions:φ
S⊆
φ
G (sound source)∀
~
x
(
φ
S(
~
x
)
→
φ
G(
~
x
))
φ
S≡
φ
G (exact source)∀
~
x
(
φ
S(
~
x
)
≡
φ
G(
~
x
))
where
φ
S is a query overS
, andφ
G is a query overG
. Given source databaseC
, a databaseB
that is legal wrtG
satisfiesM
wrtC
if for each assertion inM
:φ
SC⊆
φ
GB (sound source)φ
SC=
φ
GB (exact source)The mapping
M
does not provide direct information about which data satisfy the global schema: to answer a queryq
overG
, we have to infer how to useM
in orderExample of GLAV
Global schema:
W ork
(
P erson, P roject
)
,
Area
(
P roject, F ield
)
Source 1:
HasJob
(
P erson, F ield
)
Source 2:
T each
(
P rof essor, Course
)
, In
(
Course, F ield
)
Source 3:
Get
(
Researcher, Grant
)
, F or
(
Grant, P roject
)
GLAV mapping:
{
(
r, f
)
|
HasJob
(
r, f
)
}
⊆
{
(
r, f
)
|
W ork
(
r, p
)
∧
Area
(
p, f
)
}
{
(
r, f
)
|
T each
(
r, c
)
∧
In
(
c, f
)
} ⊆
{
(
r, f
)
|
W ork
(
r, p
)
∧
Area
(
p, f
)
}
Beyond GLAV: P2P data integration
In P2P, the global schema does not exist. Constraints (that we can still call
G
) are defined overA
G=
A
S1∪ · · · ∪ A
Snand the mapping
M
is constituted by a set of assertions (φ
S1i, φ
S2j are queries over the alphabetsA
Si andA
Sj, respectively):φ
S1i⊆
φ
S2j.A
S is a distinguished subset of predicates inA
G, called “base predicates” (wheredata are). A source database is a database for the base predicates. Given source database
C
, a databaseW
that satisfiesI
relative toC
is a database forS
such that, for each assertionφ
1⊆
φ
2 inM
,φ
W1⊆
φ
W2 .Queries are now expressed over alphabet
A
Si, and the notion of certain answers is the usual one.A unified view
•
Alphabet:A
=
A
G∪ A
S•
Integrity constraints: constraintsG
, and mappingM
•
Partial database: source database•
Database: data for all symbols inA
that are both coherent with the partial database and satisfy the integrity constraints•
Query answering: computing the tuples that satisfies the query in every databaseUnder this view, the difference between LAV, GAV, GLAV, P2P is reflected in the
Query answering with incomplete information
•
[Reiter ’84]: relational setting, databases with incomplete information modeled as a first order theory•
[Vardi ’86]: relational setting, complexity of reasoning in closed world databases with unknown values•
Several approaches both from the DB and the KR communityConnection to query containment
Query containment (under constraints
T
) is the problem of checking whetherq
1B is contained inq
2B for every databaseB
(satisfyingT
), whereq
1, q
2 are queries with the same arity.•
A source databaseC
can be represented as a conjunctionq
C of ground literals overA
S (e.g., if~
x
is ins
C, then the corresponding literal iss
(
~
x
)
)•
Ifq
is a query, and~
t
is a tuple, then we denote byq
~t the query obtained by substituting the free variables ofq
with~
t
•
The problem of checking whether~
t
∈
q
I,C can be reduced to the problem of checking whetherq
C is contained inq
~t under the constraintsG ∪ M
The combined complexity of checking certain answers is identical to the complexity of query containment under constraints, and the data complexity is at most the
Outline
•
Formal framework for data integration•
Approaches to data integration•
Query answering in different approaches•
Dealing with inconsistency•
Reasoning on queries in data integration•
ConclusionsDealing with incompleteness and inconsistency
We analyze the problem of query answering in different cases, depending on two parameters:
•
Global schema: - without constraints, - with constraints•
Mapping: - GAV or LAV, - sound or completeGiven a source database
C
, we call retrieved global database any database forG
that satisfies the mapping wrtC
.Incompleteness and inconsistency
Constraints Type of Incomple-
Inconsi-in
G
mapping teness stencyno GAV/exact no no
no GAV/sound yes/no no
no LAV/sound yes no
no LAV/exact yes yes
yes GAV/exact no yes
yes GAV/sound yes yes
yes LAV/sound yes yes
Incompleteness and inconsistency
Constraints Type of Incomple-
Inconsi-in
G
mapping teness stencyno GAV/exact no no
no GAV/sound yes/no no
no LAV/sound yes no
no LAV/exact yes yes
yes GAV/exact no yes
yes GAV/sound yes yes
yes LAV/sound yes yes
INT[noconstr, GAV/exact]: example
Consider
I
=
hG
,
S
,
Mi
, with Global schemaG
:student
(
Scode
,
Sname
,
Scity
)
university(
Ucode
,
Uname
)
enrolled
(
Scode
,
Ucode
)
Source schemaS
: database relations s1,
s2,
s3Mapping
M
:student
(
X, Y, Z
)
≡
{
(
X, Y, Z
)
|
s1(
X, Y, Z, W
)
}
university(
X, Y
)
≡
{
(
X, Y
)
|
s2(
X, Y
)
}
INT[noconstr, GAV/exact]: example
Student oslo bill 15 florence anne 12 city name code University ucla BN bocconi AF name code Enrolled AF 12 BN 16 Ucode Scodes
C112
anne florence21
15
bill oslo24
s
C2AF
bocconiBN
uclas
C312
AF
16
BN
INT[noconstr, GAV/exact]
Sources
Mapping
Global schema
Retrieved GDB
Source model
Model of I
=
INT[noconstr, GAV/exact]: query answering
•
UseM
for computing fromC
the retrieved global database, where each elementg
ofG
satisfies exactly the tuples ofC
satisfying theφ
S thatM
associates tog
•
SinceG
does not have constraints, the retrieved global database is legal wrtG
•
Actually, it is the only database that is legal wrtG
, and that satisfiesM
wrtC
•
Thus, we can simply evaluate the queryq
over the retrieved global database,which is equivalent to unfolding the query according to
M
, in order to obtain a query onA
S to be evaluated overC
INT[noconstr, GAV/exact]: example of query answering
MappingM
: student(
X, Y, Z
)
≡
{
(
X, Y, Z
)
|
s1(
X, Y, Z, W
)
}
university(
X, Y
)
≡
{
(
X, Y
)
|
s2(
X, Y
)
}
enrolled(
X, W
)
≡
{
(
X, W
)
|
s3(
X, W
)
}
s
C112
anne florence21
15
bill oslo24
s
C2AF
bocconiBN
uclas
C312
AF
16
BN
Query:
{
(
X
)
|
student(
X, Y, Z
)
,
enrolled(
X, W
)
}
Unfolding wrt
M
:{
(
X
)
|
s
1(
X, Y, Z, V
)
, s
3(
X, W
)
}
Incompleteness and inconsistency
Constraints Type of Incomple-
Inconsi-in
G
mapping teness stencyno GAV/exact no no
no GAV/sound yes/no no
no LAV/sound yes no
no LAV/exact yes yes
yes GAV/exact no yes
yes GAV/sound yes yes
yes LAV/sound yes yes
INT[noconstr, GAV/sound]: example
Student
oslo bill 15 florence anne 12 city name codeUniversity
uniroma UR ucla BN bocconi AF name codeEnrolled
AF 12 BN 16 Ucode Scodes
C112
anne florence21
15
bill oslo24
s
C2AF
bocconiBN
uclas
C312
AF
16
BN
INT[noconstr, GAV/sound]
The GAV mapping assertions have the logical form:
∀
~
x
φ
s(
~
x
)
→
g
(
~
x
)
The intersection of all retrieved global databases (which can be computed by letting each element
g
ofG
satisfy exactly the tuples ofC
satisfiying theφ
S thatM
associates to
g
) still satisfiesM
wrtC
, and therefore, is the only “minimal”model of
I
.Incompleteness is of special form. For queries without negation, unfolding is sufficient.
INT[noconstr, GAV/sound]
Sources
Mapping
Global schema
Intersection
of retrieved GDBs
Source model
Minimal
Model of I
=
Incompleteness and inconsistency
Constraints Type of Incomple-
Inconsi-in
G
mapping teness stencyno GAV/exact no no
no GAV/sound yes/no no
no LAV/sound yes no
no LAV/exact yes yes
yes GAV/exact no yes
yes GAV/sound yes yes
yes LAV/sound yes yes
INT[noconstr, LAV/sound]: incompleteness
The LAV mapping assertions have the logical form:
∀
~
x
s
(
~
x
)
→
φ
G(
~
x
)
In general, given a source database
C
there are several solutions of the above assertions (i.e., different databases that are legal wrtG
that satisfiesM
wrtC
). Incompleteness comes from the mapping. This holds even for the case of simple queriesφ
G:s
1(
x
)
⊆
{
(
x
)
| ∃
y g
(
x, y
)
}
INT[noconstr, LAV/sound]
Sources
Mapping
Global schema
Retrieved GDBs
Source model
Models of I
=
=
INT[noconstr, LAV/sound]: dealing with incompleteness
View-based query processing: Answer a query based on a set of materialized views, rather than on the raw data in the database.
Relevant problem in
•
Data warehousing•
Query optimizationINT[noconstr, LAV/sound]: dealing with incompleteness
In LAV/sound data integration, the views are the sources. Two approaches to view-based query processing:
•
View-based query rewriting: query processing is divided in two steps1. re-express the query in terms of a given query language over the alphabet of
A
S2. evaluate the rewriting over the source database
C
•
View-based query answering: no limitation is posed on how queries areprocessed, and the only goal is to exploit all possible information, in particular the source database, to compute the certain answers to the query
INT[noconstr, LAV/sound]: connection to query containment
•
If queries inM
are conjunctive queries, then we can substitute the query thatM
associates tos
for everys
-literal inq
C, and therefore, checking certain answers can be reduced to checking pure containment (withoutconstraints) of two queries in the alphabet
A
GINT[noconstr, LAV/sound]: some results for query answering
•
Conjunctive queries using conjunctive views [Levy&al. PODS’95]•
Recursive queries (datalog programs) using conjunctive views[Duschka&Genesereth PODS’97], [Afrati&al. ICDT’99]
•
Complexity analysis [Abiteboul&Duschka PODS’98] [Grahne&Mendelzon ICDT’99]•
Variants of Regular Path Queries [Calvanese&al. ICDE’00, PODS’00] [Deutsch&Tannen DBPL’01], [Calvanese&al. DBPL’01]INT[noconstr, LAV/sound]: data complexity
From [Abiteboul&Duschka PODS’98]:
Sound sources CQ CQ6= PQ datalog FOL
CQ PTIME coNP PTIME PTIME undec.
CQ6= PTIME coNP PTIME PTIME undec.
PQ coNP coNP coNP coNP undec.
datalog coNP undec. coNP undec. undec.
INT[noconstr, LAV/sound]: basic technique
Consider conjunctive queries and conjunctive views.
r1
(
T
)
⊆
{
(
T
)
|
movie(
T, Y, D
)
∧
european(
D
)
}
r2(
T, V
)
⊆
{
(
T, V
)
|
movie(
T, Y, D
)
∧
review(
T, V
)
}
∀
T
r1(
T
)
→ ∃
Y
∃
D
movie(
T, Y, D
)
∧
european(
D
)
∀
T
∀
V
r2(
T, V
)
→ ∃
Y
∃
D
movie(
T, Y, D
)
∧
review(
T, V
)
movie(
T, f
1(
T
)
, f
2(
T
))
←
r1(
T
)
european(
f
2(
T
))
←
r1(
T
)
movie(
T, f
4(
T, V
)
, f
5(
T, V
))
←
r2(
T, V
)
review(
T, V
))
←
r2(
T, V
)
Answering a query means evaluating a goal wrt to this nonrecursive logic program
INT[noconstr, LAV/sound]: polynomial intractability
Given a graph
G
= (
N, E
)
, we defineI
=
hG
,
S
,
Mi
, and source databaseC
:V
b⊆
R
bV
f⊆
R
fV
t⊆
R
rg∨
R
gr∨
R
rb∨
R
br∨
R
gb∨
R
bgV
bC=
{
(
c, a
)
|
a
∈
N, c
6∈
N
}
V
fC=
{
(
a, d
)
|
a
∈
N, d
6∈
N
}
V
tC=
{
(
a, b
)
,
(
b, a
)
|
(
a, b
)
∈
E
}
Q
←
R
b·
M
·
R
fwhere
M
describes all mismatched edge pairs (e.g.,R
rg·
R
rb). IfG
is 3-colorable, then∃B
whereM
(andQ
) is empty, i.e.(
c, d
)
6∈
Q
I,C. IfG
is not 3-colorable, thenM
is nonempty∀B
, i.e.(
c, d
)
∈
Q
I,C.INT[noconstr, LAV/sound]: in coNP
Consider the case of Datalog queries and positive views.
•
~
t
is not a certain answer toQ
wrtI
andC
, if and only if there is a databaseB
forI
such that~
t
6∈
Q
B, andB
satisfiesM
wrtC
•
Because of the form ofM
∀
~
x
(s
(
~
x
)
→ ∃
y
~
1α
1(
~
x
, ~
y
1)
∨
. . .
∨ ∃
y
~
hα
h(
~
x
, ~
y
h)
)
each tuple in
C
forces the existence ofk
tuples in any database that satisfiesM
wrtC
, wherek
is the maximal length of conjuncts inM
•
IfC
hasn
tuples, then there is a databaseB
0⊆ B
forI
that satisfiesM
wrtC
with at mostn
·
k
tuples. SinceQ
is monotone,~
t
6∈
Q
B0.•
Checking whetherB
0 satisfiesM
wrtC
can be done in PTIME wrt the size ofB
0.=
⇒
coNP data complexity for Datalog queries and positive views.INT[noconstr, LAV/sound]: the case of RPQ
We deal with the problem of answering queries to data integration systems of the form
hG
,
S
,
Mi
, where•
G
simply fixes the labels (alphabetΣ
) of a semi-structured database•
the sources inS
are relational•
the mappingM
is of type LAVGlobal semi-structured database
sub sub
sub var sub sub sub
sub sub sub
calls calls
calls
Global semi-structured databases and queries
sub sub
sub var sub sub sub
sub sub sub
calls calls
calls
var var var var
a
b
Global semi-structured databases and queries
sub sub
sub var sub sub sub
sub sub sub
calls calls
calls
var var var var
a
b
INT[noconstr, LAV/sound]: the case of RPQ
Given
•
I
=
hG
,
S
,
Mi
, where–
G
simply fixes the labels (alphabetΣ
) of a semi-structured database– the sources in
S
are binary relations– the mapping
M
is of type LAV, and associates to each sources
a 2RPQw
overΣ
∀
x, y
s
(
x, y
)
⊆
x
→
wy
•
a source databaseC
•
a 2RPQQ
overΣ
•
a pair of objects~t
~t
∈
Q
I,CQuery answering: Technique
•
We search for a counterexample to~t
∈
Q
I,C, i.e., a databaseB
legal forI
wrtC
such that~t
6∈
Q
B•
Crucial point: it is sufficient to restrict our attention to canonical databases, i.e., databasesB
that can be represented by a wordw
B$
d
1w
1d
2$
d
3w
2d
4$
· · ·
$
d
2m−1w
md
2m$
where
d
1, . . . , d
2m are constants inC
,w
i∈
Σ
+, and $ acts as a separatorWe need techniques for ...
checking whether a pair of objects satisfies a 2RPQ query in the case of
•
a word representing a path•
a word representing semipathFinite-state automata and RPQs
a b cr
dp
q
….
….
Q
=
r
·
(p
∪
q
)
·
q
·
q
∗Automaton for Q
s
1∈
δ
(
s
0, r
)
, s
2∈
δ
(
s
1, p
)
, s
2∈
δ
(
s
1, q
)
,
s
3∈
δ
(
s
2, q
)
, s
3∈
δ
(
s
3, q
)
2way Regular Path Queries
2way Regular Path Queries (2RPQ) are expressed by means of finite-state automata over
Σ
0∪ {
p
−|
p
∈
Σ
0}
.r
·
(
p
∪
q
)
·
(
p
−·
p
)
∗·
q
·
q
∗ r p q p_ p q qFinite-state automata and 2RPQs
a b cr
dp
q
….
….
Word:r
p q
Query:Q
=
r
·
(
p
∪
q
)
·
p
−·
p
·
q
·
q
∗Automaton for Q
s
1∈
δ
(
s
0, r
)
, s
2∈
δ
(
s
1, p
)
, s
2∈
δ
(
s
1, q
)
,
s
3∈
δ
(
s
2, p
−)
, s
4∈
δ
(
s
3, p
)
, s
5∈
δ
(
s
4, q
)
, s
5∈
δ
(
s
5, q
)
State:s
0 Transition:s
1∈
δ
(
s
0, r
)
Finite-state automata and 2RPQs
a b cr
dp
q
….
….
Word:r
p
q
Query:Q
=
r
·
(
p
∪
q
)
·
p
−·
p
·
q
·
q
∗Automaton for Q
s
1∈
δ
(
s
0, r
)
, s
2∈
δ
(
s
1, p
)
, s
2∈
δ
(
s
1, q
)
,
s
3∈
δ
(
s
2, p
−)
, s
4∈
δ
(
s
3, p
)
, s
5∈
δ
(
s
4, q
)
, s
5∈
δ
(
s
5, q
)
State:s
1 Transition:s
∈
δ
(
s
, p
)
Finite-state automata and 2RPQs
a b cr
dp
q
….
….
Word:r p
q
Query:Q
=
r
·
(
p
∪
q
)
·
p
−·
p
·
q
·
q
∗Automaton for Q
s
1∈
δ
(
s
0, r
)
, s
2∈
δ
(
s
1, p
)
, s
2∈
δ
(
s
1, q
)
,
s
3∈
δ
(
s
2, p
−)
, s
4∈
δ
(
s
3, p
)
, s
5∈
δ
(
s
4, q
)
, s
5∈
δ
(
s
5, q
)
State:s
2 Transition: noneFinite-state automata and 2RPQs
a b cr
dp
q
….
….
Word:r p
q
Query:Q
=
r
·
(
p
∪
q
)
·
p
−·
p
·
q
·
q
∗ State:s
2 Transition: none(
a, d
)
satisfies queryQ
, but the path froma
tod
is not accepted by the 1NFAcorresponding to
Q
: the computation for 2RPQs is not captured by finite-state automata.2way automata (2NFA)
A 2way automaton
A
= (Γ
, S, S
0, ρ, F
)
consists of an alphabetΓ
, a finite set of statesS
, a set of initial statesS
0⊆
S
, a transition functionρ
:
S
×
Σ
→
2
S × {−1,0,1} and a set of accepting statesF
⊆
S
.Given a 2way automaton
A
withn
states, one can construct a one-way automatonB
1 withO
(2
nlogn)
states such thatL
(
B
1) =
L
(
A
)
, and a one-way automatonB
2 withO
(2
n)
states such thatL
(
B
2) = Γ
∗−
L
(
A
)
.2way automata and 2RPQs
Given a 2RPQ
E
= (Σ
, S, I, δ, F
)
over the alphabetΣ
, the corresponding 2way automatonA
E is:(Σ
A= Σ
∪ {
$
}
, S
A=
S
∪ {
s
f} ∪ {
s
←|
s
∈
S
}
, I, δ
A,
{
s
f}
)
whereδ
A is defined as follows:•
(
s
2,
1)
∈
δ
A(
s
1, r
)
, for each transitions
2∈
δ
(
s
1, r
)
ofE
•
enter backward mode:(
s
←,
−
1)
∈
δ
A(
s, `
)
, for eachs
∈
S
and`
∈
Σ
A•
exit backward mode:(
s
2,
0)
∈
δ
A(
s
←1, r
)
, for eachs
2∈
δ
(
s
1, r
−)
ofE
•
(
s
f,
1)
∈
δ
A(
s,
$)
, for eachs
∈
F
.2way automata and 2RPQs
a b cr
dp
q
….
….
Q
=
r
·
(p
∪
q
)
·
p
−·
p
·
q
·
q
∗Automaton for Q
s
1∈
δ
(
s
0, r
)
, s
2∈
δ
(
s
1, p
)
, s
2∈
δ
(
s
1, q
)
,
s
3∈
δ
(
s
2,
p
−)
, s
4∈
δ
(
s
3, p
)
, s
5∈
δ
(
s
4, q
)
, s
5∈
δ
(
s
5, q
)
2way automaton
(
s
1,
1)
∈
δ
A(
s
0, r
)
,
(
s
2,
1)
∈
δ
A(
s
1, p
)
,
(
s
2,
1)
∈
δ
A(
s
1, q
)
,
(
s
←2,
−
1
)
∈
δ
A(
s
2,
q
)
,
(
s
3,
0
)
∈
δ
A(
s
←2,
p
)
,
(
s
4,
1)
∈
δ
A(
s
3, p
)
,
(
s
5,
1)
∈
δ
A(
s
4, q
)
,
(
s
f,
1)
∈
δ
A(
s
5,
$)
2NFA and 2RPQs
a b cr
dp
q
….
….
Word:r
p q
$
Query:Q
=
r
·
(
p
∪
q
)
·
p
−·
p
·
q
·
q
∗Automaton for
Q
(
s
1,
1)
∈
δ
A(
s
0, r
)
,
(
s
2,
1)
∈
δ
A(
s
1, p
)
,
(
s
←2,
−
1)
∈
δ
A(
s
2, q
)
,
(
s
3,
0)
∈
δ
A(
s
←2, p
)
,
(
s
4,
1)
∈
δ
A(
s
3, p
)
,
(
s
5,
1)
∈
δ
A(
s
4, q
)
,
(
s
f,
1)
∈
δ
A(
s
5,
$)
State:s
0(
s
,
1)
∈
δ
(
s
, r
)
2NFA and 2RPQs
a b cr
dp
q
….
….
Word:r
p
q
$
Query:Q
=
r
·
(
p
∪
q
)
·
p
−·
p
·
q
·
q
∗Automaton for
Q
(
s
1,
1)
∈
δ
A(
s
0, r
)
,
(
s
2,
1)
∈
δ
A(
s
1, p
)
,
(
s
←2,
−
1)
∈
δ
A(
s
2, q
)
,
(
s
3,
0)
∈
δ
A(
s
←2, p
)
,
(
s
4,
1)
∈
δ
A(
s
3, p
)
,
(
s
5,
1)
∈
δ
A(
s
4, q
)
,
(
s
f,
1)
∈
δ
A(
s
5,
$)
State:s
1 Transition:(
s
2,
1)
∈
δ
A(
s
1, p
)
2NFA and 2RPQs
a b cr
dp
q
….
….
Word:r p
q
$
Query:Q
=
r
·
(
p
∪
q
)
·
p
−·
p
·
q
·
q
∗Automaton for
Q
(
s
1,
1)
∈
δ
A(
s
0, r
)
,
(
s
2,
1)
∈
δ
A(
s
1, p
)
,
(
s
←2,
−
1)
∈
δ
A(
s
2, q
)
,
(
s
3,
0)
∈
δ
A(
s
←2, p
)
,
(
s
4,
1)
∈
δ
A(
s
3, p
)
,
(
s
5,
1)
∈
δ
A(
s
4, q
)
,
(
s
f,
1)
∈
δ
A(
s
5,
$)
State:s
2(
s
←,
−
1)
∈
δ
(
s
, q
)
2NFA and 2RPQs
a b cr
dp
q
….
….
Word:r
p
q
$
Query:Q
=
r
·
(
p
∪
q
)
·
p
−·
p
·
q
·
q
∗Automaton for
Q
(
s
1,
1)
∈
δ
A(
s
0, r
)
,
(
s
2,
1)
∈
δ
A(
s
1, p
)
,
(
s
←2,
−
1)
∈
δ
A(
s
2, q
)
,
(
s
3,
0)
∈
δ
A(
s
←2, p
)
,
(
s
4,
1)
∈
δ
A(
s
3, p
)
,
(
s
5,
1)
∈
δ
A(
s
4, q
)
,
(
s
f,
1)
∈
δ
A(
s
5,
$)
State:s
←2 Transition:(
s
3,
0)
∈
δ
A(
s
←2, p
)
2NFA and 2RPQs
a b cr
dp
q
….
….
Word:r
p
q
$
Query:Q
=
r
·
(
p
∪
q
)
·
p
−·
p
·
q
·
q
∗Automaton for
Q
(
s
1,
1)
∈
δ
A(
s
0, r
)
,
(
s
2,
1)
∈
δ
A(
s
1, p
)
,
(
s
←2,
−
1)
∈
δ
A(
s
2, q
)
,
(
s
3,
0)
∈
δ
A(
s
←2, p
)
,
(
s
4,
1)
∈
δ
A(
s
3, p
)
,
(
s
5,
1)
∈
δ
A(
s
4, q
)
,
(
s
f,
1)
∈
δ
A(
s
5,
$)
State:s
3(
s
,
1)
∈
δ
(
s
, p
)
2NFA and 2RPQs
a b cr
dp
q
….
….
Word:r p
q
$
Query:Q
=
r
·
(
p
∪
q
)
·
p
−·
p
·
q
·
q
∗Automaton for
Q
(
s
1,
1)
∈
δ
A(
s
0, r
)
,
(
s
2,
1)
∈
δ
A(
s
1, p
)
,
(
s
←2,
−
1)
∈
δ
A(
s
2, q
)
,
(
s
3,
0)
∈
δ
A(
s
←2, p
)
,
(
s
4,
1)
∈
δ
A(
s
3, p
)
,
(
s
5,
1)
∈
δ
A(
s
4, q
)
,
(
s
f,
1)
∈
δ
A(
s
5,
$)
State:s
4 Transition:(
s
5,
1)
∈
δ
A(
s
4, q
)
2NFA and 2RPQs
a b cr
dp
q
….
….
Word:r p q
$
Query:Q
=
r
·
(
p
∪
q
)
·
p
−·
p
·
q
·
q
∗Automaton for
Q
(
s
1,
1)
∈
δ
A(
s
0, r
)
,
(
s
2,
1)
∈
δ
A(
s
1, p
)
,
(
s
←2,
−
1)
∈
δ
A(
s
2, q
)
,
(
s
3,
0)
∈
δ
A(
s
←2, p
)
,
(
s
4,
1)
∈
δ
A(
s
3, p
)
,
(
s
5,
1)
∈
δ
A(
s
4, q
)
,
(
s
f,
1)
∈
δ
A(
s
5,
$)
State:s
5(
s
,
1)
∈
δ
(
s
,
$)
2NFA and 2RPQs
a b cr
dp
q
….
….
Word:r p q
$
Query:Q
=
r
·
(
p
∪
q
)
·
p
−·
p
·
q
·
q
∗ State:s
f(
a, d
)
satisfies queryQ
, and the path froma
tod
is accepted by the 2NFA2NFA and view extensions
Global schema G:
(r
∪
p
∪
q
∪
r
−∪
p
−∪
q
−)
∗r
∪
q
∗(p
∪
r)
°p
(p
−∪
q
)
∗r
°r
r
°(q
∪
p)
(d1,d2) (d4,d5) (d4,d2) (d3,d3) (d2,d3)Sources:
Database forG
: d1 d2r
d3r
r
q
r
d4p
p
d5p
2NFA and view extensions
d1 d2r
d3r
r
q
r
d4p
p
d5p
DatabaseB
as a word:$d
4p p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗To verify that
(
d
1, d
3)
satisfiesQ
in the above databaseB
, we buildA
(Q,d1,d3), by exploiting not only the ability of 2way automata to move on the word both forward and backward, but also the ability to jump from one position in the word representing a node to any other position (either preceding or succeeding) representing the same node.A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:s
0A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:s
0A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p
p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:s
0A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p
p
d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:s
0A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p
d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:s
0A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:s
0A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:s
0A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:s
1 Transition:(
s
1,
1)
∈
δ
A(
s
1, d
1)
A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r
d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:s
1A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r
d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:s
2A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:(
s
2, d
2)
A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:(
s
2, d
2)
A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:(
s
2, d
2)
A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:(
s
2, d
2)
A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:s
2A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:s
←2A run of
A
(Q,d1,d3) d1 d2r
d3r
r
q
r
d4p
p
d5p
Word:$
d
4p p d
5$
d
1r d
2$
d
4p
−d
2$
d
3r r d
3$
d
2r q d
3$
Q
=
r
·
(
p
∪
q
)
·
(
p
−·
r
)
∗·
q
·
q
∗ State:s
3A run of
A
(Q,d1,d3)