Xml Tree Structure and Methods of Interdependence

(1)

Reasoning About Data in XML Data Integration

Tadeusz Pankowski1,2

1_{Institute of Control and Information Engineering, Pozna´}_{n University of Technology, Poland} 2_{Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Pozna´}_{n, Poland}

[email protected]

Abstract

In this paper, we propose solu-tions to some problems arising while data from different sources is to be integrated under a given target schema. We address the follow-ing problems: inferrfollow-ing missfollow-ing data based on constraints imposed by the target schema, generating map-pings from a source schema to a tar-get schema based on key constraints and value dependencies, and merg-ing data based on subsumptions be-tween XML data controlled by ontol-ogy and semantics defined by means of description logic.

1 Introduction

In data integration [2, 11] we identify the following issues concerning reasoning about data: (1) inferring data values which are not given explicitly in sources but can be de-duced based on some constraints enforced by the target schema; (2) finding an executable mapping from a source schema into a tar-get schema so that an instance of the tartar-get schema can be computed from a given set of source instances; (3) merging heterogeneous source data in such a way that the result is subsumed by all merged components – the re-sult is at least as specific as any component and is free of overlapping data.

In the process of data transformation some missing or incomplete data may be inferred.

We achieve that by representing missing data by terms reflecting constraints imposed by the schema. In some cases such terms may be resolved and replaced by the actual data [15]. In the paper we propose a method for generat-ing mappgenerat-ings between schemas based on key constraints and value dependencies defined by means of an XML schema. In the first step an

automapping over a schema, i.e. a mapping from the schema onto itself, is generated. The automapping represents the schema. A com-position of automappings over two schemas gives a mapping between these schemas. We propose a language, called XDMap, for map-ping specification based on source-to-target dependencies andSkolem functions.

The data taken out from different sources may have not only different structures but also may use different names, concepts, precision, etc. In order to handle them we have to use a domain ontology. However, semantic rela-tionships provided by the ontology must be generalized to XML tree structures in order to reason about subsumptions or equivalences between XML data. We have done it using se-mantics of description logic.

Section 2 illustrates the problem of inferring some missing data in data integration. In Section 3 an approach to create executable schema mappings is proposed. We show how key constraints defined in XML Schema may be used to generate automappings and how mappings can be derived from automappings. In Section 4 we discuss subsumptions on XML data trees and their use for merging data. Sec-tion 5 concludes the paper.

(2)

2 Using constraints for inferring data in data integration

We will show how missing data may be in-ferred in data integration using some con-straints on target schema. Suppose there are three schemas S1, S2, and S3, respectively

(Fig. 1) and that only S₂ and S₃ are asso-ciated with data, while S1 is a mediated (or

target) schema that does not store any data. The meaning of labels are: author (A), name (N) and university (U) of the author; paper (P) title (T), year (Y) of publication and the conference (C) where the paper has been pre-sented. Elements labeled with R and K are used to join authors with their papers. I2 and

I₃ are instances ofS₂ and S₃, respectively. In such scenario we meet the problem of data integration (data exchange), i.e. com-puting target instances from source instances [3, 8, 12, 17, 20]. It is commonly agreed that mappings are needed to perform these func-tions effectively, where a mapping specifies a relationship between a set of source schemas and a target schema.

In particular, an instance of S1 in Fig. 1 can

be obtained by transformations M21(I2) or

M31(I3), or by merging (M21(I2)∪M31(I3) =

(M21 ∪ M31)(I2,I3), where Mij denotes a

mapping fromS_i intoS_j.

We can use two kinds of constraints to define mappings, namely:

1. Value dependencies(on the target) to de-clare that a value of a path depends on a tuple of values of other paths;

2. Key constraints (on a source) to declare that a subtree is uniquely identified by a tuple of values of key paths.

Value dependencies can be used to infer miss-ing data [3, 15, 20]. Suppose we want to transform the instance I2to the target schema

S1, i.e. an instance I11 = M21(I2) must be

produced (Fig. 2(a)). The original instance provides no data about publication year. We know, however, that the publication year (Y) uniquely depends on the title (T), denoted by

the value dependency constraint Y = y(T), whereyis the name of a function mapping ti-tles into publication years. Hence, we assign the term y(t) as the text value of Y, where t

is the title. This convention forces some ele-ments of typeY to have the same values (Fig. 2(a)). Such value dependencies can be defined within a schema declared be means of an (ex-tended) XML Schema (Fig. 3).

A term, likey(t), may be resolved using other mappings. Suppose we want to merge the in-stance in Fig. 2(a)) and I₃. In this process terms denoting years will be replaced with ac-tual values (Fig. 2(b)). Note that in this way we are able to infer the publication year of the paper written by a2. This information is not given explicitly neither in I2 nor in I3.

Information provided by key constraints, el-ements <xs:key> within XML Schema (Fig. 3), are used to specify how many instances (nodes) of an element type must be in the computed target instance. For example, the element type /A1/A in S1 is uniquely

iden-tified by the key path N. So, there are as many nodes of type/A1/Aas there are differ-ent values of /A1/A/N. In S2, however,

ele-ments of type Aare identified by N but only in a context determined by the element type

/P2/P that is identified byT. Thus, to iden-tify/P2/P/Awe need a pair of values deter-mined by paths /P2/P/T and/P2/P/A/N.

3 XML schema mappings

3.1 Basic ideas of mappings

We will show how, from the declaration in Fig. 3, the automappingM11 overS1 can be

gen-erated (Fig. 4). The clause foreach defines variables. Lines (1) and (2) are obvious. (3) includes value dependencies specified in the schema. Let y = f($x₁) and z = f($x₂) be two value dependencies, Ω be a set of bindings for $x1, Ω0 be a set of bindings for $zand $x2,

and there is no binding for $y, neither in Ω nor in Ω0 _($_x

1 denotes a vector of variables). The

value to $y is assigned according to the rules: 1. For a binding ω ∈ Ω, the term f(a),

(3)

D3 A* N R* P* K T Y C P2 P* T A+ N U A1 A* N P+ T Y? U? S1: S2: U u1 P2 P A N U a1 u1 A N U a2 u2 P T t2 A N a1 D3 A N a1 R i1 i2 R A N a3 R i3 P K i1 T t1 Y 05 C C1 P K i2 T t2 Y 03 C C2 P K i3 T t3 Y 04 C C1 T t1 S3: I2: I3:

Figure 1: Schemas: S₁,S₂,S₃, and schema instances I₂ and I₃ (S₁ does not have any stored instance) A1 A N U P A N U a1 u1 a2 u2 P T t1 y(t1) Y T t1 y(t1) Y P T t2 y(t2) Y I11 = M21(I2) (a) A1 A N U P a1 u1 A N U a2 u2 P T t1 05Y T t1 05Y P T t2 03Y I13 = M21(I2)∪ M31(I3) = (M₂₁∪ M₃₁)(I2, I3) A N U a3 u(a3) P T t3 04Y (b)

Figure 2: Instances of schema S1 produced by mappings using value dependency constraints

<xs:schema xmlns:xs="..."> <xs:element name="A1"> <xs:complexType><xs:sequence> <xs:element ref="A"/></xs:sequence> </xs:complexType> </xs:element> <xs:element name="A"> <xs:complexType><xs:sequence>

<xs:element name="N" type="xs:string"/> <xs:element name="U" type="xs:string"/> <xs:element ref="P" /></xs:sequence> </xs:complexType>

<xs:key name="AKey"><xs:selector xpath="."/> <xs:field xpath="N"/>

</xs:key> <xs:valdep>

<xs:target name="U"/><xs:function name="u"/> <xs:source xpath="N"/>

</xs:valdep> </xs:element>

<xs:element name="P">

<xs:complexType><xs:sequence>

<xs:element name="T" type="xs:string"/> <xs:element name="Y" type="xs:string"/> </xs:sequence>

</xs:complexType>

<xs:key name="PKey"><xs:selector xpath="."/> <xs:field xpath="T"/>

</xs:key> <xs:valdep>

<xs:target name="Y"/><xs:function name="y"/> <xs:source xpath="T"/>

</xs:valdep> </xs:element> </xs:schema>

Figure 3: XML Schema of S1, extended with

<xs:valdep>declaration

2. If there is a binding ω0 ∈ Ω0 such that

ω0_($_x

2) =a, then the valueω0($z) is

as-signed to $y (we say that the term f(a) has been resolved).

M11= (G11,Φ11, C11, E11) = (1)foreach$yA1 in/A1, $yAin $yA1/A, $yN in $yA/N,$yU in $yA/U, $yP in $yA/P, $yT in $yP/T, $yY in $yP/Y, (2)where true (3)when$yU=u($yN),$yY =y($yT) exists (4)F/A1()inF()()/A1 (5)F/A1/A($yN)inF/A1()/A (6)F/A1/A/N($yN)inF/A1/A($yN)/N with$yN (7)F/A1/A/U($yN,$yU)inF/A1/A($yN)/U with$yU (8)F/A1/A/P($yN,$yT)inF/A1/A($yN)/P (9)F/A1/A/P/T($yN,$yT)inF/A1/A/P($yN,$yT)/T with$yT (10)F/A1/A/P/Y($yN,$yT,$yY)in F/A1/A/P($yN,$yT)/Y with$yY

Figure 4: Automapping M₁₁ overS₁ (4) creates two new nodes, the root r and the node n of the outermost element of type

/A1, as results of Skolem functionsF₍₎() and

F_/A1(), respectively. The node n is a child of type A1 of r. (5) creates a new node n0 for any distinct value of $yN, each such node

has the type /A1/A and is a child of type A

of the node created by F_/A1() in (4). (6) For any distinct value of $yN a new node n00 of

(4)

a child of type N of the node created by in-vocation of F_/A1/A($yN) in (5) for the same

value of $y_N. Because n00 _{is a leaf, so it} ob-tains the text value equal to the current value of $yN. Analogously for the remainder.

3.2 Capturing key constraints by

automappings

In specification of automappings, Skolem functions and their arguments play a crucial role. We assume that:

• for any path P in the schema there is exactly one Skolem function FP(...),

• arguments of a Skolem function F_P(...) are determined by key paths defined for the element of typeP in the schema. In S1 there is exactly one root and one

out-ermost element, so the corresponding Skolem functions have empty lists of arguments. Ele-ment of type/A1/Ahas a key path N. Each of its subelements inherits this key path and additionally has its local key paths. Local key paths for non-leaf elements are defined in the schema. The local key path for a leaf ele-ment is, by default, this leaf eleele-ment itself. Thus, in S1 we have the following key paths:

N for /A1/A and for /A1/A/N; (N,T) for

/A1/A/P and for /A1/A/P/T; and (N,T,Y) for /A1/A/P/Y. Values of these key paths are bound to variables and are used as argu-ments of Skolem functions.

In definition of S₃ (Fig. 5), the schema speci-fies thekeyandkeyref relationships between theK child element of theP element (the ref-erenced key) and theRchild element of theA

element (the foreign key). Additionaly, the value dependency K =k(T, N) says that the pathN must start at elementAreferencingP

via its foreign key defined in AKeyref. Key references are captured as follows:

• in the exists clause any occurrence of a variable $xf ranging over values of a

for-eign key is replaced with a variable $x_k

ranging over values of the corresponding referenced key;

<xs:element name="A">

<xs:complexType>...</xs:complexType>... <xs:keyref name="AKeyref" refer="PKey">

<xs:selector xpath="."/> <xs:field xpath="R"/> </xs:keyref> </xs:element> <xs:element name="P"> <xs:complexType>...</xs:complexType> <xs:key name="PKey"> <xs:selector xpath="."/> <xs:field xpath="K"/> </xs:key> <xs:valdep>

<xs:target name="K"/><xs:function name="k"/> <xs:source xpath="T"/>

<xs:source xpath="N" ref="AKeyref"/> </xs:valdep> ...</xs:element>

Figure 5: Fragment of XML Schema forS₃ • the equality $xf = $xk is inserted into

thewhere clause.

Using these rules, we obtain the following specification of the automapping over S3:

M33= foreach$zD3 in/D3, $zA in $zD3/A, $zN in $zA/N, $zRin $zA/R, $zP in $zD3/P, $zK in $zP/K, $zT in $zP/T, $zY in $zP/Y, $zC in $zP/C where $zR= $zK when $zK=k($zN,$zT),$zY =y($zT),$zC=c($zT) exists F/D3()inF()()/D3 F/D3/A($zN)inF/D3()/A F/D3/A/N($zN)inF/D3/A($zN)/N with$zN F/D3/A/R($zN,$zK)inF/D3/A($zN)/Rwith$zK F/D3/P($zK)inF/D3()/P F/D3/P/K($zK)inF/D3/P($zK)/Kwith$zK F/D3/P/T($zK,$zT)inF/D3/P($zK)/T with$zT F/D3/P/Y($zK,$zY)inF/D3/P($zK)/Y with$zY F/D3/P/C($zK,$zC)inF/D3/P($zK)/Cwith$zC

3.3 Syntax and semantics for

mappings

The part foreach/where/when of a map-ping M determines a partially ordered set (Ω,≤) of bindings of variables ($x,$y). For example, in the mapping M21 (Fig. 6) for

two bindings over I2,ω1= ($xT →t1,$xN →

a1,$xU →u1,$yY →y(t1)) andω2 = ($xT →

t1,$xN → a2,$xU → u2,$yY → y(t2)), we

have ω1 < ω2, because the tuple of leaf nodes

providing values for ω1 precedes the tuple of

leaf nodes providing values for ω2. Bindings

from Ω are used in the existsE part to pro-duce the result target instance. The ordering

(5)

imposed in Ω by a source instance should be preserved in the target instance.

Note that if theforeach/where clause is de-fined over S2, while the when/exists

con-cerns S1, then we deal with a mapping M21

from S₂ into S₁. Then, after an appropriate replacement of variables, we obtain:

M21= foreach$xP2in/P2, $xP in $xP2/P, $xT in $xP/T, $xAin $xP/A, $xN in $xA/N, $xU in $xA/U where true whenC11($yN,$yU,$yT,$yY) [$yN→$xN,$yU→$xU,$yT →$xT] existsE11($yN,$yU,$yT,$yY) [$yN→$xN,$yU→$xU,$yT →$xT]

Figure 6: MappingM₂₁ fromS₂ into S₁ InM21 there is no replacement for $yY, thus

its value must be set somehow differently, e.g. as anullvalue [3]. We set it as the termy(t), where t is the current value of $xT (see Fig.

2(a)). It is a form of Skolemization. Thus, a mapping specification in XDMap conforms to the general form of source-to-target gener-ating dependencies[1, 9, 12, 13]: ∀$x(G($x)∧ Φ($x)⇒ ∃$yC($x,$y)∧E($x,$y)).

Definition 1 An executable schema mapping

in XDMap (or mapping for short) between a source schema S and a target schema T is a sequence M ::= (M, ..., M) of mapping con-straints between S and T, where:

M := foreach G($x)

where Φ($x)

when C($x,$y)

exists F_P/l($x,$y) inFP($x0,$y0)/l

[with $x00 _]

• Gis a list of variable definitions over a source schema: $x in Q or $x in $x0/Q;

• Φis a conjunction of atomic conditions:

$x= $x0 _or _$_x₆_{= $}_x0_;

• C a list of target constraints $x=f($x)

or $y=f($x), $x∈$x, $y ∈$y;

• F_P($x,$y) – a Skolem term, where P is a rooted path in a target schema;

• ($x0_,_$_y0₎_⊆_($_x,_$_y₎_,_$_x00_∈_($_x,_$_y₎_. _¤

Definition 2 Let M = (G,Φ, C, E)($x,$y)

be a mapping, and (Ω,≤) be a partially or-dered set of bindings of variables ($x,$y) de-termined by (G,Φ, C). A target instance I of a target schemaTis then obtained as follows:

1. F₍₎() =r – the root of I.

2. FP($x,$y)(ω) =n– a node of type P. 3. If F_P/l($x,$y)(ω) =n and

FP($x0,$y0)(ω) =n0, and

($x0_,_$_y0₎ _⊆ _($_x,_$_y₎ _then _n _{is a child of}

type l of the node n0_.

4. Let F_P/l($x,$y)(ω1) =n1,

F_P/l($x,$y)(ω2) =n2, where ω1 ≤ω2, and ($x0_,_$_y0₎₍_ω

1) = ($x0,$y0)(ω2). Then

n1≤n2 in the document order in the set of children of typel of the node

F_P($x0_,_$_y0₎₍_ω

1).

5. IfF_P/l(($x,$y)(ω) =nis a leaf, then the text value of n is equal to ω($x00₎_.

4 Subsumptions on XML data

trees

Till now we have assumed that source doc-uments are ”ontologically homogeneous”. In real applications [16], however, we need do-main ontologies to make use of relationships between the concepts used for data model-ing. Relationships between concepts need to be generalized to cope with XML data trees. Then XML data, taken out from different sources, can be merged into a document that is the greatest lower bound of the set of data being merged, i.e. is subsumed by the data. To discuss the problem more precisely, we will use a simple tree language T L, to express paths and tree patterns (at schema level) as well as values and trees (at instance level).

T ::= P |P/(T, ..., T) – (tree patterns)

P ::= l|l/P – (paths)

t ::= v |T:v |P/(t, ..., t) – (trees)

v ::= s|(v, ..., v) – (values)

wherelis a node label, andsis a string value. Note, that a tree pattern is a set of paths with a common prefix.

(6)

To define semantics for T L, we will use the approach used in description logic [4]. Let ∆ be a non-empty set of individuals, and child ⊆ ∆×∆ be a transitively closed bi-nary relation over ∆. Interpretation of T Lis a function .I _{defined as follows:}

cI _⊆_∆ lI _⊆_∆ (v1, ..., vn)I =v1I∩...∩vnI (l/P)I _{= (}_lI _./_child_{./ P}I₎_._{2, where} (X ./child./ Y).2 = ={y∈Y|∃x(x∈X∧(x, y)∈child)} (T1, ..., Tn)I =T1I∩...∩TnI (P/(T1, ..., Tn))I = (PI./child./(T1, ..., Tn)I).2 (t1, ..., tn)I =tI1∩...∩tIn (P/(t1, ..., tn))I = (PI./child./(t1, ..., tn)I).2 (T:v)I _{= (}_TI_./_child_{./ v}I₎_.₂

We say that an expressionE1 is subsumed by

an expression E₂, or that E₂ subsumes E₁, written E1 v E2, if E1I ⊆ E2I. If both E1 v

E2 and E2 vE1, then E1 is equivalent toE2,

written E₁≡E₂.

Theorem 1 The following rules hold:

R1. (v₁, ..., v_n) v (v₁, ..., v_i), (T₁, ..., T_n) v (T1, ..., Ti), (t1, ..., tn)v(t1, ..., ti), for any1≤i≤n;

R2. P/tvt, R3. P/T:vvP:v,

R4. if P1/t1 and P2/t2 are valid trees, then

P1vP2∧t1 vt2 ⇒P1/t1 vP2/t2, R5. P/(T1:v1, ..., Tn:vn)vP:(v1, ..., vn).

Proof (R1) follows from the property of

sets intersection; to prove (R2) note that (P/t)I _{= (}_PI _./ _child _{./ t}I₎_.₂ _⊆ _tI_{; in} proof of (R3) we use the fact that the child relation is transitively closed, thus we have (P/T:v)I _{= ((}_PI _./ _child _{./ T}I₎_.₂₎ _./ child./ vI).2⊆(PI ./child./ vI).2; (R4) is a standard property of partial ordering re-lations. (R5) follows from the definition and from (R1) and (R3): (P/(T1:v1, ..., Tn:vn))I =

(P/T1:v1, ..., P/Tn:vn))I ⊆

(P:v1, ..., P:vn))I = (P:(v1, ..., vn))I. ¤

In data integration we try to merge different XML documents into a one, duplicate-free,

and well constructed document. In order to realize this we can use:

• definitions of source data schemas given by means of DTD or XML Schemas, if they are available;

• domain ontologies both for names and tags (at the schema level) and for values (at the instance level),

• any other resources which can be used to understand and classify data correctly, such as dictionaries, taxonomies, the-sauri, user provided match and mismatch information as well as knowledge discov-ered in data, e.g. keys and statistical characteristics.

Using these resources and methods, we can classify XML tree fragments – such as val-ues and paths and tree patterns – into equiv-alence classes with respect to the synonymy relation. The value representing the class of semantically equivalent values resolves such issues as diversity of currencies, measures, and representation formats, in order to overcome difficulties in duplicate elimination and value comparison. For text values there is a prob-lem with synonyms, different languages, jar-gon and so on. To solve these problems, meth-ods from information retrieval can be used [5, 7].

Next, subsumption on these classes can be de-fined, where v1 v v2 means that v1 is more desirable than v₂, because v₁ is more infor-mative, more reliable (one database may be considered to be more reliable than others) or has higher precision. Correct definition of this relation is crucial because it is used to define subsumption relation over complex ex-pressions.

In order to define subsumption on tree pat-terns, we start with establishing it on indi-vidual labels. As for values, patterns with different syntax may have the same mean-ing, e.g. f name, first-name, and f irstname

belong to the same equivalence class. The path author/name and the tree pattern

(7)

t1: _article title author fname lname ”title-1” ”John” ”Smith” t2: _paper title author ”title-1” _”John Smith” ”journal-1” journal ”journal-1” t3: paper title author fname lname ”title-1” ”John” ”Smith” journal

Figure 7: Source data treest₁ and t₂, and their join t₃. Fat arrows denote equivalent key paths different, but somehow related equivalence

classes. Again, identification of such patterns can be supported by ontologies, statistics and machinelearning methods [10, 18]. For com-plex patterns, the subsumption relation can be inferred from atomic patterns by means of rules proved in Theorem 1.

The following inference rules follow from The-orem 1 and are of special importance for data merging during data integration:

P1 vP2∧v1 vv2 ⇒P1:v1vP2:v2,

P/(P1, ..., Pn)vP0/(P10, ..., Pm0 )∧

(v1, ..., vn)v(v10, ..., v0m)⇒

P/(P₁:v₁, ..., P_n:v_n)vP0_/₍_P0

1:v10, ..., Pm0 :vm0 )

It follows from Theorem 1 that it is sufficient to inspect subsumptions between trees and paths, rather than between trees and trees. Example For data trees t1, t2, and t3 from

Fig. 7, we have: Patterns: article≡ paper, author/fname vauthor, author/lname vauthor, author/(fname,lname) vauthor. Values:

”John Smith”v ”John”, ”John Smith”v ”Smith”.

Trees:

author/(fname:”John”,lname:”Smith”)

vauthor:”John Smith”,

t3vt1, t3 vt2.

Note that if we restricted ourselves to paths only, we would not be able to construct the ex-pected minimal result tree t3 from t1 and t2,

because neitherauthor/fname:”John”nor au-thor/lname:”Smith” is subsumed by the path

author:”John Smith”.

Trees t1 and t2 from Fig. 7 can be joined

because there are two keys holding in t₁

and t2, respectively, which are equivalent and

have equivalent values, i.e. article/title:”title-1” ≡ paper/title:”title-1”. Thus, these trees could be treated as describing the same en-tity from the semantic domain of interest. When trees describe different entities they are non-joinable. Non-joinable trees are merged in such a way that a new root label is cre-ated and all trees under consideration become the highest-level subtrees of the newly created root.

5 Conclusion

We discussed some reasoning methods useful in XML data integration systems. We mo-tivated our research on an scenario of data exchange when data structured under source schemas are to be transformed into a data structured under another schema (a target schema). In such data integration some miss-ing or incomplete data can be inferred. The reasoning about missing data is based on data constraints imposed by the target schema. Integration of data needs mappings which describes transformation from a source into a target schema. We propose a novel ap-proach to XML schema mapping specification based on key constraints [6, 19]. First, au-tomappings over schemas are generated, and next the automappings are combined to cre-ate mappings between schemas represented by these automappings. The other kind of rea-soning is based on ontologies and concerns a problem of finding the least upper bound of merged data. The assumption of the existence of some domain oriented taxonomies and

(8)

on-tologies makes the problem more feasible than in the case of “deep Web integration” [10]. The method presented in the paper is a part of our research on XML data integration [16, 15] XML data transformation [14] and query re-formulation.

References

[1] S. Abiteboul, R. Hull, and V. Vianu.

Foundations of Databases. Addison-Wesley, Reading, Massachusetts, 1995. [2] S. Abiteboul, L. Segoufin, and V. Vianu.

Representing and Querying XML with Incomplete Information. In PODS Con-ference, pages 150–161, 2001.

[3] M. Arenas and L. Libkin. XML Data Ex-change: Consistency and Query Answer-ing. InPODS, pages 13–24, 2005. [4] F. Baader, D. Calvanese, D.

McGuin-ness, D. Nardi, and P. Petel-Schneider, editors. The Description Logic Hand-book: Theory, Implementation and Ap-plications. Cambridge, 2003.

[5] R. Baeza-Yates and B. Ribeiro-Neto.

Modern Information Retrieval. Addison Wesley, New York, 1999.

[6] P. Buneman, S. B. Davidson, W. Fan, C. S. Hara, and W. C. Tan. Reasoning about keys for XML. Information Sys-tems, 28(8):1037–1063, 2003.

[7] J. C. P. Carvalho and A. S. da Silva. Finding similar identities among objects from multiple web sources. In WIDM 2003, pages 90–93. ACM, 2003.

[8] R. Fagin, P. G. Kolaitis, and L. Popa. Data exchange: getting to the core.ACM TODS, 30(1):174–210, 2005.

[9] R. Fagin, P. G. Kolaitis, L. Popa, and W. C. Tan. Composing schema map-pings: Second-order dependencies to the rescue. InPODS, pages 83–94, 2004. [10] B. He, K. C.-C. Chang, and J. Han.

Dis-covering complex matchings across web

query interfaces: a correlation mining ap-proach. In KDD 2004, pages 148–157. ACM, 2004.

[11] M. Lenzerini. Data integration: A theo-retical perspective. InPODS, pages 233– 246, 2002.

[12] S. Melnik, P. A. Bernstein, A. Y. Halevy, and E. Rahm. Supporting executable mappings in model management. In SIG-MOD Conference, pages 167–178, 2005. [13] A. Nash, P. A. Bernstein, and S. Melnik.

Composition of mappings given by em-bedded dependencies. In PODS, 2005. [14] T. Pankowski. A High-Level Language

for Specifying XML Data Transforma-tions, In ADBIS 2004. Lecture Notes in Computer Science, 3255:159–172, 2004. [15] T. Pankowski. Management of

exe-cutable schema mappings for XML data exchange, In DATAX 2006,EDBT 2006 Workshops. Lecture Notes in Computer Science (to appear), pages 1–12, 2006. [16] T. Pankowski and E. Hunt. Data

merg-ing in life science data integration sys-tems. InIntelligent Information Systems, Advances in Soft Computing, pages 279– 288. Springer Verlag, 2005.

[17] L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hern´andez, and R. Fagin. Trans-lating web data. In VLDB, pages 598– 609, 2002.

[18] A. Theobald and G. Weikum. The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking, In: EDBT 2002. Lecture Notes in Computer Science, 2287:477– 495, 2002.

[19] XML Schema Part 1: Structures 2d Edition. www.w3.org/TR/xmlschema-1, 2004.

[20] C. Yu and L. Popa. Constraint-based xml query rewriting for data integration. In SIGMOD Conference, pages 371–382, 2004.