Approximate Subtree Identification in Heterogeneous XML Document Collections

(1)

Approximate Subtree

Identification in

Heterogeneous XML

Document Collections

Ismael Sanz

1 _{, Marco Mesiti}

2 _,

Giovanna Guerrini

3 _{and Rafael Berlanga}

1

1 _{Universitat Jaume I, Spain}

2 _{Università degli Studi di Milano, Italy}

3 _{Università degli Studi di Pisa, Italy}

(2)

Context

• Heterogeneous XML Document

Collections

• No schema information

• Approximate subtrees

• Switched parent/children relationships

• Missing elements and levels

• Presence of

don’t care

nodes

b

a

c

a

_b

a

c

a

b

c

(3)

Motivating Example

db

street

address

‘…’

employee

person

street

address

city

‘London’

‘Old Street’

address

‘…’

person

address

person

‘…’

person

address

person

street

town

street

person

address

city

(4)

Objectives

• Flexibility and adaptability:

• Support diverse structural similarity

measures

• Tag similarity (syntactic and semantic)

• Work with standard XML indexing

schemes

• Use the measures to capture

(5)

Summary of the Approach

• 2 steps:

• Create a flexible, generic way of

retrieving candidate subtrees.

• Use one (or several) similarity measures

to rank the result.

• Terminology

• Target tree: set of heterogeneous

documents, represented as a tree with

an abstract root ‘db’.

• Pattern tree: an abstract representation

of a user query

(6)

Representation

• Access nodes using a numbering

scheme

• Should be as generic as possible

• Minimal: (pre, post, level)

• Should work with more complex

schemes

a (1, 3, 1)

b (2, 2, 2)

c (3, 3, 2)

a

(7)

Pattern, Fragment, Region

• Fragment: subtree of the target with

only relevant nodes

• Region: combination of fragments

rooted at their nearest common

ancestor

b

d

c

db

b

f

h

b

e

c

f

d

e

f

d

b

d

c

db

b

f

h

b

e

c

f

d

e

f

d

b

d

c

db

b

f

h

b

e

c

f

d

e

f

d

b

d

c

(8)

Pattern-region Matching

b

d

c

h

d

b

c

d

f

b

c

d

c

R1

R3

R2

(9)

Pattern-region Similarity

• Evaluation

• Similarity

|)

)

R

(

V

|

|,

)

P

(

V

max(|

))

x

(

M

,

x

(

Sim

)

M

Eval(

=

∑

X

p

∈

V

(

P

)

:

M

(

x

p

)

≠⊥

NODE

p

))

M

(Eval(

max

)

R

,

P

(

=

_M

MatchSim

(10)

Vertex Similarity

• Match-based

• Level-based

• Distance-based

• Many other possibilities

1 )

,

(

Sim

_M

x

_p

x

_r

=

))

R

(

level

),

P

(

level

max(

|

)

x

(

level

)

x

(

level

|

)

x

,

x

(

Sim

_L

_p

_r

=

1 −

P

p

−

R

r

)

,

max(

|

)

(

)

(

|

1 )

,

(

Sim

_max

R

P

r

R

p

P

r

p

D

d

x

d

x

d

x

=

−

(11)

Similarity example

• Similarity of matching vertices

• Similarity of the pattern with regions

1

1 2/3

1

1 2/3

1/5

M

Sim

_L

Sim

_D

1 P

x

2 P

x

3 P

x

1

1 7/9

2/3

1/2

4/9

3/5

2/5

M

Sim

_L

Sim

_D

1 R

2 R

3 R

b

d

c

f

b

c

d

c

(12)

Fragment Construction

• Target index

• Correlates the element labels with their

occurrences in the target (Inverted

index)

• Use a normalized label set to account

for inexact label matching: two

syntactically or semantically similar

labels are indexed together

• Pattern index

• Obtained by extracting from the target

index the elements similar to those in

the pattern and organizing them

level-by-level

(13)

Fragment Construction

• Target and pattern index

db

b

f

h

b

e

c

f

d

e

f

d

b

d

c

(14)

Fragment Construction

• Target and pattern index

b

c

d

e

f

h

1,5,1

2,4,2

5,5,2

3,3,3

4,4,3

11,16,1

7,8,2

8,8,3

10,10,3

9,10,2

6,10,1

14,16,2

13,13,2

16,16,4

12,13,2

15,16,2

(15)

Fragment Construction

• Target and pattern index

b

c

d

e

f

h

1,5,1

2,4,2

5,5,2

3,3,3

4,4,3

11,16,1

7,8,2

8,8,3

10,10,3

9,10,2

6,10,1

14,16,2

13,13,2

16,16,4

12,13,2

15,16,2

b

d

c

1

2 b,1,5,1

b,7,8,2

b,12,13,2

(16)

Fragment Construction

• Target and pattern index

b

c

d

e

f

h

2,4,2

5,5,2

3,3,3

4,4,3

11,16,1

7,8,2

8,8,3

10,10,3

9,10,2

6,10,1

14,16,2

13,13,2

16,16,4

12,13,2

15,16,2

b

d

c

1,5,1

1 b,1,5,1

2

3 c,2,4,2

c,8,8,3

b,7,8,2

b,12,13,2

(17)

Fragment Construction

• Target and pattern index

b

c

d

e

f

h

1,5,1

2,4,2

5,5,2

3,3,3

4,4,3

11,16,1

7,8,2

8,8,3

10,10,3

9,10,2

6,10,1

14,16,2

13,13,2

16,16,4

12,13,2

15,16,2

1

2

3 b,1,5,1

c,2,4,2

c,8,8,3

d,5,5,2

d,10,10,3

b,7,8,2

b,12,13,2

d,14,16,2

b

c

d

e

f

h

1,5,1

2,4,2

5,5,2

3,3,3

4,4,3

11,16,1

7,8,2

8,8,3

10,10,3

9,10,2

6,10,1

14,16,2

13,13,2

16,16,4

12,13,2

15,16,2

1

2

3 b,1,5,1

c,2,4,2

c,8,8,3

d,5,5,2

d,10,10,3

b,7,8,2

b,12,13,2

d,14,16,2

b

d

c

(18)

Fragment Construction

• Compute fragments by traversing the

pattern index

• Algorithm:

• Begin at the highest available level

• Find descendants in the sublevels

• Cost

• K = maximal size of a level structure

|)

)

(

||

)

(

|

(

K

label

P

NL

T

O

(19)

Fragment Construction

• Target and pattern index

b

c

d

e

f

h

1,5,1

2,4,2

5,5,2

3,3,3

4,4,3

11,16,1

7,8,2

8,8,3

10,10,3

9,10,2

6,10,1

14,16,2

13,13,2

16,16,4

12,13,2

15,16,2

b

d

c

1

2 b,1,5,1

c,2,4,2

c,8,8,3

d,5,5,2

d,10,10,3

b,7,8,2

b,12,13,2

d,14,16,2

3

(20)

Region Construction

• Potentially exponential complexity

• Locality principle:

merging fragments

or regions only makes sense when

they are close

• Remark: merging adds

don’t care

nodes

• In practice, merge adjacent

(21)

Region Construction

b,1,5,1

c,2,4,2

d,5,5,2

b,7,8,2

c,8,8,3

d,10,10,3

b,12,13,2

d,14,16,2

b,7,8,2

c,8,8,3

d,10,10,3

f,6,10,1

b,12,13,2

d,14,16,2

h,11,16,1

(22)

(23)

Experimental Results

(24)

Experimental Results

(25)

Experimental Results

(26)

Conclusions

• Conclusions

• Developed an approach for the

identification of subtrees which are

similar to a given pattern in a collection

of heterogeneous XML documents

• Future work

• Framework for selecting, composing and

applying similarity measures

• Add some constraints to vertices and