Holistic Twig Joins: Optimal XML Pattern Matching

(1)

Nicolas Bruno ColumbiaUniversity [email protected] Nick Koudas AT&T Labs{Research [email protected]

Divesh Srivastava

AT&T Labs{Research

[email protected]

Abstract

XMLemploysatree-structureddatamodel,and,naturally,XMLqueriesspecifypatternsofselection

predicatesonmultipleelementsrelatedbyatreestructure.Findingalloccurrencesofsuchatwigpattern

inanXMLdatabaseisacoreoperationforXMLqueryprocessing. Priorworkhastypicallydecomposed

the twigpattern into binarystructural (parent-childand ancestor-descendant)relationships, and twig

matchingis achievedby: (i)using structuraljoinalgorithmstomatchthebinaryrelationships against

the XML database,and (ii) stitching together thesebasic matches. A limitation ofthis approachfor

matchingtwigpatternsisthatintermediateresultsizescangetlarge,evenwhentheinput andoutput

sizesaremoremanageable.

In this paper, we propose a novel holistic twig join algorithm, TwigStack, for matching an XML

query twigpattern. Ourtechniqueusesachain oflinkedstacksto compactlyrepresent partialresults

toroot-to-leafquerypaths,whicharethencomposedtoobtainmatchesforthetwigpattern. Whenthe

twigpatternusesonlyancestor-descendantrelationshipsbetweenelements,TwigStackis I/OandCPU

optimalamongallsequentialalgorithmsthatreadtheentireinput: itislinearinthesumofsizesofthe

input listsandthe nalresultlist,butindependentofthe sizesofintermediateresults. Wethenshow

howtouse(amodicationof)B-trees,alongwithTwigStack,tomatchquerytwigpatternsinsub-linear

time. Finally, we complementour analysis withexperimental results onarange ofreal and synthetic

data,andquerytwigpatterns.

1 Introduction

XMLemploysatree-structuredmodelforrepresentingdata. Queriesin XMLquerylanguages(see,e.g.,[7,

4, 2]) typically specify patterns of selectionpredicates on multiple elements that havesome specied tree

structuredrelationships. Forexample,theXQueryexpression:

book[title = `XML']//author[fn = `jane' AND ln = `doe']

matchesauthorelementsthat(i)haveachildsubelementfnwithcontentjane,(ii)haveachildsubelement

lnwith contentdoe, and (iii) are descendantsof bookelementsthat haveachild titlesubelement with

contentXML.Thisexpressioncanberepresentedasanode-labeledtwig(orsmalltree)patternwithelements

andstringvaluesasnodelabels.

FindingalloccurrencesofatwigpatterninadatabaseisacoreoperationinXMLqueryprocessing,both

inrelationalimplementationsofXMLdatabases,andinnativeXMLdatabases. Priorwork(see,forexample,

[11,23, 16,18, 17, 26,1])hastypicallydecomposed thetwigpatterninto asetof binary(parent-childand

ancestor-descendant)relationshipsbetweenpairsofnodes,e.g.,theparent-childrelationships(book,title)

(2)

(ii)\stitching"togetherthesebasicmatches.

Forsolvingtherstsub-problemofmatchingbinarystructuralrelationships,Zhangetal.[26]proposed

avariation ofthe traditionalmerge join algorithm,the multi-predicatemerge join (MPMGJN) algorithm,

based on the (DocId, LeftPos:RightPos, LevelNum) representation of positions of XML elements and

stringvalues(seeSection2.3fordetailsaboutthisrepresentation). TheirresultsshowedthattheMPMGJN

algorithm couldoutperformstandardRDBMS join algorithmsbymorethananorder ofmagnitude. More

recently, Al-Khalifa et al. [1] took advantageof the samerepresentation of positions of XML elements to

deviseI/OandCPUoptimaljoinalgorithmstomatchbinarystructuralrelationshipsonanXMLdatabase.

Thesecondsub-problemofstitchingtogetherthebasicmatchesobtainedusingbinary\structural"joins

requiresidentifyingagoodjoinorderinginacost-basedmanner,takingselectivitiesandintermediateresult

size estimatesinto account. In this paper, weshow that a basiclimitation of this (traditional) approach

for matchingquery twigpatterns isthat intermediate resultsizes cangetverylarge,evenwhen theinput

and nalresult sizesare muchmoremanageable. As aresult, weseek abetter solutionto theproblem of

matchingquerytwig patternseÆciently.

Inthis paper, wepropose anovel holistic twig join approach formatching XML query twig patterns,

whereinnolargeintermediateresultsarecreated. Ourtechniqueusesthe(DocId, LeftPos : RightPos,

LevelNum)representationofpositionsofXMLelementsandstringvalues(thatsuccinctlycapturesstructural

relationshipsbetweennodesintheXMLdatabase). Italsousesachainoflinkedstackstocompactlyrepresent

partialresultstoindividualqueryroot-to-leafpaths,whicharethencomposedtoobtainmatchestothequery

twigpattern.

Sinceagreat deal of XML data isexpected to be storedin relationaldatabasesystems(all themajor

DBMSvendorsincludingOracle,IBMandMicrosoftareprovidingsystemsupportforXMLdata),ourstudy

providesevidencethatRDBMSsystemsneedtoaugmenttheirsuiteofqueryprocessingstrategiestoinclude

holistictwigjoins foreÆcientXMLqueryprocessing. Our studyis equallyrelevantfor nativeXMLquery

engines,sinceholistictwigjoinsareaneÆcientset-at-a-timestrategyformatchingXMLquerypatterns, in

contrasttothenode-at-a-timeapproachofusingtreetraversals.

1.1 Outline and Contributions

Webeginbypresentingbackgroundmaterial(datamodel,querytwigpatterns,andpositionalrepresentations

ofXMLelements)in Section2. Ourmaincontributionsare:

We develop two families of holistic path join algorithms in Section 3 to match XML query

root-to-leaf paths eÆciently. The rst, PathStack, generalizes the Stack-Tree-Desc binary structural join

algorithm of[1], whilethesecond,PathMPMJ,generalizestheMPMGJNbinaryjoin algorithmof [26].

WeanalyzePathStackandshowthatitisI/OandCPUoptimalamongallsequentialalgorithmsthat

read theentireinput,andhasworst-casecomplexitieslinearinthesumofinputandoutputsizesbut

independentof the sizesofintermediateresults.

Wethen developTwigStackin Section4,aholistic twig join algorithmthat (i)renesPathStackto

ensure that resultscomputed for oneroot-to-leafpath ofa twigpattern are likely to havematching

results in other paths of thetwig pattern, and (ii) merges resultsfor thedierentroot-to-leafpaths

in the querytwig pattern, to computethedesired output. Whenthequery twig usesonly

ancestor-descendantrelationships between elements, weanalytically demonstrate that TwigStack is I/O and

(3)

<head> Origins <=head> ...

<=section> ... <=chapter> ... <=book>

book

(1,66,4)

head

(1,68:78,3)

(1,65:67,3)

2000

title

section

(1,61:63,2)

(1,64:93,2)

year

chapter

(1,3,3)

(1,2:4,2)

title

(1,5:60,2)

allauthors

XML

author

(1,62,3)

author

fn

ln

fn

(1,8,5)

jane

ln

fn

ln

(1,43,5)

jane

john

(1,11,5)

poe

doe

XML

Origins

(1,70,5)

(1,7:9,4)

(1,6:20,3)

(1,26,5)

(1,46,5)

(1,1:150,1)

(1,69:71,4)

(a) (b)

Figure1: (a)AsampleXMLdatabasefragment,(b)Tree representation

Finally,in Section5wepresentexperimentalresultsonarangeofrealandsyntheticdata,andquery

twigpatterns,to complementouranalyticalresults:

{ Weshowthesubstantialperformancebenetsof usingholistic twig joins overbinary structural

joins (forarbitraryjoinorders).

{ WeshowthatPathStackissignicantlyfasterthanPathMPMJamongholisticpathjoinalgorithms.

ThisvalidatestheanalyticalresultsdemonstratingtheI/OandCPUoptimalityofPathStack.

{ For the case of twigpatterns, weshow that the use of TwigStack is better (both in time and

space) than the independent use of PathStack on each root-to-leaf path, even when the twig

patterncontainsparent-childstructuralrelationships.

{ We show how to use a modication of B-trees, denoted XB-trees, along with TwigStack, to

performmatchingofquerytwigpatternsin sub-lineartime.

WedescriberelatedworkinSection6,andconcludebydiscussingongoingandfutureworkinSection7.

2 Background

2.1 Data Model and Query Twig Patterns

An XMLdatabaseisaforestofrooted, ordered,labeledtrees, each nodecorrespondingto anelementora

value, and theedges representing(direct) element-subelementor element-valuerelationships. Node labels

consist of aset of (attribute, value) pairs, which suÆces to model tags, IDs,IDREFs, etc. Theordering of

sibling nodesimplicitlydenesatotalorderonthenodesin atree,obtainedbyapreordertraversalofthe

treenodes. ForthesampleXML document ofFigure1(a), itstree representationisshownin Figure 1(b).

(TheutilityofthenumbersassociatedwiththetreenodeswillbeexplainedinSection2.3.)

QueriesinXMLquerylanguageslikeXQuery[2],Quilt[4]andXML-QL[7]makeuseof(nodelabeled)

(4)

XML

book

year

(a)

title

2000

XML

author

title

book

(b)

fn

ln

jane

doe

Figure2: Querytwig patterns

include elementtags, attribute-valuecomparisons, andstringvalues,and thequerytwig pattern edgesare

eitherparent-childedges(depictedusingasingleline)orancestor-descendantedges(depictedusingadouble

line). Forexample,theXQueryexpression:

book[title = `XML' AND year = `2000']

which matchesbookelementsthat (i)haveachild titlesubelement withcontentXML,and(ii) havea

childyearsubelementwithcontent2000,canberepresentedasthetwigpatterninFigure2(a). Only

parent-childedgesareusedinthiscase. Similarly,theXQueryexpressionintheintroductioncanberepresentedas

thetwigpatternin Figure 2(b). Notethat anancestor-descendantedgeis usedbetweenthebookelement

andtheauthorelement.

Ingeneral,ateachnodeinthequerytwigpattern,thereisanode predicate ontheattributes(e.g.,tag,

content)ofthenodeinquestion. Forthepurposesofthispaper,exactlywhatispermittedinthispredicate

isnotmaterial. Similarly,thephysicalrepresentationof thenodesin theXML databaseisnotrelevantto

theresultsinthispaper. ItsuÆcesforourpurposesthattherebeeÆcientaccessmechanisms(suchasindex

structures)toidentifythenodesintheXMLdatabasethatsatisfyanygivennodepredicateq,andreturna

streamofmatchesT

q .

2.2 Twig Pattern Matching

Givenaquerytwigpattern QandanXMLdatabaseD,amatch ofQinD isidentiedbyamappingfrom

nodesin Qtonodesin D,such that: (i)querynodepredicatesaresatisedbythecorrespondingdatabase

nodes(the imagesunder themapping),and (ii)thestructural (parent-childandancestor-descendant)

rela-tionships betweenquerynodesare satisedbythecorrespondingdatabasenodes. Theanswer to queryQ

withnnodescan berepresentedasann-aryrelationwhereeachtuple (d

1 ;:::;d

n

)consistsofthedatabase

nodesthatidentify adistinctmatchof Qin D.

Finding all matches of a query twig pattern in an XML database is a core operation in XML query

processing, both in relational implementations of XML databases, and in nativeXML databases. In this

paper,weconsiderthetwigpatternmatchingproblem:

GivenaquerytwigpatternQ,andanXMLdatabaseDthathasindexstructurestoidentify

databasenodesthat satisfyeachofQ'snodepredicates,computetheanswertoQonD.

Consider,for example,the querytwig patternin Figure 2(a),and thedatabasetree in Figure 1. This

querytwigpatternhasonematchinthedatatreethat mapsthenodesinthequerytotherootofthedata

(5)

Thekeyto aneÆcient,uniform mechanismfor set-at-a-time (join-based)matching ofquery twigpatterns

isapositionalrepresentationofoccurrencesofXMLelementsand stringvaluesin theXMLdatabase(see,

e.g.,[5,6,26]),which extendstheclassicinvertedindex datastructureininformationretrieval[21].

Werepresenttheposition ofastringoccurrencein the XMLdatabaseasa3-tuple (DocId, LeftPos,

LevelNum),andthepositionofanelementoccurrenceasa3-tuple(DocId, LeftPos:RightPos, LevelNum),

where(i)DocIdistheidentierofthedocument;(ii)LeftPosand RightPoscanbegeneratedbycounting

word numbers from the beginning of the document DocId until the start and the end of the element,

respectively; and (iii) LevelNum is the nesting depth of the element (or string value) in the document.

Figure1(b) shows3-tuplesassociatedwithsometreenodes,basedonthisrepresentation(theDocIdforall

nodesischosentobeone).

Structuralrelationshipsbetweentreenodeswhosepositionsarerecordedinthisfashioncanbedetermined

easily: (i)ancestor-descendant: atreenoden

2

whosepositionin theXML databaseisencodedas(D

2 ;L 2 : R 2 ;N 2

)isadescendantofatreenoden

1

whosepositionisencodedas(D

1 ;L 1 :R 1 ;N 1 )iD 1 =D 2 ;L 1 <L 2 , andR 2 <R 1 1

;(ii)parent-child: atreenoden

2

whosepositionintheXMLdatabaseisencodedas(D

2 ;L 2 : R 2 ;N 2

)isachildofatreenoden

1

whosepositionisencodedas(D

1 ;L 1 :R 1 ;N 1 )iD 1 =D 2 ;L 1 <L 2 ;R 2 < R 1 ,andN 1 +1=N 2

. Forexample,inFigure1(b),theauthornodewithposition(1;6:20;3)isadescendant

ofthe book node withposition (1;1:150;1),and thestring\jane"withposition (1;8;5) isachild ofthe

authornodewithposition(1;7:9;4).

A key point worth noting about this representation of node positions in the XML data tree is that

checking an ancestor-descendant relationship is as simple as checking a parent-child relationship(we can

checkforanancestor-descendantstructuralrelationshipwithoutknowledgeoftheintermediatenodesonthe

path). Also,thisrepresentationofpositionsofnodesallowforcheckingorder(e.g.,noden

2

followsnoden

1 )

andstructuralproximity(e.g., noden

2

isadescendantwithin3levelsofn

1

)relationships.

3 Holistic Path Join Algorithms

3.1 Notation

Letq (with or withoutsubscripts) denotetwigpatterns, aswell as(interchangeably) theroot nodeof the

twigpattern. Inouralgorithms,wemakeuseofthefollowingtwignodeoperations: isLeaf:Node!Bool,

isRoot: Node!Bool, parent: Node!Node,children: Node!fNodeg,and subtreeNodes: Node!

fNodeg. Pathquerieshaveonlyonechildpernode,otherwisechildren(q)returnsthesetofchildrennodes

ofq. TheresultofsubtreeNodes(q)isthenodeqand allitsdescendants.

Associated with each node q in a query twig pattern there is a stream T

q

. The stream contains the

positional representationsof the databasenodes that match the node predicate at the twig pattern node

q (possibly obtained using an eÆcient access mechanism, such asan index structure). The nodes in the

streamaresortedbytheir(DocId, LeftPos)values. Theoperationsoverstreamsare: eof,advance,next,

nextL,andnextR. ThelasttwooperationsreturntheLeftPosandRightPoscoordinatesin thepositional

representationofthenextelementin thestream,respectively.

Inourstack-basedalgorithms, PathStackand TwigStack, we alsoassociatewith each query nodeq a

stackS

q

. Eachdatanodeinthestackconsistsofapair: (positionalrepresentationofanodefromT

q

,pointer

toanodein S

parent(q)

). Theoperationsoverstacksare: empty, pop,push,topL, andtopR. Thelast two

1

(6)

in thestack,respectively. Atevery pointduring thecomputation, (i)the nodesin stackS

q

(frombottom

totop)areguaranteedtolieonaroot-to-leafpathintheXMLdatabase,and(ii)thesetofstackscontaina

compactencoding ofpartialandtotalanswerstothequerytwigpattern,whichcanrepresentinlinearspace

apotentially exponential(in thenumberof querynodes)numberof answersto thequerytwig pattern, as

illustratedbelow.

A

₁

B

₁

C

₁

A

₁

B

₂

C

₁

A

₂

B

₂

C

₁

C

₁

S

_C

B

₁

S

_B

B

₂

A

₁

S

_A

A

₂

A

B

C

A

₁

B

₁

A

2 B

2 C

₁

(a) Data

(b) Query

(c) Stack encoding

(d) Query results

Figure 3: Compactencoding ofanswersusingstacks.

Example3.1 Figure3illustratesthe stack encodingof answerstoapathqueryfor asample dataset. The

answer [A 2 ;B 2 ;C 1

] is encoded since C

1 points to B 2 , and B 2 points to A 2 . Since A 1 is below A 2 on the stack S A , [A 1 ;B 2 ;C 1

] isalso an answer. Finally, since B

1 is below B 2 on the stack S B ,and B 1 pointsto A 1 ,[A 1 ;B 1 ;C 1

]isalsoananswer. Notethat[A

2 ;B

1 ;C

1

]isnotananswer,sinceA

2

isabovethenode (A

1 ) onstack S A towhich B 1 points.

Wemakecrucialuseofthiscompactstackencodingin ouralgorithms,PathStackandTwigStack.

3.2 PathStack

Algorithm PathStack, which computes answersto aquery pathpattern, is presentedin Figure 4, forthe

casewhenthestreamscontainnodesfrom asingleXML document. Whenthestreams containnodesfrom

multiple XML documents, the algorithm is easily extended to test equalityof DocIdbefore manipulating

thenodesinthestreamsandstacks.

Thekeyidea of Algorithm PathStack is to repeatedly construct (compact) stack encodings of partial

andtotalanswerstothequerypathpattern,byiteratingthroughthestreamnodesinsortedorderoftheir

LeftPosvalues;thus,thequerypathpatternnodeswillbematchedfromthequeryrootdowntothequery

leaf. Line2,in AlgorithmPathStack,identiesthestreamcontainingthenextnodetobeprocessed. Lines

3-5 removepartial answersfrom thestacksthat cannot be extended to totalanswers, given knowledge of

thenextstreamnodeto beprocessed. Line6augmentsthepartial answersencodedin thestackswiththe

newstreamnode. WheneveranodeispushedonthestackS

qmin

, whereq

min

istheleafnodeof thequery

path,the stackscontainanencoding oftotalanswerstothe querypath,andAlgorithm showSolutionsis

invokedbyAlgorithmPathStack(lines7-9)to \output"these answers.

Anaturalwayfor AlgorithmshowSolutionsto outputquerypath answersencoded inthe stacksis as

n-tuplesthat arein sortedleaf-to-rootorder ofthequerypath. Thiswill ensurethat,overthesequenceof

invocationsof AlgorithmshowSolutionsbyAlgorithmPathStack,theanswersto thequerypatharealso

computedinleaf-to-rootorder.Figure5showssuchaprocedureforthecasewhenonlyancestor-descendant

(7)

01 while :end (q)

02 q

min

=getMinSource(q)

03 for qi in subtreeNodes(q) //cleanstacks

04 while (:empty (S q i )^topR (S q i )<nextL(T q min )) 05 pop(Sq i ) 06 moveStreamToStack(T q min ;S q min , pointer to top (S parent (q min ) )) 07 if (isLeaf(q min )) 08 showSolutions(Sq min ;1) 09 pop (S q min ) Function end(q)

return 8qi2subtreeNodes(q):isLeaf (qi))eof(Tq

i )

Function getMinSource(q)

return qi2subtreeNodes(q) such that nextL (Tq

i ) is minimal Procedure moveStreamToStack(T q ;S q ;p) 01 push (Sq;(next (Tq);p)) 02 advance(Tq)

Figure4: AlgorithmPathStack

Whenparent-childedgesarepresentinthequerypath, wealsoneed totaketheLevelNuminformation

into account. PathStack doesnot needto change, but we needto ensurethat each time showSolutions

is invoked, it does not output incorrect tuples, in addition to avoiding unnecessary work. This can be

achieved by modifying the recursive call (lines 6-7) to check for parent-child edges, in which case only a

single recursive call (showSolutions(SN 1;S[SN]:index[SN]:pointer totheparentstack)) needs to be

invoked,afterverifyingthat theLevelNumofthetwonodesdierbyone. Loopingthroughallnodesinthe

stackS[SN 1]wouldstillbecorrect,butitwoulddomoreworkthanisstrictlynecessary.

Ifwedesirethenal answersto thequerypathbepresentedin sortedroot-to-leaforder(as opposed to

sorted leaf-to-root order), it is easyto see that it doesnotsuÆce that each invocationof showSolutions

outputanswersencodedin thestackin theroot-to-leaforder. Toproduceanswersin thesortedroot-to-leaf

order,wewouldneedto \block" answers,and delaytheiroutput untilwearesurethat noanswerpriorto

themin thesortordercan becomputed. Thedetails ofhowto achievethisnaturallyextendtheintuitions

ofAlgorithmStack-Tree-Ancfrom [1],andarepresentedin thenextsection.

Example3.2 Consider the leftmost path, book{title{XML,in each of the query twigs of Figure 2. If we

used the binary structural join algorithms of [26,1], we wouldrst needto compute matches to oneof the

parent-childstructural relationships: book{title,ortitle{XML.Sinceevery bookhas atitle,thisbinary

join would producea lot of matches against an XML books database, even whenthere are only afew books

whose title isXML. If,instead,werst computedmatches to title{XML,wewould also match pairs under

chapterelements,asinthe XMLdatatreeofFigure1(b),which donotextendtototalanswerstothe query

pathpattern. UsingAlgorithmPathStack,partialanswers arecompactlyrepresentedinthe stacks,andnot

output. Using the XML data tree of Figure 1(b), only one totalanswer, identied by the mapping [ book

(8)

//Assume,forsimplicity,thatthestacksofthequerynodesfromtheroottothecurrent

// leafnodeweareinterestedincanbeaccessedasS[1];:::;S[n].

//Also assumethatwehaveaglobalarrayindex [1::n]ofpointerstothestackelements.

//index [i]representsthepositioninthei'thstackthatweareinterestedinforthe

// currentsolution,wherethebottomofeachstackhasposition1.

//MarkweareinterestedinpositionSP ofstackSN.

01 index [SN]=SP

02 if (SN==1) //weareintheroot

03 //outputsolutionsfromthestacks

04 output (S[n]:index[n];:::;S[1]:index[1])

05 else //recursivecall

06 for i=1 to S[SN]:index[SN]:pointertoparent

07 showSolutions(SN 1;i)

Figure5: ProcedureshowSolutions

3.2.1 Blockingresults

ConsiderthesimplepathqueryA==D. Atanypointduringthequeryprocessing,ifnodeafromA's stack

isfoundtobeanancestorofnodedfrom D'sstream,theneverynodea 0

fromA'sstackthat isanancestor

ofaisalsoanancestorofd. Sincea 0

:L<a:L, wemustdelayoutputofthesolution(a;d)until after(a 0

;d)

hasbeenoutput. Butthereremainsthepossibilityofanewelementd 0

afterdinD'sstreamjoiningwitha 0

aslonga 0

is onA'sstack. Therefore,wecannotoutputthepair(a;d)untiltheancestornodea 0

ispopped

from A'sstack. Meanwhile,wecanbuilduplargejoinresultsthat cannotyet beoutput. Thissituation is

furtheraggravatedforlongerpathqueries. Wenowpresentaprocedurethatreturnssolutionsinroot-to-leaf

orderandgeneralizestheideasin [1]forbinarystructuraljoins andisbasedontheconceptofblocking.

Forthis purpose, we maintain twolinked lists associated with each element n in the stacks: the rst,

(S)elf-list,representsallblockeddescendantextensionswithrootelementn,andthesecondsecond,

(I)nherit-listrepresentsallblockeddescendantextensionswithrootelementsthat weredescendantsofninthestack

(onlythetopelementsin thestackshaveinherit-lists). Atanypointduringthealgorithm,thedescendant

extensions in the self- and inherit-lists of node n need to be expanded withthe elementsin the stacks of

theancestor nodesin the query, in the samewayasin the procedure showSolutions. The main ideasof

thealgorithmremainthesame,butwedonotoutputsolutions(usingshowSolutions)assoonaswedetect

them. Instead,weaccumulate inthe self-andinherit-lists thepartialsolutionsfound inroot-to-leaforder,

sowhenwenally outputresultstheyareguaranteedtobein therightorder.

Inparticular,when anewelementispushedto astack,weinitializeitsself- andinherit-liststoempty.

SupposewearepoppingelementD

c

fromthestackofquerynodeD. Dependingonthecurrentconguration,

weproceedasfollows 2

:

(a) NodeDis notthequeryroot,but hasparentnodeA. ElementD

c

isnotatthebottom ofthestack,

but has ancestor D

p

(Figure 6(a)). In this case, we rst determine nodes A

c and A

p

(the nodes in

A's stackpointedto byD

c andD

p

respectively). Then weappendtheself- and inherit-lists(in that

order) fromD

c

totheself-lists ofallelementsin A's stackstartingwith A

c

(inclusive) andending in

A

p

(exclusive).

(b) Node D is not the query root, but hasparent node A. Element D

c

is at the bottom of the stack.

(Figure6(b)). Inthiscase,werstdeterminenodeA

c

(the nodeinA'sstackpointedtobyD

c

). Then

2

(9)

A

_p

A

_c

D

_p

D

_c

.

D

_p

D

_c

.

_D

c

(a)

(b)

(c)

(d)

D

_c

.S to D

_p

.I

(D

_c

.S+D

_c

.I) to A

_c

.S, ..., A

_p

.S

(exclusive)

(D

_c

.S+D

_c

.I) to A

_c

.S, ..., A

_b

.S

(D

_c

.S+D

_c

.I) to D

_p

.I

Output (D

_c

.S+D

_c

.I)

A

_c

D

_c

.

A

_b

Figure6: Possiblestackcongurationswhenblockingresults.

weappendtheself-andinherit-lists(inthatorder)fromD

c

totheself-listsofallelementsinA'sstack

startingwithA

c

and endinginA'sbottomelement.

(c) Node D is the query root. Element D

c

is not at the bottom of the stack, but has ancestor D

p

(Figure6(c)). Inthiscaseweappendtheself-andinherit-lists(inthat order)from D

c

tothe

inherit-listofelementD

p .

(d) NodeD is the queryroot. ElementD

c

is at thebottomof the stack(Figure 6(d)). We rstoutput

thecontentsoftheself-listandthenthecontentsoftheinherit-listofelementD

c .

It is fairly simpleto show that in cases(a),(b), and (c) we preserveall solutionswhen rearrangingthe

linked lists. Also,each timeweappend onelist to another,weareguaranteed that nofuture element will

beappended outoforder. Therefore,in case(d)weoutputsolutionsin thedesired order. Asan informal

proof, considerforinstance, case(a). All descendantextension fromD

c

'sself-andinherit- listsneedto be

expandedwithallelementsfrom A

c

to thebottomof thestack. Forthatpurpose, weappend theself-and

inherit-lists D

c

to allnodesin the parent querynode from A

c to A

p

exclusive. Since we alsoappend this

descendant extensionsto theinherit-list of D

p

, weare guaranteedthat in afuture iteration this solutions

willreachtoalltheremainingnodesinstackA. Sowedidnotloseanysolutionwhenmanipulatingthelists.

Also, when we pop elementD

c

, we know that nonew elementfrom D's stream canstart before D

c (and

all its descendants) does, so any new solutionwill start after everydescendant extension in D

c

's self and

inherit lists. Incontrast,wecannotappendthese lists to theself listof A

p

andits ancestorsin the stack,

becausesomesolutionsthatstartbeforethemmightbeblockedin D

p

anditsancestors. Wesolvethatcase

byappendingD

c

'sself-andinherit-liststo D

p

'sinheritlist. Theothercasescanbeexplainedsimilarly.

Theonlyoperationweperform overlists is \append"(exceptfor thenal read out). Since our

imple-mentation ofthe linkedlists maintainstheheadand tailpointers, these append operationscanbecarried

outinconstanttimeandrequirenocopying. Therefore,weonlyneedtohaveaccessto thetailofeachlist

inmemoryascomputationproceeds. Therest ofthelistcanbepagedout. Whenlistxisappended tolist

y, it is notnecessarythat thehead of list x bein memory, since theappend operation only establishesa

link to thishead inthe tailofy. Soallwe needisto knowthepointer for theheadofeachlist, evenifit

is paged out. Each list pageis thus paged outat mostonce, at paged back in againonly when thelist is

ready foroutput. Also,thetotalnumberof entries in thelistsis exactlyequalto thenumberofentries in

theoutput. WehavethenthattheI/Orequiredtomaintainlistsofresultsisproportionaltothesizeofthe

(10)

ThefollowingpropositionisakeytoestablishingthecorrectnessofAlgorithmPathStack.

Proposition 3.1 If wex node Y,the sequenceofcases betweennode Y andnodesX on increasing order

of LeftPos (L) is: (1j2) 3 4

. Cases 1 and Cases 2are interleaved, then all nodes in Case 3 before any

node inCase 4, andnallyall nodesinCase 4(seeFigure7).

X

Y

Case 1

Case 2

Case 3

Case 4

Y

Root

X

Y

Root

X

Y

Root

X

Y

Root

X

Y

X.R<Y.L

X.L<Y.L

X.L>Y.R

X.R>Y.R

X.L>Y.L

X.R<Y.R

Property

Segments

Tree

Figure 7: CasesforPathStackandTwigStack

Lemma3.1 Supposethatforanarbitrarynodeqinthepathpatternquery,wehavethatgetMinSource(q)=

q

N

. Also,suppose that t

q

N

isthe next element in q

N

'sstream. Then, after t

q N ispushedon tostack S q N ,

the chain of stacks from S

qN to S

q

veries that their labels areincludedin the chain of nodesin the XML

datatreefromt

q

N

tothe root.

For each node t

qmin

pushed onto stack S

qmin

, it is easy to see that the above lemma, along with the

iterativenature ofAlgorithm showSolutions, ensuresthatall answersin whicht

q

min

isamatchforquery

nodeq

min

willbeoutput. Thisleadstothefollowingcorrectnessresult:

Theorem3.1 GivenaquerypathpatternqandanXMLdatabaseD,AlgorithmPathStackcorrectlyreturns

allanswers for qon D.

Wenextshowoptimality. GivenanXMLquerypathoflengthn, PathStacktakesninput listsoftree

nodes sortedby (DocId, LeftPos), and computesanoutput sortedlist ofn-tuples that match thequery

path. It is straightforward to see that, excluding the invocations to showSolutions, the I/O and CPU

costsof PathStack arelinear in thesum of sizes of then inputlists. Sincethe costof showSolutionsis

proportionaltothesizeoftheoutputlist,wehavethefollowingoptimalityresult:

Theorem3.2 Given aquery pathpattern qwith nnodes, andanXML database D,AlgorithmPathStack

has worst-caseI/OandCPUtimecomplexitieslinearin thesum ofsizesof theninputlistsandthe output

list. Further,theworst-casespacecomplexityofAlgorithmPathStackistheminimumof(i)the sumofsizes

of then inputlists,and(ii) the maximumlengthofa root-to-leafpath inD.

It is particularly important to note that the worst-case time complexity of Algorithm PathStack is

(11)

AstraightforwardgeneralizationoftheMPMGJNalgorithm [26]forpathqueriesproceeds onestreamata

time to get all solutions. Considerthe path queryq

1 ==q

2 ==q

3

. Thebasic ideais as follows: Get the rst

(next) element from the stream T

q

1

and generate all solutions that use that particular element from T

q 1 . Then,advanceT q1 andbacktrack T q2 andT q3

accordingly(i.e.,to theearliestpositionthat mightleadtoa

solution). Thisprocedureis repeateduntilT

q

1

isempty. Thegenerate all solutions steprecursivelystarts

with the rst marked elementin T

q2

, gets all solutions that use that element (and the calling element in

T

q

1

),thenadvancesthestreamT

q

2

untiltherearenomoresolutionswiththecurrentelementinT

q

2

,andso

on. Werefertothisalgorithmas PathMPMJNaive,inourexperiments.

It turns outthat maintaining onlyone mark per stream (for backtracking purposes) is tooineÆcient,

sinceallmarksneedtopointto theearliestsegmentthat canmatchthecurrentelementinT

q1

(the stream

of the root node). A better strategy is to use astack of marks, as described in Algorithm PathMPMJ in

Figure8. InthisoptimizedgeneralizationofMPMGJN,eachquerynodewill nothaveasinglemarkinthe

stream,but \k"markswhere kisthenumberofitsancestorsin thequery. Eachmarkpointstoanearlier

positionin thestream, andforquerynodeq,thei'thmarkistherstpointin T

q

suchthattheelementin

T

q

startsafterthecurrentelementin thestream ofq'si'thancestor.

Algorithm PathMPMJ(q)

01 while (:eof (Tq)^ (isRoot (q)_nextL (q)<nextR (parent (q))))

02 for (qi2subtreeNodes(q)) //advancedescendants

03 while (nextL (q i )<nextL (parent (q i ))) 04 advance(Tq i ) 05 PushMark(T q i )

06 if (isLeaf (q)) outputSolution() //solutioninthestreams'heads

07 else PathMPMJ(child (q))

08 advance(Tq)

09 for (qi2subtreeNodes(q)) //backtrackdescendants

10 PopMark(Tq

i )

Figure8: AlgorithmPathMPMJ

Theorem3.3 GivenaquerypathpatternqandanXMLdatabaseD,AlgorithmPathMPMJcorrectlyreturns

allanswers for qon D.

Whilethetwoextensions ofMPMGJN appearsimilar, the dierencebetweentheirperformanceis

no-ticeable, as we shall see in the experimental section. Further, as was the case with MPMGJN,

Algo-rithmPathMPMJisnotasymptoticallyoptimaleither.

4 Twig Join Algorithms

4.1 Limitations of Using PathStack

Astraightforwardwayofcomputinganswerstoaquerytwigpatternistodecomposethetwigintomultiple

root-to-leafpathpatterns,usePathStacktoidentifysolutionstoeachindividualpath,andthenmerge-join

these solutionsto computethe answersto thequery. This approach, which weexperimentallyevaluate in

Section 5,faces thesamefundamental problemasthetechniquesbasedonbinarystructuraljoins,towards

(12)

Against the XML database in Figure 1(b), the twopaths of this query: author{fn{jane,and author{ln{

doe,have twosolutionseach, butthe querytwig patternhas onlyone solution.

In general, if the query (root-to-leaf) paths have many solutions that do not contribute to the nal

answers,usingPathStack(as asub-routine) issuboptimal,in that theoverall computationcostfor atwig

pattern is proportional not just to the sizes of the input and the nal output, but also to the sizes of

intermediateresults. Inthissection,weseektoovercomethissuboptimalityusingAlgorithmTwigStack.

4.2 TwigStack

Algorithm TwigStack, which computesanswersto aquery twig pattern, is presented in Figure 9,for the

casewhen thestreams containnodes from asingleXML document. Aswith Algorithm PathStack, when

thestreamscontainnodesfrommultiple XMLdocuments,thealgorithmis easilyextendedto testequality

ofDocIdbeforemanipulatingthenodesin thestreamsandonthestacks.

Algorithm TwigStack(q) //Phase 1 01 while :end (q) 02 q act =getNext (q) 03 if (:isRoot(qact)) 04 cleanStack(parent (q act ), nextL (q act )) 05 if (isRoot(qact)_:empty(S parent (q act ) ))

06 cleanStack(qact, next (qact))

07 moveStreamToStack(T qact ;S qact ;pointertotop(S parent (q act ) )) 08 if (isLeaf(qact)) 09 showSolutionsWithBlocking(S qact ;1) 10 pop(Sq act ) 11 else advance(Tq act ) //Phase 2 12 mergeAllPathSolutions() Function getNext (q) 01 if (isLeaf(q)) return q 02 for qi in children(q) 03 n i =getNext (q i )

04 if (ni6=qi) return ni

05 nmin= minargn i nextL (Tn i ) 06 nmax= maxargn i nextL (Tn i )

07 while (nextR (Tq)<nextL (Tn

max )) 08 advance (T q ) 09 if (nextL (Tq)<nextL (Tn min )) return q 10 else return n min

Procedure cleanStack(S, actL)

01 while (:empty (S)^(topR (S)<actL))

02 pop(S)

Figure9: AlgorithmTwigStack

AlgorithmTwigStackoperatesin twophases. Intherstphase(lines1-11),some(butnotall)solutions

to individual query root-to-leaf paths are computed. In the second phase (line 12), these solutions are

(13)

A

₁

B

₁

A

₂

D

₁

C

₁

D

₂

(a) Data set

(b) Query

A

B

C

D

Figure10: ExampledatasetandquerytoillustratethegetNextprocedure.

The key dierence betweenPathStackandthe rstphaseof TwigStackis that before anode h

q from

thestream T

q

is pushedonitsstackS

q

,TwigStack(via itscallto getNext)ensuresthat: (i)node h

q has

adescendant h

q

i

in eachofthestreamsT

q

i ,forq

i

2children(q),and (ii)eachofthenodesh

q

i

recursively

satises the rst property. Algorithm PathStackdoesnot satisfy this property (and it does not need to

do so to ensure(asymptotic) optimality for querypath patterns). Thus, when thequery twigpattern has

only ancestor-descendant edges, each solution to each individual query root-to-leafpath is guaranteed to

bemerge-joinable with at least onesolution to each of theother root-to-leafpaths. This ensuresthat no

intermediatesolutionislargerthanthenal answertothequerytwigpattern.

The second merge-join phase of Algorithm TwigStack is linearin the sum of its input (the solutions

to individual root-to-leafpaths) and output (the answer to the query twig pattern) sizes, only when the

inputsareinsortedorderofthecommonprexesofthedierentqueryroot-to-leafpaths. Thisrequiresthat

thesolutionstoindividual querypathsbeoutputin root-to-leaforderaswell,whichnecessitates blocking;

showSolutions(fromFigure5), cannotbeusedandneedstobeextended withtheideasofSection 3.2.1.

Example4.2 Consider again the query of Example 4.1, which isthe sub-twig rooted at the authornode

of thetwig pattern inFigure2(b), andthe XMLdatabasetreeinFigure1(b). BeforeAlgorithmTwigStack

pushes anauthornodeon thestack S

author

,itensuresthat thisauthornode has: (i) adescendantfnnode

in the stream T

fn

(which in turn has a descendant jane node in T

jane

), and (ii) a descendant ln node in

the stream T

ln

(which in turn has a descendant doe node in T

doe

). Thus, only one of the three author

nodes (corresponding to the thirdauthor) from the XML data tree in Figure 1(b) is pushed on the stacks.

Subsequentstepsensurethatonlyone solutiontoeachof thetwopathsof thisquery: author{fn{jane,and

author{ln{doe,is computed. Finally, the merge-joinphase computes thedesiredanswer.

4.3 Analysis of TwigStack

Inthis sectionwediscuss thecorrectnessof algorithmTwigStackforprocessingtwig queries,and thenwe

analyzeitscomplexity.

Denition4.1 Consider a twig query Q. For each node q 2 subtreeNodes(Q) we dene the head of q,

denotedh

q

, asthe rstelement in T

q

that participates in asolution for the sub-queryrootedatq. We say

thatanodeqhasaminimaldescendantextensionifthereisasolutionforthesub-queryrootedatqcomposed

entirely ofthe head elementsof subtreeNodes(q).

Example4.1 Consider the datasetand queryof Figure 10. The rstelement inT

A

thatparticipatesina

solution forthe query,h

A =A

1

(suchsolutioninvolves the nodesA

1 ,B 1 ,C 1 ,andD 2 ). Similarly,h B =B 1 , h C =C 1 andh D =D 1

,since those are the rst elements that participate in asolution for the sub-queries

(14)

1 2

does notuse h

D =D

1

. Forthe samereason, node Adoesnot haveaminimal descendant extension.

Proposition 3.1, based on Figure 7, is also important for establishing the following lemma, used for

provingthecorrectnessofAlgorithmTwigStack.

Lemma4.1 [Descendantextension] Supposethatfor anarbitrarynodeqinthetwigquerywehavethat

getNext(q)=q

N

. Then, the following properties hold:

1. q

N

hasaminimal descendantextension.

2. Foreach node q 0 2subtreeNodes(q N ),the rstelement inT q 0 ish q 0 . 3. Either (a) q = q N or (b) parent(q N

) does not have a minimal right extension because of q

N (and

possibly othernodes). Inotherwords, thesolution rootedatp=parent(q

N

)that usesh

p

doesnotuse

h

q

for node qbut someother elementwhoseL componentis larger thanthatof h

q .

Proof: (Inductiononthenumberofdescendantsofq). Ifqisaleafnode,ittriviallyveriesallproperties,

sowe returnit in line 1 of the algorithm. Otherwise, we get n

i

=getNext(q

i

) recursivelyfor each child

q

i

of q in lines 2-4. If for somei, wehavethat n

i 6=q

i

but somedescendantof q

i

, we know byinductive

hypothesisthatn

i

veriesproperties(1),(2)and(3-b)abovewithrespecttoq

i

. Allpropertiesarestillvalid

withrespecttonodeq,so wereturnnoden

i

inline4. Otherwise,weknowbyinductivehypothesisthatall

ofq'schildrensatisfyproperties(1),(2)and(3-a)abovewithrespecttotheircorrespondingsub-queries. We

rstadvancefromT

q

allsegmentsthatarein Case1withrespect toany headsegmentin then

i

'sstreams

(lines7-8). Afterthat,weknowthatqisincase2,3or4withrespecttoeveryn

i

byusingProposition3.1.

IfqisinCase2withrespecttoalln

i

,itsatisesproperties(1),(2)and(3-a)above,sowereturnit(line9).

Otherwise,wecanguaranteethattheminimaln

i

veriesproperties(1),(2)and(3-b)above,sinceitsparent

(q)doesnothaveaminimaldescendantextensionbecauseofn

i

(andpossiblyotherchildren),sowereturn

suchminimalnoden

i

inline10(theelementinqdoesnotsatisfyproperty(3)atthemoment,butcoulddo

solateron). Toverifythelastcondition(lines9-10)wesimplycheckiftheleftboundoftheelementinqis

lowerthantheleftbound oftheminimaln

i

. Ifthat isthecase,weknowthat q'selementstartsbeforeany

n

i

andendsaftereachn

i

(sincewecheckedthat itendsaftereach n

i

starts).

Lemma4.2 Consider the following fragment ofcode:

while :end(q) q N =getNext(q); advance(q N )

The following propertieshold:

1. getNext returnsin node q

N

allandonlyelementsin the datasetwith adescendant extension.

2. If elementx iselement y'sancestorinsome solution,thenx isreturnedbygetNext before y.

Proof: (Inductiononthenumberofcalls togetNext). Inparticular, wewillshowthatproperty(1) above

holds byprovingthat all elements that are skippedin line 10of getNext areguaranteed notto haveany

descendantextension. ConsidertherstcalltogetNextforqueryq,andassumethatgetNext(q)=q

N . By

Lemma4.1, property(1), weknowthat alltherstelementsinthestreamsof subtreeNodes(q

N

)are part

(15)

either q

N

=q (with noancestors) orthe solutionrootedat q

N

is guaranteed notto bepartof any solution

involvingq

N

'sancestors,becauseeachdescendantextensionfromp=parent(q

N

),usingsomeelementxfrom

p'sstream,veriesx:Lh

p :L>h

qN

:L(usingLemma4.1,property(3)). Therefore,property2holds.

For subsequent calls to getNext, we proceed asfollows. Assume that getNext(q) = q

N

and consider

eachelementx (fromnodeq

x

2subtreeNodes(q

N

))thatisadvanced inline 10. Elementx cannotbepart

of any descendant extension with elements in the streams of subtreeNodes(q

x

), since x ends before the

rstsolutionin somechild's stream(see conditionin line9of getNext). Besides, x cannot bepartof any

descendantextensionwithelementsthatwerealreadyprocessedbygetNextbyproperty(2)oftheinductive

hypothesis. Therefore,x isguaranteednottobepartofanydescendantextension,andproperty(1)holds.

Nowsupposethatq

N

6=qandp=parent(q

N

). UsingLemma4.1,property(3),weknowthath

p :L>h

q

N :L.

Therefore, any solution involving element h

q

:L must use some element from p that wasalready returned

bygetNext. Usingproperty(2)of theinductive hypothesis, we knowthat allelementsfrom p'sancestors

in such solutions were returned by getNext before the corresponding element from p, which in turn was

returnedbeforeh

q

N

. Property(2)holdsaswell.

Wethenknowthatwhensomenodeq

N

isreturnedbygetNext,h

qN

isguaranteedtohaveadescendant

extensionin subtreeNodes(q

N

). Wealsoknowthatanyelementin theancestors ofq

N

that usesh

q

N ina

descendantextensionwasreturnedbygetNextbeforeh

qN

. Therefore,wecanmaintainforeachnodeqinthe

query,theelementsthatarepartofasolutioninvolvingotherelementsinthestreamsofsubtreeNodes(q).

Then, eachtime thatq

N

=getNext(q)is aleafnode,weoutput allsolutionsthat useh

q

N

. Wenowprove

thatwecanachievethatgoalbymaintainingasinglestackpernodeinthequery.

Lemma4.3 Supposethatinthe currentiterationwehave getNext(q)=q

N

. Then,thereisnonew solution

involving: (a) someelement fromq

N

thatends before h

q

N

starts,and(b)someelementfrom thestreamsin

subtreeNodes(q

N ).

Proof: Suppose that on the contrary, there is a new solution using some (already processed) element

from q N (denoted s qN ) for which s qN :R < h qN

:L. Using the containment property, we know that all

elements from subtreeNodes(q

N

) in such solution must end before s

q

N

:R , and therefore, before h

q

N :L.

SincegetNext(q)=q

N

weknow(Lemma4.1) thatq

N

hasaminimal descendantextension, andtherefore,

allelementsinthestreamsofsubtreeNodes(q

N

)startafterh

q

N

does,whichisacontradiction.

Thelemmaaboveguaranteesthatforeachnodeq,thesetofelementsthatcanbepartofanewsolution

areexactlythoseincludingthelastelementfromqreturnedbygetNext. Therefore,asinglestackpernode

suÆcestokeeptrackofallelementswithpotentialnewsolutions. Thefollowinglemma showsthatwecan

easilychaintheelementsofthestacksinthesamewayasin PathStackanduseshowSolutionstooutput

allsolutions.

Lemma4.4 Suppose that, in the current iteration, we have getNext(q)=q

N , and q 6= q N . Let p = parent(q N

). Then, there is no \new" solution involving: (a) some element from p that ends before h

q

N

starts,and(b) someelementfromthe streamsin subtreeNodes(p).

Proof: TheproofissimilartothatofLemma4.3. Supposethatonthecontrary,thereisanewsolutionusing

someelementfromp(denoteds

p

)forwhichs

p :R<h

qN

:L. WeknowthatallelementsfromsubtreeNodes(p)

in suchsolution mustend before s

p

:R ,and therefore, before h

q

N

:L. SincegetNext(q)=q

N

weknow(see

algorithmgetNextforTwigStack)thatforallchildrenn

i ofp(includingq N ): (1)getNext(n i )=n i ,and(2) h q N :Lh n i

:L. UsingLemma 4.1weknowthateachn

i

(16)

i ni qN

contradictionandthelemmaholds.

Thelemma above guarantees that each time getNext returnsnodeq

N

wecan clean from q

N

's parent

stackallelements thatdo notinclude h

q

N

(since theycannotparticipatein anynewsolution) andset the

h

qN

'spointerinq

N

'sstacktothetopofq

N

'sparentstack. Usingthetwolemmasabove,weproveourmain

resultforTwigStack.

Theorem4.1 Given a query twig pattern q, and an XML database D, Algorithm TwigStack correctly

returnsallanswers for qonD.

Proof: InAlgorithmTwigStack,werepeatedlyndgetNext(q)forqueryq. AssumethatgetNext(q)=q

N .

LetA

qN

bethesetofnodesinthequerythatareancestorsofq

N

. UsingLemma4.2weknowthatgetNext

alreadyreturned allelementsfrom thestreams ofnodes in Aq

N

that arepartof asolutionthat usesh

q

N .

Ifq6=q

N

,in line 3wepopfrom parent(q

N

)'sstackallelements thatare guaranteednotto participatein

any new solution (Lemma 4.4). After that, in line 4wetest whether h

q

N

participatesin asolution. We

knowthat q

N

hasadescendantextension by Lemma4.1, property1. If q6=q

N andparent(q N )'sstackis empty, nodeq N

doesnothavean ancestorextension. Therefore itis guaranteednot toparticipate in any

solution, so we advance q

N

in line 10 and continuewith the nextiteration. Otherwise, nodeq

N

has both

ancestoranddescendantextensionsandthereforeitparticipatesinatleastonesolution. Wethencleanq

N 's

stack(Lemma 4.3)and after that wepushh

qN

to it (lines5-6). ByLemma 4.4 weknowthat thepointer

tothetopofparent(q

N

)'sstackcorrectlyidentiesallsolutionsusingh

q

N

. Finally,ifq

N

isaleafnode,we

decompressthestoredsolutionsfrom thestacks(lines7-9).

Whilecorrectnessholds forquerytwig patternswith bothancestor-descendantandparent-childedges,

wecanproveoptimalityonlyforthecasewherethequerytwigpatternhasonlyancestor-descendantedges.

The intuitionis simple. Since wepush into the stacks onlyelementsthat haveboth adescendantand an

ancestor extension, weare guaranteedthat noelementthat doesnot participatein any solutionispushed

intoanystack. Therefore,themergepostprocessingstepisoptimal,andwehavethefollowingresult.

Theorem4.2 Consider aquery twig pattern q with n nodes, and only ancestor-descendant edges, and an

XMLdatabaseD. AlgorithmTwigStackhasworst-caseI/OandCPUtimecomplexitieslinearinthesumof

sizesoftheninputlistsandtheoutputlist. Further,theworst-casespacecomplexityofAlgorithmTwigStack

isthe minimum of (i)the sum of sizesof the ninput lists,and(ii) n timesthe maximum lengthofa

root-to-leafpath inD.

Itisparticularlyimportanttonotethat,forthecaseofquerytwigswithancestor-descendantedges,the

worst-casetimecomplexityofAlgorithmTwigStackisindependentofthesizesofsolutionstoanyroot-to-leaf

pathof the twig.

4.4 Suboptimality for Parent-Child Edges

Theorem 4.2 holdsonly forquery twigs with ancestor-descendantedges. Unfortunately, in thecase where

the twig pattern contains aparent-childedge betweentwo elements (e.g., see the queryin Example 4.2),

AlgorithmTwigStackisnolongerguaranteedtobeI/OandCPUoptimal. Inparticular,thealgorithmmight

produce asolutionfor oneroot-to-leafpath that doesnotmatch withanysolutionin another root-to-leaf

path.

Considerthequery twigpatternwith three nodes: A;B andC, andparent-childedgesbetween(A;B)

and between(A;C). Letthe XML datatree consist of nodeA

1

, withchildren(in order) A

2 ;B 2 ;C 2 , such

(17)

2 1 1 A B C 1 1 1

respectively. Inthiscase, wecannot sayifanyofthem participatesin asolutionwithoutadvancing other

streams, and we cannot advance any stream before knowing if it participates in a solution. As a result,

optimalitycannolongerbeguaranteed.

4.5 Using XB-Trees

AlgorithmsPathStackandTwigStackneedtoprocesseach nodeintheinputliststo checkwhetherornot

it is partof ananswer to thequery (pathortwig) pattern. When theinput lists areverylong, this may

takealot oftime. Inthissection,weproposetheuseofavariantofB-trees,denotedXB-tree,ontheinput

liststospeedupthisprocessing.

4.5.1 XB-Tree Description

Asitsnamesuggests,theXB-treeisavariantoftheB-tree,designedforindexingthepositionalrepresentation

(DocId, LeftPos : RightPos, LevelNum)ofelementsintheXMLtree. Wedescribetheindexstructure

whenallnodesbelongtothesameXMLdocument;theextensiontomultipledocumentsisstraightforward.

ThenodesintheleafpagesoftheXB-treearesortedbytheirLeftPos(L)values;thisissimilartotheleaf

pagesofaB-treeontheLvalues. ThedierencebetweenaB-treeandanXB-treeisinthedatamaintained

atinternalpages. EachnodeN inaninternalpageoftheXB-treeconsistsofaboundingsegment[N:L;N:R ]

(whereLdenotesLeftPosandRdenotesRightPos)andapointertoitschildpageN:page(whichcontains

nodes with bounding segments completely included in [N:L;N:R ]). The bounding segments of nodes in

internal pagesmight partiallyoverlap,but their Lpositions arein increasingorder. Besides, each pageP

hasapointer totheparentpageP:parent andtheintegerP:parentIndex which istheindex ofthenodein

P:parent thatpointsbackto P. Theconstruction andmaintenance ofanXB-treeis verysimilar to those

in aB-tree,usingtheL valueasthe key;thedierenceis that theR valuesneed tobepropagatedupthe

indexstructure.

4.5.2 UsingXB-Trees

Wemaintainapointeract=(actPage;actIndex)totheactIndex'thnodein pageactPageoftheXB-tree.

TherearetwooperationsovertheXB-treethataectthispointer:

1. advance. Ifact=(actPage;actIndex)doesnotpointto thelast nodeinthecurrentpage,wesimply

advance actIndex. Otherwisewereplace actwith thevalue (actPage:parent;actPage:parentIndex)

andrecursivelyadvanceit.

2. drillDown. Ifact=(actPage;actIndex), actPageisnotaleafpage,andN is theactIndex'th node

in actPage,wereplaceactwith(N:page;0)sothat itpointstotherstnodeinN:p.

Initiallyact=(rootPage;0),pointingtotherstnodeintherootpageoftheXB-tree. Whenactpoints

tothelastnodein rootPageandweadvanceit,wenishthetraversal.

Wecanmodify the previousalgorithms easily touse XB-trees. AlgorithmTwigStackXB, in Figure 11,

extends Algorithm TwigStack so that it uses XB-trees. The only changes are in the lines indicated by

parentheses. The function isPlainValuereturns true if the actual pointer in the XB-tree is pointing to

a leaf node (actual value in the original stream). If we dene isPlainValue(T)=true when T is not an

(18)

01 while :end (q) 02 q act =getNext (q) (03) if (isPlainValue(Tq act )) 04 if (:isRoot (q act ))

05 cleanStack(parent (qact), next (qact))

06 if (isRoot (q act )_:empty (S parent (qact) )) 07 cleanStack(q act , next (q act )) 08 moveStreamToStack(Tq act ;Sq act ; pointer to top (S parent (qact) )) 09 if (isLeaf(q act )) 10 showSolutionsWithBlocking(Sqact;1) 11 pop (Sq act ) 12 else advance(Tqact)

(13) else if (:isRoot(qact)^empty (S

parent (qact) )^nextL (T parent (qact) )>nextR (Tq act )) (14) advance(T qact

) //Notpartofasolution

(15) else //Mighthaveachildinsomesolution

(16) drillDown(T qact ) //Phase2 17 mergeAllPathSolutions() Function getNext(q) 01 if (isLeaf (q)) return q 02 for qi in children (q) 03 n i =getNext (q i ) (04) if (qi6=ni_:isPlainValue(Tn i )) return ni 05 n min = minarg n i nextL(T n i ) 06 n max = maxarg n i nextL (T n i )

07 while (nextR (Tq)<nextL (Tn

max )) 08 advance (T q ) 09 if (nextL (Tq)<nextL (Tn min )) return q 10 else return n min

Procedure cleanStack(S, actL)

01 while (:empty(S)^(topR (S)<actL))

02 pop(S)

Figure 11: AlgorithmTwigStackXB

Theorem4.3 Given a query twig pattern q and an XML database D, Algorithm TwigStackXB correctly

returnsallanswers for qonD.

Whilewedo not haveany analyticalresults about theeÆciency of Algorithm TwigStackXB, we show

experimentallythat itperformsmatchingofquerytwigpatternsinsub-linear time.

5 Experimental Evaluation

In this section we present experimental results on the performance of the join algorithms introduced in

Sections3and4usingbothreal andsynthetic data.

5.1 Experimental Setting

We implemented all XML join algorithms in C++ using the le system asa simple storage engine. All

(19)

XMark

XMach-1

Figure 12: TheBenchmark datasetcombinesbothXMarkandXMach-1documents.

ofdiskspace, runningRedHat Linux7.1. Weusedthefollowingsynthetic andreal-worlddatasetsforour

experiments:

Random: We generated random trees using three parameters: depth, fan-out and number of dierent

labels. For most of the experiments presented involvingRandom data sets, we generated full binary and

ternarytrees. Unless speciedexplicitly, thenodelabelsin thetreeswere uniformlydistributed. Wetried

othercongurations(largerfanoutandrandomdepth inthetree)obtainingconsistentresults.

Benchmark: We used a combination of the XMark [25] and XMach-1 [24] benchmarks. The XMark

benchmark is basedon arich DTD that models adatabasein an Internet auction site. However, XMark

usually produces shallow XML les with not too many nested levels. On the other hand, the

XMach-1 benchmark is based on a simpler DTD that models documents. As a result, the produced XML les

can model deeply nested sections and subsections. To take advantage of both models, we generated the

Benchmark datasetbymergingpartialresultsfromXMarkandXMach-1. Inparticular,werstgenerated

atemporaryXMLleusingtheXMarkbenchmarkwithscalingfactorequaltoone. Wethenreplacedeach

of the 105,396 descriptiontags (that originally corresponded to free text) with a small XML fragment

produced bytheXMark benchmark(seeFigure 12). Theresultingdata set hasdepth37 andaround11.7

millionnodes.

DBLP: Therealdatasetisan\unfolded"fragmentoftheDBLPdatabase. IntheoriginalDBLPdataset,

eachauthorisrepresentedbyaname,ahomepage,andalistofpapers. Inturn, eachpapercontainsatitle,

theconferencewhereitwaspublished,andalistofcoauthors. WegeneratedourunfoldedfragmentofDBLP

in thefollowingway. Westartedwithanarbitraryauthor andconvertedthecorresponding informationto

XMLformat. Further,foreachpaper,wereplacedeachcoauthornamewiththeactualinformationforthat

author. We recursively continued unfolding authors until we reached apreviously traversed author, ora

depthof200authors. TheresultingXMLdatasethasdepth 805andaround3millionnodes. It represents

93,536dierentpapersfrom 36,900uniqueauthors.

5.2 Holism: Binary Structural Joins vs PathStack

InthisrstexperimentwecompareourholisticPathStackalgorithmagainststrategiesthatusea

combina-tionofbinarystructuraljoins[1]. Forthispurpose,weusedaRandom dataset consistingof1Mnodesand

six dierent labels, namely: A

1 ;A 2 ;:::;A 6 . 3

We issued thepath queryA

1 ==A 2 ==:::==A 6 and evaluated 3

NotethattheactualXMLdatacancontainmanymorelabels,butthatdoesnotaectourtechniquessinceweonlyaccess

(20)

0

10

20

30

40

50

60 Execution

time

(seconds)

Binary Joins

PathStack

SS

Figure13: Holisticversusstructuralbinaryjoinsforpathqueries

it using PathStack. Then, we evaluated all binary structural join strategies resulting from applying all

possiblejoin orders. Figure13showstheexecutiontimeofallbinaryjoin strategies,where eachstrategyis

representedwithasinglebar. Wealsoshowwith asolid linetheexecutiontime ofPathStack, andwitha

dottedline thetimeittakesto simplyreadtheinputdatausingasequentialscan(labeledSS).

Forthisquery,PathStacktook2:53s,slightlymorethanthe1:87stakenbythesequentialscanoverthe

inputdata. Incontrast,thestrategiesbasedonbinarystructuraljoinsrangedfrom16:1sforthejoinordering

((((A 1 ==A 2 )==A 3 )==A 4 )==A 5 )==A 6

, to 53:07sforthejoin orderingA

1 ==((A 2 ==(A 3 ==(A 4 ==A 5 )))==A 6 ). The

rst conclusion we can draw from these execution times is that optimization plays an important role for

binary structuraljoins, sinceabad join ordering canresult in aplan that ismore thanthree times worse

than the best plan. The second conclusion we can draw is that the holistic strategy is superior to the

approach of using binary structural joins for arbitrary join orders; in this case, it results in morethan a

six-foldimprovementinexecutiontimeoverthebeststrategythatusesbinarystructuraljoins.

5.3 Paths: PathStack vs PathMPMJ

In this section we study the eÆciency of the dierent holisticpath join algorithms of Section 3. We rst

comparethetwoversionsofPathMPMJ. Weuseda64KRandom datasetwith10dierentnodesA

1 ;:::A

10 ,

and issuepath queriesof dierent lengths. Figure 14(a) shows theexecution times ofbothtechniques, as

well asthetime takenforasequentialscan overtheinput data. PathMPMJNaiveismuchslowercompared

to the optimized PathMPMJ(generally over anorder of magnitude). Thereason is that PathMPMJNaiveis

tooconservativewhenbacktrackingand readsseveraltimesunnecessaryportionsofthedata. Asshown in

Figure 14(b), PathMPMJNaiveread as much as 15 timesthe numberof elementsPathMPMJ did. Since the

performance of PathMPMJNaivedegrades considerably with the size of the data set and the length of the

inputquery,wedonotconsiderthisstrategyfortheremainderofthispaper.

WenowcomparePathStackagainsttheoptimizedPathMPMJ. InFigure15weshowtheexecutiontime

andthenumberof nodesread fromdisk forpathqueriesof dierentlengthandaRandom data setof 1M

nodesand10dierentvalues. Clearly,PathStackresultsinconsiderablybetterperformancethanPathMPMJ,

andthis dierenceincreaseswithlongerpathqueries. Thisis explainedbythefact thatPathStackmakes

asinglepassovertheinputdata,whilePathMPMJneedstobacktrackandreadagainlargeportionsofdata.

For instance, for a path query of length 10, PathMPMJ reads the equivalent of ve times the size of the

originaldata,asseeninFigure15(b). InFigure15(a),forpathqueriesoflengthtwo,theexecutiontimeof

PathStackis considerably slowerthan that ofthesequentialscan,and closerto PathMPMJ. This behavior

isdue tothefact thatforthepathqueryoflengthtwo,thenumberofsolutionsisratherlarge(morethan

100K),somostoftheexecutiontimeisusedinprocessingthesesolutionsandwritingthembacktodisk. For

(21)

0

5

10

15

20

2

3

4

5

6

7

8

9

10 Path Length

Execution

Time

(seconds)

SS

PathMPMJ

PathMPMJNaive

0 1000000

2000000

3000000

4000000

5000000

2

3

4

5

6

7

8

9

10 Path Length

V

a

lu

es

Read

PathMPMJ

PathMPMJNaive

(a)Executiontime (b)Numberofelementsread

Figure14: PathMPMJversusPathMPMJNaiveforpathqueriesusingRandom data

0

5

10

15

20

25

30

2

4

6

8

10 Path length

Execution

time

(seconds)

SS

PathStack

PathMPMJ

0 1000000

2000000

3000000

4000000

5000000

6000000

2

4

6

8

10 Path length

Values

read

PathStack

PathMPMJ

Figure15: PathStackversusPathMPMJusingRandom datasets

isclosertothatofthesequentialscanandmuchsmallerthanthatofPathMPMJ.

Figure 16 shows the execution time and number of values read for two simple path queries over the

unfoldedDBLP dataset (notethelogarithmic scaleontheY axis). Due tothespecic nestingproperties

betweennodesin this dataset, thePathMPMJ algorithmspends much time backtrackingand reads several

times thesamevalues. Forinstance, forthe pathqueryof lengththree in Figure16, PathMPMJ reads two

ordersofmagnitudemoreelementsthanPathStack.

Finally,Figure17showstheexecutiontimeandnumberofvaluesreadforafamilyoflongpathqueries

overthe Benchmark data set. In particular, wegenerated 10 dierent queriesby starting with thequery

template REGIONS/NAMERICA//ITEM//DESCRIPTION/SECTION/SECTION/SECTION/HEAD/WORD-iand

replac-ing the label WORD-i with each one of the 10 most frequent words in the data set. As in the previous

experiments,PathStackresultsinsignicantlymoreeÆcientexecutionsthanPathMPMJ.

5.4 Twigs: PathStack vs TwigStack

We now focus on twig queries, and compare our TwigStack algorithm against the naive application of

PathStacktoeachbranchinthetreefollowedbyamergestep. AsshowninSection4,TwigStackisoptimal

forancestor/descendantrelationships,but itis provably suboptimalforparent/child relationships. Inthis

(22)

1

10

100 1000

PAPER//YEAR

PAPER//YEAR//1960

Execution

time

(seconds)

SS

PathStack

PathMPMJ

10000

100000

1000000

10000000

100000000

PAPER//YEAR

PAPER//YEAR//1960

Values

Read

PathStack

PathMPMJ

Figure16: PathStackversusPathMPMJfortheunfoldedDBLPdataset

0

20

40

60

80

100

120

140

0

1

2

3

4

5

6

7

8

9 Word

i

Execution

Time

(seconds)

SS

PathStack

PathMPMJ

0 5000000

10000000

15000000

20000000

25000000

30000000

0

1

2

3

4

5

6

7

8

9 Word

i

V

a

lu

es

read

PathStack

PathMPMJ

(23)

A

₁

A

₂

A

₃

A

₄

A

₅

A

₆

A

₇

A

₁

A

₂

A

₃

A

₄

A

₅

A

₆

A

₇

AUTHOR

PAPER

YEAR

CO-AUTHOR

2000

PAPER

YEAR

1990

PAPER

CO-AUTHOR

YEAR

1980

(2)

(parameter d)

(a)

(b)

(c)

(parameter d)

Figure18: Twigqueriesusedin theexperimentsofSection5.4.1.

0

5

10

15

20

25

30 8%

9%

10%

11%

12%

13%

15%

17%

20%

24%

Fraction of data set with solutions

Execution

time

(seconds)

SS

TwigStack

PathStack

1

10

100 1000

10000

100000

1000000

8%

9%

10% 11% 12% 13% 15% 17% 20% 24%

Fraction of data set with solutions

Number

o

f

s

olutions

Partial TwigStack

Partial PathStack

Total

(a)Executiontime (b)Numberofsolutions

Figure19: PathStackversusTwigStackforabinarytwigquery.

5.4.1 Ancestor-Descendant Relationships

Fortherstexperimentinthissection,weusedthequeryshowninFigure18(a)overdierentRandom data

sets. Each dataset wasgenerated asafull ternarytree. Therst subtreeoftheroot nodecontainedonly

nodeslabeledA 1 ;A 2 ;A 3 andA 4

. ThesecondsubtreecontainednodeslabeledA

1 ;A 5 ;A 6 andA 7 . Finally,the

thirdsubtreecontainedallpossiblenodes. Clearly,therearemanypartialsolutionsinthersttwosubtrees

butthosedonotproduceanycompletesolution.Onlythethirdsubtreecontainsactualsolutions. Wevaried

the size of the third subtree relativeto the sizes of the rst twofrom 8%to 24%(beyond that point the

numberofsolutionsbecametoolarge). Figure 19showsthe executiontimeof PathStackandTwigStack,

and thenumberof partial solutionseachalgorithm produces before the mergingstep. The consistent gap

betweenTwigStackandPathStackresultsfromthelattergeneratingallpartialsolutionsfromthersttwo

subtrees which arelaterdiscardedin themergestep(A

1 ==A 2 ==A 3 ==A 4 )./(A 1 ==A 5 ==A 6 ==A 7 ). As seenin

Figure19(b), thenumberof partialsolutionsproducedbyPathStackisseveralordersofmagnitude larger

thanthat oftheTwigStackalgorithm. Thenumberoftotalsolutionsto thequerytwig computedbyboth

algorithmsis,ofcourse,thesame.

Forthesecondexperiment,weusethetwigqueryofFigure18(b)andgenerateddierentRandom data

sets inthefollowingway. Asbefore, eachdata setisafull ternarytree. Therstsubtreedoesnotcontain

any nodeslabeledA

2 orA

3

. The secondsubtreedoesnotcontainanyA

4 orA

5

nodes. Finally, the third

subtreedoesnotcontainanyA

6 orA

7