Nicolas Bruno ColumbiaUniversity [email protected] Nick Koudas AT&T Labs{Research [email protected]
Divesh Srivastava
AT&T Labs{Research
Abstract
XMLemploysatree-structureddatamodel,and,naturally,XMLqueriesspecifypatternsofselection
predicatesonmultipleelementsrelatedbyatreestructure.Findingalloccurrencesofsuchatwigpattern
inanXMLdatabaseisacoreoperationforXMLqueryprocessing. Priorworkhastypicallydecomposed
the twigpattern into binarystructural (parent-childand ancestor-descendant)relationships, and twig
matchingis achievedby: (i)using structuraljoinalgorithmstomatchthebinaryrelationships against
the XML database,and (ii) stitching together thesebasic matches. A limitation ofthis approachfor
matchingtwigpatternsisthatintermediateresultsizescangetlarge,evenwhentheinput andoutput
sizesaremoremanageable.
In this paper, we propose a novel holistic twig join algorithm, TwigStack, for matching an XML
query twigpattern. Ourtechniqueusesachain oflinkedstacksto compactlyrepresent partialresults
toroot-to-leafquerypaths,whicharethencomposedtoobtainmatchesforthetwigpattern. Whenthe
twigpatternusesonlyancestor-descendantrelationshipsbetweenelements,TwigStackis I/OandCPU
optimalamongallsequentialalgorithmsthatreadtheentireinput: itislinearinthesumofsizesofthe
input listsandthe nalresultlist,butindependentofthe sizesofintermediateresults. Wethenshow
howtouse(amodicationof)B-trees,alongwithTwigStack,tomatchquerytwigpatternsinsub-linear
time. Finally, we complementour analysis withexperimental results onarange ofreal and synthetic
data,andquerytwigpatterns.
1 Introduction
XMLemploysatree-structuredmodelforrepresentingdata. Queriesin XMLquerylanguages(see,e.g.,[7,
4, 2]) typically specify patterns of selectionpredicates on multiple elements that havesome specied tree
structuredrelationships. Forexample,theXQueryexpression:
book[title = `XML']//author[fn = `jane' AND ln = `doe']
matchesauthorelementsthat(i)haveachildsubelementfnwithcontentjane,(ii)haveachildsubelement
lnwith contentdoe, and (iii) are descendantsof bookelementsthat haveachild titlesubelement with
contentXML.Thisexpressioncanberepresentedasanode-labeledtwig(orsmalltree)patternwithelements
andstringvaluesasnodelabels.
FindingalloccurrencesofatwigpatterninadatabaseisacoreoperationinXMLqueryprocessing,both
inrelationalimplementationsofXMLdatabases,andinnativeXMLdatabases. Priorwork(see,forexample,
[11,23, 16,18, 17, 26,1])hastypicallydecomposed thetwigpatterninto asetof binary(parent-childand
ancestor-descendant)relationshipsbetweenpairsofnodes,e.g.,theparent-childrelationships(book,title)
(ii)\stitching"togetherthesebasicmatches.
Forsolvingtherstsub-problemofmatchingbinarystructuralrelationships,Zhangetal.[26]proposed
avariation ofthe traditionalmerge join algorithm,the multi-predicatemerge join (MPMGJN) algorithm,
based on the (DocId, LeftPos:RightPos, LevelNum) representation of positions of XML elements and
stringvalues(seeSection2.3fordetailsaboutthisrepresentation). TheirresultsshowedthattheMPMGJN
algorithm couldoutperformstandardRDBMS join algorithmsbymorethananorder ofmagnitude. More
recently, Al-Khalifa et al. [1] took advantageof the samerepresentation of positions of XML elements to
deviseI/OandCPUoptimaljoinalgorithmstomatchbinarystructuralrelationshipsonanXMLdatabase.
Thesecondsub-problemofstitchingtogetherthebasicmatchesobtainedusingbinary\structural"joins
requiresidentifyingagoodjoinorderinginacost-basedmanner,takingselectivitiesandintermediateresult
size estimatesinto account. In this paper, weshow that a basiclimitation of this (traditional) approach
for matchingquery twigpatterns isthat intermediate resultsizes cangetverylarge,evenwhen theinput
and nalresult sizesare muchmoremanageable. As aresult, weseek abetter solutionto theproblem of
matchingquerytwig patternseÆciently.
Inthis paper, wepropose anovel holistic twig join approach formatching XML query twig patterns,
whereinnolargeintermediateresultsarecreated. Ourtechniqueusesthe(DocId, LeftPos : RightPos,
LevelNum)representationofpositionsofXMLelementsandstringvalues(thatsuccinctlycapturesstructural
relationshipsbetweennodesintheXMLdatabase). Italsousesachainoflinkedstackstocompactlyrepresent
partialresultstoindividualqueryroot-to-leafpaths,whicharethencomposedtoobtainmatchestothequery
twigpattern.
Sinceagreat deal of XML data isexpected to be storedin relationaldatabasesystems(all themajor
DBMSvendorsincludingOracle,IBMandMicrosoftareprovidingsystemsupportforXMLdata),ourstudy
providesevidencethatRDBMSsystemsneedtoaugmenttheirsuiteofqueryprocessingstrategiestoinclude
holistictwigjoins foreÆcientXMLqueryprocessing. Our studyis equallyrelevantfor nativeXMLquery
engines,sinceholistictwigjoinsareaneÆcientset-at-a-timestrategyformatchingXMLquerypatterns, in
contrasttothenode-at-a-timeapproachofusingtreetraversals.
1.1 Outline and Contributions
Webeginbypresentingbackgroundmaterial(datamodel,querytwigpatterns,andpositionalrepresentations
ofXMLelements)in Section2. Ourmaincontributionsare:
We develop two families of holistic path join algorithms in Section 3 to match XML query
root-to-leaf paths eÆciently. The rst, PathStack, generalizes the Stack-Tree-Desc binary structural join
algorithm of[1], whilethesecond,PathMPMJ,generalizestheMPMGJNbinaryjoin algorithmof [26].
WeanalyzePathStackandshowthatitisI/OandCPUoptimalamongallsequentialalgorithmsthat
read theentireinput,andhasworst-casecomplexitieslinearinthesumofinputandoutputsizesbut
independentof the sizesofintermediateresults.
Wethen developTwigStackin Section4,aholistic twig join algorithmthat (i)renesPathStackto
ensure that resultscomputed for oneroot-to-leafpath ofa twigpattern are likely to havematching
results in other paths of thetwig pattern, and (ii) merges resultsfor thedierentroot-to-leafpaths
in the querytwig pattern, to computethedesired output. Whenthequery twig usesonly
ancestor-descendantrelationships between elements, weanalytically demonstrate that TwigStack is I/O and
<title> XML <=title> <allauthors> <author> <fn>jane<=> <ln>poe<=> <=author> <author> <fn>john<=> <ln>doe<=> <=author> <author> <fn>jane<=> <ln>doe<=> <=author> <=allauthors> <year> 2000 <=year> <chapter> <title> XML <=title> <section>
<head> Origins <=head> ...
<=section> ... <=chapter> ... <=book>
book
(1,66,4)
head
(1,68:78,3)
(1,65:67,3)
2000
title
section
(1,61:63,2)
(1,64:93,2)
year
chapter
(1,3,3)
(1,2:4,2)
title
(1,5:60,2)
allauthors
XML
author
author
(1,62,3)
author
fn
ln
fn
(1,8,5)
jane
ln
fn
ln
(1,43,5)
jane
john
(1,11,5)
poe
doe
doe
XML
Origins
(1,70,5)
(1,7:9,4)
(1,6:20,3)
(1,26,5)
(1,46,5)
(1,1:150,1)
(1,69:71,4)
(a) (b)Figure1: (a)AsampleXMLdatabasefragment,(b)Tree representation
Finally,in Section5wepresentexperimentalresultsonarangeofrealandsyntheticdata,andquery
twigpatterns,to complementouranalyticalresults:
{ Weshowthesubstantialperformancebenetsof usingholistic twig joins overbinary structural
joins (forarbitraryjoinorders).
{ WeshowthatPathStackissignicantlyfasterthanPathMPMJamongholisticpathjoinalgorithms.
ThisvalidatestheanalyticalresultsdemonstratingtheI/OandCPUoptimalityofPathStack.
{ For the case of twigpatterns, weshow that the use of TwigStack is better (both in time and
space) than the independent use of PathStack on each root-to-leaf path, even when the twig
patterncontainsparent-childstructuralrelationships.
{ We show how to use a modication of B-trees, denoted XB-trees, along with TwigStack, to
performmatchingofquerytwigpatternsin sub-lineartime.
WedescriberelatedworkinSection6,andconcludebydiscussingongoingandfutureworkinSection7.
2 Background
2.1 Data Model and Query Twig Patterns
An XMLdatabaseisaforestofrooted, ordered,labeledtrees, each nodecorrespondingto anelementora
value, and theedges representing(direct) element-subelementor element-valuerelationships. Node labels
consist of aset of (attribute, value) pairs, which suÆces to model tags, IDs,IDREFs, etc. Theordering of
sibling nodesimplicitlydenesatotalorderonthenodesin atree,obtainedbyapreordertraversalofthe
treenodes. ForthesampleXML document ofFigure1(a), itstree representationisshownin Figure 1(b).
(TheutilityofthenumbersassociatedwiththetreenodeswillbeexplainedinSection2.3.)
QueriesinXMLquerylanguageslikeXQuery[2],Quilt[4]andXML-QL[7]makeuseof(nodelabeled)
XML
book
year
(a)
title
2000
XML
author
title
book
(b)
fn
ln
jane
doe
Figure2: Querytwig patterns
include elementtags, attribute-valuecomparisons, andstringvalues,and thequerytwig pattern edgesare
eitherparent-childedges(depictedusingasingleline)orancestor-descendantedges(depictedusingadouble
line). Forexample,theXQueryexpression:
book[title = `XML' AND year = `2000']
which matchesbookelementsthat (i)haveachild titlesubelement withcontentXML,and(ii) havea
childyearsubelementwithcontent2000,canberepresentedasthetwigpatterninFigure2(a). Only
parent-childedgesareusedinthiscase. Similarly,theXQueryexpressionintheintroductioncanberepresentedas
thetwigpatternin Figure 2(b). Notethat anancestor-descendantedgeis usedbetweenthebookelement
andtheauthorelement.
Ingeneral,ateachnodeinthequerytwigpattern,thereisanode predicate ontheattributes(e.g.,tag,
content)ofthenodeinquestion. Forthepurposesofthispaper,exactlywhatispermittedinthispredicate
isnotmaterial. Similarly,thephysicalrepresentationof thenodesin theXML databaseisnotrelevantto
theresultsinthispaper. ItsuÆcesforourpurposesthattherebeeÆcientaccessmechanisms(suchasindex
structures)toidentifythenodesintheXMLdatabasethatsatisfyanygivennodepredicateq,andreturna
streamofmatchesT
q .
2.2 Twig Pattern Matching
Givenaquerytwigpattern QandanXMLdatabaseD,amatch ofQinD isidentiedbyamappingfrom
nodesin Qtonodesin D,such that: (i)querynodepredicatesaresatisedbythecorrespondingdatabase
nodes(the imagesunder themapping),and (ii)thestructural (parent-childandancestor-descendant)
rela-tionships betweenquerynodesare satisedbythecorrespondingdatabasenodes. Theanswer to queryQ
withnnodescan berepresentedasann-aryrelationwhereeachtuple (d
1 ;:::;d
n
)consistsofthedatabase
nodesthatidentify adistinctmatchof Qin D.
Finding all matches of a query twig pattern in an XML database is a core operation in XML query
processing, both in relational implementations of XML databases, and in nativeXML databases. In this
paper,weconsiderthetwigpatternmatchingproblem:
GivenaquerytwigpatternQ,andanXMLdatabaseDthathasindexstructurestoidentify
databasenodesthat satisfyeachofQ'snodepredicates,computetheanswertoQonD.
Consider,for example,the querytwig patternin Figure 2(a),and thedatabasetree in Figure 1. This
querytwigpatternhasonematchinthedatatreethat mapsthenodesinthequerytotherootofthedata
Thekeyto aneÆcient,uniform mechanismfor set-at-a-time (join-based)matching ofquery twigpatterns
isapositionalrepresentationofoccurrencesofXMLelementsand stringvaluesin theXMLdatabase(see,
e.g.,[5,6,26]),which extendstheclassicinvertedindex datastructureininformationretrieval[21].
Werepresenttheposition ofastringoccurrencein the XMLdatabaseasa3-tuple (DocId, LeftPos,
LevelNum),andthepositionofanelementoccurrenceasa3-tuple(DocId, LeftPos:RightPos, LevelNum),
where(i)DocIdistheidentierofthedocument;(ii)LeftPosand RightPoscanbegeneratedbycounting
word numbers from the beginning of the document DocId until the start and the end of the element,
respectively; and (iii) LevelNum is the nesting depth of the element (or string value) in the document.
Figure1(b) shows3-tuplesassociatedwithsometreenodes,basedonthisrepresentation(theDocIdforall
nodesischosentobeone).
Structuralrelationshipsbetweentreenodeswhosepositionsarerecordedinthisfashioncanbedetermined
easily: (i)ancestor-descendant: atreenoden
2
whosepositionin theXML databaseisencodedas(D
2 ;L 2 : R 2 ;N 2
)isadescendantofatreenoden
1
whosepositionisencodedas(D
1 ;L 1 :R 1 ;N 1 )iD 1 =D 2 ;L 1 <L 2 , andR 2 <R 1 1
;(ii)parent-child: atreenoden
2
whosepositionintheXMLdatabaseisencodedas(D
2 ;L 2 : R 2 ;N 2
)isachildofatreenoden
1
whosepositionisencodedas(D
1 ;L 1 :R 1 ;N 1 )iD 1 =D 2 ;L 1 <L 2 ;R 2 < R 1 ,andN 1 +1=N 2
. Forexample,inFigure1(b),theauthornodewithposition(1;6:20;3)isadescendant
ofthe book node withposition (1;1:150;1),and thestring\jane"withposition (1;8;5) isachild ofthe
authornodewithposition(1;7:9;4).
A key point worth noting about this representation of node positions in the XML data tree is that
checking an ancestor-descendant relationship is as simple as checking a parent-child relationship(we can
checkforanancestor-descendantstructuralrelationshipwithoutknowledgeoftheintermediatenodesonthe
path). Also,thisrepresentationofpositionsofnodesallowforcheckingorder(e.g.,noden
2
followsnoden
1 )
andstructuralproximity(e.g., noden
2
isadescendantwithin3levelsofn
1
)relationships.
3 Holistic Path Join Algorithms
3.1 Notation
Letq (with or withoutsubscripts) denotetwigpatterns, aswell as(interchangeably) theroot nodeof the
twigpattern. Inouralgorithms,wemakeuseofthefollowingtwignodeoperations: isLeaf:Node!Bool,
isRoot: Node!Bool, parent: Node!Node,children: Node!fNodeg,and subtreeNodes: Node!
fNodeg. Pathquerieshaveonlyonechildpernode,otherwisechildren(q)returnsthesetofchildrennodes
ofq. TheresultofsubtreeNodes(q)isthenodeqand allitsdescendants.
Associated with each node q in a query twig pattern there is a stream T
q
. The stream contains the
positional representationsof the databasenodes that match the node predicate at the twig pattern node
q (possibly obtained using an eÆcient access mechanism, such asan index structure). The nodes in the
streamaresortedbytheir(DocId, LeftPos)values. Theoperationsoverstreamsare: eof,advance,next,
nextL,andnextR. ThelasttwooperationsreturntheLeftPosandRightPoscoordinatesin thepositional
representationofthenextelementin thestream,respectively.
Inourstack-basedalgorithms, PathStackand TwigStack, we alsoassociatewith each query nodeq a
stackS
q
. Eachdatanodeinthestackconsistsofapair: (positionalrepresentationofanodefromT
q
,pointer
toanodein S
parent(q)
). Theoperationsoverstacksare: empty, pop,push,topL, andtopR. Thelast two
1
in thestack,respectively. Atevery pointduring thecomputation, (i)the nodesin stackS
q
(frombottom
totop)areguaranteedtolieonaroot-to-leafpathintheXMLdatabase,and(ii)thesetofstackscontaina
compactencoding ofpartialandtotalanswerstothequerytwigpattern,whichcanrepresentinlinearspace
apotentially exponential(in thenumberof querynodes)numberof answersto thequerytwig pattern, as
illustratedbelow.
A
1
B
1
C
1
A
1
B
2
C
1
A
2
B
2
C
1
C
1
S
C
B
1
S
B
B
2
A
1
S
A
A
2
A
B
C
A
1
B
1
A
2
B
2
C
1
(a) Data
(b) Query
(c) Stack encoding
(d) Query results
Figure 3: Compactencoding ofanswersusingstacks.
Example3.1 Figure3illustratesthe stack encodingof answerstoapathqueryfor asample dataset. The
answer [A 2 ;B 2 ;C 1
] is encoded since C
1 points to B 2 , and B 2 points to A 2 . Since A 1 is below A 2 on the stack S A , [A 1 ;B 2 ;C 1
] isalso an answer. Finally, since B
1 is below B 2 on the stack S B ,and B 1 pointsto A 1 ,[A 1 ;B 1 ;C 1
]isalsoananswer. Notethat[A
2 ;B
1 ;C
1
]isnotananswer,sinceA
2
isabovethenode (A
1 ) onstack S A towhich B 1 points.
Wemakecrucialuseofthiscompactstackencodingin ouralgorithms,PathStackandTwigStack.
3.2 PathStack
Algorithm PathStack, which computes answersto aquery pathpattern, is presentedin Figure 4, forthe
casewhenthestreamscontainnodesfrom asingleXML document. Whenthestreams containnodesfrom
multiple XML documents, the algorithm is easily extended to test equalityof DocIdbefore manipulating
thenodesinthestreamsandstacks.
Thekeyidea of Algorithm PathStack is to repeatedly construct (compact) stack encodings of partial
andtotalanswerstothequerypathpattern,byiteratingthroughthestreamnodesinsortedorderoftheir
LeftPosvalues;thus,thequerypathpatternnodeswillbematchedfromthequeryrootdowntothequery
leaf. Line2,in AlgorithmPathStack,identiesthestreamcontainingthenextnodetobeprocessed. Lines
3-5 removepartial answersfrom thestacksthat cannot be extended to totalanswers, given knowledge of
thenextstreamnodeto beprocessed. Line6augmentsthepartial answersencodedin thestackswiththe
newstreamnode. WheneveranodeispushedonthestackS
qmin
, whereq
min
istheleafnodeof thequery
path,the stackscontainanencoding oftotalanswerstothe querypath,andAlgorithm showSolutionsis
invokedbyAlgorithmPathStack(lines7-9)to \output"these answers.
Anaturalwayfor AlgorithmshowSolutionsto outputquerypath answersencoded inthe stacksis as
n-tuplesthat arein sortedleaf-to-rootorder ofthequerypath. Thiswill ensurethat,overthesequenceof
invocationsof AlgorithmshowSolutionsbyAlgorithmPathStack,theanswersto thequerypatharealso
computedinleaf-to-rootorder.Figure5showssuchaprocedureforthecasewhenonlyancestor-descendant
01 while :end (q)
02 q
min
=getMinSource(q)
03 for qi in subtreeNodes(q) //cleanstacks
04 while (:empty (S q i )^topR (S q i )<nextL(T q min )) 05 pop(Sq i ) 06 moveStreamToStack(T q min ;S q min , pointer to top (S parent (q min ) )) 07 if (isLeaf(q min )) 08 showSolutions(Sq min ;1) 09 pop (S q min ) Function end(q)
return 8qi2subtreeNodes(q):isLeaf (qi))eof(Tq
i )
Function getMinSource(q)
return qi2subtreeNodes(q) such that nextL (Tq
i ) is minimal Procedure moveStreamToStack(T q ;S q ;p) 01 push (Sq;(next (Tq);p)) 02 advance(Tq)
Figure4: AlgorithmPathStack
Whenparent-childedgesarepresentinthequerypath, wealsoneed totaketheLevelNuminformation
into account. PathStack doesnot needto change, but we needto ensurethat each time showSolutions
is invoked, it does not output incorrect tuples, in addition to avoiding unnecessary work. This can be
achieved by modifying the recursive call (lines 6-7) to check for parent-child edges, in which case only a
single recursive call (showSolutions(SN 1;S[SN]:index[SN]:pointer totheparentstack)) needs to be
invoked,afterverifyingthat theLevelNumofthetwonodesdierbyone. Loopingthroughallnodesinthe
stackS[SN 1]wouldstillbecorrect,butitwoulddomoreworkthanisstrictlynecessary.
Ifwedesirethenal answersto thequerypathbepresentedin sortedroot-to-leaforder(as opposed to
sorted leaf-to-root order), it is easyto see that it doesnotsuÆce that each invocationof showSolutions
outputanswersencodedin thestackin theroot-to-leaforder. Toproduceanswersin thesortedroot-to-leaf
order,wewouldneedto \block" answers,and delaytheiroutput untilwearesurethat noanswerpriorto
themin thesortordercan becomputed. Thedetails ofhowto achievethisnaturallyextendtheintuitions
ofAlgorithmStack-Tree-Ancfrom [1],andarepresentedin thenextsection.
Example3.2 Consider the leftmost path, book{title{XML,in each of the query twigs of Figure 2. If we
used the binary structural join algorithms of [26,1], we wouldrst needto compute matches to oneof the
parent-childstructural relationships: book{title,ortitle{XML.Sinceevery bookhas atitle,thisbinary
join would producea lot of matches against an XML books database, even whenthere are only afew books
whose title isXML. If,instead,werst computedmatches to title{XML,wewould also match pairs under
chapterelements,asinthe XMLdatatreeofFigure1(b),which donotextendtototalanswerstothe query
pathpattern. UsingAlgorithmPathStack,partialanswers arecompactlyrepresentedinthe stacks,andnot
output. Using the XML data tree of Figure 1(b), only one totalanswer, identied by the mapping [ book
//Assume,forsimplicity,thatthestacksofthequerynodesfromtheroottothecurrent
// leafnodeweareinterestedincanbeaccessedasS[1];:::;S[n].
//Also assumethatwehaveaglobalarrayindex [1::n]ofpointerstothestackelements.
//index [i]representsthepositioninthei'thstackthatweareinterestedinforthe
// currentsolution,wherethebottomofeachstackhasposition1.
//MarkweareinterestedinpositionSP ofstackSN.
01 index [SN]=SP
02 if (SN==1) //weareintheroot
03 //outputsolutionsfromthestacks
04 output (S[n]:index[n];:::;S[1]:index[1])
05 else //recursivecall
06 for i=1 to S[SN]:index[SN]:pointertoparent
07 showSolutions(SN 1;i)
Figure5: ProcedureshowSolutions
3.2.1 Blockingresults
ConsiderthesimplepathqueryA==D. Atanypointduringthequeryprocessing,ifnodeafromA's stack
isfoundtobeanancestorofnodedfrom D'sstream,theneverynodea 0
fromA'sstackthat isanancestor
ofaisalsoanancestorofd. Sincea 0
:L<a:L, wemustdelayoutputofthesolution(a;d)until after(a 0
;d)
hasbeenoutput. Butthereremainsthepossibilityofanewelementd 0
afterdinD'sstreamjoiningwitha 0
aslonga 0
is onA'sstack. Therefore,wecannotoutputthepair(a;d)untiltheancestornodea 0
ispopped
from A'sstack. Meanwhile,wecanbuilduplargejoinresultsthat cannotyet beoutput. Thissituation is
furtheraggravatedforlongerpathqueries. Wenowpresentaprocedurethatreturnssolutionsinroot-to-leaf
orderandgeneralizestheideasin [1]forbinarystructuraljoins andisbasedontheconceptofblocking.
Forthis purpose, we maintain twolinked lists associated with each element n in the stacks: the rst,
(S)elf-list,representsallblockeddescendantextensionswithrootelementn,andthesecondsecond,
(I)nherit-listrepresentsallblockeddescendantextensionswithrootelementsthat weredescendantsofninthestack
(onlythetopelementsin thestackshaveinherit-lists). Atanypointduringthealgorithm,thedescendant
extensions in the self- and inherit-lists of node n need to be expanded withthe elementsin the stacks of
theancestor nodesin the query, in the samewayasin the procedure showSolutions. The main ideasof
thealgorithmremainthesame,butwedonotoutputsolutions(usingshowSolutions)assoonaswedetect
them. Instead,weaccumulate inthe self-andinherit-lists thepartialsolutionsfound inroot-to-leaforder,
sowhenwenally outputresultstheyareguaranteedtobein therightorder.
Inparticular,when anewelementispushedto astack,weinitializeitsself- andinherit-liststoempty.
SupposewearepoppingelementD
c
fromthestackofquerynodeD. Dependingonthecurrentconguration,
weproceedasfollows 2
:
(a) NodeDis notthequeryroot,but hasparentnodeA. ElementD
c
isnotatthebottom ofthestack,
but has ancestor D
p
(Figure 6(a)). In this case, we rst determine nodes A
c and A
p
(the nodes in
A's stackpointedto byD
c andD
p
respectively). Then weappendtheself- and inherit-lists(in that
order) fromD
c
totheself-lists ofallelementsin A's stackstartingwith A
c
(inclusive) andending in
A
p
(exclusive).
(b) Node D is not the query root, but hasparent node A. Element D
c
is at the bottom of the stack.
(Figure6(b)). Inthiscase,werstdeterminenodeA
c
(the nodeinA'sstackpointedtobyD
c
). Then
2
A
p
A
c
D
p
D
c
.
.
.
.
.
.
.
.
.
.
.
.
D
p
D
c
.
.
.
D
c
(a)
(b)
(c)
(d)
D
c
.S to D
p
.I
(D
c
.S+D
c
.I) to A
c
.S, ..., A
p
.S
(exclusive)
(D
c
.S+D
c
.I) to A
c
.S, ..., A
b
.S
(D
c
.S+D
c
.I) to D
p
.I
Output (D
c
.S+D
c
.I)
A
c
D
c
.
.
.
.
.
.
A
b
Figure6: Possiblestackcongurationswhenblockingresults.
weappendtheself-andinherit-lists(inthatorder)fromD
c
totheself-listsofallelementsinA'sstack
startingwithA
c
and endinginA'sbottomelement.
(c) Node D is the query root. Element D
c
is not at the bottom of the stack, but has ancestor D
p
(Figure6(c)). Inthiscaseweappendtheself-andinherit-lists(inthat order)from D
c
tothe
inherit-listofelementD
p .
(d) NodeD is the queryroot. ElementD
c
is at thebottomof the stack(Figure 6(d)). We rstoutput
thecontentsoftheself-listandthenthecontentsoftheinherit-listofelementD
c .
It is fairly simpleto show that in cases(a),(b), and (c) we preserveall solutionswhen rearrangingthe
linked lists. Also,each timeweappend onelist to another,weareguaranteed that nofuture element will
beappended outoforder. Therefore,in case(d)weoutputsolutionsin thedesired order. Asan informal
proof, considerforinstance, case(a). All descendantextension fromD
c
'sself-andinherit- listsneedto be
expandedwithallelementsfrom A
c
to thebottomof thestack. Forthatpurpose, weappend theself-and
inherit-lists D
c
to allnodesin the parent querynode from A
c to A
p
exclusive. Since we alsoappend this
descendant extensionsto theinherit-list of D
p
, weare guaranteedthat in afuture iteration this solutions
willreachtoalltheremainingnodesinstackA. Sowedidnotloseanysolutionwhenmanipulatingthelists.
Also, when we pop elementD
c
, we know that nonew elementfrom D's stream canstart before D
c (and
all its descendants) does, so any new solutionwill start after everydescendant extension in D
c
's self and
inherit lists. Incontrast,wecannotappendthese lists to theself listof A
p
andits ancestorsin the stack,
becausesomesolutionsthatstartbeforethemmightbeblockedin D
p
anditsancestors. Wesolvethatcase
byappendingD
c
'sself-andinherit-liststo D
p
'sinheritlist. Theothercasescanbeexplainedsimilarly.
Theonlyoperationweperform overlists is \append"(exceptfor thenal read out). Since our
imple-mentation ofthe linkedlists maintainstheheadand tailpointers, these append operationscanbecarried
outinconstanttimeandrequirenocopying. Therefore,weonlyneedtohaveaccessto thetailofeachlist
inmemoryascomputationproceeds. Therest ofthelistcanbepagedout. Whenlistxisappended tolist
y, it is notnecessarythat thehead of list x bein memory, since theappend operation only establishesa
link to thishead inthe tailofy. Soallwe needisto knowthepointer for theheadofeachlist, evenifit
is paged out. Each list pageis thus paged outat mostonce, at paged back in againonly when thelist is
ready foroutput. Also,thetotalnumberof entries in thelistsis exactlyequalto thenumberofentries in
theoutput. WehavethenthattheI/Orequiredtomaintainlistsofresultsisproportionaltothesizeofthe
ThefollowingpropositionisakeytoestablishingthecorrectnessofAlgorithmPathStack.
Proposition 3.1 If wex node Y,the sequenceofcases betweennode Y andnodesX on increasing order
of LeftPos (L) is: (1j2) 3 4
. Cases 1 and Cases 2are interleaved, then all nodes in Case 3 before any
node inCase 4, andnallyall nodesinCase 4(seeFigure7).
X
Y
Case 1
Case 2
Case 3
Case 4
Y
Root
X
Y
Root
X
Y
Root
X
Y
Root
X
X
X
X
Y
Y
Y
X.R<Y.L
X.L<Y.L
X.L>Y.R
X.R>Y.R
X.L>Y.L
X.R<Y.R
Property
Segments
Tree
Figure 7: CasesforPathStackandTwigStack
Lemma3.1 Supposethatforanarbitrarynodeqinthepathpatternquery,wehavethatgetMinSource(q)=
q
N
. Also,suppose that t
q
N
isthe next element in q
N
'sstream. Then, after t
q N ispushedon tostack S q N ,
the chain of stacks from S
qN to S
q
veries that their labels areincludedin the chain of nodesin the XML
datatreefromt
q
N
tothe root.
For each node t
qmin
pushed onto stack S
qmin
, it is easy to see that the above lemma, along with the
iterativenature ofAlgorithm showSolutions, ensuresthatall answersin whicht
q
min
isamatchforquery
nodeq
min
willbeoutput. Thisleadstothefollowingcorrectnessresult:
Theorem3.1 GivenaquerypathpatternqandanXMLdatabaseD,AlgorithmPathStackcorrectlyreturns
allanswers for qon D.
Wenextshowoptimality. GivenanXMLquerypathoflengthn, PathStacktakesninput listsoftree
nodes sortedby (DocId, LeftPos), and computesanoutput sortedlist ofn-tuples that match thequery
path. It is straightforward to see that, excluding the invocations to showSolutions, the I/O and CPU
costsof PathStack arelinear in thesum of sizes of then inputlists. Sincethe costof showSolutionsis
proportionaltothesizeoftheoutputlist,wehavethefollowingoptimalityresult:
Theorem3.2 Given aquery pathpattern qwith nnodes, andanXML database D,AlgorithmPathStack
has worst-caseI/OandCPUtimecomplexitieslinearin thesum ofsizesof theninputlistsandthe output
list. Further,theworst-casespacecomplexityofAlgorithmPathStackistheminimumof(i)the sumofsizes
of then inputlists,and(ii) the maximumlengthofa root-to-leafpath inD.
It is particularly important to note that the worst-case time complexity of Algorithm PathStack is
AstraightforwardgeneralizationoftheMPMGJNalgorithm [26]forpathqueriesproceeds onestreamata
time to get all solutions. Considerthe path queryq
1 ==q
2 ==q
3
. Thebasic ideais as follows: Get the rst
(next) element from the stream T
q
1
and generate all solutions that use that particular element from T
q 1 . Then,advanceT q1 andbacktrack T q2 andT q3
accordingly(i.e.,to theearliestpositionthat mightleadtoa
solution). Thisprocedureis repeateduntilT
q
1
isempty. Thegenerate all solutions steprecursivelystarts
with the rst marked elementin T
q2
, gets all solutions that use that element (and the calling element in
T
q
1
),thenadvancesthestreamT
q
2
untiltherearenomoresolutionswiththecurrentelementinT
q
2
,andso
on. Werefertothisalgorithmas PathMPMJNaive,inourexperiments.
It turns outthat maintaining onlyone mark per stream (for backtracking purposes) is tooineÆcient,
sinceallmarksneedtopointto theearliestsegmentthat canmatchthecurrentelementinT
q1
(the stream
of the root node). A better strategy is to use astack of marks, as described in Algorithm PathMPMJ in
Figure8. InthisoptimizedgeneralizationofMPMGJN,eachquerynodewill nothaveasinglemarkinthe
stream,but \k"markswhere kisthenumberofitsancestorsin thequery. Eachmarkpointstoanearlier
positionin thestream, andforquerynodeq,thei'thmarkistherstpointin T
q
suchthattheelementin
T
q
startsafterthecurrentelementin thestream ofq'si'thancestor.
Algorithm PathMPMJ(q)
01 while (:eof (Tq)^ (isRoot (q)_nextL (q)<nextR (parent (q))))
02 for (qi2subtreeNodes(q)) //advancedescendants
03 while (nextL (q i )<nextL (parent (q i ))) 04 advance(Tq i ) 05 PushMark(T q i )
06 if (isLeaf (q)) outputSolution() //solutioninthestreams'heads
07 else PathMPMJ(child (q))
08 advance(Tq)
09 for (qi2subtreeNodes(q)) //backtrackdescendants
10 PopMark(Tq
i )
Figure8: AlgorithmPathMPMJ
Theorem3.3 GivenaquerypathpatternqandanXMLdatabaseD,AlgorithmPathMPMJcorrectlyreturns
allanswers for qon D.
Whilethetwoextensions ofMPMGJN appearsimilar, the dierencebetweentheirperformanceis
no-ticeable, as we shall see in the experimental section. Further, as was the case with MPMGJN,
Algo-rithmPathMPMJisnotasymptoticallyoptimaleither.
4 Twig Join Algorithms
4.1 Limitations of Using PathStack
Astraightforwardwayofcomputinganswerstoaquerytwigpatternistodecomposethetwigintomultiple
root-to-leafpathpatterns,usePathStacktoidentifysolutionstoeachindividualpath,andthenmerge-join
these solutionsto computethe answersto thequery. This approach, which weexperimentallyevaluate in
Section 5,faces thesamefundamental problemasthetechniquesbasedonbinarystructuraljoins,towards
Against the XML database in Figure 1(b), the twopaths of this query: author{fn{jane,and author{ln{
doe,have twosolutionseach, butthe querytwig patternhas onlyone solution.
In general, if the query (root-to-leaf) paths have many solutions that do not contribute to the nal
answers,usingPathStack(as asub-routine) issuboptimal,in that theoverall computationcostfor atwig
pattern is proportional not just to the sizes of the input and the nal output, but also to the sizes of
intermediateresults. Inthissection,weseektoovercomethissuboptimalityusingAlgorithmTwigStack.
4.2 TwigStack
Algorithm TwigStack, which computesanswersto aquery twig pattern, is presented in Figure 9,for the
casewhen thestreams containnodes from asingleXML document. Aswith Algorithm PathStack, when
thestreamscontainnodesfrommultiple XMLdocuments,thealgorithmis easilyextendedto testequality
ofDocIdbeforemanipulatingthenodesin thestreamsandonthestacks.
Algorithm TwigStack(q) //Phase 1 01 while :end (q) 02 q act =getNext (q) 03 if (:isRoot(qact)) 04 cleanStack(parent (q act ), nextL (q act )) 05 if (isRoot(qact)_:empty(S parent (q act ) ))
06 cleanStack(qact, next (qact))
07 moveStreamToStack(T qact ;S qact ;pointertotop(S parent (q act ) )) 08 if (isLeaf(qact)) 09 showSolutionsWithBlocking(S qact ;1) 10 pop(Sq act ) 11 else advance(Tq act ) //Phase 2 12 mergeAllPathSolutions() Function getNext (q) 01 if (isLeaf(q)) return q 02 for qi in children(q) 03 n i =getNext (q i )
04 if (ni6=qi) return ni
05 nmin= minargn i nextL (Tn i ) 06 nmax= maxargn i nextL (Tn i )
07 while (nextR (Tq)<nextL (Tn
max )) 08 advance (T q ) 09 if (nextL (Tq)<nextL (Tn min )) return q 10 else return n min
Procedure cleanStack(S, actL)
01 while (:empty (S)^(topR (S)<actL))
02 pop(S)
Figure9: AlgorithmTwigStack
AlgorithmTwigStackoperatesin twophases. Intherstphase(lines1-11),some(butnotall)solutions
to individual query root-to-leaf paths are computed. In the second phase (line 12), these solutions are
A
1
B
1
A
2
D
1
C
1
D
2
(a) Data set
(b) Query
A
B
C
D
Figure10: ExampledatasetandquerytoillustratethegetNextprocedure.
The key dierence betweenPathStackandthe rstphaseof TwigStackis that before anode h
q from
thestream T
q
is pushedonitsstackS
q
,TwigStack(via itscallto getNext)ensuresthat: (i)node h
q has
adescendant h
q
i
in eachofthestreamsT
q
i ,forq
i
2children(q),and (ii)eachofthenodesh
q
i
recursively
satises the rst property. Algorithm PathStackdoesnot satisfy this property (and it does not need to
do so to ensure(asymptotic) optimality for querypath patterns). Thus, when thequery twigpattern has
only ancestor-descendant edges, each solution to each individual query root-to-leafpath is guaranteed to
bemerge-joinable with at least onesolution to each of theother root-to-leafpaths. This ensuresthat no
intermediatesolutionislargerthanthenal answertothequerytwigpattern.
The second merge-join phase of Algorithm TwigStack is linearin the sum of its input (the solutions
to individual root-to-leafpaths) and output (the answer to the query twig pattern) sizes, only when the
inputsareinsortedorderofthecommonprexesofthedierentqueryroot-to-leafpaths. Thisrequiresthat
thesolutionstoindividual querypathsbeoutputin root-to-leaforderaswell,whichnecessitates blocking;
showSolutions(fromFigure5), cannotbeusedandneedstobeextended withtheideasofSection 3.2.1.
Example4.2 Consider again the query of Example 4.1, which isthe sub-twig rooted at the authornode
of thetwig pattern inFigure2(b), andthe XMLdatabasetreeinFigure1(b). BeforeAlgorithmTwigStack
pushes anauthornodeon thestack S
author
,itensuresthat thisauthornode has: (i) adescendantfnnode
in the stream T
fn
(which in turn has a descendant jane node in T
jane
), and (ii) a descendant ln node in
the stream T
ln
(which in turn has a descendant doe node in T
doe
). Thus, only one of the three author
nodes (corresponding to the thirdauthor) from the XML data tree in Figure 1(b) is pushed on the stacks.
Subsequentstepsensurethatonlyone solutiontoeachof thetwopathsof thisquery: author{fn{jane,and
author{ln{doe,is computed. Finally, the merge-joinphase computes thedesiredanswer.
4.3 Analysis of TwigStack
Inthis sectionwediscuss thecorrectnessof algorithmTwigStackforprocessingtwig queries,and thenwe
analyzeitscomplexity.
Denition4.1 Consider a twig query Q. For each node q 2 subtreeNodes(Q) we dene the head of q,
denotedh
q
, asthe rstelement in T
q
that participates in asolution for the sub-queryrootedatq. We say
thatanodeqhasaminimaldescendantextensionifthereisasolutionforthesub-queryrootedatqcomposed
entirely ofthe head elementsof subtreeNodes(q).
Example4.1 Consider the datasetand queryof Figure 10. The rstelement inT
A
thatparticipatesina
solution forthe query,h
A =A
1
(suchsolutioninvolves the nodesA
1 ,B 1 ,C 1 ,andD 2 ). Similarly,h B =B 1 , h C =C 1 andh D =D 1
,since those are the rst elements that participate in asolution for the sub-queries
1 2
does notuse h
D =D
1
. Forthe samereason, node Adoesnot haveaminimal descendant extension.
Proposition 3.1, based on Figure 7, is also important for establishing the following lemma, used for
provingthecorrectnessofAlgorithmTwigStack.
Lemma4.1 [Descendantextension] Supposethatfor anarbitrarynodeqinthetwigquerywehavethat
getNext(q)=q
N
. Then, the following properties hold:
1. q
N
hasaminimal descendantextension.
2. Foreach node q 0 2subtreeNodes(q N ),the rstelement inT q 0 ish q 0 . 3. Either (a) q = q N or (b) parent(q N
) does not have a minimal right extension because of q
N (and
possibly othernodes). Inotherwords, thesolution rootedatp=parent(q
N
)that usesh
p
doesnotuse
h
q
for node qbut someother elementwhoseL componentis larger thanthatof h
q .
Proof: (Inductiononthenumberofdescendantsofq). Ifqisaleafnode,ittriviallyveriesallproperties,
sowe returnit in line 1 of the algorithm. Otherwise, we get n
i
=getNext(q
i
) recursivelyfor each child
q
i
of q in lines 2-4. If for somei, wehavethat n
i 6=q
i
but somedescendantof q
i
, we know byinductive
hypothesisthatn
i
veriesproperties(1),(2)and(3-b)abovewithrespecttoq
i
. Allpropertiesarestillvalid
withrespecttonodeq,so wereturnnoden
i
inline4. Otherwise,weknowbyinductivehypothesisthatall
ofq'schildrensatisfyproperties(1),(2)and(3-a)abovewithrespecttotheircorrespondingsub-queries. We
rstadvancefromT
q
allsegmentsthatarein Case1withrespect toany headsegmentin then
i
'sstreams
(lines7-8). Afterthat,weknowthatqisincase2,3or4withrespecttoeveryn
i
byusingProposition3.1.
IfqisinCase2withrespecttoalln
i
,itsatisesproperties(1),(2)and(3-a)above,sowereturnit(line9).
Otherwise,wecanguaranteethattheminimaln
i
veriesproperties(1),(2)and(3-b)above,sinceitsparent
(q)doesnothaveaminimaldescendantextensionbecauseofn
i
(andpossiblyotherchildren),sowereturn
suchminimalnoden
i
inline10(theelementinqdoesnotsatisfyproperty(3)atthemoment,butcoulddo
solateron). Toverifythelastcondition(lines9-10)wesimplycheckiftheleftboundoftheelementinqis
lowerthantheleftbound oftheminimaln
i
. Ifthat isthecase,weknowthat q'selementstartsbeforeany
n
i
andendsaftereachn
i
(sincewecheckedthat itendsaftereach n
i
starts).
Lemma4.2 Consider the following fragment ofcode:
while :end(q) q N =getNext(q); advance(q N )
The following propertieshold:
1. getNext returnsin node q
N
allandonlyelementsin the datasetwith adescendant extension.
2. If elementx iselement y'sancestorinsome solution,thenx isreturnedbygetNext before y.
Proof: (Inductiononthenumberofcalls togetNext). Inparticular, wewillshowthatproperty(1) above
holds byprovingthat all elements that are skippedin line 10of getNext areguaranteed notto haveany
descendantextension. ConsidertherstcalltogetNextforqueryq,andassumethatgetNext(q)=q
N . By
Lemma4.1, property(1), weknowthat alltherstelementsinthestreamsof subtreeNodes(q
N
)are part
either q
N
=q (with noancestors) orthe solutionrootedat q
N
is guaranteed notto bepartof any solution
involvingq
N
'sancestors,becauseeachdescendantextensionfromp=parent(q
N
),usingsomeelementxfrom
p'sstream,veriesx:Lh
p :L>h
qN
:L(usingLemma4.1,property(3)). Therefore,property2holds.
For subsequent calls to getNext, we proceed asfollows. Assume that getNext(q) = q
N
and consider
eachelementx (fromnodeq
x
2subtreeNodes(q
N
))thatisadvanced inline 10. Elementx cannotbepart
of any descendant extension with elements in the streams of subtreeNodes(q
x
), since x ends before the
rstsolutionin somechild's stream(see conditionin line9of getNext). Besides, x cannot bepartof any
descendantextensionwithelementsthatwerealreadyprocessedbygetNextbyproperty(2)oftheinductive
hypothesis. Therefore,x isguaranteednottobepartofanydescendantextension,andproperty(1)holds.
Nowsupposethatq
N
6=qandp=parent(q
N
). UsingLemma4.1,property(3),weknowthath
p :L>h
q
N :L.
Therefore, any solution involving element h
q
:L must use some element from p that wasalready returned
bygetNext. Usingproperty(2)of theinductive hypothesis, we knowthat allelementsfrom p'sancestors
in such solutions were returned by getNext before the corresponding element from p, which in turn was
returnedbeforeh
q
N
. Property(2)holdsaswell.
Wethenknowthatwhensomenodeq
N
isreturnedbygetNext,h
qN
isguaranteedtohaveadescendant
extensionin subtreeNodes(q
N
). Wealsoknowthatanyelementin theancestors ofq
N
that usesh
q
N ina
descendantextensionwasreturnedbygetNextbeforeh
qN
. Therefore,wecanmaintainforeachnodeqinthe
query,theelementsthatarepartofasolutioninvolvingotherelementsinthestreamsofsubtreeNodes(q).
Then, eachtime thatq
N
=getNext(q)is aleafnode,weoutput allsolutionsthat useh
q
N
. Wenowprove
thatwecanachievethatgoalbymaintainingasinglestackpernodeinthequery.
Lemma4.3 Supposethatinthe currentiterationwehave getNext(q)=q
N
. Then,thereisnonew solution
involving: (a) someelement fromq
N
thatends before h
q
N
starts,and(b)someelementfrom thestreamsin
subtreeNodes(q
N ).
Proof: Suppose that on the contrary, there is a new solution using some (already processed) element
from q N (denoted s qN ) for which s qN :R < h qN
:L. Using the containment property, we know that all
elements from subtreeNodes(q
N
) in such solution must end before s
q
N
:R , and therefore, before h
q
N :L.
SincegetNext(q)=q
N
weknow(Lemma4.1) thatq
N
hasaminimal descendantextension, andtherefore,
allelementsinthestreamsofsubtreeNodes(q
N
)startafterh
q
N
does,whichisacontradiction.
Thelemmaaboveguaranteesthatforeachnodeq,thesetofelementsthatcanbepartofanewsolution
areexactlythoseincludingthelastelementfromqreturnedbygetNext. Therefore,asinglestackpernode
suÆcestokeeptrackofallelementswithpotentialnewsolutions. Thefollowinglemma showsthatwecan
easilychaintheelementsofthestacksinthesamewayasin PathStackanduseshowSolutionstooutput
allsolutions.
Lemma4.4 Suppose that, in the current iteration, we have getNext(q)=q
N , and q 6= q N . Let p = parent(q N
). Then, there is no \new" solution involving: (a) some element from p that ends before h
q
N
starts,and(b) someelementfromthe streamsin subtreeNodes(p).
Proof: TheproofissimilartothatofLemma4.3. Supposethatonthecontrary,thereisanewsolutionusing
someelementfromp(denoteds
p
)forwhichs
p :R<h
qN
:L. WeknowthatallelementsfromsubtreeNodes(p)
in suchsolution mustend before s
p
:R ,and therefore, before h
q
N
:L. SincegetNext(q)=q
N
weknow(see
algorithmgetNextforTwigStack)thatforallchildrenn
i ofp(includingq N ): (1)getNext(n i )=n i ,and(2) h q N :Lh n i
:L. UsingLemma 4.1weknowthateachn
i
i ni qN
contradictionandthelemmaholds.
Thelemma above guarantees that each time getNext returnsnodeq
N
wecan clean from q
N
's parent
stackallelements thatdo notinclude h
q
N
(since theycannotparticipatein anynewsolution) andset the
h
qN
'spointerinq
N
'sstacktothetopofq
N
'sparentstack. Usingthetwolemmasabove,weproveourmain
resultforTwigStack.
Theorem4.1 Given a query twig pattern q, and an XML database D, Algorithm TwigStack correctly
returnsallanswers for qonD.
Proof: InAlgorithmTwigStack,werepeatedlyndgetNext(q)forqueryq. AssumethatgetNext(q)=q
N .
LetA
qN
bethesetofnodesinthequerythatareancestorsofq
N
. UsingLemma4.2weknowthatgetNext
alreadyreturned allelementsfrom thestreams ofnodes in Aq
N
that arepartof asolutionthat usesh
q
N .
Ifq6=q
N
,in line 3wepopfrom parent(q
N
)'sstackallelements thatare guaranteednotto participatein
any new solution (Lemma 4.4). After that, in line 4wetest whether h
q
N
participatesin asolution. We
knowthat q
N
hasadescendantextension by Lemma4.1, property1. If q6=q
N andparent(q N )'sstackis empty, nodeq N
doesnothavean ancestorextension. Therefore itis guaranteednot toparticipate in any
solution, so we advance q
N
in line 10 and continuewith the nextiteration. Otherwise, nodeq
N
has both
ancestoranddescendantextensionsandthereforeitparticipatesinatleastonesolution. Wethencleanq
N 's
stack(Lemma 4.3)and after that wepushh
qN
to it (lines5-6). ByLemma 4.4 weknowthat thepointer
tothetopofparent(q
N
)'sstackcorrectlyidentiesallsolutionsusingh
q
N
. Finally,ifq
N
isaleafnode,we
decompressthestoredsolutionsfrom thestacks(lines7-9).
Whilecorrectnessholds forquerytwig patternswith bothancestor-descendantandparent-childedges,
wecanproveoptimalityonlyforthecasewherethequerytwigpatternhasonlyancestor-descendantedges.
The intuitionis simple. Since wepush into the stacks onlyelementsthat haveboth adescendantand an
ancestor extension, weare guaranteedthat noelementthat doesnot participatein any solutionispushed
intoanystack. Therefore,themergepostprocessingstepisoptimal,andwehavethefollowingresult.
Theorem4.2 Consider aquery twig pattern q with n nodes, and only ancestor-descendant edges, and an
XMLdatabaseD. AlgorithmTwigStackhasworst-caseI/OandCPUtimecomplexitieslinearinthesumof
sizesoftheninputlistsandtheoutputlist. Further,theworst-casespacecomplexityofAlgorithmTwigStack
isthe minimum of (i)the sum of sizesof the ninput lists,and(ii) n timesthe maximum lengthofa
root-to-leafpath inD.
Itisparticularlyimportanttonotethat,forthecaseofquerytwigswithancestor-descendantedges,the
worst-casetimecomplexityofAlgorithmTwigStackisindependentofthesizesofsolutionstoanyroot-to-leaf
pathof the twig.
4.4 Suboptimality for Parent-Child Edges
Theorem 4.2 holdsonly forquery twigs with ancestor-descendantedges. Unfortunately, in thecase where
the twig pattern contains aparent-childedge betweentwo elements (e.g., see the queryin Example 4.2),
AlgorithmTwigStackisnolongerguaranteedtobeI/OandCPUoptimal. Inparticular,thealgorithmmight
produce asolutionfor oneroot-to-leafpath that doesnotmatch withanysolutionin another root-to-leaf
path.
Considerthequery twigpatternwith three nodes: A;B andC, andparent-childedgesbetween(A;B)
and between(A;C). Letthe XML datatree consist of nodeA
1
, withchildren(in order) A
2 ;B 2 ;C 2 , such
2 1 1 A B C 1 1 1
respectively. Inthiscase, wecannot sayifanyofthem participatesin asolutionwithoutadvancing other
streams, and we cannot advance any stream before knowing if it participates in a solution. As a result,
optimalitycannolongerbeguaranteed.
4.5 Using XB-Trees
AlgorithmsPathStackandTwigStackneedtoprocesseach nodeintheinputliststo checkwhetherornot
it is partof ananswer to thequery (pathortwig) pattern. When theinput lists areverylong, this may
takealot oftime. Inthissection,weproposetheuseofavariantofB-trees,denotedXB-tree,ontheinput
liststospeedupthisprocessing.
4.5.1 XB-Tree Description
Asitsnamesuggests,theXB-treeisavariantoftheB-tree,designedforindexingthepositionalrepresentation
(DocId, LeftPos : RightPos, LevelNum)ofelementsintheXMLtree. Wedescribetheindexstructure
whenallnodesbelongtothesameXMLdocument;theextensiontomultipledocumentsisstraightforward.
ThenodesintheleafpagesoftheXB-treearesortedbytheirLeftPos(L)values;thisissimilartotheleaf
pagesofaB-treeontheLvalues. ThedierencebetweenaB-treeandanXB-treeisinthedatamaintained
atinternalpages. EachnodeN inaninternalpageoftheXB-treeconsistsofaboundingsegment[N:L;N:R ]
(whereLdenotesLeftPosandRdenotesRightPos)andapointertoitschildpageN:page(whichcontains
nodes with bounding segments completely included in [N:L;N:R ]). The bounding segments of nodes in
internal pagesmight partiallyoverlap,but their Lpositions arein increasingorder. Besides, each pageP
hasapointer totheparentpageP:parent andtheintegerP:parentIndex which istheindex ofthenodein
P:parent thatpointsbackto P. Theconstruction andmaintenance ofanXB-treeis verysimilar to those
in aB-tree,usingtheL valueasthe key;thedierenceis that theR valuesneed tobepropagatedupthe
indexstructure.
4.5.2 UsingXB-Trees
Wemaintainapointeract=(actPage;actIndex)totheactIndex'thnodein pageactPageoftheXB-tree.
TherearetwooperationsovertheXB-treethataectthispointer:
1. advance. Ifact=(actPage;actIndex)doesnotpointto thelast nodeinthecurrentpage,wesimply
advance actIndex. Otherwisewereplace actwith thevalue (actPage:parent;actPage:parentIndex)
andrecursivelyadvanceit.
2. drillDown. Ifact=(actPage;actIndex), actPageisnotaleafpage,andN is theactIndex'th node
in actPage,wereplaceactwith(N:page;0)sothat itpointstotherstnodeinN:p.
Initiallyact=(rootPage;0),pointingtotherstnodeintherootpageoftheXB-tree. Whenactpoints
tothelastnodein rootPageandweadvanceit,wenishthetraversal.
Wecanmodify the previousalgorithms easily touse XB-trees. AlgorithmTwigStackXB, in Figure 11,
extends Algorithm TwigStack so that it uses XB-trees. The only changes are in the lines indicated by
parentheses. The function isPlainValuereturns true if the actual pointer in the XB-tree is pointing to
a leaf node (actual value in the original stream). If we dene isPlainValue(T)=true when T is not an
01 while :end (q) 02 q act =getNext (q) (03) if (isPlainValue(Tq act )) 04 if (:isRoot (q act ))
05 cleanStack(parent (qact), next (qact))
06 if (isRoot (q act )_:empty (S parent (qact) )) 07 cleanStack(q act , next (q act )) 08 moveStreamToStack(Tq act ;Sq act ; pointer to top (S parent (qact) )) 09 if (isLeaf(q act )) 10 showSolutionsWithBlocking(Sqact;1) 11 pop (Sq act ) 12 else advance(Tqact)
(13) else if (:isRoot(qact)^empty (S
parent (qact) )^nextL (T parent (qact) )>nextR (Tq act )) (14) advance(T qact
) //Notpartofasolution
(15) else //Mighthaveachildinsomesolution
(16) drillDown(T qact ) //Phase2 17 mergeAllPathSolutions() Function getNext(q) 01 if (isLeaf (q)) return q 02 for qi in children (q) 03 n i =getNext (q i ) (04) if (qi6=ni_:isPlainValue(Tn i )) return ni 05 n min = minarg n i nextL(T n i ) 06 n max = maxarg n i nextL (T n i )
07 while (nextR (Tq)<nextL (Tn
max )) 08 advance (T q ) 09 if (nextL (Tq)<nextL (Tn min )) return q 10 else return n min
Procedure cleanStack(S, actL)
01 while (:empty(S)^(topR (S)<actL))
02 pop(S)
Figure 11: AlgorithmTwigStackXB
Theorem4.3 Given a query twig pattern q and an XML database D, Algorithm TwigStackXB correctly
returnsallanswers for qonD.
Whilewedo not haveany analyticalresults about theeÆciency of Algorithm TwigStackXB, we show
experimentallythat itperformsmatchingofquerytwigpatternsinsub-linear time.
5 Experimental Evaluation
In this section we present experimental results on the performance of the join algorithms introduced in
Sections3and4usingbothreal andsynthetic data.
5.1 Experimental Setting
We implemented all XML join algorithms in C++ using the le system asa simple storage engine. All
XMark
XMach-1
Figure 12: TheBenchmark datasetcombinesbothXMarkandXMach-1documents.
ofdiskspace, runningRedHat Linux7.1. Weusedthefollowingsynthetic andreal-worlddatasetsforour
experiments:
Random: We generated random trees using three parameters: depth, fan-out and number of dierent
labels. For most of the experiments presented involvingRandom data sets, we generated full binary and
ternarytrees. Unless speciedexplicitly, thenodelabelsin thetreeswere uniformlydistributed. Wetried
othercongurations(largerfanoutandrandomdepth inthetree)obtainingconsistentresults.
Benchmark: We used a combination of the XMark [25] and XMach-1 [24] benchmarks. The XMark
benchmark is basedon arich DTD that models adatabasein an Internet auction site. However, XMark
usually produces shallow XML les with not too many nested levels. On the other hand, the
XMach-1 benchmark is based on a simpler DTD that models documents. As a result, the produced XML les
can model deeply nested sections and subsections. To take advantage of both models, we generated the
Benchmark datasetbymergingpartialresultsfromXMarkandXMach-1. Inparticular,werstgenerated
atemporaryXMLleusingtheXMarkbenchmarkwithscalingfactorequaltoone. Wethenreplacedeach
of the 105,396 descriptiontags (that originally corresponded to free text) with a small XML fragment
produced bytheXMark benchmark(seeFigure 12). Theresultingdata set hasdepth37 andaround11.7
millionnodes.
DBLP: Therealdatasetisan\unfolded"fragmentoftheDBLPdatabase. IntheoriginalDBLPdataset,
eachauthorisrepresentedbyaname,ahomepage,andalistofpapers. Inturn, eachpapercontainsatitle,
theconferencewhereitwaspublished,andalistofcoauthors. WegeneratedourunfoldedfragmentofDBLP
in thefollowingway. Westartedwithanarbitraryauthor andconvertedthecorresponding informationto
XMLformat. Further,foreachpaper,wereplacedeachcoauthornamewiththeactualinformationforthat
author. We recursively continued unfolding authors until we reached apreviously traversed author, ora
depthof200authors. TheresultingXMLdatasethasdepth 805andaround3millionnodes. It represents
93,536dierentpapersfrom 36,900uniqueauthors.
5.2 Holism: Binary Structural Joins vs PathStack
InthisrstexperimentwecompareourholisticPathStackalgorithmagainststrategiesthatusea
combina-tionofbinarystructuraljoins[1]. Forthispurpose,weusedaRandom dataset consistingof1Mnodesand
six dierent labels, namely: A
1 ;A 2 ;:::;A 6 . 3
We issued thepath queryA
1 ==A 2 ==:::==A 6 and evaluated 3
NotethattheactualXMLdatacancontainmanymorelabels,butthatdoesnotaectourtechniquessinceweonlyaccess
0
10
20
30
40
50
60
Execution
time
(seconds)
Binary Joins
PathStack
SS
Figure13: Holisticversusstructuralbinaryjoinsforpathqueries
it using PathStack. Then, we evaluated all binary structural join strategies resulting from applying all
possiblejoin orders. Figure13showstheexecutiontimeofallbinaryjoin strategies,where eachstrategyis
representedwithasinglebar. Wealsoshowwith asolid linetheexecutiontime ofPathStack, andwitha
dottedline thetimeittakesto simplyreadtheinputdatausingasequentialscan(labeledSS).
Forthisquery,PathStacktook2:53s,slightlymorethanthe1:87stakenbythesequentialscanoverthe
inputdata. Incontrast,thestrategiesbasedonbinarystructuraljoinsrangedfrom16:1sforthejoinordering
((((A 1 ==A 2 )==A 3 )==A 4 )==A 5 )==A 6
, to 53:07sforthejoin orderingA
1 ==((A 2 ==(A 3 ==(A 4 ==A 5 )))==A 6 ). The
rst conclusion we can draw from these execution times is that optimization plays an important role for
binary structuraljoins, sinceabad join ordering canresult in aplan that ismore thanthree times worse
than the best plan. The second conclusion we can draw is that the holistic strategy is superior to the
approach of using binary structural joins for arbitrary join orders; in this case, it results in morethan a
six-foldimprovementinexecutiontimeoverthebeststrategythatusesbinarystructuraljoins.
5.3 Paths: PathStack vs PathMPMJ
In this section we study the eÆciency of the dierent holisticpath join algorithms of Section 3. We rst
comparethetwoversionsofPathMPMJ. Weuseda64KRandom datasetwith10dierentnodesA
1 ;:::A
10 ,
and issuepath queriesof dierent lengths. Figure 14(a) shows theexecution times ofbothtechniques, as
well asthetime takenforasequentialscan overtheinput data. PathMPMJNaiveismuchslowercompared
to the optimized PathMPMJ(generally over anorder of magnitude). Thereason is that PathMPMJNaiveis
tooconservativewhenbacktrackingand readsseveraltimesunnecessaryportionsofthedata. Asshown in
Figure 14(b), PathMPMJNaiveread as much as 15 timesthe numberof elementsPathMPMJ did. Since the
performance of PathMPMJNaivedegrades considerably with the size of the data set and the length of the
inputquery,wedonotconsiderthisstrategyfortheremainderofthispaper.
WenowcomparePathStackagainsttheoptimizedPathMPMJ. InFigure15weshowtheexecutiontime
andthenumberof nodesread fromdisk forpathqueriesof dierentlengthandaRandom data setof 1M
nodesand10dierentvalues. Clearly,PathStackresultsinconsiderablybetterperformancethanPathMPMJ,
andthis dierenceincreaseswithlongerpathqueries. Thisis explainedbythefact thatPathStackmakes
asinglepassovertheinputdata,whilePathMPMJneedstobacktrackandreadagainlargeportionsofdata.
For instance, for a path query of length 10, PathMPMJ reads the equivalent of ve times the size of the
originaldata,asseeninFigure15(b). InFigure15(a),forpathqueriesoflengthtwo,theexecutiontimeof
PathStackis considerably slowerthan that ofthesequentialscan,and closerto PathMPMJ. This behavior
isdue tothefact thatforthepathqueryoflengthtwo,thenumberofsolutionsisratherlarge(morethan
100K),somostoftheexecutiontimeisusedinprocessingthesesolutionsandwritingthembacktodisk. For
0
5
10
15
20
2
3
4
5
6
7
8
9
10
Path Length
Execution
Time
(seconds)
SS
PathMPMJ
PathMPMJNaive
0
1000000
2000000
3000000
4000000
5000000
2
3
4
5
6
7
8
9
10
Path Length
V
a
lu
es
Read
PathMPMJ
PathMPMJNaive
(a)Executiontime (b)Numberofelementsread
Figure14: PathMPMJversusPathMPMJNaiveforpathqueriesusingRandom data
0
5
10
15
20
25
30
2
4
6
8
10
Path length
Execution
time
(seconds)
SS
PathStack
PathMPMJ
0
1000000
2000000
3000000
4000000
5000000
6000000
2
4
6
8
10
Path length
Values
read
PathStack
PathMPMJ
(a)Executiontime (b)Numberofelementsread
Figure15: PathStackversusPathMPMJusingRandom datasets
isclosertothatofthesequentialscanandmuchsmallerthanthatofPathMPMJ.
Figure 16 shows the execution time and number of values read for two simple path queries over the
unfoldedDBLP dataset (notethelogarithmic scaleontheY axis). Due tothespecic nestingproperties
betweennodesin this dataset, thePathMPMJ algorithmspends much time backtrackingand reads several
times thesamevalues. Forinstance, forthe pathqueryof lengththree in Figure16, PathMPMJ reads two
ordersofmagnitudemoreelementsthanPathStack.
Finally,Figure17showstheexecutiontimeandnumberofvaluesreadforafamilyoflongpathqueries
overthe Benchmark data set. In particular, wegenerated 10 dierent queriesby starting with thequery
template REGIONS/NAMERICA//ITEM//DESCRIPTION/SECTION/SECTION/SECTION/HEAD/WORD-iand
replac-ing the label WORD-i with each one of the 10 most frequent words in the data set. As in the previous
experiments,PathStackresultsinsignicantlymoreeÆcientexecutionsthanPathMPMJ.
5.4 Twigs: PathStack vs TwigStack
We now focus on twig queries, and compare our TwigStack algorithm against the naive application of
PathStacktoeachbranchinthetreefollowedbyamergestep. AsshowninSection4,TwigStackisoptimal
forancestor/descendantrelationships,but itis provably suboptimalforparent/child relationships. Inthis
1
10
100
1000
PAPER//YEAR
PAPER//YEAR//1960
Execution
time
(seconds)
SS
PathStack
PathMPMJ
10000
100000
1000000
10000000
100000000
PAPER//YEAR
PAPER//YEAR//1960
Values
Read
PathStack
PathMPMJ
(a)Executiontime (b)Numberofelementsread
Figure16: PathStackversusPathMPMJfortheunfoldedDBLPdataset
0
20
40
60
80
100
120
140
0
1
2
3
4
5
6
7
8
9
Word
i
Execution
Time
(seconds)
SS
PathStack
PathMPMJ
0
5000000
10000000
15000000
20000000
25000000
30000000
0
1
2
3
4
5
6
7
8
9
Word
i
V
a
lu
es
read
PathStack
PathMPMJ
(a)Executiontime (b)Numberofelementsread
A
1
A
2
A
3
A
4
A
5
A
6
A
7
A
1
A
2
A
3
A
4
A
5
A
6
A
7
AUTHOR
PAPER
YEAR
CO-AUTHOR
2000
PAPER
YEAR
1990
PAPER
CO-AUTHOR
YEAR
1980
(2)
(parameter d)
(a)
(b)
(c)
(parameter d)
Figure18: Twigqueriesusedin theexperimentsofSection5.4.1.
0
5
10
15
20
25
30
8%
9%
10%
11%
12%
13%
15%
17%
20%
24%
Fraction of data set with solutions
Execution
time
(seconds)
SS
TwigStack
PathStack
1
10
100
1000
10000
100000
1000000
8%
9%
10% 11% 12% 13% 15% 17% 20% 24%
Fraction of data set with solutions
Number
o
f
s
olutions
Partial TwigStack
Partial PathStack
Total
(a)Executiontime (b)Numberofsolutions
Figure19: PathStackversusTwigStackforabinarytwigquery.
5.4.1 Ancestor-Descendant Relationships
Fortherstexperimentinthissection,weusedthequeryshowninFigure18(a)overdierentRandom data
sets. Each dataset wasgenerated asafull ternarytree. Therst subtreeoftheroot nodecontainedonly
nodeslabeledA 1 ;A 2 ;A 3 andA 4
. ThesecondsubtreecontainednodeslabeledA
1 ;A 5 ;A 6 andA 7 . Finally,the
thirdsubtreecontainedallpossiblenodes. Clearly,therearemanypartialsolutionsinthersttwosubtrees
butthosedonotproduceanycompletesolution.Onlythethirdsubtreecontainsactualsolutions. Wevaried
the size of the third subtree relativeto the sizes of the rst twofrom 8%to 24%(beyond that point the
numberofsolutionsbecametoolarge). Figure 19showsthe executiontimeof PathStackandTwigStack,
and thenumberof partial solutionseachalgorithm produces before the mergingstep. The consistent gap
betweenTwigStackandPathStackresultsfromthelattergeneratingallpartialsolutionsfromthersttwo
subtrees which arelaterdiscardedin themergestep(A
1 ==A 2 ==A 3 ==A 4 )./(A 1 ==A 5 ==A 6 ==A 7 ). As seenin
Figure19(b), thenumberof partialsolutionsproducedbyPathStackisseveralordersofmagnitude larger
thanthat oftheTwigStackalgorithm. Thenumberoftotalsolutionsto thequerytwig computedbyboth
algorithmsis,ofcourse,thesame.
Forthesecondexperiment,weusethetwigqueryofFigure18(b)andgenerateddierentRandom data
sets inthefollowingway. Asbefore, eachdata setisafull ternarytree. Therstsubtreedoesnotcontain
any nodeslabeledA
2 orA
3
. The secondsubtreedoesnotcontainanyA
4 orA
5
nodes. Finally, the third
subtreedoesnotcontainanyA
6 orA
7