XML Data
Torsten S hlieder
FreieUniversitatBerlin
s hliedinf.fu-berlin.de Felix Naumann Humboldt-UniversitatzuBerlin naumanndbis.informatik.hu-ber lin. de Abstra t
Queryingheterogeneous olle tionsofdata- entri XMLdo umentsrequiresa ombinationof
databaselanguagesand on eptsusedininformationretrieval,inparti ularsimilaritysear h
andranking. Inthispaperwepresentanapproa htondapproximateanswerstoformaluser
queries. We redu ethe problemof answeringqueries againstXMLdo ument olle tions to
thewell-knownunorderedtreein lusionproblem. Weextendthisproblemtoanoptimization
problembyapplyinga ostmodeltotheembeddings. Therebyweareabletodeterminehow
losepartsoftheXMLdo umentmat hauserquery. WepresentaneÆ ientalgorithmthat
ndsall approximatemat hesandranksthema ordingtotheirsimilaritytothequery.
1 Introdu tion
XML has gained mu h attention in both the information retrieval ommunity and in the eld
of databaseresear h. One reasonis that XML oers auniform and standardized wayto
repre-sentandex hange bothdo uments inthe senseof informationretrievaland dataexportedfrom
databases. Ontheoneend ofthe spe trumof XML usagearetext- entri do umentswith only
few, interspersed markups. On the other end of thespe trum are data- entri do uments that
aresolely reatedandinterpretedbysomeappli ationlogi . Su hdo umentshaveawelldened
stru tureandaretypi allystoredin homogeneous olle tions. ButXMLisalso onsideredasthe
ommon format forthe integration of a wide rangeof data. Oneexample are federated digital
library ataloguesthat ombine thedata from several repositorieswhi htypi allyhavediering
s hemata. Anotherexamplearedatawarehousesstoringall ompany-wideXML-formatted
do u-mentssu hasmessagesanddatabasereports. Su hdo ument olle tionsaredata- entri but do
nothavea ommons hema,i.e., theyareheterogeneous.
To sear h in text- entri XML do uments, te hniques adopted from information retrieval are
appropriate. Forqueryingdata- entri XML do uments in homogeneous olle tions,new query
languages have been developed (for an overview, see [FSW99, BC00℄). However, for querying
heterogeneous olle tions of data- entri do umentsneither information retrievalte hniques nor
querylanguagesareappropriate. Classi informationretrievalmodelsliketheve torspa emodel
ompletelyignorethestru tureofthedo uments. Evenifmeaningfulelementandattributenames
arein ludedin the reationofdes riptors,thesemanti softhenestingislost. Querylanguages,
ontheotherhand, annot opewithdo umentsthatareonlysimilar tothequery. Asubstru ture
of ado ument either mat hesthe queryorit doesnot. The onsequen eof thebinary de ision
Thisresear h was supported by the GermanResear h So iety, Berlin-Brandenburg Graduate S hool in
the do ument and that often notall desired results are returned. Note, that enri hing a query
language with regular path expressions does not solvethe problem of querying olle tions with
partiallyknowns hemaaswewillpointoutinSe tion2.
Webelievethatforheterogeneous olle tionsofdata- entri do umentstheformalismofdatabase
languages ombined with the on ept of un ertainty used in information retrieval will a hieve
the best results. In this paper we integrate some aspe ts of the two on epts and present the
XMLquerylanguageapproXQL asasynta ti subsetof XQL [RLS98℄. UnlikeXQLoranyother
XML querylanguage, resultsof approXQLare notonly all stri t mat hesto the querybut also
results that have additional inner elements not spe ied by the query. This method returns
satisfying resultseven when the user does not know the exa tstru ture of the olle tion. The
resultsof approXQLqueriesaresorted bytheirsimilarityto thequery. Exa tmat hesaregiven
rst,followedbyresultswithonlyslightdeviationset . Togetherwiththelanguagewepresenta
query- entri algorithmthat eÆ ientlyanswersapproXQLqueries.
InSe tion 2weshowthat itis diÆ ultfor auserto queryaheterogeneousdo ument olle tion
usingatraditionalXMLquerylanguage. InSe tion3weintrodu etheapproXQLquerylanguage
andshowhowweinterpretthequeriesandthedata olle tionastrees. InSe tion4wereviewthe
well-known unorderedtree in lusion problem and extend itto theapproximate tree embedding
optimizationproblem. InSe tion5wepresentouralgorithm.InSe tion6wesurveyrelatedwork
andgiveanoutlookof futureimprovementsandextensionsinSe tion7.
2 Motivating Example
Assume that wehavea olle tionof do umentsstoring metadataabout books. Figure 1 shows
a subset of this olle tion. It ontains information aboutdierent books the author \Bradley"
ontributedto.
PCDATA
author
"Bradley"
PCDATA
" ... "
book
author
title
PCDATA
" ... "
"Bradley"
PCDATA
name
title
PCDATA
" ... "
title
PCDATA
" ... "
"Bradley"
PCDATA
book
book
book
title
Doc 4
Doc 3
Doc 2
Doc 1
chapter
author
list
author
PCDATA
"Bradley"
Figure1:Aheterogeneous olle tionofdata- entri do uments
Assume further that we need information about all books that have asauthor \Bradley". We
spe ifyanXQLquerythat exa tlymodelsourinformationneed:
/book/author/text() = "Bradley"
Obviously,theonlyresultofthequeryisdo ument1|despitethefa tthatatleastthedo uments
2and3exa tlymat hourinformationneedtoo. Moreover,thedo ument4mayalsobeofinterest.
Fortunately, allXML querylanguagessupport regularpath expression. Even with the simplest
kindofregularpathexpression,theskip operator we anpartlysolveourproblem:
do umentshavetobeskippedandwhere theyare. Copingwithstru turalheterogeneityrequires
knowledge of the entire data stru ture and often leadsto omplex queries. Furthermore, query
pro essorsfor XML querylanguagesdo notmeasure howmany elements haveto beskipped in
ordertomat h thequery. Therefore,noranking a ordingto stru tural losenessispossibleand
theresultsarereturnedinanarbitraryorder.
We assume that the user has only partial knowledge of the data stru ture. The user spe ies
queriesthatexa tlymeethisorherinformationneed. Thesystemretrievesresultsthatexa tlyt
tothequery,butalsorelaxesthemat hing onditionstoretrievedo umentsthathaveadditional
elementsnotspe iedinthequery. Theresultsarerankeda ordingtothesimilaritytothequery.
3 Data and Query Modeling
Inthisse tionweintrodu eanelementarydatamodelforXML. Weusesomenormalizationsto
mapelements,attributesandPCDATAtoasinglenodetype. Wefurtherintrodu etheapproXQL
querylanguage.
3.1 Modeling XML Data
We model a olle tion of XML do uments as a labeled tree with a single root. Currently, we
ignoreID-referen esandhyperlinks. Tosimplify ourmodel,weonlyuseasinglenodetype. Ea h
nodedofthattypehasalabel,writtenaslabel(d). Weusethenotationparent(d)torefertothe
(unique)parentnodeofd.
Weperformthefollowingnormalizations: Elementsaremappedtonodesusingtheelementname
aslabel. Ea hattributeismappedtotwonodes,therstonegetsthenameaslabel,these ond
onegetsthevalueof theattributeas label. PCDATA se tionsaremapped tonodes bytreating
thewholetextas label. Figure 2showsanexampleofthenormalization.
title
XML
author
Bradley
book
{ year = 1998 }
PCDATA
PCDATA
(a)PartofaXMLdatatree
1998
year
book
title
XML
Bradley
author
(b)NormalizedversionFigure2:Simpli ationofadatatree
Note,thatwedonot onsidersemanti heterogeneityofattributeandelementnames. Weassume
that all elementsmodeling the samereal worlddomain also arrythe samelabel. This anbe
a hievedbyprovidingafun tionthat mapssemanti allyequalnamestoa ommonlabel.
3.2 The ApproXQL Query Language
Weintrodu e asimplepattern mat hing language alled approXQL that is asynta ti subsetof
XQL[RLS98℄. Informally,approXQL onsistsof (1)thepathoperator \/",(2)thelteroperator
\[℄",(3)thetext()sele tor,and(4)theoperators\="and\$and$". Thefollowingqueryrequests
Note,thattheuserdoesnotneedtoknowhowtheyear1999ismodeledintheoriginaldo uments.
Due tooursimplied datamodel, attributevaluesand element ontentsarebothmappedto the
leafnodesofthedatatree. Thus,thequerywillmat hboth ases. Wedonotrequirethatquery
pathsendwiththetext()sele tor. Thefollowingquerymat hesbooksthathaveaforewordand
atleastoneeditor:
book[foreword $and$ editor℄
An approXQL query anbeinterpretedastree(seeFigure 3). Ea h$and$expressionis mapped
to aninner node ofthetree. The hildrenofan $and$nodeare therootsof thepaths that are
onjun tively onne ted. Weusethenotation hildren(q)to referto theset of hildrenof qand
the notation label(q) to identify the label of q. Note, that dierent querynodes may have the
samelabel. TheresultofanapproXQLqueryisalwaysasubtreeof thedatatree. In ontrastto
XQLwedonotallowquerieswithboolean results.
book[ title/text() = ’XML’ $and$
author[ firstname/text() = ’Neil’ $and$
lastname/text() = ’Bradley’ ]]
(a)approXQL-QueryBradley
lastname
author
book
title
XML
firstname
Neil
(b)QuerytreeFigure3:MappingofanapproXQLqueryto atree.
4 The Tree Embedding Problem
Wehaveshownhowtointerpretboththedataandthequeryastrees. With thisinterpretation,
the problemof answeringaquery an be mapped to theproblem of embeddinga querytree in
thedatatree.
The exa t ordered tree embedding problem has been extensively studied (see [RR92℄ for an
overview) and is solvable in polynomial time. However, we are not interested in an ordered
embedding, be ause ordering in the data may be in onsistent for ing us to ignore it. Even if
it were onsistent theorder might not be known to the user. Also, weare not interestedin an
exa t embedding aswewill see in thefollowing se tions. There, we rst review the unordered
tree in lusion problem. Then, to answer user queries in a meaningful way, we extend it to an
optimizationproblem.
4.1 The Unordered Tree In lusion Problem
Our goal is to approximately embed thequery tree into the data tree su h that the labelsand
thean estorshipofthenodesarepreserved. I.e.,weallowthedatatreetohavenodesinbetween
those foundin thequerytree, but theirprede essor{su essorrelationshipmust stillhold in the
tree. Figure4showsanallowableembedding. Thequerytreeontheleftisnotperfe tlyembedded
book
title
chapter
author
XML
Bradley
name
Bradley
XML
title
author
Bradley
chapter
book
chapter
section
title
XML
author
(a) Query tree
(b) Part of a XML data tree
Figure 4:Unorderedin lusionofaquerytreeinadatatree.
Tondallsu hembeddings,Kilpelainenintrodu edtheso alledunorderedtreein lusionproblem
[Kil92℄. Theauthorprovedthat eventhede isionproblem,i.e., theproblemofde idingwhether
one embedding exists,isNP- omplete. Wefollow[Kil92℄in des ribingtheproblem.
Denition1(Embedding) Afun tion f from the setQof query nodesintothe set D of data
nodesis alledan embedding iffor all q
i ;q j 2Q 1. f(q i )=f(q j ),q i =q j , 2. label (q i )=label (f(q i )), 3. q i isthe parent ofq j ,f(q i ) isanan estorof f(q j )
We allthedatanodethatis mappedto thequeryroottheembedding root. Notethe dieren e
betweenparent andan estor in ondition3. Betweenanode anditsparentthere existsan edge
whereasbetweenanodeanditsan estorthereexistsmerelyapath.
In[Kil92℄ theauthoralso studiedtheproblem oflo ating allsubtrees (representedbytheirroot
nodes)ofthedatatree thatin lude thequerytreeminimally. However,in theframework ofthe
author the term \minimal" simply means that only those mat hing subtrees are retrieved that
do not in lude another mat hing subtree. In the following subse tion we introdu e a dierent
interpretationofaminimalembedding.
4.2 The Approximate Tree Embedding Problem
Thetreein lusionapproa his initselfonlyoflimitedvalueto theuserbe ausethesolutionsare
notranked: Theembeddingrootsareretrievedinarbitraryorderindependentofhowmanynodes
have to be skipped in order to perfe tly embed the query. We do not only want answers that
perfe tlymat htheuserquery. Buttheyshouldmat haswellaspossible. Wesolvethisproblem
bymeasuringthequalityofanembedding. Weusea ostfun tion overthedatanodesthatmust
beskippedtoembedthequeryperfe tly.
Weusethe ostmodelin twoways: First,among allembeddingsthat havethesameembedding
rootweonlysele tthe ost-minimalembedding. Thus, weavoidsituationslikethat ofFigure 4
whereseveralembeddingsrefertothesamebook. Se ond,werankthe ost-minimalembeddings
a ordingtothe ost. Embeddingswherenoneoronlyasmallnumberofnodesareskippedhave
alow ostandarerankedhigh. Tothisend, weassignadeletion ost ost(d) toea hdatanode.
Denition2 (Embedding ost) Letf beanembeddingofQinD. Letf(Q)betheimage off
andP bethe setofdatanodesalongallpathsstartingandendingatanytwonodesd
i ;d
j
2f(Q).
Then the embedding ost ofQin D isC:= P
d2Pnf(Q)
Denition3(Mat h) Letf beanembeddinga ording toDenition 1. Letdbethe embedding
root. We allthe node dtogether withitsembedding osta mat h: m=(d;C).
Weusethenotationd
m andC
m
torefertothe omponentsofm. Note,thatwealsousetheterm
mat h forembeddingsofquerysubtrees.
Denition4 (Minimalmat h) Let F be the set of embeddings from a query tree into a data
treeandletM bethe setofmat hes belonging toF. WepartitionM intogroupsof mat hesthat
share the same root node. For ea h group, the mat h with the minimal ost is alled minimal
mat h.
Thefollowingse tiondes ribesanalgorithmthatndsthe ost-rankedsetofminimalmat hesto
anapproXQLquery.
5 An Algorithm for the Approximate Tree Embedding Problem
The omplexityresultof[Kil92℄suggeststhatthereisnopolynomialalgorithmsolvingthe
approx-imatetreeembeddingproblem. Weintrodu eanewalgorithmfortheapproximatetreeembedding
problemthat isstill exponentialin theworst ase. Butin typi al asesouralgorithmonlytakes
sublineartimewrt.thedatabasesize. First,wesket hthebasi ideaofourapproa handthenwe
dis ussthedetailsofthealgorithm.
Ouralgorithmisbasedondynami programming.Itsmostimportantpropertyisthequery- entri
exe ution. Forea hquerynodeonlythosedatanodesarefet hedthataremembersofapotential
embedding of thequerytree in thedata tree. Thequerytree ispro essedbottomup. Forea h
querynodeqtheembeddingsofthequerytreerootedatqare ombinedfromtheembeddingsof
thequerytrees rooted at the hildnodesof q. Ea h embedding isrepresentedbyamat h. The
mat hes belonging to a ertain query node are grouped by theirdata nodes. From ea h group
onlytheminimal mat hissele ted. Theset ofminimalmat hesbelongingto thequeryroot are
theresultsofthealgorithm. Theyaresortedbyin reasing ostandretrievedto theuser.
5.1 Constru ting the Embeddings
Weuse apostorder numerationof thenodes in thequerytree. The main loopof thealgorithm
visits thequerynodesin as ending order (see Algorithm 1,line 1). Thepostorderensuresthat
forea h hildnodeofthe urrentquerynode,theminimalmat heshavealreadybeen al ulated
and storedin themat h set belongingto that hildnode. Re allfrom Se tion 4thataminimal
mat hmisapair onsistingoftheembedding ostC
m
andadatanoded
m
thatistherootofthe
minimalembeddingofaquerysubtree.
Assumethatq
i
isthe urrentquerynode. Now,thealgorithmfet hesalldatanodesthathavethe
samelabelasq
i
(line3). Forea hmat hingdatanoded,thealgorithmtries toembedthequery
subtreerootedatq
i
inthedatasubtreerootedatd. To onstru tthoseembeddingsthealgorithm
buildsall ombinationsofmat heswhereea h ombination onsists ofkmat hes, onefrom ea h
mat hset. Ea h ombinationrepresentsapotentialembeddingofthequerysubtreerootedat q
i
in thedata subtree rooted at d. To hoose the proper ombinations outof the set of potential
ones,thealgorithm he ksforea h ombinationS whetheralldatanodesofthemat hesinS are
des endants of d and whether the mat hes in S are notblo king. We rst formalizethe terms
blo king mat hesandproper ombination. Afterwardsweexplainhowthealgorithmin rementally
Input: Q|alistofquerynodessortedin postorder
D|asetofdatanodes
Output: Thesortedlistofminimalmat hes
1: forallq
i
2Qdo //Iterate through the querynodesinpostorder
2: M
qi
:=; //The mat hset belonging toq
i
3: for alld2D su h thatlabel(d)=label(q
i )do
4: S:=f;g //The setof proper ombinations
5: forallq j 2 hildren(q i )do 6: S 0 :=f;g 7: forallm2M q j
su hthat disanan estor ofd
m do
8: forallS2Sdo //Atleast, ;will besele ted
9: if d
m
isnoan estorordes endantofanyd
m 0 fromm 0 2S then 10: S 0 :=S 0 [fS[fmgg //AppendS 0
by anew partial ombination
11: endif 12: endfor 13: endfor 14: S:=S 0 15: endfor
16: if S6=f;gthen //Thereisatleastoneproper ombination
17: M
qi :=M
qi
[fsele tminimalmat h(d;S)g
18: endif
19: end for
20: endfor
21: outputThemat hsetofthequeryrootsortedbyin reasing ost
Denition5(Blo king mat hes) Let m
i
and m
j
be two mat hes with the data nodes d
mi and d mj respe tively. Let D i ;D j
D be the node sets of the subtrees rooted at d
mi and d
mj
respe tively. Two mat hes m
i ;m
j
are blo kingif the subtrees theyrefer toshare ommon nodes,
i.e., D
i \D
j 6=;.
Asanexample,seeFigure 4onpage5. Thequerysubtrees rootedat the hapter andtheauthor
noderespe tivelybothhavetwomat hes. Inparti ular,theybothhaveonemat hin themiddle
partofthedatatree. Butthedatasubtreerootedatthe hapternode ontainsthesubtreerooted
attheauthornode. Therefore,thosemat hesareblo king.
Theblo kingoftwosubtrees anbedeterminedeÆ iently: Twosubtreesareblo kingiftheirroot
nodesareidenti aloriftherootnodeofonesubtreeisanan estoroftheother(line 9).
Denition6 (Proper ombination) Let q
i
be aquery node andd be adatanode towhi h q
i
ismapped. LetM
qj
bethe mat h setbelonging toq
j
2 hildren (q
i
). Ea h tupleS ofthe artesian
produ tX q j 2 hildren(q i ) M qj
is alled a proper ombinationif (1)the data node d
m
ofea hmat h
m2S isades endant ofd and(2) ifea h pair ofmat hes m
i ;m
j
2S isnon-blo king.
Ouralgorithm in rementally onstru tstheproper ombinationsforaquerynodeq
i
and adata
noded. First,thesetofproper ombinationsisempty(line4). Thetemporarysetof ombinations
S 0
just olle tsallmat hesatta hedtotherst hildnodeofq
i
(lines7|13). Then,SissettoS 0
(line14). These ond loop ombinesallmat hesofthese ond hildwiththemat hesoftherst
hild nowstored inS. That is, ombinationsoflength twoare onstru ted. Theloopbeginning
atline5nishesifallk hildnodesofq
i
arepro essed,i.e.,ifproper ombinationsoflengthkare
onstru ted. Ea hproper ombinationS representsanembeddingofthequerysubtreerootedat
q
i
i
ea h ombinationShastobe al ulated. InSe tion5.3weintrodu eanalgorithmthateÆ iently
al ulatestheembedding ostsinorder tosele ttheminimalmat h. Inthefollowingsubse tion
weshowhowthean estor-des endantrelationshipoftwodatanodes anbetestedeÆ iently. This
ru ialtestisbothneededin Algorithm1andforthe al ulationoftheembedding ost.
5.2 Testing the An estor-Des endant Relationship
Ina tree, thetest whether anode isan an estor of anothernode an be donein onstant time
after alinear time prepro essing of the tree [Hav97℄. This method usesa preorder numeration
of the nodes. Ifa node d
j is a des endantof a node d i then number(d i )< number(d j ). If the
omparisonistrue,thend
j iseitherades endantofd i ord j isinasubtreerightofd i . Thelatter
ase anbeex ludedifthenumberoftheright-mostleafnodeofd
i
isknown. Weexpli itlystore
this log(jDj) numberfor ea h node d
i
and a ess itwith the fun tion bound(d
i
). Now thetest
whetheranoded i isanan estor ofanoded j anbeperformedasfollows: d i isanan estorofd j () number(d i )<number(d j )and bound(d i )number(d j ):
5.3 Cal ulating the Embedding Cost
To al ulate theembedding ostfor agiven datanoded andaproper ombinationS twosteps
arene essary. First,wesumupthe ostatta hedtothemat hesinS. Se ond,forea hdatanode
d
m
belongingtoamat hm2S weaddthe ostofthepathbetweendandd
m
. Thisisperformed
bytraversingtheparentlinksstartingatd
m
andendingatd. Duringtraversalweaddthedelete
ost ost(d
i
)of ea h visited noded
i
. Note, that the pathsmayshare ommonnodes | but for
ea hnode,thedelete ostisin ludedonlyon e.
Thepreordernumerationtogetherwiththeright-mostleafnodenumbers anbeusedtoeÆ iently
al ulatethe embedding ost. InAlgorithm1the embedding ostof aquerynodeq
i
and adata
node d is al ulated using a set of proper ombinations S belonging to the hild nodes of q
i
(line17). Inthefollowingweassumethat ea hS 2Sisalistsortedas endinglybythepreorder
nodenumbers. Theoperatorshift (S)retrievestherst omponent(i.e., mat h)ofS anddeletes
this omponentfrom S.
S hasthreeimportantpropertiesthatfollowfrom its onstru tion:
1. disanan estorofalld
m
,m2S,
2. nonoded
mi
isanan estorordes endantofanothernoded
mj ,andtherefore 3. ifnumber(d mi )<number(d mj )thend mj
isin asubtreetotherightofd
mi .
Thesepropertiesleadtoasimplealgorithmfor al ulatingtheembedding ost(seePro edure1).
It al ulates the embedding ost starting at the left-most data node d
m
being partof amat h
m2S. Be auseofthepreordernumerationthisis thenodewiththesmallestnumberamong all
data nodes belonging to mat hes in S. The algorithm traverses the pathstarting at d
m to the
rootnodeoftheembeddingd
r
settingd
i
toea hinspe tednode(line2). Thedelete ost ost(d
i )
is summed up while traversingthe path (line 3). Wheneveranother mat hing data node d
m is
ades endantofd
i
thenthepro edure al ulate ostis alled re ursivelyusingd
i
asrootand
d
m
asthenewleft-mostnode(line5). Afterthepath ostsforallmat hingnodesinthesubtree
rootedatd
i
havebeenaddedthealgorithmpro eedswiththeparentofd
i
(line8). Note,thatfor
ea hm2Sthepathbetweend
m
Pro edure al ulate ost(d r ;d i ;C ;S) Input: d r
|therootnodeofthesubtreeforwhi hthe ostis al ulated
d
i
|thedatanode urrentlyinspe ted
C|theembedding ost al ulatedsofar
S |aproper ombination(in-outparameter)
Return: Theembedding ost
1: m:=shift (S)
2: whilenumber(d
i
)number(d
r
)and(m=nil orbound(d
i )<number(d m ))do 3: C:=C+ ost(d i )
4: whilem6=nil andbound(d
i )number(d m )do 5: C:=C+ al ulate ost(d i ;d m ;C ;S) ost(d i ) 6: m:=shift (S) 7: end while 8: d i :=parent(d i ) 9: endwhile 10: returnC
isbound byO(kh) where kis themaximal numberof hildrenof anyquerynode andhis the
heightofthedatatree.
Theonlystepleftisanalgorithmthatsele tstheminimalembeddingandretrievesthe
a ompa-nyingminimalmat htotheAlgorithm1. ThisisperformedbythesimplePro edure2that uses
Pro edure1assubroutine.
Pro edure2Sele tingtheminimalmat h.
Pro eduresele tminimal mat h(d;S)
Input: d|adatanodethatistherootofthesubtreeembeddings
S|asetof proper ombinations
Return: Theminimalmat h
1: C:=1
2: forS 2Sdo
3: m:=shift(S)
4: C:=C
m
+min(C ; al ulate ost(d;d
m ;0;S))
5: endfor
6: return(d;C)
5.4 Runtime Complexity
The tree in lusion problem whi h is ontained in our approximate tree embedding problem is
NP- omplete. However, real world instan es often have favorable properties. Our algorithm is
espe ially designed for su h instan es and in fa t solves them in polynomial or even sublinear
time, yet its runtime omplexity remains exponential. First, weanalyze exa tly the worst- ase
omplexityand thenwedis usswhythealgorithmis neverthelessfast forqueriesand dataused
inpra ti e.
TheouterloopofAlgorithm1iteratesoverthejQjquerynodes(line 1). Forea hquerynodeq
i
there are at most s jDj data nodesthat havethe samelabel asq
i
(line 3). Weassume that
fet hing those mat hes needs I y les, e.g., I = logjDj. The loop at line 5 iterates over all k
hildren of q
i
. Assume that weare at jth hild(1 j k). There are at most s j 1
forthe urrent hild. Thisleadstos j 1
(j 1)s=(j 1)s j
omparisons. Note,thatthesetof
ombinationsSmaygrowexponentially. Thus, attheendof theiterationoverall hildnodesof
q i ,weneededat most P k j=2 (j 1)s j =O(k 2 s k
)loops. The al ulationoftheembedding ost
ofoneproper ombination(seePro edure 1) anbedonein O(kh)wherehistheheightofthe
datatree. Be ausethereareatmosts k
proper ombinations,thesele tionoftheminimalmat h
takesat mostO(s k
kh) y les(Pro edure2). Therefore,theoverallruntime omplexityis
O(jQjIs(k 2 s k +s k kh)) = O(jQjIs k +1 k(k+h)):
One should not be s ared o by the worst ase runtime omplexity of the algorithm. First,
no fa tor of the formula depends dire tly on the size of the data. The number of data nodes
in uen e onlythe timefor theindex lookup I, thesele tivitys, and theheightof thedatatree
h. Se ond, the only fa tor that rises exponentially is s k
. This fa tor representsthe numberof
proper ombinations belonging to a querynode and adata node. It risesexponentiallyonly in
pathologi al ases, e.g., if all query nodes and data nodes arry the same label. Forreal data,
this fa toris typi allyde reasing during querypro essing. Ifthe treestru ture is non-re ursive
(i.e., no node hasan an estor with the same label) then the numberof proper ombinationsis
boundbyO(K k
)whereK isthemaximalnumberof hildrenofadatanode. IfbothkandK are
bound bya onstant then K k
is just a onstantvalue. Firstexperimentswith our prototypi al
implementationshowthattheaboveargumentsaretrueinpra ti e. Butfurtherexperimentshave
tobedoneto onrmour laim.
6 Related Work
Our work hasfour related areas: querylanguagesfor semistru tureddata, stru tural queriesin
informationretrieval,stru turemat hingin graphs,anddistan emetri sfortrees.
Manyquerylanguagesforsemistru tureddataandXMLinparti ularhavebeendevelopedduring
thelastyears[AQM +
97,RLS98,DFF +
98,CJS99,Moe00℄. All languagessupportoperatorsthat
allowtoskip ertainpartsofthedata. However,asalreadymentioned,theyrequirethattheuser
knowswhi hpartshavetobeskipped. Furthermore,theydonot onsiderhowmanynodeshave
tobeleftoutinordertomat hthequery. Thus,norankingdependingonstru turalsimilarityis
supported.
There arealsomanyproposalsof querylanguagesin theeldof informationretrieval. Here,the
endeavoris to integrate ontentand stru ture in thequeryansweringpro ess. A good overview
ofexisting methods an be foundin [NBY96℄. Again,to thebest ofoutknowledgenoapproa h
measuresthestru turalsimilaritybetweenthequeryand thedo uments.
Theproblem of treeand graphembedding hasalso beenstudied in thetheory ofgraphsand in
formalqueryansweringmodels. Theworkmost loselyrelatedtooursis[Kil92℄that introdu ed
the tree in lusion problem. This approa h was also taken up and modied in [Meu98℄. Both
approa hes do not measure the loseness of the query tree and the data tree. Other tree and
graph mat hing approa hes, e.g., [BF99, NS00℄ also do notquantify the similarity between the
queryandthedata.
Severalmeasures forthe similarity oftrees havebeendeveloped[Tai79,JWZ94℄. Inthe ordered
asetheyare omputableinpolynomialtime. Butaswearguedin thispaper,orderedmappings
are not useful for querying XML data. The problem of nding the minimal edit or alignment
distan ebetweenunorderedtreesisMAXSNP-hard[AG97℄. The omplexityresultssuggestthat
bothmeasuresarenotusefulforqueryingtrees. Evenalgorithms forrestri tedformsof theedit
distan e[Zha93,ZWS95℄areboundbyO(jQjjDjK)where K dependsonthemaximaldegree
ofthenodes. Webelievethatalgorithmsthat tou heverydatanodearenotappli ableforlarge
Forfutureresear hwe on eivetwodire tions|improvementofthealgorithmpowerandexe ution
timeandimprovementoftheusefulnessfortheuser.
GoldmanandWidomintrodu edstrongDataGuides asas hemaforsemistru tureddata[GW97℄.
Ouralgorithm anmakeuseofsu h s hematato greatlya eleratethesear hfor embeddingsin
the data tree. Further optimization te hniques makeuse of ertainpropertiesof the queryand
thedata. Also, weplan to relaxthe tree assumptionand allow dire ted a y li and even y li
datastru tures.
The urrentversionof ouralgorithm ompensatesunder-spe i ationoftheuserquery: Missing
nodesin theuserqueryareskipped inthedata. An over-spe ieduserqueryhasnodesthat do
notappearinthedata. Weplanto ompensatebyskippingnodesinthequeryinthesamewayas
wedointhedata. Further,weplantoinvestigatemethodsforlearningthedelete osts. Currently
ea hnodehasthesamedelete ostregardlessoftheimportan eofthenode.
Referen es
[AG97℄ A.Apostoli oandZ.Galil,editors. Pattern Mat hingAlgorithms, hapter14.
ApproximateTree PatternMat hing. OxfordUniversityPress,June1997.
[AQM +
97℄ S.Abiteboul,D. Quass,J.M Hugh,J.Widom,andJ.Wiener. TheLorelquery
languageforsemistru tureddata. International JournalonDigitalLibraries,
1(1):68{88,April1997.
[BC00℄ A.BonifatiandS.Ceri. ComparativeanalysisofveXMLquerylanguages.
SIGMODRe ord,29(1),Mar h2000.
[BF99℄ A.BergholzandJ.C.Freytag. Querying semistru tureddatabasedons hema
mat hing. InPro eedings of the InternationalWorkshop on DatabaseProgramming
Languages (DBPL),September1999.
[CJS99℄ S.Cluet,S.Ja qmin,andJ.Simeon. ThenewYATL: Desingandspe i ations.
Te hni alreport, INRIA,1999.
[DFF +
98℄ A.Deuts h, M.Fernandez,D. Flores u,A.Levy,andD.Su iu. XML-QL:Aquery
languageforXML. W3CNote, August1998. http://www.w3.org/TR/NOTE-xml-ql.
[FSW99℄ M.Fernandez,J.Simeon,andP.Wadler. XMLandquerylanguages: Experien es
andexamples. Unpublishedmanus ript,
http://www-db.resear h.bell-labs. om/user/simeon/xquery.ps,September1999.
[GW97℄ R.GoldmanandJ.Widom. DataGuides: Enablingqueryformulationand
optimizationinsemistru tureddata. InPro eedings of the 23rd International
Conferen eon VeryLargeDatabases(VLDB),pages436{445,Athens, Gree e,
August1997.
[Hav97℄ P.Havlak. Nestingofredu ibleandirredu ibleloops. ACMTransa tions on
ProgrammingLanguagesandSystems, 19(4):557{567,July1997.
[JWZ94℄ T.Jiang,L.Wang,andK.Zhang. Alignmentoftrees-analternativetotreeedit. In
CombinatorialPattern Mat hing,pages75{86,June1994.
[Kil92℄ PekkaKilpelainen. TreeMat hing Problemas withAppli ations toStru turedText
onPrin iplesofDigitalDo umentPro essing(PODDP),1998.
[Moe00℄ G.Moerkotte. YAXQL:Apowerfulandweb-awarequerylanguagesupportingquery
reuse. Te hni alreport, UniversityofMannheim,January2000.
[NBY96℄ G.NavarroandR.Baeza-Yates. Integrating ontentand stru tureintextretrieval.
SIGMODRe ord,25(1):67{79,Mar h1996.
[NS00℄ F.NevenandT.S hwenti k. ExpressiveandeÆ ientpatternlanguagesfor
tree-stru tureddata. InPro eedings of the 19thSymposiumonPrin iples of
DatabaseSystems (PODS),May2000.
[RLS98℄ J.Robie,J.Lapp,andD.S ha h. XMLquerylanguageXQL,1998.
http://www.w3.org/TandS/QL/QL98/pp/xql.html.
[RR92℄ R.RameshandL.V.Ramakrishnan. Nonlinearpatternmat hingin trees. Journalof
the ACM,39(2):295{316,April1992.
[Tai79℄ K.-C.Tai. Thetree-to-tree orre tionproblem. Journalof theACM,26(3):422{433,
July1979.
[Zha93℄ K.Zhang. Aneweditingbaseddistan ebetweenunorderedlabeledtrees. In
CombinatorialPattern Mat hing,pages254{265,June1993.
[ZWS95℄ K.Zhang,J.T.L. Wang,andD. Shasha. Ontheeditingdistan ebetweenundire ted
a y li graphsandrelatedproblems. InCombinatorialPattern Mat hing, pages