book book book book list chapter author title title title author name author "... " "... " "Bradley" "Bradley" PCDATA "... " PCDATA "Bradley" PCDATA

(1)

XML Data

Torsten S hlieder

FreieUniversitatBerlin

s hliedinf.fu-berlin.de Felix Naumann Humboldt-UniversitatzuBerlin naumanndbis.informatik.hu-ber lin. de Abstra t

Queryingheterogeneous olle tionsofdata- entri XMLdo umentsrequiresa ombinationof

databaselanguagesand on eptsusedininformationretrieval,inparti ularsimilaritysear h

andranking. Inthispaperwepresentanapproa htondapproximateanswerstoformaluser

queries. We redu ethe problemof answeringqueries againstXMLdo ument olle tions to

thewell-knownunorderedtreein lusionproblem. Weextendthisproblemtoanoptimization

problembyapplyinga ostmodeltotheembeddings. Therebyweareabletodeterminehow

losepartsoftheXMLdo umentmat hauserquery. WepresentaneÆ ientalgorithmthat

ndsall approximatemat hesandranksthema ordingtotheirsimilaritytothequery.

1 Introdu tion

XML has gained mu h attention in both the information retrieval ommunity and in the eld

of databaseresear h. One reasonis that XML oers auniform and standardized wayto

repre-sentandex hange bothdo uments inthe senseof informationretrievaland dataexportedfrom

databases. Ontheoneend ofthe spe trumof XML usagearetext- entri do umentswith only

few, interspersed markups. On the other end of thespe trum are data- entri do uments that

aresolely reatedandinterpretedbysomeappli ationlogi . Su hdo umentshaveawelldened

stru tureandaretypi allystoredin homogeneous olle tions. ButXMLisalso onsideredasthe

ommon format forthe integration of a wide rangeof data. Oneexample are federated digital

library ataloguesthat ombine thedata from several repositorieswhi htypi allyhavediering

s hemata. Anotherexamplearedatawarehousesstoringall ompany-wideXML-formatted

do u-mentssu hasmessagesanddatabasereports. Su hdo ument olle tionsaredata- entri but do

nothavea ommons hema,i.e., theyareheterogeneous.

To sear h in text- entri XML do uments, te hniques adopted from information retrieval are

appropriate. Forqueryingdata- entri XML do uments in homogeneous olle tions,new query

languages have been developed (for an overview, see [FSW99, BC00℄). However, for querying

heterogeneous olle tions of data- entri do umentsneither information retrievalte hniques nor

querylanguagesareappropriate. Classi informationretrievalmodelsliketheve torspa emodel

ompletelyignorethestru tureofthedo uments. Evenifmeaningfulelementandattributenames

arein ludedin the reationofdes riptors,thesemanti softhenestingislost. Querylanguages,

ontheotherhand, annot opewithdo umentsthatareonlysimilar tothequery. Asubstru ture

of ado ument either mat hesthe queryorit doesnot. The onsequen eof thebinary de ision

Thisresear h was supported by the GermanResear h So iety, Berlin-Brandenburg Graduate S hool in

(2)

the do ument and that often notall desired results are returned. Note, that enri hing a query

language with regular path expressions does not solvethe problem of querying olle tions with

partiallyknowns hemaaswewillpointoutinSe tion2.

Webelievethatforheterogeneous olle tionsofdata- entri do umentstheformalismofdatabase

languages ombined with the on ept of un ertainty used in information retrieval will a hieve

the best results. In this paper we integrate some aspe ts of the two on epts and present the

XMLquerylanguageapproXQL asasynta ti subsetof XQL [RLS98℄. UnlikeXQLoranyother

XML querylanguage, resultsof approXQLare notonly all stri t mat hesto the querybut also

results that have additional inner elements not spe ied by the query. This method returns

satisfying resultseven when the user does not know the exa tstru ture of the olle tion. The

resultsof approXQLqueriesaresorted bytheirsimilarityto thequery. Exa tmat hesaregiven

rst,followedbyresultswithonlyslightdeviationset . Togetherwiththelanguagewepresenta

query- entri algorithmthat eÆ ientlyanswersapproXQLqueries.

InSe tion 2weshowthat itis diÆ ultfor auserto queryaheterogeneousdo ument olle tion

usingatraditionalXMLquerylanguage. InSe tion3weintrodu etheapproXQLquerylanguage

andshowhowweinterpretthequeriesandthedata olle tionastrees. InSe tion4wereviewthe

well-known unorderedtree in lusion problem and extend itto theapproximate tree embedding

optimizationproblem. InSe tion5wepresentouralgorithm.InSe tion6wesurveyrelatedwork

andgiveanoutlookof futureimprovementsandextensionsinSe tion7.

2 Motivating Example

Assume that wehavea olle tionof do umentsstoring metadataabout books. Figure 1 shows

a subset of this olle tion. It ontains information aboutdierent books the author \Bradley"

ontributedto.

PCDATA

author

"Bradley"

PCDATA

" ... "

book

author

title

PCDATA

" ... "

"Bradley"

PCDATA

name

title

PCDATA

" ... "

title

PCDATA

" ... "

"Bradley"

PCDATA

book

title

Doc 4

Doc 3

Doc 2

Doc 1

chapter

author

list

author

PCDATA

"Bradley"

Figure1:Aheterogeneous olle tionofdata- entri do uments

Assume further that we need information about all books that have asauthor \Bradley". We

spe ifyanXQLquerythat exa tlymodelsourinformationneed:

/book/author/text() = "Bradley"

Obviously,theonlyresultofthequeryisdo ument1|despitethefa tthatatleastthedo uments

2and3exa tlymat hourinformationneedtoo. Moreover,thedo ument4mayalsobeofinterest.

Fortunately, allXML querylanguagessupport regularpath expression. Even with the simplest

kindofregularpathexpression,theskip operator we anpartlysolveourproblem:

(3)

do umentshavetobeskippedandwhere theyare. Copingwithstru turalheterogeneityrequires

knowledge of the entire data stru ture and often leadsto omplex queries. Furthermore, query

pro essorsfor XML querylanguagesdo notmeasure howmany elements haveto beskipped in

ordertomat h thequery. Therefore,noranking a ordingto stru tural losenessispossibleand

theresultsarereturnedinanarbitraryorder.

We assume that the user has only partial knowledge of the data stru ture. The user spe ies

queriesthatexa tlymeethisorherinformationneed. Thesystemretrievesresultsthatexa tlyt

tothequery,butalsorelaxesthemat hing onditionstoretrievedo umentsthathaveadditional

elementsnotspe iedinthequery. Theresultsarerankeda ordingtothesimilaritytothequery.

3 Data and Query Modeling

Inthisse tionweintrodu eanelementarydatamodelforXML. Weusesomenormalizationsto

mapelements,attributesandPCDATAtoasinglenodetype. Wefurtherintrodu etheapproXQL

querylanguage.

3.1 Modeling XML Data

We model a olle tion of XML do uments as a labeled tree with a single root. Currently, we

ignoreID-referen esandhyperlinks. Tosimplify ourmodel,weonlyuseasinglenodetype. Ea h

nodedofthattypehasalabel,writtenaslabel(d). Weusethenotationparent(d)torefertothe

(unique)parentnodeofd.

Weperformthefollowingnormalizations: Elementsaremappedtonodesusingtheelementname

aslabel. Ea hattributeismappedtotwonodes,therstonegetsthenameaslabel,these ond

onegetsthevalueof theattributeas label. PCDATA se tionsaremapped tonodes bytreating

thewholetextas label. Figure 2showsanexampleofthenormalization.

title

XML

author

Bradley

book

{ year = 1998 }

PCDATA

(a)PartofaXMLdatatree

1998

year

book

title

XML

Bradley

author

(b)Normalizedversion

Figure2:Simpli ationofadatatree

Note,thatwedonot onsidersemanti heterogeneityofattributeandelementnames. Weassume

that all elementsmodeling the samereal worlddomain also arrythe samelabel. This anbe

a hievedbyprovidingafun tionthat mapssemanti allyequalnamestoa ommonlabel.

3.2 The ApproXQL Query Language

Weintrodu e asimplepattern mat hing language alled approXQL that is asynta ti subsetof

XQL[RLS98℄. Informally,approXQL onsistsof (1)thepathoperator \/",(2)thelteroperator

\[℄",(3)thetext()sele tor,and(4)theoperators\="and\$and$". Thefollowingqueryrequests

(4)

Note,thattheuserdoesnotneedtoknowhowtheyear1999ismodeledintheoriginaldo uments.

Due tooursimplied datamodel, attributevaluesand element ontentsarebothmappedto the

leafnodesofthedatatree. Thus,thequerywillmat hboth ases. Wedonotrequirethatquery

pathsendwiththetext()sele tor. Thefollowingquerymat hesbooksthathaveaforewordand

atleastoneeditor:

book[foreword $and$ editor℄

An approXQL query anbeinterpretedastree(seeFigure 3). Ea h$and$expressionis mapped

to aninner node ofthetree. The hildrenofan $and$nodeare therootsof thepaths that are

onjun tively onne ted. Weusethenotation hildren(q)to referto theset of hildrenof qand

the notation label(q) to identify the label of q. Note, that dierent querynodes may have the

samelabel. TheresultofanapproXQLqueryisalwaysasubtreeof thedatatree. In ontrastto

XQLwedonotallowquerieswithboolean results.

book[ title/text() = ’XML’ $and$

author[ firstname/text() = ’Neil’ $and$

lastname/text() = ’Bradley’ ]]

(a)approXQL-Query

Bradley

lastname

author

book

title

XML

firstname

Neil

(b)Querytree

Figure3:MappingofanapproXQLqueryto atree.

4 The Tree Embedding Problem

Wehaveshownhowtointerpretboththedataandthequeryastrees. With thisinterpretation,

the problemof answeringaquery an be mapped to theproblem of embeddinga querytree in

thedatatree.

The exa t ordered tree embedding problem has been extensively studied (see [RR92℄ for an

overview) and is solvable in polynomial time. However, we are not interested in an ordered

embedding, be ause ordering in the data may be in onsistent for ing us to ignore it. Even if

it were onsistent theorder might not be known to the user. Also, weare not interestedin an

exa t embedding aswewill see in thefollowing se tions. There, we rst review the unordered

tree in lusion problem. Then, to answer user queries in a meaningful way, we extend it to an

optimizationproblem.

4.1 The Unordered Tree In lusion Problem

Our goal is to approximately embed thequery tree into the data tree su h that the labelsand

thean estorshipofthenodesarepreserved. I.e.,weallowthedatatreetohavenodesinbetween

those foundin thequerytree, but theirprede essor{su essorrelationshipmust stillhold in the

tree. Figure4showsanallowableembedding. Thequerytreeontheleftisnotperfe tlyembedded

(5)

book

title

chapter

author

XML

Bradley

name

Bradley

XML

title

author

Bradley

chapter

book

chapter

section

title

XML

author

(a) Query tree

(b) Part of a XML data tree

Figure 4:Unorderedin lusionofaquerytreeinadatatree.

Tondallsu hembeddings,Kilpelainenintrodu edtheso alledunorderedtreein lusionproblem

[Kil92℄. Theauthorprovedthat eventhede isionproblem,i.e., theproblemofde idingwhether

one embedding exists,isNP- omplete. Wefollow[Kil92℄in des ribingtheproblem.

Denition1(Embedding) Afun tion f from the setQof query nodesintothe set D of data

nodesis alledan embedding iffor all q

i ;q j 2Q 1. f(q i )=f(q j ),q i =q j , 2. label (q i )=label (f(q i )), 3. q i isthe parent ofq j ,f(q i ) isanan estorof f(q j )

We allthedatanodethatis mappedto thequeryroottheembedding root. Notethe dieren e

betweenparent andan estor in ondition3. Betweenanode anditsparentthere existsan edge

whereasbetweenanodeanditsan estorthereexistsmerelyapath.

In[Kil92℄ theauthoralso studiedtheproblem oflo ating allsubtrees (representedbytheirroot

nodes)ofthedatatree thatin lude thequerytreeminimally. However,in theframework ofthe

author the term \minimal" simply means that only those mat hing subtrees are retrieved that

do not in lude another mat hing subtree. In the following subse tion we introdu e a dierent

interpretationofaminimalembedding.

4.2 The Approximate Tree Embedding Problem

Thetreein lusionapproa his initselfonlyoflimitedvalueto theuserbe ausethesolutionsare

notranked: Theembeddingrootsareretrievedinarbitraryorderindependentofhowmanynodes

have to be skipped in order to perfe tly embed the query. We do not only want answers that

perfe tlymat htheuserquery. Buttheyshouldmat haswellaspossible. Wesolvethisproblem

bymeasuringthequalityofanembedding. Weusea ostfun tion overthedatanodesthatmust

beskippedtoembedthequeryperfe tly.

Weusethe ostmodelin twoways: First,among allembeddingsthat havethesameembedding

rootweonlysele tthe ost-minimalembedding. Thus, weavoidsituationslikethat ofFigure 4

whereseveralembeddingsrefertothesamebook. Se ond,werankthe ost-minimalembeddings

a ordingtothe ost. Embeddingswherenoneoronlyasmallnumberofnodesareskippedhave

alow ostandarerankedhigh. Tothisend, weassignadeletion ost ost(d) toea hdatanode.

Denition2 (Embedding ost) Letf beanembeddingofQinD. Letf(Q)betheimage off

andP bethe setofdatanodesalongallpathsstartingandendingatanytwonodesd

i ;d

j

2f(Q).

Then the embedding ost ofQin D isC:= P

d2Pnf(Q)

(6)

Denition3(Mat h) Letf beanembeddinga ording toDenition 1. Letdbethe embedding

root. We allthe node dtogether withitsembedding osta mat h: m=(d;C).

Weusethenotationd

m andC

m

torefertothe omponentsofm. Note,thatwealsousetheterm

mat h forembeddingsofquerysubtrees.

Denition4 (Minimalmat h) Let F be the set of embeddings from a query tree into a data

treeandletM bethe setofmat hes belonging toF. WepartitionM intogroupsof mat hesthat

share the same root node. For ea h group, the mat h with the minimal ost is alled minimal

mat h.

Thefollowingse tiondes ribesanalgorithmthatndsthe ost-rankedsetofminimalmat hesto

anapproXQLquery.

5 An Algorithm for the Approximate Tree Embedding Problem

The omplexityresultof[Kil92℄suggeststhatthereisnopolynomialalgorithmsolvingthe

approx-imatetreeembeddingproblem. Weintrodu eanewalgorithmfortheapproximatetreeembedding

problemthat isstill exponentialin theworst ase. Butin typi al asesouralgorithmonlytakes

sublineartimewrt.thedatabasesize. First,wesket hthebasi ideaofourapproa handthenwe

dis ussthedetailsofthealgorithm.

Ouralgorithmisbasedondynami programming.Itsmostimportantpropertyisthequery- entri

exe ution. Forea hquerynodeonlythosedatanodesarefet hedthataremembersofapotential

embedding of thequerytree in thedata tree. Thequerytree ispro essedbottomup. Forea h

querynodeqtheembeddingsofthequerytreerootedatqare ombinedfromtheembeddingsof

thequerytrees rooted at the hildnodesof q. Ea h embedding isrepresentedbyamat h. The

mat hes belonging to a ertain query node are grouped by theirdata nodes. From ea h group

onlytheminimal mat hissele ted. Theset ofminimalmat hesbelongingto thequeryroot are

theresultsofthealgorithm. Theyaresortedbyin reasing ostandretrievedto theuser.

5.1 Constru ting the Embeddings

Weuse apostorder numerationof thenodes in thequerytree. The main loopof thealgorithm

visits thequerynodesin as ending order (see Algorithm 1,line 1). Thepostorderensuresthat

forea h hildnodeofthe urrentquerynode,theminimalmat heshavealreadybeen al ulated

and storedin themat h set belongingto that hildnode. Re allfrom Se tion 4thataminimal

mat hmisapair onsistingoftheembedding ostC

m

andadatanoded

m

thatistherootofthe

minimalembeddingofaquerysubtree.

Assumethatq

i

isthe urrentquerynode. Now,thealgorithmfet hesalldatanodesthathavethe

samelabelasq

i

(line3). Forea hmat hingdatanoded,thealgorithmtries toembedthequery

subtreerootedatq

i

inthedatasubtreerootedatd. To onstru tthoseembeddingsthealgorithm

buildsall ombinationsofmat heswhereea h ombination onsists ofkmat hes, onefrom ea h

mat hset. Ea h ombinationrepresentsapotentialembeddingofthequerysubtreerootedat q

i

in thedata subtree rooted at d. To hoose the proper ombinations outof the set of potential

ones,thealgorithm he ksforea h ombinationS whetheralldatanodesofthemat hesinS are

des endants of d and whether the mat hes in S are notblo king. We rst formalizethe terms

blo king mat hesandproper ombination. Afterwardsweexplainhowthealgorithmin rementally

(7)

Input: Q|alistofquerynodessortedin postorder

D|asetofdatanodes

Output: Thesortedlistofminimalmat hes

1: forallq

i

2Qdo //Iterate through the querynodesinpostorder

2: M

qi

:=; //The mat hset belonging toq

i

3: for alld2D su h thatlabel(d)=label(q

i )do

4: S:=f;g //The setof proper ombinations

5: forallq j 2 hildren(q i )do 6: S 0 :=f;g 7: forallm2M q j

su hthat disanan estor ofd

m do

8: forallS2Sdo //Atleast, ;will besele ted

9: if d

m

isnoan estorordes endantofanyd

m 0 fromm 0 2S then 10: S 0 :=S 0 [fS[fmgg //AppendS 0

by anew partial ombination

11: endif 12: endfor 13: endfor 14: S:=S 0 15: endfor

16: if S6=f;gthen //Thereisatleastoneproper ombination

17: M

qi :=M

qi

[fsele tminimalmat h(d;S)g

18: endif

19: end for

20: endfor

21: outputThemat hsetofthequeryrootsortedbyin reasing ost

Denition5(Blo king mat hes) Let m

i

and m

j

be two mat hes with the data nodes d

mi and d mj respe tively. Let D i ;D j

D be the node sets of the subtrees rooted at d

mi and d

mj

respe tively. Two mat hes m

i ;m

j

are blo kingif the subtrees theyrefer toshare ommon nodes,

i.e., D

i \D

j 6=;.

Asanexample,seeFigure 4onpage5. Thequerysubtrees rootedat the hapter andtheauthor

noderespe tivelybothhavetwomat hes. Inparti ular,theybothhaveonemat hin themiddle

partofthedatatree. Butthedatasubtreerootedatthe hapternode ontainsthesubtreerooted

attheauthornode. Therefore,thosemat hesareblo king.

Theblo kingoftwosubtrees anbedeterminedeÆ iently: Twosubtreesareblo kingiftheirroot

nodesareidenti aloriftherootnodeofonesubtreeisanan estoroftheother(line 9).

Denition6 (Proper ombination) Let q

i

be aquery node andd be adatanode towhi h q

i

ismapped. LetM

qj

bethe mat h setbelonging toq

j

2 hildren (q

i

). Ea h tupleS ofthe artesian

produ tX q j 2 hildren(q i ) M qj

is alled a proper ombinationif (1)the data node d

m

ofea hmat h

m2S isades endant ofd and(2) ifea h pair ofmat hes m

i ;m

j

2S isnon-blo king.

Ouralgorithm in rementally onstru tstheproper ombinationsforaquerynodeq

i

and adata

noded. First,thesetofproper ombinationsisempty(line4). Thetemporarysetof ombinations

S 0

just olle tsallmat hesatta hedtotherst hildnodeofq

i

(lines7|13). Then,SissettoS 0

(line14). These ond loop ombinesallmat hesofthese ond hildwiththemat hesoftherst

hild nowstored inS. That is, ombinationsoflength twoare onstru ted. Theloopbeginning

atline5nishesifallk hildnodesofq

i

arepro essed,i.e.,ifproper ombinationsoflengthkare

onstru ted. Ea hproper ombinationS representsanembeddingofthequerysubtreerootedat

q

i

(8)

i

ea h ombinationShastobe al ulated. InSe tion5.3weintrodu eanalgorithmthateÆ iently

al ulatestheembedding ostsinorder tosele ttheminimalmat h. Inthefollowingsubse tion

weshowhowthean estor-des endantrelationshipoftwodatanodes anbetestedeÆ iently. This

ru ialtestisbothneededin Algorithm1andforthe al ulationoftheembedding ost.

5.2 Testing the An estor-Des endant Relationship

Ina tree, thetest whether anode isan an estor of anothernode an be donein onstant time

after alinear time prepro essing of the tree [Hav97℄. This method usesa preorder numeration

of the nodes. Ifa node d

j is a des endantof a node d i then number(d i )< number(d j ). If the

omparisonistrue,thend

j iseitherades endantofd i ord j isinasubtreerightofd i . Thelatter

ase anbeex ludedifthenumberoftheright-mostleafnodeofd

i

isknown. Weexpli itlystore

this log(jDj) numberfor ea h node d

i

and a ess itwith the fun tion bound(d

i

). Now thetest

whetheranoded i isanan estor ofanoded j anbeperformedasfollows: d i isanan estorofd j () number(d i )<number(d j )and bound(d i )number(d j ):

5.3 Cal ulating the Embedding Cost

To al ulate theembedding ostfor agiven datanoded andaproper ombinationS twosteps

arene essary. First,wesumupthe ostatta hedtothemat hesinS. Se ond,forea hdatanode

d

m

belongingtoamat hm2S weaddthe ostofthepathbetweendandd

m

. Thisisperformed

bytraversingtheparentlinksstartingatd

m

andendingatd. Duringtraversalweaddthedelete

ost ost(d

i

)of ea h visited noded

i

. Note, that the pathsmayshare ommonnodes | but for

ea hnode,thedelete ostisin ludedonlyon e.

Thepreordernumerationtogetherwiththeright-mostleafnodenumbers anbeusedtoeÆ iently

al ulatethe embedding ost. InAlgorithm1the embedding ostof aquerynodeq

i

and adata

node d is al ulated using a set of proper ombinations S belonging to the hild nodes of q

i

(line17). Inthefollowingweassumethat ea hS 2Sisalistsortedas endinglybythepreorder

nodenumbers. Theoperatorshift (S)retrievestherst omponent(i.e., mat h)ofS anddeletes

this omponentfrom S.

S hasthreeimportantpropertiesthatfollowfrom its onstru tion:

1. disanan estorofalld

m

,m2S,

2. nonoded

mi

isanan estorordes endantofanothernoded

mj ,andtherefore 3. ifnumber(d mi )<number(d mj )thend mj

isin asubtreetotherightofd

mi .

Thesepropertiesleadtoasimplealgorithmfor al ulatingtheembedding ost(seePro edure1).

It al ulates the embedding ost starting at the left-most data node d

m

being partof amat h

m2S. Be auseofthepreordernumerationthisis thenodewiththesmallestnumberamong all

data nodes belonging to mat hes in S. The algorithm traverses the pathstarting at d

m to the

rootnodeoftheembeddingd

r

settingd

i

toea hinspe tednode(line2). Thedelete ost ost(d

i )

is summed up while traversingthe path (line 3). Wheneveranother mat hing data node d

m is

ades endantofd

i

thenthepro edure al ulate ostis alled re ursivelyusingd

i

asrootand

d

m

asthenewleft-mostnode(line5). Afterthepath ostsforallmat hingnodesinthesubtree

rootedatd

i

havebeenaddedthealgorithmpro eedswiththeparentofd

i

(line8). Note,thatfor

ea hm2Sthepathbetweend

m

(9)

Pro edure al ulate ost(d r ;d i ;C ;S) Input: d r

|therootnodeofthesubtreeforwhi hthe ostis al ulated

d

i

|thedatanode urrentlyinspe ted

C|theembedding ost al ulatedsofar

S |aproper ombination(in-outparameter)

Return: Theembedding ost

1: m:=shift (S)

2: whilenumber(d

i

)number(d

r

)and(m=nil orbound(d

i )<number(d m ))do 3: C:=C+ ost(d i )

4: whilem6=nil andbound(d

i )number(d m )do 5: C:=C+ al ulate ost(d i ;d m ;C ;S) ost(d i ) 6: m:=shift (S) 7: end while 8: d i :=parent(d i ) 9: endwhile 10: returnC

isbound byO(kh) where kis themaximal numberof hildrenof anyquerynode andhis the

heightofthedatatree.

Theonlystepleftisanalgorithmthatsele tstheminimalembeddingandretrievesthe

a ompa-nyingminimalmat htotheAlgorithm1. ThisisperformedbythesimplePro edure2that uses

Pro edure1assubroutine.

Pro edure2Sele tingtheminimalmat h.

Pro eduresele tminimal mat h(d;S)

Input: d|adatanodethatistherootofthesubtreeembeddings

S|asetof proper ombinations

Return: Theminimalmat h

1: C:=1

2: forS 2Sdo

3: m:=shift(S)

4: C:=C

m

+min(C ; al ulate ost(d;d

m ;0;S))

5: endfor

6: return(d;C)

5.4 Runtime Complexity

The tree in lusion problem whi h is ontained in our approximate tree embedding problem is

NP- omplete. However, real world instan es often have favorable properties. Our algorithm is

espe ially designed for su h instan es and in fa t solves them in polynomial or even sublinear

time, yet its runtime omplexity remains exponential. First, weanalyze exa tly the worst- ase

omplexityand thenwedis usswhythealgorithmis neverthelessfast forqueriesand dataused

inpra ti e.

TheouterloopofAlgorithm1iteratesoverthejQjquerynodes(line 1). Forea hquerynodeq

i

there are at most s jDj data nodesthat havethe samelabel asq

i

(line 3). Weassume that

fet hing those mat hes needs I y les, e.g., I = logjDj. The loop at line 5 iterates over all k

hildren of q

i

. Assume that weare at jth hild(1 j k). There are at most s j 1

(10)

forthe urrent hild. Thisleadstos j 1

(j 1)s=(j 1)s j

omparisons. Note,thatthesetof

ombinationsSmaygrowexponentially. Thus, attheendof theiterationoverall hildnodesof

q i ,weneededat most P k j=2 (j 1)s j =O(k 2 s k

)loops. The al ulationoftheembedding ost

ofoneproper ombination(seePro edure 1) anbedonein O(kh)wherehistheheightofthe

datatree. Be ausethereareatmosts k

proper ombinations,thesele tionoftheminimalmat h

takesat mostO(s k

kh) y les(Pro edure2). Therefore,theoverallruntime omplexityis

O(jQjIs(k 2 s k +s k kh)) = O(jQjIs k +1 k(k+h)):

One should not be s ared o by the worst ase runtime omplexity of the algorithm. First,

no fa tor of the formula depends dire tly on the size of the data. The number of data nodes

in uen e onlythe timefor theindex lookup I, thesele tivitys, and theheightof thedatatree

h. Se ond, the only fa tor that rises exponentially is s k

. This fa tor representsthe numberof

proper ombinations belonging to a querynode and adata node. It risesexponentiallyonly in

pathologi al ases, e.g., if all query nodes and data nodes arry the same label. Forreal data,

this fa toris typi allyde reasing during querypro essing. Ifthe treestru ture is non-re ursive

(i.e., no node hasan an estor with the same label) then the numberof proper ombinationsis

boundbyO(K k

)whereK isthemaximalnumberof hildrenofadatanode. IfbothkandK are

bound bya onstant then K k

is just a onstantvalue. Firstexperimentswith our prototypi al

implementationshowthattheaboveargumentsaretrueinpra ti e. Butfurtherexperimentshave

tobedoneto onrmour laim.

6 Related Work

Our work hasfour related areas: querylanguagesfor semistru tureddata, stru tural queriesin

informationretrieval,stru turemat hingin graphs,anddistan emetri sfortrees.

Manyquerylanguagesforsemistru tureddataandXMLinparti ularhavebeendevelopedduring

thelastyears[AQM +

97,RLS98,DFF +

98,CJS99,Moe00℄. All languagessupportoperatorsthat

allowtoskip ertainpartsofthedata. However,asalreadymentioned,theyrequirethattheuser

knowswhi hpartshavetobeskipped. Furthermore,theydonot onsiderhowmanynodeshave

tobeleftoutinordertomat hthequery. Thus,norankingdependingonstru turalsimilarityis

supported.

There arealsomanyproposalsof querylanguagesin theeldof informationretrieval. Here,the

endeavoris to integrate ontentand stru ture in thequeryansweringpro ess. A good overview

ofexisting methods an be foundin [NBY96℄. Again,to thebest ofoutknowledgenoapproa h

measuresthestru turalsimilaritybetweenthequeryand thedo uments.

Theproblem of treeand graphembedding hasalso beenstudied in thetheory ofgraphsand in

formalqueryansweringmodels. Theworkmost loselyrelatedtooursis[Kil92℄that introdu ed

the tree in lusion problem. This approa h was also taken up and modied in [Meu98℄. Both

approa hes do not measure the loseness of the query tree and the data tree. Other tree and

graph mat hing approa hes, e.g., [BF99, NS00℄ also do notquantify the similarity between the

queryandthedata.

Severalmeasures forthe similarity oftrees havebeendeveloped[Tai79,JWZ94℄. Inthe ordered

asetheyare omputableinpolynomialtime. Butaswearguedin thispaper,orderedmappings

are not useful for querying XML data. The problem of nding the minimal edit or alignment

distan ebetweenunorderedtreesisMAXSNP-hard[AG97℄. The omplexityresultssuggestthat

bothmeasuresarenotusefulforqueryingtrees. Evenalgorithms forrestri tedformsof theedit

distan e[Zha93,ZWS95℄areboundbyO(jQjjDjK)where K dependsonthemaximaldegree

ofthenodes. Webelievethatalgorithmsthat tou heverydatanodearenotappli ableforlarge

(11)

Forfutureresear hwe on eivetwodire tions|improvementofthealgorithmpowerandexe ution

timeandimprovementoftheusefulnessfortheuser.

GoldmanandWidomintrodu edstrongDataGuides asas hemaforsemistru tureddata[GW97℄.

Ouralgorithm anmakeuseofsu h s hematato greatlya eleratethesear hfor embeddingsin

the data tree. Further optimization te hniques makeuse of ertainpropertiesof the queryand

thedata. Also, weplan to relaxthe tree assumptionand allow dire ted a y li and even y li

datastru tures.

The urrentversionof ouralgorithm ompensatesunder-spe i ationoftheuserquery: Missing

nodesin theuserqueryareskipped inthedata. An over-spe ieduserqueryhasnodesthat do

notappearinthedata. Weplanto ompensatebyskippingnodesinthequeryinthesamewayas

wedointhedata. Further,weplantoinvestigatemethodsforlearningthedelete osts. Currently

ea hnodehasthesamedelete ostregardlessoftheimportan eofthenode.

Referen es

[AG97℄ A.Apostoli oandZ.Galil,editors. Pattern Mat hingAlgorithms, hapter14.

ApproximateTree PatternMat hing. OxfordUniversityPress,June1997.

[AQM +

97℄ S.Abiteboul,D. Quass,J.M Hugh,J.Widom,andJ.Wiener. TheLorelquery

languageforsemistru tureddata. International JournalonDigitalLibraries,

1(1):68{88,April1997.

[BC00℄ A.BonifatiandS.Ceri. ComparativeanalysisofveXMLquerylanguages.

SIGMODRe ord,29(1),Mar h2000.

[BF99℄ A.BergholzandJ.C.Freytag. Querying semistru tureddatabasedons hema

mat hing. InPro eedings of the InternationalWorkshop on DatabaseProgramming

Languages (DBPL),September1999.

[CJS99℄ S.Cluet,S.Ja qmin,andJ.Simeon. ThenewYATL: Desingandspe i ations.

Te hni alreport, INRIA,1999.

[DFF +

98℄ A.Deuts h, M.Fernandez,D. Flores u,A.Levy,andD.Su iu. XML-QL:Aquery

languageforXML. W3CNote, August1998. http://www.w3.org/TR/NOTE-xml-ql.

[FSW99℄ M.Fernandez,J.Simeon,andP.Wadler. XMLandquerylanguages: Experien es

andexamples. Unpublishedmanus ript,

http://www-db.resear h.bell-labs. om/user/simeon/xquery.ps,September1999.

[GW97℄ R.GoldmanandJ.Widom. DataGuides: Enablingqueryformulationand

optimizationinsemistru tureddata. InPro eedings of the 23rd International

Conferen eon VeryLargeDatabases(VLDB),pages436{445,Athens, Gree e,

August1997.

[Hav97℄ P.Havlak. Nestingofredu ibleandirredu ibleloops. ACMTransa tions on

ProgrammingLanguagesandSystems, 19(4):557{567,July1997.

[JWZ94℄ T.Jiang,L.Wang,andK.Zhang. Alignmentoftrees-analternativetotreeedit. In

CombinatorialPattern Mat hing,pages75{86,June1994.

[Kil92℄ PekkaKilpelainen. TreeMat hing Problemas withAppli ations toStru turedText

(12)

onPrin iplesofDigitalDo umentPro essing(PODDP),1998.

[Moe00℄ G.Moerkotte. YAXQL:Apowerfulandweb-awarequerylanguagesupportingquery

reuse. Te hni alreport, UniversityofMannheim,January2000.

[NBY96℄ G.NavarroandR.Baeza-Yates. Integrating ontentand stru tureintextretrieval.

SIGMODRe ord,25(1):67{79,Mar h1996.

[NS00℄ F.NevenandT.S hwenti k. ExpressiveandeÆ ientpatternlanguagesfor

tree-stru tureddata. InPro eedings of the 19thSymposiumonPrin iples of

DatabaseSystems (PODS),May2000.

[RLS98℄ J.Robie,J.Lapp,andD.S ha h. XMLquerylanguageXQL,1998.

http://www.w3.org/TandS/QL/QL98/pp/xql.html.

[RR92℄ R.RameshandL.V.Ramakrishnan. Nonlinearpatternmat hingin trees. Journalof

the ACM,39(2):295{316,April1992.

[Tai79℄ K.-C.Tai. Thetree-to-tree orre tionproblem. Journalof theACM,26(3):422{433,

July1979.

[Zha93℄ K.Zhang. Aneweditingbaseddistan ebetweenunorderedlabeledtrees. In

CombinatorialPattern Mat hing,pages254{265,June1993.

[ZWS95℄ K.Zhang,J.T.L. Wang,andD. Shasha. Ontheeditingdistan ebetweenundire ted

a y li graphsandrelatedproblems. InCombinatorialPattern Mat hing, pages