Doument Retrieval
Thomas Roelleke,Mounia Lalmas and Gabriella Kazai
QueenMary,University of London
fthor,mounia,gabsgds.qmul.a.uk
Ian Ruthven
University of Strathlyde
Ian.Ruthvenis.strath.a.uk
Stefan Quiker
University of Dortmund
quikerls6.s.uni-dortmund.de
Abstrat
Strutureddoument retrievalaims at retrievingthe doument omponentsthat best satisfy a
query,insteadofmerelyretrievingpre-deneddoumentunits.Thispaperreportsonaninvestigation
of a tf-idf-a approah, where tf and idf are the lassial termfrequeny and inverse doument
frequeny,anda,anewparameteralledaessibility,thatapturesthestrutureofdouments.
The tf-idf-a approah is dened usinga probabilisti relational algebra. To investigate the
retrievalqualityandestimatethea values,we developedamethodthatautomatiallyonstruts
diversetest olletionsofstrutured doumentsfrom astandardtest olletion,withwhih
experi-mentswerearriedout. Theanalysisoftheexperimentsprovidesestimatesofthea values.
1 Introdution
Intraditionalinformationretrieval(IR)systems[18℄,retrievableunitsarexed. Forexample,thewhole
doument, or, sometimes, pre-dened parts suh as paragraphs onstitute the retrievable units. The
logial struture ofdouments(hapter,setion, table,formula, authorinformation, bibliographiitem,
et)istherefore\attened"andnotexploited. Classialretrievalmethodslakthepossibilityto
intera-tivelydeterminethesizeandthetypeofretrievableunits,thatbestsuitanatualretrievaltaskoruser
preferenes.
Currentresearhisaimingatdevelopingretrievalmodelsthatdynamiallyreturndoumentomponents
of varying omplexity. A retrieval result may then onsist of several entry points to a same
dou-ment, whereby eah entry point is weighted aordingto how it satises the query. Authors suh as
based on the aggregation of the estimated relevane of their related omponents. These models have
been based on various formal theories (e.g. fuzzy logi [3℄, Dempster-Shafer's theory of evidene [13℄,
probabilistilogi[17, 2℄,andBayesianinferene[15℄). What thesemodels haveinommonis that the
basi omponents of their retrieval funtion are variants of the twostandard term weighting shemes,
termfrequeny(tf)andinversedoumentfrequeny(idf). Evideneassoiatedwiththelogialstruture
ofdoumentsisoftenenodedinto oneorbothofthesedimensions.
In this paper, we make this evidene expliitby introduingthe \aessibility" dimension, denoted by
a. Thisdimensionmeasuresthestrengthofthestruturalrelationshipbetweendoumentomponents:
thestrongertherelationship,themoreimpathastheontentofaomponentindesribingtheontent
of its related omponents(e.g. [17, 6℄). We refer to the approah astf-idf-a. We are interested in
investigatinghowthea dimensionaetstheretrievalofdoumentomponentsofvaryingomplexity.
To arry out this investigation, we require a framework that expliitly aptures the ontent and the
struture of doumentsin terms of our tf-idf-a approah. We use aprobabilisti relational algebra
[17℄forthispurpose(Setion2). Ourinvestigationalsorequirestestolletionsofstrutureddouments.
Sine relevaneassessments for strutureddouments are diÆult to obtain and manual assessmentis
expensiveandtaskspei,wedevelopedanautomatiapproahforreatingstrutured doumentsand
generatingrelevaneassessmentsfromaattestolletion(Setion3). Finally,weperformretrievalruns
fordierentsettingsoftheaessibilitydimension. Theanalysisprovidesuswithmethodsforestimating
theappropriatesettingoftheaessibilitydimensionforstrutureddoumentretrieval(Setion4).
2 The tf-idf-a Approah
We view a strutured doument as a tree whose nodes, alled ontexts, are the omponents of the
doument (e.g., hapters,setions, et.) and whose edges representthe omposition relationship(e.g.,
ahapter ontainsseveral setions). The root ontext of the tree, whih is uniquefor eah doument,
embodies thewhole doument. Atomi ontextsare doumentomponentsthat orrespond to thelast
elementsoftheompositionhains. Allothernodesarereferredto asinner ontexts.
Retrieval onstrutured douments takesboththe logialstruture and the ontentof douments into
aountandreturns,inresponsetoauserquery,doumentomponentsofvaryingomplexity(root,inner
oratomiontexts). Thisretrievalmethodologyombinesquerying-ndingwhihatomiontextsmath
thequery-andbrowsingalongthedouments'struture-ndingthelevelofomplexitythatbestmath
the query[5℄. Take, forexample, an artilewith vesetions. Weandene a retrievalstrategy that
isto retrievethewholeartileif morethan threeof itssetionsarerelevantto thequery. Dynamially
returning doument omponents of varying omplexity an, however, lead to user disorientation and
ognitiveoverload [6,7, 10℄. This isbeausethe presentationof doument omponentsin the retrieval
dierentpositionsintheranking.
To redue user disorientation and ognitive overload, a retrieval strategy would be to retrieve (and
thereforedisplaytotheuser)super-ontexts omposed ofmanyrelevantsub-ontexts before-orinstead
of -retrievingthe sub-ontextsthemselves. Thisstrategy would prioritise theretrievalof larger
super-ontextsfrom wherethesub-ontextsouldbeaessedthroughdiret browsing[6, 5℄. Other strategies
ouldfavoursmaller,morespeiontexts,byassigningsmallerrelevanestatusvaluestolargeontexts.
Toallow the implementation of any of the above retrieval strategies, wefollow the aggregation-based
approahtostrutureddoumentretrieval(e.g. [17,11,2,15,6,8℄). Webasetheestimationofrelevane,
orinotherwords,weomputetheretrievalstatusvalue(RSV)ofaontextbasedonaontentdesription
that is derivedfrom its ontentand the ontent ofits sub-ontexts. Forthis purpose, weaugment the
ontent of a super-ontext with that of its sub-ontexts. The augmentation proess is applied to the
wholedoumenttreestruture,startingwiththeatomiontexts,wherenoaugmentationisperformed,
andendingwiththerootontext.
Inthisframework,thebrowsingelementoftheretrievalstrategyisthenimplementedwithinthemethod
of augmentation. By ontrollingtheextent to whih the sub-ontexts of asuper-ontextontributeto
itsrepresentationweandiretlyinuene thederivedRSVs.
Via the augmentation proess we an also inuene the extent to whih the individual sub-ontexts
ontribute to the representation of the super-ontext. This way ertain sub-ontexts, suh as titles,
abstratsandonlusions,et. ouldbeemphasisedwhileothersouldbede-emphasised.
Wemodel thisimpatofthe sub-ontextsonthesuper-ontextbyexpandingthetwodimensions,term
frequeny,tf,and inverse doumentfrequeny, idf, standardto IR,withathird dimension,the
aes-sibility dimension,a 1
. Wereferto theinueningpowerof asub-ontextoveritssuper-ontextas
thesub-ontext'simportane. What thea dimensionrepresents,then,isthe degreetowhih
the sub-ontext isimportantto the super-ontext.
Intheremainderofthis setionweshow,bymeansofprobabilistirelationalalgebradened in [17, 9℄,
howthisdimensionanbeusedtoinorporatethisqualitativenotionofontextimportaneforstrutured
doumentretrieval.
2.1 Probabilisti Relational Algebra
ProbabilistiRelationalAlgebra(PRA)isalanguagemodelthatinorporatesprobabilitytheorywiththe
wellknownrelationalparadigm. Thealgebraallowsthemodellingofdoumentandqueryrepresentations
asrelationsonsistingofprobabilistituples,anditdenesoperators,withsimilarsemantistoSQLbut
1
Theterm\aessibility"istakenfromtheframeworkofpossibleworldsandaessibilityrelationsofaKripkestruture
[4℄. In[12℄,asemantisofthe tf-idf-aisdened,whereontextsaremodelledaspossibleworldsandtheirstrutureis
2.1.1 Doument and query modelling
A PRA program desribingthe representation of a strutured doumentolletion uses relations suh
asterm,termspae,anda. Theterm relationrepresentsthetf dimensionand onsistsofprobabilisti
tuplesintheformoftfweightterm(indexterm,ontext),whihassigntfweightvaluestoeah(index term,
ontext)pairintheolletion(e.g. indextermsthat ourinontexts). Thevalueoftfweight 2[0,1℄is
aprobabilistiinterpretationofthetermfrequeny. Thetermspae relationmodelstheidf dimensionby
assigningidf valuestotheindextermsintheolletion. Thisisstoredintuplesintheformofidf weight
termspae(index term),whereidfweight 2[0,1℄. Thea relationdesribesthedoumentstrutureand
onsistsoftuplesa weighta(ontextp,ontext),whereontextis\aessible"fromontextpwith
aprobabilityaweight 2
.
A query representation in PRA is desribed by the qterm relation whih is in the form of q weight
qterm(queryterm),where theqweight desribestheimportaneof thequeryterm, andquery term are
thetermsomposingthequery.
2.1.2 Relationaloperators in PRA
SimilarlytoSQL,PRAsupportsanumberofrelationaloperators. ThoseusedinthispaperareSELECT,
PROJECT,JOIN,andUNITE.Theirsyntax andfuntionalitiesaredesribednext.
SELECT[riteria℄(relation) returns those probabilisti tuples of relation that math the
speied riteria, where the format of riteria is $olumn= 3
avalue. For example,
SE-LECT[$1=sailing℄(termspae) will return all those tuples from the termspae relation that have
the term \sailing"in olumn one. Tostore theresulting tuple in arelation weuse thefollowing
syntax: new relation = SELECT[riteria℄(relation). The arity of the new relation equalsto the
arityofrelation.
JOIN[$olumn1=$olumn2℄(relation1,relation2)joins (mathes) tworelations and returns
tuplesthatontainmathingdataintheirrespetiveolumns,whereolumn1 speiesaolumnof
relation1 andolumn2 relatestorelation2. Thearityofthereturnedtuplesisthesumofthearity
ofthetuplesinrelation1 andrelation2. Forexample,new term=JOIN[$1=$1℄(term,termspae)will
populatenew termwithtuplesthathavethesamevalueinolumnone. Theformatoftheresulting
tuples is new weight newterm(index term,ontext,indexterm), where the value of newweight is
2
Onaoneptuallevelthereisnorestritiononwhihontextsanaesswhihotherontexts,sothisformalismanbe
adoptedtodesribenetworkedarhitetures.Inthisstudy,however,weonlydealwithtreetypestrutureswhereontextp
andontextformaparent-hild(super-ontextandsub-ontext)relationship.
3
theory).
PROJECT[olumns℄(relation) returns tuples that ontain only the speied olumns
of relation, where the format of olumns is $olumn1,$olumn2, et. For example,
new term=PROJECT[$1,$3℄(JOIN[$2=$2℄(term,a)) returns all (indexterm,ontextp) pairs
where the index term ours in ontext p's sub-ontext. This is beause the JOIN operator
re-turnsthetuplesnew weight(index term,ontext,ontextp,ontext ),whereontext pisthe
super-ontext of ontext. Projeting olumn one and three into new term results in new weight
new term(index term,ontextp).
UNITE(relation1,relation2)returnstheunion ofthe tuplesstored in relation1 andrelation2.
For example, new term=UNITE(term,PROJECT[$1,$3℄(JOIN[$2=$2℄(term,a))) will produe a
relationthatinludes tuplesoftheterm relationandtheresultingtuplesofthePROJECT
opera-tion. Theprobabilitiesof thetuples innew term arealulatedaordingto probabilitytheoryor
fuzzytheory.
BasedontherelationsdesribingthedoumentandqueryspaeandwiththeuseofthePRAoperators
weanimplementaretrievalstrategythat takesinto aountboththestrutureandtheontentof the
douments byaugmenting theontentof sub-ontextsinto the super-ontext, wherethe augmentation
anbeontrolledbythea weight values.
2.1.3 Retrieval strategies
Let us rst model the lassial tfidf retrieval funtion. For this, we use the JOIN and PROJECT
operationsofPRA.
tdfindex =PROJECT[$1,$2℄(JOIN[$1=$1℄(term,termspae))
retrievetdf =PROJECT[$3℄(JOIN[$1=$1℄(qterm,tdfindex))
Here, the rst funtion omputes the tfidf indexing. It produes tuples in the form of tfidf weight
tdfindex(index term,ontext). The tf idf weight is alulated using probability theory and the
inde-pendene assumption as tfweightidfweight. Theseond funtion joins (mathes) the query terms
withthetdfindex termsandproduestheretrievalresultsintheformofrsvretrievetdf(ontext). The
RSVs given in rsv are alulatedaordingto probability theoryassuming disjointnesswith respet to
thetermspae(termspae)asfollows:
rsv(ontext)= X
queryterm
i
q weight
i
tf weight
i
(ontext)idf weight
super-ontextbythat ofitssub-ontexts. Thisismodelled bythefollowingPRAequation.
tdfa index =PROJECT[$1,$3℄(JOIN[$2=$2℄(tdfindex,a))
Theaugmentedrelationonsistsoftuplestdfa weighttdfaindex(index term,ontextp)where
on-textp is the super-ontext of ontext whih is indexed by index term. The value of tdfa weight
is alulated astf idf weighta weight (assuming probabilistiindependene). Based on the
td-fa index weannowdene ourretrievalstrategyforstrutureddouments.
tdfa index =PROJECT[$1,$3℄(JOIN[$2=$2℄(tdfindex,a))
retrievetdfa=PROJECT[$3℄(JOIN[$1=$1℄(qterm,tdfaindex))
retrieve =UNITE(retrieve tdf,retrieve tdfa)
TheRSVsoftheretrievalresultarealulated(bytheUNITEoperator)aordingtoprobabilitytheory
andassumingindependene:
rsv(ontext) = P(retrievetfidf(ontext) OR retrieve tfidfa(ontext))
= P(retrievetfidf(ontext))+P(retrievetfidfa(ontext))
P(retrievetfidf(ontext) AND retrievetfidfa(ontext))
= P(retrievetfidf(ontext))+P(retrievetfidfa(ontext))
P(retrievetfidf(ontext))P(retrievetfidfa(ontext))
Sinethe weight oftdfa index is diretlyinuenedbytheweightassoiated witha, theresulting
RSVsisalsodependentona.
2.2 Example
Considerthefollowingolletionofonedoumentdo1omposed oftwosetions,se1andse2. Terms
suhassailing,boats,et. ourin theolletion:
0.1term(sailing,do1)
0.8term(boats,do1)
0.7term(sailing,se1)
0.8term(greee,se2)
0.4termspae(sailing)
0.3termspae(boats)
0.2termspae(greee)
0.1termspae(santorini)
Letustakethequery\sailingboats",representedbythefollowingPRAprogram.
qterm(sailing)
qterm(boats)
Giventhisqueryandapplyingthelassialtfidfretrievalfuntion(asdesribedintheprevioussetion)
toourdoumentolletionweretrievethefollowingdoumentomponents.
0.28retrievedtdf(do1)
0.28retrievedtdf(se1)
Both retrieved ontexts havetherelevanestatus valueof 0.28. Theaboveretrievalstrategy, however,
doesnot take into aount thestruture of thedoument, e.g. that se1whih isaboutsailingis part
ofdo1. From auser'spointof view,itmight bebetterto retrieverst- oronly -do1 sinese1an
beaessed from do1 by browsing down from do1 to se1. Let us now apply our tfidfa retrieval
strategy. Weobtain:
0.441retrieve(do1)
0.28retrieve(se1)
This shows that theRSV ofdo1 inreaseswhen wetake into aount thefat that do1 is omposed
ofse1,whihisalso indexedbytheterm"sailing". Thisisdoneusing thestruturalknowledgestored
ina. Thisdemonstratesthatbyusingourthirddimension,a,weobtainarankingthatexploits the
struture ofthe doumentto determinewhih doumentomponentsshould beretrievedhigher in the
ranking.
Indesigningappliationsforstrutureddoumentretrieval,wearefaedwiththeproblemofdetermining
theprobabilities(weights)ofthearelation. Inourretrievalappliationssofar,onstantavaluessuh
as0.5 and0.6 were used. Howeverwewantto establishmethodsto deriveestimatesofthe a values.
Toahievethis,werequiretest olletionswith ontrolledparameterstoallowus toderiveappropriate
estimations of the a valueswith respet to these parameters. In the following setion we present a
methodforreatingsimulatedtestolletionsofstrutureddoumentsthatallowsuh aninvestigation.
3 Automati Constrution of Strutured Doument Test
Col-letions
Althoughmanytestolletionsareomposedofdoumentsthatontainsomeinternalstruture[19, 1℄,
relevane judgements are usually madeat the doument level(root ontexts) orat theatomi ontext
level. Thismeansthattheyannotbeusedfortheevaluationofstrutureddoumentretrievalsystems,
(e.g. depthandwidthofdoumenttreestruture). Thesewillenableustoinvestigatethea dimension
under dierentonditions. Sine relevane assessmentsforstrutured doumentsare diÆultto obtain
and manual assessmentisexpensiveandtask spei,it wasimperativetond awayto automatially
build suh test olletions. We developed amethodology that allowed us to reate diverse olletions
of strutured douments and automatially generate relevane assessments. Ourmethodology exploits
existingstandardtestolletionswiththeirexistingqueriesandrelevanejudgementssothatnohuman
resouresare neessary. In addition ourmethodologyallowsthereationof alltest olletionsdeemed
neessarytoarryoutourinvestigationregardingtheeetoftheadimensionforstrutureddoument
retrieval.
In Setion 3.1 we disuss how we reatedthe strutured douments, in Setion 3.2 wedisuss howwe
deidedontherelevaneof thedoumentomponents,andnally,in Setion3.3 weshowtheresultsof
themethodologyusingtheCACMolletion 4
.
3.1 Constrution of the douments
Our basi methodology is to ombine douments from a test olletion to form simulated strutured
douments. That is to treat a number of original douments from the olletion as omponents of a
strutured doument. A simplied versionof this strategy was used in [13℄. In the remainder of this
setionweshallpresentamoresophistiatedversion,and dealwithsomeoftheissuesarisingfrom the
onstrutionof simulated strutureddouments. Toillustrate ourmethodology, we used awell known
smallstandardtestolletion,theCACMtestolletion.
Using the methodology of ombining douments, it is possible to reate two types of test olletions:
homogeneous olletionsin whihthedoumentshavethesamelogialstruture andheterogeneous
ol-letions in whih the douments havevarying logial struture. In ourexperiments, Setion 4,we use
theseolletionstoseehowthevaluesofaompareforthetwotypesofolletions.
By ontrolling the number of douments ombined, and the way douments are ombined, it is also
possible to generate dierent types of strutured douments. We used two main riteria to generate
strutured douments. Therstriterion is width. This orresponds tothe numberofdoumentsthat
are ombined at eah level, i.e. how many ontexts in a doument, and how many sub-ontexts per
ontext. The seondriterion is depth. This orrespondsto howmanylevelsare in thetree struture.
Forexampleadoument withnosub-ontexts(allthe textis at onelevel)hasdepth of1,adoument
with sub-ontexts hasdepth 2,adoumentwith sub-sub-ontextshas depth3, andso on. Using these
riteriaitispossibletoautomatiallygeneratetestolletionsofstrutureddoumentsthatvaryinwidth
anddepth.
4
The olletion has 3204 douments and 64 queries. See www.ds.gla.a.uk/idom/irresoures/testsolletions/ for
douments. The types of logial struture are shown in Figure 1. Pair (EE), Triple (EEE), Quad
(EEEE),Sext(EEEEEE)andOt(EEEEEEEE)areomposedofrootandatomiontextsonly. These
test olletions will be useful in estimating the a values based on the width riterion. The other
olletionsPair-E ((EE)E), Pair-2 ((EE)(EE)) and Triple-3 ((EEE)(EEE)(EEE)) have root, inner and
atomiontexts. Theseolletions areusefulforestimatingthea valuesbasedonthedepthriterion.
ColletionsofeahtypewerebuiltfromtheCACMtestolletionwheredoumentsofthetestolletion,
referredto as\originaldouments",wereused toformtheatomiontexts.
d1
e1
e2
Pair(EE)d1
e1
e2
e3
s1
Pair-E((EE)E)d1
e1
e2
s1
s2
e3
e4
Pair-2((EE)(EE))d1
e2
e1
e3
Triple(EEE)e1
e2
e3
d1
s1
s2
s3
e4
e5
e6
e7
e8
e9
Triple-3
((EEE)(EEE)(EEE))
e1
e2
e3
e4
d1
Quad(EEEE)e6
e5
e4
e3
e2
e1
d1
Sext(EEEEEE)e6
e5
e4
e3
e2
e1
d1
e7
e8
Ot(EEEEEEEE)Figure1: Typesoflogialstruture
Wealsoreatedoneheterogeneous testolletion,referredtoasMix,whihisomposedofamixtureof
Pair(EE)andTriple(EEE)douments.
We should note here that we are aware that the strutured douments reated in this manner often
willnothaveameaningfulontentand maynotreettermdistributionsin realstrutureddouments.
Neverthelesstheuseofsimulateddoumentsdoesallowforextensiveinvestigation,Setion4,toprovide
initial estimates for the a values. We are urrently using these estimates in our urrent work on a
realtest olletionofstrutureddouments(XML-baseddouments). Wearenot,therefore,suggesting
that weanuse the artiially reatedtest olletions assubstitutes forreal douments and relevane
assessments. Ratherweusethem asatest-bedto obtainestimatesfor parametersthat willbeused in
more realistievaluations. As mentioned before, the neessityof using artiial test olletionsomes
fromthelakofrealtestolletions.
Oneoftheadvantagesofourapproahisthatweanautomatiallyreateolletionsdiverseintypeand
size. Howeverwemust takestepsto ensurethatthereatedolletionsarerealistiand ofmanageable
sizeto allowexperimentation.
With a straight ombinatori approah, we an derive from a olletion of N original douments the
possiblenumberofstrutureddoumentswouldbeN 2
over2forthePairtypeofolletion(thatisabout
5million doumentsfor the 3204 douments of the CACM olletion). Thereforewe requiremethods
to ut down thenumberof atual doumentsombined. In thepartiular experimentswearried out,
we used twostrategies to aomplishthis: disarding \noisy"douments and minimising \dependent"
atomiontextsofthestrutureddouments,i.e. theoriginaldoumentsfrom theCACM olletion.
(1) disarding\noisy" douments: Ifadoument'ssub-ontextsareamixtureofrelevantand
non-relevantontextsforallqueriesintheolletionthenthedoumentisonsideredto benoisy.
Thatis,thereisnoqueryintheolletionforwhihallsub-ontextsarerelevantorallsub-ontextsare
non-relevant. Wedisardallnoisydoumentsfromtheolletion. Thisdoesnotmeanthatweareonly
onsideringstrutureddoumentswhereallsub-ontextsareinagreement;wesimplyinsistthattheyare
inagreementforatleastonequeryin theolletion 5
.
(2) minimising\dependent"douments: With astraightombination approahwealso have the
problem of multiple ourrenes of thesameatomi onepts (the original doumentsin thetest
olletionappearingmanytimes). Thisouldmeanthatoursimulatedstrutureddoumentsmay
beverysimilar-ordependent-due totheoverlapbetweenthesub-ontexts.
Our seond approah to utting down the number of reated douments is therefore to minimise the
numberofdependentdouments.
Wedonot,however,wanttoeradiatemultipleourreneompletely. First,multipleourrenemimi
real-worldsituationswhere similardoumentparts areused in severaldouments(e.g. web,hypertext,
digitallibraries). Seond,exlusiveusageofanatomiontext requiresaproedureto determinewhih
atomiontextleadsto the\best"strutureddoument,whihisdiÆult,ifnotimpossibletoassess.
Thewaywereduethenumberofmultipleourrenesistoreduetherepeateduseofatomiontexts
thatarerelevanttothesamequery. Thatis,wedonotwanttoreatemanystrutureddoumentsthat
ontainthesamesetofrelevantontexts.
Ourbasiproedureistoreduethenumberofdoumentswhoseatomiontextsareallrelevanttothe
samequery,i.e. omposed ofomponentsthatare allrelevantto thequery. Thereasonweonentrate
onrelevantontextsis thatthesearetheonesweuseto deidewhether thewhole strutureddoument
isrelevantornot,(seeSetion 3.2).
Foreahatomiontext,e
i
,whihisrelevanttoaquery,q
j
,weonlyallowe
i
toappearinonedoument
whoseotheratomiontextsareallrelevanttoq
j
. Thisreduesmultiple ourrenesofe
i
in douments
omposed entirelyofrelevantatomiontexts. Asthere maybemanystrutured doumentsontaining
e
i
whose atomi ontexts are relevant, we need a method to hoose whih of these douments to use
in the olletion. We dothis by hoosing thedoument with the lowest noisevalue. This means that
weprefer doumentsthatare relevanttomultiple queriesoverdoumentsthat areonlyrelevantto one
query. Ifmorethanonesuhdoumentsexistswehooseonerandomly.
Boththesestepsreduethenumberof strutureddoumentstoamanageablesize.
5
Wehavesofardesribedhowwereatedthestrutured doumentsandhowweutdownthepotential
numberofdoumentsreated. Whatwehavenowtoonsiderarethequeriesandrelevaneassessments.
The queries and relevane assessments ome from the standard test olletionsthat are used to build
thesimulatedstrutureddouments. However,given that astrutured doument may beomposed of
a mixture of relevant and non-relevant douments, we have to deide when to lassify a root ontext
(strutureddoument),oraninnerontext,asrelevantornon-relevant.
Ourapproah denes therelevaneof non-atomi ontexts asthe aggregationof therelevane of their
sub-ontexts.
Leta non-atomiontext dbe omposed ofk sub-ontextse1;:::;ek. Foragivenquery, wehavethree
ases: all the sub-ontexts e1to ek are relevant; all the sub-ontexts are not relevant; and neither of
theprevioustwoasesholds. Inthelatter,wesaythatwehave\ontraditory"relevane assessments.
For the rst two ases, it is reasonableto assess that d is relevantand dis not relevant to the query,
respetively. Inthethirdase,anaggregationstrategyisrequiredtodeidetherelevaneofdtoaquery.
Weapplythefollowingtwostrategies:
optimistirelevane: disassessedrelevanttothequeryifatleastoneofitssub-ontextsisassessed
relevantto thequery;disassessednon-relevantifallitssub-ontextsareassessednon-relevantto
thequery.
pessimistirelevane: disassessedrelevanttothequeryifallitssub-ontextsareassessedrelevant
to thequery;in allotherases,disassessednon-relevanttothequery.
Variantsoftheaboveouldbeused; e.g.,disonsidered relevantif2=3of itssub-ontexts arerelevant
[11℄. We are urrently arrying outresearh to devisestrategies that may be loser to user's views of
relevanewithrespettostrutureddoumentretrieval 6
.
Thepointofusingdierentaggregationstrategiesisthatitallowsustoinvestigatetheperformaneofthe
a dimensionwhenusingdierentrelevaneriteria. Forexample,theoptimististrategyorresponds
to aloose denition of relevane (where adoument is relevant ifit ontains any relevantomponent)
and thepessimististrategyorrespondsto astrit denition ofrelevane(where allomponentsmust
berelevantbeforethestrutureddoumentisrelevant).
3.3 Example
Inthe previoussetions wehaveshownhowwe anuse existingtest olletionsto reateolletionsof
strutureddouments. Theseolletionsanbeofvaryingwidthanddepth,bebasedondieringnotions
6
For instane, inan experimentrelated to passageretrieval,somerelevant douments ontained no partsthat were
this methodology is that it allows the reation of diverse olletion types from a single original test
olletion.
TheolletionswereatedfortheexperimentsreportedinthispaperwerebasedontheCACMolletion.
Wehavedesribedtheolletiontypes,weshallnowexaminetheolletionsinmoredetailtoshowthe
dierenesbetweenthem.
Table1showsthenumberofroot, inner andatomiontextsfor theolletions. Asitanbeseen,the
homogeneousolletions displayarelationshipbetweentheatomiontextsandrootandinnerontext.
Forinstane,thePairolletionhastwieasmanyatomiontextsasrootontexts,theTripleolletion
has three times as many atomi ontexts as root ontexts, et. This does not hold, however, for the
heterogeneousMixolletion,whihisombinedofamixtureofdoumenttypes.
Coll. Num. Num. Num. Total
Root Inner Atomi Num.
Pair 383 0 766 1149
Pair-E 247 247 741 1235
Triple 247 0 741 988
Quad 180 0 720 900
Pair-2 180 360 720 1260
Triple-3 66 198 594 858
Sext 109 0 654 763
Ot 80 0 640 720
Mix 280 0 700 980
Table1: Numberofontexts
Oneofthewaysweutdownthenumberofreatedstrutureddoumentswastoreduethenumberof
multipleourrenesofatomiontexts. AsshowninTable2,wedonotexludeallmultipleourrenes,
however suh ourrenes are rare. For, example, in the Triple-3 olletion, 20 of the original 3204
doumentsareusedthreetimesamongthe594atomiontexts.
Theabovetwomeasuresareindependentofhowwedeideontherelevaneofaontext,i.e. whetherwe
usetheoptimistiorpessimistiaggregationstrategy. Thehoieofaggregationstrategywillaet the
numberofrelevantontexts. Asanexampleweshowin,Figure2,thenumberofrelevantrootontexts
when usingthe Pairolletion, (fullguresanbefound in [16℄). As it anbeseen, fortheoptimisti
aggregationstrategywehavealmosttwieasmanyrelevantroot ontexts(average15.53 perquery) as
forthepessimistiaggregationstrategy(average7.85perquery). Thisdemonstratesthattheaggregation
strategyanbeused toreateolletionswithdierentharateristis.
Inthis setion wedesribedthe reationofa number ofolletionsbased onthe CACM olletion. In
frequeny
Pair 398 86 33 14 7 1
Pair-E 392 82 31 14 6 1
Triple 392 82 31 14 6 1
Quad 388 82 28 12 6 1
Pair-2 391 84 32 11 3 1
Triple-3 336 83 20 8 0 0
Sext 362 74 22 12 6 0
Ot 352 75 21 11 5 1
Mix 366 85 27 12 7 0
Table2: Multipleourreneofontexts
0
10
20
30
40
50
60
70
0
10
20
30
40
50
60
Contexts
Queries
Distribution of relevant root contexts
Average(15.53125)
Pair,opt,root
0
5
10
15
20
25
30
35
0
10
20
30
40
50
60
Contexts
Queries
Distribution of relevant root contexts
Average(7.859375)
Pair,pess,root
Using the dierent types of olletions, and their assoiated properties, we arried out a number of
experiments to investigate the aessibilitydimension for the retrievalof strutured douments. With
our set of test olletions of strutureddouments and their various and ontrolled harateristis, we
studiedtheeet ofdierenta valuesontheretrievalquality.
Wetargetedthefollowingquestions:
1. Is thereanoptimalsetting ofthea parameterfor aontext withnsub-ontexts? (Setion4.1).
An optimalsettingisonewhihgivesthebestaveragepreision.
2. With high a values, we expet large ontexts to be retrieved with a higher RSV than small
ontexts. What is the \break-even point", i.e. whih setting of a will retrieve largeand small
ontextswith thesamepreferene? (Setion 4.2)
4.1 Optimal values of the aessibility dimension
For allouronstruted olletions,forinreasing valuesof a (rangingfrom 0:1 to 0:9), weomputed
the RSV of eah ontext, using the augmentation proess desribed in Setion 2. With the obtained
rankings(ofroot,inner andatomi ontexts)andourrelevane assessments(optimistiorpessimisti),
wealulatedpreision/reallvaluesandthentheaveragepreisionvalues. ThegraphsinFigure3show
for eah aessibility value the orresponding average preision. We show the graphs for Pair, Pair-E
andMixonly. Allgraphsshowa\bellshape". Theoptimalaessibilityvaluesand theirorresponding
maximalpreisionvaluesaregiveninTable3.
Optimistirelevane Pessimistirelevane
olletion max. av. preision a max. av. preision a
Pair 0.4702 0.75 0.4359 0.65
Triple 0.4719 0.6 0.4479 0.45
Quad 0.455 0.55 0.4474 0.35
Sext 0.4431 0.45 0.4507 0.25
Ot 0.4277 0.35 0.4404 0.2
Pair-2 0.4722 0.8 0.4556 0.6
Pair-E 0.4787 0.75 0.4464 0.65
Triple-3 0.4566 0.65 0.4694 0.4
Mix 0.4608 0.75 0.4307 0.5
Table3: Optimalaessibilityvaluesandorrespondingmaximumpreision
0.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
0.46
0.48
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Precision
Accessibility Value
Pair,opt0.34
0.35
0.36
0.37
0.38
0.39
0.4
0.41
0.42
0.43
0.44
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Precision
Accessibility Value
Pair,pess0.28
0.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
0.46
0.48
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Precision
Accessibility Value
Pair-E,opt0.34
0.36
0.38
0.4
0.42
0.44
0.46
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Precision
Accessibility Value
Pair-E,pess0.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
0.46
0.48
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Precision
Accessibility Value
Mix,opt0.35
0.36
0.37
0.38
0.39
0.4
0.41
0.42
0.43
0.44
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Precision
Accessibility Value
Mix,pess
Figure3: Aessibilityvaluesandorrespondingaveragepreisionvalues
with the numberof sub-ontexts. This holds for bothrelevane aggregationstrategies. Thea value
anbeapproximatedwiththefuntion
a=a 1
p
n
wherenisthenumberofsub-ontexts. Theparameteradependsontherelevaneaggregationstrategy.
Withthemethodofleastsquarepolynomials(seeAppendix),valuesofaare1:068and0:78foroptimisti
andpessimistirelevaneassessments,respetively.
TheoptimalavaluesobtainedforPairandTriplearelosetothoseofPair-2andPair-E,andTriple-3,
respetively. This indiates that for depth-two olletions (Pair-2, Pair-E, Triple-3) we an apply the
above estimates for a independently of the depth of the olletion, indiating that approximations
basedonthenumberofsub-ontextsseemappropriate.
Thea valueforMixusedthesamexedaessibilityvaluesforalldouments,whether theywerePair
orTriple douments. This ouldbeonsidered as\unfair",sine,asdisussedabove,thesettingof the
a for a ontext depends on the number of its sub-ontexts. Therefore, we performed an additional
experiment,wherethea valueswereset to0.75and0.6,respetively,forontextswith twoandthree
sub-ontextsin theoptimistirelevanease, and0.65and 0.45forthe pessimisti ase. These are the
optimalaessibilityvaluesobtainedforPairand Triple(seeTable3). Theaveragepreisionvaluesare
0.4615 and 0.4301for optimisti and pessimisti relevane assessments, respetively. Comparedto the
values obtainedwith xed aessibility values(0.4608and 0.4307,respetively), there is no signiant
From theresultson thehomogeneousolletions,weonludethat wean set thea parameterfor a
ontextaordingtothefuntion a 1
p
n
whereaanbeviewedastheparameterreetingtherelevane
aggregationstrategy.
4.2 Large or small ontexts
Onemajor role of the aessibility dimensionis to emphasise the retrievalpreferene of large vssmall
ontexts. For example, ontexts deeper in the struture (small ontexts) should be retrieved before
ontextsupper(largeontexts)in thestruturewhenspeiontextsarepreferredtomoreexhaustive
ontexts[6℄. Theavaluegivespowerfulontrolregardingexhaustivenessandspeiityoftheretrieval.
With small a values, small ontexts \overtake" large ontexts, whereas with high a values large
ontextsdominatetheupperranks.
For demonstrating and investigating this eet, we produed for eah olletion with our tf-idf-a
method dened inSetion 2arankedlistof ontextsfordierenta values,ranging againfrom 0:1to
0:9. Foreah typeofontexts(atomi, inner and root), wealulate itsaveragerank overall retrieval
resultsforaolletion. Theseaveragevaluesarethenplottedintoagraphinrelationtotheaessibility
values. Figure 4shows theobtainedgraphsfor Ot, Triple-3and Mix. In allgraphs, theroot ontext
urvestartsintheupperleftorner,whereastheatomiontexturvestartsinthelowerleftorner. For
instane,wesee thatfortheOtolletion,the\break-evenpoint"isaround0:1and0:2forpessimisti
relevaneassessment.
With the Triple-3 olletion we obtain three break-even points for root-inner, root-atomi, and
inner-atomi. Whereas the average rank of inner nodes does not vary greatly with varying a values, the
eet on root and atomi ontexts is similar to the eet observed with the Ot olletion, but with
dierentbreak-evenpointsvalues(e.g.around0:4 0:5foroptimistirelevaneassessment).FortheMix
olletionthebreak-even-pointloatesaround0:5,ahighervaluethanthat fortheOtolletion.
Whereas as in Setion 4.1, the maximum average preisionleads to a setting of a, the experiments
regardingsmall and large ontexts provide us with aseond soure for setting thea value, onethat
ontrolstheretrievalofexhaustivevsspei doumententrypoints.
5 Conlusion
Inthisworkweinvestigatedhowtoexpliitlyinorporatethenotionofstrutureintostrutureddoument
retrieval. This is in ontrast to otherresearh, (e.g. [3, 15, 8, 20℄), wherethe struture ofadoument
is only impliitly aptured within the retrieval model. The advantageof ourapproah is that we an
0
10
20
30
40
50
60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Rank
Accessibility Value
Root Context
Inner Context
Atomic Context
Ot,opt0
5
10
15
20
25
30
35
40
45
50
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Rank
Accessibility Value
Root Context
Inner Context
Atomic Context
Ot,pess20
40
60
80
100
120
140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Rank
Accessibility Value
Root Context
Inner Context
Atomic Context
Triple-3,opt20
40
60
80
100
120
140
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Rank
Accessibility Value
Root Context
Inner Context
Atomic Context
Triple-3,pess0
10
20
30
40
50
60
70
80
90
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Rank
Accessibility Value
Root Context
Inner Context
Atomic Context
Mix,opt0
10
20
30
40
50
60
70
80
90
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Rank
Accessibility Value
Root Context
Inner Context
Atomic Context
Mix,pessFigure4: Eetoftheaessibilityonthetypesofretrievedontexts
Ourapproahtostrutureddoumentsranksdoumentontexts(doumentomponentsofvarying
gran-ularity) based on a desription of their individual ontent augmented with that of their sub-ontexts.
Thereforeadoument'sontextenapsulatestheontentofallitssub-ontextstakingintoaounttheir
importane. This was implemented using PRA, a probabilisti relational algebra. In our model, and
implementation, we quantitatively inorporated the degree to whih a sub-ontext ontributes to the
ontent of asuper-ontext using the a dimension, i.e. higher a values mean that the sub-ontext
ontributesmoretothedesriptionofasuper-ontext.
We arried out extensive experiments on olletions of douments with varying struture to provide
estimatesfora. Thisinvestigationisneessaryto allowthesettingofa tovaluesthatwill failitate
theretrievalofdoumentomponentsofvaryinggranularity.
The experiments requiredthe development of test olletionsof strutured douments. We developed
a methodology for the automati onstrution of test olletionsof strutured douments using
stan-dardtestolletionswiththeirsetof douments,queriesandorrespondingrelevaneassessments. The
methodologymakesitpossibleto generatetest olletions ofstrutured douments withvarying width
anddepth, basedondieringnotionsofrelevaneandwithidential orvaryingstruture.
Theanalysisoftheretrievalresultsallowedustoderiveageneralreommendationforappropriatesettings
ofthea valueforstrutureddoumentretrieval. Theavaluesdependonthenumberofsub-ontexts
of a ontexts, and the relevane assessment aggregation strategies. They also depend on the required
exhaustivenessandspeiityoftheretrieval. Theseresultsarebeingusedasthebasisforanevaluation
[1℄ Baeza-Yates,R., andRibeiro-Neto,B.ModernInformationRetrieval. AddisonWesley,1999.
[2℄ Baumgarten,C.Aprobabilistimodelfordistributedinformationretrieval. InProeedingsofACM-SIGIR
ConfereneonResearh andDevelopmentinInformationRetrieval (Philadelphia,USA,1997),pp.258{266.
[3℄ Bordogna,G.,andPasi,G.Flexiblequeryingofstrutureddouments.InProeedingsofFlexibleQuery
AnsweringSystems(FQAS) (Warsaw,Poland,2000),pp.350{361.
[4℄ Chellas,B. ModalLogi. CambridgeUniversityPress,1980.
[5℄ Chiaramella, Y. Browsing and querying: two omplementary approahes for multimedia information
retrieval. InProeedings Hypermedia-InformationRetrieval-Multimedia (Dortmund,Germany,1997).
[6℄ Chiaramella, Y., Mulhem, P.,and Fourel, F. A modelfor multimedia information retrieval. Teh.
Rep.FermiESPRITBRA8134,UniversityofGlasgow,1996.
[7℄ Edwards,D.,andHardman,L.Lostinhyperspae:Cognitivenavigationinahypertextenvironment. In
Proeedings ofHypertextI (1989),pp.105{125.
[8℄ Frisse,M. Searhingfor informationinahypertextmedialhandbook. Communiations of theACM31,
7(1988),880{886.
[9℄ Fuhr,N.,andRoelleke,T. Aprobabilistirelationalalgebrafortheintegrationofinformationretrieval
anddatabasesystems.ACMTransationsonInformationSystems14,1(1997).
[10℄ Iweha,C.VisualisationofStruturedDouments: AnInvestigationintotheRoleofVisualisingStruturefor
Information RetrievalInterfaesand HumanComputer Interation. PhDthesis,QueenMarty&Westeld
College,1999.
[11℄ Lalmas,M.,andMoutogianni,E. ADempster-Shaferindexingforthefoussedretrievalofhierarhially
strutureddouments:Implememtationandexperimentsonawebmuseumolletion. InRIAO(2000).
[12℄ Lalmas, M.,and Roelleke, T. Four-valued knowledgeaugmentationforstrutured doumentretrieval.
Submittedfor Publiation.
[13℄ Lalmas,M.,and Ruthven,I. RepresentingandretrievingstrutureddoumentswithDempster-Shafer's
theoryofevidene: Modellingandevaluation. JournalofDoumentation 54,5(1998),529{565.
[14℄ Mizzaro, S. Relevane: Thewhole story. Journal of the Ameria Soiety for Information Siene 48,9
(1997),810{832.
[15℄ Myaeng, S., Jang, D. H., Kim, M. S., and Zhoo, Z. C. A exiblemodelfor retrievalof SGML
do-uments. InProeedings of ACM-SIGIR Conferene onResearh andDevelopmentinInformation Retrieval
(Melbourne,Australia, 1998),pp.138{145.
[16℄ Quiker, S. Relevanzuntersuhung fur das Retrieval von strukturierten Dokumenten. Master's thesis,
UniversityofDortmund,1998.
[17℄ Roelleke,T.POOL:ProbabilistiObjet-OrientedLogialRepresentationandRetrievalofComplexObjets
-AModelforHypermediaRetrieva. PhDthesis,UniversityofDortmund,Germany,1999.
[18℄ vanRijsbergen,C.J. InformationRetrieval,2ed. Butterworths,London,1979.
[19℄ Voorhees,E.,andHarman,D.OverviewoftheFifthTextREtrievalConferene(TREC-5).InProeedings
Researh andDevelopmentinInformationRetrieval (Dublin,Ireland,1994),pp.311{317.
Least square polynomials
Considertheexperimentalvaluesofa in relationwiththesquareroot ofthenumberof sub-ontexts.
Assuming a 1 p ni where n i
ranges in the set f2;3;4;6;8g is the funtion for estimating the optimal
aessibilityvalues,weapplyleastsquarepolynomialsasfollowsforalulatinga.