The accessibility dimension for structured document retrieval

(1)

Doument Retrieval

Thomas Roelleke,Mounia Lalmas and Gabriella Kazai

QueenMary,University of London

fthor,mounia,gabsgds.qmul.a.uk

Ian Ruthven

University of Strathlyde

Ian.Ruthvenis.strath.a.uk

Stefan Quiker

University of Dortmund

quikerls6.s.uni-dortmund.de

Abstrat

Strutureddoument retrievalaims at retrievingthe doument omponentsthat best satisfy a

query,insteadofmerelyretrievingpre-deneddoumentunits.Thispaperreportsonaninvestigation

of a tf-idf-a approah, where tf and idf are the lassial termfrequeny and inverse doument

frequeny,anda,anewparameteralledaessibility,thatapturesthestrutureofdouments.

The tf-idf-a approah is dened usinga probabilisti relational algebra. To investigate the

retrievalqualityandestimatethea values,we developedamethodthatautomatiallyonstruts

diversetest olletionsofstrutured doumentsfrom astandardtest olletion,withwhih

experi-mentswerearriedout. Theanalysisoftheexperimentsprovidesestimatesofthea values.

1 Introdution

Intraditionalinformationretrieval(IR)systems[18℄,retrievableunitsarexed. Forexample,thewhole

doument, or, sometimes, pre-dened parts suh as paragraphs onstitute the retrievable units. The

logial struture ofdouments(hapter,setion, table,formula, authorinformation, bibliographiitem,

et)istherefore\attened"andnotexploited. Classialretrievalmethodslakthepossibilityto

intera-tivelydeterminethesizeandthetypeofretrievableunits,thatbestsuitanatualretrievaltaskoruser

preferenes.

Currentresearhisaimingatdevelopingretrievalmodelsthatdynamiallyreturndoumentomponents

of varying omplexity. A retrieval result may then onsist of several entry points to a same

dou-ment, whereby eah entry point is weighted aordingto how it satises the query. Authors suh as

(2)

based on the aggregation of the estimated relevane of their related omponents. These models have

been based on various formal theories (e.g. fuzzy logi [3℄, Dempster-Shafer's theory of evidene [13℄,

probabilistilogi[17, 2℄,andBayesianinferene[15℄). What thesemodels haveinommonis that the

basi omponents of their retrieval funtion are variants of the twostandard term weighting shemes,

termfrequeny(tf)andinversedoumentfrequeny(idf). Evideneassoiatedwiththelogialstruture

ofdoumentsisoftenenodedinto oneorbothofthesedimensions.

In this paper, we make this evidene expliitby introduingthe \aessibility" dimension, denoted by

a. Thisdimensionmeasuresthestrengthofthestruturalrelationshipbetweendoumentomponents:

thestrongertherelationship,themoreimpathastheontentofaomponentindesribingtheontent

of its related omponents(e.g. [17, 6℄). We refer to the approah astf-idf-a. We are interested in

investigatinghowthea dimensionaetstheretrievalofdoumentomponentsofvaryingomplexity.

To arry out this investigation, we require a framework that expliitly aptures the ontent and the

struture of doumentsin terms of our tf-idf-a approah. We use aprobabilisti relational algebra

[17℄forthispurpose(Setion2). Ourinvestigationalsorequirestestolletionsofstrutureddouments.

Sine relevaneassessments for strutureddouments are diÆult to obtain and manual assessmentis

expensiveandtaskspei,wedevelopedanautomatiapproahforreatingstrutured doumentsand

generatingrelevaneassessmentsfromaattestolletion(Setion3). Finally,weperformretrievalruns

fordierentsettingsoftheaessibilitydimension. Theanalysisprovidesuswithmethodsforestimating

theappropriatesettingoftheaessibilitydimensionforstrutureddoumentretrieval(Setion4).

2 The tf-idf-a Approah

We view a strutured doument as a tree whose nodes, alled ontexts, are the omponents of the

doument (e.g., hapters,setions, et.) and whose edges representthe omposition relationship(e.g.,

ahapter ontainsseveral setions). The root ontext of the tree, whih is uniquefor eah doument,

embodies thewhole doument. Atomi ontextsare doumentomponentsthat orrespond to thelast

elementsoftheompositionhains. Allothernodesarereferredto asinner ontexts.

Retrieval onstrutured douments takesboththe logialstruture and the ontentof douments into

aountandreturns,inresponsetoauserquery,doumentomponentsofvaryingomplexity(root,inner

oratomiontexts). Thisretrievalmethodologyombinesquerying-ndingwhihatomiontextsmath

thequery-andbrowsingalongthedouments'struture-ndingthelevelofomplexitythatbestmath

the query[5℄. Take, forexample, an artilewith vesetions. Weandene a retrievalstrategy that

isto retrievethewholeartileif morethan threeof itssetionsarerelevantto thequery. Dynamially

returning doument omponents of varying omplexity an, however, lead to user disorientation and

ognitiveoverload [6,7, 10℄. This isbeausethe presentationof doument omponentsin the retrieval

(3)

dierentpositionsintheranking.

To redue user disorientation and ognitive overload, a retrieval strategy would be to retrieve (and

thereforedisplaytotheuser)super-ontexts omposed ofmanyrelevantsub-ontexts before-orinstead

of -retrievingthe sub-ontextsthemselves. Thisstrategy would prioritise theretrievalof larger

super-ontextsfrom wherethesub-ontextsouldbeaessedthroughdiret browsing[6, 5℄. Other strategies

ouldfavoursmaller,morespeiontexts,byassigningsmallerrelevanestatusvaluestolargeontexts.

Toallow the implementation of any of the above retrieval strategies, wefollow the aggregation-based

approahtostrutureddoumentretrieval(e.g. [17,11,2,15,6,8℄). Webasetheestimationofrelevane,

orinotherwords,weomputetheretrievalstatusvalue(RSV)ofaontextbasedonaontentdesription

that is derivedfrom its ontentand the ontent ofits sub-ontexts. Forthis purpose, weaugment the

ontent of a super-ontext with that of its sub-ontexts. The augmentation proess is applied to the

wholedoumenttreestruture,startingwiththeatomiontexts,wherenoaugmentationisperformed,

andendingwiththerootontext.

Inthisframework,thebrowsingelementoftheretrievalstrategyisthenimplementedwithinthemethod

of augmentation. By ontrollingtheextent to whih the sub-ontexts of asuper-ontextontributeto

itsrepresentationweandiretlyinuene thederivedRSVs.

Via the augmentation proess we an also inuene the extent to whih the individual sub-ontexts

ontribute to the representation of the super-ontext. This way ertain sub-ontexts, suh as titles,

abstratsandonlusions,et. ouldbeemphasisedwhileothersouldbede-emphasised.

Wemodel thisimpatofthe sub-ontextsonthesuper-ontextbyexpandingthetwodimensions,term

frequeny,tf,and inverse doumentfrequeny, idf, standardto IR,withathird dimension,the

aes-sibility dimension,a 1

. Wereferto theinueningpowerof asub-ontextoveritssuper-ontextas

thesub-ontext'simportane. What thea dimensionrepresents,then,isthe degreetowhih

the sub-ontext isimportantto the super-ontext.

Intheremainderofthis setionweshow,bymeansofprobabilistirelationalalgebradened in [17, 9℄,

howthisdimensionanbeusedtoinorporatethisqualitativenotionofontextimportaneforstrutured

doumentretrieval.

2.1 Probabilisti Relational Algebra

ProbabilistiRelationalAlgebra(PRA)isalanguagemodelthatinorporatesprobabilitytheorywiththe

wellknownrelationalparadigm. Thealgebraallowsthemodellingofdoumentandqueryrepresentations

asrelationsonsistingofprobabilistituples,anditdenesoperators,withsimilarsemantistoSQLbut

1

Theterm\aessibility"istakenfromtheframeworkofpossibleworldsandaessibilityrelationsofaKripkestruture

[4℄. In[12℄,asemantisofthe tf-idf-aisdened,whereontextsaremodelledaspossibleworldsandtheirstrutureis

(4)

2.1.1 Doument and query modelling

A PRA program desribingthe representation of a strutured doumentolletion uses relations suh

asterm,termspae,anda. Theterm relationrepresentsthetf dimensionand onsistsofprobabilisti

tuplesintheformoftfweightterm(indexterm,ontext),whihassigntfweightvaluestoeah(index term,

ontext)pairintheolletion(e.g. indextermsthat ourinontexts). Thevalueoftfweight 2[0,1℄is

aprobabilistiinterpretationofthetermfrequeny. Thetermspae relationmodelstheidf dimensionby

assigningidf valuestotheindextermsintheolletion. Thisisstoredintuplesintheformofidf weight

termspae(index term),whereidfweight 2[0,1℄. Thea relationdesribesthedoumentstrutureand

onsistsoftuplesa weighta(ontextp,ontext),whereontextis\aessible"fromontextpwith

aprobabilityaweight 2

.

A query representation in PRA is desribed by the qterm relation whih is in the form of q weight

qterm(queryterm),where theqweight desribestheimportaneof thequeryterm, andquery term are

thetermsomposingthequery.

2.1.2 Relationaloperators in PRA

SimilarlytoSQL,PRAsupportsanumberofrelationaloperators. ThoseusedinthispaperareSELECT,

PROJECT,JOIN,andUNITE.Theirsyntax andfuntionalitiesaredesribednext.

SELECT[riteria℄(relation) returns those probabilisti tuples of relation that math the

speied riteria, where the format of riteria is $olumn= 3

avalue. For example,

SE-LECT[$1=sailing℄(termspae) will return all those tuples from the termspae relation that have

the term \sailing"in olumn one. Tostore theresulting tuple in arelation weuse thefollowing

syntax: new relation = SELECT[riteria℄(relation). The arity of the new relation equalsto the

arityofrelation.

JOIN[$olumn1=$olumn2℄(relation1,relation2)joins (mathes) tworelations and returns

tuplesthatontainmathingdataintheirrespetiveolumns,whereolumn1 speiesaolumnof

relation1 andolumn2 relatestorelation2. Thearityofthereturnedtuplesisthesumofthearity

ofthetuplesinrelation1 andrelation2. Forexample,new term=JOIN[$1=$1℄(term,termspae)will

populatenew termwithtuplesthathavethesamevalueinolumnone. Theformatoftheresulting

tuples is new weight newterm(index term,ontext,indexterm), where the value of newweight is

2

Onaoneptuallevelthereisnorestritiononwhihontextsanaesswhihotherontexts,sothisformalismanbe

adoptedtodesribenetworkedarhitetures.Inthisstudy,however,weonlydealwithtreetypestrutureswhereontextp

andontextformaparent-hild(super-ontextandsub-ontext)relationship.

3

(5)

theory).

PROJECT[olumns℄(relation) returns tuples that ontain only the speied olumns

of relation, where the format of olumns is $olumn1,$olumn2, et. For example,

new term=PROJECT[$1,$3℄(JOIN[$2=$2℄(term,a)) returns all (indexterm,ontextp) pairs

where the index term ours in ontext p's sub-ontext. This is beause the JOIN operator

re-turnsthetuplesnew weight(index term,ontext,ontextp,ontext ),whereontext pisthe

super-ontext of ontext. Projeting olumn one and three into new term results in new weight

new term(index term,ontextp).

UNITE(relation1,relation2)returnstheunion ofthe tuplesstored in relation1 andrelation2.

For example, new term=UNITE(term,PROJECT[$1,$3℄(JOIN[$2=$2℄(term,a))) will produe a

relationthatinludes tuplesoftheterm relationandtheresultingtuplesofthePROJECT

opera-tion. Theprobabilitiesof thetuples innew term arealulatedaordingto probabilitytheoryor

fuzzytheory.

BasedontherelationsdesribingthedoumentandqueryspaeandwiththeuseofthePRAoperators

weanimplementaretrievalstrategythat takesinto aountboththestrutureandtheontentof the

douments byaugmenting theontentof sub-ontextsinto the super-ontext, wherethe augmentation

anbeontrolledbythea weight values.

2.1.3 Retrieval strategies

Let us rst model the lassial tfidf retrieval funtion. For this, we use the JOIN and PROJECT

operationsofPRA.

tdfindex =PROJECT[$1,$2℄(JOIN[$1=$1℄(term,termspae))

retrievetdf =PROJECT[$3℄(JOIN[$1=$1℄(qterm,tdfindex))

Here, the rst funtion omputes the tfidf indexing. It produes tuples in the form of tfidf weight

tdfindex(index term,ontext). The tf idf weight is alulated using probability theory and the

inde-pendene assumption as tfweightidfweight. Theseond funtion joins (mathes) the query terms

withthetdfindex termsandproduestheretrievalresultsintheformofrsvretrievetdf(ontext). The

RSVs given in rsv are alulatedaordingto probability theoryassuming disjointnesswith respet to

thetermspae(termspae)asfollows:

rsv(ontext)= X

queryterm

i

q weight

i

tf weight

i

(ontext)idf weight

(6)

super-ontextbythat ofitssub-ontexts. Thisismodelled bythefollowingPRAequation.

tdfa index =PROJECT[$1,$3℄(JOIN[$2=$2℄(tdfindex,a))

Theaugmentedrelationonsistsoftuplestdfa weighttdfaindex(index term,ontextp)where

on-textp is the super-ontext of ontext whih is indexed by index term. The value of tdfa weight

is alulated astf idf weighta weight (assuming probabilistiindependene). Based on the

td-fa index weannowdene ourretrievalstrategyforstrutureddouments.

tdfa index =PROJECT[$1,$3℄(JOIN[$2=$2℄(tdfindex,a))

retrievetdfa=PROJECT[$3℄(JOIN[$1=$1℄(qterm,tdfaindex))

retrieve =UNITE(retrieve tdf,retrieve tdfa)

TheRSVsoftheretrievalresultarealulated(bytheUNITEoperator)aordingtoprobabilitytheory

andassumingindependene:

rsv(ontext) = P(retrievetfidf(ontext) OR retrieve tfidfa(ontext))

= P(retrievetfidf(ontext))+P(retrievetfidfa(ontext))

P(retrievetfidf(ontext) AND retrievetfidfa(ontext))

= P(retrievetfidf(ontext))+P(retrievetfidfa(ontext))

P(retrievetfidf(ontext))P(retrievetfidfa(ontext))

Sinethe weight oftdfa index is diretlyinuenedbytheweightassoiated witha, theresulting

RSVsisalsodependentona.

2.2 Example

Considerthefollowingolletionofonedoumentdo1omposed oftwosetions,se1andse2. Terms

suhassailing,boats,et. ourin theolletion:

0.1term(sailing,do1)

0.8term(boats,do1)

0.7term(sailing,se1)

0.8term(greee,se2)

0.4termspae(sailing)

0.3termspae(boats)

0.2termspae(greee)

0.1termspae(santorini)

(7)

Letustakethequery\sailingboats",representedbythefollowingPRAprogram.

qterm(sailing)

qterm(boats)

Giventhisqueryandapplyingthelassialtfidfretrievalfuntion(asdesribedintheprevioussetion)

toourdoumentolletionweretrievethefollowingdoumentomponents.

0.28retrievedtdf(do1)

0.28retrievedtdf(se1)

Both retrieved ontexts havetherelevanestatus valueof 0.28. Theaboveretrievalstrategy, however,

doesnot take into aount thestruture of thedoument, e.g. that se1whih isaboutsailingis part

ofdo1. From auser'spointof view,itmight bebetterto retrieverst- oronly -do1 sinese1an

beaessed from do1 by browsing down from do1 to se1. Let us now apply our tfidfa retrieval

strategy. Weobtain:

0.441retrieve(do1)

0.28retrieve(se1)

This shows that theRSV ofdo1 inreaseswhen wetake into aount thefat that do1 is omposed

ofse1,whihisalso indexedbytheterm"sailing". Thisisdoneusing thestruturalknowledgestored

ina. Thisdemonstratesthatbyusingourthirddimension,a,weobtainarankingthatexploits the

struture ofthe doumentto determinewhih doumentomponentsshould beretrievedhigher in the

ranking.

Indesigningappliationsforstrutureddoumentretrieval,wearefaedwiththeproblemofdetermining

theprobabilities(weights)ofthearelation. Inourretrievalappliationssofar,onstantavaluessuh

as0.5 and0.6 were used. Howeverwewantto establishmethodsto deriveestimatesofthe a values.

Toahievethis,werequiretest olletionswith ontrolledparameterstoallowus toderiveappropriate

estimations of the a valueswith respet to these parameters. In the following setion we present a

methodforreatingsimulatedtestolletionsofstrutureddoumentsthatallowsuh aninvestigation.

3 Automati Constrution of Strutured Doument Test

Col-letions

Althoughmanytestolletionsareomposedofdoumentsthatontainsomeinternalstruture[19, 1℄,

relevane judgements are usually madeat the doument level(root ontexts) orat theatomi ontext

level. Thismeansthattheyannotbeusedfortheevaluationofstrutureddoumentretrievalsystems,

(8)

(e.g. depthandwidthofdoumenttreestruture). Thesewillenableustoinvestigatethea dimension

under dierentonditions. Sine relevane assessmentsforstrutured doumentsare diÆultto obtain

and manual assessmentisexpensiveandtask spei,it wasimperativetond awayto automatially

build suh test olletions. We developed amethodology that allowed us to reate diverse olletions

of strutured douments and automatially generate relevane assessments. Ourmethodology exploits

existingstandardtestolletionswiththeirexistingqueriesandrelevanejudgementssothatnohuman

resouresare neessary. In addition ourmethodologyallowsthereationof alltest olletionsdeemed

neessarytoarryoutourinvestigationregardingtheeetoftheadimensionforstrutureddoument

retrieval.

In Setion 3.1 we disuss how we reatedthe strutured douments, in Setion 3.2 wedisuss howwe

deidedontherelevaneof thedoumentomponents,andnally,in Setion3.3 weshowtheresultsof

themethodologyusingtheCACMolletion 4

.

3.1 Constrution of the douments

Our basi methodology is to ombine douments from a test olletion to form simulated strutured

douments. That is to treat a number of original douments from the olletion as omponents of a

strutured doument. A simplied versionof this strategy was used in [13℄. In the remainder of this

setionweshallpresentamoresophistiatedversion,and dealwithsomeoftheissuesarisingfrom the

onstrutionof simulated strutureddouments. Toillustrate ourmethodology, we used awell known

smallstandardtestolletion,theCACMtestolletion.

Using the methodology of ombining douments, it is possible to reate two types of test olletions:

homogeneous olletionsin whihthedoumentshavethesamelogialstruture andheterogeneous

ol-letions in whih the douments havevarying logial struture. In ourexperiments, Setion 4,we use

theseolletionstoseehowthevaluesofaompareforthetwotypesofolletions.

By ontrolling the number of douments ombined, and the way douments are ombined, it is also

possible to generate dierent types of strutured douments. We used two main riteria to generate

strutured douments. Therstriterion is width. This orresponds tothe numberofdoumentsthat

are ombined at eah level, i.e. how many ontexts in a doument, and how many sub-ontexts per

ontext. The seondriterion is depth. This orrespondsto howmanylevelsare in thetree struture.

Forexampleadoument withnosub-ontexts(allthe textis at onelevel)hasdepth of1,adoument

with sub-ontexts hasdepth 2,adoumentwith sub-sub-ontextshas depth3, andso on. Using these

riteriaitispossibletoautomatiallygeneratetestolletionsofstrutureddoumentsthatvaryinwidth

anddepth.

4

The olletion has 3204 douments and 64 queries. See www.ds.gla.a.uk/idom/irresoures/testsolletions/ for

(9)

douments. The types of logial struture are shown in Figure 1. Pair (EE), Triple (EEE), Quad

(EEEE),Sext(EEEEEE)andOt(EEEEEEEE)areomposedofrootandatomiontextsonly. These

test olletions will be useful in estimating the a values based on the width riterion. The other

olletionsPair-E ((EE)E), Pair-2 ((EE)(EE)) and Triple-3 ((EEE)(EEE)(EEE)) have root, inner and

atomiontexts. Theseolletions areusefulforestimatingthea valuesbasedonthedepthriterion.

ColletionsofeahtypewerebuiltfromtheCACMtestolletionwheredoumentsofthetestolletion,

referredto as\originaldouments",wereused toformtheatomiontexts.

d1

e1

e2

Pair(EE)

d1

e1

e2

e3

s1

Pair-E((EE)E)

d1

e1

e2

s1

s2

e3

e4

Pair-2((EE)(EE))

d1

e2

e1

e3

Triple(EEE)

e1

e2

e3

d1

s1

s2

s3

e4

e5

e6

e7

e8

e9

Triple-3

((EEE)(EEE)(EEE))

e1

e2

e3

e4

d1

Quad(EEEE)

e6

e5

e4

e3

e2

e1

d1

Sext(EEEEEE)

e6

e5

e4

e3

e2

e1

d1

e7

e8

Ot(EEEEEEEE)

Figure1: Typesoflogialstruture

Wealsoreatedoneheterogeneous testolletion,referredtoasMix,whihisomposedofamixtureof

Pair(EE)andTriple(EEE)douments.

We should note here that we are aware that the strutured douments reated in this manner often

willnothaveameaningfulontentand maynotreettermdistributionsin realstrutureddouments.

Neverthelesstheuseofsimulateddoumentsdoesallowforextensiveinvestigation,Setion4,toprovide

initial estimates for the a values. We are urrently using these estimates in our urrent work on a

realtest olletionofstrutureddouments(XML-baseddouments). Wearenot,therefore,suggesting

that weanuse the artiially reatedtest olletions assubstitutes forreal douments and relevane

assessments. Ratherweusethem asatest-bedto obtainestimatesfor parametersthat willbeused in

more realistievaluations. As mentioned before, the neessityof using artiial test olletionsomes

fromthelakofrealtestolletions.

Oneoftheadvantagesofourapproahisthatweanautomatiallyreateolletionsdiverseintypeand

size. Howeverwemust takestepsto ensurethatthereatedolletionsarerealistiand ofmanageable

sizeto allowexperimentation.

With a straight ombinatori approah, we an derive from a olletion of N original douments the

possiblenumberofstrutureddoumentswouldbeN 2

over2forthePairtypeofolletion(thatisabout

5million doumentsfor the 3204 douments of the CACM olletion). Thereforewe requiremethods

to ut down thenumberof atual doumentsombined. In thepartiular experimentswearried out,

we used twostrategies to aomplishthis: disarding \noisy"douments and minimising \dependent"

(10)

atomiontextsofthestrutureddouments,i.e. theoriginaldoumentsfrom theCACM olletion.

(1) disarding\noisy" douments: Ifadoument'ssub-ontextsareamixtureofrelevantand

non-relevantontextsforallqueriesintheolletionthenthedoumentisonsideredto benoisy.

Thatis,thereisnoqueryintheolletionforwhihallsub-ontextsarerelevantorallsub-ontextsare

non-relevant. Wedisardallnoisydoumentsfromtheolletion. Thisdoesnotmeanthatweareonly

onsideringstrutureddoumentswhereallsub-ontextsareinagreement;wesimplyinsistthattheyare

inagreementforatleastonequeryin theolletion 5

.

(2) minimising\dependent"douments: With astraightombination approahwealso have the

problem of multiple ourrenes of thesameatomi onepts (the original doumentsin thetest

olletionappearingmanytimes). Thisouldmeanthatoursimulatedstrutureddoumentsmay

beverysimilar-ordependent-due totheoverlapbetweenthesub-ontexts.

Our seond approah to utting down the number of reated douments is therefore to minimise the

numberofdependentdouments.

Wedonot,however,wanttoeradiatemultipleourreneompletely. First,multipleourrenemimi

real-worldsituationswhere similardoumentparts areused in severaldouments(e.g. web,hypertext,

digitallibraries). Seond,exlusiveusageofanatomiontext requiresaproedureto determinewhih

atomiontextleadsto the\best"strutureddoument,whihisdiÆult,ifnotimpossibletoassess.

Thewaywereduethenumberofmultipleourrenesistoreduetherepeateduseofatomiontexts

thatarerelevanttothesamequery. Thatis,wedonotwanttoreatemanystrutureddoumentsthat

ontainthesamesetofrelevantontexts.

Ourbasiproedureistoreduethenumberofdoumentswhoseatomiontextsareallrelevanttothe

samequery,i.e. omposed ofomponentsthatare allrelevantto thequery. Thereasonweonentrate

onrelevantontextsis thatthesearetheonesweuseto deidewhether thewhole strutureddoument

isrelevantornot,(seeSetion 3.2).

Foreahatomiontext,e

i

,whihisrelevanttoaquery,q

j

,weonlyallowe

i

toappearinonedoument

whoseotheratomiontextsareallrelevanttoq

j

. Thisreduesmultiple ourrenesofe

i

in douments

omposed entirelyofrelevantatomiontexts. Asthere maybemanystrutured doumentsontaining

e

i

whose atomi ontexts are relevant, we need a method to hoose whih of these douments to use

in the olletion. We dothis by hoosing thedoument with the lowest noisevalue. This means that

weprefer doumentsthatare relevanttomultiple queriesoverdoumentsthat areonlyrelevantto one

query. Ifmorethanonesuhdoumentsexistswehooseonerandomly.

Boththesestepsreduethenumberof strutureddoumentstoamanageablesize.

5

(11)

Wehavesofardesribedhowwereatedthestrutured doumentsandhowweutdownthepotential

numberofdoumentsreated. Whatwehavenowtoonsiderarethequeriesandrelevaneassessments.

The queries and relevane assessments ome from the standard test olletionsthat are used to build

thesimulatedstrutureddouments. However,given that astrutured doument may beomposed of

a mixture of relevant and non-relevant douments, we have to deide when to lassify a root ontext

(strutureddoument),oraninnerontext,asrelevantornon-relevant.

Ourapproah denes therelevaneof non-atomi ontexts asthe aggregationof therelevane of their

sub-ontexts.

Leta non-atomiontext dbe omposed ofk sub-ontextse1;:::;ek. Foragivenquery, wehavethree

ases: all the sub-ontexts e1to ek are relevant; all the sub-ontexts are not relevant; and neither of

theprevioustwoasesholds. Inthelatter,wesaythatwehave\ontraditory"relevane assessments.

For the rst two ases, it is reasonableto assess that d is relevantand dis not relevant to the query,

respetively. Inthethirdase,anaggregationstrategyisrequiredtodeidetherelevaneofdtoaquery.

Weapplythefollowingtwostrategies:

optimistirelevane: disassessedrelevanttothequeryifatleastoneofitssub-ontextsisassessed

relevantto thequery;disassessednon-relevantifallitssub-ontextsareassessednon-relevantto

thequery.

pessimistirelevane: disassessedrelevanttothequeryifallitssub-ontextsareassessedrelevant

to thequery;in allotherases,disassessednon-relevanttothequery.

Variantsoftheaboveouldbeused; e.g.,disonsidered relevantif2=3of itssub-ontexts arerelevant

[11℄. We are urrently arrying outresearh to devisestrategies that may be loser to user's views of

relevanewithrespettostrutureddoumentretrieval 6

.

Thepointofusingdierentaggregationstrategiesisthatitallowsustoinvestigatetheperformaneofthe

a dimensionwhenusingdierentrelevaneriteria. Forexample,theoptimististrategyorresponds

to aloose denition of relevane (where adoument is relevant ifit ontains any relevantomponent)

and thepessimististrategyorrespondsto astrit denition ofrelevane(where allomponentsmust

berelevantbeforethestrutureddoumentisrelevant).

3.3 Example

Inthe previoussetions wehaveshownhowwe anuse existingtest olletionsto reateolletionsof

strutureddouments. Theseolletionsanbeofvaryingwidthanddepth,bebasedondieringnotions

6

For instane, inan experimentrelated to passageretrieval,somerelevant douments ontained no partsthat were

(12)

this methodology is that it allows the reation of diverse olletion types from a single original test

olletion.

TheolletionswereatedfortheexperimentsreportedinthispaperwerebasedontheCACMolletion.

Wehavedesribedtheolletiontypes,weshallnowexaminetheolletionsinmoredetailtoshowthe

dierenesbetweenthem.

Table1showsthenumberofroot, inner andatomiontextsfor theolletions. Asitanbeseen,the

homogeneousolletions displayarelationshipbetweentheatomiontextsandrootandinnerontext.

Forinstane,thePairolletionhastwieasmanyatomiontextsasrootontexts,theTripleolletion

has three times as many atomi ontexts as root ontexts, et. This does not hold, however, for the

heterogeneousMixolletion,whihisombinedofamixtureofdoumenttypes.

Coll. Num. Num. Num. Total

Root Inner Atomi Num.

Pair 383 0 766 1149

Pair-E 247 247 741 1235

Triple 247 0 741 988

Quad 180 0 720 900

Pair-2 180 360 720 1260

Triple-3 66 198 594 858

Sext 109 0 654 763

Ot 80 0 640 720

Mix 280 0 700 980

Table1: Numberofontexts

Oneofthewaysweutdownthenumberofreatedstrutureddoumentswastoreduethenumberof

multipleourrenesofatomiontexts. AsshowninTable2,wedonotexludeallmultipleourrenes,

however suh ourrenes are rare. For, example, in the Triple-3 olletion, 20 of the original 3204

doumentsareusedthreetimesamongthe594atomiontexts.

Theabovetwomeasuresareindependentofhowwedeideontherelevaneofaontext,i.e. whetherwe

usetheoptimistiorpessimistiaggregationstrategy. Thehoieofaggregationstrategywillaet the

numberofrelevantontexts. Asanexampleweshowin,Figure2,thenumberofrelevantrootontexts

when usingthe Pairolletion, (fullguresanbefound in [16℄). As it anbeseen, fortheoptimisti

aggregationstrategywehavealmosttwieasmanyrelevantroot ontexts(average15.53 perquery) as

forthepessimistiaggregationstrategy(average7.85perquery). Thisdemonstratesthattheaggregation

strategyanbeused toreateolletionswithdierentharateristis.

Inthis setion wedesribedthe reationofa number ofolletionsbased onthe CACM olletion. In

(13)

frequeny

Pair 398 86 33 14 7 1

Pair-E 392 82 31 14 6 1

Triple 392 82 31 14 6 1

Quad 388 82 28 12 6 1

Pair-2 391 84 32 11 3 1

Triple-3 336 83 20 8 0 0

Sext 362 74 22 12 6 0

Ot 352 75 21 11 5 1

Mix 366 85 27 12 7 0

Table2: Multipleourreneofontexts

0

10

20

30

40

50

60

70

0

10

20

30

40

50

60 Contexts

Queries

Distribution of relevant root contexts

Average(15.53125)

Pair,opt,root

0

5

10

15

20

25

30

35

0

10

20

30

40

50

60 Contexts

Queries

Distribution of relevant root contexts

Average(7.859375)

Pair,pess,root

(14)

Using the dierent types of olletions, and their assoiated properties, we arried out a number of

experiments to investigate the aessibilitydimension for the retrievalof strutured douments. With

our set of test olletions of strutureddouments and their various and ontrolled harateristis, we

studiedtheeet ofdierenta valuesontheretrievalquality.

Wetargetedthefollowingquestions:

1. Is thereanoptimalsetting ofthea parameterfor aontext withnsub-ontexts? (Setion4.1).

An optimalsettingisonewhihgivesthebestaveragepreision.

2. With high a values, we expet large ontexts to be retrieved with a higher RSV than small

ontexts. What is the \break-even point", i.e. whih setting of a will retrieve largeand small

ontextswith thesamepreferene? (Setion 4.2)

4.1 Optimal values of the aessibility dimension

For allouronstruted olletions,forinreasing valuesof a (rangingfrom 0:1 to 0:9), weomputed

the RSV of eah ontext, using the augmentation proess desribed in Setion 2. With the obtained

rankings(ofroot,inner andatomi ontexts)andourrelevane assessments(optimistiorpessimisti),

wealulatedpreision/reallvaluesandthentheaveragepreisionvalues. ThegraphsinFigure3show

for eah aessibility value the orresponding average preision. We show the graphs for Pair, Pair-E

andMixonly. Allgraphsshowa\bellshape". Theoptimalaessibilityvaluesand theirorresponding

maximalpreisionvaluesaregiveninTable3.

Optimistirelevane Pessimistirelevane

olletion max. av. preision a max. av. preision a

Pair 0.4702 0.75 0.4359 0.65

Triple 0.4719 0.6 0.4479 0.45

Quad 0.455 0.55 0.4474 0.35

Sext 0.4431 0.45 0.4507 0.25

Ot 0.4277 0.35 0.4404 0.2

Pair-2 0.4722 0.8 0.4556 0.6

Pair-E 0.4787 0.75 0.4464 0.65

Triple-3 0.4566 0.65 0.4694 0.4

Mix 0.4608 0.75 0.4307 0.5

Table3: Optimalaessibilityvaluesandorrespondingmaximumpreision

(15)

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Average Precision

Accessibility Value

Pair,opt

0.34

0.35

0.36

0.37

0.38

0.39

0.4

0.41

0.42

0.43

0.44

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Average Precision

Accessibility Value

Pair,pess

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Average Precision

Accessibility Value

Pair-E,opt

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Average Precision

Accessibility Value

Pair-E,pess

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Average Precision

Accessibility Value

Mix,opt

0.35

0.36

0.37

0.38

0.39

0.4

0.41

0.42

0.43

0.44

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Average Precision

Accessibility Value

Mix,pess

Figure3: Aessibilityvaluesandorrespondingaveragepreisionvalues

with the numberof sub-ontexts. This holds for bothrelevane aggregationstrategies. Thea value

anbeapproximatedwiththefuntion

a=a 1

p

n

wherenisthenumberofsub-ontexts. Theparameteradependsontherelevaneaggregationstrategy.

Withthemethodofleastsquarepolynomials(seeAppendix),valuesofaare1:068and0:78foroptimisti

andpessimistirelevaneassessments,respetively.

TheoptimalavaluesobtainedforPairandTriplearelosetothoseofPair-2andPair-E,andTriple-3,

respetively. This indiates that for depth-two olletions (Pair-2, Pair-E, Triple-3) we an apply the

above estimates for a independently of the depth of the olletion, indiating that approximations

basedonthenumberofsub-ontextsseemappropriate.

Thea valueforMixusedthesamexedaessibilityvaluesforalldouments,whether theywerePair

orTriple douments. This ouldbeonsidered as\unfair",sine,asdisussedabove,thesettingof the

a for a ontext depends on the number of its sub-ontexts. Therefore, we performed an additional

experiment,wherethea valueswereset to0.75and0.6,respetively,forontextswith twoandthree

sub-ontextsin theoptimistirelevanease, and0.65and 0.45forthe pessimisti ase. These are the

optimalaessibilityvaluesobtainedforPairand Triple(seeTable3). Theaveragepreisionvaluesare

0.4615 and 0.4301for optimisti and pessimisti relevane assessments, respetively. Comparedto the

values obtainedwith xed aessibility values(0.4608and 0.4307,respetively), there is no signiant

(16)

From theresultson thehomogeneousolletions,weonludethat wean set thea parameterfor a

ontextaordingtothefuntion a 1

p

n

whereaanbeviewedastheparameterreetingtherelevane

aggregationstrategy.

4.2 Large or small ontexts

Onemajor role of the aessibility dimensionis to emphasise the retrievalpreferene of large vssmall

ontexts. For example, ontexts deeper in the struture (small ontexts) should be retrieved before

ontextsupper(largeontexts)in thestruturewhenspeiontextsarepreferredtomoreexhaustive

ontexts[6℄. Theavaluegivespowerfulontrolregardingexhaustivenessandspeiityoftheretrieval.

With small a values, small ontexts \overtake" large ontexts, whereas with high a values large

ontextsdominatetheupperranks.

For demonstrating and investigating this eet, we produed for eah olletion with our tf-idf-a

method dened inSetion 2arankedlistof ontextsfordierenta values,ranging againfrom 0:1to

0:9. Foreah typeofontexts(atomi, inner and root), wealulate itsaveragerank overall retrieval

resultsforaolletion. Theseaveragevaluesarethenplottedintoagraphinrelationtotheaessibility

values. Figure 4shows theobtainedgraphsfor Ot, Triple-3and Mix. In allgraphs, theroot ontext

urvestartsintheupperleftorner,whereastheatomiontexturvestartsinthelowerleftorner. For

instane,wesee thatfortheOtolletion,the\break-evenpoint"isaround0:1and0:2forpessimisti

relevaneassessment.

With the Triple-3 olletion we obtain three break-even points for root-inner, root-atomi, and

inner-atomi. Whereas the average rank of inner nodes does not vary greatly with varying a values, the

eet on root and atomi ontexts is similar to the eet observed with the Ot olletion, but with

dierentbreak-evenpointsvalues(e.g.around0:4 0:5foroptimistirelevaneassessment).FortheMix

olletionthebreak-even-pointloatesaround0:5,ahighervaluethanthat fortheOtolletion.

Whereas as in Setion 4.1, the maximum average preisionleads to a setting of a, the experiments

regardingsmall and large ontexts provide us with aseond soure for setting thea value, onethat

ontrolstheretrievalofexhaustivevsspei doumententrypoints.

5 Conlusion

Inthisworkweinvestigatedhowtoexpliitlyinorporatethenotionofstrutureintostrutureddoument

retrieval. This is in ontrast to otherresearh, (e.g. [3, 15, 8, 20℄), wherethe struture ofadoument

is only impliitly aptured within the retrieval model. The advantageof ourapproah is that we an

(17)

0

10

20

30

40

50

60

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Average Rank

Accessibility Value

Root Context

Inner Context

Atomic Context

Ot,opt

0

5

10

15

20

25

30

35

40

45

50

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Average Rank

Accessibility Value

Root Context

Inner Context

Atomic Context

Ot,pess

20

40

60

80

100

120

140

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Average Rank

Accessibility Value

Root Context

Inner Context

Atomic Context

Triple-3,opt

20

40

60

80

100

120

140

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Average Rank

Accessibility Value

Root Context

Inner Context

Atomic Context

Triple-3,pess

0

10

20

30

40

50

60

70

80

90

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Average Rank

Accessibility Value

Root Context

Inner Context

Atomic Context

Mix,opt

0

10

20

30

40

50

60

70

80

90

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 Average Rank

Accessibility Value

Root Context

Inner Context

Atomic Context

Mix,pess

Figure4: Eetoftheaessibilityonthetypesofretrievedontexts

Ourapproahtostrutureddoumentsranksdoumentontexts(doumentomponentsofvarying

gran-ularity) based on a desription of their individual ontent augmented with that of their sub-ontexts.

Thereforeadoument'sontextenapsulatestheontentofallitssub-ontextstakingintoaounttheir

importane. This was implemented using PRA, a probabilisti relational algebra. In our model, and

implementation, we quantitatively inorporated the degree to whih a sub-ontext ontributes to the

ontent of asuper-ontext using the a dimension, i.e. higher a values mean that the sub-ontext

ontributesmoretothedesriptionofasuper-ontext.

We arried out extensive experiments on olletions of douments with varying struture to provide

estimatesfora. Thisinvestigationisneessaryto allowthesettingofa tovaluesthatwill failitate

theretrievalofdoumentomponentsofvaryinggranularity.

The experiments requiredthe development of test olletionsof strutured douments. We developed

a methodology for the automati onstrution of test olletionsof strutured douments using

stan-dardtestolletionswiththeirsetof douments,queriesandorrespondingrelevaneassessments. The

methodologymakesitpossibleto generatetest olletions ofstrutured douments withvarying width

anddepth, basedondieringnotionsofrelevaneandwithidential orvaryingstruture.

Theanalysisoftheretrievalresultsallowedustoderiveageneralreommendationforappropriatesettings

ofthea valueforstrutureddoumentretrieval. Theavaluesdependonthenumberofsub-ontexts

of a ontexts, and the relevane assessment aggregation strategies. They also depend on the required

exhaustivenessandspeiityoftheretrieval. Theseresultsarebeingusedasthebasisforanevaluation

(18)

[1℄ Baeza-Yates,R., andRibeiro-Neto,B.ModernInformationRetrieval. AddisonWesley,1999.

[2℄ Baumgarten,C.Aprobabilistimodelfordistributedinformationretrieval. InProeedingsofACM-SIGIR

ConfereneonResearh andDevelopmentinInformationRetrieval (Philadelphia,USA,1997),pp.258{266.

[3℄ Bordogna,G.,andPasi,G.Flexiblequeryingofstrutureddouments.InProeedingsofFlexibleQuery

AnsweringSystems(FQAS) (Warsaw,Poland,2000),pp.350{361.

[4℄ Chellas,B. ModalLogi. CambridgeUniversityPress,1980.

[5℄ Chiaramella, Y. Browsing and querying: two omplementary approahes for multimedia information

retrieval. InProeedings Hypermedia-InformationRetrieval-Multimedia (Dortmund,Germany,1997).

[6℄ Chiaramella, Y., Mulhem, P.,and Fourel, F. A modelfor multimedia information retrieval. Teh.

Rep.FermiESPRITBRA8134,UniversityofGlasgow,1996.

[7℄ Edwards,D.,andHardman,L.Lostinhyperspae:Cognitivenavigationinahypertextenvironment. In

Proeedings ofHypertextI (1989),pp.105{125.

[8℄ Frisse,M. Searhingfor informationinahypertextmedialhandbook. Communiations of theACM31,

7(1988),880{886.

[9℄ Fuhr,N.,andRoelleke,T. Aprobabilistirelationalalgebrafortheintegrationofinformationretrieval

anddatabasesystems.ACMTransationsonInformationSystems14,1(1997).

[10℄ Iweha,C.VisualisationofStruturedDouments: AnInvestigationintotheRoleofVisualisingStruturefor

Information RetrievalInterfaesand HumanComputer Interation. PhDthesis,QueenMarty&Westeld

College,1999.

[11℄ Lalmas,M.,andMoutogianni,E. ADempster-Shaferindexingforthefoussedretrievalofhierarhially

strutureddouments:Implememtationandexperimentsonawebmuseumolletion. InRIAO(2000).

[12℄ Lalmas, M.,and Roelleke, T. Four-valued knowledgeaugmentationforstrutured doumentretrieval.

Submittedfor Publiation.

[13℄ Lalmas,M.,and Ruthven,I. RepresentingandretrievingstrutureddoumentswithDempster-Shafer's

theoryofevidene: Modellingandevaluation. JournalofDoumentation 54,5(1998),529{565.

[14℄ Mizzaro, S. Relevane: Thewhole story. Journal of the Ameria Soiety for Information Siene 48,9

(1997),810{832.

[15℄ Myaeng, S., Jang, D. H., Kim, M. S., and Zhoo, Z. C. A exiblemodelfor retrievalof SGML

do-uments. InProeedings of ACM-SIGIR Conferene onResearh andDevelopmentinInformation Retrieval

(Melbourne,Australia, 1998),pp.138{145.

[16℄ Quiker, S. Relevanzuntersuhung fur das Retrieval von strukturierten Dokumenten. Master's thesis,

UniversityofDortmund,1998.

[17℄ Roelleke,T.POOL:ProbabilistiObjet-OrientedLogialRepresentationandRetrievalofComplexObjets

-AModelforHypermediaRetrieva. PhDthesis,UniversityofDortmund,Germany,1999.

[18℄ vanRijsbergen,C.J. InformationRetrieval,2ed. Butterworths,London,1979.

[19℄ Voorhees,E.,andHarman,D.OverviewoftheFifthTextREtrievalConferene(TREC-5).InProeedings

(19)

Researh andDevelopmentinInformationRetrieval (Dublin,Ireland,1994),pp.311{317.

Least square polynomials

Considertheexperimentalvaluesofa in relationwiththesquareroot ofthenumberof sub-ontexts.

Assuming a 1 p ni where n i

ranges in the set f2;3;4;6;8g is the funtion for estimating the optimal

aessibilityvalues,weapplyleastsquarepolynomialsasfollowsforalulatinga.