• No results found

Deep Web Navigation by Example

N/A
N/A
Protected

Academic year: 2020

Share "Deep Web Navigation by Example"

Copied!
12
0
0

Loading.... (view fulltext now)

Full text

(1)

DEEP WEB NAVIGATIONBY EXAMPLE

YANG WANG AND THOMAS HORNUNG

Abstrat. LargeportionsoftheWebareburiedbehinduser-orientedinterfaes,whihanonlybeaessedbyllingoutforms.

Tomakethethereinontainedinformationaessibletoautomatiproessing,oneofthemajorhurdlesistonavigatetotheatual

resultpage. Inthispaperwepresentaframeworkfornavigatingtheseso-alledDeepWebsitesbasedonthepage-keyword-ation

paradigm: thesystemllsout formswithprovidedinputparametersand then submitstheform. Afterwards itheksifithas

alreadyfoundaresultpagebylookingforpre-speiedkeywordpatternsintheurrentpage.Basedontheoutomeeitherfurther

ationstoreaharesultpageareexeutedortheresultingURLisreturned.

Keywords: formanalysis,deepWebnavigationbypage-keyword-ations

1. Introdution. The Web an be lassied into two ategories with respet to aess patterns: the

SurfaeWebandtheDeepWeb[7℄. TheSurfaeWebonsistsofstatiandpublilyavailableWebpages,whih

ontainlinksto otherpagesand anberepresentedasadireted graph. This Webgraphanbetraversedby

rawlers(alsoknownasspiders)andthefoundpagesarethentraditionallyindexedbysearhengines.

The Deep Web in ontrastonsists of dynamially generated result pages of numerous databases, whih

anbe queriedviaaWebform. These pagesannot bereahedbyfollowinglinks from other pagesand it is

thereforehallengingto index theirontent. Figure1depitsthegeneralinteration patternbetweentheuser

and aDeepWeb site. Theuserlls outtheform eld with the desiredinformation (1) andthe Web form is

sent to the serverwhere it is transformed in a database query. In this phase it is possible, that the system

needsfurther userinput due toambiguityin the underlying data,e.g. there mightbe toomany resultsfora

query, andtheuserhasto providefurther informationonintermediatepages(2). Finally, theWebserverhas

gatheredallneessaryinformationtogeneratetheresultpageand itisdeliveredtotheuser(3).

[12℄disoveredanexponentialgrowthandgreat subjetdiversityof theseDeep Websites. Among others

theyarrivedatthefollowingonlusions:

Thereareapproximately43.00096.000Web-aessibledatabases,

TheDeepWebis400500timeslargerthantheSurfaeWeb,

95%oftheavailabledataandinformationontheDeepWebisfreelyavailable.

Taking into aount this vast amountof high-quality data, whih is gearedtowards human visitors, it is not

surprisingthatmanydierentresearhquestionsareativelypursuedinthisareaatthemoment,e.g. vertial

searhengines[13℄.

Inthis paperwepresent DNavigator, aframework to automatethe neessaryinteration stepsto obtain

datafromDeepWebsites. Theideaistoreordtheuserinterationsfromtheinitial Webformtothedesired

result page. These interations are generalized, where two dierent phases an be distinguished: the lling

outandsubmissionofthefrontendform(f.(1)in Figure1)andtheationsperformedonintermediatepages

(f.(2)inFigure1). Consequently,DNavigatoronsistsoftwoomponents: WebformanalysisandDeepWeb

navigationmodeling.

Thisframework hasbeenmotivated bytheFireSearhprojet[15℄,whihisgearedtowardsolletingand

analyzingDeepWebdata atquerytime. Theultimateextrationandlabelingof datafrom theresultpage is

donewiththeViPER extrationsystem[23℄. However,astheframeworkhasbeenimplementedinJavaSript

andJavaasaFirefoxpluginitouldbeusedwithminormodiationsinotherprojets,e.g.foradomain-spei

MetaSearhengine,wheretherelevantDeepWebsouresouldbeintegratedbyaninterestedommunity,as

well.

Thepaperisstruturedasfollows:westartwithadesriptionofthetwomainomponentsofourframework,

namelytheanalysisofform eldsinSetion2andthenavigationmodelinSetion 3. Setion4dealswiththe

intriaiesofimplementingourresearhprototypeintheFirefoxbrowser. InSetion5wepresentanevaluation

ofoursystemandinSetion6wedisussrelatedwork.Finally,wegiveanoutlookonfutureworkinSetion7

andonludewithSetion 8.

InstituteofComputerSiene,Albert-LudwigsUniversityFreiburg,Germany,

(2)

Fig.1.1.Aessing aDeepWebsite

2. Form analysis. Web forms are omnipresent: whether the user searhes for information on Google,

partiipatesin anonlinevote,orommentsanentryin ablog,she alwaysprovidesinformation viallingout

andsubmittingaform. Onamoretehniallevel,eahinputelement(intheontextofthispaperwereferto

allelementsin theform eldthat anbeprovidedwithavalue,e.g. hekboxes,asinputelements) ofaWeb

formisassoiatedwithauniqueIDandonsubmissionoftheformthevalueassignmentsareenodedaseither

GETorPOSTHTTPrequest[3℄.

Fig.2.1. WebformwithHTMLrepresentation

Figure 2.1 shows an exampleof asimple Web form. The uniqueID for theinput element labeled First

name iss1andthustheassoiatedHTTPGETrequestlooksasfollows:

GET /searh.gi?s1=Yang&s2=Wang HTTP/1.1

Host: www.example.org

User-Agent: Mozilla/4.0

Aept: image/gif, image/jpeg, */*

Connetion: lose

InSetion2.1wedisusshowtomapuser-denedlabelsontoinputelements,whilewedealinSetion 2.2

(3)

2.1. Labeling of Input Elements. Initially for eah newWeb page westore allourring forms with

allinput elements, IDs and therange oflegal values(i. e. for dropdownmenu lists, this would be theset of

legaloptions),inadatabaseforlateranalysis. Afterwardstheuseranloadthedesiredformeldandlabelthe

desiredinputelements,e.g. inFigure2.2themaximumdesiredpriethevisitoriswillingtopayforausedar

hasbeenlabeled Prie-To. Thelabelingof theWeb forms isinspiredby theidea ofsoialbookmarking[14℄:

eah user has apersonal, evolvingvoabulary of tags. Here a tagis aombination of astring label with an

XMLdatatype[15℄. Theexamplein Figure2.2 showstheuservoabularyin theupperorner,wherethesize

ofthelabelsis determinedbythefrequenytheyhavebeenusedbefore.

Overallshehaslabeledsix inputelements,e.g. thedesiredbrandandthemakeof thear. Nowwehek

foreahlabeledinputelement,iftheyarestatiorifthereareanydynamidependenies,whihmightbedue

toAjaxinterationswiththeserver. Note,that onlytheseinputelementsoftheformanbeusedlateronfor

queryingthat havebeenlabeledin thisstage.

Fig.2.2. Labelingofinputelements

OurrunningexampleistheanalysisofaWebsearhengineforusedars(http://www.autosout24.de),

whereeah armodeldependsonitsarmake. Theotherinputelementsarestati,i.e.theydonothangeif

oneoftheotherinputelementsishanging.

2.2. Dependeny Chek of Input Elements. The dynami and stati ombinationsare determined

automatiallyaftertheuserhasnishedlabelingthedesiredinputelementsbasedonthefollowingidea:modify

therstdropdownmenu(onlydropdownmenusareurrentlyonsideredasandidatesfordynamielements,all

otherinputelementtypesareassumedtobestatibydefault)andhekallotherlabeleddropdownmenus,ifthe

availableoptionshavehanged. Ifthisisthease,thenmodifythedependentdropdownmenutounoverlayered

dependeniesandmarkthedependentmenuasdynami. Afteralldropdownmenushavebeenheked,wemark

allmenusthatarenotdynamiasstati. Toavoidloops,weonlyhekpossibledropdownmenusthathavenot

partiipatedin adependenyin theurrentanalysisylebefore, e.g. in theexampleshownin Figure 2.2the

armodelwouldnotbeonsideredifwehekforfurtherdynamidependeniesforthearmakeinputelement.

Figure2.3andFigure2.4showtheresultingstatianddynami dependeniesforourrunningexample.

Afterthedependenyhek,theformissubmittedandeitheraPOSTorGETHTTPrequest[3℄is

(4)

Fig.2.3. Relationtreeforstatiinputelementsforhttp://www.autosout24 .de

Fig.2.4.Relationtreefordynamiinputelementsforhttp://www.autosout24.d e

2.3. SimulationofWebFormBehavior. Usingthegathereddatawenowhavetwopossibleoptionsto

simulatetheWebformbehavior: weaneitherusethevariablebindingsfortheuser-denedtagstolloutand

submittheWebform online,takingintoonsiderationthedynamiandstatidependeniesorweandiretly

generateaPOST/GETHTTPrequestoine. Forobviousreasons,weusuallyprefertheoinegeneration,but

asisdisussedin Setion5.2itissometimesneessaryto(automatially)llouttheWebformonline.

Suppose theuserprovidedthefollowingvariablebindingsforourrunningarsearhexample

Car-Brand=BMW

Car-Model=850

andtheoriginallyapturedrequestURLwas

http://www.autosout24.de/List.aspx

withthefollowingsearhpart

vis=1&make=9&model=16581&...

Now werst math the tagsto the orresponding URLeld and the string representationto the assoiated

value,yielding

(5)

Thesetwokey-valuepairsan thenbeinsertedintheoriginalsearhpart,whihgivesusthenewsearhpart:

vis=1&make=13&model=1664&...

Depending ifaPOST orGETrequestisrequired,thevariablebindings areeitherenodedin thebody of

themessageordiretlyintheURL.

AftertheHTTPrequestissendtotheserver,weeitherdiretlygetbaktheresultpage,oralternativelyan

intermediatepage. Inthelatteraseweautomatiallynavigatetotheresultpagebasedonthe

Page-Keyword-Ation paradigm,whihispresentedinthenextsetion.

3. Deep Web navigation. Thenavigationmodel isa ruial partof oursystem. Based on themodel

thesystemandetermineanytime,ifithasalreadyreahedtheresultpageorifitisonanintermediatepage.

Additionallythemodeldeterminestheations,whihshouldbeperformedforaspeiintermediatepage,e.g.

tolikonalink orlloutanewform eld. Thekeyideaof ourPage-Keyword-Ationparadigmisthat the

systemrstdeterminesitsloation(intermediatevs. resultpage)basedonapagekeywordandtheninvokesa

seriesofassoiatedationsifappropriate.

Fig.3.1. Navigationproess

3.1. DeepWebNavigation. TheoverallnavigationproessisillustratedinFigure3.1: theuserprovides

thesystemwithavaluemapthatontainsforeahdesiredinputelementlabel/valueombinations. Iftheform

eld ontainsdynami input elementsfor whih she hasprovided inputlabel/valueombinationswehek if

theyarelegal. Ifso,wesubsequentlylloutandsubmittheformeldwiththeseombinations,whihyieldsa

newWebpage (additionally,weusetheinformationobtainedduring form analysisfor diretlygenerating the

requestPOST/GETURL;therebyweanoinemimithebehavioroftheformeld). Forthis Webpagewe

hek, if wean ndoneof ourdened keywords(f. Setion 3.2). If so, weperform theassoiated ations

whih resultin anewWebpageand hekagainifweareonaintermediatepage. Theyleontinuesaslong

asweanndkeywordsontheWebpage. Toavoidaninniteloop,theuseranspeifyanupperboundonthe

numberofpossibleintermediate pages,after whih anerrormessageisreturned. Ifweannot ndakeyword

ontheurrentWebpage,wehavefoundthegoalpageandreturnitsURL.

3.2. Intermediate Page Keyword. DeepWebpagesaretypiallyreateddynamially,i. e. datafrom

a bakground database is lled into a predened presentation template. Therefore, we an usually identify

xedelements,whih arepartofthetemplateandarealmostidentialbetweendierentmanifestations. After

theform analysisis nishedtheuseraniterativelysubmittheform withdierentoptions. If aninputvalue

ombination leads her to an intermediate page, she an identify the relevant keyword as desribed in the

following. If she has already reahed a result page for a valueombination no further user interations are

required. Note that as long as she is in the ontext of the urrentlyative form eld, she an also aess a

series of intermediate pages and for eah page speify aseries of ations. For the identiation of a spei

intermediatepageweoptedforastatitexteld. ThereasonisthatitanbeinludedinmanyHTMLelements,

e.g. thediv, h2,orthe spantag and givenour templateassumption they serveasasuientdisriminatory

fator. Other moreadvaned tehniquesbasedonvisual markerson the pageormoreIR-related tehniques,

(6)

Fig.3.2.Intermediatepagesforhttp://www.autosout24.de (left)andhttp://www.imdb.om (right)

Themostlikelyandidateswhiharemostharateristiareenirledwithanellipse,e.g.theerrormessage

forthearsearhservieshownontheleft. Aftertheuserhasidentiedthekeywordinthepage,sheannow

speifyationsthatshould beperformedinordertoreahtheresultpage.

3.3. Intermediate Page Ations. The above speied keywordsan be used to identify intermediate

pages. However,ourultimategoalistondaresultpagegivenasetofinputvalueombinationsfortheinitial

formeld. Thereforesomeations,suhaslikingonalinkorllingoutandsubmittinganew(intermediate)

form, have to be performed to aess the nextpreferably resultpage. In order to uniquely identify the

appropriate HTML elements on whih the stored ations should be exeuted, we dened a path addressing

languagealled KApath, whih is a semanti subsetof XPath [5℄. In order to aess theappropriate ation

element, thesystemrstndstheommon anestorof thekeywordelementandtheation elementand then

desendsdownwardsintheationelementbranh. Afterwards,theregisteredationsareexeutedforthefound

ationelement. Thus,KApathsupportsthefollowingpathexpressions:

/Node[aname 1

=avalue

1

=℄ . . . [aname

n

=avalue

n

℄: The element in the DOM tree that

mathesthespeiedattributename-valueombinationsoftypeNode,

/P: Immediateparentnodeofurrentnode,

/P::P:All(transitive)parentnodesofurrentnode,

/P::P/Node[aname 1

=avalue

1

℄. . . [aname

n

=avalue

n

℄: Therstfound parentnodein the

DOMtreethatmathesthespeiedattributename-valueombinationsstartingfromtheurrentnode

andisoftypeNode,

/Child: Immediate hildnodesofurrentnode,

/Child::Child: All (transitive) hildnodesoftheurrentnode,

/Child::Child/Node[aname 1

=avalue

1

℄ . . . [aname

n

=avalue

n

℄: Therstfound hildnode

intheDOMtreethatmathesthespeiedattributename-valueombinationsstartingfromtheurrent

nodeandisoftypeNode.

Figure 3.3 showsanexample howthe assoiatedation element in apageanbereferened with respet

to thepagekeywordwithaKApathexpression. Here,the TBODYnode isthe rstommonparentnodefor

both(keywordandation)elements. ThereforethesystemautomatiallygeneratesaKApathexpressionwhih

allowsoptionalintermediateelementsbetweenthekeywordandtherstommonparentnode. Forndingthe

orretationelementitisruialtoonsider itsattributesaswell.

If thedesired ationelementshaveno (e.g. links) ordynami attributes(e.g. visibility),weadditionally

storetheabsolutepathfromkeywordtoationelementandthetreestruturestartingfromtheommonparent.

AnothersituationwhereweanmakeuseoftheabsolutepathiswhentheHTMLpagestruture hashanged

(7)

Fig.3.3.ExampleKAPathexpression,whihallowsoptionalHTMLelementsintheintermediatepage

3.4. Reording User Ations. Based on the user's browsing behavior, the system an generate the

omplete navigation model. First, she identies the keyword for an intermediate page by liking on the

relevanttextintheWebpage. Then,thesystemdeterminesthelosestsurroundingHTMLelementandstores

therelevantontextinformation. Afterwards,thesystemmonitorstheuserbehaviorandstoreseahationshe

performs untilshe reahesanewpage. Based onthis ationlog, thesystemanautomatially determinethe

pathsandtreestruturesforeahation.

ToeasethereordingoftheuserationswehaveimplementedWSript,aHTMLationlanguagesimilarto

Chikenfoot [8℄. Thisintermediatesriptlanguageisonvenient,beauseinorder tondtheHTMLelements

onwhih the ationshave to beinvokedwehave to relyon the navigation strutures dened in Setion 3.3.

Therefore,theprovidedationshaveanavigationand (ifappliable)aninputpart.

Thefollowingtypesofationsaresupportedbyoursystem:

Clikingonlinks: link(absolutepath,KApath, treestruture),

Entering textin input elds: enter(absolute path, KApath, treestruture, elementname, element ID,

inputvalue),

Seletingahekboxorradiobutton: lik(absolutepath,KApath,treestruture,elementname,element

ID),

Seleting an option from a dropdown menu: dropdown(absolute path, KApath, tree struture, option

text,element name,elementID),and

Submittingforms: lik(absolutepath,KApath,treestruture,element name,elementID).

TheelementnameandIDthatarepresentforsomeationsareidential tothenameandIDattributesof

theunderlyingHTMLelementandareusedrstto ndtherelevantHTMLelement. IfthelookupbyIDand

namefails, thesearhforthe ationelement ontinues withtheKApathasdesribedin Setions3.3 and4.2.

ForexamplethefollowingationexpressionwouldenterHalloWorld intothetextinputeldoftheHTMLtree

inFigure3.3:

enter(/ParentNode/ParentNode/ParentNode/Child[1℄/Child[1℄/Child[0℄

,/P::P/TBODY[a1=v1℄[a2=v2℄/Child::Child/INPUT[a3=v3℄,

TBODY(TR,TR)/TR(TD,TD)/TD(INPUT),Hallo World)

Together, keyword and the assoiated ations form the navigation model for this intermediate page (f.

Figure3.3).

4. Implementation. In this setion, we desribein detailthe implementation of DNavigator. Beause

the framework is geared towardsasual Web users, important requirements must be met, most notably the

(8)

provides JavaSript with the ability to reate and manipulate standardJava objets so that the systeman

onnettothedatabase,e.g. tostoretheextrateddynamidependeniesandthenavigationmodel,andfeth

thepredenednavigationmodelsfrom thedatabasetomanipulateanintermediatepage.

Fig.4.1. Systemarhiteture

Therestofthissetiondesribestheimplementationaswellasthemainissueswesolvedwhileimplementing

thesystem.

4.1. Navigation Model Creation. The system traks a user's navigation ations on an intermediate

page by adding JavaSript event handlers to Web pages before reording. These eventhandlers are invoked

when ertain user ations our (e.g., liking on text, liking on a link, hanging the seleted optionin a

dropdownmenu,et.), whihare supportedbyoursystem. Thereordingproessisasfollows: whentheuser

pressestheanalysisbuttonintheFirefoxpluginwindow,thesystemsetseventhandlersonalllikableelements

in the pagedisplayed in thebrowser(i. e., handlers forlinks, handlers forforms, et.). Whenan eventres,

thesystemreordsalltheneessaryinformationfortheevent,e.g. KApath,absolutepath,treestruture et.

It must thenwait until thefollowingpage is loaded to repeatthe proess ofadding handlers and waiting for

events.

InordertodeterminetheKApath,absolutepathandtreestruturewithrespettothekeywordandation

elementthesystemtraversestheDoumentObjetModel[1℄treestartingfrombothelements.

4.2. Deep Web Navigation. After submittingthe oinegenerated HTTP request, therst web page

isreturnedfrom thetarget server. Thesysteminsertsan onload handler intheWeb pageto detetwhenthe

pagehasbeenompletelyloaded. Then,afterthepagehasbeendownloaded,thenavigationisinvoked,i.e. the

systemwillhekwhetheroneofourpredenedkeywordelementsexistinthisreturnedpage. Ifthisisthease,

itisanintermediatepage. BeauseforeahkeywordelementwehavesaveditsHTMLtype,attributessequene

and ontainedtext, thehek proesswasrealized in JavaSriptusing the doument objet and node objet

based onthe DOM tree, i. e. with the method getElementsByTagName(). The systemrst nds all HTML

elementsthathaveidentialHTMLtypesaskeywordelement. Aftertheomparisonbetweentheattributesof

these found elements and thestored attributes of thekeywordelement, andbetweenthe saved keyword text

and thefound keywordtext, the systemandetermine whethera saved keyword(keywordelement) existsin

thisintermediatepage.

For any intermediatepage anumber of relatedations must be performed, so that the systemis able to

navigate in thediretion of the resultpage. Before suh ationsare exeuted, the systemmust rstnd the

ation-relatedelements,i.e. wemustndallHTMLelements,onwhihtheation-eventshavetobeativated.

ForthisgoalweuseWSriptthatwaspresentedin Setion3.4. Theassoiatedsriptwillbefethed fromthe

databaseusingthe Website IDand theidentied keywords. Afterwards, aninterpreterfuntion is invoked to

parseandexeuteeveryWSriptexpressionstepbystep. Here,weiterativelyusethefollowingapproahes:

(9)

2. Otherwise,wetrytondtheationelementbasedontheKApath-expression.

3. Finally, iftheationelementafter exeutingthersttwostrategiesannotbefound,thesystemuses

theabsolute pathandthetreestruturetoloatetheationelement.

Theexeutionofrelatedationsis simulatedusingDOMLevel2events[1,2℄,i. e. fakeeventobjetsare

reatedusingthedoument.reateEvent()method. Afterwards,theyareativatedonthedesiredationelement

usingtheelement.dispathEvent() method.

5. Evaluation. In our experiments, we evaluated the following aspets for our twomajor omponents:

auray andruntime. For this, we seleted100 DeepWebsites from dierent domains, e.g. ar searhand

video searh. 60 of them were diretly adopted from the website table in [7℄, beause they ontain a large

amountof data. Theotherswere seletedby afousedsearhonGoogleon DeepWebrepositories. Forafull

listofthetestedWebsiteswerefertheinterestedreaderto[25℄.

5.1. ExperimentalResults. AllexperimentshavebeenondutedonaThinkpadT60(IntelCoreDuo

2ProessorT72002,00Ghzwitha667MHzfrontsidebusand2GBofmainmemory)runningWindowsVista,

MySQLServer5.0, JavaJDK1.6andFirefox2.0.0.12. Themaximaldownloadrateoftheinternetonnetion

was2048Kbit/sand themaximumuploadrate256Kbit/s.

5.1.1. FrontendAnalysis. For99%ofthetestedWebsitesthefrontendanalysiswassuessful,nding

theorretstatianddynamidependenies. Dependingonthenumberofitemsinthedropdownmenusofthe

formelds, thetimeneeded foranalysistookfrom 0.5to 30seonds, i.e. 4.28seondsonaverage. Sinethis

analysishasonlytobeperformedone,wefeelthat performaneoptimizationsforthisanalysis areoflimited

benet,beauseourmajorfousisonorretlyidentifyinghiddendependeniesbetweenthedropdownmenus.

Table5.1

Time(inseonds)fornavigationexperiments.

# Int. Pages #WebSites PageLoad 1Model 6Models

0 58 2.25 2.26 2.31

1 22 - 4.60 4.66

2 14 - 6.47 6.55

3 4 - 8.12 8.23

4 1 - 9.70 9.83

5 1 - 11.06 11.22

5.1.2. Deep Web Navigation. For 96% of the tested Web sites we were able to suessfully nd a

keywordand to navigateto the desiredresult page. Thenavigationproess took from 2.26to 11.22seonds,

i. e. 3.79 seondsonaverage. As shown in Table5.1 mostofthe timewasspend forloadingpages,i. e. 2.25

seondson average. Theolumns labeled1 Model and 6Models indiate thenumberofregisterednavigation

modelsfor eah page. As anbeseen,theoverhead forheking multiple modelswasmarginalin ontrastto

thetimespentforloadingpages. Thisisdue tothefat thattheexeutionoftheationsisperformedbythe

browseronthelientsideandsinenoomputationallyintensivealgorithmisrequiredtoidentifyintermediate

pages.

5.2. Open Issues. Ourevaluationrevealedthefollowingopenissuesofoursystem.

5.2.1. FrontendAnalysis.

DelayedAJAXinterations: ForoneWeb sitewewereunable toorretlydetet thedynami

depen-deniesbeausetheservertooklongerthanourspeiedthresholdtohangetheitemsintherespetive

dropdownmenu.

Thisouldberemediedbyinreasingourthresholdvaluetosomeextent,butfurtherinvestigationisneeded

tondageneralsolutionforthisproblem.

5.2.2. DeepWeb Navigation.

Dynami requestURLs: Usually, dierent requestURLs only dierin thesearhpart, i.e.the partof

(10)

Hidden form elements: Sinethe useranonlydrag labelsto visible form elements, valuesin hidden

formelementsthat havetobeorrelatedwithvisibleelementsannotbedetetedbyoursystem.

SessionIDs: SessionIDs areoften usedto trakuser interationswith Web pagesand areonly valid

foraertainperiod. Beause weare notableto produeanew(fake) session IDforeah servie,the

oinegeneratedURLbeomesinvalidovertime.

All oftheabovementionedissuesouldbesolvedbyllingoutthefrontend format runtimeandskipping

theoinegenerationoftheURLforsuhresoures.

Stati URLs: Oursystemdetermines,ifanewWebpagehasbeenloaded basedontheurrentURL.

IftheURLdoesnothangeafteraformhasbeensubmitted,wearenotabletoinitiatethenavigation

proessandaddtherequiredeventhandlerasdesribedin Setion4.2.

This an be solved by using another metri for determining if a new Web page has been loaded, e.g.

aheksumoftheWebpage.

6. Related Work. [22℄presentsaframeworkalled DEQUEforqueryingWebformswhereinputvalues

are allowed from relations as well as from result pages. As a part of their system they also model Web

form interfaes, but their fous is moreon the modeling of onseutive forms and they did not onsider the

dependeniesbetweenforminputelements.

AnumberofnavigationoneptshavebeenproposedforaessingDeepWebsoures.[10℄and[18℄proposed

proess-orientednavigationmaps,whih desribeasetof pathsfrom astart pageto aresult page. Butthese

maps relyononseutivestatetransitions andxed interations betweenthem. In[16℄ theuserationsfrom

aspeied start page overpossibly multiple intermediate pages to an end page are reorded in a navigation

map. The ations that link two adjaent pages are strongly onneted as well. A sophistiated Deep Web

navigationstrategybasedonthebranhednavigationmodelisproposedin[6℄. Thenavigationisrepresentedas

asequeneofpages,withenvisionedfuturesupportforstandardproess-owlanguagessuhasWS-BPEL[4℄.

In[21℄ anavigationsequene wasspeiedin NESQL[20℄. The NESQLexpression ontainsmetadataabout

ation elements, for instane, their speiednames and types. Eah expression will beinterpreted based on

theseelementproperties. Bystoringhistorialinformationfrom previousaessesofaDeepWebresoureand

utilizingbrowserpools,theirsystemtriestoreuse theurrentstateofabrowser.[24℄desribeasystemalled

WebVCR,whih isableto reordand replayaseries ofbrowserstepsasasmartbookmark, but theydonot

onsideroptionalintermediatepages.

Ourframeworkisnotdependentonarigidsequeneof intermediatepages,beauseforeah newpageall

keywordpatterns are heked and therefore the previousstate of the system is not important for our

page-orientednavigationmodel. Besides,wedonotneedaomplexnavigationalgebraoralulusforthenavigation

proess beause we just savetheabovedesribed navigation model foreah intermediatepage. Forinstane,

the framework proposed by [10℄ relies on a subset of serial-Horn Transation F-Logi [17℄. As disussed in

Setion 3.4, the saved ation sequenes are just maro proedures, whih are interpreted by our JavaSript

maroengine.

7. Future Work. At the moment we only perform a hard string math between user inputs and the

options in adropdown menu. If the stringsdo notmath exatly an error is returned. At the moment we

are investigating approximate string mathing tehniques [9℄ to alleviate this problem to some extent. An

alternativewouldbeto usesemantisimilaritymetris,suh asproposedin [27℄, whihwouldalso be ableto

apture thesimilaritybetweenthetwoar ompaniesToyota andLexus (a division of Toyota). Thework by

[26℄triestoautomatetheextrationofqueryapabilities,suhaslabelingforminputelementsandndinglegal

rangesofinputvalues. Thisouldbeinterestingto ombine withourapproahto suggesttagstotheuser, or

to tryto math thelabelsontheWebform withthe tagsin theuservoabularyandthuseasing thelabeling

oftheWebforms.

Ourexperiments suggestthat the determination ofasuitable keywordis ruial forthesuessful

identi-ation of an intermediate page, and that for someases it mightbe better to skip the oine generation of

the startURL. Currently, we are extendingourresearh prototype to aeptalist of keywordsand work on

analgorithm toautomatially suggestmeaningfulanddisriminatorykeywords. Ultimately, weareinterested

in generalizing theonept of immediate pageidentiation to moreelaborate tehniques,suh asthe visual

appearaneof theWebpage.

(11)

be performedone, and afterwardsthesysteman simulate thebehavioroftheWebform oine. Seond, we

haveproposed asimplebut eetiveDeepWeb navigationstrategy,whihreplaesaheavy-weightnavigation

aluluswithanintermediatepageidentiationproedureandasetofationsthatnavigatetothenextpage.

Theproposednavigationstrategyhasthefollowingbenets:

1. Itis stateless. Beause foreah page,wehekall availablenavigationmodels, wearenotdependent

onaspeinavigationorder.

2. Simpleextensibility. Ifthesystemenountersanewandsofarunknownimmediatepage,theuseran

easilyextendtheexisting navigationmodelwithonlyafewsteps.

3. Simplepresentation ofthe model. Eahnavigationmodelhasanintuitivetextualrepresentationwhih

iseasierto understandandusethanaompliatednavigationalulus.

Tosumup,DNavigatoroersasimpleuserinterfae,butsuessfullydealswithmostoftheproblemsthatare

posed byreal-worldDeepWebsitesasourevaluationhasshown.

REFERENCES

[1℄ Doumentobjetmodel(dom). http://www.w3.org/DOM/

[2℄ Doumentobjetmodel(dom)level2eventsspeiation.http://www.w3.org/TR/DO M-L evel -2-Eve nts/

[3℄ Hypertexttransferprotoolhttp/1.1(rf2616). http://tools.ietf.org/htm l/r f26 16/

[4℄ Web servies business proess exeution language version 2.0. http://www-128.ibm.om/ dev elop erwo rks /lib rar y/

speifiation/ws-bpel/

[5℄ Xmlpathlanguage(xpath)version1.0.http://www.w3.org/TR/xpa th

[6℄ R.Baumgartner,M.Ceresna,andG.Ledermüller,Deepwebnavigationinwebdataextration,inProeedingsofthe

2005InternationalConfereneonComputationalIntelligeneforModelling,ControlandAutomation,andInternational

ConfereneonIntelligentAgents,WebTehnologiesandInternetCommere.,Vienna,Austria,2005,pp.698703.

[7℄ M.K.Bergman,Thedeepweb: Surfainghiddenvalue,whitepaper.http://www.brightplanet. om /ima ges/ sto ries /pd f/

deepwebwhitepaper.pdf 2001.

[8℄ M.Bolin, M.Webber,P.Rha,T.Wilson,andR.C. Miller,Automationandustomizationofrendered webpages,

inProeedingsofthe18thAnnualACMSymposiumonUserInterfaeSoftwareandTehnology,Seattle,WA,USA,2005,

pp.163172.

[9℄ W. W. Cohen,P.Ravikumar,andS.E.Fienberg,Aomparisonof stringdistanemetris forname-mathing tasks,

inProeedingsoftheWorkshoponInformationIntegrationontheWeb,Aapulo,Mexio,pp.7378.

[10℄ H.Davulu,J.Freire,M.Kifer,andI.V.Ramakrishnan,Alayeredarhitetureforqueryingdynamiwebontent,in

ProeedingsoftheACMSIGMODInternationalConfereneonManagementofData,Philadelphia,Pennsylvania,USA,

19999,pp.491502.

[11℄ D.Flanagan,JavaSript: TheDenitiveGuide,FourthEdition,O’Reilly,Sebastopol,CA,USA,2001.

[12℄ B. He, M. Patel, Z.Zhang, andK. C. C. Chang,Aessing thedeep web,Communiationsof theACM,50(2007),

pp.94101.

[13℄ H.He, W. Meng, C.T. Yu, andZ.Wu,Wise-integrator: Asystem forextratingand integratingomplexwebsearh

interfaesof thedeep web,inProeedingsofthe 31stInternational Confereneon VeryLargeDataBases,Trondheim,

Norway,2005,pp.13141317.

[14℄ P.Heymann,G.Koutrika,andH.Garia-Molina,Cansoialbookmarkingimprovewebsearh?,inProeedingsofthe

InternationalConfereneonWebSearhandWebDataMining,PaloAlto,California,USA,2008,pp.195206.

[15℄ T. Hornung, K. Simon, and G. Lausen, Mashing up the deep webresearh in progress, in Proeedings of the 4th

InternationalConfereneonWebInformationSystemsandTehnologies,Funhal,MadeiraPortugal,2008,pp.5866.

[16℄ N.Julasana,A.Khandelwal,A.Lolage,P.Singh,P.Vasudevan,H.Davulu,andI.V.Ramakrishnan,Winagent:

Asystem for reating and exeutingpersonal information assistantsusinga webbrowser, inProeedingsofthe 2004

InternationalConfereneonIntelligentUserInterfaes,Funhal,Madeira,Portugal,2004,pp.356357.

[17℄ M. Kifer,Dedutive and objet-oriented datalanguages: Aquestfor integration,inProeedingsof the4th International

ConfereneonDedutiveandObjet-OrientedDatabases,Singapore,1995,pp.187212.

[18℄ J.P.Lage,A.S.daSilva,P.B.Golgher,andA.H.F.Laender,Colletinghiddenwebpagesfordataextration,in

Proeedingsofthe4thACMCIKMInternationalWorkshoponWebInformationandDataManagement,Virginia,USA,

2002,pp.6975.

[19℄ K.Nigam,A.MCallum,S.Thrun,andT.M.Mithell,Learningtolassifytextfromlabeledandunlabeleddouments,

inProeedingsofthe15thNationalConfereneonArtiialIntelligeneandTenthInnovativeAppliationsofArtiial

IntelligeneConferene,Wisonsin,USA,1998,pp.792799.

[20℄ A. Pan, J. Raposo, M. Alvarez, J. Hidalgo, and A. Vinaet, Semi-automati wrapper generation for ommerial

web soures, inProeedingsof the WorkingConferene on EngineeringinformationSystems inthe Internet Context,

Kanazawa,Japan,2002,pp.265283.

[21℄ J.Raposo,M.Alvarez,J.Losada,andA.Pan,Maintainingwebnavigationowsforwrappers,inProeedingsofthe

2nd International Workshopon DataEngineeringIssuesinE-CommereandServies,San Franiso,CA,USA,2006,

pp.100114.

(12)

[23℄ K.SimonandG. Lausen,Viper: Augmentingautomati informationextrationwithvisualpereption,inProeedingsof

the2005ACMCIKMInternationalConfereneonInformationandKnowledgeManagement,Bremen, Germany,2005,

pp.381388.

[24℄ A.Vinod, J.Freire,B.Kumar, andD.F. Lieuwenet,Automatingwebnavigationwiththewebvr,inProeedingsof

the9thInternationalWorldWideWebConferene,Amsterdam,TheNetherlands,2000,pp.503517.

[25℄ Y. Wang, Deep web navigation by example, master's thesis, Institute of Computer Siene, Albert-Ludwigs University

Freiburg,2008.

[26℄ Z.Zhang,B.He, andK.C. C.Chang,Understandingweb queryinterfaes: Best-eortparsing withhiddensyntax,in

ProeedingsoftheACMSIGMODInternationalConfereneonManagementofData,Paris,Frane,2004,pp.107118.

[27℄ C. N. Ziegler,K. Simon, andG. Lausen, Automatiomputationof semantiproximity using taxonomi knowledge,

inProeedings of the ACM CIKMInternational Conferene on Informationand Knowledge Management, Arlington,

Virginia,USA,2006,pp.465474.

Editedby: DominikFlejter,TomaszKazmarek,MarekKowalkiewiz

Reeived: January11th,2008

Aepted: Marh 19th,2008

References

Related documents

With a membership of 60,000 nationwide, the National Student Nurses’ Association mentors the professional development of future registered nurses and facilitates their entrance

As competitive density increases, agency theory pre- dicts higher revenues and the transaction cost perspective predicts higher expenses for a relational supplier learning

 With 141 meter frontage, located on 60 With 141 meter frontage, located on 60 meter main sector road to be meter main sector road to be developed which developed which.

We show that miR-30s overexpression in human breast cancer cells reduces tumor cell invasion, osteomimicry and bone destruction in vivo by directly targeting multiple

Note that in addition to the standard side navigation menu, this web page provides a top navigation menu to allow configuration of advanced modem features such as traffic

In creating a new page you will be Naming the page, choosing what type of page and then Placing the page where you want it in the Navigation tree.. Click on ADD NAVIGATION on

Studies that had previously reported the economic burden on primary caregivers, either as a review paper or in a systematic review, used different methodological approaches that

Concerning the parameter estimation, we have carried out a One-Step construction based on the optimal IC and compared this One-Step estimator with MoM, MLE, and M-estimator