DEEP WEB NAVIGATIONBY EXAMPLE
YANG WANG AND THOMAS HORNUNG
∗
Abstrat. LargeportionsoftheWebareburiedbehinduser-orientedinterfaes,whihanonlybeaessedbyllingoutforms.
Tomakethethereinontainedinformationaessibletoautomatiproessing,oneofthemajorhurdlesistonavigatetotheatual
resultpage. Inthispaperwepresentaframeworkfornavigatingtheseso-alledDeepWebsitesbasedonthepage-keyword-ation
paradigm: thesystemllsout formswithprovidedinputparametersand then submitstheform. Afterwards itheksifithas
alreadyfoundaresultpagebylookingforpre-speiedkeywordpatternsintheurrentpage.Basedontheoutomeeitherfurther
ationstoreaharesultpageareexeutedortheresultingURLisreturned.
Keywords: formanalysis,deepWebnavigationbypage-keyword-ations
1. Introdution. The Web an be lassied into two ategories with respet to aess patterns: the
SurfaeWebandtheDeepWeb[7℄. TheSurfaeWebonsistsofstatiandpublilyavailableWebpages,whih
ontainlinksto otherpagesand anberepresentedasadireted graph. This Webgraphanbetraversedby
rawlers(alsoknownasspiders)andthefoundpagesarethentraditionallyindexedbysearhengines.
The Deep Web in ontrastonsists of dynamially generated result pages of numerous databases, whih
anbe queriedviaaWebform. These pagesannot bereahedbyfollowinglinks from other pagesand it is
thereforehallengingto index theirontent. Figure1depitsthegeneralinteration patternbetweentheuser
and aDeepWeb site. Theuserlls outtheform eld with the desiredinformation (1) andthe Web form is
sent to the serverwhere it is transformed in a database query. In this phase it is possible, that the system
needsfurther userinput due toambiguityin the underlying data,e.g. there mightbe toomany resultsfora
query, andtheuserhasto providefurther informationonintermediatepages(2). Finally, theWebserverhas
gatheredallneessaryinformationtogeneratetheresultpageand itisdeliveredtotheuser(3).
[12℄disoveredanexponentialgrowthandgreat subjetdiversityof theseDeep Websites. Among others
theyarrivedatthefollowingonlusions:
•
Thereareapproximately43.00096.000Web-aessibledatabases,•
TheDeepWebis400500timeslargerthantheSurfaeWeb,•
95%oftheavailabledataandinformationontheDeepWebisfreelyavailable.Taking into aount this vast amountof high-quality data, whih is gearedtowards human visitors, it is not
surprisingthatmanydierentresearhquestionsareativelypursuedinthisareaatthemoment,e.g. vertial
searhengines[13℄.
Inthis paperwepresent DNavigator, aframework to automatethe neessaryinteration stepsto obtain
datafromDeepWebsites. Theideaistoreordtheuserinterationsfromtheinitial Webformtothedesired
result page. These interations are generalized, where two dierent phases an be distinguished: the lling
outandsubmissionofthefrontendform(f.(1)in Figure1)andtheationsperformedonintermediatepages
(f.(2)inFigure1). Consequently,DNavigatoronsistsoftwoomponents: WebformanalysisandDeepWeb
navigationmodeling.
Thisframework hasbeenmotivated bytheFireSearhprojet[15℄,whihisgearedtowardsolletingand
analyzingDeepWebdata atquerytime. Theultimateextrationandlabelingof datafrom theresultpage is
donewiththeViPER extrationsystem[23℄. However,astheframeworkhasbeenimplementedinJavaSript
andJavaasaFirefoxpluginitouldbeusedwithminormodiationsinotherprojets,e.g.foradomain-spei
MetaSearhengine,wheretherelevantDeepWebsouresouldbeintegratedbyaninterestedommunity,as
well.
Thepaperisstruturedasfollows:westartwithadesriptionofthetwomainomponentsofourframework,
namelytheanalysisofform eldsinSetion2andthenavigationmodelinSetion 3. Setion4dealswiththe
intriaiesofimplementingourresearhprototypeintheFirefoxbrowser. InSetion5wepresentanevaluation
ofoursystemandinSetion6wedisussrelatedwork.Finally,wegiveanoutlookonfutureworkinSetion7
andonludewithSetion 8.
∗
InstituteofComputerSiene,Albert-LudwigsUniversityFreiburg,Germany,
Fig.1.1.Aessing aDeepWebsite
2. Form analysis. Web forms are omnipresent: whether the user searhes for information on Google,
partiipatesin anonlinevote,orommentsanentryin ablog,she alwaysprovidesinformation viallingout
andsubmittingaform. Onamoretehniallevel,eahinputelement(intheontextofthispaperwereferto
allelementsin theform eldthat anbeprovidedwithavalue,e.g. hekboxes,asinputelements) ofaWeb
formisassoiatedwithauniqueIDandonsubmissionoftheformthevalueassignmentsareenodedaseither
GETorPOSTHTTPrequest[3℄.
Fig.2.1. WebformwithHTMLrepresentation
Figure 2.1 shows an exampleof asimple Web form. The uniqueID for theinput element labeled First
name iss1andthustheassoiatedHTTPGETrequestlooksasfollows:
GET /searh.gi?s1=Yang&s2=Wang HTTP/1.1
Host: www.example.org
User-Agent: Mozilla/4.0
Aept: image/gif, image/jpeg, */*
Connetion: lose
InSetion2.1wedisusshowtomapuser-denedlabelsontoinputelements,whilewedealinSetion 2.2
2.1. Labeling of Input Elements. Initially for eah newWeb page westore allourring forms with
allinput elements, IDs and therange oflegal values(i. e. for dropdownmenu lists, this would be theset of
legaloptions),inadatabaseforlateranalysis. Afterwardstheuseranloadthedesiredformeldandlabelthe
desiredinputelements,e.g. inFigure2.2themaximumdesiredpriethevisitoriswillingtopayforausedar
hasbeenlabeled Prie-To. Thelabelingof theWeb forms isinspiredby theidea ofsoialbookmarking[14℄:
eah user has apersonal, evolvingvoabulary of tags. Here a tagis aombination of astring label with an
XMLdatatype[15℄. Theexamplein Figure2.2 showstheuservoabularyin theupperorner,wherethesize
ofthelabelsis determinedbythefrequenytheyhavebeenusedbefore.
Overallshehaslabeledsix inputelements,e.g. thedesiredbrandandthemakeof thear. Nowwehek
foreahlabeledinputelement,iftheyarestatiorifthereareanydynamidependenies,whihmightbedue
toAjaxinterationswiththeserver. Note,that onlytheseinputelementsoftheformanbeusedlateronfor
queryingthat havebeenlabeledin thisstage.
Fig.2.2. Labelingofinputelements
OurrunningexampleistheanalysisofaWebsearhengineforusedars(http://www.autosout24.de),
whereeah armodeldependsonitsarmake. Theotherinputelementsarestati,i.e.theydonothangeif
oneoftheotherinputelementsishanging.
2.2. Dependeny Chek of Input Elements. The dynami and stati ombinationsare determined
automatiallyaftertheuserhasnishedlabelingthedesiredinputelementsbasedonthefollowingidea:modify
therstdropdownmenu(onlydropdownmenusareurrentlyonsideredasandidatesfordynamielements,all
otherinputelementtypesareassumedtobestatibydefault)andhekallotherlabeleddropdownmenus,ifthe
availableoptionshavehanged. Ifthisisthease,thenmodifythedependentdropdownmenutounoverlayered
dependeniesandmarkthedependentmenuasdynami. Afteralldropdownmenushavebeenheked,wemark
allmenusthatarenotdynamiasstati. Toavoidloops,weonlyhekpossibledropdownmenusthathavenot
partiipatedin adependenyin theurrentanalysisylebefore, e.g. in theexampleshownin Figure 2.2the
armodelwouldnotbeonsideredifwehekforfurtherdynamidependeniesforthearmakeinputelement.
Figure2.3andFigure2.4showtheresultingstatianddynami dependeniesforourrunningexample.
Afterthedependenyhek,theformissubmittedandeitheraPOSTorGETHTTPrequest[3℄is
Fig.2.3. Relationtreeforstatiinputelementsforhttp://www.autosout24 .de
Fig.2.4.Relationtreefordynamiinputelementsforhttp://www.autosout24.d e
2.3. SimulationofWebFormBehavior. Usingthegathereddatawenowhavetwopossibleoptionsto
simulatetheWebformbehavior: weaneitherusethevariablebindingsfortheuser-denedtagstolloutand
submittheWebform online,takingintoonsiderationthedynamiandstatidependeniesorweandiretly
generateaPOST/GETHTTPrequestoine. Forobviousreasons,weusuallyprefertheoinegeneration,but
asisdisussedin Setion5.2itissometimesneessaryto(automatially)llouttheWebformonline.
Suppose theuserprovidedthefollowingvariablebindingsforourrunningarsearhexample
Car-Brand=BMW
Car-Model=850
andtheoriginallyapturedrequestURLwas
http://www.autosout24.de/List.aspx
withthefollowingsearhpart
vis=1&make=9&model=16581&...
Now werst math the tagsto the orresponding URLeld and the string representationto the assoiated
value,yielding
Thesetwokey-valuepairsan thenbeinsertedintheoriginalsearhpart,whihgivesusthenewsearhpart:
vis=1&make=13&model=1664&...
Depending ifaPOST orGETrequestisrequired,thevariablebindings areeitherenodedin thebody of
themessageordiretlyintheURL.
AftertheHTTPrequestissendtotheserver,weeitherdiretlygetbaktheresultpage,oralternativelyan
intermediatepage. Inthelatteraseweautomatiallynavigatetotheresultpagebasedonthe
Page-Keyword-Ation paradigm,whihispresentedinthenextsetion.
3. Deep Web navigation. Thenavigationmodel isa ruial partof oursystem. Based on themodel
thesystemandetermineanytime,ifithasalreadyreahedtheresultpageorifitisonanintermediatepage.
Additionallythemodeldeterminestheations,whihshouldbeperformedforaspeiintermediatepage,e.g.
tolikonalink orlloutanewform eld. Thekeyideaof ourPage-Keyword-Ationparadigmisthat the
systemrstdeterminesitsloation(intermediatevs. resultpage)basedonapagekeywordandtheninvokesa
seriesofassoiatedationsifappropriate.
Fig.3.1. Navigationproess
3.1. DeepWebNavigation. TheoverallnavigationproessisillustratedinFigure3.1: theuserprovides
thesystemwithavaluemapthatontainsforeahdesiredinputelementlabel/valueombinations. Iftheform
eld ontainsdynami input elementsfor whih she hasprovided inputlabel/valueombinationswehek if
theyarelegal. Ifso,wesubsequentlylloutandsubmittheformeldwiththeseombinations,whihyieldsa
newWebpage (additionally,weusetheinformationobtainedduring form analysisfor diretlygenerating the
requestPOST/GETURL;therebyweanoinemimithebehavioroftheformeld). Forthis Webpagewe
hek, if wean ndoneof ourdened keywords(f. Setion 3.2). If so, weperform theassoiated ations
whih resultin anewWebpageand hekagainifweareonaintermediatepage. Theyleontinuesaslong
asweanndkeywordsontheWebpage. Toavoidaninniteloop,theuseranspeifyanupperboundonthe
numberofpossibleintermediate pages,after whih anerrormessageisreturned. Ifweannot ndakeyword
ontheurrentWebpage,wehavefoundthegoalpageandreturnitsURL.
3.2. Intermediate Page Keyword. DeepWebpagesaretypiallyreateddynamially,i. e. datafrom
a bakground database is lled into a predened presentation template. Therefore, we an usually identify
xedelements,whih arepartofthetemplateandarealmostidentialbetweendierentmanifestations. After
theform analysisis nishedtheuseraniterativelysubmittheform withdierentoptions. If aninputvalue
ombination leads her to an intermediate page, she an identify the relevant keyword as desribed in the
following. If she has already reahed a result page for a valueombination no further user interations are
required. Note that as long as she is in the ontext of the urrentlyative form eld, she an also aess a
series of intermediate pages and for eah page speify aseries of ations. For the identiation of a spei
intermediatepageweoptedforastatitexteld. ThereasonisthatitanbeinludedinmanyHTMLelements,
e.g. thediv, h2,orthe spantag and givenour templateassumption they serveasasuientdisriminatory
fator. Other moreadvaned tehniquesbasedonvisual markerson the pageormoreIR-related tehniques,
Fig.3.2.Intermediatepagesforhttp://www.autosout24.de (left)andhttp://www.imdb.om (right)
Themostlikelyandidateswhiharemostharateristiareenirledwithanellipse,e.g.theerrormessage
forthearsearhservieshownontheleft. Aftertheuserhasidentiedthekeywordinthepage,sheannow
speifyationsthatshould beperformedinordertoreahtheresultpage.
3.3. Intermediate Page Ations. The above speied keywordsan be used to identify intermediate
pages. However,ourultimategoalistondaresultpagegivenasetofinputvalueombinationsfortheinitial
formeld. Thereforesomeations,suhaslikingonalinkorllingoutandsubmittinganew(intermediate)
form, have to be performed to aess the nextpreferably resultpage. In order to uniquely identify the
appropriate HTML elements on whih the stored ations should be exeuted, we dened a path addressing
languagealled KApath, whih is a semanti subsetof XPath [5℄. In order to aess theappropriate ation
element, thesystemrstndstheommon anestorof thekeywordelementandtheation elementand then
desendsdownwardsintheationelementbranh. Afterwards,theregisteredationsareexeutedforthefound
ationelement. Thus,KApathsupportsthefollowingpathexpressions:
•
/Node[aname 1=avalue
1
=℄ . . . [aname
n
=avalue
n
℄: The element in the DOM tree that
mathesthespeiedattributename-valueombinationsoftypeNode,
•
/P: Immediateparentnodeofurrentnode,•
/P::P:All(transitive)parentnodesofurrentnode,•
/P::P/Node[aname 1=avalue
1
℄. . . [aname
n
=avalue
n
℄: Therstfound parentnodein the
DOMtreethatmathesthespeiedattributename-valueombinationsstartingfromtheurrentnode
andisoftypeNode,
•
/Child: Immediate hildnodesofurrentnode,•
/Child::Child: All (transitive) hildnodesoftheurrentnode,•
/Child::Child/Node[aname 1=avalue
1
℄ . . . [aname
n
=avalue
n
℄: Therstfound hildnode
intheDOMtreethatmathesthespeiedattributename-valueombinationsstartingfromtheurrent
nodeandisoftypeNode.
Figure 3.3 showsanexample howthe assoiatedation element in apageanbereferened with respet
to thepagekeywordwithaKApathexpression. Here,the TBODYnode isthe rstommonparentnodefor
both(keywordandation)elements. ThereforethesystemautomatiallygeneratesaKApathexpressionwhih
allowsoptionalintermediateelementsbetweenthekeywordandtherstommonparentnode. Forndingthe
orretationelementitisruialtoonsider itsattributesaswell.
If thedesired ationelementshaveno (e.g. links) ordynami attributes(e.g. visibility),weadditionally
storetheabsolutepathfromkeywordtoationelementandthetreestruturestartingfromtheommonparent.
AnothersituationwhereweanmakeuseoftheabsolutepathiswhentheHTMLpagestruture hashanged
Fig.3.3.ExampleKAPathexpression,whihallowsoptionalHTMLelementsintheintermediatepage
3.4. Reording User Ations. Based on the user's browsing behavior, the system an generate the
omplete navigation model. First, she identies the keyword for an intermediate page by liking on the
relevanttextintheWebpage. Then,thesystemdeterminesthelosestsurroundingHTMLelementandstores
therelevantontextinformation. Afterwards,thesystemmonitorstheuserbehaviorandstoreseahationshe
performs untilshe reahesanewpage. Based onthis ationlog, thesystemanautomatially determinethe
pathsandtreestruturesforeahation.
ToeasethereordingoftheuserationswehaveimplementedWSript,aHTMLationlanguagesimilarto
Chikenfoot [8℄. Thisintermediatesriptlanguageisonvenient,beauseinorder tondtheHTMLelements
onwhih the ationshave to beinvokedwehave to relyon the navigation strutures dened in Setion 3.3.
Therefore,theprovidedationshaveanavigationand (ifappliable)aninputpart.
Thefollowingtypesofationsaresupportedbyoursystem:
•
Clikingonlinks: link(absolutepath,KApath, treestruture),•
Entering textin input elds: enter(absolute path, KApath, treestruture, elementname, element ID,inputvalue),
•
Seletingahekboxorradiobutton: lik(absolutepath,KApath,treestruture,elementname,elementID),
•
Seleting an option from a dropdown menu: dropdown(absolute path, KApath, tree struture, optiontext,element name,elementID),and
•
Submittingforms: lik(absolutepath,KApath,treestruture,element name,elementID).TheelementnameandIDthatarepresentforsomeationsareidential tothenameandIDattributesof
theunderlyingHTMLelementandareusedrstto ndtherelevantHTMLelement. IfthelookupbyIDand
namefails, thesearhforthe ationelement ontinues withtheKApathasdesribedin Setions3.3 and4.2.
ForexamplethefollowingationexpressionwouldenterHalloWorld intothetextinputeldoftheHTMLtree
inFigure3.3:
enter(/ParentNode/ParentNode/ParentNode/Child[1℄/Child[1℄/Child[0℄
,/P::P/TBODY[a1=v1℄[a2=v2℄/Child::Child/INPUT[a3=v3℄,
TBODY(TR,TR)/TR(TD,TD)/TD(INPUT),Hallo World)
Together, keyword and the assoiated ations form the navigation model for this intermediate page (f.
Figure3.3).
4. Implementation. In this setion, we desribein detailthe implementation of DNavigator. Beause
the framework is geared towardsasual Web users, important requirements must be met, most notably the
provides JavaSript with the ability to reate and manipulate standardJava objets so that the systeman
onnettothedatabase,e.g. tostoretheextrateddynamidependeniesandthenavigationmodel,andfeth
thepredenednavigationmodelsfrom thedatabasetomanipulateanintermediatepage.
Fig.4.1. Systemarhiteture
Therestofthissetiondesribestheimplementationaswellasthemainissueswesolvedwhileimplementing
thesystem.
4.1. Navigation Model Creation. The system traks a user's navigation ations on an intermediate
page by adding JavaSript event handlers to Web pages before reording. These eventhandlers are invoked
when ertain user ations our (e.g., liking on text, liking on a link, hanging the seleted optionin a
dropdownmenu,et.), whihare supportedbyoursystem. Thereordingproessisasfollows: whentheuser
pressestheanalysisbuttonintheFirefoxpluginwindow,thesystemsetseventhandlersonalllikableelements
in the pagedisplayed in thebrowser(i. e., handlers forlinks, handlers forforms, et.). Whenan eventres,
thesystemreordsalltheneessaryinformationfortheevent,e.g. KApath,absolutepath,treestruture et.
It must thenwait until thefollowingpage is loaded to repeatthe proess ofadding handlers and waiting for
events.
InordertodeterminetheKApath,absolutepathandtreestruturewithrespettothekeywordandation
elementthesystemtraversestheDoumentObjetModel[1℄treestartingfrombothelements.
4.2. Deep Web Navigation. After submittingthe oinegenerated HTTP request, therst web page
isreturnedfrom thetarget server. Thesysteminsertsan onload handler intheWeb pageto detetwhenthe
pagehasbeenompletelyloaded. Then,afterthepagehasbeendownloaded,thenavigationisinvoked,i.e. the
systemwillhekwhetheroneofourpredenedkeywordelementsexistinthisreturnedpage. Ifthisisthease,
itisanintermediatepage. BeauseforeahkeywordelementwehavesaveditsHTMLtype,attributessequene
and ontainedtext, thehek proesswasrealized in JavaSriptusing the doument objet and node objet
based onthe DOM tree, i. e. with the method getElementsByTagName(). The systemrst nds all HTML
elementsthathaveidentialHTMLtypesaskeywordelement. Aftertheomparisonbetweentheattributesof
these found elements and thestored attributes of thekeywordelement, andbetweenthe saved keyword text
and thefound keywordtext, the systemandetermine whethera saved keyword(keywordelement) existsin
thisintermediatepage.
For any intermediatepage anumber of relatedations must be performed, so that the systemis able to
navigate in thediretion of the resultpage. Before suh ationsare exeuted, the systemmust rstnd the
ation-relatedelements,i.e. wemustndallHTMLelements,onwhihtheation-eventshavetobeativated.
ForthisgoalweuseWSriptthatwaspresentedin Setion3.4. Theassoiatedsriptwillbefethed fromthe
databaseusingthe Website IDand theidentied keywords. Afterwards, aninterpreterfuntion is invoked to
parseandexeuteeveryWSriptexpressionstepbystep. Here,weiterativelyusethefollowingapproahes:
2. Otherwise,wetrytondtheationelementbasedontheKApath-expression.
3. Finally, iftheationelementafter exeutingthersttwostrategiesannotbefound,thesystemuses
theabsolute pathandthetreestruturetoloatetheationelement.
Theexeutionofrelatedationsis simulatedusingDOMLevel2events[1,2℄,i. e. fakeeventobjetsare
reatedusingthedoument.reateEvent()method. Afterwards,theyareativatedonthedesiredationelement
usingtheelement.dispathEvent() method.
5. Evaluation. In our experiments, we evaluated the following aspets for our twomajor omponents:
auray andruntime. For this, we seleted100 DeepWebsites from dierent domains, e.g. ar searhand
video searh. 60 of them were diretly adopted from the website table in [7℄, beause they ontain a large
amountof data. Theotherswere seletedby afousedsearhonGoogleon DeepWebrepositories. Forafull
listofthetestedWebsiteswerefertheinterestedreaderto[25℄.
5.1. ExperimentalResults. AllexperimentshavebeenondutedonaThinkpadT60(IntelCoreDuo
2ProessorT72002,00Ghzwitha667MHzfrontsidebusand2GBofmainmemory)runningWindowsVista,
MySQLServer5.0, JavaJDK1.6andFirefox2.0.0.12. Themaximaldownloadrateoftheinternetonnetion
was2048Kbit/sand themaximumuploadrate256Kbit/s.
5.1.1. FrontendAnalysis. For99%ofthetestedWebsitesthefrontendanalysiswassuessful,nding
theorretstatianddynamidependenies. Dependingonthenumberofitemsinthedropdownmenusofthe
formelds, thetimeneeded foranalysistookfrom 0.5to 30seonds, i.e. 4.28seondsonaverage. Sinethis
analysishasonlytobeperformedone,wefeelthat performaneoptimizationsforthisanalysis areoflimited
benet,beauseourmajorfousisonorretlyidentifyinghiddendependeniesbetweenthedropdownmenus.
Table5.1
Time(inseonds)fornavigationexperiments.
# Int. Pages #WebSites PageLoad 1Model 6Models
0 58 2.25 2.26 2.31
1 22 - 4.60 4.66
2 14 - 6.47 6.55
3 4 - 8.12 8.23
4 1 - 9.70 9.83
5 1 - 11.06 11.22
5.1.2. Deep Web Navigation. For 96% of the tested Web sites we were able to suessfully nd a
keywordand to navigateto the desiredresult page. Thenavigationproess took from 2.26to 11.22seonds,
i. e. 3.79 seondsonaverage. As shown in Table5.1 mostofthe timewasspend forloadingpages,i. e. 2.25
seondson average. Theolumns labeled1 Model and 6Models indiate thenumberofregisterednavigation
modelsfor eah page. As anbeseen,theoverhead forheking multiple modelswasmarginalin ontrastto
thetimespentforloadingpages. Thisisdue tothefat thattheexeutionoftheationsisperformedbythe
browseronthelientsideandsinenoomputationallyintensivealgorithmisrequiredtoidentifyintermediate
pages.
5.2. Open Issues. Ourevaluationrevealedthefollowingopenissuesofoursystem.
5.2.1. FrontendAnalysis.
•
DelayedAJAXinterations: ForoneWeb sitewewereunable toorretlydetet thedynamidepen-deniesbeausetheservertooklongerthanourspeiedthresholdtohangetheitemsintherespetive
dropdownmenu.
Thisouldberemediedbyinreasingourthresholdvaluetosomeextent,butfurtherinvestigationisneeded
tondageneralsolutionforthisproblem.
5.2.2. DeepWeb Navigation.
•
Dynami requestURLs: Usually, dierent requestURLs only dierin thesearhpart, i.e.the partof•
Hidden form elements: Sinethe useranonlydrag labelsto visible form elements, valuesin hiddenformelementsthat havetobeorrelatedwithvisibleelementsannotbedetetedbyoursystem.
•
SessionIDs: SessionIDs areoften usedto trakuser interationswith Web pagesand areonly validforaertainperiod. Beause weare notableto produeanew(fake) session IDforeah servie,the
oinegeneratedURLbeomesinvalidovertime.
All oftheabovementionedissuesouldbesolvedbyllingoutthefrontend format runtimeandskipping
theoinegenerationoftheURLforsuhresoures.
•
Stati URLs: Oursystemdetermines,ifanewWebpagehasbeenloaded basedontheurrentURL.IftheURLdoesnothangeafteraformhasbeensubmitted,wearenotabletoinitiatethenavigation
proessandaddtherequiredeventhandlerasdesribedin Setion4.2.
This an be solved by using another metri for determining if a new Web page has been loaded, e.g.
aheksumoftheWebpage.
6. Related Work. [22℄presentsaframeworkalled DEQUEforqueryingWebformswhereinputvalues
are allowed from relations as well as from result pages. As a part of their system they also model Web
form interfaes, but their fous is moreon the modeling of onseutive forms and they did not onsider the
dependeniesbetweenforminputelements.
AnumberofnavigationoneptshavebeenproposedforaessingDeepWebsoures.[10℄and[18℄proposed
proess-orientednavigationmaps,whih desribeasetof pathsfrom astart pageto aresult page. Butthese
maps relyononseutivestatetransitions andxed interations betweenthem. In[16℄ theuserationsfrom
aspeied start page overpossibly multiple intermediate pages to an end page are reorded in a navigation
map. The ations that link two adjaent pages are strongly onneted as well. A sophistiated Deep Web
navigationstrategybasedonthebranhednavigationmodelisproposedin[6℄. Thenavigationisrepresentedas
asequeneofpages,withenvisionedfuturesupportforstandardproess-owlanguagessuhasWS-BPEL[4℄.
In[21℄ anavigationsequene wasspeiedin NESQL[20℄. The NESQLexpression ontainsmetadataabout
ation elements, for instane, their speiednames and types. Eah expression will beinterpreted based on
theseelementproperties. Bystoringhistorialinformationfrom previousaessesofaDeepWebresoureand
utilizingbrowserpools,theirsystemtriestoreuse theurrentstateofabrowser.[24℄desribeasystemalled
WebVCR,whih isableto reordand replayaseries ofbrowserstepsasasmartbookmark, but theydonot
onsideroptionalintermediatepages.
Ourframeworkisnotdependentonarigidsequeneof intermediatepages,beauseforeah newpageall
keywordpatterns are heked and therefore the previousstate of the system is not important for our
page-orientednavigationmodel. Besides,wedonotneedaomplexnavigationalgebraoralulusforthenavigation
proess beause we just savetheabovedesribed navigation model foreah intermediatepage. Forinstane,
the framework proposed by [10℄ relies on a subset of serial-Horn Transation F-Logi [17℄. As disussed in
Setion 3.4, the saved ation sequenes are just maro proedures, whih are interpreted by our JavaSript
maroengine.
7. Future Work. At the moment we only perform a hard string math between user inputs and the
options in adropdown menu. If the stringsdo notmath exatly an error is returned. At the moment we
are investigating approximate string mathing tehniques [9℄ to alleviate this problem to some extent. An
alternativewouldbeto usesemantisimilaritymetris,suh asproposedin [27℄, whihwouldalso be ableto
apture thesimilaritybetweenthetwoar ompaniesToyota andLexus (a division of Toyota). Thework by
[26℄triestoautomatetheextrationofqueryapabilities,suhaslabelingforminputelementsandndinglegal
rangesofinputvalues. Thisouldbeinterestingto ombine withourapproahto suggesttagstotheuser, or
to tryto math thelabelsontheWebform withthe tagsin theuservoabularyandthuseasing thelabeling
oftheWebforms.
Ourexperiments suggestthat the determination ofasuitable keywordis ruial forthesuessful
identi-ation of an intermediate page, and that for someases it mightbe better to skip the oine generation of
the startURL. Currently, we are extendingourresearh prototype to aeptalist of keywordsand work on
analgorithm toautomatially suggestmeaningfulanddisriminatorykeywords. Ultimately, weareinterested
in generalizing theonept of immediate pageidentiation to moreelaborate tehniques,suh asthe visual
appearaneof theWebpage.
be performedone, and afterwardsthesysteman simulate thebehavioroftheWebform oine. Seond, we
haveproposed asimplebut eetiveDeepWeb navigationstrategy,whihreplaesaheavy-weightnavigation
aluluswithanintermediatepageidentiationproedureandasetofationsthatnavigatetothenextpage.
Theproposednavigationstrategyhasthefollowingbenets:
1. Itis stateless. Beause foreah page,wehekall availablenavigationmodels, wearenotdependent
onaspeinavigationorder.
2. Simpleextensibility. Ifthesystemenountersanewandsofarunknownimmediatepage,theuseran
easilyextendtheexisting navigationmodelwithonlyafewsteps.
3. Simplepresentation ofthe model. Eahnavigationmodelhasanintuitivetextualrepresentationwhih
iseasierto understandandusethanaompliatednavigationalulus.
Tosumup,DNavigatoroersasimpleuserinterfae,butsuessfullydealswithmostoftheproblemsthatare
posed byreal-worldDeepWebsitesasourevaluationhasshown.
REFERENCES
[1℄ Doumentobjetmodel(dom). http://www.w3.org/DOM/
[2℄ Doumentobjetmodel(dom)level2eventsspeiation.http://www.w3.org/TR/DO M-L evel -2-Eve nts/
[3℄ Hypertexttransferprotoolhttp/1.1(rf2616). http://tools.ietf.org/htm l/r f26 16/
[4℄ Web servies business proess exeution language version 2.0. http://www-128.ibm.om/ dev elop erwo rks /lib rar y/
speifiation/ws-bpel/
[5℄ Xmlpathlanguage(xpath)version1.0.http://www.w3.org/TR/xpa th
[6℄ R.Baumgartner,M.Ceresna,andG.Ledermüller,Deepwebnavigationinwebdataextration,inProeedingsofthe
2005InternationalConfereneonComputationalIntelligeneforModelling,ControlandAutomation,andInternational
ConfereneonIntelligentAgents,WebTehnologiesandInternetCommere.,Vienna,Austria,2005,pp.698703.
[7℄ M.K.Bergman,Thedeepweb: Surfainghiddenvalue,whitepaper.http://www.brightplanet. om /ima ges/ sto ries /pd f/
deepwebwhitepaper.pdf 2001.
[8℄ M.Bolin, M.Webber,P.Rha,T.Wilson,andR.C. Miller,Automationandustomizationofrendered webpages,
inProeedingsofthe18thAnnualACMSymposiumonUserInterfaeSoftwareandTehnology,Seattle,WA,USA,2005,
pp.163172.
[9℄ W. W. Cohen,P.Ravikumar,andS.E.Fienberg,Aomparisonof stringdistanemetris forname-mathing tasks,
inProeedingsoftheWorkshoponInformationIntegrationontheWeb,Aapulo,Mexio,pp.7378.
[10℄ H.Davulu,J.Freire,M.Kifer,andI.V.Ramakrishnan,Alayeredarhitetureforqueryingdynamiwebontent,in
ProeedingsoftheACMSIGMODInternationalConfereneonManagementofData,Philadelphia,Pennsylvania,USA,
19999,pp.491502.
[11℄ D.Flanagan,JavaSript: TheDenitiveGuide,FourthEdition,OReilly,Sebastopol,CA,USA,2001.
[12℄ B. He, M. Patel, Z.Zhang, andK. C. C. Chang,Aessing thedeep web,Communiationsof theACM,50(2007),
pp.94101.
[13℄ H.He, W. Meng, C.T. Yu, andZ.Wu,Wise-integrator: Asystem forextratingand integratingomplexwebsearh
interfaesof thedeep web,inProeedingsofthe 31stInternational Confereneon VeryLargeDataBases,Trondheim,
Norway,2005,pp.13141317.
[14℄ P.Heymann,G.Koutrika,andH.Garia-Molina,Cansoialbookmarkingimprovewebsearh?,inProeedingsofthe
InternationalConfereneonWebSearhandWebDataMining,PaloAlto,California,USA,2008,pp.195206.
[15℄ T. Hornung, K. Simon, and G. Lausen, Mashing up the deep webresearh in progress, in Proeedings of the 4th
InternationalConfereneonWebInformationSystemsandTehnologies,Funhal,MadeiraPortugal,2008,pp.5866.
[16℄ N.Julasana,A.Khandelwal,A.Lolage,P.Singh,P.Vasudevan,H.Davulu,andI.V.Ramakrishnan,Winagent:
Asystem for reating and exeutingpersonal information assistantsusinga webbrowser, inProeedingsofthe 2004
InternationalConfereneonIntelligentUserInterfaes,Funhal,Madeira,Portugal,2004,pp.356357.
[17℄ M. Kifer,Dedutive and objet-oriented datalanguages: Aquestfor integration,inProeedingsof the4th International
ConfereneonDedutiveandObjet-OrientedDatabases,Singapore,1995,pp.187212.
[18℄ J.P.Lage,A.S.daSilva,P.B.Golgher,andA.H.F.Laender,Colletinghiddenwebpagesfordataextration,in
Proeedingsofthe4thACMCIKMInternationalWorkshoponWebInformationandDataManagement,Virginia,USA,
2002,pp.6975.
[19℄ K.Nigam,A.MCallum,S.Thrun,andT.M.Mithell,Learningtolassifytextfromlabeledandunlabeleddouments,
inProeedingsofthe15thNationalConfereneonArtiialIntelligeneandTenthInnovativeAppliationsofArtiial
IntelligeneConferene,Wisonsin,USA,1998,pp.792799.
[20℄ A. Pan, J. Raposo, M. Alvarez, J. Hidalgo, and A. Vinaet, Semi-automati wrapper generation for ommerial
web soures, inProeedingsof the WorkingConferene on EngineeringinformationSystems inthe Internet Context,
Kanazawa,Japan,2002,pp.265283.
[21℄ J.Raposo,M.Alvarez,J.Losada,andA.Pan,Maintainingwebnavigationowsforwrappers,inProeedingsofthe
2nd International Workshopon DataEngineeringIssuesinE-CommereandServies,San Franiso,CA,USA,2006,
pp.100114.
[23℄ K.SimonandG. Lausen,Viper: Augmentingautomati informationextrationwithvisualpereption,inProeedingsof
the2005ACMCIKMInternationalConfereneonInformationandKnowledgeManagement,Bremen, Germany,2005,
pp.381388.
[24℄ A.Vinod, J.Freire,B.Kumar, andD.F. Lieuwenet,Automatingwebnavigationwiththewebvr,inProeedingsof
the9thInternationalWorldWideWebConferene,Amsterdam,TheNetherlands,2000,pp.503517.
[25℄ Y. Wang, Deep web navigation by example, master's thesis, Institute of Computer Siene, Albert-Ludwigs University
Freiburg,2008.
[26℄ Z.Zhang,B.He, andK.C. C.Chang,Understandingweb queryinterfaes: Best-eortparsing withhiddensyntax,in
ProeedingsoftheACMSIGMODInternationalConfereneonManagementofData,Paris,Frane,2004,pp.107118.
[27℄ C. N. Ziegler,K. Simon, andG. Lausen, Automatiomputationof semantiproximity using taxonomi knowledge,
inProeedings of the ACM CIKMInternational Conferene on Informationand Knowledge Management, Arlington,
Virginia,USA,2006,pp.465474.
Editedby: DominikFlejter,TomaszKazmarek,MarekKowalkiewiz
Reeived: January11th,2008
Aepted: Marh 19th,2008