Optimal Crawling Strategies for Web Search Engines
J.L. Wolf, M.S. Squillante,
P.S. Yu
IBM Watson Research Center
fjlwolf,mss,psyug@us.ibm.com
J. Sethuraman
IEOR Department
Columbia University
jay@ieor.columbia.edu
L. Ozsen
OR Department
Northwestern University
ozsen@yahoo.com
ABSTRACT
WebSearchEngines employ multiple so-called crawlers to
maintain local copies of web pages. But these web pages
are frequently updated by their owners, and therefore the
crawlers mustregularly revisit the web pages to maintain
thefreshnessoftheirlocalcopies.Inthispaper,wepropose
atwo-partschemeto optimize this crawling process. One
goalmightbetheminimizationoftheaveragelevelof
stale-nessoverallwebpages,andtheschemeweproposecansolve
this problem. Alternatively, the same basic schemecould
beused tominimizea possibly moreimportant search
en-gineembarrassmentlevelmetric: Thefrequencywithwhich
aclient makesa search engine query andthen clicks ona
returnedurl onlyto nd that the resultis incorrect. The
rstpartourschemedeterminesthe(nearly)optimal
crawl-ingfrequencies,aswellasthetheoreticallyoptimaltimesto
crawleachwebpage. It doesso within anextremely
gen-eralstochasticframework,onewhichsupportsawiderange
ofcomplexupdatepatternsfoundinpractice. Ituses
tech-niquesfromprobabilitytheoryandthetheoryofresource
al-locationproblemswhicharehighlycomputationallyeÆcient
{crucialfor practicality becausethesize oftheproblemin
thewebenvironmentisimmense. Thesecondpartemploys
these crawling frequencies and ideal crawl times as input,
andcreatesanoptimalachievablescheduleforthecrawlers.
Oursolution,basedonnetwork owtheory,isexactaswell
ashighlyeÆcient. Ananalysisoftheupdatepatternsfrom
a highly accessed and highly dynamic web site is used to
gain some insights into the properties of page updates in
practice. Then,basedonthisanalysis,weperformasetof
detailedsimulationexperimentstodemonstratethequality
andspeedofourapproach.
Categories and Subject Descriptors
H.3[InformationSystems]: InformationStorageand
Re-trieval; G.2,3 [Mathematics of Computing ]: Discrete
Mathematics,ProbabilityandStatistics
General Terms
Algorithms,Performance,Design,Theory
1.
INTRODUCTION
Websearchenginesplay avitalrole ontheWorldWide
Web,sincetheyprovideformanyclientstherstpointersto
Copyright is held by the author/owner(s).
WWW2002
, May 7–11, 2002, Honolulu, Hawaii, USA.
ACM 1-58113-449-5/02/0005.
webpagesofinterest. Suchsearchenginesemploycrawlers
tobuildlocalrepositoriescontainingwebpages,whichthey
thenuse to builddatastructures usefulto thesearch
pro-cess. Forexample,aninvertedindexiscreatedthattypically
consistsof,foreachterm,asortedlistofitspositionsinthe
variouswebpages.
Ontheotherhand,webpagesare frequentlyupdatedby
theirowners[8,21,27],sometimesmodestlyandsometimes
more signicantly. A large studyin[2] notes that 23% of
the web pages change daily,while 40% of commercialweb
pageschangedaily. Somewebpagesdisappearcompletely,
and [2] reports a half-life of 10 days for web pages. The
datagatheredbyasearchengineduringitscrawlscanthus
quicklybecomestale,oroutofdate. Socrawlersmust
reg-ularlyrevisitthewebpagestomaintainthefreshnessofthe
searchengine'sdata.
Inthis paper, we proposeatwo-partschemetooptimize
the crawling (or perhaps, more precisely, the recrawling)
process. Onereasonablegoal insuchaschemeisthe
mini-mizationoftheaveragelevelofstalenessoverallwebpages,
and the scheme we propose here can solve this problem.
We believe, however, that a slightly dierent metric
pro-vides greater utility. This involves a so-called embarr
ass-ment metric: The frequency with which a client makes a
searchengine query,clicksonaurlreturnedbythesearch
engine,andthenndsthattheresultingpageisinconsistent
withrespectto thequery. Inthis context,goodness would
correspondto thesearchenginehavingafresh copyof the
webpage. However,badness mustbepartitionedintolucky
and unlucky categories: The searchengine canbe badbut
luckyinavariety ofways. Inorderofincreasing luckiness
thepossibilitiesare: (1)Thewebpagemightbestale, but
notreturnedtotheclient asaresultofthequery;(2)The
webpagemight be stale,returnedto theclient as aresult
of thequery,butnotclickedonby theclient; and(3)The
webpagemightbestale,returnedtotheclientasaresultof
thequery,clickedonbytheclient,butmightbecorrectwith
respecttothequeryanyway. Sothemetricunderdiscussion
wouldonlycountthosequeriesonwhichthesearchengine
isactuallyembarrassed: Thewebpageisstale,returnedto
theclient,who clicksonthe urlonlytondthat thepage
is eitherinconsistent with respecttothe original query,or
(worseyet)hasabrokenlink. (Accordingto[17],andnoted
in[6], upto14% ofthelinks insearchenginesarebroken,
notagoodstateofaairs.)
Therstcomponent ofour proposedschemedetermines
theoptimalnumberofcrawlsforeachwebpageoveraxed
deter-selves. These twoproblemsare highlyinterconnected. The
samebasicschemecanbeusedtooptimizeeitherthe
stale-nessorembarrassment metric,andisapplicable for awide
variety ofstochastic updateprocesses. Theability to
han-dlecomplexupdateprocessesinageneraluniedframework
turnsouttobeanimportantadvantageandacontribution
ofourapproachinthattheupdatepatternsofsomeclasses
ofwebpages appearto follow fairly complexprocesses, as
we will demonstrate. Anotherimportant modelsupported
byourgeneral frameworkis motivatedby,forexample, an
information service that updates its web pages at certain
timesoftheday,ifanupdatetothepageisnecessary. This
case, whichwe callquasi-deterministic,is characterizedby
webpages whose updatesmightbecharacterizedas
some-whatmore deterministic, inthe sense that thereare xed
potential times at which updatesmight or might not
oc-cur. Ofcourse,webpageswithdeterministicupdatesarea
specialcaseofthequasi-deterministicmodel. Furthermore,
thecrawling frequency problem canbe solvedunder
addi-tionalconstraintswhichmakeitssolutionmorepracticalin
therealworld: Forexample,onecanimposeminimumand
maximumboundsonthenumberofcrawlsfor agivenweb
page. Thelatter boundisimportantbecause crawling can
actuallycauseperformanceproblemsforwebsites.
Thisrstcomponentproblemisformulatedandsolved
us-ingavarietyof techniquesfromprobabilitytheory[23, 28]
andthetheoryofresourceallocationproblems[12, 14]. We
note that the optimization problems described here must
be solvedfor huge instances: Thesize of the World Wide
Webisnowestimatedatoveronebillionpages,theupdate
rateofthesewebpagesislarge, asinglecrawler cancrawl
morethanamillion pagesperday,andsearchengines
em-ploy multiplecrawlers. (Actually,[2] notesthat their own
crawler canhandle 50-100 crawls persecond, while others
canhandle several hundred crawls persecond. We should
note,however,that crawlingisoftenrestrictedtolessbusy
periods intheday.) A contributionofthispaperisthe
in-troductionofstate-of-the-artresourceallocationalgorithms
tosolvetheseproblems.
Thesecond component ofour schemeemploysas its
in-puttheoutputfromtherstcomponent. (Again,this
con-sists of theoptimal numbers ofcrawls and theideal crawl
times).Itthenndsanoptimalachievablescheduleforthe
crawlersthemselves. This isa parallel machine scheduling
problem[3, 20] because of the multiple crawlers.
Further-more,some of the scheduling tasks have releasedates,
be-cause,forexample,itisnotusefultoscheduleacrawltaskon
aquasi-deterministicwebpagebefore thepotentialupdate
takesplace. Oursolutionis basedonnetwork ow theory,
andcanbeposedspecicallyasatransportationproblem[1].
Thisproblem mustalso besolved for enormous instances,
andagaintherearefastalgorithmsavailableatourdisposal.
Moreover,onecanimposeadditionalreal-worldconstraints,
suchasrestrictedcrawlingtimesforagivenwebpage.
Weknowof relativelyfew related papersintheresearch
literature. Perhapsthemostrelevantis[6]. (Seealso[2]for
amoregeneralsurveyarticle.) In[6]theauthorsessentially
introduceandsolveaversionoftheproblem ofndingthe
optimalnumberofcrawls per page. They employ a
stale-ness metricand assume a Poisson update process. Their
algorithm solves the resulting continuous resource
alloca-tionproblem bytheuseof Lagrangemultipliers. In[7]the
with general crawl time distributions), with weights
pro-portional to the page update frequencies. They present a
heuristictohandlelargeprobleminstances. Theproblemof
optimizingthe numberofcrawlersistackled in[26], based
onaqueueing-theoreticanalysis, inorderto avoid thetwo
extremesofstarvationandsaturation. Insummary,thereis
someliteratureontherst component ofourcrawler
opti-mizationscheme,thoughwehavenotedaboveseveral
poten-tialadvantagesofourapproach. To ourknowledge, this is
therstpaperthatmeaningfullyexaminesthe
correspond-ing scheduling problem whichis the second component of
ourscheme.
Anotherimportant aspectofourstudyconcernsthe
sta-tisticalpropertiesoftheupdatepatternsforwebpages. This
clearlyisacriticalissuefortheanalysisofthecrawling
prob-lem,butunfortunatelythereappearstobeverylittleinthe
literature onthe typesof update processes found in
prac-tice. To thebest ofourknowledge, thesole exceptionis a
recent study[19] whichsuggests thatthe updateprocesses
forpagesatanewsservicewebsitearenotPoisson. Given
theassumptionofPoissonupdateprocessesinmostprevious
studies,andtofurtherinvestigatetheprevalenceofPoisson
update processesin practice, we analyze the page update
datafromahighlyaccessedwebsiteservinghighlydynamic
pages. Arepresentativesampleoftheresultsfromour
anal-ysis are presented and discussed. Most importantly,these
results demonstrate that the interupdate processesspan a
widerangeofcomplexstatisticalpropertiesacrossdierent
webpages and that theseprocesses candier signicantly
froma Poissonprocess. Bysupporting inour general
uni-edapproachsuchcomplexupdatepatterns(includingthe
quasi-deterministic model) inadditionto thePoissoncase,
webelievethatouroptimalschemecanprovideevengreater
benetsinreal-worldenvironments.
Theremainderofthis paperisorganizedasfollows.
Sec-tion 2describesour formulation and solutionfor the twin
problemsof ndingthe optimal numberof crawls and the
idealized crawltimes. We looselyrefer tothis rst
compo-nent as the optimalfrequencyproblem. Section3contains
the formulation and solution for the second component of
ourscheme, namelythe scheduling problem. Section4
dis-cusses someissuesof parameterizingourapproach,
includ-ingseveralempiricalresultsonupdatepatterndistributions
for real webpages basedontraces from aproductionweb
site. InSection5weprovidesexperimentalresultsshowing
both the quality of our solutions and their running times.
Section6containsconclusionsandareasforfuturework.
2.
CRAWLING FREQUENCY PROBLEM
2.1
General Framework
Weformulatethecrawlingfrequencyproblemwithin the
contextofageneral modelframework, basedonstochastic
marked point processes. This makesit possible for us to
studytheprobleminauniedmanneracrossawiderange
of web environments and assumptions. A rigorous formal
denitionofourgeneralframeworkanditsimportant
math-ematicalproperties,aswellasarigorousformalanalysisof
various aspects of our general framework, are beyond the
scope of the present paper. We therefore sketchhere the
model framework and ananalysis of aspecic instance of
additionaltechnicaldetails. Furthermore,see[24]for
addi-tionaldetailsonstochasticmarkedpointprocesses.
We denote by N the total number of web pages to be
crawled,whichshallbeindexedbyi. Weconsidera
schedul-ing intervalof lengthT as a basicatomic unitof decision
making,where T issuÆcientlylarge tosupportourmodel
assumptionsbelow. Thebasicideaisthatthesescheduling
intervalsrepeateveryT unitsoftime,andwewillmake
de-cisions aboutone scheduling intervalusingboth new data
and theresults from the previous scheduling interval. Let
R denote the total number of crawls possible in a single
schedulinginterval.
Letui;n 2IR +
denotethepointintimeat whichthen th
update ofpagei occurs,where 0<ui;1 <ui;2 <:::T,
i2f1;2;:::;Ng. Associatedwiththen th
updateofpagei
isamark ki;n 2IK,where ki;n is usedtorepresent all
im-portantandusefulinformationforthen th
updateofpagei
andIKisthespaceofallsuchmarks(calledthemarkspace).
Examplesofpossiblemarksincludeinformationonthe
prob-ability of whether an update actually occurs at the
cor-responding point in time (e.g., see Section 2.3.3) and the
probability of whether anactual updatematters from the
perspectiveofthecrawlingfrequencyproblem(e.g.,a
min-imalupdateof thepagemaynot change theresults ofthe
searchenginemechanisms). Theoccurrences ofupdatesto
pageiare thenmodeled asastationarystochastic marked
point process Ui = f(ui;n;ki;n);n 2 INg dened on the
statespaceIR +
IK. Inotherwords,U i
isastochastic
se-quence of points fui;1;ui;2;:::g in time at which updates
ofpageioccur, togetherwithacorrespondingsequenceof
generalmarksfk i;1
;k i;2
;:::g containing informationabout
theseupdates.
A countingprocess N u i
(t)isassociated withthe marked
pointprocessU i andisgivenbyN u i (t)maxfn:u i;n tg,
t 2 IR+. This counting process represents the numberof
updatesofpageithatoccurinthetimeinterval[0;t]. The
intervaloftimebetweenthen 1 st andn th updateofpagei is given by U i;n u i;n u i;n 1 , n 2 IN, where we dene
ui;00andki;0;. Thecorrespondingforwardand
back-ward recurrencetimes are givenby A u i (t)u i;N u i (t)+1 t and B u i (t) t u i;N u i (t) , respectively, t 2 IR+. In this
paper we shall make the assumption that the time
inter-vals U i;n
2 IR +
between updates of page i are
indepen-dent and identically distributed (i.i.d.) following an
arbi-trary distributionfunction Gi() withmean 1 i
>0, and
thus the counting process N u i
(t) is a renewal process [23,
28],i2f1;2;:::;Ng. Notethatui;0doesnotrepresentthe
timeofanactualupdate,andthereforethecountingprocess
fN u i
(t);t2IR +
g(startingattime0)is,moreprecisely,an
equilibriumrenewalprocess (which is aninstance ofa
de-layedrenewalprocess)[23,28].
Supposewedecidetocrawlwebpageiatotalofx i
times
during the scheduling interval [0;T] (where xi is a
non-negative integer less thanor equal to R ), andsuppose we
decidetodosoatthearbitrarytimes0ti;1<ti;2<:::<
t i;x
i
T. Ourapproachinthispaperisbasedoncomputing
aparticularprobabilityfunctionthatcaptures,inacertain
sense,whetherthesearchenginewillhaveastalecopyofweb
pageiatanarbitrarytimetintheinterval[0;T].Fromthis
wecaninturncomputeacorrespondingtime-average
stale-nessestimateai(ti;1;:::;ti;x i
)byaveragingthisprobability
functionoveralltwithin[0;T].Specically,weconsiderthe
arbitrarytimeti;jasfallingwithintheintervalU i;N u i (t i;j )+1 - - -. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B u i (t i;j ) U i;N i (t i;j )+1 u i;N u i (t i;j ) ti;j u i;N u i (t i;j )+1 ti;j+1 A u i (t i;j )
Figure1: ExampleofStationaryMarkedPoint
Pro-cessFramework
betweenthe two updatesofpageiat times u i;N u i (t i;j ) and u i;N u i (t i;j )+1
,andourinterestisinaparticulartime-average
measureofstalenesswithrespecttotheforwardrecurrence
timeA u i
(ti;j)untilthenextupdate,giventhebackward
re-currencetime B u i
(ti;j). Figure1depicts a simpleexample
ofthissituation.
Moreformally,weexploitconditionalprobabilitiesto
de-nethefollowing time-averagestalenessestimate
ai(ti;1;:::;ti;x i ) = 1 T x i X j=0 Z t i;j+1 t i;j Z 1 0 P[A;B;C] P[B] g B u i (v)dv dt; (1) where A=fUi;n i;j +1 t ti;j+vg,B=fUi;n i;j +1 >vg, C=fk i;n i;j +1 2K g, t i;0 0,t i;x i +1 T,n i;j N u i (t i;j ), g B u i
()isthestationarydensityforthebackwardrecurrence
time,andKIKisthemarksetofinterestforthestaleness
estimate under consideration. Note that the variable v is
usedtointegrateoverallpossiblevaluesofB u i
(t i;j
)2[0;1).
Furtherobservethedependenceofthestalenessestimateon
theupdatepatternsforwebpagei.
WhenK=IK(i.e.,allmarksareconsideredinthe
deni-tionofthetime-averagestaleness estimate),thentheinner
integral abovereducesasfollows:
Z 1 0 G i (t t i;j +v) G i (v) 1 Gi(v) g B u i (v)dv = Z 1 0 1 G i (t t i;j +v) Gi(v) g B u i (v)dv; (2)
whereGi(t)1 Gi(t)isthetaildistributionofthe
interup-datetimesUi;n,n2IN. Fromstandardrenewaltheory[23,
28], wehave g B u i (t)= i G i
(t),and thusaftersomesimple
algebraic manipulationsweobtain
a i (t i;1 ;:::;t i;x i ) = 1 T x i X j=0 Z t i;j+1 t i;j 1 i Z 1 0 G i (t t i;j +v)dv dt: (3)
Naturallywewouldalso like thetimes t i;1
;:::t i;x
i to be
chosensoastominimizethetime-averagestalenessestimate
ai(ti;1;:::;ti;x i
), given that there are xi crawls of pagei.
Deferringforthemomentthequestionofhowtondthese
optimalvaluest i;1 ;:::t i;x i
,letusdenethefunctionA i by setting A i (x i )=a i (t i;1 ;:::;t i;x i ): (4)
Thus,thedomainofthisfunctionA i
isthesetf0;:::;R g.
Wenowmustdecidehowtondtheoptimalvaluesofthe
x i
variables. Whileonewould liketo choose x i
aslarge as
possible,thereiscompetitionforcrawlsfromtheotherweb
N X
i=1
wiAi(xi) (5)
subjecttotheconstraints
N X i=1 x i =R ; (6) x i 2f0;:::;R g: (7)
Heretheweightswiwilldeterminetherelativeimportance
ofeachwebpagei. Ifeachweightwiischosentobe1,then
the problem becomes one of minimizing the time-average
stalenessestimateacross all theweb pagesi. However, we
willshortly discussawayto picktheseweightsabitmore
intelligently,therebyminimizingametricthatcomputesthe
levelofembarrassmentthesearchenginehastoendure.
Theoptimizationproblemjustposedhasaveryniceform.
Specically,itisanexampleofaso-called discrete,
separa-bleresourceallocationproblem. Theproblemisseparableby
thenatureoftheobjectivefunction,writtenasthe
summa-tionoffunctionsoftheindividualxivariables. Theproblem
isdiscretebecauseofthesecondconstraint,andaresource
allocation problem because ofthe rstconstraint. For
de-tailsonresourceallocationproblems,werefertheinterested
readerto [12]. This is awell-studied area inoptimization
theory, and we shallborrowliberally from that literature.
Indoingso, we rstpoint outthat thereexists adynamic
programmingalgorithmforsolvingsuchproblems,whichhas
computationalcomplexityO(NR 2
).
Fortunately,wewillshowshortlythatthefunctionAi is
convex, whichinthis discrete context meansthat the rst
dierences A i
(j+1) A i
(j) arenon-decreasing asa
func-tionofj. (Theserstdierencesare justthediscrete
ana-loguesofderivatives.)Thisextrastructuremakesitpossible
toemploy evenfaster algorithms,butbefore we candoso
thereremaina few important issues. Eachof theseissues
is discussed indetail inthe next three subsections, which
involve: (1)computingtheweightsw i
oftheembarrassment
level metricfor each web pagei; (2)computing the
func-tional forms of ai and Ai for each web page i, based on
thecorrespondingmarkedpointprocessU i
;and(3)solving
theresultingdiscrete,convex,separableresourceallocation
probleminahighlyeÆcientmanner.
Itisimportanttopointoutthatwecanactuallyhandlea
moregeneralresourceallocationconstraint thanthatgiven
inEquation(7). Specically,wecanhandlebothminimum
andmaximumnumberofcrawlsmiandMiforeachpagei,
sothat the constraint insteadbecomes x i 2fm i ;:::;M i g:
Wecanalsohandleothertypesofconstraintsonthecrawls
that tend to ariseinpractice [4], butomit details here in
theinterestofspace.
2.2
Computing the Weights
wiConsiderFigure2,whichillustratesadecisiontreetracing
thepossibleresultsforaclientmakingasearchenginequery.
Letus xa particularwebpageiinmind, andfollow the
decisiontreedownfromtheroottotheleaves.
Therstpossibilityisforthepagetobefresh: Inthiscase
the web pagewill not cause embarrassment to the search
engine. So, assume the pageis stale. If the pageis never
returnedby the search engine, there again can be no
em-barrassment: Thesearchengineissimplyluckyinthiscase.
BAD BUT LUCKY
Query correct
Page not clicked
Page not returned
GOOD: Page fresh
UGLY: Query incorrect
Page clicked
Page returned
Page stale
Figure 2: EmbarrassmentLevelDecision Tree
Whathappensifthepageisreturned? Asearchenginewill
typically organize its query responses into multiple result
pages, and eachof these resultpages will contain theurls
of several returned web pages, in variouspositions on the
page. Let P denotethe numberofpositions onareturned
page(whichistypicallyontheorderof10). Notethat the
position of a returned web page on a result page re ects
theordered estimateofthesearchengine forthe webpage
matchingwhattheuserwants. Letb i;j;k
denotethe
proba-bilitythatthesearchenginewillreturnpageiinpositionj
ofqueryresultpagek.Thesearchenginecaneasilyestimate
theseprobabilities,eitherbymonitoringallqueryresultsor
bysamplingthemfortheclientqueries.
Thesearchenginecanstillbeluckyevenifthewebpagei
isstaleandreturned: Aclientmightnotclickonthepage,
and thus never have a chance to learn that the page was
stale. Letcj;k denote thefrequencythat aclientwill click
on a returned page in position j of query result page k.
Thesefrequenciesalsocanbeeasilyestimated,againeither
bymonitoringorsampling.
Onecanspeculatethatthis clickingprobabilityfunction
might typically decrease both as a function of the overall
position(k 1)P+jofthereturnedpageandasafunction
of the page k on which it is returned. Assuming a
Zipf-likefunction [29, 15] forthe rstfunctionandageometric
functionto modelthe probability of cyclingthroughk 1
pages toget to returnedpagek,onewouldobtain a
click-ing probability functionthatlookslikethe oneprovidedin
Figure3. (Accordingto [4]thereissomeevidencethatthe
clickingprobabilities inFigure 3 actually riserather than
fallas anewpageisreached. Thisis because someclients
donotscroll down{somedonotevenknowhowtodoso.
Moreimportantly, [4]notesthatthisdatacanactuallybe
collectedbythesearchengine.)
Finally, even if the web page is stale, returned by the
search engine, and clicked on: The changes to the page
mightnotcausetheresultsofthequeryto bewrong. This
truly lucky loser scenario can be more common than one
mightinitiallysuspect,becausemostwebpagestypicallydo
notchangeintermsoftheirbasiccontent, andmostclient
queries will probably tryto addressthis basic content. In
0
5
10
15
20
25
30
35
40
45
50
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Page 1
Page 2
Page 3
...
Page P
Position
Probability of Clicking
Probability of Clicking as Function of Page and Position
Figure3: Probability ofClickingasFunctionof
Po-sition/Page
versionof pageiyields anincorrect response. Onceagain,
thisparametercanbeeasilyestimated.
Assuming,asseemsreasonable,thatthesethreetypesof
eventsare independent, onecancomputethe totallevelof
embarrassment caused tothe searchengine bywebpage i
as w i =d i X j X k c j;k b i;j;k : (8)
Notethatalthoughthe functionalformof theabove
equa-tionis abitinvolved,the value ofwi is simplyaconstant
fromtheperspectiveoftheresourceallocationproblem.
2.3
Computing the Functions
a iand
A i
Aspreviouslynoted,thefunctionstobecomputedinthis
sectiondependuponthecharacteristicsofthemarkedpoint
process Ui = f(ui;n;ki;n);n 2 INg which models the
up-datebehaviorsof eachpagei. We considerthree typesof
markedpointprocessesthatrepresentdierentcaseswhich
are expected to be of interest in practice. The rst two
casesarebased ontheuse ofEquation(3)to computethe
functionsa i
andA i
underdierent distributional
assump-tions for the interupdate times Ui;n of the marked point
process Ui; specically, we consider Gi() to be
exponen-tialand general distribution functions, respectively, where
theformer hasbeen the primarycase consideredin
previ-ousstudiesandthelatter isusedto stateafew important
propertiesofthisgeneralform(whichareusedinturnto
ob-taintheexperimentalresultspresentedinSection4). The
thirdcase,whichwecallquasi-deterministic,is basedona
nitenumberofspecictimesu i;n
atwhichpageimightbe
updated,wherethecorrespondingmarksk i;n
representthe
probabilitythattheupdateattimeui;n actuallyoccurs.
2.3.1
Exponential Distribution Function
GiConsider as a simple butprototypicalexample the case
inwhichthetimeintervalsUi;n 2IR +
between updatesof
page iare i.i.d. following anexponential distributionwith
parameter i ,i.e.,G i (t)=1 e i t andG i (t)=e i t [28].
Inthiscase we alsoassumethatall updatesare ofinterest
irrespectiveoftheir associatedmarkvalues(i.e.,K=IK).
Supposeasbeforethatwecrawlatotalofx i
timesinthe
scheduling interval at the arbitrary times ti;1;:::ti;x i
. It
thenfollowsfromEquation(3)thatthetime-average
stale-nessestimateisgivenby
a i (t i;1 ;:::;t i;x i ) = 1 T x i X j=0 Z t i;j+1 t i;j 1 i Z 1 0 e i (t t i;j +v) dv dt = 1 T x i X j=0 Z t i;j+1 t i;j 1 e i (t t i;j ) dt: (9)
After some minor algebraic manipulations, we obtain that
thetime-averagestalenessestimateis givenby
a i (t i;1 ;:::;t i;x i )=1+ 1 i T x i X j=0 e i (t i;j+1 t i;j ) 1 :(10) Letting T i;j = t i;j+1 t i;j for all 0 j x i , then the problemofndingA i reducestominimizing 1+ 1 iT x i X j=0 e i T i;j 1 (11)
subjecttotheconstraints
0T i;j T; (12) x i X j=0 Ti;j=T: (13)
Modulo a few constants, which are not important to the
optimization problem, this now takes the form of a
con-tinuous,convex,separableresource allocationproblem. As
before,theproblemisseparablebythenatureofthe
objec-tivefunction. Itiscontinuousbecauseoftherstconstraint,
anditisaresourceallocationproblembecauseofthesecond
constraint;itisalsoconvexforthereasonsprovidedbelow.
Thekeypointisthattheoptimumvalueisknowntooccur
atthevalue(T i;1 ;:::;T i;x i
)wherethederivativesT 1 e i T i;j
ofthesummandsinEquation(11)areequal,subjecttothe
constraints0 T i;j T and P x i j=0 T i;j =T. Thegeneral
result, originally dueto [9], was the seminal paperin the
theoryofresourceallocationproblems,andthereexist
sev-eral fast algorithmsfor ndingthe valuesof T i;j
. See[12]
for a good exposition of these algorithms. In our special
caseoftheexponentialdistribution,however,thesummands
are allidentical,andthustheoptimaldecisionvariablesas
well can be found by inspection: They occur when each
T i;j
=T=(x i
+1). Hence,wecanwrite
Ai(xi)=1+ x i +1 iT e i T=(x i +1) 1 ; (14)
whichiseasilyshowntobeconvex.
2.3.2
General Distribution Function
GiNow let us consider the same scenario as the previous
section butwherethedistributionoftheinterupdatetimes
U i;n
2IR +
forpageiisanarbitrarydistributionG i
()with
mean 1 i
. ThenweobservefromEquation(3)afew
this formula that the summands remainseparable. Given
thatall thesummandsarealso identical,theoptimal
deci-sionvariables occur wheneachT i;j
=T=(xi+1) asinthe
exponentialcase.
2.3.3
Quasi-Deterministic Case
Suppose the marked point process U i
consists of a
de-terministic sequence of points fu i;1 ;u i;2 ;:::;u i;Q i g
den-ing possible update times for page i, together with a
se-quenceofmarksfki;1;ki;2;:::;ki;Q i
gdeningthe
probabil-ity of whether the corresponding update actually occurs.
Herewe eliminate thei.i.d. assumptionof Section2.1 and
consider an arbitrary sequence of specic times such that
0 u i;1 < u i;2 < ::: < u i;Q i T. Recall that u i;0
0 and dene ui;Q i
T for convenience. The update at
time ui;j occurs with probability ki;j. If ki;j = 1 for all
j 2f1;:::;Q i
g, thenthe update patternreducesto being
purely deterministic. We shall assume that the value ki;0
canbeinferred fromthecrawlingstrategyemployedinthe
previousschedulinginterval(s). Ourinterestisagainon
de-terminingthetime-averagestalenessestimateA i
(x i
) forx i optimallychosencrawls.
Akeyobservationisthatallcrawlsshouldbedoneatthe
potentialupdatetimes,becausethereisnoreason todelay
beyond when the update has occurred. This also implies
that we canassume xi Qi+1,as there is noreason to
crawl more frequently. (The maximum of Q i
+1 crawls
correspondstothetime0andtheQiotherpotentialupdate
times.)Hence,considerthebinarydecisionvariables
yi;j=
1; ifacrawloccursattimeui;j;
0; otherwise :
(15)
Ifwecrawlx i
times,thenwehave P Q i j=0 y i;j =x i .
Notethat,asaconsequenceoftheaboveassumptionsand
observations,thetwointegralsinEquation(1)reducetoa
muchsimplerform. Specically,letus considerastaleness
probabilityfunctionp(yi;0; :::;yi;Q i
;t)atanarbitrarytime
t,which we nowcompute. Recall that N u i
(t) provides the
indexofthelatest potential update timethatoccursator
beforetimet,sothatN u i
(t)Qi. Similarly,dene
J i (t) = maxfj :u i;j t; y i;j =1; 0jQ i g + ; (16)
whichistheindexofthelatestpotentialupdatetimeator
beforetimet thatis actuallygoing to becrawled. Clearly
wecanalsounambiguouslyuseJ i;j
toabbreviatethevalue
ofJi(t)atanytimetforwhichj=N u i (t).Nowwehave p(yi;0;:::;yi;Q i ;t) = 1 N u i (t) Y j=J i (t)+1 (1 ki;j); (17)
whereaproductovertheemptyset,aspernormal
conven-tion,isassumedtobe1.
Figure 4 illustrates a typical staleness probability
func-tionp. (Forvisualclaritywedisplaythefreshnessfunction
1 pratherthanthestalenessfunctioninthisgure.) Here
the potential update times are noted by circles on the
x-axis. Thosewhichareactuallycrawledaredepictedaslled
circles, while those that are not crawled are left unlled.
Thefreshness functionjumpsto 1during eachinterval
im-mediatelytotherightofacrawltime,andthendecreases,
interval by interval, as more termsare multipliedinto the
product(seeEquation(17)). Thefunctionisconstant
dur-PROBABILITY
0
TIME
T
1
0
Figure4: FreshnessProbability Functionfor
Quasi-Deterministic WebPages
ingeachinterval{thatispreciselywhyJ i;j
canbedened.
Nowwecancomputethecorrespondingtime-average
prob-abilityestimateas a(y i;0 ;:::;y i;Q i )= Q i X j=0 u i;j [1 j Y k =J i;j +1 (1 k i;j )]: (18)
Thequestionofhowtochoosetheoptimalxicrawltimesis
perhaps themost subtleissue inthispaper. Wecanwrite
this as adiscrete resource allocation problem, namely the
minimization of Equation (18) subject to the constraints
y i;j 2 f0;1g and P Q i j=0 y i;j = x i
. The problem withthis
is that thedecisionvariablesyi;j are highly intertwinedin
theobjectivefunction. Whileouroptimizationproblemcan
besolvedexactlybytheuseofso-callednon-serialdynamic
programmingalgorithms,asshownin[12],orcanbesolved
asageneralintegerprogram,suchmeanstoobtainthe
prob-lem solution will not have good performance. Hence, for
reasonsweshalldescribemomentarily,wechoosetoemploy
a greedy algorithm for the optimizationproblem: That is,
we rstestimate the value of Ai(1)by pickingthat index
0 j Q i
for which the objective function will decrease
the most when yi;j is turned from 0 to 1. In the general
inductivestepweassumethatwearegivenanestimatefor
A i (x i 1). Then to compute A i (x i
) we pick that index
0 j Qi with yi;j = 0 for which the objective
func-tion decreases the most upon setting yi;j = 1. It can be
shownthatthe greedyalgorithm doesnot,ingeneral,nd
theoptimalsolution. However,theaveragefreshnesscanbe
easily shownto be anincreasing, submodular function (in
thenumberofcrawls),andsothegreedyalgorithmis
guar-anteedtoproduceasolutionwithaveragefreshnessatleast
(1 1=e)ofthe bestpossible [18]. Forthe specialcase we
consider, we believethe worst-case performance guarantee
ofthegreedyalgorithm isstrictlybetter. Wetherefore feel
i functionA
i
,whenestimatedinthisway,isconvex: If,given
twosuccessivegreedychoices,therstdierencedecreases,
thenthesecondgreedychoicewouldhavebeenchosenbefore
therstone.
2.4
Solving the Discrete Separable Convex
Resource Allocation Problem
Asnoted,theoptimizationproblemdescribedaboveisa
specialcaseofadiscrete,convex,separableresource
alloca-tionproblem. Theproblemofminimizing
N X i=1 F i (x i ) (19)
subjecttotheconstraints
N X i=1 xi=R (20) and xi2fmi;:::;Mig (21) withconvexF i
sisverywellstudiedintheoptimization
lit-erature. We point the reader to [12] for details on these
algorithms. Wecontentourselvesherewithabriefoverview.
Theearliest knownalgorithm for discrete,convex,
sepa-rableresourceallocationproblemsisessentiallyduetoFox
[9]. Moreprecisely,Foxlookedatthecontinuouscase,noting
thatthe Lagrangemultipliers (orKuhn-Tuckerconditions)
impliedthattheoptimalvalueoccurredwhenthederivatives
wereasequalaspossible,subjecttotheaboveconstraints.
This gave rise to a greedy algorithm for the discrete case
whichis usuallyattributedto Fox. Oneforms amatrixin
whichthe(i;j)thtermDi;j isdenedtobetherst
dier-ence: Di;j=Fi(j+1) Fi(j). Byconvexitythecolumnsof
thismatrixareguaranteedtobemonotone,andspecically
non-decreasing. Thegreedyalgorithminitiallysetseachxi
tobe mi. It thenndstheindexifor whichxi+1 Mi
and thevalue ofthe nextrst dierence D i;x
i
isminimal.
Forthisindexi
oneincrementsxi
by1.Thenthisprocess
repeats until Equation (20) is satised, or until the setof
allowable indices iempties. (In that case there is no
fea-siblesolution.) Note that the rstdierences are justthe
discreteanalogs ofderivativesfor the continuouscase, and
thatthegreedyalgorithmndsasolutioninwhich,modulo
Constraint(21),allrstdierencesareasequalaspossible.
ThecomplexityofthegreedyalgorithmisO(N+RlogN).
Thereis afaster algorithmfor ourproblem,duetoGalil
andMegiddo[11],whichhascomplexityO(N(logR ) 2
). The
fastestalgorithm is due to Frederickson and Johnson [10],
and it has complexity O(maxfN;Nlog(R =N)g). The
al-gorithmishighly complex,consistingofthreecomponents.
The rst component eliminates elements of the D matrix
fromconsideration,leavingO(R )elementsandtakingO(N)
time. Thesecond component iterates O(log(R =N))times,
eachiterationtakingO(N)time. Attheendofthis
compo-nentonlyO(N)elementsoftheD matrixremain. Finally,
the third component is a linear time selection algorithm,
ndingtheoptimalvalueinO(N)time. Forfull detailson
this algorithm see [12]. We employ the Frederickson and
Johnsonalgorithminthispaper.
There do exist some alternative algorithms which could
beconsideredforourparticularoptimizationproblem. For
optimizationproblemisinherentlydiscrete,buttheportions
corresponding to other distributions can be considered as
giving rise toa continuousproblem (whichis done in[6]).
Inthecase ofdistributionsforwhichthe expressionfor A i is dierentiable, and for which the derivative has a closed
formexpression,theredoexistveryfastalgorithmsfor(a)
solvingthecontinuouscase,and(b)relaxingthecontinuous
solution to a discrete solution. So, if all web pages had
suchdistributions the above approachcould be attractive.
Indeed,ifmostwebpageshadsuchdistributions,onecould
partition the set of web pages into twocomponents. The
rstsetcouldbesolvedbycontinuousrelaxation,whilethe
complementarysetcould besolvedbyadiscretealgorithm
suchasthatgivenby[10]. Astheamountofresourcegiven
toonesetgoesup,theamountgiventotheothersetwould
godown. Soa bracketandbisection algorithm[22], which
is logarithmic incomplexity,could be quitefast. Weshall
notpursuethisidea furtherhere.
3.
CRAWLER SCHEDULING PROBLEM
Giventhatweknowhowmanycrawlsshouldbemadefor
eachwebpage,thequestionnowbecomeshowtobest
sched-ulethecrawlsoveraschedulingintervaloflengthT. (Again,
weshallthinkintermsofscheduling intervalsoflengthT.
Weare tryingtooptimally schedulethecurrentscheduling
intervalusingsomeinformationfromthelastone.) Weshall
assume that there are C possibly heterogeneous crawlers,
and thateachcrawlerk canhandleSk crawltasksintime
T. Thuswecansaythatthetotalnumberofcrawlsintime
T isR= P
C k =1
Sk. Weshallmakeonesimplifying
assump-tion thateachcrawl oncrawler k takesapproximately the
same amount of time. Thuswe candividethe time
inter-valT into S k
equal size timeslots, and estimate thestart
timeofthel thslotoncrawlerkbyTk l=(l 1)=T foreach
1lS k
and1kC.
Weknowfromtheprevioussectionthedesirednumberof
crawls x i
foreachwebpagei. Sincewehavealready
com-putedtheoptimalschedulefor thelastscheduling interval,
wefurtherknowthestarttimet i;0
ofthenalcrawlforweb
pageiwithinthelastschedulinginterval. Thuswecan
com-putetheoptimalcrawltimest i;1 ;:::;t i;x i
forwebpagei
dur-ingthecurrentschedulinginterval. Forthestochasticcaseit
isimportantfortheschedulertoinitiateeachofthesecrawl
tasksatapproximatelythepropertime,butbeingabitearly
or abitlate shouldhave noseriousimpactfor mostof the
updateprobabilitydistributionfunctionsweenvision. Thus
itis reasonabletoassumeaschedulercostfunctionfor the
jthcrawlofpagei,whoseupdatepatternsfollowa
stochas-tic process, that takes the form S(t) = jt t i;j
j. On the
otherhand,forawebpageiwhose updatepatternsfollow
aquasi-deterministicprocess,beingabitlateisacceptable,
butbeing early is notuseful. Soan appropriatescheduler
costfunctionforthejthcrawlofaquasi-deterministicpage
imighthavetheform
S(t)= 1 ift<t i;j t ti;j otherwise (22)
Intermsofschedulingnotation,theabovecrawltaskissaid
to havearelease time ofti;j. See [3]for more information
onschedulingtheory.
Virtuallynoworkseemstohavebeendoneonthe
SLOT S
SLOT 1
SUPPLY=1
DEMAND=1
CRAWL TASK R
CRAWL TASK 1
CRAWLER C
CRAWLER 1
Figure 5: Transportation ProblemNetwork
isasimple,exactsolutionfortheproblem. Specically,the
problemcanbeposedandsolvedasatransportationproblem
asweshallwenowdescribe. See[1]formoreinformationon
transportationproblemsandnetwork ows ingeneral. We
describeourschedulingproblemintermsofanetwork.
Wedeneabipartitenetworkwithonedirectedarcfrom
each supply node to each demand node. The R supply
nodes,indexedbyj, correspondto thecrawlsto be
sched-uled. Eachofthesenodeshasasupplyof1unit. Therewill
beonedemandnodepertimeslotandcrawlerpair,eachof
whichhasademandof1unit. Weindextheseby1lS k and1kC.Thecostofarcjklemanatingfromasupply
nodejto ademandnodeklisSj(Tk l). Figure5showsthe
underlyingnetworkfor anexampleofthisparticular
trans-portationproblem. Hereforsimplicitywe assumethatthe
crawlersarehomogeneous,andthusthateachcancrawlthe
samenumberS=SkofpagesintheschedulingintervalT.
Inthe gurethe numberofcrawls is R =4, whichequals
thenumberof crawler timeslots. Thenumberofcrawlers
is C =2, and thenumberof crawls percrawler is S = 2.
HenceR=CS.
The specic linear optimization problem solved by the
transportationproblemcanbeformulatedasfollows.
Minimize M X i=1 N X j=1 M X k =1 Ri(T jk )f ijk (23) suchthat M X i=1 f ijk =181jN and1kM; (24) N X j=1 M X k =1 fijk=181iM; (25) fijk081i;kM and1jN: (26)
Thesolutionof atransportationproblem canbe
accom-plishedquickly.See,forexample,[1].
transportationproblemformulationensuresthatthereexists
anoptimalsolutionwithintegral ows,and thetechniques
in the literature nd such a solution. Again, see [1] for
details. This implies that eachf ijk
is binary. If f ijk
= 1
then acrawl ofwebpage iis assignedto the jth crawl of
crawlerk.
Ifitisrequiredtoxorrestrictcertain crawltasks from
certain crawler slots, this can be easily done: Onesimply
changesthecostsoftherestricteddirectedarcstobeinnite.
(Fixingacrawltasktoasubsetofcrawlerslotsisthesame
asrestrictingitfromthecomplementarycrawlerslots.)
4.
PARAMETERIZATION ISSUES
The use of our formulation and solution in practice
re-quires calculating estimates of the parameters of our
gen-eral modelframework. Inthe interest of space, we sketch
heresomeoftheissuesinvolvedinaddressingthisproblem,
and referthe interestedreaderto thesequelfor additional
details.
Notethatwheneachpageiiscrawledwecaneasily
ob-tainthelastupdatetimeforthepage. Whilethisdoesnot
provideinformationaboutanyotherupdatesoccurringsince
thelastcrawlofpagei,wecanusethisinformationtogether
withthedataandmodelsforpageifrompreviousscheduling
intervals to statistically infer keyproperties of the update
processforthepage. Thisisthenusedinturntoconstruct
a probability distribution (including a quasi-deterministic
distribution)fortheinterupdatetimesofpagei.
Anotherimportant aspectof our approachconcerns the
statistical properties of the update process. The analysis
ofpreviousstudieshasessentiallyassumedthattheupdate
processis Poisson[6, 7, 26], i.e., theinterupdate timesfor
eachpagefollowanexponentialdistribution.Unfortunately,
verylittle has beenpublishedintheresearch literatureon
the properties of update processes found inpractice, with
thesoleexception(toourknowledge)ofarecentstudy[19]
suggestingthattheinterupdatetimesofpagesatanews
ser-vicewebsitearenotexponential. Tofurtherinvestigatethe
prevalenceofexponentialinterupdatetimesinpractice,we
analyzethepageupdatedatafromanotherwebsite
environ-mentwhosecontentishighlyaccessedandhighlydynamic.
Specically, we consider the update patterns found at the
websiteforthe1998NaganoOlympicGames,referringthe
interested reader to [16, 25] for more details on this
envi-ronment. Figure 6 plots the tail distributions of the time
betweenupdatesforeachofasetof18individualdynamic
pages which are representative of the update pattern
be-haviors foundinour studyof all dynamicpages that were
modiedafairamountoftime. Inotherwords,thecurves
illustratetheprobabilitythat thetimebetweenupdates to
agivenpageisgreaterthantasafunctionoftimet.
We rstobserve from theseresults that the interupdate
timedistributionscandiersignicantlyfroman
exponen-tial distribution. More precisely, our results suggest that
theinterupdatetimedistributionforsomeofthewebpages
atNaganohaveatailthat decaysatasubexponentialrate
andcanbecloselyapproximatedbyasubsetoftheWeibull
distributions;i.e.,thetailofthelong-tailedWeibull
interup-date distribution is given by G (t) = e t
, where t 0,
>0and0<<1. Wefurtherndthattheinterupdate
timedistributionforsomeoftheotherwebpagesatNagano
0
0.2
0.4
0.6
0.8
1
0
500
1000
1500
2000
Prob. time between updates > t
Time t (in seconds)
Tail Distribution of Update Process
web page 1
web page 2
web page 3
web page 4
web page 5
web page 6
web page 7
web page 8
web page 9
0
0.2
0.4
0.6
0.8
1
0
500
1000
1500
2000
Prob. time between updates > t
Time t (in seconds)
Tail Distribution of Update Process
web page 10
web page 11
web page 12
web page 13
web page 14
web page 15
web page 16
web page 17
web page 18
Figure 6: Tail Distributions ofUpdate Processes
classofParetodistributions; i.e., thetailof thePareto
in-terupdatetimedistributionisgivenbyG(t)=(=(+t))
,
wheret0,>0 and0<<2. Moreover, someofthe
periodicbehaviorobservedintheupdatepatternsforsome
web pages can be addressed with our quasi-deterministic
distribution.
The above analysis is not intended to be an exhaustive
studybyanymeans.Ourresults,togetherwiththosein[19],
suggestthat thereare important web siteenvironmentsin
which the interupdate times do not follow an exponential
distribution. Thisincludesthequasi-deterministicinstance
of our general model, which is motivated by information
services as discussed above. The key point is that there
clearly are webenvironmentsin whichthe update process
for individual web pages canbe muchmore complexthan
aPoissonprocess,andourgeneralformulationandsolution
ofthecrawlerschedulingproblemmakesitpossibleforusto
handlesuchawiderange ofwebenvironmentswithin this
uniedframework.
5.
EXPERIMENTAL RESULTS
Usingtheempiricaldataandanalysisoftheprevious
sec-tionwe nowillustratethe performance ofourscheme. We
willfocus ontheproblemofndingtheoptimalnumberof
crawls,comparingourschemewithtwosimpleralgorithms.
Bothof thesealgorithms wereconsidered in [6], and they
are certainlynatural alternatives. Therst schememight
becalledproportional. Wesimplyallocatethetotalamount
ofcrawlsRaccordingtotheaverageupdateratesofthe
vari-ouswebpages. Modulointegralityconcerns,thismeansthat
wechoose xi/i. Thesecondschemeissimpleryet,
allo-catingthenumberof crawlsas evenlyas possible amongst
thewebpages. Welabelthis theuniformscheme. Eachof
theseschemescanbeamendedtohandleourembarrassment
metricweights.Whenspeakingofthesevariantswewilluse
thetermsweightedproportionalandweighteduniform. The
formerchoosesx i /w i i
. Thelatterissomethingofa
mis-nomer: We are choosing xi / wi, so this is essentially a
slightly dierent proportional scheme. We can also think
ofouroptimalschemeasweighted,solvingforthe smallest
objective function P
N i=1
wiAi. If we solve instead for the
smallest objectivefunction P
N i=1
Ai, we getanunweighted
optimal algorithm. This is essentially the same problem
solved in [6], provided, of course, that all web pages are
updated according to a Poisson process. But our scheme
will havemuchgreater speed. Eventhough thisalgorithm
omitstheweightsintheformulationoftheproblem,wemust
compare thequality of thesolution basedon theweighted
objectivefunctionvalue.
Inourexperimentsweconsidercombinationsofdierent
types of web page update distributions. In a number of
cases we use a mixtureof 60% Poisson, 30% Pareto and
10% quasi-deterministic distributions. In the experiments
wechoose T tobeoneday,thoughwehave maderunsfor
aweekaswell. WesetN tobeonemillionwebpages,and
variedRbetween1.5and5millioncrawls. Weassumedthat
theaveragerateofupdatesoverallpageswas1.5,andthese
updates were chosen according to a Zipf-like distribution
withparametersN and ,thelatterchosenbetween0and
1[29,15]. Suchdistributionsrunthespectrumfromhighly
skewed(when=0)tototally uniform(when =1). We
considered both the staleness and embarrassment metrics.
Whenconsideringtheembarrassmentmetricwederivedthe
weightsinEquation(8)byconsideringasearchenginewhich
returned5 resultpages perquery,with 10urls onapage.
Theprobabilitiesbi;j;kwithwhichthesearchenginereturns
pageiinposition jof queryresult pagek werechosenby
linearizing these 50 positions, picking a randomly chosen
center,andimposingatruncatednormaldistributionabout
that center. Theclickingfrequencies cj;k for position j of
query pagek are chosenas described inSection2.2, via a
Zipf-likefunctionwithparameters50and=0,witha
geo-metricfunctionfor cyclingthroughthepages. Weassumed
that theclient went fromoneresultpageto thenextwith
probability0.8. Wechoosetheluckyloserprobabilitydi of
webpageiyieldinganincorrectresponsetotheclientquery
bypickingauniformrandomnumberbetween0and1. All
curvesinourexperimentsdisplaytheanalyticallycomputed
objectivefunctionvaluesofthevariousschemes.
Figure7showstwoexperimentsusingtheembarrassment
metric. In theleft-hand side ofthe gure we consider the
1.5
2
2.5
3
3.5
4
4.5
5
0
1
2
3
4
5
6
7
8
9
10
R/N
Embarassments per 1000 Queries
Embarassment as Function of Crawl/Web Page Ratio
Optimal
Proportional
Uniform
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
2
4
6
8
10
12
Theta
Embarassments per 1000 Queries
Embarassment as Function of Zipf−like Parameter
Optimal Schemes
Proportional Schemes
Uniform Schemes
Figure7: TwoEmbarrassmentMetric Examples
varying the ratio of R to N from 1.5 to 5. We consider
here a true Zipf distribution for the update frequencies {
inotherwords wechoose Zipf-likeparameter=0. There
aresixcurves,namelyoptimal,proportionalanduniformin
bothweighted and unweighted versions. (The unweighted
optimalcurveistheresultofemployingunitweightsduring
the computation phase, butdisplaying the weighted
opti-malobjectivefunction.) Bydenitiontheunweighted
opti-mal schemewill not performas well as theweighted
opti-malscheme, which isindeed thebestpossible solution. In
allothercases,however,theunweightedvariantdoesbetter
thantheweightedvariant. Sothetrueuniformpolicydoes
thebestamongstalloftheheuristics,atleastforthe
exper-imentsconsideredin ourstudy. Thissomewhat surprising
state of aairs was noticed in [6] as well. Both uniform
policiesdobetterthantheirproportionalcounterparts.
No-ticethattheweightedoptimalcurveis,infact,convexasa
functionofincreasingR . Thiswillalwaysbetrue.
In the right-hand sideof Figure 7 we show asomewhat
diering mixtureofdistributions. Inthis case we varythe
Zipf-likeparameterwhileholdingthevalueofRtobe2.5
million(sothat R =N =2:5). As increases,thusyielding
lessskew,theobjectivefunctionsgenerallyincreaseaswell.
This is appealing, because it shows in particular that the
optimalschemedoesvery well inhighly skewed scenarios,
whichwebelievearemorerepresentativeofrealweb
environ-ments.Moreover,noticethatthecurvesessentiallyconverge
toeachotherasincreases. Thisisnottoosurprising,since
theoptimal,proportionalanduniformschemeswouldall
re-sultinthesamesolutionspreciselyintheabsenceofweights
when =1. Ingeneraltheuniformschemedoesrelatively
better in this gurethan it did in the previous one. The
explanationiscomplex,butitessentiallyhastodowiththe
correlation of the weights and the update frequencies. In
fact, we show this examplebecause it puts uniform in its
bestpossiblelight.
In Figure 8 we show four experiments where we varied
theratioofR toN. Theseguresdepicttheaverage
stale-nessmetric,andsoweonlyhavethree(unweighted)curves
pergure. Thetopleft-handguredepictamixtureof
up-datedistributiontypes,andtheotherthreeguresdepict,in
turn,purePoisson,Paretoandquasi-deterministic
distribu-tions. Noticethatthesecurvesarecleanerthanthosein
Fig-ure7. Theweightings,whileimportant,introduceadegree
of noiseintothe objectivefunctionvalues. For thisreason
wewillfocusonthenon-weightedcasefromhereon. The
y-axisrangesdieroneachofthefourgures,butinallcases
the optimal schemeyields a convex function ofR =N (and
thusof R ). The uniformschemeperforms better thanthe
proportionalschemeonceagain. It doesrelatively lesswell
inthe Poisson updatescenario. Inthe quasi-deterministic
guretheoptimalschemeisactuallyabletoreduceaverage
stalenessto0forsuÆcientlylargeRvalues.
InFigure9weexplorethecaseofParetointerupdate
dis-tributionsinmoredetail. Onceagain,theaveragestaleness
metric is plotted as a function of the parameter in the
Paretodistribution;referto Section4. Thisdistributionis
said to have aheavytail when0< <2, which is quite
interestingbecauseitisoverthisrangeofparametervalues
thattheoptimalsolutionismostsensitive. Inparticular,we
observethattheoptimalsolutionvalueisrather atfor
val-uesofrangingfrom4toward2. However,asapproaches
2, the optimal average staleness value startsto rise which
continuestoincreaseinanexponentialmannerasranges
from2toward0. Thesetrendsappear toholdfor allthree
schemes,withouroptimalschemecontinuingtoprovidethe
bestperformanceanduniformcontinuingtooutperformthe
proportional scheme. Our results suggest the importance
ofsupportingheavy-taileddistributionswhentheyexistin
practice(andourresultsoftheprevioussectiondemonstrate
thattheydoindeedexistinpractice). Thisisappealing
be-cause it shows inparticular that the optimal schemedoes
verywellinthesecomplexenvironmentswhichmaybemore
representativeof realwebenvironmentsthanthose
consid-eredinpreviousstudies.
The transportation problem solution to the scheduling
problem isoptimal, ofcourse. Furthermore,the quality of
thesolution, measuredintermsofthe deviationofthe
ac-tual time slots for the various tasks from their ideal time
slots,willnearlyalwaysbeoutstanding. Figure10showsan
illustrativeexample. Thisexampleinvolvesthescheduling
ofonedaywithC=10homogeneouscrawlersand1million
crawls per crawler. So there are 10 million crawls inall.
mi-1.5
2
2.5
3
3.5
4
4.5
5
0
0.05
0.1
0.15
0.2
0.25
R/N
Average Staleness
Average Staleness as Function of Crawl/Web page ratio
Optimal Scheme
Proportional Scheme
Uniform Scheme
1.5
2
2.5
3
3.5
4
4.5
5
0
0.02
0.04
0.06
0.08
0.1
0.12
R/N
Average Staleness
Average Staleness as Function of Crawl/Web page ratio, Poisson Updates
Optimal Scheme
Proportional Scheme
Uniform Scheme
1.5
2
2.5
3
3.5
4
4.5
5
0
0.05
0.1
0.15
0.2
0.25
0.3
R/N
Average Staleness
Average Staleness as Function of Crawl/Web page ratio, Pareto Updates
Optimal Scheme
Proportional Scheme
Uniform Scheme
1.5
2
2.5
3
3.5
4
4.5
5
0
0.1
0.2
0.3
0.4
0.5
R/N
Average Staleness
Average Staleness as Function of Crawl/Web page ratio, Quasi−Deterministic Updates
Optimal Scheme
Proportional Scheme
Uniform Scheme
Figure8: FourAverageStaleness Metric Examples: Mixed,Poisson,Paretoand Quasi-Deterministic
1.5
2
2.5
3
3.5
4
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Average alpha
Average Staleness
Average Staleness as Function of Pareto Parameter
Optimal Scheme
Proportional Scheme
Uniform Scheme
Figure 9: Pareto Example
−20
0
−15
−10
−5
0
5
10
15
20
5
10
15
20
25
30
35
Deviation from Optimal Time Slot
Percent
Distribution of Actual/Ideal Task Time Slots
Poisson, Pareto
Quasi−Deterministic
tasks,tonoticethattheyoccur onoraftertheiridealtime
slots,asrequired. Thequasi-deterministiccrawlsamounted
to20% of the overall crawls inthis example. Thebottom
lineisthatthisschedulingproblemwill nearlyalwaysyield
optimalsolutionsofveryhighabsolutequality.
ThecrawlingfrequencyschemewasimplementedinCand
runonanIBM RS/6000Model50. Innocase didthe
al-gorithmrequiremore thanaminute of elapsedtime. The
crawlerschedulingalgorithmwasimplementedusingIBM's
OptimizationSubroutineLibrary(OSL)package[13],which
cansolvenetwork owproblems. Problems ofoursizerun
inapproximatelytwominutes.
6.
CONCLUSION
Giventheimportantrole ofsearchenginesintheWorld
Wide Web, we studied the crawling process employed by
suchsearchengineswiththe goal ofimprovingthequality
of theservice theyprovideto clients. Ouranalysis of the
optimalcrawlingprocessconsideredboththemetricof
stal-eness,asdonebythefewstudiesinthisarea,andthe
met-ric ofembarrassment,whichwe introduced asa preferable
goal. Weproposed a generaltwo-part schemeto optimize
thecrawlingprocess,wheretherstcomponentdetermines
the optimalnumberof crawls for eachpage together with
theoptimaltimesatwhichthesecrawlsshouldtakeplaceif
therewerenopracticalconstraints. Thesecondcomponent
ofourschemethenndsanoptimalachievableschedulefor
a set of crawlers to follow. An important contribution of
thepaperisthisformulationwhichmakesitpossibleforus
toexploitveryeÆcientalgorithmsThesealgorithmsare
sig-nicantlyfaster than those considered inprevious studies.
Ourformulationandsolutionisalsomoregeneralthan
pre-viousworkforseveralreasons,includingtheuseofweights
inthe objective function and the handling of signicantly
moregeneralupdatepatterns. Giventhe lack ofpublished
dataonwebpageupdatepatterns,andgiventheassumption
ofexponentialinterupdatetimesinthe analysisof thefew
previous studies, we analyzed the page update data from
a highly accessed web site serving highly dynamic pages.
Thecorrespondingresults clearly demonstratethe benets
ofourgeneraluniedapproachinthat thedistributionsof
thetimesbetweenupdatestosome webpagesclearly span
a wide range of complex behaviors. By accommodating
suchcomplexupdatepatterns,webelievethatouroptimal
schemecanprovideevengreaterbenetsinreal-world
envi-ronmentsthanpreviousworkinthearea.
Acknowledgement.WethankAllenDowneyforpointing
usto[19].
7.
REFERENCES
[1] R.Ahuja,T.MagnantiandJ.Orlin,NetworkFlows, PrenticeHall,1993.
[2] A.Arasu,J.Cho,H.Garcia-Molina,A.Paepckeand S.Raghavan,\SearchingtheWeb",ACM
TransactionsonInternetTechnology,1(1),2001.
[3] J.Blazewicz,K.Ecker,G.SchmidtandJ.Weglarz, SchedulinginComputerandManufacturing Systems, Springer-Verlag,1993.
[4] A.Broder,Personalcommunication.
[5] J.Challenger,P.Dantzig,A.Iyengar,M.S.
Squillante,andL.Zhang.EÆcientlyservingdynamic dataathighlyaccessedwebsites.Preprint,May2001.
DatabasetoImproveFreshness",ACMSIGMOD Conference,2000.
[7] E.Coman,Z.LiuandR.Weber,\OptimalRobot SchedulingforWebSearchEngines",INRIAResearch Report,1997.
[8] F.Douglas,A.FeldmannandB.Krishnamurthy, \Rate ofChangeandotherMetrics: ALiveStudyof theWorldWideWeb",USENIX Symposiumon Internetworking TechnologiesandSystems,1999. [9] B. Fox,\DiscreteOptimizationviaMarginal
Analysis", ManagementScience,13:210-216,1966.
[10] G.FredericksonandD.Johnson,\TheComplexityof SelectionandRankinginX+YandMatriceswith SortedColumns",JournalofComputerandSystem Science,24:197-208, 1982.
[11] Z.GalilandN.Megiddo,\AFastSelectionAlgorithm andtheProblemofOptimumDistributionofEorts", JournaloftheACM,26:58-64,1981.
[12] Ibaraki,T.,andKatoh,N.,\ResourceAllocation Problems: AlgorithmicApproaches",MITPress, Cambridge,MA,1988.
[13] InternationalBusiness MachinesCorporation,
Optimization Subroutine Library GuideandReference, IBM,1995.
[14] N.KatohandT.Ibaraki,\ResourceAllocation Problems",inHandbook ofCombinatorial Optimization,D-Z.DuandP.Pardalos,editors, KluwerAcademicPress,2000.
[15] D. Knuth,TheArtofComputer Programming,vol.2, AddisonWesley,1973.
[16] A.Iyengar,M.SquillanteandL.Zhang,\Analysis andCharacterization ofLarge-ScaleWebServer AccessPatternsandPerformance",World WideWeb, 2:85-100, 1999.
[17] S.LawrenceandC.Giles,\Accessibilityof
InformationontheWeb",Nature,400:107-109,1999.
[18] G.NemhauserandL.Wolsey,Integerand CombinatorialOptimization,J.Wiley,1988. [19] V.N.PadmanabhanandL.Qiu.\TheContentand
AccessDynamicsofaBusyWebSite: Findingsand Implications",ACMSIGCOMM'00Conference,2000.
[20] M. Pinedo,Scheduling: Theory,Algorithmsand Systems, Prentice-Hall,1995.
[21] J.PitkowandP.Pirolli,\Life,DeathandLawfulness ontheElectronicFrontier",CHIConferenceon HumanFactors inComputingSystems, 1997.
[22] W.Press,B. Flannery,S.TeukolskyandW.
Vetterling,NumericalRecipes,CambridgeUniversity Press, 1986.
[23] S.M.Ross.StochasticProcesses. JohnWileyand Sons,SecondEdition, 1997.
[24] K.Sigman.StationaryMarkedPointProcesses: An IntuitiveApproach.ChapmanandHall,1995.
[25] M. Squillante,D.YaoandL.Zhang, \WebTraÆc ModelingandWebServerPerformanceAnalysis", IEEE ConferenceonDecisionandControl, 1999.
[26] J.Talim,Z.Liu,P.NainandE.Coman,
\OptimizingtheNumberofRobotsforWebSearch Engines",TelecommunicationSystemsJournal, 17(1-2):234-243, 2001.
[27] C.WillsandM. Mikhailov,\TowardsaBetter UnderstandingofWebResourcesandServer
ResponsesforImprovedCaching",WWWConference, 1999.
[28] R.W.Wol.StochasticModelingandtheTheory of Queues. PrenticeHall,1989.