• No results found

Optimal Crawling Strategies for Web Search Engines

N/A
N/A
Protected

Academic year: 2021

Share "Optimal Crawling Strategies for Web Search Engines"

Copied!
12
0
0

Loading.... (view fulltext now)

Full text

(1)

Optimal Crawling Strategies for Web Search Engines

J.L. Wolf, M.S. Squillante,

P.S. Yu

IBM Watson Research Center

fjlwolf,mss,psyug@us.ibm.com

J. Sethuraman

IEOR Department

Columbia University

jay@ieor.columbia.edu

L. Ozsen

OR Department

Northwestern University

ozsen@yahoo.com

ABSTRACT

WebSearchEngines employ multiple so-called crawlers to

maintain local copies of web pages. But these web pages

are frequently updated by their owners, and therefore the

crawlers mustregularly revisit the web pages to maintain

thefreshnessoftheirlocalcopies.Inthispaper,wepropose

atwo-partschemeto optimize this crawling process. One

goalmightbetheminimizationoftheaveragelevelof

stale-nessoverallwebpages,andtheschemeweproposecansolve

this problem. Alternatively, the same basic schemecould

beused tominimizea possibly moreimportant search

en-gineembarrassmentlevelmetric: Thefrequencywithwhich

aclient makesa search engine query andthen clicks ona

returnedurl onlyto nd that the resultis incorrect. The

rstpartourschemedeterminesthe(nearly)optimal

crawl-ingfrequencies,aswellasthetheoreticallyoptimaltimesto

crawleachwebpage. It doesso within anextremely

gen-eralstochasticframework,onewhichsupportsawiderange

ofcomplexupdatepatternsfoundinpractice. Ituses

tech-niquesfromprobabilitytheoryandthetheoryofresource

al-locationproblemswhicharehighlycomputationallyeÆcient

{crucialfor practicality becausethesize oftheproblemin

thewebenvironmentisimmense. Thesecondpartemploys

these crawling frequencies and ideal crawl times as input,

andcreatesanoptimalachievablescheduleforthecrawlers.

Oursolution,basedonnetwork owtheory,isexactaswell

ashighlyeÆcient. Ananalysisoftheupdatepatternsfrom

a highly accessed and highly dynamic web site is used to

gain some insights into the properties of page updates in

practice. Then,basedonthisanalysis,weperformasetof

detailedsimulationexperimentstodemonstratethequality

andspeedofourapproach.

Categories and Subject Descriptors

H.3[InformationSystems]: InformationStorageand

Re-trieval; G.2,3 [Mathematics of Computing ]: Discrete

Mathematics,ProbabilityandStatistics

General Terms

Algorithms,Performance,Design,Theory

1.

INTRODUCTION

Websearchenginesplay avitalrole ontheWorldWide

Web,sincetheyprovideformanyclientsthe rstpointersto

Copyright is held by the author/owner(s).

WWW2002

, May 7–11, 2002, Honolulu, Hawaii, USA.

ACM 1-58113-449-5/02/0005.

webpagesofinterest. Suchsearchenginesemploycrawlers

tobuildlocalrepositoriescontainingwebpages,whichthey

thenuse to builddatastructures usefulto thesearch

pro-cess. Forexample,aninvertedindexiscreatedthattypically

consistsof,foreachterm,asortedlistofitspositionsinthe

variouswebpages.

Ontheotherhand,webpagesare frequentlyupdatedby

theirowners[8,21,27],sometimesmodestlyandsometimes

more signi cantly. A large studyin[2] notes that 23% of

the web pages change daily,while 40% of commercialweb

pageschangedaily. Somewebpagesdisappearcompletely,

and [2] reports a half-life of 10 days for web pages. The

datagatheredbyasearchengineduringitscrawlscanthus

quicklybecomestale,oroutofdate. Socrawlersmust

reg-ularlyrevisitthewebpagestomaintainthefreshnessofthe

searchengine'sdata.

Inthis paper, we proposeatwo-partschemetooptimize

the crawling (or perhaps, more precisely, the recrawling)

process. Onereasonablegoal insuchaschemeisthe

mini-mizationoftheaveragelevelofstalenessoverallwebpages,

and the scheme we propose here can solve this problem.

We believe, however, that a slightly di erent metric

pro-vides greater utility. This involves a so-called embarr

ass-ment metric: The frequency with which a client makes a

searchengine query,clicksonaurlreturnedbythesearch

engine,andthen ndsthattheresultingpageisinconsistent

withrespectto thequery. Inthis context,goodness would

correspondto thesearchenginehavingafresh copyof the

webpage. However,badness mustbepartitionedintolucky

and unlucky categories: The searchengine canbe badbut

luckyinavariety ofways. Inorderofincreasing luckiness

thepossibilitiesare: (1)Thewebpagemightbestale, but

notreturnedtotheclient asaresultofthequery;(2)The

webpagemight be stale,returnedto theclient as aresult

of thequery,butnotclickedonby theclient; and(3)The

webpagemightbestale,returnedtotheclientasaresultof

thequery,clickedonbytheclient,butmightbecorrectwith

respecttothequeryanyway. Sothemetricunderdiscussion

wouldonlycountthosequeriesonwhichthesearchengine

isactuallyembarrassed: Thewebpageisstale,returnedto

theclient,who clicksonthe urlonlyto ndthat thepage

is eitherinconsistent with respecttothe original query,or

(worseyet)hasabrokenlink. (Accordingto[17],andnoted

in[6], upto14% ofthelinks insearchenginesarebroken,

notagoodstateofa airs.)

The rstcomponent ofour proposedschemedetermines

theoptimalnumberofcrawlsforeachwebpageovera xed

(2)

deter-selves. These twoproblemsare highlyinterconnected. The

samebasicschemecanbeusedtooptimizeeitherthe

stale-nessorembarrassment metric,andisapplicable for awide

variety ofstochastic updateprocesses. Theability to

han-dlecomplexupdateprocessesinageneraluni edframework

turnsouttobeanimportantadvantageandacontribution

ofourapproachinthattheupdatepatternsofsomeclasses

ofwebpages appearto follow fairly complexprocesses, as

we will demonstrate. Anotherimportant modelsupported

byourgeneral frameworkis motivatedby,forexample, an

information service that updates its web pages at certain

timesoftheday,ifanupdatetothepageisnecessary. This

case, whichwe callquasi-deterministic,is characterizedby

webpages whose updatesmightbecharacterizedas

some-whatmore deterministic, inthe sense that thereare xed

potential times at which updatesmight or might not

oc-cur. Ofcourse,webpageswithdeterministicupdatesarea

specialcaseofthequasi-deterministicmodel. Furthermore,

thecrawling frequency problem canbe solvedunder

addi-tionalconstraintswhichmakeitssolutionmorepracticalin

therealworld: Forexample,onecanimposeminimumand

maximumboundsonthenumberofcrawlsfor agivenweb

page. Thelatter boundisimportantbecause crawling can

actuallycauseperformanceproblemsforwebsites.

This rstcomponentproblemisformulatedandsolved

us-ingavarietyof techniquesfromprobabilitytheory[23, 28]

andthetheoryofresourceallocationproblems[12, 14]. We

note that the optimization problems described here must

be solvedfor huge instances: Thesize of the World Wide

Webisnowestimatedatoveronebillionpages,theupdate

rateofthesewebpagesislarge, asinglecrawler cancrawl

morethanamillion pagesperday,andsearchengines

em-ploy multiplecrawlers. (Actually,[2] notesthat their own

crawler canhandle 50-100 crawls persecond, while others

canhandle several hundred crawls persecond. We should

note,however,that crawlingisoftenrestrictedtolessbusy

periods intheday.) A contributionofthispaperisthe

in-troductionofstate-of-the-artresourceallocationalgorithms

tosolvetheseproblems.

Thesecond component ofour schemeemploysas its

in-puttheoutputfromthe rstcomponent. (Again,this

con-sists of theoptimal numbers ofcrawls and theideal crawl

times).Itthen ndsanoptimalachievablescheduleforthe

crawlersthemselves. This isa parallel machine scheduling

problem[3, 20] because of the multiple crawlers.

Further-more,some of the scheduling tasks have releasedates,

be-cause,forexample,itisnotusefultoscheduleacrawltaskon

aquasi-deterministicwebpagebefore thepotentialupdate

takesplace. Oursolutionis basedonnetwork ow theory,

andcanbeposedspeci callyasatransportationproblem[1].

Thisproblem mustalso besolved for enormous instances,

andagaintherearefastalgorithmsavailableatourdisposal.

Moreover,onecanimposeadditionalreal-worldconstraints,

suchasrestrictedcrawlingtimesforagivenwebpage.

Weknowof relativelyfew related papersintheresearch

literature. Perhapsthemostrelevantis[6]. (Seealso[2]for

amoregeneralsurveyarticle.) In[6]theauthorsessentially

introduceandsolveaversionoftheproblem of ndingthe

optimalnumberofcrawls per page. They employ a

stale-ness metricand assume a Poisson update process. Their

algorithm solves the resulting continuous resource

alloca-tionproblem bytheuseof Lagrangemultipliers. In[7]the

with general crawl time distributions), with weights

pro-portional to the page update frequencies. They present a

heuristictohandlelargeprobleminstances. Theproblemof

optimizingthe numberofcrawlersistackled in[26], based

onaqueueing-theoreticanalysis, inorderto avoid thetwo

extremesofstarvationandsaturation. Insummary,thereis

someliteratureonthe rst component ofourcrawler

opti-mizationscheme,thoughwehavenotedaboveseveral

poten-tialadvantagesofourapproach. To ourknowledge, this is

the rstpaperthatmeaningfullyexaminesthe

correspond-ing scheduling problem whichis the second component of

ourscheme.

Anotherimportant aspectofourstudyconcernsthe

sta-tisticalpropertiesoftheupdatepatternsforwebpages. This

clearlyisacriticalissuefortheanalysisofthecrawling

prob-lem,butunfortunatelythereappearstobeverylittleinthe

literature onthe typesof update processes found in

prac-tice. To thebest ofourknowledge, thesole exceptionis a

recent study[19] whichsuggests thatthe updateprocesses

forpagesatanewsservicewebsitearenotPoisson. Given

theassumptionofPoissonupdateprocessesinmostprevious

studies,andtofurtherinvestigatetheprevalenceofPoisson

update processesin practice, we analyze the page update

datafromahighlyaccessedwebsiteservinghighlydynamic

pages. Arepresentativesampleoftheresultsfromour

anal-ysis are presented and discussed. Most importantly,these

results demonstrate that the interupdate processesspan a

widerangeofcomplexstatisticalpropertiesacrossdi erent

webpages and that theseprocesses candi er signi cantly

froma Poissonprocess. Bysupporting inour general

uni- edapproachsuchcomplexupdatepatterns(includingthe

quasi-deterministic model) inadditionto thePoissoncase,

webelievethatouroptimalschemecanprovideevengreater

bene tsinreal-worldenvironments.

Theremainderofthis paperisorganizedasfollows.

Sec-tion 2describesour formulation and solutionfor the twin

problemsof ndingthe optimal numberof crawls and the

idealized crawltimes. We looselyrefer tothis rst

compo-nent as the optimalfrequencyproblem. Section3contains

the formulation and solution for the second component of

ourscheme, namelythe scheduling problem. Section4

dis-cusses someissuesof parameterizingourapproach,

includ-ingseveralempiricalresultsonupdatepatterndistributions

for real webpages basedontraces from aproductionweb

site. InSection5weprovidesexperimentalresultsshowing

both the quality of our solutions and their running times.

Section6containsconclusionsandareasforfuturework.

2.

CRAWLING FREQUENCY PROBLEM

2.1

General Framework

Weformulatethecrawlingfrequencyproblemwithin the

contextofageneral modelframework, basedonstochastic

marked point processes. This makesit possible for us to

studytheprobleminauni edmanneracrossawiderange

of web environments and assumptions. A rigorous formal

de nitionofourgeneralframeworkanditsimportant

math-ematicalproperties,aswellasarigorousformalanalysisof

various aspects of our general framework, are beyond the

scope of the present paper. We therefore sketchhere the

model framework and ananalysis of aspeci c instance of

(3)

additionaltechnicaldetails. Furthermore,see[24]for

addi-tionaldetailsonstochasticmarkedpointprocesses.

We denote by N the total number of web pages to be

crawled,whichshallbeindexedbyi. Weconsidera

schedul-ing intervalof lengthT as a basicatomic unitof decision

making,where T issuÆcientlylarge tosupportourmodel

assumptionsbelow. Thebasicideaisthatthesescheduling

intervalsrepeateveryT unitsoftime,andwewillmake

de-cisions aboutone scheduling intervalusingboth new data

and theresults from the previous scheduling interval. Let

R denote the total number of crawls possible in a single

schedulinginterval.

Letui;n 2IR +

denotethepointintimeat whichthen th

update ofpagei occurs,where 0<ui;1 <ui;2 <:::T,

i2f1;2;:::;Ng. Associatedwiththen th

updateofpagei

isamark ki;n 2IK,where ki;n is usedtorepresent all

im-portantandusefulinformationforthen th

updateofpagei

andIKisthespaceofallsuchmarks(calledthemarkspace).

Examplesofpossiblemarksincludeinformationonthe

prob-ability of whether an update actually occurs at the

cor-responding point in time (e.g., see Section 2.3.3) and the

probability of whether anactual updatematters from the

perspectiveofthecrawlingfrequencyproblem(e.g.,a

min-imalupdateof thepagemaynot change theresults ofthe

searchenginemechanisms). Theoccurrences ofupdatesto

pageiare thenmodeled asastationarystochastic marked

point process Ui = f(ui;n;ki;n);n 2 INg de ned on the

statespaceIR +

IK. Inotherwords,U i

isastochastic

se-quence of points fui;1;ui;2;:::g in time at which updates

ofpageioccur, togetherwithacorrespondingsequenceof

generalmarksfk i;1

;k i;2

;:::g containing informationabout

theseupdates.

A countingprocess N u i

(t)isassociated withthe marked

pointprocessU i andisgivenbyN u i (t)maxfn:u i;n tg,

t 2 IR+. This counting process represents the numberof

updatesofpageithatoccurinthetimeinterval[0;t]. The

intervaloftimebetweenthen 1 st andn th updateofpagei is given by U i;n u i;n u i;n 1 , n 2 IN, where we de ne

ui;00andki;0;. Thecorrespondingforwardand

back-ward recurrencetimes are givenby A u i (t)u i;N u i (t)+1 t and B u i (t)  t u i;N u i (t) , respectively, t 2 IR+. In this

paper we shall make the assumption that the time

inter-vals U i;n

2 IR +

between updates of page i are

indepen-dent and identically distributed (i.i.d.) following an

arbi-trary distributionfunction Gi() withmean  1 i

>0, and

thus the counting process N u i

(t) is a renewal process [23,

28],i2f1;2;:::;Ng. Notethatui;0doesnotrepresentthe

timeofanactualupdate,andthereforethecountingprocess

fN u i

(t);t2IR +

g(startingattime0)is,moreprecisely,an

equilibriumrenewalprocess (which is aninstance ofa

de-layedrenewalprocess)[23,28].

Supposewedecidetocrawlwebpageiatotalofx i

times

during the scheduling interval [0;T] (where xi is a

non-negative integer less thanor equal to R ), andsuppose we

decidetodosoatthearbitrarytimes0ti;1<ti;2<:::<

t i;x

i

T. Ourapproachinthispaperisbasedoncomputing

aparticularprobabilityfunctionthatcaptures,inacertain

sense,whetherthesearchenginewillhaveastalecopyofweb

pageiatanarbitrarytimetintheinterval[0;T].Fromthis

wecaninturncomputeacorrespondingtime-average

stale-nessestimateai(ti;1;:::;ti;x i

)byaveragingthisprobability

functionoveralltwithin[0;T].Speci cally,weconsiderthe

arbitrarytimeti;jasfallingwithintheintervalU i;N u i (t i;j )+1  - - -. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B u i (t i;j ) U i;N i (t i;j )+1 u i;N u i (t i;j ) ti;j u i;N u i (t i;j )+1 ti;j+1 A u i (t i;j )

Figure1: ExampleofStationaryMarkedPoint

Pro-cessFramework

betweenthe two updatesofpageiat times u i;N u i (t i;j ) and u i;N u i (t i;j )+1

,andourinterestisinaparticulartime-average

measureofstalenesswithrespecttotheforwardrecurrence

timeA u i

(ti;j)untilthenextupdate,giventhebackward

re-currencetime B u i

(ti;j). Figure1depicts a simpleexample

ofthissituation.

Moreformally,weexploitconditionalprobabilitiesto

de- nethefollowing time-averagestalenessestimate

ai(ti;1;:::;ti;x i ) = 1 T x i X j=0 Z t i;j+1 t i;j Z 1 0 P[A;B;C] P[B] g B u i (v)dv  dt; (1) where A=fUi;n i;j +1 t ti;j+vg,B=fUi;n i;j +1 >vg, C=fk i;n i;j +1 2K g, t i;0 0,t i;x i +1 T,n i;j N u i (t i;j ), g B u i

()isthestationarydensityforthebackwardrecurrence

time,andKIKisthemarksetofinterestforthestaleness

estimate under consideration. Note that the variable v is

usedtointegrateoverallpossiblevaluesofB u i

(t i;j

)2[0;1).

Furtherobservethedependenceofthestalenessestimateon

theupdatepatternsforwebpagei.

WhenK=IK(i.e.,allmarksareconsideredinthe

de ni-tionofthetime-averagestaleness estimate),thentheinner

integral abovereducesasfollows:

Z 1 0 G i (t t i;j +v) G i (v) 1 Gi(v) g B u i (v)dv = Z 1 0  1 G i (t t i;j +v) Gi(v)  g B u i (v)dv; (2)

whereGi(t)1 Gi(t)isthetaildistributionofthe

interup-datetimesUi;n,n2IN. Fromstandardrenewaltheory[23,

28], wehave g B u i (t)= i G i

(t),and thusaftersomesimple

algebraic manipulationsweobtain

a i (t i;1 ;:::;t i;x i ) = 1 T x i X j=0 Z t i;j+1 t i;j  1  i Z 1 0 G i (t t i;j +v)dv  dt: (3)

Naturallywewouldalso like thetimes t i;1

;:::t i;x

i to be

chosensoastominimizethetime-averagestalenessestimate

ai(ti;1;:::;ti;x i

), given that there are xi crawls of pagei.

Deferringforthemomentthequestionofhowto ndthese

optimalvaluest  i;1 ;:::t  i;x i

,letusde nethefunctionA i by setting A i (x i )=a i (t  i;1 ;:::;t  i;x i ): (4)

Thus,thedomainofthisfunctionA i

isthesetf0;:::;R g.

Wenowmustdecidehowto ndtheoptimalvaluesofthe

x i

variables. Whileonewould liketo choose x i

aslarge as

possible,thereiscompetitionforcrawlsfromtheotherweb

(4)

N X

i=1

wiAi(xi) (5)

subjecttotheconstraints

N X i=1 x i =R ; (6) x i 2f0;:::;R g: (7)

Heretheweightswiwilldeterminetherelativeimportance

ofeachwebpagei. Ifeachweightwiischosentobe1,then

the problem becomes one of minimizing the time-average

stalenessestimateacross all theweb pagesi. However, we

willshortly discussawayto picktheseweightsabitmore

intelligently,therebyminimizingametricthatcomputesthe

levelofembarrassmentthesearchenginehastoendure.

Theoptimizationproblemjustposedhasaveryniceform.

Speci cally,itisanexampleofaso-called discrete,

separa-bleresourceallocationproblem. Theproblemisseparableby

thenatureoftheobjectivefunction,writtenasthe

summa-tionoffunctionsoftheindividualxivariables. Theproblem

isdiscretebecauseofthesecondconstraint,andaresource

allocation problem because ofthe rstconstraint. For

de-tailsonresourceallocationproblems,werefertheinterested

readerto [12]. This is awell-studied area inoptimization

theory, and we shallborrowliberally from that literature.

Indoingso, we rstpoint outthat thereexists adynamic

programmingalgorithmforsolvingsuchproblems,whichhas

computationalcomplexityO(NR 2

).

Fortunately,wewillshowshortlythatthefunctionAi is

convex, whichinthis discrete context meansthat the rst

di erences A i

(j+1) A i

(j) arenon-decreasing asa

func-tionofj. (These rstdi erencesare justthediscrete

ana-loguesofderivatives.)Thisextrastructuremakesitpossible

toemploy evenfaster algorithms,butbefore we candoso

thereremaina few important issues. Eachof theseissues

is discussed indetail inthe next three subsections, which

involve: (1)computingtheweightsw i

oftheembarrassment

level metricfor each web pagei; (2)computing the

func-tional forms of ai and Ai for each web page i, based on

thecorrespondingmarkedpointprocessU i

;and(3)solving

theresultingdiscrete,convex,separableresourceallocation

probleminahighlyeÆcientmanner.

Itisimportanttopointoutthatwecanactuallyhandlea

moregeneralresourceallocationconstraint thanthatgiven

inEquation(7). Speci cally,wecanhandlebothminimum

andmaximumnumberofcrawlsmiandMiforeachpagei,

sothat the constraint insteadbecomes x i 2fm i ;:::;M i g:

Wecanalsohandleothertypesofconstraintsonthecrawls

that tend to ariseinpractice [4], butomit details here in

theinterestofspace.

2.2

Computing the Weights

wi

ConsiderFigure2,whichillustratesadecisiontreetracing

thepossibleresultsforaclientmakingasearchenginequery.

Letus xa particularwebpageiinmind, andfollow the

decisiontreedownfromtheroottotheleaves.

The rstpossibilityisforthepagetobefresh: Inthiscase

the web pagewill not cause embarrassment to the search

engine. So, assume the pageis stale. If the pageis never

returnedby the search engine, there again can be no

em-barrassment: Thesearchengineissimplyluckyinthiscase.

BAD BUT LUCKY

Query correct

Page not clicked

Page not returned

GOOD: Page fresh

UGLY: Query incorrect

Page clicked

Page returned

Page stale

Figure 2: EmbarrassmentLevelDecision Tree

Whathappensifthepageisreturned? Asearchenginewill

typically organize its query responses into multiple result

pages, and eachof these resultpages will contain theurls

of several returned web pages, in variouspositions on the

page. Let P denotethe numberofpositions onareturned

page(whichistypicallyontheorderof10). Notethat the

position of a returned web page on a result page re ects

theordered estimateofthesearchengine forthe webpage

matchingwhattheuserwants. Letb i;j;k

denotethe

proba-bilitythatthesearchenginewillreturnpageiinpositionj

ofqueryresultpagek.Thesearchenginecaneasilyestimate

theseprobabilities,eitherbymonitoringallqueryresultsor

bysamplingthemfortheclientqueries.

Thesearchenginecanstillbeluckyevenifthewebpagei

isstaleandreturned: Aclientmightnotclickonthepage,

and thus never have a chance to learn that the page was

stale. Letcj;k denote thefrequencythat aclientwill click

on a returned page in position j of query result page k.

Thesefrequenciesalsocanbeeasilyestimated,againeither

bymonitoringorsampling.

Onecanspeculatethatthis clickingprobabilityfunction

might typically decrease both as a function of the overall

position(k 1)P+jofthereturnedpageandasafunction

of the page k on which it is returned. Assuming a

Zipf-likefunction [29, 15] forthe rstfunctionandageometric

functionto modelthe probability of cyclingthroughk 1

pages toget to returnedpagek,onewouldobtain a

click-ing probability functionthatlookslikethe oneprovidedin

Figure3. (Accordingto [4]thereissomeevidencethatthe

clickingprobabilities inFigure 3 actually riserather than

fallas anewpageisreached. Thisis because someclients

donotscroll down{somedonotevenknowhowtodoso.

Moreimportantly, [4]notesthatthisdatacanactuallybe

collectedbythesearchengine.)

Finally, even if the web page is stale, returned by the

search engine, and clicked on: The changes to the page

mightnotcausetheresultsofthequeryto bewrong. This

truly lucky loser scenario can be more common than one

mightinitiallysuspect,becausemostwebpagestypicallydo

notchangeintermsoftheirbasiccontent, andmostclient

queries will probably tryto addressthis basic content. In

(5)

0

5

10

15

20

25

30

35

40

45

50

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Page 1

Page 2

Page 3

...

Page P

Position

Probability of Clicking

Probability of Clicking as Function of Page and Position

Figure3: Probability ofClickingasFunctionof

Po-sition/Page

versionof pageiyields anincorrect response. Onceagain,

thisparametercanbeeasilyestimated.

Assuming,asseemsreasonable,thatthesethreetypesof

eventsare independent, onecancomputethe totallevelof

embarrassment caused tothe searchengine bywebpage i

as w i =d i X j X k c j;k b i;j;k : (8)

Notethatalthoughthe functionalformof theabove

equa-tionis abitinvolved,the value ofwi is simplyaconstant

fromtheperspectiveoftheresourceallocationproblem.

2.3

Computing the Functions

a i

and

A i

Aspreviouslynoted,thefunctionstobecomputedinthis

sectiondependuponthecharacteristicsofthemarkedpoint

process Ui = f(ui;n;ki;n);n 2 INg which models the

up-datebehaviorsof eachpagei. We considerthree typesof

markedpointprocessesthatrepresentdi erentcaseswhich

are expected to be of interest in practice. The rst two

casesarebased ontheuse ofEquation(3)to computethe

functionsa i

andA i

underdi erent distributional

assump-tions for the interupdate times Ui;n of the marked point

process Ui; speci cally, we consider Gi() to be

exponen-tialand general distribution functions, respectively, where

theformer hasbeen the primarycase consideredin

previ-ousstudiesandthelatter isusedto stateafew important

propertiesofthisgeneralform(whichareusedinturnto

ob-taintheexperimentalresultspresentedinSection4). The

thirdcase,whichwecallquasi-deterministic,is basedona

nitenumberofspeci ctimesu i;n

atwhichpageimightbe

updated,wherethecorrespondingmarksk i;n

representthe

probabilitythattheupdateattimeui;n actuallyoccurs.

2.3.1

Exponential Distribution Function

Gi

Consider as a simple butprototypicalexample the case

inwhichthetimeintervalsUi;n 2IR +

between updatesof

page iare i.i.d. following anexponential distributionwith

parameter i ,i.e.,G i (t)=1 e  i t andG i (t)=e  i t [28].

Inthiscase we alsoassumethatall updatesare ofinterest

irrespectiveoftheir associatedmarkvalues(i.e.,K=IK).

Supposeasbeforethatwecrawlatotalofx i

timesinthe

scheduling interval at the arbitrary times ti;1;:::ti;x i

. It

thenfollowsfromEquation(3)thatthetime-average

stale-nessestimateisgivenby

a i (t i;1 ;:::;t i;x i ) = 1 T x i X j=0 Z t i;j+1 t i;j  1 i Z 1 0 e  i (t t i;j +v) dv  dt = 1 T x i X j=0 Z t i;j+1 t i;j  1 e  i (t t i;j )  dt: (9)

After some minor algebraic manipulations, we obtain that

thetime-averagestalenessestimateis givenby

a i (t i;1 ;:::;t i;x i )=1+ 1  i T x i X j=0  e  i (t i;j+1 t i;j ) 1  :(10) Letting T i;j = t i;j+1 t i;j for all 0  j x i , then the problemof ndingA i reducestominimizing 1+ 1 iT x i X j=0  e  i T i;j 1  (11)

subjecttotheconstraints

0T i;j T; (12) x i X j=0 Ti;j=T: (13)

Modulo a few constants, which are not important to the

optimization problem, this now takes the form of a

con-tinuous,convex,separableresource allocationproblem. As

before,theproblemisseparablebythenatureofthe

objec-tivefunction. Itiscontinuousbecauseofthe rstconstraint,

anditisaresourceallocationproblembecauseofthesecond

constraint;itisalsoconvexforthereasonsprovidedbelow.

Thekeypointisthattheoptimumvalueisknowntooccur

atthevalue(T  i;1 ;:::;T  i;x i

)wherethederivativesT 1 e  i T  i;j

ofthesummandsinEquation(11)areequal,subjecttothe

constraints0 T  i;j T and P x i j=0 T  i;j =T. Thegeneral

result, originally dueto [9], was the seminal paperin the

theoryofresourceallocationproblems,andthereexist

sev-eral fast algorithmsfor ndingthe valuesof T  i;j

. See[12]

for a good exposition of these algorithms. In our special

caseoftheexponentialdistribution,however,thesummands

are allidentical,andthustheoptimaldecisionvariablesas

well can be found by inspection: They occur when each

T  i;j

=T=(x i

+1). Hence,wecanwrite

Ai(xi)=1+ x i +1 iT  e  i T=(x i +1) 1  ; (14)

whichiseasilyshowntobeconvex.

2.3.2

General Distribution Function

Gi

Now let us consider the same scenario as the previous

section butwherethedistributionoftheinterupdatetimes

U i;n

2IR +

forpageiisanarbitrarydistributionG i

()with

mean 1 i

. ThenweobservefromEquation(3)afew

(6)

this formula that the summands remainseparable. Given

thatall thesummandsarealso identical,theoptimal

deci-sionvariables occur wheneachT  i;j

=T=(xi+1) asinthe

exponentialcase.

2.3.3

Quasi-Deterministic Case

Suppose the marked point process U i

consists of a

de-terministic sequence of points fu i;1 ;u i;2 ;:::;u i;Q i g

de n-ing possible update times for page i, together with a

se-quenceofmarksfki;1;ki;2;:::;ki;Q i

gde ningthe

probabil-ity of whether the corresponding update actually occurs.

Herewe eliminate thei.i.d. assumptionof Section2.1 and

consider an arbitrary sequence of speci c times such that

0  u i;1 < u i;2 < ::: < u i;Q i  T. Recall that u i;0 

0 and de ne ui;Q i

 T for convenience. The update at

time ui;j occurs with probability ki;j. If ki;j = 1 for all

j 2f1;:::;Q i

g, thenthe update patternreducesto being

purely deterministic. We shall assume that the value ki;0

canbeinferred fromthecrawlingstrategyemployedinthe

previousschedulinginterval(s). Ourinterestisagainon

de-terminingthetime-averagestalenessestimateA i

(x i

) forx i optimallychosencrawls.

Akeyobservationisthatallcrawlsshouldbedoneatthe

potentialupdatetimes,becausethereisnoreason todelay

beyond when the update has occurred. This also implies

that we canassume xi  Qi+1,as there is noreason to

crawl more frequently. (The maximum of Q i

+1 crawls

correspondstothetime0andtheQiotherpotentialupdate

times.)Hence,considerthebinarydecisionvariables

yi;j= 

1; ifacrawloccursattimeui;j;

0; otherwise :

(15)

Ifwecrawlx i

times,thenwehave P Q i j=0 y i;j =x i .

Notethat,asaconsequenceoftheaboveassumptionsand

observations,thetwointegralsinEquation(1)reducetoa

muchsimplerform. Speci cally,letus considerastaleness

probabilityfunctionp(yi;0; :::;yi;Q i

;t)atanarbitrarytime

t,which we nowcompute. Recall that N u i

(t) provides the

indexofthelatest potential update timethatoccursator

beforetimet,sothatN u i

(t)Qi. Similarly,de ne

J i (t) = maxfj :u i;j t; y i;j =1; 0jQ i g + ; (16)

whichistheindexofthelatestpotentialupdatetimeator

beforetimet thatis actuallygoing to becrawled. Clearly

wecanalsounambiguouslyuseJ i;j

toabbreviatethevalue

ofJi(t)atanytimetforwhichj=N u i (t).Nowwehave  p(yi;0;:::;yi;Q i ;t) = 1 N u i (t) Y j=J i (t)+1 (1 ki;j); (17)

whereaproductovertheemptyset,aspernormal

conven-tion,isassumedtobe1.

Figure 4 illustrates a typical staleness probability

func-tionp. (Forvisualclaritywedisplaythefreshnessfunction

1 pratherthanthestalenessfunctioninthis gure.) Here

the potential update times are noted by circles on the

x-axis. Thosewhichareactuallycrawledaredepictedas lled

circles, while those that are not crawled are left un lled.

Thefreshness functionjumpsto 1during eachinterval

im-mediatelytotherightofacrawltime,andthendecreases,

interval by interval, as more termsare multipliedinto the

product(seeEquation(17)). Thefunctionisconstant

dur-PROBABILITY

0

TIME

T

1

0

Figure4: FreshnessProbability Functionfor

Quasi-Deterministic WebPages

ingeachinterval{thatispreciselywhyJ i;j

canbede ned.

Nowwecancomputethecorrespondingtime-average

prob-abilityestimateas  a(y i;0 ;:::;y i;Q i )= Q i X j=0 u i;j [1 j Y k =J i;j +1 (1 k i;j )]: (18)

Thequestionofhowtochoosetheoptimalxicrawltimesis

perhaps themost subtleissue inthispaper. Wecanwrite

this as adiscrete resource allocation problem, namely the

minimization of Equation (18) subject to the constraints

y i;j 2 f0;1g and P Q i j=0 y i;j = x i

. The problem withthis

is that thedecisionvariablesyi;j are highly intertwinedin

theobjectivefunction. Whileouroptimizationproblemcan

besolvedexactlybytheuseofso-callednon-serialdynamic

programmingalgorithms,asshownin[12],orcanbesolved

asageneralintegerprogram,suchmeanstoobtainthe

prob-lem solution will not have good performance. Hence, for

reasonsweshalldescribemomentarily,wechoosetoemploy

a greedy algorithm for the optimizationproblem: That is,

we rstestimate the value of Ai(1)by pickingthat index

0 j  Q i

for which the objective function will decrease

the most when yi;j is turned from 0 to 1. In the general

inductivestepweassumethatwearegivenanestimatefor

A i (x i 1). Then to compute A i (x i

) we pick that index

0  j  Qi with yi;j = 0 for which the objective

func-tion decreases the most upon setting yi;j = 1. It can be

shownthatthe greedyalgorithm doesnot,ingeneral, nd

theoptimalsolution. However,theaveragefreshnesscanbe

easily shownto be anincreasing, submodular function (in

thenumberofcrawls),andsothegreedyalgorithmis

guar-anteedtoproduceasolutionwithaveragefreshnessatleast

(1 1=e)ofthe bestpossible [18]. Forthe specialcase we

consider, we believethe worst-case performance guarantee

ofthegreedyalgorithm isstrictlybetter. Wetherefore feel

(7)

i functionA

i

,whenestimatedinthisway,isconvex: If,given

twosuccessivegreedychoices,the rstdi erencedecreases,

thenthesecondgreedychoicewouldhavebeenchosenbefore

the rstone.

2.4

Solving the Discrete Separable Convex

Resource Allocation Problem

Asnoted,theoptimizationproblemdescribedaboveisa

specialcaseofadiscrete,convex,separableresource

alloca-tionproblem. Theproblemofminimizing

N X i=1 F i (x i ) (19)

subjecttotheconstraints

N X i=1 xi=R (20) and xi2fmi;:::;Mig (21) withconvexF i

sisverywellstudiedintheoptimization

lit-erature. We point the reader to [12] for details on these

algorithms. Wecontentourselvesherewithabriefoverview.

Theearliest knownalgorithm for discrete,convex,

sepa-rableresourceallocationproblemsisessentiallyduetoFox

[9]. Moreprecisely,Foxlookedatthecontinuouscase,noting

thatthe Lagrangemultipliers (orKuhn-Tuckerconditions)

impliedthattheoptimalvalueoccurredwhenthederivatives

wereasequalaspossible,subjecttotheaboveconstraints.

This gave rise to a greedy algorithm for the discrete case

whichis usuallyattributedto Fox. Oneforms amatrixin

whichthe(i;j)thtermDi;j isde nedtobethe rst

di er-ence: Di;j=Fi(j+1) Fi(j). Byconvexitythecolumnsof

thismatrixareguaranteedtobemonotone,andspeci cally

non-decreasing. Thegreedyalgorithminitiallysetseachxi

tobe mi. It then ndstheindexifor whichxi+1 Mi

and thevalue ofthe next rst di erence D i;x

i

isminimal.

Forthisindexi 

oneincrementsxi 

by1.Thenthisprocess

repeats until Equation (20) is satis ed, or until the setof

allowable indices iempties. (In that case there is no

fea-siblesolution.) Note that the rstdi erences are justthe

discreteanalogs ofderivativesfor the continuouscase, and

thatthegreedyalgorithm ndsasolutioninwhich,modulo

Constraint(21),all rstdi erencesareasequalaspossible.

ThecomplexityofthegreedyalgorithmisO(N+RlogN).

Thereis afaster algorithmfor ourproblem,duetoGalil

andMegiddo[11],whichhascomplexityO(N(logR ) 2

). The

fastestalgorithm is due to Frederickson and Johnson [10],

and it has complexity O(maxfN;Nlog(R =N)g). The

al-gorithmishighly complex,consistingofthreecomponents.

The rst component eliminates elements of the D matrix

fromconsideration,leavingO(R )elementsandtakingO(N)

time. Thesecond component iterates O(log(R =N))times,

eachiterationtakingO(N)time. Attheendofthis

compo-nentonlyO(N)elementsoftheD matrixremain. Finally,

the third component is a linear time selection algorithm,

ndingtheoptimalvalueinO(N)time. Forfull detailson

this algorithm see [12]. We employ the Frederickson and

Johnsonalgorithminthispaper.

There do exist some alternative algorithms which could

beconsideredforourparticularoptimizationproblem. For

optimizationproblemisinherentlydiscrete,buttheportions

corresponding to other distributions can be considered as

giving rise toa continuousproblem (whichis done in[6]).

Inthecase ofdistributionsforwhichthe expressionfor A i is di erentiable, and for which the derivative has a closed

formexpression,theredoexistveryfastalgorithmsfor(a)

solvingthecontinuouscase,and(b)relaxingthecontinuous

solution to a discrete solution. So, if all web pages had

suchdistributions the above approachcould be attractive.

Indeed,ifmostwebpageshadsuchdistributions,onecould

partition the set of web pages into twocomponents. The

rstsetcouldbesolvedbycontinuousrelaxation,whilethe

complementarysetcould besolvedbyadiscretealgorithm

suchasthatgivenby[10]. Astheamountofresourcegiven

toonesetgoesup,theamountgiventotheothersetwould

godown. Soa bracketandbisection algorithm[22], which

is logarithmic incomplexity,could be quitefast. Weshall

notpursuethisidea furtherhere.

3.

CRAWLER SCHEDULING PROBLEM

Giventhatweknowhowmanycrawlsshouldbemadefor

eachwebpage,thequestionnowbecomeshowtobest

sched-ulethecrawlsoveraschedulingintervaloflengthT. (Again,

weshallthinkintermsofscheduling intervalsoflengthT.

Weare tryingtooptimally schedulethecurrentscheduling

intervalusingsomeinformationfromthelastone.) Weshall

assume that there are C possibly heterogeneous crawlers,

and thateachcrawlerk canhandleSk crawltasksintime

T. Thuswecansaythatthetotalnumberofcrawlsintime

T isR= P

C k =1

Sk. Weshallmakeonesimplifying

assump-tion thateachcrawl oncrawler k takesapproximately the

same amount of time. Thuswe candividethe time

inter-valT into S k

equal size timeslots, and estimate thestart

timeofthel thslotoncrawlerkbyTk l=(l 1)=T foreach

1lS k

and1kC.

Weknowfromtheprevioussectionthedesirednumberof

crawls x  i

foreachwebpagei. Sincewehavealready

com-putedtheoptimalschedulefor thelastscheduling interval,

wefurtherknowthestarttimet i;0

ofthe nalcrawlforweb

pageiwithinthelastschedulinginterval. Thuswecan

com-putetheoptimalcrawltimest  i;1 ;:::;t  i;x i

forwebpagei

dur-ingthecurrentschedulinginterval. Forthestochasticcaseit

isimportantfortheschedulertoinitiateeachofthesecrawl

tasksatapproximatelythepropertime,butbeingabitearly

or abitlate shouldhave noseriousimpactfor mostof the

updateprobabilitydistributionfunctionsweenvision. Thus

itis reasonabletoassumeaschedulercostfunctionfor the

jthcrawlofpagei,whoseupdatepatternsfollowa

stochas-tic process, that takes the form S(t) = jt t  i;j

j. On the

otherhand,forawebpageiwhose updatepatternsfollow

aquasi-deterministicprocess,beingabitlateisacceptable,

butbeing early is notuseful. Soan appropriatescheduler

costfunctionforthejthcrawlofaquasi-deterministicpage

imighthavetheform

S(t)=  1 ift<t  i;j t ti;j otherwise (22)

Intermsofschedulingnotation,theabovecrawltaskissaid

to havearelease time ofti;j. See [3]for more information

onschedulingtheory.

Virtuallynoworkseemstohavebeendoneonthe

(8)

SLOT S

SLOT 1

SUPPLY=1

DEMAND=1

CRAWL TASK R

CRAWL TASK 1

CRAWLER C

CRAWLER 1

Figure 5: Transportation ProblemNetwork

isasimple,exactsolutionfortheproblem. Speci cally,the

problemcanbeposedandsolvedasatransportationproblem

asweshallwenowdescribe. See[1]formoreinformationon

transportationproblemsandnetwork ows ingeneral. We

describeourschedulingproblemintermsofanetwork.

Wede neabipartitenetworkwithonedirectedarcfrom

each supply node to each demand node. The R supply

nodes,indexedbyj, correspondto thecrawlsto be

sched-uled. Eachofthesenodeshasasupplyof1unit. Therewill

beonedemandnodepertimeslotandcrawlerpair,eachof

whichhasademandof1unit. Weindextheseby1lS k and1kC.Thecostofarcjklemanatingfromasupply

nodejto ademandnodeklisSj(Tk l). Figure5showsthe

underlyingnetworkfor anexampleofthisparticular

trans-portationproblem. Hereforsimplicitywe assumethatthe

crawlersarehomogeneous,andthusthateachcancrawlthe

samenumberS=SkofpagesintheschedulingintervalT.

Inthe gurethe numberofcrawls is R =4, whichequals

thenumberof crawler timeslots. Thenumberofcrawlers

is C =2, and thenumberof crawls percrawler is S = 2.

HenceR=CS.

The speci c linear optimization problem solved by the

transportationproblemcanbeformulatedasfollows.

Minimize M X i=1 N X j=1 M X k =1 Ri(T jk )f ijk (23) suchthat M X i=1 f ijk =181jN and1kM; (24) N X j=1 M X k =1 fijk=181iM; (25) fijk081i;kM and1jN: (26)

Thesolutionof atransportationproblem canbe

accom-plishedquickly.See,forexample,[1].

transportationproblemformulationensuresthatthereexists

anoptimalsolutionwithintegral ows,and thetechniques

in the literature nd such a solution. Again, see [1] for

details. This implies that eachf ijk

is binary. If f ijk

= 1

then acrawl ofwebpage iis assignedto the jth crawl of

crawlerk.

Ifitisrequiredto xorrestrictcertain crawltasks from

certain crawler slots, this can be easily done: Onesimply

changesthecostsoftherestricteddirectedarcstobein nite.

(Fixingacrawltasktoasubsetofcrawlerslotsisthesame

asrestrictingitfromthecomplementarycrawlerslots.)

4.

PARAMETERIZATION ISSUES

The use of our formulation and solution in practice

re-quires calculating estimates of the parameters of our

gen-eral modelframework. Inthe interest of space, we sketch

heresomeoftheissuesinvolvedinaddressingthisproblem,

and referthe interestedreaderto thesequelfor additional

details.

Notethatwheneachpageiiscrawledwecaneasily

ob-tainthelastupdatetimeforthepage. Whilethisdoesnot

provideinformationaboutanyotherupdatesoccurringsince

thelastcrawlofpagei,wecanusethisinformationtogether

withthedataandmodelsforpageifrompreviousscheduling

intervals to statistically infer keyproperties of the update

processforthepage. Thisisthenusedinturntoconstruct

a probability distribution (including a quasi-deterministic

distribution)fortheinterupdatetimesofpagei.

Anotherimportant aspectof our approachconcerns the

statistical properties of the update process. The analysis

ofpreviousstudieshasessentiallyassumedthattheupdate

processis Poisson[6, 7, 26], i.e., theinterupdate timesfor

eachpagefollowanexponentialdistribution.Unfortunately,

verylittle has beenpublishedintheresearch literatureon

the properties of update processes found inpractice, with

thesoleexception(toourknowledge)ofarecentstudy[19]

suggestingthattheinterupdatetimesofpagesatanews

ser-vicewebsitearenotexponential. Tofurtherinvestigatethe

prevalenceofexponentialinterupdatetimesinpractice,we

analyzethepageupdatedatafromanotherwebsite

environ-mentwhosecontentishighlyaccessedandhighlydynamic.

Speci cally, we consider the update patterns found at the

websiteforthe1998NaganoOlympicGames,referringthe

interested reader to [16, 25] for more details on this

envi-ronment. Figure 6 plots the tail distributions of the time

betweenupdatesforeachofasetof18individualdynamic

pages which are representative of the update pattern

be-haviors foundinour studyof all dynamicpages that were

modi edafairamountoftime. Inotherwords,thecurves

illustratetheprobabilitythat thetimebetweenupdates to

agivenpageisgreaterthantasafunctionoftimet.

We rstobserve from theseresults that the interupdate

timedistributionscandi ersigni cantlyfroman

exponen-tial distribution. More precisely, our results suggest that

theinterupdatetimedistributionforsomeofthewebpages

atNaganohaveatailthat decaysatasubexponentialrate

andcanbecloselyapproximatedbyasubsetoftheWeibull

distributions;i.e.,thetailofthelong-tailedWeibull

interup-date distribution is given by G (t) = e t

, where t  0,

>0and0< <1. Wefurther ndthattheinterupdate

timedistributionforsomeoftheotherwebpagesatNagano

(9)

0

0.2

0.4

0.6

0.8

1

0

500

1000

1500

2000

Prob. time between updates > t

Time t (in seconds)

Tail Distribution of Update Process

web page 1

web page 2

web page 3

web page 4

web page 5

web page 6

web page 7

web page 8

web page 9

0

0.2

0.4

0.6

0.8

1

0

500

1000

1500

2000

Prob. time between updates > t

Time t (in seconds)

Tail Distribution of Update Process

web page 10

web page 11

web page 12

web page 13

web page 14

web page 15

web page 16

web page 17

web page 18

Figure 6: Tail Distributions ofUpdate Processes

classofParetodistributions; i.e., thetailof thePareto

in-terupdatetimedistributionisgivenbyG(t)=( =( +t))

,

wheret0, >0 and0< <2. Moreover, someofthe

periodicbehaviorobservedintheupdatepatternsforsome

web pages can be addressed with our quasi-deterministic

distribution.

The above analysis is not intended to be an exhaustive

studybyanymeans.Ourresults,togetherwiththosein[19],

suggestthat thereare important web siteenvironmentsin

which the interupdate times do not follow an exponential

distribution. Thisincludesthequasi-deterministicinstance

of our general model, which is motivated by information

services as discussed above. The key point is that there

clearly are webenvironmentsin whichthe update process

for individual web pages canbe muchmore complexthan

aPoissonprocess,andourgeneralformulationandsolution

ofthecrawlerschedulingproblemmakesitpossibleforusto

handlesuchawiderange ofwebenvironmentswithin this

uni edframework.

5.

EXPERIMENTAL RESULTS

Usingtheempiricaldataandanalysisoftheprevious

sec-tionwe nowillustratethe performance ofourscheme. We

willfocus ontheproblemof ndingtheoptimalnumberof

crawls,comparingourschemewithtwosimpleralgorithms.

Bothof thesealgorithms wereconsidered in [6], and they

are certainlynatural alternatives. The rst schememight

becalledproportional. Wesimplyallocatethetotalamount

ofcrawlsRaccordingtotheaverageupdateratesofthe

vari-ouswebpages. Modulointegralityconcerns,thismeansthat

wechoose xi/i. Thesecondschemeissimpleryet,

allo-catingthenumberof crawlsas evenlyas possible amongst

thewebpages. Welabelthis theuniformscheme. Eachof

theseschemescanbeamendedtohandleourembarrassment

metricweights.Whenspeakingofthesevariantswewilluse

thetermsweightedproportionalandweighteduniform. The

formerchoosesx i /w i  i

. Thelatterissomethingofa

mis-nomer: We are choosing xi / wi, so this is essentially a

slightly di erent proportional scheme. We can also think

ofouroptimalschemeasweighted,solvingforthe smallest

objective function P

N i=1

wiAi. If we solve instead for the

smallest objectivefunction P

N i=1

Ai, we getanunweighted

optimal algorithm. This is essentially the same problem

solved in [6], provided, of course, that all web pages are

updated according to a Poisson process. But our scheme

will havemuchgreater speed. Eventhough thisalgorithm

omitstheweightsintheformulationoftheproblem,wemust

compare thequality of thesolution basedon theweighted

objectivefunctionvalue.

Inourexperimentsweconsidercombinationsofdi erent

types of web page update distributions. In a number of

cases we use a mixtureof 60% Poisson, 30% Pareto and

10% quasi-deterministic distributions. In the experiments

wechoose T tobeoneday,thoughwehave maderunsfor

aweekaswell. WesetN tobeonemillionwebpages,and

variedRbetween1.5and5millioncrawls. Weassumedthat

theaveragerateofupdatesoverallpageswas1.5,andthese

updates were chosen according to a Zipf-like distribution

withparametersN and ,thelatterchosenbetween0and

1[29,15]. Suchdistributionsrunthespectrumfromhighly

skewed(when=0)tototally uniform(when =1). We

considered both the staleness and embarrassment metrics.

Whenconsideringtheembarrassmentmetricwederivedthe

weightsinEquation(8)byconsideringasearchenginewhich

returned5 resultpages perquery,with 10urls onapage.

Theprobabilitiesbi;j;kwithwhichthesearchenginereturns

pageiinposition jof queryresult pagek werechosenby

linearizing these 50 positions, picking a randomly chosen

center,andimposingatruncatednormaldistributionabout

that center. Theclickingfrequencies cj;k for position j of

query pagek are chosenas described inSection2.2, via a

Zipf-likefunctionwithparameters50and=0,witha

geo-metricfunctionfor cyclingthroughthepages. Weassumed

that theclient went fromoneresultpageto thenextwith

probability0.8. Wechoosetheluckyloserprobabilitydi of

webpageiyieldinganincorrectresponsetotheclientquery

bypickingauniformrandomnumberbetween0and1. All

curvesinourexperimentsdisplaytheanalyticallycomputed

objectivefunctionvaluesofthevariousschemes.

Figure7showstwoexperimentsusingtheembarrassment

metric. In theleft-hand side ofthe gure we consider the

(10)

1.5

2

2.5

3

3.5

4

4.5

5

0

1

2

3

4

5

6

7

8

9

10

R/N

Embarassments per 1000 Queries

Embarassment as Function of Crawl/Web Page Ratio

Optimal

Proportional

Uniform

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

2

4

6

8

10

12

Theta

Embarassments per 1000 Queries

Embarassment as Function of Zipf−like Parameter

Optimal Schemes

Proportional Schemes

Uniform Schemes

Figure7: TwoEmbarrassmentMetric Examples

varying the ratio of R to N from 1.5 to 5. We consider

here a true Zipf distribution for the update frequencies {

inotherwords wechoose Zipf-likeparameter=0. There

aresixcurves,namelyoptimal,proportionalanduniformin

bothweighted and unweighted versions. (The unweighted

optimalcurveistheresultofemployingunitweightsduring

the computation phase, butdisplaying the weighted

opti-malobjectivefunction.) Byde nitiontheunweighted

opti-mal schemewill not performas well as theweighted

opti-malscheme, which isindeed thebestpossible solution. In

allothercases,however,theunweightedvariantdoesbetter

thantheweightedvariant. Sothetrueuniformpolicydoes

thebestamongstalloftheheuristics,atleastforthe

exper-imentsconsideredin ourstudy. Thissomewhat surprising

state of a airs was noticed in [6] as well. Both uniform

policiesdobetterthantheirproportionalcounterparts.

No-ticethattheweightedoptimalcurveis,infact,convexasa

functionofincreasingR . Thiswillalwaysbetrue.

In the right-hand sideof Figure 7 we show asomewhat

di ering mixtureofdistributions. Inthis case we varythe

Zipf-likeparameterwhileholdingthevalueofRtobe2.5

million(sothat R =N =2:5). As increases,thusyielding

lessskew,theobjectivefunctionsgenerallyincreaseaswell.

This is appealing, because it shows in particular that the

optimalschemedoesvery well inhighly skewed scenarios,

whichwebelievearemorerepresentativeofrealweb

environ-ments.Moreover,noticethatthecurvesessentiallyconverge

toeachotherasincreases. Thisisnottoosurprising,since

theoptimal,proportionalanduniformschemeswouldall

re-sultinthesamesolutionspreciselyintheabsenceofweights

when =1. Ingeneraltheuniformschemedoesrelatively

better in this gurethan it did in the previous one. The

explanationiscomplex,butitessentiallyhastodowiththe

correlation of the weights and the update frequencies. In

fact, we show this examplebecause it puts uniform in its

bestpossiblelight.

In Figure 8 we show four experiments where we varied

theratioofR toN. These guresdepicttheaverage

stale-nessmetric,andsoweonlyhavethree(unweighted)curves

per gure. Thetopleft-hand guredepictamixtureof

up-datedistributiontypes,andtheotherthree guresdepict,in

turn,purePoisson,Paretoandquasi-deterministic

distribu-tions. Noticethatthesecurvesarecleanerthanthosein

Fig-ure7. Theweightings,whileimportant,introduceadegree

of noiseintothe objectivefunctionvalues. For thisreason

wewillfocusonthenon-weightedcasefromhereon. The

y-axisrangesdi eroneachofthefour gures,butinallcases

the optimal schemeyields a convex function ofR =N (and

thusof R ). The uniformschemeperforms better thanthe

proportionalschemeonceagain. It doesrelatively lesswell

inthe Poisson updatescenario. Inthe quasi-deterministic

guretheoptimalschemeisactuallyabletoreduceaverage

stalenessto0forsuÆcientlylargeRvalues.

InFigure9weexplorethecaseofParetointerupdate

dis-tributionsinmoredetail. Onceagain,theaveragestaleness

metric is plotted as a function of the parameter in the

Paretodistribution;referto Section4. Thisdistributionis

said to have aheavytail when0< <2, which is quite

interestingbecauseitisoverthisrangeofparametervalues

thattheoptimalsolutionismostsensitive. Inparticular,we

observethattheoptimalsolutionvalueisrather atfor

val-uesof rangingfrom4toward2. However,as approaches

2, the optimal average staleness value startsto rise which

continuestoincreaseinanexponentialmanneras ranges

from2toward0. Thesetrendsappear toholdfor allthree

schemes,withouroptimalschemecontinuingtoprovidethe

bestperformanceanduniformcontinuingtooutperformthe

proportional scheme. Our results suggest the importance

ofsupportingheavy-taileddistributionswhentheyexistin

practice(andourresultsoftheprevioussectiondemonstrate

thattheydoindeedexistinpractice). Thisisappealing

be-cause it shows inparticular that the optimal schemedoes

verywellinthesecomplexenvironmentswhichmaybemore

representativeof realwebenvironmentsthanthose

consid-eredinpreviousstudies.

The transportation problem solution to the scheduling

problem isoptimal, ofcourse. Furthermore,the quality of

thesolution, measuredintermsofthe deviationofthe

ac-tual time slots for the various tasks from their ideal time

slots,willnearlyalwaysbeoutstanding. Figure10showsan

illustrativeexample. Thisexampleinvolvesthescheduling

ofonedaywithC=10homogeneouscrawlersand1million

crawls per crawler. So there are 10 million crawls inall.

(11)

mi-1.5

2

2.5

3

3.5

4

4.5

5

0

0.05

0.1

0.15

0.2

0.25

R/N

Average Staleness

Average Staleness as Function of Crawl/Web page ratio

Optimal Scheme

Proportional Scheme

Uniform Scheme

1.5

2

2.5

3

3.5

4

4.5

5

0

0.02

0.04

0.06

0.08

0.1

0.12

R/N

Average Staleness

Average Staleness as Function of Crawl/Web page ratio, Poisson Updates

Optimal Scheme

Proportional Scheme

Uniform Scheme

1.5

2

2.5

3

3.5

4

4.5

5

0

0.05

0.1

0.15

0.2

0.25

0.3

R/N

Average Staleness

Average Staleness as Function of Crawl/Web page ratio, Pareto Updates

Optimal Scheme

Proportional Scheme

Uniform Scheme

1.5

2

2.5

3

3.5

4

4.5

5

0

0.1

0.2

0.3

0.4

0.5

R/N

Average Staleness

Average Staleness as Function of Crawl/Web page ratio, Quasi−Deterministic Updates

Optimal Scheme

Proportional Scheme

Uniform Scheme

Figure8: FourAverageStaleness Metric Examples: Mixed,Poisson,Paretoand Quasi-Deterministic

1.5

2

2.5

3

3.5

4

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Average alpha

Average Staleness

Average Staleness as Function of Pareto Parameter

Optimal Scheme

Proportional Scheme

Uniform Scheme

Figure 9: Pareto Example

−20

0

−15

−10

−5

0

5

10

15

20

5

10

15

20

25

30

35

Deviation from Optimal Time Slot

Percent

Distribution of Actual/Ideal Task Time Slots

Poisson, Pareto

Quasi−Deterministic

(12)

tasks,tonoticethattheyoccur onoraftertheiridealtime

slots,asrequired. Thequasi-deterministiccrawlsamounted

to20% of the overall crawls inthis example. Thebottom

lineisthatthisschedulingproblemwill nearlyalwaysyield

optimalsolutionsofveryhighabsolutequality.

ThecrawlingfrequencyschemewasimplementedinCand

runonanIBM RS/6000Model50. Innocase didthe

al-gorithmrequiremore thanaminute of elapsedtime. The

crawlerschedulingalgorithmwasimplementedusingIBM's

OptimizationSubroutineLibrary(OSL)package[13],which

cansolvenetwork owproblems. Problems ofoursizerun

inapproximatelytwominutes.

6.

CONCLUSION

Giventheimportantrole ofsearchenginesintheWorld

Wide Web, we studied the crawling process employed by

suchsearchengineswiththe goal ofimprovingthequality

of theservice theyprovideto clients. Ouranalysis of the

optimalcrawlingprocessconsideredboththemetricof

stal-eness,asdonebythefewstudiesinthisarea,andthe

met-ric ofembarrassment,whichwe introduced asa preferable

goal. Weproposed a generaltwo-part schemeto optimize

thecrawlingprocess,wherethe rstcomponentdetermines

the optimalnumberof crawls for eachpage together with

theoptimaltimesatwhichthesecrawlsshouldtakeplaceif

therewerenopracticalconstraints. Thesecondcomponent

ofourschemethen ndsanoptimalachievableschedulefor

a set of crawlers to follow. An important contribution of

thepaperisthisformulationwhichmakesitpossibleforus

toexploitveryeÆcientalgorithmsThesealgorithmsare

sig-ni cantlyfaster than those considered inprevious studies.

Ourformulationandsolutionisalsomoregeneralthan

pre-viousworkforseveralreasons,includingtheuseofweights

inthe objective function and the handling of signi cantly

moregeneralupdatepatterns. Giventhe lack ofpublished

dataonwebpageupdatepatterns,andgiventheassumption

ofexponentialinterupdatetimesinthe analysisof thefew

previous studies, we analyzed the page update data from

a highly accessed web site serving highly dynamic pages.

Thecorrespondingresults clearly demonstratethe bene ts

ofourgeneraluni edapproachinthat thedistributionsof

thetimesbetweenupdatestosome webpagesclearly span

a wide range of complex behaviors. By accommodating

suchcomplexupdatepatterns,webelievethatouroptimal

schemecanprovideevengreaterbene tsinreal-world

envi-ronmentsthanpreviousworkinthearea.

Acknowledgement.WethankAllenDowneyforpointing

usto[19].

7.

REFERENCES

[1] R.Ahuja,T.MagnantiandJ.Orlin,NetworkFlows, PrenticeHall,1993.

[2] A.Arasu,J.Cho,H.Garcia-Molina,A.Paepckeand S.Raghavan,\SearchingtheWeb",ACM

TransactionsonInternetTechnology,1(1),2001.

[3] J.Blazewicz,K.Ecker,G.SchmidtandJ.Weglarz, SchedulinginComputerandManufacturing Systems, Springer-Verlag,1993.

[4] A.Broder,Personalcommunication.

[5] J.Challenger,P.Dantzig,A.Iyengar,M.S.

Squillante,andL.Zhang.EÆcientlyservingdynamic dataathighlyaccessedwebsites.Preprint,May2001.

DatabasetoImproveFreshness",ACMSIGMOD Conference,2000.

[7] E.Co man,Z.LiuandR.Weber,\OptimalRobot SchedulingforWebSearchEngines",INRIAResearch Report,1997.

[8] F.Douglas,A.FeldmannandB.Krishnamurthy, \Rate ofChangeandotherMetrics: ALiveStudyof theWorldWideWeb",USENIX Symposiumon Internetworking TechnologiesandSystems,1999. [9] B. Fox,\DiscreteOptimizationviaMarginal

Analysis", ManagementScience,13:210-216,1966.

[10] G.FredericksonandD.Johnson,\TheComplexityof SelectionandRankinginX+YandMatriceswith SortedColumns",JournalofComputerandSystem Science,24:197-208, 1982.

[11] Z.GalilandN.Megiddo,\AFastSelectionAlgorithm andtheProblemofOptimumDistributionofE orts", JournaloftheACM,26:58-64,1981.

[12] Ibaraki,T.,andKatoh,N.,\ResourceAllocation Problems: AlgorithmicApproaches",MITPress, Cambridge,MA,1988.

[13] InternationalBusiness MachinesCorporation,

Optimization Subroutine Library GuideandReference, IBM,1995.

[14] N.KatohandT.Ibaraki,\ResourceAllocation Problems",inHandbook ofCombinatorial Optimization,D-Z.DuandP.Pardalos,editors, KluwerAcademicPress,2000.

[15] D. Knuth,TheArtofComputer Programming,vol.2, AddisonWesley,1973.

[16] A.Iyengar,M.SquillanteandL.Zhang,\Analysis andCharacterization ofLarge-ScaleWebServer AccessPatternsandPerformance",World WideWeb, 2:85-100, 1999.

[17] S.LawrenceandC.Giles,\Accessibilityof

InformationontheWeb",Nature,400:107-109,1999.

[18] G.NemhauserandL.Wolsey,Integerand CombinatorialOptimization,J.Wiley,1988. [19] V.N.PadmanabhanandL.Qiu.\TheContentand

AccessDynamicsofaBusyWebSite: Findingsand Implications",ACMSIGCOMM'00Conference,2000.

[20] M. Pinedo,Scheduling: Theory,Algorithmsand Systems, Prentice-Hall,1995.

[21] J.PitkowandP.Pirolli,\Life,DeathandLawfulness ontheElectronicFrontier",CHIConferenceon HumanFactors inComputingSystems, 1997.

[22] W.Press,B. Flannery,S.TeukolskyandW.

Vetterling,NumericalRecipes,CambridgeUniversity Press, 1986.

[23] S.M.Ross.StochasticProcesses. JohnWileyand Sons,SecondEdition, 1997.

[24] K.Sigman.StationaryMarkedPointProcesses: An IntuitiveApproach.ChapmanandHall,1995.

[25] M. Squillante,D.YaoandL.Zhang, \WebTraÆc ModelingandWebServerPerformanceAnalysis", IEEE ConferenceonDecisionandControl, 1999.

[26] J.Talim,Z.Liu,P.NainandE.Co man,

\OptimizingtheNumberofRobotsforWebSearch Engines",TelecommunicationSystemsJournal, 17(1-2):234-243, 2001.

[27] C.WillsandM. Mikhailov,\TowardsaBetter UnderstandingofWebResourcesandServer

ResponsesforImprovedCaching",WWWConference, 1999.

[28] R.W.Wol .StochasticModelingandtheTheory of Queues. PrenticeHall,1989.

References

Related documents

generess fe birth control savings card generic equivalent to generess fe generic name for generess fe order generess fe online generess fe missed brown pill buy generess fe

The Claims Payment Summary appears on the R&amp;S Report in the following format: CSHCN number The client’s CSHCN Services Program number.. Date of service The format MMDDYYYY

I I’ve evaluated the MC errors by running 250 simulation studies and running simsum on its own results (!). I All seems to work well, except in one case where the joint distribution

With the case study on HC Company’s product pricing, we could draw the following conclusion: ( 1 ) It divides the product costs into unit level, batch level, product-sustaining

We go beyond net worker flows and consider also the model’s implications for gross worker flows (hires, quits, layoffs) and vacancy postings by employer size, measured by

Findings: The FSC and SSC evaluation of the platelets obtained from the thrombocytopenic animals shows several distinctive features in comparison to the flow

The findings are consist- ent with those from the sources quoted in the Background and indicate that: (i) the administration of IV or oral CsA can eliminate the need for

Briefly, the proposed barriers and solutions are the following: barriers: mHealth expertise and human shortage; funding and infrastructure investments; legal, privacy standardization