AN ADAPTIVE FILE DISTRIBUTIONALGORITHM FOR WIDEAREA NETWORK
TAKASHI HOSHINO
∗
,KENJIRO TAURA
∗
, AND TAKASHI CHIKAYAMA
∗
Abstrat. Thispaperdesribesadatadistributionalgorithmsuitableforopyinglargelestomanynodesinmultiplelusters
inwide-areanetworks. Itisaself-organizingalgorithmthatahievespipelinetransfers,faulttolerane,salability,andaneient
routeseletion. Itworksinthepreseneoftoday'stypialnetworkrestritionssuhasrewallsandNetworkAddressTranslations,
makingitsuitableinwide-areasetting. Experimentalresultsindiateouralgorithmisabletoautomatiallybuildatransferroute
losetotheoptimal. Propagationofa300MBlefromonerootnodetoover150nodestakesabout1.5timesaslongasthebest
timeobtainedbythemanuallyoptimizedtransferroute.
Keywords. Self-stabilizingdistributedalgorithm,faulttolerane,salability,wide-areanetwork
1. Introdution. This paper desribes apratial algorithm for opying large data (typially in a le)
fromasourenode(s) to manydestinationnodesinparallel. We seekasalablesolutionsuitablebothwithin
aluster and aross many lusters in wide-area. By suitablewithin a luster, we mean that it fully utilizes
the available bandwidth of LAN/luster interonnet. Forexample, assuming 32 nodes are onneted via a
suiently high-throughput swith, it should be ableto opya singlelarge le to the 32 nodes in not muh
morethanthetime it takesto opy thele toa singlenode. Suh analgorithm must atleast performmany
one-to-onetransfersin parallel. Bysuitableinwide-area,wemeanitmakesagood hoiein seletingtransfer
routes. If manynodesin aluster retrieve data from another luster, alink arossthe twoeasily saturates.
Thus suhanalgorithmshould haveamehanismto transferdatawithin alusterwherepossible.
Tobepratial,itshouldworkwithasimpleandsmallmanualongurationthatmaynotbeveryaurate.
It won't be pratialto assume, for example, that theuser givesa omplete andaurate information about
physial network topology, desirable paths for transferring data, or even logial network onnetivity (i.e.,
network settings suh as rewall and Network Address Translation (NAT)). Assuming suh information is
notpratialnot onlybeausethe usermaynot want to write them, but also beausesuh informationmay
hange overtime dueto suheventsasnode/network failures. Thesystemtherefore musttolerateinaurate
informationandadapttotheonditionsobservedatruntime. Suhanadaptivesystemnaturallysupportsfault
toleranein thesensethatevenifsomenodesfail, remainingnodesaomplishtheirworkandnodesthat one
failedanjoin thetransferagain.
Webelievesuh afault-tolerantand adaptivelerepliatorisamandatorybuilding blokforlusterand
Grid omputing. It may be used for installing large program/data to many nodes. It may also be used in
le synhronizers[5℄ sotheysupport synhronizing dataamong alargenumberof nodes in parallel. Perhaps
mostimportant, repliating alarge datato many nodes will beapratial tehniqueto reset adistributed
omputation;itsimplyreinitializesalltheinvolvednodes,soastoreoverfromsomebroken/inonsistentstates.
Thisobservationaordswith reentpratiesin large-saleluster management,wherereinstallingoperating
systemsfromsrathisonsideredasanormaloperation,ratherthanthelastresort,toxbrokenlusters[13℄.
Togetanintuitiveideaabouthowagoodtransferroutetypiallylooks,onsideranetwork inFigure1.1.
Therearetwoloalareanetworks(LANs) named
A
andB
,eah inludingthreelusters(A1
,A2
,andA3
inA
andB1
,B2
,andB3
inB
). Assumenodesan onnetto eahotherviatheTCPlayer.Supposethedataisonanodeinluster
A1
andshouldpropagatetoallothernodes. Inthegure,asmall irleisanode,aretangleaswith,andalineonnetinganodeandaswithanetworkablethatantransferdatawith100Mbps. 1
Intuitively, the best strategy is to form a transfer route like the one shown in Figure 1.2. Figure 1.3
representsthesameroutein thephysialtopology. Speially,thefollowingtwopropertiesareimportant.
•
Thenumberof onnetionsthat rossaLAN/lusterboundaryissmall; there is onlyoneonnetionarossthetwoLANsand veonnetionsarossthesixlusters.
•
Theentiretransferrouteformsalist. That is,nonodesservedatatotwoormorenodes.Thereasonwhytherstpropertyisimportantwillbelear. Asimplealulationwill revealthat ifnodesare
randomly onnetedwithout any eortto onnet nodes loseto eah other, links arossLANs/lusters will
∗
UniversityofTokyo,{hoshino,tau,hikayama}logos.t.u-tokyo.a.jp
1
Cluster B1
Cluster B2
Cluster B3
Cluster A1
Cluster A2
Cluster A3
Subnet A
Subnet B
Node
Switch
100Mbps Line
Fig.1.1. Typialnetworkenvironmentforwhihoursolutionissuitable
Cluster A1
Cluster A2
Cluster A3
Cluster B2
Cluster B1
Cluster B3
Subnet B
Subnet A
Root
Intra-cluster edge
Intra-subnet edge
Inter-cluster edge
Inter-subnet edge
Fig.1.2. Besttransferrouteinappliation layer
easily beome abottlenek. This isespeially truein today'stypialnetwork ongurationwhere apaity of
longlinks(orporate-/ampus-/wide-area)issimilartooratbestonlyanorderofmagnitudelargerorsothan
typial loal arealinks. For example, letus assumefor the purpose ofdisussion that wehavetwo100Mbps
swithed LANs onneted via a 1Gbps link. In suh settings, we should be able to transferdata among all
thenodesinthetwoLANsapproximatelyattheLANbandwidth(100Mbps),but ifonnetionsarerandomly
hosen, alink arossthe twoLANsansustain only10suh onnetionsatbest. Thusthe1Gbps link won't
beenoughforsupporting10ormorenodesin eah sideof it.
Theseondbulletmaybelessobvious. ItisimportantforreduingthebottlenekinNICs. Supposethree
Cluster B1
Cluster B2
Cluster B3
Cluster A1
Cluster A2
Cluster A3
Subnet A
Subnet B
Node
Switch
100Mbps Line
Root
Fig.1.3.Best transferrouteaordingtoourguidelinesintypialnetwork
whenlinksarossLANsaresuientlypowerful,so theywon'tbeomebottleneksaslongaswemaintainthe
rstproperty.
Ouralgorithmtries tobuild atransferrouteloseto suh bestroutes. Notethat itisnotalwayspossible
to onnetallnodes in alist. Forexample,ifrewallsdonotallowsomeonnetions, itmaybeunavoidable
forsomenodesto servedata totwoor morehildren. Thus, ouralgorithm in generalforms atransferforest,
with some heurististo onnet nodes lose to eah other and to make the tree deeper. It may be aforest,
ratherthan asingletree, beausethere maybemultiple nodes that haveomplete datain the beginning. In
suh ases,aseparatetreewillbeformedrootedat eahsourenode.
Thepaperisorganizedasfollows. Setion2desribesamodelofthenetworkandthegoalofthisresearh.
Then,weproposeouralgorithm andproofofeienyin Setion3. Andvalidationandevaluationareshown
in Setion 4. In Setion 5, we explain related work. Finally, we onlude and summarize this researh and
remarktofuture workin Setion6.
2. ProblemDesription. Inthissetion,wedenegoalsof thealgorithmandformalizetheproblem.
2.1. Goals.
Tolerate faults and adapt to resoure onditions: Copyingalargele tomanynodestakesalongtime.
Thereforeour solution must toleratetemporal/permanentnetwork faults and node rashes. Whena
noderashes,nodesreeivingdatafrom therashednodemustndasubstitutesothattheremaining
nodesnishtheirtasks. Whenanodereovers,itmustbeabletojointhetransfernetworkandontinue
itsjob,withoutwaitingfortheongoingoperationtonishandthenrestartingfromsrath. Inaddition
to beingfault-tolerant, itmust adapt to hangesin network onditions; itshould hange thetransfer
routedepending onhangesofonditions.
Both of these requirements prelude a simplisti solution that statially onstruts a route in the
beginning and tries to retain the sameroute until they nish. Nodes must ontinuouslysearhfor a
bettertransferroute.
Make aneienttransfer routeautomatially: AsmotivatedinSetion1,ourgeneralriteriaforgood
datatomultiple(morethanone)hildren. Thisisbasedonourassumptionthatabottlenekistypially
aused by an inter-subnetedge or anode. Examples for thelatter are disks and network interfaes.
Ouralgorithm tries to optimizethe numberof longonnetionsand the numberof hildrenfor eah
node,withaverysimpleloalsearhheuristis.
Work on today'stypial networkongurations: Today's typial network ongurations do not allow
eah node to onnet to all other nodes. Firewalls may blok onnetions between LANs. Inside a
LAN, itis ommonto plae all luster nodesbut one (a master node) behind aNAT router,so that
aessestolustersneedgothroughthemaster. WithDHCP,itmayevenbeimpratialtoassumeall
nodestohavepersistentnames.
In short, wemust model the network as a generalgraph where allowed onnetions are represented
byits edges. Yet itis impratialto assumesuhagraphis given by theuser(or theadministrator)
eitheroineorinthebeginningofthealgorithm. Altogether,wemustdesignanalgorithmthatbegins
withaminimumamountofglobalinformation(e.g.,partiipatingnodes)andaloalknowledgeofthe
network(e.g.,neighbors)in eah node.
Do notassumephysial networktopology: Knowing physial network topology would help us to
opti-mize transfer routes. Designing the algorithm assuming aomplete knowledge aboutit is, however,
impratial formanyreasons disussedsofar. First itis umbersome fortheuser orthe
administra-tor to maintain suh information. We may be able to obtain suh information by using tools suh
astraeroute, but suh toolstend to beunavailable these daysfor seurity onsiderations. It is also
diult to obtain the topologyof the network behind asingle router with traeroute. Seond, even
iftopologyinformationisavailable,dynamiallyprobingthenetworkis alwaysneessaryto makethe
algorithm fault-tolerant and adaptive. Algorithms based on probing onnetivity and proximity at
runtimenaturallyworkwithoutdetailedknowledgeaboutnetworktopology.
Of ourse, weould alwaysuse physial topologyashints to ouralgorithm, among manyother hints
suhasIP addressprex,lateny,andobservedthroughput.
Toahievethesegoals,eahnodeinvolvedinouralgorithmontinuouslyseeksaparent,anodethatserves
datato thenode. Whenitfaes suh eventsasparentrashesordisonnetions, ittriesto ndanewparent.
Even without suh events, they ontinuously searh for abetter parent to optimize the transferroute. The
riteriaforabetterparentarethat(1)theloseranodeistoitselfthebetter,and(2)thefewerhildrenanode
hasthebetter.
Ouralgorithmisasimpleloalsearhalgorithmthatonvergestoasatisfatorytransferintypialnetwork
ongurationsoftoday. Ideally,wedesireanalgorithmtondagloballyoptimalsolutionforanygivennetwork.
Aplausibledenitionoftheoptimalwouldbetominimizethesumofseletededgeweightsandthenumberof
branhes(orequivalently,thenumberofleaves)inthegraph. Thetworiteriamayonitforgeneralweighted
graphsandeven iftheydonot,theywill requireaomplexglobal optimizationalgorithm (e.g.,fault-tolerant
MST onstrution) whose pratial importane may not be very lear. In the following, we formulate our
problemandproveoursimplealgorithmhasapropertywhihtranslatestoasuientlygood transferroute
intypialreal networkongurations.
2.2. ProblemFormulation. Asusual,wemodelthenetworkbyadiretedgraph
G
=
hV, Ei
,whereV
is asetofnodespartiipatingintherepliation.E
representspossibleonnetionsbetweennodes;(
a, b
)
∈
E
⇐⇒
a
knowsb
'snameandtheurrentnetwork statusallowsa
toonnettob
.The graphis for modeling purposes only; in pratie, the network status may hange overtime, soeah
nodeannotknowtheompletestatusofthenetwork. It mayevenbeimpratialto assumeeahnodeknows
alltheneighborsit anonnetto. Inourimplementation, eah nodebeginswith knowinginformationabout
afewofitsneighborsandreeivingaommandthatinstrutsittopartiipateintherepliationofale. They
learnother node nameson the yby propagating informationalong establishedonnetions. This way, they
learnother onnetionsthey may be ableto make. Theylearn whethera partiularonnetion is allowed or
not by tryingto establisha onnetiononly when neessary. Nodes never maintaininformation about edges
theyarenotadjaentto.
Below, we proveour optimization algorithm eventually reahes atransfer forest that has somedesirable
properties, assuming that thegraphis xed at somepoint. Notethat our algorithmorretly nishesits job
Todene thegoodness ofatransferforest,wemust introdueanotionofdistane betweennodes. One
plausible formulation would be to give edges arbitrary weights, and to aim at reduing the total weights of
seleted edges (i.e., minimum spanning forest). We do not use this formulation but introdue a stronger
assumptionaboutthedistanes betweennodeswhihwebelieveisapratialapproximationof realnetworks,
andshowasimplerloalsearhobtainssuientlygoodresults.
Weassumenodesanbedeomposedintogroupssothatnodeslosetoeahotheronstituteagroup. Our
optimizationalgorithm doesnotassumethat eahnodeknowsthedeompositionexpliitly,but onlyassumes
that eah node an somehow ompare relative distanes from the loal node to other nodes. We show in
Setion3.3suhaomparisoninduesadeomposition. Itissuhadeompositionforwhihouralgorithmtries
toreduethenumberofinter-groupedges. Again,therepliationorretlynisheswithinaurateinformation,
thusanimplementationanuseanysuientlyauratemeasurement. Oururrentimplementationisgivenin
Setion3.2.1.
We say a deomposition is omplete if nodes in eah group form a lique (a omplete subgraph) of
G
. That is, nodes inside a group an onnet to eah other without being bloked by, e.g., rewalls. For anydeomposition whih mayor maynot beomplete, onean derive aomplete deomposition by dividing its
inompletegroupintoanumberofgroupssoeahofthemisalique. Weallsuhaompletedeompositiona
ompletesubdivision oftheoriginaldeomposition. Givenadeomposition
D
,aompletesubdivision thathas theminimumnumberofgroupsisalledtheoarsest subdivision ofD
.Given adeomposition,the goalwould beto makea transferforest lose to the followingbest desirable,
whih has
1. theminimumnumberofedgesonnetingnodesin dierentgroups,and
2. theminimumnumberofbranhes.
Ouralgorithmonvergesto theoptimalifeah nodeanonnetto anyothernode (i.e.,theentiregraph
is omplete, orin pratialterms,rewall,NAT,orDHCP do notdenyanyonnetion againstus). Inmore
generalgraphs,ouralgorithmhasthefollowingproperty. Let
D
thedeompositioninduedbyaheuristisused tomeasuretherelativedistane betweennodes,andD
theoarsestompletesubdivisionofD
. Ouralgorithm ahieves(1)thenumberofinter-grouponnetions≤
N
−
F
and(2) thenumberof branhes≤
N
−
1
,whereN
is thetotalnumberofgroupsinD
andF
thenumberofgroupsinD
ontainingatleastone nished node, anodewhihhasreeivedtheentiredata.Ourlaimthat theabovepropertytranslatesto agoodresult inpratieis basedonthefollowing
obser-vations.
•
Asimplemeasurementanreasonablyapproximatethelosenessbetweennodes. Forexample,givenanodein thesameLANastheloal nodeandanothernotinthesameLAN,itwillberelativelyeasy
fortheloalnodetojudge ifonenodeis losertotheother, thus shouldbepreferred. Therefore,one
anobtainadeomposition eahgroupofwhihhasnodeslosetoeah other.
•
Intypialnetworkongurations,nodeslosetoeahothertendtobeallowedtoonnettoeahother.Mosttypially,nodeswithinaLANanonnetto eahother. Makingagroupofnodeslosetoeah
otherthus tendstoyieldasubgraphthat isnearlyomplete.
Therstbulletimpliesthat,ifwegroupnodesbasedonareasonablyauratemeasurementofdistanesbetween
them,wewillhavegroupseahofwhihonsists ofnodesloseto eah other. Eah suh groupwill benearly
omplete (bullet #2), therefore
N
will belose toN
. Together,the numberof onnetionsrossing agroup boundarywillbelosetoN
−
F
,andthenumberof branheslosetoN
−
1
.3. Algorithm. Thealgorithmhasseveralfeaturesthatweshould remark.
Asimple, self-stabilizingdistributed algorithm: Eahnodeworksbasedoninformationaboutits
neigh-bors and optimizestransferroutes withasmall amountof loalinformation. Eahnode ontinuously
seeksalosernode thatmayservedata faster. Thismehanismnaturallymakesouralgorithm
fault-tolerantandallowsnodesto joinorleaveomputation atanytime.
Parallel and pipelinedtransfer: Transferringdatafromnode
A
toB
andfromC
toD
anourinparallel. Moreover,transferringapiee of datafromA
toB
andtransferring anotherpiee of data fromB
toC
analsotakeplae inparallel (pipelined transfer). Thisisespeially importantforrepliatinglarge lesinswithednetworks.together withthe self-stabilizingnature of thealgorithm,is enoughto makeit deadlok-free; whena
noderashes,itshildrenwill eventuallylearnthereis noprogressforalongtime, inwhihasethey
tryto onnettoanothernodethatisaheadofit.
01: /*StartingorAfterReovered*/
02: oset =urrentlesize ondisk;
03: parent=invalid;/*thenodeselfisgettingdatafrom.
*/
04: andidate =null;
05: is_sending_giveme =false;
06: hildren =none;/*nodesselfisgivingdatato*/
07: siblings=none;/*usedforTree2ListSuggestion */
08: neighbors =listofneighbors(deadoralive);
09: while(true){
10: /**********SearhingforParent**********/
11: (andidate==null&&parent ==invalid)
⇒
12: andidate =anodeinneighbors;
13: send(andidate,ask(id,oset));
14: /*NearParentHeuristis*/
15: (andidate ==null&&anodeinneighbors satises
16: is_loser(self,node,parent))
⇒
17: andidate =node;
18: send(andidate,ask(id,oset));
19: /*Tree2List Heuristis*/
20: (andidate ==null&&asibling insiblingssatises
21: !is_loser(self,parent,sibling))
⇒
22: andidate =sibling;
23: send(andidate,ask(id,oset));
24: reeived(ask(wid,woset))
⇒
25: if ((oset
>
woset)&&26: (MAX_NODE
>
numberofhildren)){27: addthisnode(wid,woset) tohildren;
28: send(wid,ok(id,oset));
29: }else{
30: send(wid,ng(id));
31: }
32: reeived(ok(wid,woset))
⇒
33: if (woset
>
oset){34: parent=wid;andidate=null;
35: }
36: reeived(ng(wid))
⇒
37: if (wid ==andidate) {
38: andidate =null;
39: }else if (wid ==parent){
40: parent=invalid;
41: }
42: /********** DataTransfer**********/
43: (parent !=invalid&&oset
<
lesize &&44: !is_sending_giveme)
⇒
45: is_sending_giveme =true;
46: send(parent,giveme(id,oset));
47: reeived(giveme(hild,woset))
⇒
48: if (oset
>
woset){49: size =max(BLOCKSIZE,osetwoset);
50: buf =load(lename,woset,size);
51: send(hild,data(id,woset,size,buf));
52: }else{
53: send(hild,ng(id));
54: }
55: reeived(data(wid,woset,size,buf)
⇒
56: if (woset ==oset){
57: is_sending_giveme =false;
58: save(lename,woset,size,buf);
59: oset +=woset;
60: }
61: (oset ==lesize &&parent!=null)
⇒
62: if (parent!=invalid)
63: send(parent,disonnet(id));
64: parent=null;
65: reeived(disonnet(hild))
⇒
66: deletethehild fromhildren;
67: /********** Tree2List Suggestion**********/
68: (havingmorethanonehild)
⇒
69: foreahhild inhildren {
70: send(hild,suggestion(id,hildren));
71: }
72: reeived(suggestion(parent,new_siblings))
⇒
73: siblings =new_siblings;
74: /********** FaultHandling**********/
75: (timeout(data,ng)fromparent)
⇒
76: parent=invalid;
77: (timeout(giveme,disonnet)fromhild)
⇒
78: deletethehild fromhildren;
79: (timeout(ok,ng)fromandidate)
⇒
80: andidate =null;
81: }
Fig.3.1. Pseudo-ode ofouralgorithm
Figure 3.1 shows the loal algorithm running on eah node. Prior to running this algorithm, eah node
knows its neighbors (neighbors) and the size of the le eah node must eventually have (lesize). In atual
implementation,eahnodemaybeginwithaninompletelistofneighbors. Nodespropagatetheirneighborsto
otherandlearnfrom others.
Insidethemain whileloop(line981)is writtenasalistofthefollowingform:
ondition
⇒
ationaretrue,anyoneofthemishosenarbitrarily.
First,we explainthe base part of this algorithm in Setion 3.1. We ontinue with theroute optimization
heuristisin Setion3.2
3.1. The Base Algorithm. Eahnoderepeatsthefollowinguntilitgetstheentiredata.
•
Itseeksanodethatisaheadofitself(i.e.,hasmoredatathanitself). Letusallsuhanodeitsparent.Aparentmayhangeovertime.
•
Oneitndsaparent,itaskstheparenttosendthedatathatshouldomenexttothedataiturrentlyhas. Forexample,ifanodehastherst1000bytesofale,itwillasktheparenttosendsomeamount
ofdatafromoset1000.
•
Inaddition,Eahnode,exeptonesthathaveobtainedtheentiredata,seeksanodethatislosertoitsurrent
parent. DetailsareinSetion 3.2.1.
Eah node having twoormorehildrentries to resolvethis situation,by suggestinghildrento
onnetto oneofitssiblings.
When a node reeivesan instrution to partiipate in a repliation, eah node heks how muh data it
has (line 2), searhes for a andidate node that has data grater than itself by onneting to some nodes in
its neighbors list. Variable oset indiates the size of data at that time, and satises the inequality 0
≤
oset≤
lesize. Duringdata transfer,theinvarianthild's oset≤
parent'soset ismaintained(line25, 33, and48).Anodesearhingforaparentsendsanaskmessagearryingitsoset (datasize)toaandidate(line1113).
Ifthereeiverhasmoredatathan thesender,itsends anokmessageto thenode sender(line2428,3235).
Atthat time,therelationbetweenparent-hildisestablished. Afterthat,thehildsends agivememessageto
theparent(line 4346)and theparentsends ahunkof datato thehild (line4751). This repeatsuntil the
hildeitherathesuptheparentindatasize (line5254),ndsabetterandidatethantheurrentparent,or
reeivesanerror. If thereeiverof askdoesnothavemoredatathan thesender,it sendsananswerng (line
2931)tothesender. Reeivinganngmessage(line3641),thenodeontinuestosearhforaparent.
Anodeanbeaparentofsomenodesandahildofanotheratthesametime. Ineet,weahieveapipeline
transferthroughallnodes.
Whena parentbeomes unreahablefrom its hild (due to a parentrash ora network failure), thehild
merelysearhes foranew parent. When anodereovers,it anpartiipatein the transferfrom theosetat
thetimeithasfailed. Hene,thisalgorithmisfault-tolerant(line7480).
3.2. Adaptive Transfer Route Optimization. Now, we explain optimizing heuristis on top of the
basealgorithm(line1423,6773).
is_closer(A,B,C)
parent
candidate
Sub
Tree
new
parent
A
C
B
C
B
A
Sub
Tree
Sub
Tree
Sub
Tree
Fig.3.2. NearParent Operation
3.2.1. NearParent Heuristis. Eah node periodiallytries to onnet to a node that is loserto its
urrentparent(andidateinFigure3.2,line1518inFigure3.1). Iftheandidatenodeturnsouttohavemore
Notethatevenifeahnodehasonnetedtoitsparent,itsearhesforanevenloserandidateperiodially.
Wehavenotonduted anextensivestudyaboutthebest frequeny. Frequentmeasurementswill allowus to
ndagoodtransferroutefastattheostofinreasednetworktra. Oururrentimplementationguarantees
thatthere isatmostonetrafrom eahnodeforthemeasurement. It alsoguaranteeseahnodeperformsa
measurementatmostoneevery100ms. ThiswillhardlyaetsCPUornetworkload.
Theprediatetojudgeifanode
B
isloserthanC
fromtheloalnodeA
,is_loser(A
,B
,C
),urrentlyuses thefollowingriteriain thelistedorder.Throughputobserved in the past: Eah node reordsthroughput from eah ofthe nodesthat havebeen
hosenasitsparent. If
A
hashosenbothB
andC
asits parentbefore, whiheverprodued abetter throughputisonsideredloser.Observed lateny: The above riterionis notappliablewhen either
B
orC
hasneverbeen hosenone asA
'sparent. InthisaseA
useslateniesittakestoonnettoB
andC
.The length ofthe mathingIP address prex: Whenobservedlateniesaretoolosetodisriminate,we
useIPaddressesof
A
,B
,andC
. WeomparethelengthsoftheommonprexesofIPaddressesofA
andB
tothat ofA
andC
.For thepurpose ofprovingthetheoretial property ofthe algorithmmentioned in Setion 3.3(also stated as
Theorem3.7),is_loseranbeanyprediatethat satisesthefollowingproperties.
•
is_loser(
A, B, C
)
andis_loser(
A, C, B
)
donotbeometrueatthesametime.•
ForagivenA
, thebinaryrelation:R
A
(
B, C
)
def
=
is_loser(
A, B, C
)
istransitive. That is,is_loser
(
A, B, C
)
∧
is_loser(
A, C, D
)
⇒
is_loser(
A, B, D
)
•
is_loser(
A, B, C
)
⇒
is_loser(
B, A, C
)
Itwillbelearthatanyreasonabledenitionofrelativedistaneandanauratemeasurementofit,inluding
theoneslistedabove,willsatisfythersttwobullets. Thethirdpropertymaynotsoundveryobvious. Examples
thatsatisfythepropertyinlude:
•
Adenitionbasedonthebottlenekedgeontrees. Thatis,assumenodesareonnetedviaaweightedtreeandletis_loser
(
A, B, C
)
betrueitheminimumweightonthepathbetweenA
andB
islarger thanthatonthepathbetweenA
andC
.•
Adenition basedonthedistaneontrees. Thatis, assumenodesareonnetedviaatreeandletis_loser
(
A, B, C
)
betrueithepathbetweenA
andB
is shorterthanA
andC
.•
Adenition basedonaddressprexes. That is,assumenodesareassignedintegeraddressesandletis_loser
(
A, B, C
)
betrueithelengthofthemathingaddressprexbetweenA
andB
islargerthan thatbetweenA
andC
.Thereforeweexpet that oururrentimplementation ofis_loserbasedon measuredbandwidthsbetween
nodes,measuredlateniesbetweennodes,andthelengthofIP addressprexes,will satisfythethird property
providedmeasurementsareaurate.
Notethatimplementingsuhaprediatedoesnotrequireanyapriori notionofgroups. Just
dening/mea-suring therelativeloseness betweennodeswill sue, aslongassuh adenition/measurement satisesthe
above properties. In Setion 2.2, we show suh a prediate in generalimpliitly indues a distane between
nodes, whih in turn indues a deomposition of nodes based on the distane. Our algorithm redues the
numberofinter-groupedgesforadeomposition derivedthisway.
3.2.2. Tree2ListHeuristis. NearParentheuristisreduesthenumberofedgesthatrossgroup
bound-aries. It howeveris not useful for reduing the number of branhes. Another optimization, alled Tree2List
heuristis,omesintoplayto makethetransferroutelosertoalist.
Anodethathastwoormorehildrensendsitshildrenlisttoeveryhild(line6871).Whenanodereeivesa
suggestionmessage,whiheetivelyontainsitsurrentsiblings,ithoosesoneinthelistasthenextandidate
iftheurrentparentisnotlosertoit(lines7273,2023).Figure3.3showshowTree2Listheuristismodies
Sub
Tree
Sub
Tree
Sub
Tree
Sub
Tree
Sub
Tree
Sub
Tree
A
B
C
A
B
C
!is_closer(C,A,B)
Fig.3.3. Tree2ListOperation
AnimportantpropertyaboutTree2List,provedin thenextsetion,isthatitneverinreasesthenumberof
inter-groupedges. ThisguaranteesthatapplyingTree2ListdoesnotimpedetheNearParent'seortofreduing
the numberof inter-group edges. In the next setion, we stateand prove properties of transferforests after
applyingbothheuristisinanarbitraryorder.
3.3. Properties of the Route Optimization Algorithm. Letis_loser satisfythe properties stated
inSetion3.2.1. Werstshowthefollowing,thatsaysis_loser
(
A, B, C
)
isequivalenttoomparingadistane betweenA
andB
andbetweenA
andC
, forsomedenitionofadistane.Lemma 3.1. For is_losersatisfyingthe property statedinSetion 3.2.1, thereexists adistane funtion
d
thatsatisesthe following.•
For allnodesA
andB
,d
(
A, B
) =
d
(
B, A
)
.•
For allnodesA
,B
,andC
,is_loser
(
A, B, C
)
⇐⇒
d
(
A, B
)
< d
(
A, C
)
.
Proof: SeeAppendix A.1.ThefollowingLemma isimportantforguaranteeingTree2Listisappliablewhenwehavemanybranhes.
Lemma3.2. For any
d
satisfying the onditioninLemma3.1,max(
d
(
A, B
)
, d
(
A, C
))
≥
d
(
B, C
)
istruefor all nodes
A, B
,andC
. Proof: SeeAppendixA.2.Adistane funtion
d
andathresholdt
deneanaturaldeomposition of agraph. That is, weremoveall edges(
x, y
)
suh thatd
(
x, y
)
> t
from theoriginal graph, and leta groupbea onnetedomponent of the graph. We allsuh adeomposition is derived from is_loser. Manydeompositions anbe derivedfrom asingledenition ofis_loser,dependingonthehoieof
d
andt
.Wemodel ourrouteoptimization heuristisasaproessof rewritingthetransferforest aordingto
Near-Parent,Tree2List,ornishingthetransfertoanode.
Definition 3.3. A state of omputation is a forest among partiipating nodes, indued by their parent
pointers. Let
S
andS
′
bestates. Wedene relations
→
n
,→
t
,→
f
,and→
by:1.
S
→
n
S
′
def⇐⇒
S
′
isobtainedby applying NearParent to
S
(Figure3.2), 2.S
→
t
S
′
def⇐⇒
S
′
isobtainedby applying Tree2List to
S
(Figure3.3), 3.S
→
f
S
′
def⇐⇒
S
′
isobtainedbynishing anode andmaking itsparentpointernull, and
4.
→
def=
→
n
∪ →
t
∪ →
f
. Thatis,S
→
S
′
def⇐⇒
(
S
→
n
S
′
)
or(
S
→
t
S
′
)
or(
S
→
f
S
′
)
.
Definition3.4.
•
Letw
(
S
)
bethe numberof edgesinforestS
thatrossgroupboundaries. Fortehnialonveniene, we onsider an invalid parentpointertorossagroup boundary, anda nullparentpointernotto rossany group boundary.
•
Letf
(
S
)
be the number of nished nodes (having parent=
null) andF
(
S
)
be the number of groups that have at least one nished node. We say suh agroup isnished. Note there may be unnishednodesin anishedgroup.
•
Letl
(
S
)
bethe numberof leaves (i.e., nodesthat arenotpointedtobyany parentpointer).Lemma 3.5. Transition paths are bounded. That is, the lengthof a path
S0
→
S1
→
S2
→ · · ·
is bounded. Proof: DeneSUMDIST(
S
)
,SUMDEPTH(
S
)
,andQ
(
S
)
asfollows.SUMDIST
(
S
) =
P
x
:noded
(
x, x
'sparent)
,
SUMDEPTH(
S
) =
P
x
: nodedepth
(
x
)
,
andQ
(
S
) = (
f
(
S
)
,
−
SUMDIST(
S
)
,
SUMDEPTH(
S
))
,
where depth
(
x
)
is thenumberof hopsfrom theroot ofthe treex
belongsto.d
(
x, x
'sparent)
isthe distane betweenx
anditsparent. Againfortehnialonveniene, ifx
'sparentpointer isinvalidweonsiderithas avaluelargerthananyotherd
(
y, z
)
forz
6
=
invalid. Similarly,ifx
'sparentisnull,ittakesavaluesmaller thananyotherd
(
y, z
)
forz
6
=
null.Ifweintroduealexiographialorderamongtriples
Q
(
S
)
,itiseasytoseeQ
(
S
)
stritlyinreasesbyasingle transitionstep. Thatis,S
→
S
′
⇒
Q
(
S
)
< Q
(
S
′
)
.
Infat,
→
f
inreasesf
(
S
)
,→
n
doesnothangef
(
S
)
andinreases−
SUMDIST(
S
)
, and→
t
doesnothangef
(
S
)
,neverdereases−
SUMDIST(
S
)
,andinreasesSUMDEPTH(
S
)
.Sineallquantitiesofthetriplesarelearlyboundedabove,wehaveprovedtransitionpathsarebounded.
Lemma 3.6.
1. If
S
satisesw
(
S
)
> N
−
F
(
S
)
,then→
n
isappliabletoS
. Thatis,thereexistsS
′
suhthat
S
→
n
S
′
.
2. If
S
satisesl
(
S
)
−
f
(
S
)
≥
N
,f
(
S
)
≥
1
,and→
n
isnot appliable toS
,then→
t
isappliable toS
. Proof:1. If
w
(
S
)
> N
−
F
(
S
)
(=
thenumberofunnished groups),eitherofthefollowingmusthold.•
Thereisanunnishedgrouphavingmorethanoneoutgoinginter-groupedges.•
Thereisanishedgrouphavinganoutgoinginter-groupinter-groupedge.Anoutgoingedge isaparentpointerpointingtoanodeoutsidethegroup. Intheformerase,lettwo
ofsuhedgesbe
(
A, B
)
and(
C, D
)
.A
andC
belongtoonegroup,sayX
,whileneitherB
norD
belong toX
. Thus,atransitionby→
n
thateithermakesA
oneofC
'shildrenorvieversa,isappliable. In thelatterase,letonesuhedgebe(
A, B
)
andonenishednodeinthegroupbeP
. Thus,atransition by→
n
thatmakesA
oneofP
'shildrenisappliable.2. Wesplittheproofintotwoases,(i)
l
(
S
)
−
f
(
S
)
> N
,and(ii)l
(
S
)
−
f
(
S
) =
N
. (i)l
(
S
)
−
f
(
S
)
> N
:Wehaveatleastonegroup
X
thatsatises:l
−
f >
1
where
l
andf
denote thenumberof leavesinX
andthenumberofnishednodesinX
,respetively. Leta1, a2,
· · ·
a
l
betheleavesinX
(l
≥
2
). Leta
i,
1
=
a
i
and~a
i
= (
a
i,
1, a
i,
2,
· · ·
, a
i,ni
)
(i
= 1
,
· · ·
, l
)be hains ofparentpointers startingfroma
i
. That is,a
i,j
isahildofa
i,j
+1
)
for alli
andj
(1
≤
i
≤
l
,1
≤
j
≤
n
i
−
1
).Wearguebyontraditionthatallbutoneofsuhhainsmustbeentirelyin
X
. Letusassumew.o.l.g. neitherof~a1
nor~a2
are inX
. Then therearej
andk
(1
≤
j
≤
n1
−
1
and1
≤
k
≤
n2
−
1
)suh thata1
,j
anda2
,k
∈
X
, anda1
,j
+1
anda2
,k
+1
6∈
X
. Then atransition by→
n
that onnetsa1
,j
anda2
,k
shouldbeappliable. Thisontraditstheassumptionthat→
n
isnotappliableinS
.twohains. Itremainstoshowwehaveeither(
¬
is_loser(
B, A, C
)
)or(¬
is_loser(
C, A, B
)
),soeitherB
orC
antrigger→
t
. ByLemma3.2,wehavemax
(
d
(
A, B
)
, d
(
A, C
))
≥
d
(
B, C
)
,
fromwhihweanderive:
max
(
d
(
A, B
)
, d
(
A, C
))
≥
d
(
B, C
)
⇔
d
(
A, B
)
≥
d
(
B, C
)
ord
(
A, C
)
≥
d
(
B, C
)
⇔
d
(
B, A
)
≥
d
(
B, C
)
ord
(
C, A
)
≥
d
(
C, B
)
⇒
¬
is_loser(
B, A, C
)
or¬
is_loser(
C, A, B
)
.
(ii)
l
(
S
)
−
f
(
S
) =
N
:Ifwehaveonegroup
X
that satises:l
−
f >
1
,
thenthesamedisussionas(i)applies. Intheremainingaseallthegroupssatisfy:
l
−
f
= 1
.
Let
X
beanygroup. Asin (i),onsider thel
hainsstartingfrom anode inX
. Ifallthel
hains are entirelyinX
,twoofthemmustmergeinX
,andthefollowingargumentisthesameas(i). Therefore eah group hasexatly onehain outgoingfrom the group. Then wehaveN
inter-group edges, i.e.,w
(
S
)
≥
N
. This implies, however,→
n
is appliablebeausef
(
S
)
≥
1
⇒
F
(
S
)
≥
1
⇒
w
(
S
)
≥
N >
N
−
F
(
S
)
.Theorem 3.7. Alongany path ofstatetransitions startingfrom any state
I
, wereahwithin nite stepsa stateS
∞
satisfying:1.
w
(
S
∞)
≤
N
−
F
(
S
∞)
,and 2.l
(
S
∞)
−
f
(
S
∞)
≤
N
−
1
.Proof: FromLemma 3.5, anytransition path
I
=
S0
→
S1
→ · · ·
is bounded, therefore reahesastateS
∞
inwhihneither→
n
nor→
t
(or→
,forthatmatter)isappliable. Lemma3.6showsinthisstatewehaveboth oftheaboveproperties.Remark1:. Asaspeialasewhere
D
=
D
(i.e.,noedgesareblokedinsideagroupofD
),wehaveN
=
N
. Inthisasethetheoremimpliesthat,forsuientlylongtransfers,thenumberofedgesbetweengroupsreahestheoptimal
N
−
F
(
S
)
. RepliatingalefromF
(
S
)
groupstotherestwill learlyneedN
−
F
(
S
)
inter-group edges. Forbeinglosetoalist,theseondbulletofthetheoremimpliesthatthenumberofbranhes,eetivelyalulatedby
l
(
S
)
−f
(
S
)
,istheoptimalN
−
1
. Toseethisisoptimalingeneral,onsideranetworkonguration shownin Figure3.4,whih foresinter-groupedgesto formastar.Remark2:. Reallthatthetheoremappliestoany deompositionderivedfromis_loser. Ifthenetworkhas
multiple levelsofhierarhies, (e.g.,inside aluster,lusters insideaLAN,LANsin aampus/orporatearea,
and LANsin wide area),and is_loseran desriminate all ofthem, ouralgorithm simultaneouslyoptimizes
all the levels. For example, let us say we have
N1
LANs andN2
lusters andf
(
S
) = 1
as the usual ase. If weassume is_loserandesriminate intra-luster,inter-luster but intra-LAN, and inter-LANedges, andthenetworkongurationallowsallonnetions,ouralgorithm onvergestoa statein whih wehave
N1
−
1
inter-LANedgesandN2
−
1
inter-lusteredges.4. Evaluation.
4.1. Implementation. We have implemented the desribed algorithm in Java. This is exeutable on
ommonomputerssupportingJavaandTCP/IPv4protool. WeonrmedtheprogramrunsonSolaris(spar),
Finished Node (and only it can be connected to by other’s group.)
Leaf Node
N = 4
l(S) = 4
f(S) = 1
l(S)-f(S) = 3 = N-1
Data Flow
Group
Fig.3.4. Anexample wheretheoptimalvalue of
l
(
S
)
−
f
(
S
)
isN
−
1
4.2. Single Cluster Experiments. First, we ran some experiments in a single luster. The luster
onsists of 16 nodes. Eah node has two Alpha CPUs and a loal hard-disk. Network ables of nodes are
onneted to a 100Mbps swith. Loal disk bandwidth is faster than network, so it does not reveal as a
bottlenek. CPU isalsofastenough.
We initially let onenode have 500MB le, and others have no data. Sine there is only a singleluster,
NearParent optimization does not play any role in this experiment. So this experiment is to see the eet
of Tree2List. In addition to Tree2List, we ran the base algorithm without any optimization, hanging the
maximumnumberofhildreneah nodeanserve,from oneto ve. Theylearlydemonstratehowimportant
isittomakethetransfertreelosetoalist.
ThetimewhihthedistributiontasksspentisshowninFigure4.1.
In this result, it is lear that the average distribution time inreasesas themaximumnumber of hildren
inreases. Thegraphalsoindiates that,in thispartiular experiment,limitingthenumberof hildrentoone
yieldsthebestresult. Thatis, restritingtheshapeofthetransfertreetoalistintherstplaeisbetterthan
our Tree2List strategy whih rst forms an arbitrary tree and then tries to developit to alist. We believe,
however, ourstrategy hasseveraladvantages. First, nodes maynot be ableto form alist in the presene of
rewallset. Insuhases,onemustfallbaktoatree. Seond,formingalistinthebeginningmaytakemuh
longerthanformingatree,espeiallywhenthenumberofnodesbeomeslarge,sinealistanonlygrowone
nodeatatime.
4.3. Multiple Cluster Experiments. Next,wemade experimentsin seven lusters illustratedin
Fig-ure4.2. Theyareallplaedin theampusofUniversityofTokyo.
•
AnIBMLinuxlusteralledistbs ontains70nodes. Weusedallofthemfortheexperiment. Nodeswithin a luster are onneted via 1Gbps links. A node in this luster is the soure node in this
experiment. Bandwidthfrom/to otherlustersbelowispoor100Mbps.
•
A SunFire15KSMP alled istsun has70CPUs, ofwhih weused 20. We usedthis mahine asifitwere20separatenodes. Ithasa100MbpsNICsharedbyallCPUs. Repliationof300MBdataamong
0
50
100
150
200
250
tree2list
children
limit 1
children
limit 2
children
limit 3
children
limit 4
children
limit 5
Time to Distribute 500MB (sec)
Kinds of Making Transfer Tree
Plot of Experiments changing Children Limit in One Cluster
Fig.4.1. Performaneinasingleluster
•
A luster of lusters alled kototoi ontains three luster eah having 16 nodes. Network speed is100Mbpsinside eahluster. Throughputbetweentwoofthethreeisseveral hundredsMbps. Having
morethanoneonnetiontoasinglelustereasilysaturatesthelink. Nonodesoutsidekototoiannot
diretlyonnettoinside it.
•
An HPAlpha lusteralled oxen ontains16nodes,whih isthesamelusterin Setion 4.2. Thereare two (and only two) gateway nodes that an onnet to and an be onneted from outside the
luster.
•
ALinuxluster alledmarten eahofwhihruns Linux insideVMWare. Itsongurationisalmostthesameasalusterinkototoi.
•
For onnetivity, any node an onnet to istsun nodes and the gateways of oxen. Also, istsun andistbs are in the same virtual LAN, so nodes in the two lusters an diretly onnet to eah other.
Connetionsto remainingnodesfrom otherlustersarebloked.
Weomparedthefollowingalgorithms.
Randomtree: The base algorithmwithoutanyheuristis, withno limitonthenumber ofhildren foreah
node.
NearParent only: Thebase algorithm
+
NearParent. NoTree2List.Tree2List only: Thebasealgorithm
+
Tree2List. NoNearParent.NearParent
+
Tree2List: UsebothTree2ListandNearParent.Manual: Fix thetransferroutethat weonsider will be thebest, asfollows;istbs onnetsto istsunviaone
inter-lusteredge. Itisbranhedintothreeinsideistsun. Theygotokototoi,oxen,andmarten. Inside
lusters,therearenobranhes. Thethroughputshouldbeloseto100Mbps/3=33Mbps,determined
bythethreeoutgoingedgesfromistsun, whihshare asingle100MbpsNIC.
InFigure4.3, the resultsarepresented. Notsurprisingly,Manual isthe fastest. NearParent
+
Tree2Listahieved an overhead of 50-100% to the manually tuned transfer and more than four times faster than the
randomtree.
Figure4.4showsthatthenumberofinter-lusteredgesanddistributiontimehaveastrongorrelation.
istbs 70 nodes
kototoi 16x3 nodes
istsun 20 nodes
Root
Max 100Mbps Data Route
oxen 16 nodes
marten 14 nodes
Internet
Gateway Node
Fig.4.2. Conditionof7lusters
5. Related Work.
5.1. Minimum Spanning Tree Constrution. MST onstrution is a ommonly used tehnique for
optimizingowsinnetworks.Therehavebeenanumberofpublishedalgorithmsandtheirappliations[2,6,1℄.
Itisompellingtomodelourproblembyageneralweightedgraph,withthegoalbeingatreethathasasmall
weightandasmallnumberofbranhes.
Weonsideredapproahesalongthislineandthenabandonedthemforseveralreasons.First,fromtheoretial
pointofview,minimizingthetworiteriaatthesametimeisimpossibleforgeneralweightedgraphs,sowemust
makeadiult(andsomewhatarbitrary)deisionabouthowtotradeonefortheother. Fromthepratialside,
buildinganMSTforgeneralweightedgraphinfault-tolerantandself-stabilizingmannerisalreadyomplexto
implement. Finally,typialrealnetworkshavearelativelysimplestruturewean(andshould)exploit. That
is, nodeslose to eah otherin termsof physial proximityan logiallyonnet to eah other at some level
andbelow. Thereforethesenodesshould beabletoform alistentirelywithin thelique. Wehaveshownthis
isinfat possiblewithaverysimplehill-limbingwithfault-toleraneandadaptiveness.
5.2. Appliation-Level Multiast and CDN. Our work is in spirit similar to anumber of work on
appliation-level multiast and ontent distribution networks (CDN). Our optimization riteria are dierent
fromthem, partiularlyinthatwetrytoredue thenumberofbranhes.
ALMI[9℄ usesaentralized treemanagementshemeand makesMST forgoodperformane. End System
Multiast[7℄takesbothlatenyandbandwidthintoaountwhenmakingatreeofend-hosts. In[12℄,CAN[11℄
is used for the infrastruture of multiast. Bayeux [15℄ uses Tapestry [14℄ that is also ontent-addressable
network. Overast[8℄ is a multiasting systemthat ahievesbothsmall latenies and high throughput. The
main appliation ofthese systemsismultimedia streaming to widelydistributed nodes. Insuh settings,it is
importantto bound lateniesbeausethe appliation maybe aninterative multimedia appliation. Alsoin
CDNs, the main riteria are latenies and tra load balaning, rather than delivering as muh bandwidth
as possible. So researhes about CDN suh [10, 4, 3℄ mainly onern how to alloate replias of ontents,
and how to rediret user requests to appropriate replias. On the other hand, it is less important for suh
0
200
400
600
800
1000
1200
1400
1600
random
tree
nearparent
only
tree2list
only
nearparent
tree2list
ideal
fixed tree
Time to Distribute 300MB (sec)
Kinds of Making Transfer Tree
Plot of Experiments using Verious Transfer Tree on 7 Clusters
Fig.4.3. Performaneon7lusters
0
200
400
600
800
1000
1200
1400
1600
0
10
20
30
40
50
60
70
80
90
Time to Distribute 300MB Data(sec)
Number of Inter-cluster Edges
Edges Crossing over Clusters and Time to Distribute
Fig.4.4. Correlationbetweennumberofinter-lusteredgesanddistributiontime
lateniesaggressively, beause therst priorityis on theompletion time oftransferring largeles. It is also
veryimportantto utilize LANbandwidthas muhaspossible,asthe typialusagewill beto opylarge les
6. Summary and Future Work. We have desribed a large le distribution algorithm that realizes
salability, adaptiveness, fault-tolerane,and eientuse of bandwidths. It is based ona simpledistributed
algorithm with simpleloal heurististo optimize transfers. Weformalized and provedthe properties of our
algorithm andargued that thisgivesagood resultin pratialsettings. Oursystemwill beusefulfor setting
up a number of lusters and preparing wide-area distributed omputations with a large data. Evaluations
showthatourimplementationiseetiveinrealenvironmentonsistingofover150nodesarosssevenlusters
ampus-wide.
Oururrentimplementationoftheprotoolisnotseure. Anymaliiousnodeanpartiipateinthe
replia-tionandbreakstheintegrity. Tobeausefultoolfordistributedomputing,wemustuseasuitableauthentiation
whennodesonnetto eahother. Whileintroduingseureauthentiationsispossible,thismayinreasethe
ostof deployingsuh tools, whoseverypurpose will beto help maintainalargenumberof nodeseasily. We
muststudyhowtomaintaineaseofinstallationanduseofthistoolwhileahievingareasonablelevelofseurity.
REFERENCES
[1℄ AbhishekAgrawalandHenriCasanova.ClusteringHostsinP2PandGlobalComputingPlatforms.InProeedingsofthe3rd
IEEE/ACMInternationalSymposiumonClusterComputingandtheGrid(CCGrid2003),pages367373,2003.
[2℄ F.BauerandA.Varma.DistributedAlgorithmsforMultiastPathSetupinDataNetworks. TehnialReport
UCSC-CRL-95-10,UniversityofCaliforniaatSantaCruz,August1995.
[3℄ A.Biliris,C.Cranor,F.Douglis,M.Rabinovih,S.Sibal,O.Spatshek,andW.Sturm. CDNbrokering. InProeedingsof
WCW'01,June2001.
[4℄ PeiCaoandSandyIrani. Cost-awareWWWproxyahingalgorithms. InProeedingsof the1997UsenixSymposiumon
InternetTehnologiesandSystems(USITS-97),Monterey,CA,1997.
[5℄ CVShome. http://www.vshome.org/.
[6℄ LisaHighamand ZhiyingLiang. Self-StabilizingMinimumSpanningTree ConstrutiononMessage-Passing Networks. In
Proeedingsofthe15thConf.onDistributedComputing,DISC,LNCS2180,pages194208,2001.
[7℄ YanghuaChu,SanjayG.Rao,SrinivasanSeshan,andHuiZhang.EnablingConfereningAppliationsontheInternetUsing
anOverlayMultiastArhiteture. InACMSIGCOMM2001,SanDiago,CA,August2001.ACM.
[8℄ JohnJannotti, David K.Giord, KirkL. Johnson, M. Frans Kaashoek, and James W. O'Toole, Jr. Overast: Reliable
Multiasting with an Overlay Network. In Proeedings of the Fourth Symposium on Operating System Design and
Implementation(OSDI),pages197212,Otober2000.
[9℄ DimitrisPendarakis,SherliaShi,DineshVerma,andMarelWaldvogel.ALMI:AnAppliationLevelMultiastInfrastruture.
InProeedings of the 3rd USNIXSymposium on Internet Tehnologies and Systems (USITS '01), pages 4960, San
Franiso,CA,USA,Marh2001.
[10℄ LiliQiu,VenkataN.Padmanabhan,andGeoreyM.Voelker. Ontheplaementofwebserverreplias. InINFOCOM,pages
15871596,2001.
[11℄ SylviaRatnasamy,PaulFranis, MarkHandley,Rihard Karp, and Sott Shenker. Asalable ontent-addressable
net-work. InProeedings of the 2001 onferene on appliations, tehnologies, arhitetures, and protools for omputer
ommuniations(SIGCOMM2001),pages161172.ACMPress,August2001.
[12℄ SylviaRatnasamy,MarkHandley,RihardKarp,andSottShenker. Appliation-LevelMultiastUsingContent-Addressable
Networks. LetureNotesinComputerSiene,2233,2001.
[13℄ YasuhitoTakamiya,AtsushiManabe,andSatoshiMatsuoka.Luie:Afastinstallationandadministrationtoolforlarge-saled
lusters(inJapanese). InSACSIS2003,pages365372,May2003.
[14℄ B.Y.Zhao,J.D.Kubiatowiz,andA.D.Joseph. Tapestry: AnInfrastruture forFault-tolerantWide-areaLoationand
Routing.TehnialReportUCB/CSD-01-1141,UCBerkeley,April2001.
[15℄ ShelleyQ.Zhuang,BenY.Zhao,AnthonyD.Joseph,RandyH.Katz,andJohnD.Kubiatowiz. Bayeux: AnArhiteture
forSalableandFault-tolerantWide-areaDataDissemination. InProeedings oftheEleventhInternationalWorkshop
onNetworkandOperatingSystemSupportforDigitalAudioandVideo(NOSSDAV2001),June2001.
AppendixA. OmittedProofs. Inthissetionweabbreviateis_loserto
C
.A.1. Lemma3.1. Let
V
betheset ofallnodes. Weintrodueanunknownx
AB
foreahA, B
∈
V
. For eahtriple(
A, B, C
)
suh thatC
(
A, B, C
)
is true, wegenerate aonstraintx
AB
< x
AC
. Wethen unifyx
AB
andx
BA
for allA, B
∈
V
, replaingall ourreneofonewith theother. Weare goingto showthereare no loopsofonstraintsx
AB
< x
CD
<
· · ·
< x
AB
,thus theonstraintsare satisable. Whenwehaveprovedthis, weletd
(
A, B
) =
x
AB
,forallA, B
∈
V
.Tobeginwith,weshowthefollowing:
x
AB
<
· · ·
< x
Y Z
⇒ C
(
A, B, Z
)
orC
(
A, B, Y
)
,
1.
n
= 1
:Observewemust have
A
=
Y
,A
=
Z
,B
=
Y
, orB
=
Z
sinethis onstraintwasgenerated fromC
. WhenA
=
Y
,x
AB
< x
Y Z
⇒
x
AB
< x
AZ
⇒ C
(
A, B, Z
)
. Other asesaresimilar.2. Assumethelaimholdsupto
n
−
1
andnowwehavex
AB
< x
CD
<
· · ·
< x
Y Z
oflength
n
. Byindutionhypothesis, weeitherhave: (a)C
(
C, D, Z
)
,or(b)
C
(
C, D, Y
)
.By
x
AB
< x
CD
,weeitherhave: (i)A
=
C
andC
(
A, B, D
)
, (ii)A
=
D
andC
(
A, B, C
)
, (iii)B
=
C
andC
(
A, B, D
)
,or (iv)B
=
D
andC
(
A, B, C
)
.Sine(a)and(b)aresimilarweonlyprovethease(a)byanalyzingthefourases(i)(iv).
(i)
C
(
A, B, D
)
andC
(
A, D, Z
)
⇒ C
(
A, B, Z
)
(ii)
C
(
A, B, C
)
andC
(
C, A, Z
)
⇒ C
(
A, B, C
)
andC
(
A, C, Z
)
⇒ C
(
A, B, Z
)
.(iii)
C
(
A, B, D
)
andC
(
B, D, Z
)
⇒ C
(
B, A, D
)
andC
(
B, D, Z
)
⇒ C
(
B, A, Z
)
⇒
(
A, B, Z
)
.
(iv)
C
(
A, B, C
)
andC
(
C, B, Z
)
⇒ C
(
B, A, C
)
andC
(
B, C, Z
)
⇒ C
(
B, A, Z
)
⇒
(
A, B, Z
)
.Nowweprovebyontraditiontherearenoloops:
x
AB
<
· · ·
< x
Y Z
< x
AB
.
Bytheaboveindution,weeitherhave:
(a)
C
(
A, B, Z
)
or, (b)C
(
A, B, Y
)
.By
x
Y Z
< x
AB
,weeitherhave: (i)Y
=
A
andC
(
A, Z, B
)
, (ii)Y
=
B
andC
(
B, Z, A
)
, (iii)Z
=
A
andC
(
A, Y, B
)
,or (iv)Z
=
B
andC
(
B, Y, A
)
.Weseeombininganyof(a)(b)andanyof(i)(iv)willleadtoontradition. Weonlyprovease(a)sine(b)
issimilar.
(i)
C
(
A, B, Z
)
andC
(
A, Z, B
)
⇒
false.(ii)
C
(
A, B, Z
)
andC
(
B, Z, A
)
⇒ C
(
B, A, Z
)
andC
(
B, Z, A
)
⇒
false.(iii)
C
(
A, B, Z
)
andZ
=
A
⇒
false. (iv) Sameas(iii).A.2. Lemma 3.2. Analyze the three ases, (i)
d
(
A, C
)
< d
(
A, B
)
, (ii)d
(
A, B
)
< d
(
A, C
)
, and (iii)d
(
A, B
) =
d
(
A, C
)
. Proveeahasebyontradition. (i) Letusassumed
(
A, C
)
< d
(
A, B
)
< d
(
B, C
)
. Then,d
(
A, C
)
< d
(
A, B
)
andd
(
A, B
)
< d
(
B, C
)
⇒
d
(
A, C
)
< d
(
A, B
)
andd
(
B, A
)
< d
(
B, C
)
⇒ C
(
A, C, B
)
andC
(
B, A, C
)
⇒ C
(
A, C, B
)
andC
(
A, B, C
)
(ii) Similarto (i).
(iii) Letusassume
d
(
A, B
) =
d
(
A, C
)
< d
(
B, C
)
. Then,d
(
A, B
) =
d
(
A, C
)
andd
(
A, C
)
< d
(
B, C
)
⇒
d
(
A, B
) =
d
(
A, C
)
andd
(
C, A
)
< d
(
C, B
)
⇒
d
(
A, B
) =
d
(
A, C
)
andC
(
C, A, B
)
⇒
d
(
A, B
) =
d
(
A, C
)
andC
(
A, C, B
)
⇒
d
(
A, C
) =
d
(
A, B
)
andd
(
A, C
)
< d
(
A, B
)
⇒
false.Editedby: WilsonRivera,JaimeSeguel.
Reeived: July3,2003.