• No results found

An Adaptive File Distribution Algorithm for Wide Area Network

N/A
N/A
Protected

Academic year: 2020

Share "An Adaptive File Distribution Algorithm for Wide Area Network"

Copied!
18
0
0

Loading.... (view fulltext now)

Full text

(1)

AN ADAPTIVE FILE DISTRIBUTIONALGORITHM FOR WIDEAREA NETWORK

TAKASHI HOSHINO

,KENJIRO TAURA

, AND TAKASHI CHIKAYAMA

Abstrat. Thispaperdesribesadatadistributionalgorithmsuitableforopyinglargelestomanynodesinmultiplelusters

inwide-areanetworks. Itisaself-organizingalgorithmthatahievespipelinetransfers,faulttolerane,salability,andaneient

routeseletion. Itworksinthepreseneoftoday'stypialnetworkrestritionssuhasrewallsandNetworkAddressTranslations,

makingitsuitableinwide-areasetting. Experimentalresultsindiateouralgorithmisabletoautomatiallybuildatransferroute

losetotheoptimal. Propagationofa300MBlefromonerootnodetoover150nodestakesabout1.5timesaslongasthebest

timeobtainedbythemanuallyoptimizedtransferroute.

Keywords. Self-stabilizingdistributedalgorithm,faulttolerane,salability,wide-areanetwork

1. Introdution. This paper desribes apratial algorithm for opying large data (typially in a le)

fromasourenode(s) to manydestinationnodesinparallel. We seekasalablesolutionsuitablebothwithin

aluster and aross many lusters in wide-area. By suitablewithin a luster, we mean that it fully utilizes

the available bandwidth of LAN/luster interonnet. Forexample, assuming 32 nodes are onneted via a

suiently high-throughput swith, it should be ableto opya singlelarge le to the 32 nodes in not muh

morethanthetime it takesto opy thele toa singlenode. Suh analgorithm must atleast performmany

one-to-onetransfersin parallel. Bysuitableinwide-area,wemeanitmakesagood hoiein seletingtransfer

routes. If manynodesin aluster retrieve data from another luster, alink arossthe twoeasily saturates.

Thus suhanalgorithmshould haveamehanismto transferdatawithin alusterwherepossible.

Tobepratial,itshouldworkwithasimpleandsmallmanualongurationthatmaynotbeveryaurate.

It won't be pratialto assume, for example, that theuser givesa omplete andaurate information about

physial network topology, desirable paths for transferring data, or even logial network onnetivity (i.e.,

network settings suh as rewall and Network Address Translation (NAT)). Assuming suh information is

notpratialnot onlybeausethe usermaynot want to write them, but also beausesuh informationmay

hange overtime dueto suheventsasnode/network failures. Thesystemtherefore musttolerateinaurate

informationandadapttotheonditionsobservedatruntime. Suhanadaptivesystemnaturallysupportsfault

toleranein thesensethatevenifsomenodesfail, remainingnodesaomplishtheirworkandnodesthat one

failedanjoin thetransferagain.

Webelievesuh afault-tolerantand adaptivelerepliatorisamandatorybuilding blokforlusterand

Grid omputing. It may be used for installing large program/data to many nodes. It may also be used in

le synhronizers[5℄ sotheysupport synhronizing dataamong alargenumberof nodes in parallel. Perhaps

mostimportant, repliating alarge datato many nodes will beapratial tehniqueto reset adistributed

omputation;itsimplyreinitializesalltheinvolvednodes,soastoreoverfromsomebroken/inonsistentstates.

Thisobservationaordswith reentpratiesin large-saleluster management,wherereinstallingoperating

systemsfromsrathisonsideredasanormaloperation,ratherthanthelastresort,toxbrokenlusters[13℄.

Togetanintuitiveideaabouthowagoodtransferroutetypiallylooks,onsideranetwork inFigure1.1.

Therearetwoloalareanetworks(LANs) named

A

and

B

,eah inludingthreelusters(

A1

,

A2

,and

A3

in

A

and

B1

,

B2

,and

B3

in

B

). Assumenodesan onnetto eahotherviatheTCPlayer.

Supposethedataisonanodeinluster

A1

andshouldpropagatetoallothernodes. Inthegure,asmall irleisanode,aretangleaswith,andalineonnetinganodeandaswithanetworkablethatantransfer

datawith100Mbps. 1

Intuitively, the best strategy is to form a transfer route like the one shown in Figure 1.2. Figure 1.3

representsthesameroutein thephysialtopology. Speially,thefollowingtwopropertiesareimportant.

Thenumberof onnetionsthat rossaLAN/lusterboundaryissmall; there is onlyoneonnetion

arossthetwoLANsand veonnetionsarossthesixlusters.

Theentiretransferrouteformsalist. That is,nonodesservedatatotwoormorenodes.

Thereasonwhytherstpropertyisimportantwillbelear. Asimplealulationwill revealthat ifnodesare

randomly onnetedwithout any eortto onnet nodes loseto eah other, links arossLANs/lusters will

UniversityofTokyo,{hoshino,tau,hikayama}logos.t.u-tokyo.a.jp

1

(2)

Cluster B1

Cluster B2

Cluster B3

Cluster A1

Cluster A2

Cluster A3

Subnet A

Subnet B

Node

Switch

100Mbps Line

Fig.1.1. Typialnetworkenvironmentforwhihoursolutionissuitable

Cluster A1

Cluster A2

Cluster A3

Cluster B2

Cluster B1

Cluster B3

Subnet B

Subnet A

Root

Intra-cluster edge

Intra-subnet edge

Inter-cluster edge

Inter-subnet edge

Fig.1.2. Besttransferrouteinappliation layer

easily beome abottlenek. This isespeially truein today'stypialnetwork ongurationwhere apaity of

longlinks(orporate-/ampus-/wide-area)issimilartooratbestonlyanorderofmagnitudelargerorsothan

typial loal arealinks. For example, letus assumefor the purpose ofdisussion that wehavetwo100Mbps

swithed LANs onneted via a 1Gbps link. In suh settings, we should be able to transferdata among all

thenodesinthetwoLANsapproximatelyattheLANbandwidth(100Mbps),but ifonnetionsarerandomly

hosen, alink arossthe twoLANsansustain only10suh onnetionsatbest. Thusthe1Gbps link won't

beenoughforsupporting10ormorenodesin eah sideof it.

Theseondbulletmaybelessobvious. ItisimportantforreduingthebottlenekinNICs. Supposethree

(3)

Cluster B1

Cluster B2

Cluster B3

Cluster A1

Cluster A2

Cluster A3

Subnet A

Subnet B

Node

Switch

100Mbps Line

Root

Fig.1.3.Best transferrouteaordingtoourguidelinesintypialnetwork

whenlinksarossLANsaresuientlypowerful,so theywon'tbeomebottleneksaslongaswemaintainthe

rstproperty.

Ouralgorithmtries tobuild atransferrouteloseto suh bestroutes. Notethat itisnotalwayspossible

to onnetallnodes in alist. Forexample,ifrewallsdonotallowsomeonnetions, itmaybeunavoidable

forsomenodesto servedata totwoor morehildren. Thus, ouralgorithm in generalforms atransferforest,

with some heurististo onnet nodes lose to eah other and to make the tree deeper. It may be aforest,

ratherthan asingletree, beausethere maybemultiple nodes that haveomplete datain the beginning. In

suh ases,aseparatetreewillbeformedrootedat eahsourenode.

Thepaperisorganizedasfollows. Setion2desribesamodelofthenetworkandthegoalofthisresearh.

Then,weproposeouralgorithm andproofofeienyin Setion3. Andvalidationandevaluationareshown

in Setion 4. In Setion 5, we explain related work. Finally, we onlude and summarize this researh and

remarktofuture workin Setion6.

2. ProblemDesription. Inthissetion,wedenegoalsof thealgorithmandformalizetheproblem.

2.1. Goals.

Tolerate faults and adapt to resoure onditions: Copyingalargele tomanynodestakesalongtime.

Thereforeour solution must toleratetemporal/permanentnetwork faults and node rashes. Whena

noderashes,nodesreeivingdatafrom therashednodemustndasubstitutesothattheremaining

nodesnishtheirtasks. Whenanodereovers,itmustbeabletojointhetransfernetworkandontinue

itsjob,withoutwaitingfortheongoingoperationtonishandthenrestartingfromsrath. Inaddition

to beingfault-tolerant, itmust adapt to hangesin network onditions; itshould hange thetransfer

routedepending onhangesofonditions.

Both of these requirements prelude a simplisti solution that statially onstruts a route in the

beginning and tries to retain the sameroute until they nish. Nodes must ontinuouslysearhfor a

bettertransferroute.

Make aneienttransfer routeautomatially: AsmotivatedinSetion1,ourgeneralriteriaforgood

(4)

datatomultiple(morethanone)hildren. Thisisbasedonourassumptionthatabottlenekistypially

aused by an inter-subnetedge or anode. Examples for thelatter are disks and network interfaes.

Ouralgorithm tries to optimizethe numberof longonnetionsand the numberof hildrenfor eah

node,withaverysimpleloalsearhheuristis.

Work on today'stypial networkongurations: Today's typial network ongurations do not allow

eah node to onnet to all other nodes. Firewalls may blok onnetions between LANs. Inside a

LAN, itis ommonto plae all luster nodesbut one (a master node) behind aNAT router,so that

aessestolustersneedgothroughthemaster. WithDHCP,itmayevenbeimpratialtoassumeall

nodestohavepersistentnames.

In short, wemust model the network as a generalgraph where allowed onnetions are represented

byits edges. Yet itis impratialto assumesuhagraphis given by theuser(or theadministrator)

eitheroineorinthebeginningofthealgorithm. Altogether,wemustdesignanalgorithmthatbegins

withaminimumamountofglobalinformation(e.g.,partiipatingnodes)andaloalknowledgeofthe

network(e.g.,neighbors)in eah node.

Do notassumephysial networktopology: Knowing physial network topology would help us to

opti-mize transfer routes. Designing the algorithm assuming aomplete knowledge aboutit is, however,

impratial formanyreasons disussedsofar. First itis umbersome fortheuser orthe

administra-tor to maintain suh information. We may be able to obtain suh information by using tools suh

astraeroute, but suh toolstend to beunavailable these daysfor seurity onsiderations. It is also

diult to obtain the topologyof the network behind asingle router with traeroute. Seond, even

iftopologyinformationisavailable,dynamiallyprobingthenetworkis alwaysneessaryto makethe

algorithm fault-tolerant and adaptive. Algorithms based on probing onnetivity and proximity at

runtimenaturallyworkwithoutdetailedknowledgeaboutnetworktopology.

Of ourse, weould alwaysuse physial topologyashints to ouralgorithm, among manyother hints

suhasIP addressprex,lateny,andobservedthroughput.

Toahievethesegoals,eahnodeinvolvedinouralgorithmontinuouslyseeksaparent,anodethatserves

datato thenode. Whenitfaes suh eventsasparentrashesordisonnetions, ittriesto ndanewparent.

Even without suh events, they ontinuously searh for abetter parent to optimize the transferroute. The

riteriaforabetterparentarethat(1)theloseranodeistoitselfthebetter,and(2)thefewerhildrenanode

hasthebetter.

Ouralgorithmisasimpleloalsearhalgorithmthatonvergestoasatisfatorytransferintypialnetwork

ongurationsoftoday. Ideally,wedesireanalgorithmtondagloballyoptimalsolutionforanygivennetwork.

Aplausibledenitionoftheoptimalwouldbetominimizethesumofseletededgeweightsandthenumberof

branhes(orequivalently,thenumberofleaves)inthegraph. Thetworiteriamayonitforgeneralweighted

graphsandeven iftheydonot,theywill requireaomplexglobal optimizationalgorithm (e.g.,fault-tolerant

MST onstrution) whose pratial importane may not be very lear. In the following, we formulate our

problemandproveoursimplealgorithmhasapropertywhihtranslatestoasuientlygood transferroute

intypialreal networkongurations.

2.2. ProblemFormulation. Asusual,wemodelthenetworkbyadiretedgraph

G

=

hV, Ei

,where

V

is asetofnodespartiipatingintherepliation.

E

representspossibleonnetionsbetweennodes;

(

a, b

)

E

⇐⇒

a

knows

b

'snameandtheurrentnetwork statusallows

a

toonnetto

b

.

The graphis for modeling purposes only; in pratie, the network status may hange overtime, soeah

nodeannotknowtheompletestatusofthenetwork. It mayevenbeimpratialto assumeeahnodeknows

alltheneighborsit anonnetto. Inourimplementation, eah nodebeginswith knowinginformationabout

afewofitsneighborsandreeivingaommandthatinstrutsittopartiipateintherepliationofale. They

learnother node nameson the yby propagating informationalong establishedonnetions. This way, they

learnother onnetionsthey may be ableto make. Theylearn whethera partiularonnetion is allowed or

not by tryingto establisha onnetiononly when neessary. Nodes never maintaininformation about edges

theyarenotadjaentto.

Below, we proveour optimization algorithm eventually reahes atransfer forest that has somedesirable

properties, assuming that thegraphis xed at somepoint. Notethat our algorithmorretly nishesits job

(5)

Todene thegoodness ofatransferforest,wemust introdueanotionofdistane betweennodes. One

plausible formulation would be to give edges arbitrary weights, and to aim at reduing the total weights of

seleted edges (i.e., minimum spanning forest). We do not use this formulation but introdue a stronger

assumptionaboutthedistanes betweennodeswhihwebelieveisapratialapproximationof realnetworks,

andshowasimplerloalsearhobtainssuientlygoodresults.

Weassumenodesanbedeomposedintogroupssothatnodeslosetoeahotheronstituteagroup. Our

optimizationalgorithm doesnotassumethat eahnodeknowsthedeompositionexpliitly,but onlyassumes

that eah node an somehow ompare relative distanes from the loal node to other nodes. We show in

Setion3.3suhaomparisoninduesadeomposition. Itissuhadeompositionforwhihouralgorithmtries

toreduethenumberofinter-groupedges. Again,therepliationorretlynisheswithinaurateinformation,

thusanimplementationanuseanysuientlyauratemeasurement. Oururrentimplementationisgivenin

Setion3.2.1.

We say a deomposition is omplete if nodes in eah group form a lique (a omplete subgraph) of

G

. That is, nodes inside a group an onnet to eah other without being bloked by, e.g., rewalls. For any

deomposition whih mayor maynot beomplete, onean derive aomplete deomposition by dividing its

inompletegroupintoanumberofgroupssoeahofthemisalique. Weallsuhaompletedeompositiona

ompletesubdivision oftheoriginaldeomposition. Givenadeomposition

D

,aompletesubdivision thathas theminimumnumberofgroupsisalledtheoarsest subdivision of

D

.

Given adeomposition,the goalwould beto makea transferforest lose to the followingbest desirable,

whih has

1. theminimumnumberofedgesonnetingnodesin dierentgroups,and

2. theminimumnumberofbranhes.

Ouralgorithmonvergesto theoptimalifeah nodeanonnetto anyothernode (i.e.,theentiregraph

is omplete, orin pratialterms,rewall,NAT,orDHCP do notdenyanyonnetion againstus). Inmore

generalgraphs,ouralgorithmhasthefollowingproperty. Let

D

thedeompositioninduedbyaheuristisused tomeasuretherelativedistane betweennodes,and

D

theoarsestompletesubdivisionof

D

. Ouralgorithm ahieves(1)thenumberofinter-grouponnetions

N

F

and(2) thenumberof branhes

N

1

,where

N

is thetotalnumberofgroupsin

D

and

F

thenumberofgroupsin

D

ontainingatleastone nished node, anodewhihhasreeivedtheentiredata.

Ourlaimthat theabovepropertytranslatesto agoodresult inpratieis basedonthefollowing

obser-vations.

Asimplemeasurementanreasonablyapproximatethelosenessbetweennodes. Forexample,given

anodein thesameLANastheloal nodeandanothernotinthesameLAN,itwillberelativelyeasy

fortheloalnodetojudge ifonenodeis losertotheother, thus shouldbepreferred. Therefore,one

anobtainadeomposition eahgroupofwhihhasnodeslosetoeah other.

Intypialnetworkongurations,nodeslosetoeahothertendtobeallowedtoonnettoeahother.

Mosttypially,nodeswithinaLANanonnetto eahother. Makingagroupofnodeslosetoeah

otherthus tendstoyieldasubgraphthat isnearlyomplete.

Therstbulletimpliesthat,ifwegroupnodesbasedonareasonablyauratemeasurementofdistanesbetween

them,wewillhavegroupseahofwhihonsists ofnodesloseto eah other. Eah suh groupwill benearly

omplete (bullet #2), therefore

N

will belose to

N

. Together,the numberof onnetionsrossing agroup boundarywillbeloseto

N

F

,andthenumberof branhesloseto

N

1

.

3. Algorithm. Thealgorithmhasseveralfeaturesthatweshould remark.

Asimple, self-stabilizingdistributed algorithm: Eahnodeworksbasedoninformationaboutits

neigh-bors and optimizestransferroutes withasmall amountof loalinformation. Eahnode ontinuously

seeksalosernode thatmayservedata faster. Thismehanismnaturallymakesouralgorithm

fault-tolerantandallowsnodesto joinorleaveomputation atanytime.

Parallel and pipelinedtransfer: Transferringdatafromnode

A

to

B

andfrom

C

to

D

anourinparallel. Moreover,transferringapiee of datafrom

A

to

B

andtransferring anotherpiee of data from

B

to

C

analsotakeplae inparallel (pipelined transfer). Thisisespeially importantforrepliatinglarge lesinswithednetworks.

(6)

together withthe self-stabilizingnature of thealgorithm,is enoughto makeit deadlok-free; whena

noderashes,itshildrenwill eventuallylearnthereis noprogressforalongtime, inwhihasethey

tryto onnettoanothernodethatisaheadofit.

01: /*StartingorAfterReovered*/

02: oset =urrentlesize ondisk;

03: parent=invalid;/*thenodeselfisgettingdatafrom.

*/

04: andidate =null;

05: is_sending_giveme =false;

06: hildren =none;/*nodesselfisgivingdatato*/

07: siblings=none;/*usedforTree2ListSuggestion */

08: neighbors =listofneighbors(deadoralive);

09: while(true){

10: /**********SearhingforParent**********/

11: (andidate==null&&parent ==invalid)

12: andidate =anodeinneighbors;

13: send(andidate,ask(id,oset));

14: /*NearParentHeuristis*/

15: (andidate ==null&&anodeinneighbors satises

16: is_loser(self,node,parent))

17: andidate =node;

18: send(andidate,ask(id,oset));

19: /*Tree2List Heuristis*/

20: (andidate ==null&&asibling insiblingssatises

21: !is_loser(self,parent,sibling))

22: andidate =sibling;

23: send(andidate,ask(id,oset));

24: reeived(ask(wid,woset))

25: if ((oset

>

woset)&&

26: (MAX_NODE

>

numberofhildren)){

27: addthisnode(wid,woset) tohildren;

28: send(wid,ok(id,oset));

29: }else{

30: send(wid,ng(id));

31: }

32: reeived(ok(wid,woset))

33: if (woset

>

oset){

34: parent=wid;andidate=null;

35: }

36: reeived(ng(wid))

37: if (wid ==andidate) {

38: andidate =null;

39: }else if (wid ==parent){

40: parent=invalid;

41: }

42: /********** DataTransfer**********/

43: (parent !=invalid&&oset

<

lesize &&

44: !is_sending_giveme)

45: is_sending_giveme =true;

46: send(parent,giveme(id,oset));

47: reeived(giveme(hild,woset))

48: if (oset

>

woset){

49: size =max(BLOCKSIZE,osetwoset);

50: buf =load(lename,woset,size);

51: send(hild,data(id,woset,size,buf));

52: }else{

53: send(hild,ng(id));

54: }

55: reeived(data(wid,woset,size,buf)

56: if (woset ==oset){

57: is_sending_giveme =false;

58: save(lename,woset,size,buf);

59: oset +=woset;

60: }

61: (oset ==lesize &&parent!=null)

62: if (parent!=invalid)

63: send(parent,disonnet(id));

64: parent=null;

65: reeived(disonnet(hild))

66: deletethehild fromhildren;

67: /********** Tree2List Suggestion**********/

68: (havingmorethanonehild)

69: foreahhild inhildren {

70: send(hild,suggestion(id,hildren));

71: }

72: reeived(suggestion(parent,new_siblings))

73: siblings =new_siblings;

74: /********** FaultHandling**********/

75: (timeout(data,ng)fromparent)

76: parent=invalid;

77: (timeout(giveme,disonnet)fromhild)

78: deletethehild fromhildren;

79: (timeout(ok,ng)fromandidate)

80: andidate =null;

81: }

Fig.3.1. Pseudo-ode ofouralgorithm

Figure 3.1 shows the loal algorithm running on eah node. Prior to running this algorithm, eah node

knows its neighbors (neighbors) and the size of the le eah node must eventually have (lesize). In atual

implementation,eahnodemaybeginwithaninompletelistofneighbors. Nodespropagatetheirneighborsto

otherandlearnfrom others.

Insidethemain whileloop(line981)is writtenasalistofthefollowingform:

ondition

ation

(7)

aretrue,anyoneofthemishosenarbitrarily.

First,we explainthe base part of this algorithm in Setion 3.1. We ontinue with theroute optimization

heuristisin Setion3.2

3.1. The Base Algorithm. Eahnoderepeatsthefollowinguntilitgetstheentiredata.

Itseeksanodethatisaheadofitself(i.e.,hasmoredatathanitself). Letusallsuhanodeitsparent.

Aparentmayhangeovertime.

Oneitndsaparent,itaskstheparenttosendthedatathatshouldomenexttothedataiturrently

has. Forexample,ifanodehastherst1000bytesofale,itwillasktheparenttosendsomeamount

ofdatafromoset1000.

Inaddition,

Eahnode,exeptonesthathaveobtainedtheentiredata,seeksanodethatislosertoitsurrent

parent. DetailsareinSetion 3.2.1.

Eah node having twoormorehildrentries to resolvethis situation,by suggestinghildrento

onnetto oneofitssiblings.

When a node reeivesan instrution to partiipate in a repliation, eah node heks how muh data it

has (line 2), searhes for a andidate node that has data grater than itself by onneting to some nodes in

its neighbors list. Variable oset indiates the size of data at that time, and satises the inequality 0

oset

lesize. Duringdata transfer,theinvarianthild's oset

parent'soset ismaintained(line25, 33, and48).

Anodesearhingforaparentsendsanaskmessagearryingitsoset (datasize)toaandidate(line1113).

Ifthereeiverhasmoredatathan thesender,itsends anokmessageto thenode sender(line2428,3235).

Atthat time,therelationbetweenparent-hildisestablished. Afterthat,thehildsends agivememessageto

theparent(line 4346)and theparentsends ahunkof datato thehild (line4751). This repeatsuntil the

hildeitherathesuptheparentindatasize (line5254),ndsabetterandidatethantheurrentparent,or

reeivesanerror. If thereeiverof askdoesnothavemoredatathan thesender,it sendsananswerng (line

2931)tothesender. Reeivinganngmessage(line3641),thenodeontinuestosearhforaparent.

Anodeanbeaparentofsomenodesandahildofanotheratthesametime. Ineet,weahieveapipeline

transferthroughallnodes.

Whena parentbeomes unreahablefrom its hild (due to a parentrash ora network failure), thehild

merelysearhes foranew parent. When anodereovers,it anpartiipatein the transferfrom theosetat

thetimeithasfailed. Hene,thisalgorithmisfault-tolerant(line7480).

3.2. Adaptive Transfer Route Optimization. Now, we explain optimizing heuristis on top of the

basealgorithm(line1423,6773).

is_closer(A,B,C)

parent

candidate

Sub

Tree

new

parent

A

C

B

C

B

A

Sub

Tree

Sub

Tree

Sub

Tree

Fig.3.2. NearParent Operation

3.2.1. NearParent Heuristis. Eah node periodiallytries to onnet to a node that is loserto its

urrentparent(andidateinFigure3.2,line1518inFigure3.1). Iftheandidatenodeturnsouttohavemore

(8)

Notethatevenifeahnodehasonnetedtoitsparent,itsearhesforanevenloserandidateperiodially.

Wehavenotonduted anextensivestudyaboutthebest frequeny. Frequentmeasurementswill allowus to

ndagoodtransferroutefastattheostofinreasednetworktra. Oururrentimplementationguarantees

thatthere isatmostonetrafrom eahnodeforthemeasurement. It alsoguaranteeseahnodeperformsa

measurementatmostoneevery100ms. ThiswillhardlyaetsCPUornetworkload.

Theprediatetojudgeifanode

B

isloserthan

C

fromtheloalnode

A

,is_loser(

A

,

B

,

C

),urrentlyuses thefollowingriteriain thelistedorder.

Throughputobserved in the past: Eah node reordsthroughput from eah ofthe nodesthat havebeen

hosenasitsparent. If

A

hashosenboth

B

and

C

asits parentbefore, whiheverprodued abetter throughputisonsideredloser.

Observed lateny: The above riterionis notappliablewhen either

B

or

C

hasneverbeen hosenone as

A

'sparent. Inthisase

A

useslateniesittakestoonnetto

B

and

C

.

The length ofthe mathingIP address prex: Whenobservedlateniesaretoolosetodisriminate,we

useIPaddressesof

A

,

B

,and

C

. WeomparethelengthsoftheommonprexesofIPaddressesof

A

and

B

tothat of

A

and

C

.

For thepurpose ofprovingthetheoretial property ofthe algorithmmentioned in Setion 3.3(also stated as

Theorem3.7),is_loseranbeanyprediatethat satisesthefollowingproperties.

is_loser

(

A, B, C

)

andis_loser

(

A, C, B

)

donotbeometrueatthesametime.

Foragiven

A

, thebinaryrelation:

R

A

(

B, C

)

def

=

is_loser

(

A, B, C

)

istransitive. That is,

is_loser

(

A, B, C

)

is_loser

(

A, C, D

)

is_loser

(

A, B, D

)

is_loser

(

A, B, C

)

is_loser

(

B, A, C

)

Itwillbelearthatanyreasonabledenitionofrelativedistaneandanauratemeasurementofit,inluding

theoneslistedabove,willsatisfythersttwobullets. Thethirdpropertymaynotsoundveryobvious. Examples

thatsatisfythepropertyinlude:

Adenitionbasedonthebottlenekedgeontrees. Thatis,assumenodesareonnetedviaaweighted

treeandletis_loser

(

A, B, C

)

betrueitheminimumweightonthepathbetween

A

and

B

islarger thanthatonthepathbetween

A

and

C

.

Adenition basedonthedistaneontrees. Thatis, assumenodesareonnetedviaatreeandlet

is_loser

(

A, B, C

)

betrueithepathbetween

A

and

B

is shorterthan

A

and

C

.

Adenition basedonaddressprexes. That is,assumenodesareassignedintegeraddressesandlet

is_loser

(

A, B, C

)

betrueithelengthofthemathingaddressprexbetween

A

and

B

islargerthan thatbetween

A

and

C

.

Thereforeweexpet that oururrentimplementation ofis_loserbasedon measuredbandwidthsbetween

nodes,measuredlateniesbetweennodes,andthelengthofIP addressprexes,will satisfythethird property

providedmeasurementsareaurate.

Notethatimplementingsuhaprediatedoesnotrequireanyapriori notionofgroups. Just

dening/mea-suring therelativeloseness betweennodeswill sue, aslongassuh adenition/measurement satisesthe

above properties. In Setion 2.2, we show suh a prediate in generalimpliitly indues a distane between

nodes, whih in turn indues a deomposition of nodes based on the distane. Our algorithm redues the

numberofinter-groupedgesforadeomposition derivedthisway.

3.2.2. Tree2ListHeuristis. NearParentheuristisreduesthenumberofedgesthatrossgroup

bound-aries. It howeveris not useful for reduing the number of branhes. Another optimization, alled Tree2List

heuristis,omesintoplayto makethetransferroutelosertoalist.

Anodethathastwoormorehildrensendsitshildrenlisttoeveryhild(line6871).Whenanodereeivesa

suggestionmessage,whiheetivelyontainsitsurrentsiblings,ithoosesoneinthelistasthenextandidate

iftheurrentparentisnotlosertoit(lines7273,2023).Figure3.3showshowTree2Listheuristismodies

(9)

Sub

Tree

Sub

Tree

Sub

Tree

Sub

Tree

Sub

Tree

Sub

Tree

A

B

C

A

B

C

!is_closer(C,A,B)

Fig.3.3. Tree2ListOperation

AnimportantpropertyaboutTree2List,provedin thenextsetion,isthatitneverinreasesthenumberof

inter-groupedges. ThisguaranteesthatapplyingTree2ListdoesnotimpedetheNearParent'seortofreduing

the numberof inter-group edges. In the next setion, we stateand prove properties of transferforests after

applyingbothheuristisinanarbitraryorder.

3.3. Properties of the Route Optimization Algorithm. Letis_loser satisfythe properties stated

inSetion3.2.1. Werstshowthefollowing,thatsaysis_loser

(

A, B, C

)

isequivalenttoomparingadistane between

A

and

B

andbetween

A

and

C

, forsomedenitionofadistane.

Lemma 3.1. For is_losersatisfyingthe property statedinSetion 3.2.1, thereexists adistane funtion

d

thatsatisesthe following.

For allnodes

A

and

B

,

d

(

A, B

) =

d

(

B, A

)

.

For allnodes

A

,

B

,and

C

,

is_loser

(

A, B, C

)

⇐⇒

d

(

A, B

)

< d

(

A, C

)

.

Proof: SeeAppendix A.1.

ThefollowingLemma isimportantforguaranteeingTree2Listisappliablewhenwehavemanybranhes.

Lemma3.2. For any

d

satisfying the onditioninLemma3.1,

max(

d

(

A, B

)

, d

(

A, C

))

d

(

B, C

)

istruefor all nodes

A, B

,and

C

. Proof: SeeAppendixA.2.

Adistane funtion

d

andathreshold

t

deneanaturaldeomposition of agraph. That is, weremoveall edges

(

x, y

)

suh that

d

(

x, y

)

> t

from theoriginal graph, and leta groupbea onnetedomponent of the graph. We allsuh adeomposition is derived from is_loser. Manydeompositions anbe derivedfrom a

singledenition ofis_loser,dependingonthehoieof

d

and

t

.

Wemodel ourrouteoptimization heuristisasaproessof rewritingthetransferforest aordingto

Near-Parent,Tree2List,ornishingthetransfertoanode.

Definition 3.3. A state of omputation is a forest among partiipating nodes, indued by their parent

pointers. Let

S

and

S

bestates. Wedene relations

n

,

t

,

f

,and

by:

1.

S

n

S

def

⇐⇒

S

isobtainedby applying NearParent to

S

(Figure3.2), 2.

S

t

S

def

⇐⇒

S

isobtainedby applying Tree2List to

S

(Figure3.3), 3.

S

f

S

def

⇐⇒

S

isobtainedbynishing anode andmaking itsparentpointernull, and

4.

def

=

n

∪ →

t

∪ →

f

. Thatis,

S

S

def

⇐⇒

(

S

n

S

)

or

(

S

t

S

)

or

(

S

f

S

)

.

(10)

Definition3.4.

Let

w

(

S

)

bethe numberof edgesinforest

S

thatrossgroupboundaries. Fortehnialonveniene, we onsider an invalid parentpointertorossagroup boundary, anda nullparentpointernotto ross

any group boundary.

Let

f

(

S

)

be the number of nished nodes (having parent

=

null) and

F

(

S

)

be the number of groups that have at least one nished node. We say suh agroup isnished. Note there may be unnished

nodesin anishedgroup.

Let

l

(

S

)

bethe numberof leaves (i.e., nodesthat arenotpointedtobyany parentpointer).

Lemma 3.5. Transition paths are bounded. That is, the lengthof a path

S0

S1

S2

→ · · ·

is bounded. Proof: DeneSUMDIST

(

S

)

,SUMDEPTH

(

S

)

,and

Q

(

S

)

asfollows.

SUMDIST

(

S

) =

P

x

:node

d

(

x, x

'sparent

)

,

SUMDEPTH

(

S

) =

P

x

: node

depth

(

x

)

,

and

Q

(

S

) = (

f

(

S

)

,

SUMDIST

(

S

)

,

SUMDEPTH

(

S

))

,

where depth

(

x

)

is thenumberof hopsfrom theroot ofthe tree

x

belongsto.

d

(

x, x

'sparent

)

isthe distane between

x

anditsparent. Againfortehnialonveniene, if

x

'sparentpointer isinvalidweonsiderithas avaluelargerthananyother

d

(

y, z

)

for

z

6

=

invalid. Similarly,if

x

'sparentisnull,ittakesavaluesmaller thananyother

d

(

y, z

)

for

z

6

=

null.

Ifweintroduealexiographialorderamongtriples

Q

(

S

)

,itiseasytosee

Q

(

S

)

stritlyinreasesbyasingle transitionstep. Thatis,

S

S

Q

(

S

)

< Q

(

S

)

.

Infat,

f

inreases

f

(

S

)

,

n

doesnothange

f

(

S

)

andinreases

SUMDIST

(

S

)

, and

t

doesnothange

f

(

S

)

,neverdereases

SUMDIST

(

S

)

,andinreasesSUMDEPTH

(

S

)

.

Sineallquantitiesofthetriplesarelearlyboundedabove,wehaveprovedtransitionpathsarebounded.

Lemma 3.6.

1. If

S

satises

w

(

S

)

> N

F

(

S

)

,then

n

isappliableto

S

. Thatis,thereexists

S

suhthat

S

n

S

.

2. If

S

satises

l

(

S

)

f

(

S

)

N

,

f

(

S

)

1

,and

n

isnot appliable to

S

,then

t

isappliable to

S

. Proof:

1. If

w

(

S

)

> N

F

(

S

)

(

=

thenumberofunnished groups),eitherofthefollowingmusthold.

Thereisanunnishedgrouphavingmorethanoneoutgoinginter-groupedges.

Thereisanishedgrouphavinganoutgoinginter-groupinter-groupedge.

Anoutgoingedge isaparentpointerpointingtoanodeoutsidethegroup. Intheformerase,lettwo

ofsuhedgesbe

(

A, B

)

and

(

C, D

)

.

A

and

C

belongtoonegroup,say

X

,whileneither

B

nor

D

belong to

X

. Thus,atransitionby

n

thateithermakes

A

oneof

C

'shildrenorvieversa,isappliable. In thelatterase,letonesuhedgebe

(

A, B

)

andonenishednodeinthegroupbe

P

. Thus,atransition by

n

thatmakes

A

oneof

P

'shildrenisappliable.

2. Wesplittheproofintotwoases,(i)

l

(

S

)

f

(

S

)

> N

,and(ii)

l

(

S

)

f

(

S

) =

N

. (i)

l

(

S

)

f

(

S

)

> N

:

Wehaveatleastonegroup

X

thatsatises:

l

f >

1

where

l

and

f

denote thenumberof leavesin

X

andthenumberofnishednodesin

X

,respetively. Let

a1, a2,

· · ·

a

l

betheleavesin

X

(

l

2

). Let

a

i,

1

=

a

i

and

~a

i

= (

a

i,

1, a

i,

2,

· · ·

, a

i,ni

)

(

i

= 1

,

· · ·

, l

)be hains ofparentpointers startingfrom

a

i

. That is,

a

i,j

isahildof

a

i,j

+1

)

for all

i

and

j

(

1

i

l

,

1

j

n

i

1

).

Wearguebyontraditionthatallbutoneofsuhhainsmustbeentirelyin

X

. Letusassumew.o.l.g. neitherof

~a1

nor

~a2

are in

X

. Then thereare

j

and

k

(

1

j

n1

1

and

1

k

n2

1

)suh that

a1

,j

and

a2

,k

X

, and

a1

,j

+1

and

a2

,k

+1

6∈

X

. Then atransition by

n

that onnets

a1

,j

and

a2

,k

shouldbeappliable. Thisontraditstheassumptionthat

n

isnotappliablein

S

.

(11)

twohains. Itremainstoshowwehaveeither(

¬

is_loser

(

B, A, C

)

)or(

¬

is_loser

(

C, A, B

)

),soeither

B

or

C

antrigger

t

. ByLemma3.2,wehave

max

(

d

(

A, B

)

, d

(

A, C

))

d

(

B, C

)

,

fromwhihweanderive:

max

(

d

(

A, B

)

, d

(

A, C

))

d

(

B, C

)

d

(

A, B

)

d

(

B, C

)

or

d

(

A, C

)

d

(

B, C

)

d

(

B, A

)

d

(

B, C

)

or

d

(

C, A

)

d

(

C, B

)

¬

is_loser

(

B, A, C

)

or

¬

is_loser

(

C, A, B

)

.

(ii)

l

(

S

)

f

(

S

) =

N

:

Ifwehaveonegroup

X

that satises:

l

f >

1

,

thenthesamedisussionas(i)applies. Intheremainingaseallthegroupssatisfy:

l

f

= 1

.

Let

X

beanygroup. Asin (i),onsider the

l

hainsstartingfrom anode in

X

. Ifallthe

l

hains are entirelyin

X

,twoofthemmustmergein

X

,andthefollowingargumentisthesameas(i). Therefore eah group hasexatly onehain outgoingfrom the group. Then wehave

N

inter-group edges, i.e.,

w

(

S

)

N

. This implies, however,

n

is appliablebeause

f

(

S

)

1

F

(

S

)

1

w

(

S

)

N >

N

F

(

S

)

.

Theorem 3.7. Alongany path ofstatetransitions startingfrom any state

I

, wereahwithin nite stepsa state

S

satisfying:

1.

w

(

S

∞)

N

F

(

S

∞)

,and 2.

l

(

S

∞)

f

(

S

∞)

N

1

.

Proof: FromLemma 3.5, anytransition path

I

=

S0

S1

→ · · ·

is bounded, therefore reahesastate

S

inwhihneither

n

nor

t

(or

,forthatmatter)isappliable. Lemma3.6showsinthisstatewehaveboth oftheaboveproperties.

Remark1:. Asaspeialasewhere

D

=

D

(i.e.,noedgesareblokedinsideagroupof

D

),wehave

N

=

N

. Inthisasethetheoremimpliesthat,forsuientlylongtransfers,thenumberofedgesbetweengroupsreahes

theoptimal

N

F

(

S

)

. Repliatingalefrom

F

(

S

)

groupstotherestwill learlyneed

N

F

(

S

)

inter-group edges. Forbeinglosetoalist,theseondbulletofthetheoremimpliesthatthenumberofbranhes,eetively

alulatedby

l

(

S

)

−f

(

S

)

,istheoptimal

N

1

. Toseethisisoptimalingeneral,onsideranetworkonguration shownin Figure3.4,whih foresinter-groupedgesto formastar.

Remark2:. Reallthatthetheoremappliestoany deompositionderivedfromis_loser. Ifthenetworkhas

multiple levelsofhierarhies, (e.g.,inside aluster,lusters insideaLAN,LANsin aampus/orporatearea,

and LANsin wide area),and is_loseran desriminate all ofthem, ouralgorithm simultaneouslyoptimizes

all the levels. For example, let us say we have

N1

LANs and

N2

lusters and

f

(

S

) = 1

as the usual ase. If weassume is_loserandesriminate intra-luster,inter-luster but intra-LAN, and inter-LANedges, and

thenetworkongurationallowsallonnetions,ouralgorithm onvergestoa statein whih wehave

N1

1

inter-LANedgesand

N2

1

inter-lusteredges.

4. Evaluation.

4.1. Implementation. We have implemented the desribed algorithm in Java. This is exeutable on

ommonomputerssupportingJavaandTCP/IPv4protool. WeonrmedtheprogramrunsonSolaris(spar),

(12)

Finished Node (and only it can be connected to by other’s group.)

Leaf Node

N = 4

l(S) = 4

f(S) = 1

l(S)-f(S) = 3 = N-1

Data Flow

Group

Fig.3.4. Anexample wheretheoptimalvalue of

l

(

S

)

f

(

S

)

is

N

1

4.2. Single Cluster Experiments. First, we ran some experiments in a single luster. The luster

onsists of 16 nodes. Eah node has two Alpha CPUs and a loal hard-disk. Network ables of nodes are

onneted to a 100Mbps swith. Loal disk bandwidth is faster than network, so it does not reveal as a

bottlenek. CPU isalsofastenough.

We initially let onenode have 500MB le, and others have no data. Sine there is only a singleluster,

NearParent optimization does not play any role in this experiment. So this experiment is to see the eet

of Tree2List. In addition to Tree2List, we ran the base algorithm without any optimization, hanging the

maximumnumberofhildreneah nodeanserve,from oneto ve. Theylearlydemonstratehowimportant

isittomakethetransfertreelosetoalist.

ThetimewhihthedistributiontasksspentisshowninFigure4.1.

In this result, it is lear that the average distribution time inreasesas themaximumnumber of hildren

inreases. Thegraphalsoindiates that,in thispartiular experiment,limitingthenumberof hildrentoone

yieldsthebestresult. Thatis, restritingtheshapeofthetransfertreetoalistintherstplaeisbetterthan

our Tree2List strategy whih rst forms an arbitrary tree and then tries to developit to alist. We believe,

however, ourstrategy hasseveraladvantages. First, nodes maynot be ableto form alist in the presene of

rewallset. Insuhases,onemustfallbaktoatree. Seond,formingalistinthebeginningmaytakemuh

longerthanformingatree,espeiallywhenthenumberofnodesbeomeslarge,sinealistanonlygrowone

nodeatatime.

4.3. Multiple Cluster Experiments. Next,wemade experimentsin seven lusters illustratedin

Fig-ure4.2. Theyareallplaedin theampusofUniversityofTokyo.

AnIBMLinuxlusteralledistbs ontains70nodes. Weusedallofthemfortheexperiment. Nodes

within a luster are onneted via 1Gbps links. A node in this luster is the soure node in this

experiment. Bandwidthfrom/to otherlustersbelowispoor100Mbps.

A SunFire15KSMP alled istsun has70CPUs, ofwhih weused 20. We usedthis mahine asifit

were20separatenodes. Ithasa100MbpsNICsharedbyallCPUs. Repliationof300MBdataamong

(13)

0

50

100

150

200

250

tree2list

children

limit 1

children

limit 2

children

limit 3

children

limit 4

children

limit 5

Time to Distribute 500MB (sec)

Kinds of Making Transfer Tree

Plot of Experiments changing Children Limit in One Cluster

Fig.4.1. Performaneinasingleluster

A luster of lusters alled kototoi ontains three luster eah having 16 nodes. Network speed is

100Mbpsinside eahluster. Throughputbetweentwoofthethreeisseveral hundredsMbps. Having

morethanoneonnetiontoasinglelustereasilysaturatesthelink. Nonodesoutsidekototoiannot

diretlyonnettoinside it.

An HPAlpha lusteralled oxen ontains16nodes,whih isthesamelusterin Setion 4.2. There

are two (and only two) gateway nodes that an onnet to and an be onneted from outside the

luster.

ALinuxluster alledmarten eahofwhihruns Linux insideVMWare. Itsongurationisalmost

thesameasalusterinkototoi.

For onnetivity, any node an onnet to istsun nodes and the gateways of oxen. Also, istsun and

istbs are in the same virtual LAN, so nodes in the two lusters an diretly onnet to eah other.

Connetionsto remainingnodesfrom otherlustersarebloked.

Weomparedthefollowingalgorithms.

Randomtree: The base algorithmwithoutanyheuristis, withno limitonthenumber ofhildren foreah

node.

NearParent only: Thebase algorithm

+

NearParent. NoTree2List.

Tree2List only: Thebasealgorithm

+

Tree2List. NoNearParent.

NearParent

+

Tree2List: UsebothTree2ListandNearParent.

Manual: Fix thetransferroutethat weonsider will be thebest, asfollows;istbs onnetsto istsunviaone

inter-lusteredge. Itisbranhedintothreeinsideistsun. Theygotokototoi,oxen,andmarten. Inside

lusters,therearenobranhes. Thethroughputshouldbeloseto100Mbps/3=33Mbps,determined

bythethreeoutgoingedgesfromistsun, whihshare asingle100MbpsNIC.

InFigure4.3, the resultsarepresented. Notsurprisingly,Manual isthe fastest. NearParent

+

Tree2List

ahieved an overhead of 50-100% to the manually tuned transfer and more than four times faster than the

randomtree.

Figure4.4showsthatthenumberofinter-lusteredgesanddistributiontimehaveastrongorrelation.

(14)

istbs 70 nodes

kototoi 16x3 nodes

istsun 20 nodes

Root

Max 100Mbps Data Route

oxen 16 nodes

marten 14 nodes

Internet

Gateway Node

Fig.4.2. Conditionof7lusters

5. Related Work.

5.1. Minimum Spanning Tree Constrution. MST onstrution is a ommonly used tehnique for

optimizingowsinnetworks.Therehavebeenanumberofpublishedalgorithmsandtheirappliations[2,6,1℄.

Itisompellingtomodelourproblembyageneralweightedgraph,withthegoalbeingatreethathasasmall

weightandasmallnumberofbranhes.

Weonsideredapproahesalongthislineandthenabandonedthemforseveralreasons.First,fromtheoretial

pointofview,minimizingthetworiteriaatthesametimeisimpossibleforgeneralweightedgraphs,sowemust

makeadiult(andsomewhatarbitrary)deisionabouthowtotradeonefortheother. Fromthepratialside,

buildinganMSTforgeneralweightedgraphinfault-tolerantandself-stabilizingmannerisalreadyomplexto

implement. Finally,typialrealnetworkshavearelativelysimplestruturewean(andshould)exploit. That

is, nodeslose to eah otherin termsof physial proximityan logiallyonnet to eah other at some level

andbelow. Thereforethesenodesshould beabletoform alistentirelywithin thelique. Wehaveshownthis

isinfat possiblewithaverysimplehill-limbingwithfault-toleraneandadaptiveness.

5.2. Appliation-Level Multiast and CDN. Our work is in spirit similar to anumber of work on

appliation-level multiast and ontent distribution networks (CDN). Our optimization riteria are dierent

fromthem, partiularlyinthatwetrytoredue thenumberofbranhes.

ALMI[9℄ usesaentralized treemanagementshemeand makesMST forgoodperformane. End System

Multiast[7℄takesbothlatenyandbandwidthintoaountwhenmakingatreeofend-hosts. In[12℄,CAN[11℄

is used for the infrastruture of multiast. Bayeux [15℄ uses Tapestry [14℄ that is also ontent-addressable

network. Overast[8℄ is a multiasting systemthat ahievesbothsmall latenies and high throughput. The

main appliation ofthese systemsismultimedia streaming to widelydistributed nodes. Insuh settings,it is

importantto bound lateniesbeausethe appliation maybe aninterative multimedia appliation. Alsoin

CDNs, the main riteria are latenies and tra load balaning, rather than delivering as muh bandwidth

as possible. So researhes about CDN suh [10, 4, 3℄ mainly onern how to alloate replias of ontents,

and how to rediret user requests to appropriate replias. On the other hand, it is less important for suh

(15)

0

200

400

600

800

1000

1200

1400

1600

random

tree

nearparent

only

tree2list

only

nearparent

tree2list

ideal

fixed tree

Time to Distribute 300MB (sec)

Kinds of Making Transfer Tree

Plot of Experiments using Verious Transfer Tree on 7 Clusters

Fig.4.3. Performaneon7lusters

0

200

400

600

800

1000

1200

1400

1600

0

10

20

30

40

50

60

70

80

90

Time to Distribute 300MB Data(sec)

Number of Inter-cluster Edges

Edges Crossing over Clusters and Time to Distribute

Fig.4.4. Correlationbetweennumberofinter-lusteredgesanddistributiontime

lateniesaggressively, beause therst priorityis on theompletion time oftransferring largeles. It is also

veryimportantto utilize LANbandwidthas muhaspossible,asthe typialusagewill beto opylarge les

(16)

6. Summary and Future Work. We have desribed a large le distribution algorithm that realizes

salability, adaptiveness, fault-tolerane,and eientuse of bandwidths. It is based ona simpledistributed

algorithm with simpleloal heurististo optimize transfers. Weformalized and provedthe properties of our

algorithm andargued that thisgivesagood resultin pratialsettings. Oursystemwill beusefulfor setting

up a number of lusters and preparing wide-area distributed omputations with a large data. Evaluations

showthatourimplementationiseetiveinrealenvironmentonsistingofover150nodesarosssevenlusters

ampus-wide.

Oururrentimplementationoftheprotoolisnotseure. Anymaliiousnodeanpartiipateinthe

replia-tionandbreakstheintegrity. Tobeausefultoolfordistributedomputing,wemustuseasuitableauthentiation

whennodesonnetto eahother. Whileintroduingseureauthentiationsispossible,thismayinreasethe

ostof deployingsuh tools, whoseverypurpose will beto help maintainalargenumberof nodeseasily. We

muststudyhowtomaintaineaseofinstallationanduseofthistoolwhileahievingareasonablelevelofseurity.

REFERENCES

[1℄ AbhishekAgrawalandHenriCasanova.ClusteringHostsinP2PandGlobalComputingPlatforms.InProeedingsofthe3rd

IEEE/ACMInternationalSymposiumonClusterComputingandtheGrid(CCGrid2003),pages367373,2003.

[2℄ F.BauerandA.Varma.DistributedAlgorithmsforMultiastPathSetupinDataNetworks. TehnialReport

UCSC-CRL-95-10,UniversityofCaliforniaatSantaCruz,August1995.

[3℄ A.Biliris,C.Cranor,F.Douglis,M.Rabinovih,S.Sibal,O.Spatshek,andW.Sturm. CDNbrokering. InProeedingsof

WCW'01,June2001.

[4℄ PeiCaoandSandyIrani. Cost-awareWWWproxyahingalgorithms. InProeedingsof the1997UsenixSymposiumon

InternetTehnologiesandSystems(USITS-97),Monterey,CA,1997.

[5℄ CVShome. http://www.vshome.org/.

[6℄ LisaHighamand ZhiyingLiang. Self-StabilizingMinimumSpanningTree ConstrutiononMessage-Passing Networks. In

Proeedingsofthe15thConf.onDistributedComputing,DISC,LNCS2180,pages194208,2001.

[7℄ YanghuaChu,SanjayG.Rao,SrinivasanSeshan,andHuiZhang.EnablingConfereningAppliationsontheInternetUsing

anOverlayMultiastArhiteture. InACMSIGCOMM2001,SanDiago,CA,August2001.ACM.

[8℄ JohnJannotti, David K.Giord, KirkL. Johnson, M. Frans Kaashoek, and James W. O'Toole, Jr. Overast: Reliable

Multiasting with an Overlay Network. In Proeedings of the Fourth Symposium on Operating System Design and

Implementation(OSDI),pages197212,Otober2000.

[9℄ DimitrisPendarakis,SherliaShi,DineshVerma,andMarelWaldvogel.ALMI:AnAppliationLevelMultiastInfrastruture.

InProeedings of the 3rd USNIXSymposium on Internet Tehnologies and Systems (USITS '01), pages 4960, San

Franiso,CA,USA,Marh2001.

[10℄ LiliQiu,VenkataN.Padmanabhan,andGeoreyM.Voelker. Ontheplaementofwebserverreplias. InINFOCOM,pages

15871596,2001.

[11℄ SylviaRatnasamy,PaulFranis, MarkHandley,Rihard Karp, and Sott Shenker. Asalable ontent-addressable

net-work. InProeedings of the 2001 onferene on appliations, tehnologies, arhitetures, and protools for omputer

ommuniations(SIGCOMM2001),pages161172.ACMPress,August2001.

[12℄ SylviaRatnasamy,MarkHandley,RihardKarp,andSottShenker. Appliation-LevelMultiastUsingContent-Addressable

Networks. LetureNotesinComputerSiene,2233,2001.

[13℄ YasuhitoTakamiya,AtsushiManabe,andSatoshiMatsuoka.Luie:Afastinstallationandadministrationtoolforlarge-saled

lusters(inJapanese). InSACSIS2003,pages365372,May2003.

[14℄ B.Y.Zhao,J.D.Kubiatowiz,andA.D.Joseph. Tapestry: AnInfrastruture forFault-tolerantWide-areaLoationand

Routing.TehnialReportUCB/CSD-01-1141,UCBerkeley,April2001.

[15℄ ShelleyQ.Zhuang,BenY.Zhao,AnthonyD.Joseph,RandyH.Katz,andJohnD.Kubiatowiz. Bayeux: AnArhiteture

forSalableandFault-tolerantWide-areaDataDissemination. InProeedings oftheEleventhInternationalWorkshop

onNetworkandOperatingSystemSupportforDigitalAudioandVideo(NOSSDAV2001),June2001.

AppendixA. OmittedProofs. Inthissetionweabbreviateis_loserto

C

.

A.1. Lemma3.1. Let

V

betheset ofallnodes. Weintrodueanunknown

x

AB

foreah

A, B

V

. For eahtriple

(

A, B, C

)

suh that

C

(

A, B, C

)

is true, wegenerate aonstraint

x

AB

< x

AC

. Wethen unify

x

AB

and

x

BA

for all

A, B

V

, replaingall ourreneofonewith theother. Weare goingto showthereare no loopsofonstraints

x

AB

< x

CD

<

· · ·

< x

AB

,thus theonstraintsare satisable. Whenwehaveprovedthis, welet

d

(

A, B

) =

x

AB

,forall

A, B

V

.

Tobeginwith,weshowthefollowing:

x

AB

<

· · ·

< x

Y Z

⇒ C

(

A, B, Z

)

or

C

(

A, B, Y

)

,

(17)

1.

n

= 1

:

Observewemust have

A

=

Y

,

A

=

Z

,

B

=

Y

, or

B

=

Z

sinethis onstraintwasgenerated from

C

. When

A

=

Y

,

x

AB

< x

Y Z

x

AB

< x

AZ

⇒ C

(

A, B, Z

)

. Other asesaresimilar.

2. Assumethelaimholdsupto

n

1

andnowwehave

x

AB

< x

CD

<

· · ·

< x

Y Z

oflength

n

. Byindutionhypothesis, weeitherhave: (a)

C

(

C, D, Z

)

,or

(b)

C

(

C, D, Y

)

.

By

x

AB

< x

CD

,weeitherhave: (i)

A

=

C

and

C

(

A, B, D

)

, (ii)

A

=

D

and

C

(

A, B, C

)

, (iii)

B

=

C

and

C

(

A, B, D

)

,or (iv)

B

=

D

and

C

(

A, B, C

)

.

Sine(a)and(b)aresimilarweonlyprovethease(a)byanalyzingthefourases(i)(iv).

(i)

C

(

A, B, D

)

and

C

(

A, D, Z

)

⇒ C

(

A, B, Z

)

(ii)

C

(

A, B, C

)

and

C

(

C, A, Z

)

⇒ C

(

A, B, C

)

and

C

(

A, C, Z

)

⇒ C

(

A, B, Z

)

.

(iii)

C

(

A, B, D

)

and

C

(

B, D, Z

)

⇒ C

(

B, A, D

)

and

C

(

B, D, Z

)

⇒ C

(

B, A, Z

)

(

A, B, Z

)

.

(iv)

C

(

A, B, C

)

and

C

(

C, B, Z

)

⇒ C

(

B, A, C

)

and

C

(

B, C, Z

)

⇒ C

(

B, A, Z

)

(

A, B, Z

)

.

Nowweprovebyontraditiontherearenoloops:

x

AB

<

· · ·

< x

Y Z

< x

AB

.

Bytheaboveindution,weeitherhave:

(a)

C

(

A, B, Z

)

or, (b)

C

(

A, B, Y

)

.

By

x

Y Z

< x

AB

,weeitherhave: (i)

Y

=

A

and

C

(

A, Z, B

)

, (ii)

Y

=

B

and

C

(

B, Z, A

)

, (iii)

Z

=

A

and

C

(

A, Y, B

)

,or (iv)

Z

=

B

and

C

(

B, Y, A

)

.

Weseeombininganyof(a)(b)andanyof(i)(iv)willleadtoontradition. Weonlyprovease(a)sine(b)

issimilar.

(i)

C

(

A, B, Z

)

and

C

(

A, Z, B

)

false.

(ii)

C

(

A, B, Z

)

and

C

(

B, Z, A

)

⇒ C

(

B, A, Z

)

and

C

(

B, Z, A

)

false.

(iii)

C

(

A, B, Z

)

and

Z

=

A

false. (iv) Sameas(iii).

A.2. Lemma 3.2. Analyze the three ases, (i)

d

(

A, C

)

< d

(

A, B

)

, (ii)

d

(

A, B

)

< d

(

A, C

)

, and (iii)

d

(

A, B

) =

d

(

A, C

)

. Proveeahasebyontradition. (i) Letusassume

d

(

A, C

)

< d

(

A, B

)

< d

(

B, C

)

. Then,

d

(

A, C

)

< d

(

A, B

)

and

d

(

A, B

)

< d

(

B, C

)

d

(

A, C

)

< d

(

A, B

)

and

d

(

B, A

)

< d

(

B, C

)

⇒ C

(

A, C, B

)

and

C

(

B, A, C

)

⇒ C

(

A, C, B

)

and

C

(

A, B, C

)

(18)

(ii) Similarto (i).

(iii) Letusassume

d

(

A, B

) =

d

(

A, C

)

< d

(

B, C

)

. Then,

d

(

A, B

) =

d

(

A, C

)

and

d

(

A, C

)

< d

(

B, C

)

d

(

A, B

) =

d

(

A, C

)

and

d

(

C, A

)

< d

(

C, B

)

d

(

A, B

) =

d

(

A, C

)

and

C

(

C, A, B

)

d

(

A, B

) =

d

(

A, C

)

and

C

(

A, C, B

)

d

(

A, C

) =

d

(

A, B

)

and

d

(

A, C

)

< d

(

A, B

)

false.

Editedby: WilsonRivera,JaimeSeguel.

Reeived: July3,2003.

References

Related documents

The aims of the study were as follows: firstly, to examine the latent variable structure of the BCET in relation to the subscale assessment domains of quantity, time/age, length,

As argued earlier, the goal of Event Tracker is to provide an inte- grated platform that provides both automated low-latency ingestion and augmentation of report streams

5.26 Details of Hong Kong and other taxes levied on the income and capital of the MPF scheme or pooled investment fund including tax, if any, deducted from benefits accrued

The image processing algorithm for malaria diagnosis uses (a) the ARR transform algorithm to segment the blood components, (b) WBC differentiation algorithm using size of

5 (b), so that the transition region can be filled by region filling and the system can produce a good segmentation result [1]. Transi- tional pixels attached to each other

92 Dundalk Gaels Gaelic Football Club 93 County Cavan Motor Club Limited 94 Showjumping Association of Ireland 96 Castle Golf Club Limited.. 98 Irish Basketball Association Limited

Future role and relevance of the Waitangi Tribunal and Māori Land Court Chief Judge Wilson Isaac, Judge Stephanie Milroy, Prof... Is there a need for a separate Family Court

Figure 3: Top view of the prismatic body and the occurring cavitation structures - comparison od the experiment (left) to the numerical result (right)... The areas indicated by