Satin: Simple and Efficient Java-based Grid Programming

(1)

SATIN: SIMPLEANDEFFICIENT JAVA-BASED GRID PROGRAMMING

ROB V. VAN NIEUWPOORT, JASON MAASSEN,THILO KIELMANN, HENRI E. BAL

∗

Abstrat. Grid programming environments need tobeboth portable and eient to exploitthe omputationalpower of

dynamially available resoures. In previous work, we have presented the divide-and-onquer based Satin model for parallel

omputingonlusteredwide-areasystems.Inthispaper,wepresenttheSatinimplementationontopofournewIbisplatformwhih

ombinesJava'swriteone,runeverywhere witheientommuniationbetweenJVMs. WeevaluateSatin/Ibis onthetestbed

of the EU-funded GridLabprojet, showingthat Satin's load-balaning algorithmautomatiallyadapts both to heterogeneous

proessor speedsandvaryingnetworkperformane,resultingineientutilizationofthe omputingresoures. Ourresultsshow

thatwhenthewide-arealinkssuerfromongestion,Satin'sload-balaningalgorithmanstillahievearound80%eieny,while

analgorithmthatisnotgridawaredropsto26%orless.

Keywords. Satin,Ibis,divide-and-onquer,loadbalaning,distributedsuperomputing.

1. Introdution. In omputational grids, appliations need to simultaneously tap the omputational

powerofmultiple,dynamiallyavailablesites. Theruxofdesigninggridprogrammingenvironmentsstems

ex-atlyfromthedynamiavailabilityofomputeyles: gridprogrammingenvironmentsneedtobebothportable

torunonasmanysites aspossible,and theyneedtobeexible to opewithdierentnetwork protoolsand

dynamiallyhanginggroupsof heterogeneousomputenodes.

Existingprogrammingenvironmentsare either portable and exible(Jini, Java RMI),orthey arehighly

eient(MPI).TheGlobalGridForumalsohasinvestigatedpossiblegridprogrammingmodels[19℄. Reently,

GridRPC has been proposed as a grid programming model [30℄. GridRPC allows writing grid appliations

basedonthemanager/workerparadigm.

Unlikemanager/workerprograms,divide-and-onqueralgorithmsoperatebyreursivelydividingaproblem

intosmallersubproblems. Thisreursivesubdivisiongoesonuntiltheremainingsubproblembeomestrivialto

solve. Aftersolvingsubproblems,theirresultsarereursivelyreombineduntilthenalsolutionisassembled.

By allowingsubproblems to be divided reursively, the lass of divide-and-onquer algorithms subsumes the

manager/workeralgorithms,thus enlargingthesetofpossiblegridappliations.

Of ourse,there aremany kindsofappliations that donotlend themselveswellto adivide-and-onquer

algorithm.However,we(andothers)believethelassofdivide-and-onqueralgorithmstobesuientlylargeto

justifyitsdeploymentforhierarhialwide-areasystems. Computationsthatusethedivide-and-onquermodel

inludegeometryproedures,sortingmethods, searhalgorithms,datalassiationodes,n-bodysimulations

anddata-parallelnumerialprograms[33℄.

Divide-and-onquerappliationsmaybeparallelizedbylettingdierentproessorssolvedierent

subprob-lems. Thesesubproblemsareoftenalledjobsinthisontext. Generatedjobsaretransferredbetweenproessors

to balane the load in the omputation. The divide-and-onquer model lends itself well to

hierarhially-strutured systems beause tasks are reated by reursive subdivision. This leads to a task graph that is

hierarhiallystrutured,and whihan beexeutedwithexellentommuniationloality,espeiallyon

hier-arhialplatforms.

Inpreviouswork[26℄,wepresentedourSatinsystemfordivide-and-onquerprogrammingongridplatforms.

Satinimplementsaveryeientloadbalaningalgorithmforlustered,wide-areaplatforms. Sofar,weould

only evaluate Satin based on simulations in whih all jobs have been exeuted on one single, homogeneous

luster. Inthiswork,weevaluateSatinonarealgridtestbed[2℄,onsistingofvarious heterogeneoussystems,

onnetedbytheInternet.

InSetion2,webrieypresentSatin'sprogrammingmodelandsomesimulator-basedresultsthatindiate

thesuitabilityofSatin asagridprogrammingenvironment. InSetion3,wepresentIbis,ournewJava-based

gridprogrammingplatform that ombines Java'sruneverywhere paradigmwith highly eient yet exible

ommuniationmehanisms. InSetion 4,weevaluatetheperformaneofSatinontopofIbisin theGridLab

testbed,spanningseveralsitesinEurope. Setion5disussesrelatedwork,andinSetion6wedrawonlusions.

∗

Dept. of Computer Siene, Vrije Universiteit, Amsterdam, The Netherlands, {rob,jason,kielmann,bal} s. vu. nl

(2)

2. Divide-and Conquer in Satin. Satin's programmingmodel is an extension of the single-threaded

Java model. To ahieve parallel exeution, Satin programs do not have to use Java's threads or Remote

Method Invoations(RMI). Instead, they use muh simplerdivide-and-onquer primitives. Satin does allow

theombinationofitsdivide-and-onquerprimitiveswithJavathreadsandRMIs. Additionally,Satinprovides

sharedobjetsviaRepMI.Inthispaper,however,wefousonpuredivide-and-onquerprograms.

interfae FibInter extends satin.Spawnable {

publi long fib(long n);

}

lass Fib extends satin.SatinObjet

i m p l e m e n t s FibInter {

publi long fib(long n) {

if(n < 2) return n;

long x = fib(n

−

1); // spawned long y = fib(n

−

2); // spawned syn();

return x + y;

}

publi stati void main(String[℄ args) {

Fib f = new Fib();

long res = f.fib(10);

f.syn();

System.out.println("Fib 10 = " + res);

}

Fig.2.1. Fib:anexampledivide-and-onquerprograminSatin.

Satin expressesdivide-and-onquer parallelismentirelyin the Javalanguageitself, withoutrequiring any

newlanguageonstruts. Satinusesso-alledmarkerinterfaestoindiatethatertainmethodinvoationsneed

tobeonsideredforpotentiallyparallel(soalledspawned)exeution,ratherthanbeingexeutedsynhronously

likenormalmethods. Furthermore,amehanismisneededtosynhronizewith(waitfortheresultsof)spawned

methodinvoations. With Satin,thisanbeexpressedusingaspeialinterfae,satin.Spawnable,andthelass

satin.SatinObjet. This is shown in Fig. 2.1, using the example of a lass Fib for omputing the Fibonai

numbers. First, an interfae FibInter is implemented whih extends satin.Spawnable. All methods dened in

thisinterfae(hereb)aremarkedtobespawnedratherthanexeutednormally. Seond,thelassFibextends

satin.SatinObjetandimplementsFibInter. Fromsatin.SatinObjetitinheritsthesynmethod,fromFibInterthe

spawnedbmethod. Finally,theinvokingmethod(inthisasemain)simplyallsFibandusessyntowaitfor

theresultoftheparallelomputation.

Satin'sbyteoderewritergeneratestheneessaryode. Coneptually, anewthreadisstartedforrunning

aspawnedmethoduponinvoation. Satin'simplementation,however,eliminatesthreadreationaltogether. A

spawnedmethodinvoationisputinto aloalworkqueue. Fromthequeue, themethodmightbetransferred

to adierentCPU where it may runonurrentlywiththe method thatexeuted thespawnedmethod. The

syn method waits until allspawnedalls in theurrentmethod invoationare nished; the returnvalues of

spawnedmethod invoationsareundeneduntilasynis reahed.

Spawned method invoations are distributed aross the proessors of a parallel Satin program by work

stealingfromtheworkqueuesmentionedabove. In[26℄,wepresentedanewworkstealingalgorithm,

Cluster-awareRandomStealing(CRS),speiallydesignedforluster-based,wide-area(gridomputing)systems. CRS

isbasedonthetraditionalRandomStealing(RS)algorithmthathasbeenproventobeoptimalforhomogeneous

(singleluster) systems[8℄. Webrieydesribebothalgorithmsin turn.

2.1. RandomStealing(RS). RSattemptstostealajobfromarandomlyseletedpeerwhenaproessor

nds its own work queueempty, repeating steal attempts until it sueeds [8, 33℄. This approah minimizes

(3)

load-balaningalgorithm. Onwide-areasystems,however,thisisnotthease. With

C

lusters,onaverage

(

C

−

1)

/C

×

100%

ofallstealrequestswillgotonodesinremotelusters,ausingsigniantwide-areaommuniation

overheads.

2.2. Cluster-aware RandomStealing (CRS). In CRS,eah nodeandiretly steal jobs from nodes

in remote lusters, but at mostone job at a time. Whenevera node beomesidle, it rst attempts to steal

from a node in a remoteluster. This wide-areasteal request is sent asynhronously: Instead of waiting for

theresult,thethiefsimplysetsaagandperformsadditional,synhronousstealrequeststorandomlyseleted

nodes within its own luster, until it nds a new job. As longas the ag is set, only loal stealing will be

performed. Thehandlerroutineforthewide-areareplysimplyresetstheagand,iftherequestwassuessful,

putsthenewjobintotheworkqueue. CRSombinestheadvantagesofRSinsidealusterwithaverylimited

amountofasynhronouswide-areaommuniation. Below,wewillshowthat CRSperformsalmostasgoodas

withasingle,largeluster,eveninextremewide-areanetwork settings.

2.3. Simulator-basedomparisonof RSand CRS. Adetailed desriptionofSatin'swide-areawork

stealing algorithm an be found in [26℄. We have extrated the omparison of RS and CRS from that work

intoTable2.1. Theruntimesshownin thistableareforparallelrunswith 64CPUs eah,either withasingle

lusterof64CPUS, orwith4lusters of16CPUseah.

Thewide-areanetworkbetweenthevirtuallustershasbeensimulatedwithourPandaWANsimulator[17℄.

Wesimulatedallombinationsof20msand200msroundtriplatenywithbandwidthapaitiesof100KByte/s

and1000KByte/s. The testshad beenperformed onthe predeessorhardwareto oururrent DAS-2luster.

DASonsistsof200MHzPentiumPro'swithaMyrinet network,runningtheMantaparallelJavasystem[23℄.

Table2.1

PerformaneofRSandCRSwithdierentsimulatedwide-arealinks(timesinseonds).

single 20ms 20ms 200ms 200ms

luster 1000KByte/s 100KByte/s 1000KByte/s 100KByte/s

appliation time e. time e. time e. time e. time e.

adaptiveintegration

RS 71.8 99.6% 78.0 91.8% 79.5 90.1% 109.3 65.5% 112.3 63.7%

CRS 71.8 99.7% 71.6 99.9% 71.7 99.8% 73.4 97.5% 73.2 97.7%

N-queens

RS 157.6 92.5% 160.9 90.6% 168.2 86.6% 184.3 79.1% 197.4 73.8%

CRS 156.3 93.2% 158.1 92.2% 156.1 93.3% 158.4 92.0% 158.1 92.2%

TSP

RS 101.6 90.4% 105.3 87.2% 105.4 87.1% 130.6 70.3% 129.7 70.8%

CRS 100.7 91.2% 103.6 88.7% 101.1 90.8% 105.0 87.5% 107.5 85.4%

raytraer

RS 147.8 94.2% 152.1 91.5% 171.6 81.1% 175.8 79.2% 182.6 76.2%

CRS 147.2 94.5% 145.0 95.9% 152.6 91.2% 146.5 95.0% 149.3 93.2%

InTable2.1weompareRS andCRSusing fourparallel appliations,with network onditionsdegrading

fromtheleft(single luster)to theright(highlateny, lowbandwidth). Foreahase,wepresenttheparallel

runtimeand theorrespondingeieny (labelede. in the table). With

t

s

beingthesequentialruntime

fortheappliation,with theSatin operationsexluded,(not shown)and

t

p

theparallel runtime asshownin

thetable,and

N

= 64

beingthenumberofCPUs,weomputetheeienyasfollows:

efficiency

=

t

s

t

p

· N

∗

100%

Adaptive integration numerially integratesafuntion overagiveninterval. Itsends veryshort messages

andhasalsoverynegrainedjobs. ThisombinationmakesRSsensitivetohighlateny,inwhihaseeieny

drops to about

65 %

. CRS, however,suessfully hides the high round trip times and ahieveseienies of

morethan

97 %

inallases.

N Queens solvestheproblem ofplaing

n

queensona

n

×

n

hessboard. It sendsmedium-sizemessages andhasaveryirregulartasktree. With eienyofonly

74 %

,RSagainsuersfromhighround triptimesas

itannotquiklyompensateloadimbalaneduetotheirregulartasktree. CRS,however,sustainseienies

(4)

TSP solves the problem of nding the shortest path between

n

ities. By passingthe distane table as parameter,ishasasomewhat higherparallelizationoverhead,resultinginslightlylowereienies, evenwith

asingleluster. Inthewide-areaases,theselongerparametermessagesontributetohigherroundtriptimes

whenstealingjobsfromremotelusters. Consequently,RSsuersmorefromslowernetworks(eieny

>

70 %

) thanCRSwhih sustainseieniesof

85 %

.

RayTraer rendersamodeledsenetoarasterimage. Itdividesasreendowntojobsofsinglepixels. Due

to thenatureof raytraing,individual pixelshaveveryirregularrenderingtimes. The appliationsends long

resultmessagesontainingimage frations, makingitsensitivetothe available bandwidth. Thissensitivity is

reetedin theeienyof RS,goingdownto

76 %

,whereasCRShides mostWAN ommuniationoverhead

andsustainseieniesof

91 %

.

To summarize, our simulator-based experiments show the superiority of CRS to RS in ase of multiple

lusters,onnetedbywide-areanetworks. Thissuperiorityisindependentofthepropertiesoftheappliations,

aswehaveshownwithbothregularand irregulartaskgraphsaswellasshortand longparameterandresult

messagesizes. Inallinvestigated ases,theeienyofCRSneverdroppedbelow

85 %

.

Althoughwewereabletoidentify theindividualeets ofwide-arealateny andbandwidth,theseresults

are limited to homogeneousIntel/Linux lusters (due to the Manta ompiler). Furthermore, we only tested

lusters of idential size. Finally, thewide area network hasbeen simulatedand thus been withoutpossibly

disturbingthird-partytra.

An evaluationon arealgridtestbed, with heterogeneousCPUs, JVMs, andnetworks,beomes neessary

toprovethesuitabilityofSatinasagridprogrammingplatform. Inthefollowing,werstpresentIbis,ournew

runeverywhere Javaenvironmentforgridomputing. ThenweevaluateSatinontopofIbisonthetestbedof

theEUGridLabprojet.

0000000000000000000000000000000000000000000

1111111111111111111111111111111111111111111

RMI

GMI

RepMI

Satin

Application

Grid

Monitoring

Ibis Portability Layer (IPL)

Topology

Discovery

NWS, etc.

TopoMon

GRAM, etc.

etc.

TCP, UDP, MPI

Panda, GM, etc.

Information

Service

GIS, etc.

Resource

Management

Communication

Serialization &

Fig.3.1.DesignofIbis. Thevariousmodulesanbeloaded dynamially,usingrun timelassloading.

3. Ibis, exible and eient Java-based Grid programming. The Satinruntimesystem used for

this paperis implemented on topof Ibis [31℄. In this setion we will briey explainthe Ibis philosophy and

design. The global struture of the Ibis system is shown in Figure 3.1. A entral part of the system is the

IbisPortabilityLayer(IPL)whihonsists ofanumberofwell-denedinterfaes. TheIPLanhavedierent

implementations,thatanbeseletedandloadedintotheappliationatruntime. TheIPLdenesserialization

and ommuniation,but also typial gridservies suh astopologydisoveryand monitoring. Although it is

possibleto usethe IPLdiretly from anappliation, Ibis also providesmorehigh-levelprogrammingmodels.

Currently, we have implemented four. Ibis RMI [31℄ provides Remote Method Invoation, using the same

interfaeasSunRMI,butwithamoreeientwireprotool. GMI[21℄providesMPI-likeolletiveoperations,

leanlyintegratedintoJava'sobjetmodel. RepMI[22℄extendsJavawithrepliatedobjets. Inthispaper,we

fousonthefourthprogrammingmodelthatIbisimplements,Satin.

3.1. Ibis Goals. AkeyprobleminmakingJavasuitableforgridprogrammingishowtodesignasystem

(5)

aspet. (The reently added java.nio pakage will hopefully at leas partially address this problem). The

Ibis strategy to ahieveboth goalssimultaneously is to developreasonably eient solutions using standard

tehniquesthatworkeverywhere,supplementedwithhighlyoptimizedbutnon-standardsolutionsforinreased

performaneinspeialases.Weapplythisstrategytobothomputationandommuniation. Ibisisdesignedto

useanystandardJVM,butifanative,optimizingompiler(e.g., Manta[23℄)isavailableforatargetmahine,

Ibis anuse it instead. Likewise, Ibis an use standard ommuniationprotools, e.g., TCP/IP orUDP, as

providedbytheJVM,butitanalsopluginanoptimizedlow-levelprotoolforahigh-speedinteronnet,like

GMorMPI,ifavailable. ThehallengesforIbisare:

1. how to make the system exible enough to run seamlessly on a variety of dierent ommuniation

hardwareandprotools;

2. howtomakethestandard,100%pureJavaaseeientenoughtobeusefulforgridomputing;

3. study whih additional optimizations an be done to improve performane further in speial

(high-performane)ases.

With Ibis, grid appliations an run simultaneously on a variety of dierent mahines, using optimized

softwarewherepossible(e.g.,anativeompiler,theGMMyrinetprotool,orMPI),andusingstandardsoftware

(e.g., TCP) when neessary. Interoperability is ahieved by using the TCP protool between multiple Ibis

implementations that usedierentprotools(likeGMorMPI) loally. Thisway,allmahines anbeusedin

onesingleomputation. Below,wedisussthethreeaforementionedissuesin moredetail.

3.2. Flexibility. ThekeyharateristiofIbisisitsextremeexibility, whihis requiredtosupportgrid

appliations. Amajordesigngoalistheabilitytoseamlesslyplugindierentommuniationsubstrateswithout

hangingtheuserode. Forthispurpose,theIbisdesignusestheIPL.AsoftwarelayerontopoftheIPLan

negotiatewithIbisinstantiationsthroughthewell-denedIPLinterfae,toseletandloadthemodulesitneeds.

ThisexibilityisimplementedusingJava'sdynamilass-loadingmehanism.

ManymessagepassinglibrariessuhasMPIandGMguaranteereliablemessagedeliveryandFIFOmessage

ordering. Whenappliationsdonotrequirethese properties,adierentmessagepassinglibrarymightbeused

to avoidthe overhead that omes with reliability and messageordering. The IPLsupports bothreliableand

unreliableommuniation,orderedandunorderedmessages,impliitandexpliitreeipt,usingasingle,simple

interfae. Usinguser-denableproperties(key-valuepairs),appliationsanreateexatlytheommuniation

hannelstheyneed,withoutunneessaryoverhead.

3.3. Optimizing the Common Case. To obtain aeptable ommuniation performane, Ibis

imple-ments several optimizations. Most importantly, the overhead of serialization and reetion is avoided by

ompile-time generation of speial methods (in byte ode) for eah objettype. These methods anbeused

to onvert objets to bytes (and vie versa), and to reate new objets on the reeiving side, without using

expensivereetionmehanisms. This way,theoverheadofserializationisredueddramatially.

Furthermore,ourommuniationimplementationsuseanoptimizedwireprotool.TheSunRMIprotool,

for example, resends type information for eah RMI. Our implementation ahes this type information per

onnetion. Using this optimization, our protool sends lessdata overthe wire,but moreimportantly, saves

proessingtimeforenodinganddeodingthetypeinformation.

3.4. Optimizing Speial Cases. Inmanyases,thetarget mahine mayhaveadditionalfailitiesthat

allowfaster omputationorommuniation,whiharediulttoahievewithstandardJavatehniques. One

exampleweinvestigated in previous work [23℄ is using a native, optimizing ompilerinsteadof a JVM.This

ompiler(Manta),oranyotherhighperformaneJavaimplementation,ansimplybeusedbyIbis. Themost

importantspeialaseforommuniationisthepreseneofahigh-speedloalinteronnet. Usually,speialized

user-levelnetworksoftwareisrequiredforsuhinteronnets,insteadofstandardprotools (TCP,UDP) that

use the OS kernel. Ibis therefore was designed to allow other protools to be plugged in. So, lower-level

ommuniationmaybebased,forexample,onaloally-optimizedMPIlibrary. TheIPLisdesignedinsuha

waythatitispossibletoexploiteienthardwaremultiast,whenavailable.

AnotherimportantfeatureoftheIPListhatitallowsazero-opyimplementation. Implementingzero-opy

(orsingle-opy)ommuniationin Javais anon-trivialtask, but itisessentialtomakeJavaompetitivewith

systems like MPI for whih zero-opy implementations already exist. The zero-opy Ibis implementation is

(6)

4. Satin on the GridLab testbed. In this setion, we will present a ase study to analyze the

per-formane that Satin/Ibis ahieves in areal grid environment. We ran the ray traer appliation introdued

in Setion 2.3 on theEuropean GridLab [2℄testbed. Morepreisely,wewereusing aharateristisubset of

the mahines on this testbed that wasavailable for our measurements at the time thestudy wasperformed.

Beause simultaneously startingand running a parallel appliation on multiple lusters still is atedious and

time-onsuming task, wehad to restritourselvesto asingletest appliation. Wehavehosentheray traer

forour testsasit issending themostdata of allourappliations, makingit verysensitive tonetwork issues.

Theraytraeriswritten inpure Javaand generatesahigh resolutionimage (

4096

×

4096

, with24-bitolor).

Ittakesapproximately10minutestosolvethisproblemonourtestbed.

Thisisaninterestingexperimentforseveralreasons. Firstly,weusetheIbisimplementationontopofTCP

forthemeasurementsinthis setion. Thismeansthat thenumbersshownbelowwere measuredusinga100%

Javaimplementation. Therefore, theyare interesting, givinga lear indiationof the performane level that

anbeahievedinJavawitharuneverywhereimplementation,withoutusinganynativeode.

Seondly, the testbed ontains mahines with several dierent arhitetures; Intel, SPARC, MIPS, and

Alphaproessorsareused. Somemahinesare32bit,whileothersare64bit. Also,dierentoperatingsystems

andJVMsareinuse. Therefore,thisexperimentisagoodmethodtoinvestigatewhetherJava'swriteone,run

everywherefeaturereallyworksinpratie. Theassumptionthatthisfeaturesuessfullyhidestheomplexity

ofthedierentunderlyingarhiteturesandoperatingsystems,wasthemostimportantreasonforinvestigating

theJava-entrisolutionspresentedinthispaper. Itisthusimportanttoverifythevalidityofthislaim.

-10

-5

0

5

10

15

20

25

35

40

45

50

55

60

0 200 400

km

Amsterdam

Berlin

Lecce

Cardiff

(7)

Thirdly, themahines are onneted by the Internet. The links show typial wide-area behavior, as the

physial distane betweenthe sites is large. Forinstane, the distane from Amsterdam to Lee is roughly

2000kilometers(about1250miles). Figure 4.1showsamapofEurope,annotatedwiththemahine loations.

Thisgivesan ideaofthedistanes betweenthesites. Weusethis experimenttoverifySatin's load-balaning

algorithms in pratie, with real non-dediated wide-area links. We haverun the ray traer both with the

standardrandomstealingalgorithm(RS)andwiththenewluster-awarealgorithm(CRS)asintroduedabove.

Forpratialreasons,wehadtouserelativelysmalllustersforthemeasurementsinthissetion. Thesimulation

resultsin Setion2.3showthattheperformaneofCRSinreaseswhenlargerlustersareused,beausethere

ismoreopportunitytobalanetheloadinside alusterduring wide-areaommuniation.

Table4.1

MahinesontheGridLabtestbed.

Operating CPUs/ total

loation arhiteture System JIT nodes node CPUs

VrijeUniversiteit Intel RedHat

Amsterdam Pentium-III Linux IBM

TheNetherlands 1GHz kernel2.4.18 1.4.0 8 1 8

VrijeUniversiteit SunFire280R SUN

Amsterdam UltraSPARC-III Sun HotSpot

TheNetherlands 750MHz64bit Solaris8 1.4.2 1 2 2

ISUFI/HighPerf. Compaq Compaq HP1.4.0

ComputingCenter Alpha Tru64UNIX basedon

Lee,Italy 667MHz64bit V5.1A HotSpot 1 4 4

Cardi Intel RedHat SUN

University Pentium-III Linux7.1 HotSpot

Cardi,Wales,UK 1GHz kernel2.4.2 1.4.1 1 2 2

MasarykUniversity, IntelXeon DebianLinux IBM

Brno,CzehRepubli 2.4GHz kernel2.4.20 1.4.0 4 2 8

Konrad-Zuse-Zentrum SGI SGI

für Origin3000 1.4.1-EA

Informationstehnik MIPSR14000 basedon

Berlin,Germany 500MHz IRIX6.5 HotSpot 1 16 16

Some information about the mahines we used is shown in Table 4.1. To run the appliation, we used

whiheverJavaJIT(Just-In-Timeompiler)thatwaspre-installedoneahpartiularsystemwheneverpossible,

beausethisiswhat mostuserswouldprobablydoinpratie.

Table4.2

Round-tripwide-arealateny(inmilliseonds)andahievablebandwidth(inKByte/s)betweentheGridLabsites.

daytime nighttime

to to to to

A'dam A'dam to to to to A'dam A'dam to to to to

soure DAS-2 Sun Lee Cardi Brno Berlin DAS-2 Sun Lee Cardi Brno Berlin

latenyfrom

A'damDAS-2 1 204 16 20 42 1 65 15 20 18

A'damSun 1 204 15 19 43 1 62 14 19 17

Lee 198 195 210 204 178 63 66 60 66 64

Cardi 9 9 198 28 26 9 9 51 27 21

Brno 20 20 188 33 22 20 19 64 33 22

Berlin 18 17 185 31 22 18 17 59 30 22

bandwidthfrom

A'damDAS-2 11338 42 750 3923 2578 11442 40 747 4115 2578

A'damSun 11511 22 696 2745 2611 11548 46 701 3040 2626

Lee 73 425 44 43 75 77 803 94 110 82

Cardi 842 791 29 767 825 861 818 37 817 851

Brno 3186 2709 26 588 2023 3167 2705 37 612 2025

Berlin 2555 2633 9 533 2097 2611 2659 9 562 2111

Beausethe sitesare onnetedviatheInternet, we haveno inueneon theamountoftra that ows

overthelinks. ToreduetheinueneofInternettraonthemeasurements,wealsoperformedmeasurements

aftermidnight(CET). However,in pratietherestillis somevariabilityin thelink speeds. Wemeasuredthe

lateny of the wide-area links by running ping 50 times, while the ahievable bandwidth is measured with

netperf [25℄,using32KBytepakets. ThemeasuredlateniesandbandwidthsareshowninTable4.2. Allsites

haddiultiesfromtimetotimewhilesendingtrato Lee,Italy. Forinstane, fromAmsterdamto Lee,

wemeasuredlateniesfrom44milliseondsupto3.5seonds. Also,weexperienedpaketlosswiththislink: up

(8)

there anbemorethan afatorof twodierenein link speedsduring daytimeand nighttime,espeiallythe

links from and to Lee show a largevariability. It is also interesting to see that thelink performanefrom

Leetothetwosites inAmsterdamisdierent. Weveriedthiswithtraeroute, andfoundthat thetrais

indeed routed dierentlyas the twomahinesuse dierentnetwork numbersdespite being loatedwithin the

samebuilding.

Table4.3

Problemsenounteredinarealgridenvironment,andtheirsolutions.

problem solution

rewalls bindallsoketstoportsintheopenrange

buggyJITs upgradetoJava1.4JITs

multi-homesmahines useasingle,externallyvalidIPaddress

Ibis, Satin and the ray traer appliation were all ompiled with the standard Java ompiler java on

the DAS-2 mahine in Amsterdam, and then just opiedto the other GridLab sites, without reompiling or

reonguringanything. Onmostsites,thisworksawlessly. However,wedidrunintoseveralpratialproblems.

AsummaryisgiveninTable4.3. SomeoftheGridLabsiteshaverewallsinstalled,whih blokSatin'stra

whenno speial measuresare taken. Mostsites in ourtestbed havesomeopenport range,whihmeans that

tra toportswithin this rangeanpassthrough. Thesolutionweuseto avoidbeingblokedby rewallsis

straightforward: allsokets usedforommuniationin Ibisareboundto aport withinthe(site-spei) open

portrange. Weareworkingonamoregeneralsolutionthat multiplexesalltraoverasingleport. Another

solutionis tomultiplexall tra overa(Globus) ssh onnetion,asis donebyKanedaet al. [16℄, orusinga

mehanismlikeSOCKS[20℄.

Another problem we enountered was that the JITs installed on some sites ontained bugs. Espeially

theombination ofthreads andsokets presentedsomediulties. Thereseemsto beabugin Sun's1.3 JIT

(HotSpot) related to threads and soket ommuniation. In some irumstanes, a bloking operation on a

soketwouldblokthe wholeappliation insteadofjust the threadthat doesthe operation. Thesolutionfor

thisproblemwastoupgradetoaJava1.4JIT,wheretheproblem issolved.

Finally, some mahines in the testbed are multi-homed: they have multiple IP addresses. The original

Ibis implementation on TCP got onfused by this, beause the InetAddress.getLoalHost method an return

an IP addressin aprivaterange, or anaddress for aninterfae that is not aessiblefrom the outside. Our

urrentsolutionistomanuallyspeifywhihIPaddresshastobeusedwhenmultiplehoiesareavailable.All

mahines in the testbedhave aGlobus[10℄ installation, so weused GSI-SSH (GlobusSeurityInfrastruture

Seure Shell) [11℄ to login to the GridLab sites. We had to start the appliation by hand, as not all sites

have a job manager installed. When ajob manager is present, Globus an be used to start the appliation

automatially.

AsshowninTable4.1,weused 40proessorsin total,using 6mahines loatedat 5sites alloverEurope,

with4dierentproessorarhitetures. Aftersolvingtheaforementionedpratialproblems,SatinontheTCP

Ibisimplementationranonallsites,inpure Java,withouthavingtoreompileanything.

Table4.4

RelativespeedsofthemahineandJVMombinationsinthetestbed.

run relative relativetotal %oftotal

site arhiteture time(s) nodespeed speedofluster system

A'damDAS-2 1GHzIntelPentium-III 233.1 1.000 8.000 32.4

A'damSun 750MHzUltraSPARC-III 445.2 0.523 1.046 4.2

Lee 667MHZCompaqAlpha 512.7 0.454 1.816 7.4

Cardi 1GHzIntelPentium-III 758.9 0.307 0.614 2.5

Brno 2.4GHzIntelXeon 152.8 1.525 12.200 49.5

Berlin 500MHzMIPSR14000 3701.4 0.062 0.992 4.0

total 24.668 100.0

Asabenhmark,werstrantheparallel versionof theraytraerwithasmallerproblemsize (

512 ×

512

,

with 24 bit olor) on asingle mahine on all lusters. This way, we anompute the relative speeds of the

(9)

theDAS-2lusterinAmsterdam. ItisinterestingtonotethatthequalityoftheJITompileranhavealarge

impatontheperformaneattheappliationlevel. AnodeintheDAS-2lusterandthemahineinCardiare

both1GHzIntelPentium-IIIs, but thereismorethanafatorofthree diereneinappliation performane.

Thisis aused bythedierent JITompilersthat were used. OntheDAS-2, weused themoreeientIBM

1.4JIT,whiletheSUN1.4JIT(HotSpot) wasinstalledonthemahinein Cardi.

Furthermore,theresultsshowthat,althoughthelokfrequenyofthemahineatBrnois2.4timesashigh

asthefrequenyofaDAS-2node,thespeedimprovementisonly53%. BothmahinesuseIntelproessors,but

theXeonmahineinBrnoisbasedonPentium-4proessors,whihdolessworkperylethanthePentium-III

CPUs that are used by theDAS-2. We haveto onlude that it is in generalnot possibleto simply use the

lokfrequeniestoompareproessorspeeds.

Finally, it is obvious that the Originmahine in Berlin is slowompared to theother mahines. This is

partlyaused bytheineientJIT, whih isbasedontheSUNHotSpot JVM.Beauseoftheombinationof

slowproessorsandtheineientJIT,the16nodesof theOriginweusedareaboutasfastasasingle1GHz

Pentium-III with theIBM JIT. The Originthus hardly ontributesanything to the omputation. Thetable

showsthat,althoughweused40CPUsintotalforthegridrun,therelativespeedoftheseproessorstogether

addsupto 24.668DAS-2 nodes(1 GHzPentium-IIIs). Theperentageofthe totalomputepowerthat eah

individuallusterdeliversisshownintherightmost olumnofTable4.4.

Table4.5

PerformaneoftheraytraerappliationontheGridLabtestbed.

run ommuniation parallelization

algorithm time(s) time(s) overhead time(s) overhead eieny

nighttime

RS 877.6 198.5 36.1% 121.9 23.5% 62.6%

CRS 676.5 35.4 6.4% 83.9 16.6% 81.3%

daytime

RS 2083.5 1414.5 257.3% 111.8 21.7% 26.4%

CRS 693.0 40.1 7.3% 95.7 18.8% 79.3%

singleluster25

RS 579.6 11.3 2.0% 11.0 1.9% 96.1%

WealsorantheraytraeronasingleDAS-2mahine, withthelargeproblemsizethatwewilluseforthe

gridruns. This took 13746seonds (almostfourhours). ThesequentialprogramwithouttheSatinonstruts

takes13564seonds,theoverheadoftheparallelversionthusisabout1%. With perfetspeedup,therun time

oftheparallelprogramontheGridLabtestbedwouldbe13564dividedby24.668,whihis549.8seonds(about

nine minutes). We onsider this runtime the upper bound on theperformane that an be ahieved on the

testbed,

t

perfect

. Weanuse this numberto alulate theeieny that isahievedby therealparallel runs. Wealltheatualruntimeoftheappliationonthetestbed

t

grid

. InanalogytoSetion2.3,eieny anbe denedasfollows:

efficiency

=

t

perfect

t

grid

∗

100%

Wehavealsomeasuredthetimethatisspentinommuniation(

t

comm

). Thisinludesidletime,beauseallidle timein thesystemisaused bywaitingforommuniationtonish. Wealulatetherelativeommuniation

overhead withthisformula:

communication overhead

=

t

comm

t

perfect

∗

100%

Finally, thetimethat islostdueto parallelizationoverhead(

t

par

)is alulatedasshownbelow:

t

par

=

t

grid

−

t

comm

−

t

perfect

parallelization overhead

=

t

par

t

perfect

(10)

CommuniationstatistisfortheraytraerappliationontheGridLabtestbed.

intraluster inter luster

alg. messages MByte messages MByte

nighttime

RS 3218 41.8 11473 137.3

CRS 1353295 131.7 12153 86.0

daytime

RS 56686 18.9 149634 154.1

CRS 2148348 130.7 10115 82.1

singleluster25

RS 45458 155.6 n.a. n.a.

The results of the grid runs are shown in Table 4.5. For referene, we also provide measurements on a

single luster, using 25 nodes of the DAS-2 system. The results presented here are the fastest runs out of

threeexperiments. Duringdaytime,theperformaneoftheraytraerwithRSshowedalargevariability,some

runs took longerthan anhour to omplete, whilethe fastest run took about half anhour. Therefore, in this

partiularase,wetookthebestresultofsixruns. ThisapproahthusisinfavorofRS.WithCRS, thiseet

doesnotour: thedierenebetweenthefastestandtheslowestrunduringdaytimewaslessthan20seonds.

Duringnight,when thereis littleInternet tra, theappliation withCRS isalready morethan200seonds

faster(about23%)thanwiththeRSalgorithm. Duringdaytime,whentheInternetlinksareheavilyused,CRS

outperformsRSbyafatorofthree. Regardlessofthetimeoftheday,theeienyofaparallelrunwithCRS

isabout80%.

The numbers in Table 4.5 show that the parallelization overhead on the testbed is signiantly higher

ompared to asingleluster. Soures of this overheadare thread reationand swithingaused by inoming

stealrequests, andthelokingoftheworkqueues. Theoverheadis higheronthetestbed,beauseveofthe

six mahines we use are SMPs (i.e. they have a shared memory arhiteture). In general, this means that

the CPUs in suh a system have to share resoures, making memory aess and espeially synhronization

potentially more expensive. The latter has anegative eet on the performane of the work queues. Also,

multiple CPUs share a singlenetwork interfae, makingaess to the ommuniationdevie moreexpensive.

The urrent implementation of Satin treats SMPs as lusters (i.e., on a N-way SMP, we start N JVMs).

Therefore,Satin pays theprieoftheSMP overhead,but doesnotexploit thebenetsofSMP systems,suh

astheavailablesharedmemory. Animplementationthatdoesutilizesharedmemorywhenavailableisplanned

forthefuture.

Communiationstatistisofthegridrunsareshownin Table4.6. Thenumbersin thetabletotalsforthe

whole run, summed over all CPUs. Again, statistisfor a singleluster run are inluded for referene. The

numbersshowthatalmostalloftheoverheadofRSisinexessivewide-areaommuniation. Duringdaytime,

for instane, it tries to send 154 MByte over the busy Internet links. Duringthe time-onsuming wide-area

transfers,thesendingmahineisidle,beausethealgorithmissynhronous. CRSsendsonlyabout82MBytes

overthewide-arealinks(abouthalftheamountofRS),but moreimportantly,thetransfersareasynhronous.

With CRS,themahinethatinitiates thewide-areatraonurrentlytriesto stealworkin theloalluster,

andalsoonurrentlyexeutestheworkthat isfound.

CRSeetivelytradeslesswide-areatraformoreloalommuniation. As showninTable4.6, therun

during the nightsends about 1.4 million loal-areamessages. Duringdaytime, theCRS algorithm has to do

moreeort to keep theloadbalaned: during thewide-areasteals, about 2.1 millionloal messagesare sent

whiletryingto ndworkwithin theloal lusters. This isabout60%morethanduring thenight. Still,only

40.1seondsarespentommuniating. WithCRS,therunduringdaytimeonlytakes16.5seonds(about2.4%)

longerthantherunatnight. ThetotalommuniationoverheadofCRS isat most7.3%,while withRS, this

anbeasmuhastwothirdsoftheruntime(i.e. thealgorithm spendsmoretimeonommuniatingthanon

alulatingusefulwork).

Beauseallidletimeisausedbyommuniation,thetimethatisspentontheatualomputationanbe

(11)

0%

20%

40%

60%

80%

100%

perfect

RS night

CRS night

RS day

CRS day

%

o

f

w

o

r

k

c

a

lc

u

la

te

d

Berlin

Brno

Cardiff

Lecce

A'dam Sun

A'dam DAS-2

Fig.4.2.Distributionofworkoverthedierentsites.

omputingtheatualproblem. Giventheamountoftimealusterperformsusefulworkandtherelativespeed

oftheluster,weanalulatewhat frationofthetotalworkisalulatedbyeahindividualluster. Wean

omparethisworkloaddistributionwiththeidealdistributionwhihisrepresentedbytherightmostolumnof

Table4.4. Theidealdistributionandtheresultsforthefourgridrunsareshownin Figure4.2. Thedierene

betweentheperfet distributionand theatualdistributions of thefour gridruns ishardly visible. Fromthe

gure,weanonludethat,although theworkloaddistribution ofbothRS andCRSis virtuallyperfet, the

RSalgorithmitselfspendsalargeamountoftimeonahieving thisdistribution. CRSdoesnotsuerfromthis

problem, beausewide-areatra isasynhronousandis overlappedwithusefulworkthat wasfoundloally.

Still,itahievesanalmostoptimaldistribution.

Tosummarize,theexperimentdesribedin thissetionshowsthattheJava-entriapproahtogrid

om-puting,andtheSatin/Ibissystemin partiular,worksextremelywellinpratieinarealgridenvironment. It

took hardlyanyeorttorunIbisandSatinonaheterogeneoussystem. Furthermore,theperformaneresults

learlyshowthat CRSoutperformsRSin arealgridenvironment,espeiallywhenthewide-arealinksarealso

usedforother(Internet)tra. WithCRS,thesystemisidle(waitingforommuniation)duringonlyasmall

frationof thetotalruntime. Weexpet evenbetterperformanewhen largerlusters areused, asindiated

byoursimulatorresultsfromSetion2.3.

5. Relatedwork. WehavedisussedaJava-entriapproahtowritingwide-areaparallel(grid

omput-ing)appliations. Most othergridomputing systems(e.g., Globus[10℄ andLegion[13℄) support avariety of

languages. GridLab [2℄ is building a toolkit of grid serviesthat anbe aessed from various programming

languages. Converse [15℄ is a framework for multi-lingual interoperability. The SuperWeb [1℄, and

Bayani-han [29℄ are examples of global omputing infrastrutures that support Java. A language-entri approah

makesiteasiertodealwithheterogeneoussystems,sinethedatatypesthataretransferredoverthenetworks

arelimited to theones supportedin thelanguage(thus obviating the needfor aseparate interfaedenition

(12)

appliations on the grid [5℄. AppLeS fouses on seleting the best set of resoures for the appliation out

of the resoure pool of the grid. Satin addresses the more low-level problem of load balaning the parallel

omputationitself,givensomesetofgridresoures. AppLeSprovides(amongstothers)atemplatefor

master-workerappliations, whereas Satin provides load balaning for the moregeneral lass of divide-and-onquer

algorithms.

Manydivide-and-onquersystemsarebasedontheClanguage. Amongthem,Cilk[7℄onlysupports

shared-memorymahines,CilkNOW[9℄andDCPAR[12℄runonloal-area,distributed-memorysystems. SilkRoad[27℄

isaversionofCilkfordistributedmemorysystemsthatusesasoftwareDSMtoprovidesharedmemorytothe

programmer,targetingatsmall-sale,loal-areasystems.

The Java lasses presented by Lea [18℄ an be used to write divide-and-onquer programs for

shared-memory systems. Satin is a divide-and-onquer extension of Java that was designed for wide-area systems,

withoutsharedmemory. LikeSatin,Javar[6℄isompiler-based. WithJavar,theprogrammerusesannotations

to indiate divide-and-onquer and other forms of parallelism. The ompiler then generates multithreaded

Javaode,that runsonanyJVM.Therefore,Javarprogramsrunonlyonshared-memorymahinesandDSM

systems.

Herrmann et al. [14℄ desribe a ompiler-based approah to divide-and-onquer programming that uses

skeletons. TheirDHC ompilersupports apurelyfuntional subsetof Haskell,andtranslatessoureprograms

into Cand MPI. Alt et al.[3℄ developeda Java-basedsystem,in whih skeletons areused to express parallel

programs, one of whih for expressing divide-and-onquer parallelism. Although the programming system

targetsgridplatforms, itisnotlearhowsalabletheapproahis: in [3℄,measurementsareprovidedonlyfor

aloal lusterof8mahines.

Most systemsdesribed aboveuse someform of randomstealing (RS). It hasbeen proven[8℄that RS is

optimalinspae,timeandommuniation,atleastforrelativelytightlyoupledsystemslikeSMPsandlusters

that havehomogeneousommuniationperformane. Inpreviouswork[26℄, wehaveshownthatthis property

annotbeextendedtowide-areasystems. WeextendedRStoperformasynhronouswide-areaommuniation

interleaved with synhronous loal ommuniation. The resulting randomized algorithm, alled CRS, does

performwellinloosely-oupledsystems.

AnotherJava-baseddivide-and-onquersystemisAtlas [4℄. Atlasisaset ofJavalassesthat anbeused

to write divide-and-onquer programs. Javelin 3 [24℄ provides a set of Javalasses that allow programmers

to express branh-and-bound omputations, suh asthe travelingsalesperson problem. LikeSatin, Atlas and

Javelin3aredesigned forwide-areasystems. Both Atlasand Javelin 3usetree-basedhierarhialsheduling

algorithms. Wefoundthatsuhalgorithmsareineientforne-grainedappliations andthatCRSperforms

better[26℄.

6. Conlusions. Grid programmingenvironments need to bebothportable and eient to exploit the

omputational powerof dynamially available resoures. Satinmakesit possibleto write divide-and-onquer

appliations in Java,and istargeted at lusteredwide-areasystems. TheSatinimplementationon topofour

newIbisplatformombinesJava'sruneverywhere witheientommuniationbetweenJVMs. Theresulting

systemiseasytouseinagridenvironment. Toahievehighperformane,Satinusesaspeialgrid-aware

load-balaningalgorithm.Previoussimulationresultssuggestedthatthisalgorithmismoreeientthantraditional

algorithmsthatareusedontightly-oupledsystems. Inthispaper,weveriedthesesimulationresultsinareal

gridenvironment.

WeevaluatedSatin/Ibis onthe highlyheterogeneoustestbed oftheEU-fundedGridLab projet,showing

thatSatin'sload-balaningalgorithmautomatiallyadaptsbothtoheterogeneousproessorspeedsandvarying

network performane, resulting in eient utilization of the omputing resoures. Measurements show that

Satin'sCRS algorithmindeedoutperformsthewidelyusedRS algorithmbyawidemargin. WithCRS, Satin

ahievesaround80%eieny, even during daytimewhen the links between thesites are heavilyloaded. In

ontrast, with thetraditional RS algorithm, the eieny drops to about26% when the wide-arealinks are

ongested.

Aknowledgments. PartofthisworkhasbeensupportedbytheEuropeanCommission,grant

IST-2001-32133(GridLab). WewouldalsoliketothankOlivierAumage,RutgerHofman,CerielJaobs,MaikNijhuisand

GosiaWrzesi«skafortheirontributionstotheIbisode. KeesVerstoepis doingamarvelousjobmaintaining

(13)

REFERENCES

[1℄ A.D.Alexandrov,M.Ibel,K.E.Shauser,andC. J.Sheiman,SuperWeb: Researh IssuesinJava-Based Global

Computing,Conurreny: PratieandExperiene,9(1997),pp.535553.

[2℄ G. Allen, K. Davis, K. N. Dolkas, N. D. Doulamis, T. Goodale, T. Kielmann, A. Merzky, J. Nabrzyski,

J.Pukaki,T. Radke,M. Russell, E.Seidel, J.Shalf,andI.Taylor,EnablingAppliationsontheGrid-A

GridLabOverview,nternationalJournalofHighPerformaneComputingAppliations,(2003).aeptedforpubliation.

[3℄ M.Alt,H.Bishof,andS.Gorlath,ProgramDevelopmentforComputationalGridsusingSkeletonsandPerformane

Predition,ParallelProessingLetters,12(2002),pp.157174. WorldSientiPublishingCompany.

[4℄ E.J.Baldeshwieler,R. Blumofe,andE.Brewer,ATLAS:AnInfrastrutureforGlobalComputing,inProeedings

oftheSeventhACMSIGOPSEuropeanWorkshoponSystemSupportforWorldwideAppliations,Connemara,Ireland,

September1996,pp.165172.

[5℄ F.Berman,R.Wolski,S.Figueira,J.Shopf,andG.Shao,Appliation-levelShedulingonDistributedHeterogeneous

Networks,inProeedingsoftheACM/IEEEConfereneon Superomputing(SC'96),Pittsburgh,PA,November 1996.

Onlineathttp://www.superomp.org.

[6℄ A. Bik, J. Villais, and D. Gannon, Javar: APrototype Java Restruturing Compiler, Conurreny: Pratie and

Experiene,9(1997),pp.11811191.

[7℄ R.D.Blumofe,C.F.Joerg,B.C.Kuszmaul,C.E.Leiserson,K.H.Randall,andY.Zhou.,Cilk: AnEient

Multithreaded RuntimeSystem,in5thACMSIGPLANSymposiumonPriniplesandPratieofParallelProgramming

(PPoPP'95),SantaBarbara,CA,July1995,pp.207216.

[8℄ R.D.BlumofeandC.E.Leiserson,ShedulingMultithreadedComputationsbyWorkStealing,in35thAnnualSymposium

onFoundationsofComputerSiene(FOCS'94),SantaFe,NewMexio,November1994,pp.356368.

[9℄ R.D.BlumofeandP.Lisieki,AdaptiveandReliableParallelComputingonNetworksofWorkstations,inUSENIX1997

AnnualTehnialConfereneonUNIXandAdvanedComputingSystems,Anaheim,CA,1997,pp.133147.

[10℄ I.Foster andC.Kesselman,Globus: AMetaomputingInfrastrutureToolkit,InternationalJournalofSuperomputer

Appliations,11(1997),pp.115128.

[11℄ I. Foster, C. Kesselman, G. Tsudik, and S. Tueke,Aseurityarhiteture for omputational grids, in5th ACM

ConfereneonComputerandCommuniationSeurity,SanFraniso,CA,November1998,pp.8392.

[12℄ B.FreislebenandT.Kielmann,AutomatedTransformationofSequentialDivideandConquerAlgorithmsintoParallel

Programs,ComputersandArtiialIntelligene,14(1995),pp.579596.

[13℄ A.GrimshawandW.A.Wulf,TheLegionVisionofaWorldwideVirtualComputer,Comm.ACM,40(1997),pp.3945.

[14℄ C.A.HerrmannandC.Lengauer,HDC:AHigher-OrderLanguageforDivide-and-Conquer,ParallelProessingLetters,

10(2000),pp.239250.

[15℄ L.V.Kalé,M.Bhandarkar,N.Jagathesan,S.Krishnan,andJ.Yelon,Converse: Aninteroperableframeworkfor

parallelprogramming,inIntl.ParallelProessingSymposium,1996.

[16℄ K. Kaneda, K. Taura, and A.Yonezawa,Virtual privategrid: Aommandshell for utilizinghundreds of mahines

eiently, in2nd IEEE/ACMInternational Symposiumon Cluster Computingand the Grid (CCGrid 2002),Berlin,

Germany,May2002,pp.212219.

[17℄ T.Kielmann,H.E.Bal,J.Maassen,R.vanNieuwpoort,L.Eyraud,R.Hofman,andK.Verstoep,Programming

EnvironmentsforHigh-PerformaneGridComputing: theAlbatrossProjet,FutureGenerationComputerSystems,18

(2002),pp.11131125.

[18℄ D.Lea,AJavaFork/JoinFramework,inProeedingsoftheACM2000JavaGrandeConferene,SanFraniso,CA,June

2000,pp.3643.

[19℄ C.Lee, S.Matsuoka,D. Talia,A.Sussmann, M. Müller,G. Allen,andJ.Saltz,AGridprogramming primer.

GlobalGridForum,August2001.

[20℄ M.Leeh,M.Ganis,Y.Lee,R.Kuris,D.Koblas,andL.Jones,RFC1928: SOCKSprotoolversion5,April1996.

[21℄ J.Maassen,T.Kielmann,andH.Bal,GMI:FlexibleandEientGroupMethodInvoationforParallelProgramming,

inInproeedingsofLCR-02:SixthWorkshoponLanguages,Compilers,andRun-timeSystemsforSalableComputers,

WashingtonDC,Marh2002,pp.16.

[22℄ J.Maassen,T.Kielmann,andH.E.Bal,Parallel Appliation Experienewith Repliated MethodInvoation,

Conur-renyandComputation: PratieandExperiene,13(2001),pp.681712.

[23℄ J.Maassen,R. vanNieuwpoort, R. Veldema,H.Bal, T. Kielmann,C.Jaobs,andR. Hofman,Eient Java

RMIforParallel Programming,ACMTransationsonProgrammingLanguagesandSystems,23(2001),pp.747775.

[24℄ M.O.NearyandP.Cappello,AdvanedEagerShedulingforJava-BasedAdaptivelyParallelComputing,inProeedings

of the JointACM2002 JavaGrande- ISCOPE(International Symposiumon ComputinginObjet-Oriented Parallel

Environments)Conferene,Seattle,November2002,pp.5665.

[25℄ Publinetperfhomepage. www.netperf.org.

[26℄ R. V. v. Nieuwpoort, T. Kielmann, and H. E. Bal, Eient Load Balaning for Wide-area Divide-and-Conquer

Appliations, inProeedings EighthACMSIGPLAN Symposiumon PriniplesandPratie ofParallelProgramming

(PPoPP'01),Snowbird,UT,June2001,pp.3443.

[27℄ L. Peng, W. Wong, M. Feng, andC. Yuen, SilkRoad: AMultithreaded RuntimeSystem with Software Distributed

Shared MemoryforSMP Clusters,inIEEEInternational Confereneon ClusterComputing(Cluster2000), Chemnitz,

Saxony,Germany,November2000,pp.243249.

[28℄ M.Philippsen,B. Haumaher, andC. Nester,More eientserializationand RMIforJava, Conurreny: Pratie

andExperiene,12(2000),pp.495518.

[29℄ L.F. G.Sarmenta,VolunteerComputing,PhDthesis,Dept.ofEletrialEngineeringandComputerSiene,MIT,2001.

(14)

[31℄ R. V.van Nieuwpoort, J.Maassen,R. Hofman, T. Kielmann,andH.E. Bal,Ibis: anEient Java-based Grid

ProgrammingEnvironment,inJointACMJavaGrande-ISCOPE2002Conferene,Seattle,Washington,USA,November

2002,pp.1827.

[32℄ A.Wollrath,J.Waldo,andR. Riggs,Java-CentriDistributedComputing,IEEEMiro,17(1997),pp.4453.

[33℄ I.-C. Wu and H. Kung, Communiation Complexity for Parallel Divide-and-Conquer, in 32ndAnnual Symposiumon

FoundationsofComputerSiene(FOCS'91),SanJuan,PuertoRio,Ot.1991,pp.151162.

Editedby: WilsonRivera,JaimeSeguel.

Reeived: July15,2003.