SATIN: SIMPLEANDEFFICIENT JAVA-BASED GRID PROGRAMMING
ROB V. VAN NIEUWPOORT, JASON MAASSEN,THILO KIELMANN, HENRI E. BAL
∗
Abstrat. Grid programming environments need tobeboth portable and eient to exploitthe omputationalpower of
dynamially available resoures. In previous work, we have presented the divide-and-onquer based Satin model for parallel
omputingonlusteredwide-areasystems.Inthispaper,wepresenttheSatinimplementationontopofournewIbisplatformwhih
ombinesJava'swriteone,runeverywhere witheientommuniationbetweenJVMs. WeevaluateSatin/Ibis onthetestbed
of the EU-funded GridLabprojet, showingthat Satin's load-balaning algorithmautomatiallyadapts both to heterogeneous
proessor speedsandvaryingnetworkperformane,resultingineientutilizationofthe omputingresoures. Ourresultsshow
thatwhenthewide-arealinkssuerfromongestion,Satin'sload-balaningalgorithmanstillahievearound80%eieny,while
analgorithmthatisnotgridawaredropsto26%orless.
Keywords. Satin,Ibis,divide-and-onquer,loadbalaning,distributedsuperomputing.
1. Introdution. In omputational grids, appliations need to simultaneously tap the omputational
powerofmultiple,dynamiallyavailablesites. Theruxofdesigninggridprogrammingenvironmentsstems
ex-atlyfromthedynamiavailabilityofomputeyles: gridprogrammingenvironmentsneedtobebothportable
torunonasmanysites aspossible,and theyneedtobeexible to opewithdierentnetwork protoolsand
dynamiallyhanginggroupsof heterogeneousomputenodes.
Existingprogrammingenvironmentsare either portable and exible(Jini, Java RMI),orthey arehighly
eient(MPI).TheGlobalGridForumalsohasinvestigatedpossiblegridprogrammingmodels[19℄. Reently,
GridRPC has been proposed as a grid programming model [30℄. GridRPC allows writing grid appliations
basedonthemanager/workerparadigm.
Unlikemanager/workerprograms,divide-and-onqueralgorithmsoperatebyreursivelydividingaproblem
intosmallersubproblems. Thisreursivesubdivisiongoesonuntiltheremainingsubproblembeomestrivialto
solve. Aftersolvingsubproblems,theirresultsarereursivelyreombineduntilthenalsolutionisassembled.
By allowingsubproblems to be divided reursively, the lass of divide-and-onquer algorithms subsumes the
manager/workeralgorithms,thus enlargingthesetofpossiblegridappliations.
Of ourse,there aremany kindsofappliations that donotlend themselveswellto adivide-and-onquer
algorithm.However,we(andothers)believethelassofdivide-and-onqueralgorithmstobesuientlylargeto
justifyitsdeploymentforhierarhialwide-areasystems. Computationsthatusethedivide-and-onquermodel
inludegeometryproedures,sortingmethods, searhalgorithms,datalassiationodes,n-bodysimulations
anddata-parallelnumerialprograms[33℄.
Divide-and-onquerappliationsmaybeparallelizedbylettingdierentproessorssolvedierent
subprob-lems. Thesesubproblemsareoftenalledjobsinthisontext. Generatedjobsaretransferredbetweenproessors
to balane the load in the omputation. The divide-and-onquer model lends itself well to
hierarhially-strutured systems beause tasks are reated by reursive subdivision. This leads to a task graph that is
hierarhiallystrutured,and whihan beexeutedwithexellentommuniationloality,espeiallyon
hier-arhialplatforms.
Inpreviouswork[26℄,wepresentedourSatinsystemfordivide-and-onquerprogrammingongridplatforms.
Satinimplementsaveryeientloadbalaningalgorithmforlustered,wide-areaplatforms. Sofar,weould
only evaluate Satin based on simulations in whih all jobs have been exeuted on one single, homogeneous
luster. Inthiswork,weevaluateSatinonarealgridtestbed[2℄,onsistingofvarious heterogeneoussystems,
onnetedbytheInternet.
InSetion2,webrieypresentSatin'sprogrammingmodelandsomesimulator-basedresultsthatindiate
thesuitabilityofSatin asagridprogrammingenvironment. InSetion3,wepresentIbis,ournewJava-based
gridprogrammingplatform that ombines Java'sruneverywhere paradigmwith highly eient yet exible
ommuniationmehanisms. InSetion 4,weevaluatetheperformaneofSatinontopofIbisin theGridLab
testbed,spanningseveralsitesinEurope. Setion5disussesrelatedwork,andinSetion6wedrawonlusions.
∗
Dept. of Computer Siene, Vrije Universiteit, Amsterdam, The Netherlands, {rob,jason,kielmann,bal} s. vu. nl
2. Divide-and Conquer in Satin. Satin's programmingmodel is an extension of the single-threaded
Java model. To ahieve parallel exeution, Satin programs do not have to use Java's threads or Remote
Method Invoations(RMI). Instead, they use muh simplerdivide-and-onquer primitives. Satin does allow
theombinationofitsdivide-and-onquerprimitiveswithJavathreadsandRMIs. Additionally,Satinprovides
sharedobjetsviaRepMI.Inthispaper,however,wefousonpuredivide-and-onquerprograms.
interfae FibInter extends satin.Spawnable {
publi long fib(long n);
}
lass Fib extends satin.SatinObjet
i m p l e m e n t s FibInter {
publi long fib(long n) {
if(n < 2) return n;
long x = fib(n
−
1); // spawned long y = fib(n−
2); // spawned syn();return x + y;
}
publi stati void main(String[℄ args) {
Fib f = new Fib();
long res = f.fib(10);
f.syn();
System.out.println("Fib 10 = " + res);
}
}
Fig.2.1. Fib:anexampledivide-and-onquerprograminSatin.
Satin expressesdivide-and-onquer parallelismentirelyin the Javalanguageitself, withoutrequiring any
newlanguageonstruts. Satinusesso-alledmarkerinterfaestoindiatethatertainmethodinvoationsneed
tobeonsideredforpotentiallyparallel(soalledspawned)exeution,ratherthanbeingexeutedsynhronously
likenormalmethods. Furthermore,amehanismisneededtosynhronizewith(waitfortheresultsof)spawned
methodinvoations. With Satin,thisanbeexpressedusingaspeialinterfae,satin.Spawnable,andthelass
satin.SatinObjet. This is shown in Fig. 2.1, using the example of a lass Fib for omputing the Fibonai
numbers. First, an interfae FibInter is implemented whih extends satin.Spawnable. All methods dened in
thisinterfae(hereb)aremarkedtobespawnedratherthanexeutednormally. Seond,thelassFibextends
satin.SatinObjetandimplementsFibInter. Fromsatin.SatinObjetitinheritsthesynmethod,fromFibInterthe
spawnedbmethod. Finally,theinvokingmethod(inthisasemain)simplyallsFibandusessyntowaitfor
theresultoftheparallelomputation.
Satin'sbyteoderewritergeneratestheneessaryode. Coneptually, anewthreadisstartedforrunning
aspawnedmethoduponinvoation. Satin'simplementation,however,eliminatesthreadreationaltogether. A
spawnedmethodinvoationisputinto aloalworkqueue. Fromthequeue, themethodmightbetransferred
to adierentCPU where it may runonurrentlywiththe method thatexeuted thespawnedmethod. The
syn method waits until allspawnedalls in theurrentmethod invoationare nished; the returnvalues of
spawnedmethod invoationsareundeneduntilasynis reahed.
Spawned method invoations are distributed aross the proessors of a parallel Satin program by work
stealingfromtheworkqueuesmentionedabove. In[26℄,wepresentedanewworkstealingalgorithm,
Cluster-awareRandomStealing(CRS),speiallydesignedforluster-based,wide-area(gridomputing)systems. CRS
isbasedonthetraditionalRandomStealing(RS)algorithmthathasbeenproventobeoptimalforhomogeneous
(singleluster) systems[8℄. Webrieydesribebothalgorithmsin turn.
2.1. RandomStealing(RS). RSattemptstostealajobfromarandomlyseletedpeerwhenaproessor
nds its own work queueempty, repeating steal attempts until it sueeds [8, 33℄. This approah minimizes
load-balaningalgorithm. Onwide-areasystems,however,thisisnotthease. With
C
lusters,onaverage(
C
−
1)
/C
×
100%
ofallstealrequestswillgotonodesinremotelusters,ausingsigniantwide-areaommuniationoverheads.
2.2. Cluster-aware RandomStealing (CRS). In CRS,eah nodeandiretly steal jobs from nodes
in remote lusters, but at mostone job at a time. Whenevera node beomesidle, it rst attempts to steal
from a node in a remoteluster. This wide-areasteal request is sent asynhronously: Instead of waiting for
theresult,thethiefsimplysetsaagandperformsadditional,synhronousstealrequeststorandomlyseleted
nodes within its own luster, until it nds a new job. As longas the ag is set, only loal stealing will be
performed. Thehandlerroutineforthewide-areareplysimplyresetstheagand,iftherequestwassuessful,
putsthenewjobintotheworkqueue. CRSombinestheadvantagesofRSinsidealusterwithaverylimited
amountofasynhronouswide-areaommuniation. Below,wewillshowthat CRSperformsalmostasgoodas
withasingle,largeluster,eveninextremewide-areanetwork settings.
2.3. Simulator-basedomparisonof RSand CRS. Adetailed desriptionofSatin'swide-areawork
stealing algorithm an be found in [26℄. We have extrated the omparison of RS and CRS from that work
intoTable2.1. Theruntimesshownin thistableareforparallelrunswith 64CPUs eah,either withasingle
lusterof64CPUS, orwith4lusters of16CPUseah.
Thewide-areanetworkbetweenthevirtuallustershasbeensimulatedwithourPandaWANsimulator[17℄.
Wesimulatedallombinationsof20msand200msroundtriplatenywithbandwidthapaitiesof100KByte/s
and1000KByte/s. The testshad beenperformed onthe predeessorhardwareto oururrent DAS-2luster.
DASonsistsof200MHzPentiumPro'swithaMyrinet network,runningtheMantaparallelJavasystem[23℄.
Table2.1
PerformaneofRSandCRSwithdierentsimulatedwide-arealinks(timesinseonds).
single 20ms 20ms 200ms 200ms
luster 1000KByte/s 100KByte/s 1000KByte/s 100KByte/s
appliation time e. time e. time e. time e. time e.
adaptiveintegration
RS 71.8 99.6% 78.0 91.8% 79.5 90.1% 109.3 65.5% 112.3 63.7%
CRS 71.8 99.7% 71.6 99.9% 71.7 99.8% 73.4 97.5% 73.2 97.7%
N-queens
RS 157.6 92.5% 160.9 90.6% 168.2 86.6% 184.3 79.1% 197.4 73.8%
CRS 156.3 93.2% 158.1 92.2% 156.1 93.3% 158.4 92.0% 158.1 92.2%
TSP
RS 101.6 90.4% 105.3 87.2% 105.4 87.1% 130.6 70.3% 129.7 70.8%
CRS 100.7 91.2% 103.6 88.7% 101.1 90.8% 105.0 87.5% 107.5 85.4%
raytraer
RS 147.8 94.2% 152.1 91.5% 171.6 81.1% 175.8 79.2% 182.6 76.2%
CRS 147.2 94.5% 145.0 95.9% 152.6 91.2% 146.5 95.0% 149.3 93.2%
InTable2.1weompareRS andCRSusing fourparallel appliations,with network onditionsdegrading
fromtheleft(single luster)to theright(highlateny, lowbandwidth). Foreahase,wepresenttheparallel
runtimeand theorrespondingeieny (labelede. in the table). With
t
s
beingthesequentialruntimefortheappliation,with theSatin operationsexluded,(not shown)and
t
p
theparallel runtime asshowninthetable,and
N
= 64
beingthenumberofCPUs,weomputetheeienyasfollows:efficiency
=
t
s
t
p
·
N
∗
100%
Adaptive integration numerially integratesafuntion overagiveninterval. Itsends veryshort messages
andhasalsoverynegrainedjobs. ThisombinationmakesRSsensitivetohighlateny,inwhihaseeieny
drops to about
65 %
. CRS, however,suessfully hides the high round trip times and ahieveseienies ofmorethan
97 %
inallases.N Queens solvestheproblem ofplaing
n
queensonan
×
n
hessboard. It sendsmedium-sizemessages andhasaveryirregulartasktree. With eienyofonly74 %
,RSagainsuersfromhighround triptimesasitannotquiklyompensateloadimbalaneduetotheirregulartasktree. CRS,however,sustainseienies
TSP solves the problem of nding the shortest path between
n
ities. By passingthe distane table as parameter,ishasasomewhat higherparallelizationoverhead,resultinginslightlylowereienies, evenwithasingleluster. Inthewide-areaases,theselongerparametermessagesontributetohigherroundtriptimes
whenstealingjobsfromremotelusters. Consequently,RSsuersmorefromslowernetworks(eieny
>
70 %
) thanCRSwhih sustainseieniesof85 %
.RayTraer rendersamodeledsenetoarasterimage. Itdividesasreendowntojobsofsinglepixels. Due
to thenatureof raytraing,individual pixelshaveveryirregularrenderingtimes. The appliationsends long
resultmessagesontainingimage frations, makingitsensitivetothe available bandwidth. Thissensitivity is
reetedin theeienyof RS,goingdownto
76 %
,whereasCRShides mostWAN ommuniationoverheadandsustainseieniesof
91 %
.To summarize, our simulator-based experiments show the superiority of CRS to RS in ase of multiple
lusters,onnetedbywide-areanetworks. Thissuperiorityisindependentofthepropertiesoftheappliations,
aswehaveshownwithbothregularand irregulartaskgraphsaswellasshortand longparameterandresult
messagesizes. Inallinvestigated ases,theeienyofCRSneverdroppedbelow
85 %
.Althoughwewereabletoidentify theindividualeets ofwide-arealateny andbandwidth,theseresults
are limited to homogeneousIntel/Linux lusters (due to the Manta ompiler). Furthermore, we only tested
lusters of idential size. Finally, thewide area network hasbeen simulatedand thus been withoutpossibly
disturbingthird-partytra.
An evaluationon arealgridtestbed, with heterogeneousCPUs, JVMs, andnetworks,beomes neessary
toprovethesuitabilityofSatinasagridprogrammingplatform. Inthefollowing,werstpresentIbis,ournew
runeverywhere Javaenvironmentforgridomputing. ThenweevaluateSatinontopofIbisonthetestbedof
theEUGridLabprojet.
0000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000
1111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111
RMI
GMI
RepMI
Satin
Application
Grid
Monitoring
Ibis Portability Layer (IPL)
Topology
Discovery
NWS, etc.
TopoMon
GRAM, etc.
etc.
TCP, UDP, MPI
Panda, GM, etc.
Information
Service
GIS, etc.
Resource
Management
Communication
Serialization &
Fig.3.1.DesignofIbis. Thevariousmodulesanbeloaded dynamially,usingrun timelassloading.
3. Ibis, exible and eient Java-based Grid programming. The Satinruntimesystem used for
this paperis implemented on topof Ibis [31℄. In this setion we will briey explainthe Ibis philosophy and
design. The global struture of the Ibis system is shown in Figure 3.1. A entral part of the system is the
IbisPortabilityLayer(IPL)whihonsists ofanumberofwell-denedinterfaes. TheIPLanhavedierent
implementations,thatanbeseletedandloadedintotheappliationatruntime. TheIPLdenesserialization
and ommuniation,but also typial gridservies suh astopologydisoveryand monitoring. Although it is
possibleto usethe IPLdiretly from anappliation, Ibis also providesmorehigh-levelprogrammingmodels.
Currently, we have implemented four. Ibis RMI [31℄ provides Remote Method Invoation, using the same
interfaeasSunRMI,butwithamoreeientwireprotool. GMI[21℄providesMPI-likeolletiveoperations,
leanlyintegratedintoJava'sobjetmodel. RepMI[22℄extendsJavawithrepliatedobjets. Inthispaper,we
fousonthefourthprogrammingmodelthatIbisimplements,Satin.
3.1. Ibis Goals. AkeyprobleminmakingJavasuitableforgridprogrammingishowtodesignasystem
aspet. (The reently added java.nio pakage will hopefully at leas partially address this problem). The
Ibis strategy to ahieveboth goalssimultaneously is to developreasonably eient solutions using standard
tehniquesthatworkeverywhere,supplementedwithhighlyoptimizedbutnon-standardsolutionsforinreased
performaneinspeialases.Weapplythisstrategytobothomputationandommuniation. Ibisisdesignedto
useanystandardJVM,butifanative,optimizingompiler(e.g., Manta[23℄)isavailableforatargetmahine,
Ibis anuse it instead. Likewise, Ibis an use standard ommuniationprotools, e.g., TCP/IP orUDP, as
providedbytheJVM,butitanalsopluginanoptimizedlow-levelprotoolforahigh-speedinteronnet,like
GMorMPI,ifavailable. ThehallengesforIbisare:
1. how to make the system exible enough to run seamlessly on a variety of dierent ommuniation
hardwareandprotools;
2. howtomakethestandard,100%pureJavaaseeientenoughtobeusefulforgridomputing;
3. study whih additional optimizations an be done to improve performane further in speial
(high-performane)ases.
With Ibis, grid appliations an run simultaneously on a variety of dierent mahines, using optimized
softwarewherepossible(e.g.,anativeompiler,theGMMyrinetprotool,orMPI),andusingstandardsoftware
(e.g., TCP) when neessary. Interoperability is ahieved by using the TCP protool between multiple Ibis
implementations that usedierentprotools(likeGMorMPI) loally. Thisway,allmahines anbeusedin
onesingleomputation. Below,wedisussthethreeaforementionedissuesin moredetail.
3.2. Flexibility. ThekeyharateristiofIbisisitsextremeexibility, whihis requiredtosupportgrid
appliations. Amajordesigngoalistheabilitytoseamlesslyplugindierentommuniationsubstrateswithout
hangingtheuserode. Forthispurpose,theIbisdesignusestheIPL.AsoftwarelayerontopoftheIPLan
negotiatewithIbisinstantiationsthroughthewell-denedIPLinterfae,toseletandloadthemodulesitneeds.
ThisexibilityisimplementedusingJava'sdynamilass-loadingmehanism.
ManymessagepassinglibrariessuhasMPIandGMguaranteereliablemessagedeliveryandFIFOmessage
ordering. Whenappliationsdonotrequirethese properties,adierentmessagepassinglibrarymightbeused
to avoidthe overhead that omes with reliability and messageordering. The IPLsupports bothreliableand
unreliableommuniation,orderedandunorderedmessages,impliitandexpliitreeipt,usingasingle,simple
interfae. Usinguser-denableproperties(key-valuepairs),appliationsanreateexatlytheommuniation
hannelstheyneed,withoutunneessaryoverhead.
3.3. Optimizing the Common Case. To obtain aeptable ommuniation performane, Ibis
imple-ments several optimizations. Most importantly, the overhead of serialization and reetion is avoided by
ompile-time generation of speial methods (in byte ode) for eah objettype. These methods anbeused
to onvert objets to bytes (and vie versa), and to reate new objets on the reeiving side, without using
expensivereetionmehanisms. This way,theoverheadofserializationisredueddramatially.
Furthermore,ourommuniationimplementationsuseanoptimizedwireprotool.TheSunRMIprotool,
for example, resends type information for eah RMI. Our implementation ahes this type information per
onnetion. Using this optimization, our protool sends lessdata overthe wire,but moreimportantly, saves
proessingtimeforenodinganddeodingthetypeinformation.
3.4. Optimizing Speial Cases. Inmanyases,thetarget mahine mayhaveadditionalfailitiesthat
allowfaster omputationorommuniation,whiharediulttoahievewithstandardJavatehniques. One
exampleweinvestigated in previous work [23℄ is using a native, optimizing ompilerinsteadof a JVM.This
ompiler(Manta),oranyotherhighperformaneJavaimplementation,ansimplybeusedbyIbis. Themost
importantspeialaseforommuniationisthepreseneofahigh-speedloalinteronnet. Usually,speialized
user-levelnetworksoftwareisrequiredforsuhinteronnets,insteadofstandardprotools (TCP,UDP) that
use the OS kernel. Ibis therefore was designed to allow other protools to be plugged in. So, lower-level
ommuniationmaybebased,forexample,onaloally-optimizedMPIlibrary. TheIPLisdesignedinsuha
waythatitispossibletoexploiteienthardwaremultiast,whenavailable.
AnotherimportantfeatureoftheIPListhatitallowsazero-opyimplementation. Implementingzero-opy
(orsingle-opy)ommuniationin Javais anon-trivialtask, but itisessentialtomakeJavaompetitivewith
systems like MPI for whih zero-opy implementations already exist. The zero-opy Ibis implementation is
4. Satin on the GridLab testbed. In this setion, we will present a ase study to analyze the
per-formane that Satin/Ibis ahieves in areal grid environment. We ran the ray traer appliation introdued
in Setion 2.3 on theEuropean GridLab [2℄testbed. Morepreisely,wewereusing aharateristisubset of
the mahines on this testbed that wasavailable for our measurements at the time thestudy wasperformed.
Beause simultaneously startingand running a parallel appliation on multiple lusters still is atedious and
time-onsuming task, wehad to restritourselvesto asingletest appliation. Wehavehosentheray traer
forour testsasit issending themostdata of allourappliations, makingit verysensitive tonetwork issues.
Theraytraeriswritten inpure Javaand generatesahigh resolutionimage (
4096
×
4096
, with24-bitolor).Ittakesapproximately10minutestosolvethisproblemonourtestbed.
Thisisaninterestingexperimentforseveralreasons. Firstly,weusetheIbisimplementationontopofTCP
forthemeasurementsinthis setion. Thismeansthat thenumbersshownbelowwere measuredusinga100%
Javaimplementation. Therefore, theyare interesting, givinga lear indiationof the performane level that
anbeahievedinJavawitharuneverywhereimplementation,withoutusinganynativeode.
Seondly, the testbed ontains mahines with several dierent arhitetures; Intel, SPARC, MIPS, and
Alphaproessorsareused. Somemahinesare32bit,whileothersare64bit. Also,dierentoperatingsystems
andJVMsareinuse. Therefore,thisexperimentisagoodmethodtoinvestigatewhetherJava'swriteone,run
everywherefeaturereallyworksinpratie. Theassumptionthatthisfeaturesuessfullyhidestheomplexity
ofthedierentunderlyingarhiteturesandoperatingsystems,wasthemostimportantreasonforinvestigating
theJava-entrisolutionspresentedinthispaper. Itisthusimportanttoverifythevalidityofthislaim.
-10
-10
-5
-5
0
0
5
5
10
10
15
15
20
20
25
25
35
35
40
40
45
45
50
50
55
55
60
60
0
200 400
km
Amsterdam
Berlin
Lecce
Cardiff
Thirdly, themahines are onneted by the Internet. The links show typial wide-area behavior, as the
physial distane betweenthe sites is large. Forinstane, the distane from Amsterdam to Lee is roughly
2000kilometers(about1250miles). Figure 4.1showsamapofEurope,annotatedwiththemahine loations.
Thisgivesan ideaofthedistanes betweenthesites. Weusethis experimenttoverifySatin's load-balaning
algorithms in pratie, with real non-dediated wide-area links. We haverun the ray traer both with the
standardrandomstealingalgorithm(RS)andwiththenewluster-awarealgorithm(CRS)asintroduedabove.
Forpratialreasons,wehadtouserelativelysmalllustersforthemeasurementsinthissetion. Thesimulation
resultsin Setion2.3showthattheperformaneofCRSinreaseswhenlargerlustersareused,beausethere
ismoreopportunitytobalanetheloadinside alusterduring wide-areaommuniation.
Table4.1
MahinesontheGridLabtestbed.
Operating CPUs/ total
loation arhiteture System JIT nodes node CPUs
VrijeUniversiteit Intel RedHat
Amsterdam Pentium-III Linux IBM
TheNetherlands 1GHz kernel2.4.18 1.4.0 8 1 8
VrijeUniversiteit SunFire280R SUN
Amsterdam UltraSPARC-III Sun HotSpot
TheNetherlands 750MHz64bit Solaris8 1.4.2 1 2 2
ISUFI/HighPerf. Compaq Compaq HP1.4.0
ComputingCenter Alpha Tru64UNIX basedon
Lee,Italy 667MHz64bit V5.1A HotSpot 1 4 4
Cardi Intel RedHat SUN
University Pentium-III Linux7.1 HotSpot
Cardi,Wales,UK 1GHz kernel2.4.2 1.4.1 1 2 2
MasarykUniversity, IntelXeon DebianLinux IBM
Brno,CzehRepubli 2.4GHz kernel2.4.20 1.4.0 4 2 8
Konrad-Zuse-Zentrum SGI SGI
für Origin3000 1.4.1-EA
Informationstehnik MIPSR14000 basedon
Berlin,Germany 500MHz IRIX6.5 HotSpot 1 16 16
Some information about the mahines we used is shown in Table 4.1. To run the appliation, we used
whiheverJavaJIT(Just-In-Timeompiler)thatwaspre-installedoneahpartiularsystemwheneverpossible,
beausethisiswhat mostuserswouldprobablydoinpratie.
Table4.2
Round-tripwide-arealateny(inmilliseonds)andahievablebandwidth(inKByte/s)betweentheGridLabsites.
daytime nighttime
to to to to
A'dam A'dam to to to to A'dam A'dam to to to to
soure DAS-2 Sun Lee Cardi Brno Berlin DAS-2 Sun Lee Cardi Brno Berlin
latenyfrom
A'damDAS-2 1 204 16 20 42 1 65 15 20 18
A'damSun 1 204 15 19 43 1 62 14 19 17
Lee 198 195 210 204 178 63 66 60 66 64
Cardi 9 9 198 28 26 9 9 51 27 21
Brno 20 20 188 33 22 20 19 64 33 22
Berlin 18 17 185 31 22 18 17 59 30 22
bandwidthfrom
A'damDAS-2 11338 42 750 3923 2578 11442 40 747 4115 2578
A'damSun 11511 22 696 2745 2611 11548 46 701 3040 2626
Lee 73 425 44 43 75 77 803 94 110 82
Cardi 842 791 29 767 825 861 818 37 817 851
Brno 3186 2709 26 588 2023 3167 2705 37 612 2025
Berlin 2555 2633 9 533 2097 2611 2659 9 562 2111
Beausethe sitesare onnetedviatheInternet, we haveno inueneon theamountoftra that ows
overthelinks. ToreduetheinueneofInternettraonthemeasurements,wealsoperformedmeasurements
aftermidnight(CET). However,in pratietherestillis somevariabilityin thelink speeds. Wemeasuredthe
lateny of the wide-area links by running ping 50 times, while the ahievable bandwidth is measured with
netperf [25℄,using32KBytepakets. ThemeasuredlateniesandbandwidthsareshowninTable4.2. Allsites
haddiultiesfromtimetotimewhilesendingtrato Lee,Italy. Forinstane, fromAmsterdamto Lee,
wemeasuredlateniesfrom44milliseondsupto3.5seonds. Also,weexperienedpaketlosswiththislink: up
there anbemorethan afatorof twodierenein link speedsduring daytimeand nighttime,espeiallythe
links from and to Lee show a largevariability. It is also interesting to see that thelink performanefrom
Leetothetwosites inAmsterdamisdierent. Weveriedthiswithtraeroute, andfoundthat thetrais
indeed routed dierentlyas the twomahinesuse dierentnetwork numbersdespite being loatedwithin the
samebuilding.
Table4.3
Problemsenounteredinarealgridenvironment,andtheirsolutions.
problem solution
rewalls bindallsoketstoportsintheopenrange
buggyJITs upgradetoJava1.4JITs
multi-homesmahines useasingle,externallyvalidIPaddress
Ibis, Satin and the ray traer appliation were all ompiled with the standard Java ompiler java on
the DAS-2 mahine in Amsterdam, and then just opiedto the other GridLab sites, without reompiling or
reonguringanything. Onmostsites,thisworksawlessly. However,wedidrunintoseveralpratialproblems.
AsummaryisgiveninTable4.3. SomeoftheGridLabsiteshaverewallsinstalled,whih blokSatin'stra
whenno speial measuresare taken. Mostsites in ourtestbed havesomeopenport range,whihmeans that
tra toportswithin this rangeanpassthrough. Thesolutionweuseto avoidbeingblokedby rewallsis
straightforward: allsokets usedforommuniationin Ibisareboundto aport withinthe(site-spei) open
portrange. Weareworkingonamoregeneralsolutionthat multiplexesalltraoverasingleport. Another
solutionis tomultiplexall tra overa(Globus) ssh onnetion,asis donebyKanedaet al. [16℄, orusinga
mehanismlikeSOCKS[20℄.
Another problem we enountered was that the JITs installed on some sites ontained bugs. Espeially
theombination ofthreads andsokets presentedsomediulties. Thereseemsto beabugin Sun's1.3 JIT
(HotSpot) related to threads and soket ommuniation. In some irumstanes, a bloking operation on a
soketwouldblokthe wholeappliation insteadofjust the threadthat doesthe operation. Thesolutionfor
thisproblemwastoupgradetoaJava1.4JIT,wheretheproblem issolved.
Finally, some mahines in the testbed are multi-homed: they have multiple IP addresses. The original
Ibis implementation on TCP got onfused by this, beause the InetAddress.getLoalHost method an return
an IP addressin aprivaterange, or anaddress for aninterfae that is not aessiblefrom the outside. Our
urrentsolutionistomanuallyspeifywhihIPaddresshastobeusedwhenmultiplehoiesareavailable.All
mahines in the testbedhave aGlobus[10℄ installation, so weused GSI-SSH (GlobusSeurityInfrastruture
Seure Shell) [11℄ to login to the GridLab sites. We had to start the appliation by hand, as not all sites
have a job manager installed. When ajob manager is present, Globus an be used to start the appliation
automatially.
AsshowninTable4.1,weused 40proessorsin total,using 6mahines loatedat 5sites alloverEurope,
with4dierentproessorarhitetures. Aftersolvingtheaforementionedpratialproblems,SatinontheTCP
Ibisimplementationranonallsites,inpure Java,withouthavingtoreompileanything.
Table4.4
RelativespeedsofthemahineandJVMombinationsinthetestbed.
run relative relativetotal %oftotal
site arhiteture time(s) nodespeed speedofluster system
A'damDAS-2 1GHzIntelPentium-III 233.1 1.000 8.000 32.4
A'damSun 750MHzUltraSPARC-III 445.2 0.523 1.046 4.2
Lee 667MHZCompaqAlpha 512.7 0.454 1.816 7.4
Cardi 1GHzIntelPentium-III 758.9 0.307 0.614 2.5
Brno 2.4GHzIntelXeon 152.8 1.525 12.200 49.5
Berlin 500MHzMIPSR14000 3701.4 0.062 0.992 4.0
total 24.668 100.0
Asabenhmark,werstrantheparallel versionof theraytraerwithasmallerproblemsize (
512
×
512
,with 24 bit olor) on asingle mahine on all lusters. This way, we anompute the relative speeds of the
theDAS-2lusterinAmsterdam. ItisinterestingtonotethatthequalityoftheJITompileranhavealarge
impatontheperformaneattheappliationlevel. AnodeintheDAS-2lusterandthemahineinCardiare
both1GHzIntelPentium-IIIs, but thereismorethanafatorofthree diereneinappliation performane.
Thisis aused bythedierent JITompilersthat were used. OntheDAS-2, weused themoreeientIBM
1.4JIT,whiletheSUN1.4JIT(HotSpot) wasinstalledonthemahinein Cardi.
Furthermore,theresultsshowthat,althoughthelokfrequenyofthemahineatBrnois2.4timesashigh
asthefrequenyofaDAS-2node,thespeedimprovementisonly53%. BothmahinesuseIntelproessors,but
theXeonmahineinBrnoisbasedonPentium-4proessors,whihdolessworkperylethanthePentium-III
CPUs that are used by theDAS-2. We haveto onlude that it is in generalnot possibleto simply use the
lokfrequeniestoompareproessorspeeds.
Finally, it is obvious that the Originmahine in Berlin is slowompared to theother mahines. This is
partlyaused bytheineientJIT, whih isbasedontheSUNHotSpot JVM.Beauseoftheombinationof
slowproessorsandtheineientJIT,the16nodesof theOriginweusedareaboutasfastasasingle1GHz
Pentium-III with theIBM JIT. The Originthus hardly ontributesanything to the omputation. Thetable
showsthat,althoughweused40CPUsintotalforthegridrun,therelativespeedoftheseproessorstogether
addsupto 24.668DAS-2 nodes(1 GHzPentium-IIIs). Theperentageofthe totalomputepowerthat eah
individuallusterdeliversisshownintherightmost olumnofTable4.4.
Table4.5
PerformaneoftheraytraerappliationontheGridLabtestbed.
run ommuniation parallelization
algorithm time(s) time(s) overhead time(s) overhead eieny
nighttime
RS 877.6 198.5 36.1% 121.9 23.5% 62.6%
CRS 676.5 35.4 6.4% 83.9 16.6% 81.3%
daytime
RS 2083.5 1414.5 257.3% 111.8 21.7% 26.4%
CRS 693.0 40.1 7.3% 95.7 18.8% 79.3%
singleluster25
RS 579.6 11.3 2.0% 11.0 1.9% 96.1%
WealsorantheraytraeronasingleDAS-2mahine, withthelargeproblemsizethatwewilluseforthe
gridruns. This took 13746seonds (almostfourhours). ThesequentialprogramwithouttheSatinonstruts
takes13564seonds,theoverheadoftheparallelversionthusisabout1%. With perfetspeedup,therun time
oftheparallelprogramontheGridLabtestbedwouldbe13564dividedby24.668,whihis549.8seonds(about
nine minutes). We onsider this runtime the upper bound on theperformane that an be ahieved on the
testbed,
t
perfect
. Weanuse this numberto alulate theeieny that isahievedby therealparallel runs. Wealltheatualruntimeoftheappliationonthetestbedt
grid
. InanalogytoSetion2.3,eieny anbe denedasfollows:efficiency
=
t
perfect
t
grid
∗
100%
Wehavealsomeasuredthetimethatisspentinommuniation(
t
comm
). Thisinludesidletime,beauseallidle timein thesystemisaused bywaitingforommuniationtonish. Wealulatetherelativeommuniationoverhead withthisformula:
communication overhead
=
t
comm
t
perfect
∗
100%
Finally, thetimethat islostdueto parallelizationoverhead(
t
par
)is alulatedasshownbelow:t
par
=
t
grid
−
t
comm
−
t
perfect
parallelization overhead
=
t
par
t
perfect
CommuniationstatistisfortheraytraerappliationontheGridLabtestbed.
intraluster inter luster
alg. messages MByte messages MByte
nighttime
RS 3218 41.8 11473 137.3
CRS 1353295 131.7 12153 86.0
daytime
RS 56686 18.9 149634 154.1
CRS 2148348 130.7 10115 82.1
singleluster25
RS 45458 155.6 n.a. n.a.
The results of the grid runs are shown in Table 4.5. For referene, we also provide measurements on a
single luster, using 25 nodes of the DAS-2 system. The results presented here are the fastest runs out of
threeexperiments. Duringdaytime,theperformaneoftheraytraerwithRSshowedalargevariability,some
runs took longerthan anhour to omplete, whilethe fastest run took about half anhour. Therefore, in this
partiularase,wetookthebestresultofsixruns. ThisapproahthusisinfavorofRS.WithCRS, thiseet
doesnotour: thedierenebetweenthefastestandtheslowestrunduringdaytimewaslessthan20seonds.
Duringnight,when thereis littleInternet tra, theappliation withCRS isalready morethan200seonds
faster(about23%)thanwiththeRSalgorithm. Duringdaytime,whentheInternetlinksareheavilyused,CRS
outperformsRSbyafatorofthree. Regardlessofthetimeoftheday,theeienyofaparallelrunwithCRS
isabout80%.
The numbers in Table 4.5 show that the parallelization overhead on the testbed is signiantly higher
ompared to asingleluster. Soures of this overheadare thread reationand swithingaused by inoming
stealrequests, andthelokingoftheworkqueues. Theoverheadis higheronthetestbed,beauseveofthe
six mahines we use are SMPs (i.e. they have a shared memory arhiteture). In general, this means that
the CPUs in suh a system have to share resoures, making memory aess and espeially synhronization
potentially more expensive. The latter has anegative eet on the performane of the work queues. Also,
multiple CPUs share a singlenetwork interfae, makingaess to the ommuniationdevie moreexpensive.
The urrent implementation of Satin treats SMPs as lusters (i.e., on a N-way SMP, we start N JVMs).
Therefore,Satin pays theprieoftheSMP overhead,but doesnotexploit thebenetsofSMP systems,suh
astheavailablesharedmemory. Animplementationthatdoesutilizesharedmemorywhenavailableisplanned
forthefuture.
Communiationstatistisofthegridrunsareshownin Table4.6. Thenumbersin thetabletotalsforthe
whole run, summed over all CPUs. Again, statistisfor a singleluster run are inluded for referene. The
numbersshowthatalmostalloftheoverheadofRSisinexessivewide-areaommuniation. Duringdaytime,
for instane, it tries to send 154 MByte over the busy Internet links. Duringthe time-onsuming wide-area
transfers,thesendingmahineisidle,beausethealgorithmissynhronous. CRSsendsonlyabout82MBytes
overthewide-arealinks(abouthalftheamountofRS),but moreimportantly,thetransfersareasynhronous.
With CRS,themahinethatinitiates thewide-areatraonurrentlytriesto stealworkin theloalluster,
andalsoonurrentlyexeutestheworkthat isfound.
CRSeetivelytradeslesswide-areatraformoreloalommuniation. As showninTable4.6, therun
during the nightsends about 1.4 million loal-areamessages. Duringdaytime, theCRS algorithm has to do
moreeort to keep theloadbalaned: during thewide-areasteals, about 2.1 millionloal messagesare sent
whiletryingto ndworkwithin theloal lusters. This isabout60%morethanduring thenight. Still,only
40.1seondsarespentommuniating. WithCRS,therunduringdaytimeonlytakes16.5seonds(about2.4%)
longerthantherunatnight. ThetotalommuniationoverheadofCRS isat most7.3%,while withRS, this
anbeasmuhastwothirdsoftheruntime(i.e. thealgorithm spendsmoretimeonommuniatingthanon
alulatingusefulwork).
Beauseallidletimeisausedbyommuniation,thetimethatisspentontheatualomputationanbe
0%
20%
40%
60%
80%
100%
perfect
RS night
CRS night
RS day
CRS day
%
o
f
w
o
r
k
c
a
lc
u
la
te
d
Berlin
Brno
Cardiff
Lecce
A'dam Sun
A'dam DAS-2
Fig.4.2.Distributionofworkoverthedierentsites.
omputingtheatualproblem. Giventheamountoftimealusterperformsusefulworkandtherelativespeed
oftheluster,weanalulatewhat frationofthetotalworkisalulatedbyeahindividualluster. Wean
omparethisworkloaddistributionwiththeidealdistributionwhihisrepresentedbytherightmostolumnof
Table4.4. Theidealdistributionandtheresultsforthefourgridrunsareshownin Figure4.2. Thedierene
betweentheperfet distributionand theatualdistributions of thefour gridruns ishardly visible. Fromthe
gure,weanonludethat,although theworkloaddistribution ofbothRS andCRSis virtuallyperfet, the
RSalgorithmitselfspendsalargeamountoftimeonahieving thisdistribution. CRSdoesnotsuerfromthis
problem, beausewide-areatra isasynhronousandis overlappedwithusefulworkthat wasfoundloally.
Still,itahievesanalmostoptimaldistribution.
Tosummarize,theexperimentdesribedin thissetionshowsthattheJava-entriapproahtogrid
om-puting,andtheSatin/Ibissystemin partiular,worksextremelywellinpratieinarealgridenvironment. It
took hardlyanyeorttorunIbisandSatinonaheterogeneoussystem. Furthermore,theperformaneresults
learlyshowthat CRSoutperformsRSin arealgridenvironment,espeiallywhenthewide-arealinksarealso
usedforother(Internet)tra. WithCRS,thesystemisidle(waitingforommuniation)duringonlyasmall
frationof thetotalruntime. Weexpet evenbetterperformanewhen largerlusters areused, asindiated
byoursimulatorresultsfromSetion2.3.
5. Relatedwork. WehavedisussedaJava-entriapproahtowritingwide-areaparallel(grid
omput-ing)appliations. Most othergridomputing systems(e.g., Globus[10℄ andLegion[13℄) support avariety of
languages. GridLab [2℄ is building a toolkit of grid serviesthat anbe aessed from various programming
languages. Converse [15℄ is a framework for multi-lingual interoperability. The SuperWeb [1℄, and
Bayani-han [29℄ are examples of global omputing infrastrutures that support Java. A language-entri approah
makesiteasiertodealwithheterogeneoussystems,sinethedatatypesthataretransferredoverthenetworks
arelimited to theones supportedin thelanguage(thus obviating the needfor aseparate interfaedenition
appliations on the grid [5℄. AppLeS fouses on seleting the best set of resoures for the appliation out
of the resoure pool of the grid. Satin addresses the more low-level problem of load balaning the parallel
omputationitself,givensomesetofgridresoures. AppLeSprovides(amongstothers)atemplatefor
master-workerappliations, whereas Satin provides load balaning for the moregeneral lass of divide-and-onquer
algorithms.
Manydivide-and-onquersystemsarebasedontheClanguage. Amongthem,Cilk[7℄onlysupports
shared-memorymahines,CilkNOW[9℄andDCPAR[12℄runonloal-area,distributed-memorysystems. SilkRoad[27℄
isaversionofCilkfordistributedmemorysystemsthatusesasoftwareDSMtoprovidesharedmemorytothe
programmer,targetingatsmall-sale,loal-areasystems.
The Java lasses presented by Lea [18℄ an be used to write divide-and-onquer programs for
shared-memory systems. Satin is a divide-and-onquer extension of Java that was designed for wide-area systems,
withoutsharedmemory. LikeSatin,Javar[6℄isompiler-based. WithJavar,theprogrammerusesannotations
to indiate divide-and-onquer and other forms of parallelism. The ompiler then generates multithreaded
Javaode,that runsonanyJVM.Therefore,Javarprogramsrunonlyonshared-memorymahinesandDSM
systems.
Herrmann et al. [14℄ desribe a ompiler-based approah to divide-and-onquer programming that uses
skeletons. TheirDHC ompilersupports apurelyfuntional subsetof Haskell,andtranslatessoureprograms
into Cand MPI. Alt et al.[3℄ developeda Java-basedsystem,in whih skeletons areused to express parallel
programs, one of whih for expressing divide-and-onquer parallelism. Although the programming system
targetsgridplatforms, itisnotlearhowsalabletheapproahis: in [3℄,measurementsareprovidedonlyfor
aloal lusterof8mahines.
Most systemsdesribed aboveuse someform of randomstealing (RS). It hasbeen proven[8℄that RS is
optimalinspae,timeandommuniation,atleastforrelativelytightlyoupledsystemslikeSMPsandlusters
that havehomogeneousommuniationperformane. Inpreviouswork[26℄, wehaveshownthatthis property
annotbeextendedtowide-areasystems. WeextendedRStoperformasynhronouswide-areaommuniation
interleaved with synhronous loal ommuniation. The resulting randomized algorithm, alled CRS, does
performwellinloosely-oupledsystems.
AnotherJava-baseddivide-and-onquersystemisAtlas [4℄. Atlasisaset ofJavalassesthat anbeused
to write divide-and-onquer programs. Javelin 3 [24℄ provides a set of Javalasses that allow programmers
to express branh-and-bound omputations, suh asthe travelingsalesperson problem. LikeSatin, Atlas and
Javelin3aredesigned forwide-areasystems. Both Atlasand Javelin 3usetree-basedhierarhialsheduling
algorithms. Wefoundthatsuhalgorithmsareineientforne-grainedappliations andthatCRSperforms
better[26℄.
6. Conlusions. Grid programmingenvironments need to bebothportable and eient to exploit the
omputational powerof dynamially available resoures. Satinmakesit possibleto write divide-and-onquer
appliations in Java,and istargeted at lusteredwide-areasystems. TheSatinimplementationon topofour
newIbisplatformombinesJava'sruneverywhere witheientommuniationbetweenJVMs. Theresulting
systemiseasytouseinagridenvironment. Toahievehighperformane,Satinusesaspeialgrid-aware
load-balaningalgorithm.Previoussimulationresultssuggestedthatthisalgorithmismoreeientthantraditional
algorithmsthatareusedontightly-oupledsystems. Inthispaper,weveriedthesesimulationresultsinareal
gridenvironment.
WeevaluatedSatin/Ibis onthe highlyheterogeneoustestbed oftheEU-fundedGridLab projet,showing
thatSatin'sload-balaningalgorithmautomatiallyadaptsbothtoheterogeneousproessorspeedsandvarying
network performane, resulting in eient utilization of the omputing resoures. Measurements show that
Satin'sCRS algorithmindeedoutperformsthewidelyusedRS algorithmbyawidemargin. WithCRS, Satin
ahievesaround80%eieny, even during daytimewhen the links between thesites are heavilyloaded. In
ontrast, with thetraditional RS algorithm, the eieny drops to about26% when the wide-arealinks are
ongested.
Aknowledgments. PartofthisworkhasbeensupportedbytheEuropeanCommission,grant
IST-2001-32133(GridLab). WewouldalsoliketothankOlivierAumage,RutgerHofman,CerielJaobs,MaikNijhuisand
GosiaWrzesi«skafortheirontributionstotheIbisode. KeesVerstoepis doingamarvelousjobmaintaining
REFERENCES
[1℄ A.D.Alexandrov,M.Ibel,K.E.Shauser,andC. J.Sheiman,SuperWeb: Researh IssuesinJava-Based Global
Computing,Conurreny: PratieandExperiene,9(1997),pp.535553.
[2℄ G. Allen, K. Davis, K. N. Dolkas, N. D. Doulamis, T. Goodale, T. Kielmann, A. Merzky, J. Nabrzyski,
J.Pukaki,T. Radke,M. Russell, E.Seidel, J.Shalf,andI.Taylor,EnablingAppliationsontheGrid-A
GridLabOverview,nternationalJournalofHighPerformaneComputingAppliations,(2003).aeptedforpubliation.
[3℄ M.Alt,H.Bishof,andS.Gorlath,ProgramDevelopmentforComputationalGridsusingSkeletonsandPerformane
Predition,ParallelProessingLetters,12(2002),pp.157174. WorldSientiPublishingCompany.
[4℄ E.J.Baldeshwieler,R. Blumofe,andE.Brewer,ATLAS:AnInfrastrutureforGlobalComputing,inProeedings
oftheSeventhACMSIGOPSEuropeanWorkshoponSystemSupportforWorldwideAppliations,Connemara,Ireland,
September1996,pp.165172.
[5℄ F.Berman,R.Wolski,S.Figueira,J.Shopf,andG.Shao,Appliation-levelShedulingonDistributedHeterogeneous
Networks,inProeedingsoftheACM/IEEEConfereneon Superomputing(SC'96),Pittsburgh,PA,November 1996.
Onlineathttp://www.superomp.org.
[6℄ A. Bik, J. Villais, and D. Gannon, Javar: APrototype Java Restruturing Compiler, Conurreny: Pratie and
Experiene,9(1997),pp.11811191.
[7℄ R.D.Blumofe,C.F.Joerg,B.C.Kuszmaul,C.E.Leiserson,K.H.Randall,andY.Zhou.,Cilk: AnEient
Multithreaded RuntimeSystem,in5thACMSIGPLANSymposiumonPriniplesandPratieofParallelProgramming
(PPoPP'95),SantaBarbara,CA,July1995,pp.207216.
[8℄ R.D.BlumofeandC.E.Leiserson,ShedulingMultithreadedComputationsbyWorkStealing,in35thAnnualSymposium
onFoundationsofComputerSiene(FOCS'94),SantaFe,NewMexio,November1994,pp.356368.
[9℄ R.D.BlumofeandP.Lisieki,AdaptiveandReliableParallelComputingonNetworksofWorkstations,inUSENIX1997
AnnualTehnialConfereneonUNIXandAdvanedComputingSystems,Anaheim,CA,1997,pp.133147.
[10℄ I.Foster andC.Kesselman,Globus: AMetaomputingInfrastrutureToolkit,InternationalJournalofSuperomputer
Appliations,11(1997),pp.115128.
[11℄ I. Foster, C. Kesselman, G. Tsudik, and S. Tueke,Aseurityarhiteture for omputational grids, in5th ACM
ConfereneonComputerandCommuniationSeurity,SanFraniso,CA,November1998,pp.8392.
[12℄ B.FreislebenandT.Kielmann,AutomatedTransformationofSequentialDivideandConquerAlgorithmsintoParallel
Programs,ComputersandArtiialIntelligene,14(1995),pp.579596.
[13℄ A.GrimshawandW.A.Wulf,TheLegionVisionofaWorldwideVirtualComputer,Comm.ACM,40(1997),pp.3945.
[14℄ C.A.HerrmannandC.Lengauer,HDC:AHigher-OrderLanguageforDivide-and-Conquer,ParallelProessingLetters,
10(2000),pp.239250.
[15℄ L.V.Kalé,M.Bhandarkar,N.Jagathesan,S.Krishnan,andJ.Yelon,Converse: Aninteroperableframeworkfor
parallelprogramming,inIntl.ParallelProessingSymposium,1996.
[16℄ K. Kaneda, K. Taura, and A.Yonezawa,Virtual privategrid: Aommandshell for utilizinghundreds of mahines
eiently, in2nd IEEE/ACMInternational Symposiumon Cluster Computingand the Grid (CCGrid 2002),Berlin,
Germany,May2002,pp.212219.
[17℄ T.Kielmann,H.E.Bal,J.Maassen,R.vanNieuwpoort,L.Eyraud,R.Hofman,andK.Verstoep,Programming
EnvironmentsforHigh-PerformaneGridComputing: theAlbatrossProjet,FutureGenerationComputerSystems,18
(2002),pp.11131125.
[18℄ D.Lea,AJavaFork/JoinFramework,inProeedingsoftheACM2000JavaGrandeConferene,SanFraniso,CA,June
2000,pp.3643.
[19℄ C.Lee, S.Matsuoka,D. Talia,A.Sussmann, M. Müller,G. Allen,andJ.Saltz,AGridprogramming primer.
GlobalGridForum,August2001.
[20℄ M.Leeh,M.Ganis,Y.Lee,R.Kuris,D.Koblas,andL.Jones,RFC1928: SOCKSprotoolversion5,April1996.
[21℄ J.Maassen,T.Kielmann,andH.Bal,GMI:FlexibleandEientGroupMethodInvoationforParallelProgramming,
inInproeedingsofLCR-02:SixthWorkshoponLanguages,Compilers,andRun-timeSystemsforSalableComputers,
WashingtonDC,Marh2002,pp.16.
[22℄ J.Maassen,T.Kielmann,andH.E.Bal,Parallel Appliation Experienewith Repliated MethodInvoation,
Conur-renyandComputation: PratieandExperiene,13(2001),pp.681712.
[23℄ J.Maassen,R. vanNieuwpoort, R. Veldema,H.Bal, T. Kielmann,C.Jaobs,andR. Hofman,Eient Java
RMIforParallel Programming,ACMTransationsonProgrammingLanguagesandSystems,23(2001),pp.747775.
[24℄ M.O.NearyandP.Cappello,AdvanedEagerShedulingforJava-BasedAdaptivelyParallelComputing,inProeedings
of the JointACM2002 JavaGrande- ISCOPE(International Symposiumon ComputinginObjet-Oriented Parallel
Environments)Conferene,Seattle,November2002,pp.5665.
[25℄ Publinetperfhomepage. www.netperf.org.
[26℄ R. V. v. Nieuwpoort, T. Kielmann, and H. E. Bal, Eient Load Balaning for Wide-area Divide-and-Conquer
Appliations, inProeedings EighthACMSIGPLAN Symposiumon PriniplesandPratie ofParallelProgramming
(PPoPP'01),Snowbird,UT,June2001,pp.3443.
[27℄ L. Peng, W. Wong, M. Feng, andC. Yuen, SilkRoad: AMultithreaded RuntimeSystem with Software Distributed
Shared MemoryforSMP Clusters,inIEEEInternational Confereneon ClusterComputing(Cluster2000), Chemnitz,
Saxony,Germany,November2000,pp.243249.
[28℄ M.Philippsen,B. Haumaher, andC. Nester,More eientserializationand RMIforJava, Conurreny: Pratie
andExperiene,12(2000),pp.495518.
[29℄ L.F. G.Sarmenta,VolunteerComputing,PhDthesis,Dept.ofEletrialEngineeringandComputerSiene,MIT,2001.
[31℄ R. V.van Nieuwpoort, J.Maassen,R. Hofman, T. Kielmann,andH.E. Bal,Ibis: anEient Java-based Grid
ProgrammingEnvironment,inJointACMJavaGrande-ISCOPE2002Conferene,Seattle,Washington,USA,November
2002,pp.1827.
[32℄ A.Wollrath,J.Waldo,andR. Riggs,Java-CentriDistributedComputing,IEEEMiro,17(1997),pp.4453.
[33℄ I.-C. Wu and H. Kung, Communiation Complexity for Parallel Divide-and-Conquer, in 32ndAnnual Symposiumon
FoundationsofComputerSiene(FOCS'91),SanJuan,PuertoRio,Ot.1991,pp.151162.
Editedby: WilsonRivera,JaimeSeguel.
Reeived: July15,2003.