Volume6,Number3,pp.4555. http://www.spe.org
2005SWPS
JUXMEM: AN ADAPTIVE SUPPORTIVE PLATFORM FOR DATA SHARING ON THE
GRID
G. ANTONIU
∗
, L. BOUGÉ
†
,AND M. JAN
∗
Abstrat. Weaddressthe hallenge of managinglargeamountsof numerial data within omputinggridsonsisting ofa
federationoflusters. Welaimthatstoring,aessing,updatingandsharingsuhdata shouldbeonsideredbyappliationsas
anexternalservie.Weproposeahierarhialarhiteture forthisservie,basedonapeer-to-peer approah. Thisarhitetureis
illustratedthrough asoftwareplatformalledJuxMem(forJuxtaposedMemory),whihprovidestransparentaesstomutable
data,whileenhaningdatapersisteneinadynamienvironment. Managingthevolatilityofstorageresoures isspeially
empha-sized. Asaproofofonept,wedesribeaprototypeimplementationontopoftheJXTApeer-to-peerframework,andwereport
onapreliminaryexperimentalevaluation.
Keywords. datasharing,grid,peer-to-peer,hierarhialarhiteture,JXTA.
1. Introdution. A majorontribution of thegrid omputingenvironmentsdeveloped sofar isto have
deoupled omputation from deployment. Deployment is then onsidered as an external servie provided by
the underlying infrastruture, outside the appliation. This servie is in harge of loating and interating
withthephysialresoures,inorderto eientlysheduleandmaptheomputation. Inontrast,asoftoday,
no suh sophistiated servie exists regardingdata management on the grid. Paradoxially enough, omplex
infrastruturesareavailablefortransparentomputationshedulingondistributedsites,whereastheuserisstill
lefttoexpliitlystoreandtransferthedataneededbytheomputationbetweenthesesites. Atbest,advaned
FTP-like funtionalities are proposed by existing environments. Within the ontext of a growing number of
appliationsusinglargeamountsofdata,thisexpliitdatamanagement arisesasamajorlimitationagainstthe
eientuseofmodernomputationalgrids.
Likedeployment,welaimthatanadequateapproahtothisproblemonsistsindeouplingdata
manage-ment fromomputation, throughanexternalservie tailoredto therequirementsofsientiomputation. In
thiswork,wefousontheaseofagridonsistingofafederationofdistributed lusters. Suhadatasharing
servie shouldmeetthefollowingtwoproperties.
Persistene. The datasets used by thegrid omputingappliations may be verylarge. Theirtransferfrom
onesiteto anothermaybeostly(intermsof bothbandwidthandlateny),sosuhdatamovements
should bearefullyoptimized. Therefore, adata management servieshould allow datato be stored
onthegridinfrastrutureindependentlyoftheappliations,inordertoallowtheirreuseinaneient
way. Suh aservieshould alsoprovidedata loalizationinformation, inorder too-operatewiththe
omputationshedulingservie,andtherebyenhane theglobal eieny.
Transpareny. Suhadatamanagementservieshould providetransparentaesstodata. Itshouldhandle
data loalization and transfer without any help from the programmer. Yet, it should make good
useof additionalinformation and hints provided by theprogrammer, ifany. Theservie should also
transparentlyuse adequaterepliationstrategiesandonsistenyprotoolsto ensuredataavailability
andonsistenyin alarge-sale,dynami arhiteture. Inpartiular,itshould support eventssuhas
omputationalandstorageresouresjoining andleaving,orevenunexpetedlyfailing.
Atthesametime,threemainonstraintsneedto beaddressed:
Volatility and dynamiity. The lusters whih make up thegrid are not guaranteed to remainonstantly
available. Nodesmayleavedue totehnial problemsorbeausesomeresouresbeometemporarily
unavailable. This should obviously not result in disabling the data management servie. Also, new
nodesmaydynamiallyjointhephysialinfrastruture: theservieshouldbeabletodynamiallytake
intoaounttheadditionalresourestheyprovide.
Salability. The algorithms proposed for parallel omputing have often been studied on small-sale
ong-urations. Our target arhiteture is typially made of thousands of omputing nodes, say tens of
hundred-nodelusters. Itis well-knownthat designinglow-level,expliitMPI programsis most
di-ultatsuhasale. Inontrast,high-level,peer-to-peerapproaheshaveprovedtoremaineetiveat
muh largersales.
∗
IRISA/INRIACampusdeBeaulieu,35042Rennes,FR.({Gabriel.Antoniu,Mathieu .Jan }i risa .fr ).
†
Mutable data. Inourtargetappliations,dataaregenerallysharedandanbemodiedbymultiplepartners.
A large number of strategieshave been proposed forhandling data repliationand data onsisteny,
in the ontext of Distributed Shared Memory (DSM)systems. Again,these strategiesand protools
havebeendesignedwiththeassumptionofasmall-sale,stati,homogeneousarhiteture,typiallyof
lustersoffewtensofnodes. Adatasharingservieforthegridshouldonsider onsistenyprotools
adaptedto adynami,large-sale,heterogeneousarhiteture.
The type of servie we propose is similar in some respets to several types of existing data
manage-ment systems. However, these systems addressonly partially the goalsand thethree onstraintsmentioned
above.
Non-transparent,large-saledata management. Currently,themostwidely-usedapproahtodata
man-agementfordistributedgridomputationreliesonexpliitdatatransfersbetweenlientsandomputing
servers. As an example,the Globus [7℄ platform provides dataaess mehanisms(GlobusAess to
Seondary Storage [3℄) based on the GridFTP protool [1℄. Though this protool provides
authen-tiation, parallel transfers, hekpoint/restartmehanisms, et., it is still a FTP-like protool whih
requires expliit data loalization and transfer. Globus also integratesdata atalogs, where multiple
opiesofthesamedataanbereorded. Themanagementof theseatalogsismanual: itistheuser's
responsibility to reord these opies and make sure they are onsistent: no onsisteny guarantee is
providedbyGlobus.
Large-sale data storage. TheIBPProjet[2℄providesalarge-saledatastoragesystem,onsistingofaset
ofbuersdistributedoverInternet. Theuseranrentthesestorageareasandusethemastemporary
buersforeientdatatransfersarossawide-areanetwork. IBPhasbeenusedbytheNetsolve[18℄
omputingenvironmenttoimplementaservieofpersistentdata. Transfermanagementisstill atthe
user'sharge. Besides, IBPdoesnothandle dynamijoin/departureof storagenodesandprovidesno
onsistenyguaranteeformultipleopies ofthesamedata.
Transparent,small-saledata sharing. Distributed Shared Memory (DSM) systems provide transparent
data sharing, via a unique address spae aessible to physially distributed mahines. Within this
ontext,avarietyofonsistenymodelsandprotoolshavebeendened,in ordertoallowaneient
managementofrepliateddata. Thesesystemsdooertransparentaesstodata: allnodesanread
andwrite datainauniformway,usingauniqueidentieroravirtualaddress. Itistheresponsibility
of theDSM systemto loalize, transfer, repliatedata, and guaranteetheir onsisteny aordingto
somesemantis. Nevertheless, existingDSM systemshavegenerallyshownsatisfatory eienyonly
onsmall-saleongurations,typially,afewtensofnodes[11℄.
Peer-to-peersharing ofimmutabledata. Reently, peer-to-peer (P2P) has provento be an eient
ap-proahforlarge-saledatasharing. Thepeer-to-peermodelisomplementarytothelient-servermodel:
therelationsbetweenmahinesaresymmetrial,eahnodeanbelientinatransationandserverin
another. This paradigmhasbeenmade popular by Napster[17℄,Gnutella[10℄, andnowKaZaA [16℄.
Wean note that these systemsfous onsharingimmutable les: theshared dataare read-only and
anberepliatedatease.
Peer-to-peersharing ofmutable data. Reently, some mehanisms for sharing mutable data in a
peer-to-peer environment have been proposed by systemslike OeanStore [8℄, Ivy [9℄ and P-Grid [6℄. In
OeanStore,for eah dataonly asmall set ofprimary replias,alled the innerring agrees,serializes
andappliesupdates. Updatesarethenmultiastdownadisseminationtreeto allotherahedopies
ofthedata,alledseondaryreplias. However,OeanStoreusesaversioningmehanismwhihhasnot
provento beeientat largesales. Seond, despiteit provideshooksfor managing theonsisteny
of data, appliations still have to use low-level mehanismsfor eah onsisteny model [12℄. Third,
published measurements on the performane of updates only assumea single writer per data blok.
Finally, serversmaking up inner rings are assumed to be highly available. The Ivy systemhas one
main limitation: appliations have to repaironiting writes, thus the number of writers per data
is very limited. Both Oeanstore and Ivy target general-purpose, persistent le storage, not data
management for high-performane, omputing grids where for example distributed matries have to
bemovedusingparallel transfers. P-Gridproposesaooding-based algorithmforupdatingdata,but
Cluster A1
Cluster A3
Cluster A2
Wide−Area
Network
Fig.2.1. Numerialsimulationforweatherforeastusinga pipelineommuniationshemewith3 lusters.
2. Designinga data sharingservie for thegrid.
2.1. Motivatingsenarios. Letusonsideradistributedfederationof3lusters:
A
1
,A
2
andA
3
,whiho-operatetogetherasshownonFigure2.1. Eahlusteristypiallyinteronnetedthroughahigh-performane
loal-areanetwork, whereasthey areall oupled togetherthrough aregular wide-areanetwork. Consider for
instaneaweatherforeastsimulation. Cluster
A
1
mayomputethe foreastforagivenday,thenA
2
forthenextday,andnally
A
3
forthedayafter.Thus,A
3
usesdataproduedbyA
2
,whihinturnusesdataproduedby
A
1
, asinapipeline. Alternatively, lusterA
1
maysimulate the weatherforeastin agivenountry, whileA
2
etA
3
simulateitfortwoneighboringountries.Suh simulationsprodue largeamount ofnumerial data, and data-related ationsare deeply intriated
withomputation. Thedatamanagementsystemsdesribedin theprevioussetiondonotprovideanysimple
tehnique to support suh designs. Consider for instane transferring data from
A
1
toA
2
: a widely-usedtehniqueonsistsinexpliitlywritingthedataonadiskwithinluster
A
1
,thenusealetransfertooltodepositthem on a disk within luster
A
2
. The appliation is diretly involved in this series of ations. In ontrast,we propose to deouplethe appliation from the data management, bymaking data storageand loalization
transparentwith respetto theappliation. Cluster
A
1
should onlystorethedata withinthe federation-widedatamanagementservie,fromwhihluster
A
2
ouldrequestthemasneeded. Dataloalizationandtransferarethenompletelyexternaltotheappliations.
Letusnowsupposethatour3appliationsnolongero-operateaordingtoapipelinesheme,butrather
aordingtoamultiple-writers sheme. Forinstane, eahappliation simulates asinglephenomenon partof
theglobal weatherforeast: say, wind, rain and louds. Inthis ase, eah luster needs data from the other
onesinordertomakeprogress.Adatasharingservieouldallowtheonurrentappliationsnotonlytoread,
butalsoto writeto theglobally shareddata,while transparentlyhandlingdataonsisteny. Thisissimilarto
DSMsystems,butatamuhlargersale,andinafullydynamiontext. Also,assumethatsomenodesfailin
luster
A
2
. SomeofthedataneessaryforA
3
ouldthusbeomeunavailable. Thedatasharingservieshouldalsoprovidemehanismstotoleratesuhfaults, forinstane,basedonredundany.
2.2. Design priniples. We onsider twomajor souresof inspiration for the design of adata sharing
servieforsientigridomputing:
DSM systems, whihproposeonsistenymodelsandprotoolsforeienttransparentmanagementof
mu-tabledata, onstati, small-saledongurations (tensof nodes);
P2P systems, whihhaveprovenadequateforthemanagementofimmutable data onhighly dynami,
large-sale ongurations(millions ofnodes).
Thesetwolassesofsystemshavebeendesignedandstudied in verydierentontexts. InDSM systems,the
nodes are generally under the ontrol of asingle administration, and theresoures are trusted. In ontrast,
P2PsystemsaggregateresouresloatedattheedgeoftheInternet,withnotrustguarantee,andlooseontrol.
Moreoverthesenumerousresouresareessentiallyheterogeneousin termsofproessors,operatingsystemsand
Table2.1
AgriddatasharingservieasaompromisebetweenDSM andP2Psystems.
DSM Grid data servie P2P
Sale
10
1
10
2
10
3
10
4
10
5
10
6
Resoure ontrol
and trust degree
High Medium Null
Dynamiity Null Medium High
Resoure
homogeneity
Homogeneous
(lusters)
Ratherheterogeneous
(lustersoflusters)
Heterogeneous
(Internet)
Data type Mutable Mutable Immutable
Appliation
omplexity
Complex Complex Simple
Typial
appliations
Sienti
omputation
Sientiomputationand
datastorage
Filesharingand
storage
multiple nodes. Inontrast,P2Psystemsgenerallyserveasasupportforstoringand sharingimmutableles.
Theseantagonistfeaturesaresummarizedin therstandthird olumnsofTable2.1.
Ourdata sharingservietargetsphysialarhitetureswithfeatures intermediatebetweenDSM andP2P
systems. Weaddresssalesoftheorderofthousandsofnodes,organizedasafederationoflusters,saytensof
hundred-nodelusters. Atagloballevel,theresouresarethusratherheterogeneous,whiletheyanprobably
beonsideredashomogeneouswithintheindividuallusters. Theontroldegreeand thetrustdegreearealso
intermediate,sinethelustersmaybelongtodierentadministrations,whihsetupagreementsonthesharing
protool. Finally,wetarget numerialappliationslikeheavysimulations,madebyouplingindividualodes.
These simulations proess large amountsof data, with signiantrequirements in terms of data storage and
sharing. These intermediatefeaturesareillustratedintheseondolumn ofTable2.1.
Theontributionofthispaperis namelytoproposeanarhitetureforsuhadata sharingservie,whih
addresses the problem of managing mutable data on dynami, large-sale ongurations. Ourapproah aims
at taking benet of both DSM systems (transparent aess to data, onsistenyprotools) and P2Psystems
(salability,supportforresourevolatilityanddynamiity).
2.3. The JXTAimplementationframework. OurproposalispartlyinspiredbytheP2Papproah. It
anusefullybenetfromaplatformprovidingbasimehanismsforpeer-to-peerinteration. Toourknowledge,
themostadvanedimplementationplatforminthis areaisJXTA [14℄. ThenameJXTAstandsforjuxtaposed,
inorder tosuggestthejuxtapositionratherthantheoppositionoftheP2Pandlient-servermodels. JXTAis
aprojetoriginallyinitiatedbySunMirosystems.
JXTAisanopen-soureframework,whihspeiesasetoflanguage-andplatform-independentXML-based
protools[15℄. JXTAprovidesarihsetofbuildingbloksforthemanagementofpeer-to-peersystems: resoure
disovery,peergroupmanagement,peer-to-peerommuniation,et.
Peers. ThebasientityinJXTAisthepeer. Peersareorganizedinnetworks. Theyareuniquelyidentiedby
IDs. An IDisalogialaddressindependentoftheloationofthepeerinthephysialnetwork. JXTA
introduesseveraltypesofpeers. Themostrelevantasfarasweareonernedaretheedgepeers and
rendezvous peers. Edgepeersareableto ommuniate withother peersin theJXTAvirtualnetwork.
Theyan alsostoreadvertisementsofresourestheydisoverin thenetwork. Rendezvouspeershave
theextraabilityofforwardingtherequeststheyreeivetootherrendezvouspeers. Theyanalsooer
astorageareaforadvertisementsthat havebeenpublished byedgepeers. Finally, theyare internally
managedbyJXTAusingadistributedhashtable (DHT)andaremakinguptheframeofJXTA.They
anthus be dynamially loatedin aneient way. Joining, leaving, andeven unexpeted failing of
rendezvouspeersaresupportedbytheJXTA protools.
Peer groups. Peersanbemembersofoneorseveralpeer groups. Apeergroupismadeupof severalpeers
that share a ommonset ofinterests, e.g., peers that havethe sameaess rightsto some resoures.
Themainmotivationforreatingpeergroupsistobuildserviesolletivelydeliveredbypeergroups,
Pipes. CommuniationbetweenpeersorpeergroupswithintheJXTAvirtualnetworkismadebyusingpipes.
Pipesareunidiretional,unreliableandasynhronouslogialhannels. JXTAoerstwotypesofpipes:
point-to-point pipes,and propagatepipes. Propagatepipesanbeused to build amultiast layerat
thevirtuallevel.
Advertisements. EveryresoureintheJXTAnetwork(peer,peergroup,pipe,servie,et.) isdesribedand
published using advertisements. Advertisementsare strutured XML doumentswhih arepublished
withinthenetworkofrendezvouspeers. Torequestaservie,alienthasrsttodisover amathing
advertisementusingspei loalizationprotools.
JXTA protools. JXTAproposessixgeneriprotools. Outofthese,twoarepartiularlyusefulforbuilding
higher-levelpeer-to-peerservies: thePeerDisoveryProtool,whihallowsforadvertisement
publish-ingand disovery; and thePipe Binding Protool, whih dynamially establishes links between peers
ommuniatingonagivenpipe.
ThedatasharingserviethatweproposeisdesignedusingtheJXTA buildingbloksdesribedabove.
3. JuxMem: a supportive platform for data sharing on the grid. Thearhiteture of the data
sharing servie we propose, mirrors an arhiteture onsisting of a federation of distributed lusters. The
arhiteture is therefore hierarhial, and is illustratedthrough the proposition of asoftware platform alled
JuxMem (for Juxtaposed Memory), whose goal is to be the foundation for a data sharing servie for grid
omputingenvironments,likeDIET[4℄.
Group "cluster A"
Group "data"
Group "cluster B"
Physical network
Overlay network
Group "cluster C"
Cluster C
Cluster B
Cluster A
Node
Group "juxmem"
Client
Provider
Cluster manager
Fig.3.1. Hierarhyof theentitiesinthenetworkoverlaydenedby JuxMem.
3.1. Hierarhialarhiteture. Figure3.1showsthehierarhyoftheentitiesdenedinthearhiteture
of JuxMem. Thisarhitetureis madeupof anetworkof peer groups(lustergroups
A
,B
andC
), whihgenerallyorrespond to lusters at thephysial level. All the groupsareinside awider groupwhih inludes
allthepeers whih runtheservie(the juxmemgroup). Eahluster grouponsistsof aset ofnodeswhih
providememoryfordatastorage. Wewillallthesenodesproviders. Ineahlustergroup,anodeisinharge
of managingthe memorymade available by theprovidersof thegroup. This node isalled lustermanager.
Finally, a nodewhih simplyuses theservie to alloate and/oraess data bloks is alled lient. It should
be notedthat anodeanbeat thesametime aluster manager,alientand aprovider, but forthe sakeof
larity,eahnodeplaysonlyonerolein theexampleillustratedontheFigure3.1.
providersfromdierentlustergroups. Indeed,adataanbespreadoveronseverallusters(hereAandC).
Forthisreason,thedataand lustergroupsareatthesamelevelofthegrouphierarhy. Note alsothatthe
lustergroupsouldalsoorrespondtosubsetsofthesamephysial luster.
Anotherimportantfeatureisthatthearhitetureof JuxMemisdynami,sinelusteranddatagroups
an be reated at run time. For instane, for eah blok of data inserted into the system, a data group is
automatiallyinstantiated.
API of the data sharing servie. The Appliation Programming Interfae (API) provided by JuxMem
illustratesthefuntionalities ofadatasharingservieprovidingdata persisteneaswellastransparenywith
respettodata loalization.
allo(size, attributes) allowstoreateamemoryareaofthespeiedsizeonaluster. Theattributes
parameter allows to speify the level of redundany and the default protool used to manage the
onsistenyof theopies of the orrespondingdata blok. This funtion returns anID whih anbe
seenat theappliationlevelasadatablokID.
map(id, attributes) allows to retrieve the advertisement of a data ommuniation hannel whih has to
beused to manipulate the data blokidentied by id. The attributes argumentallows to speify
parametersfortheviewofthedatablokdesiredbythelient,likeforinstanewhatweallthedegree
of onsisteny: somelientsmay haveweaker onsistenyrequirementsthan the oneensured by the
defaultprotoolusedtomanagethedatablok.
put(id, value) allowstomodifythevalueofthedatablokidentied byid. Thenewvalueisthenvalue.
get(id) allowstogettheurrentvalueofthedatablokidentiedbyid.
lok(id) allows to lok the data blok identied by id. A lok is impliitly assoiated to eah data blok.
Clientswhihaessashareddatablokneedto synhronizeusingthis lok.
unlok(id) allowsto unlokthedataidentiedbyid.
reonfigure(attributes) allowsto dynamially reongure anode. The attributesparameter allows to
indiateifthenodeisgoingtoatasalustermanagerand/orasaprovider. Ifthenodeisgoingtoat
asaprovider, theattributesparameteralsoallowsto speify theamountofmemory thatthe node
providestoJuxMem.
3.2. Managing memoryresoures.
Publishingandplaementofresoureadvertisements. Memoryresouresaremanagedusingadvertisements.
Eahproviderpublishestheamountofmemoryitoerswithinthelustergrouptowhihitbelongs,bythe
meansof aprovider advertisement. Theluster managerof thegroupstoresallsuh advertisementsavailable
in his group. He is also responsible for publishing theamount of memoryavailable in the luster byusing a
luster advertisement. Thisadvertisementliststheamountsof memoryoeredbyprovidersofthe assoiated
lustergroup. These lusteradvertisementsarepublishedinside thejuxmemgroup,so thattheyanthenbe
usedbyallthelientsin ordertoalloatememory.
Clustermanagersarethusinhargeofmakingthelink betweenthelustergroupandthejuxmemgroup.
Theymakeup anetwork organizedusingaDHT at thelevel ofthejuxmemgrouplevel,in order tobuild the
frameofthedatasharingservie. ThisframeisrepresentedbytheringontheFigure3.2. Eahlustermanager
G1toG6isresponsibleforaluster,respetivelyA1toA6,eahofwhihismadeupofvenodes. Atthelevel
ofthe juxmemgroup,theDHT worksas follows. Eah lusteradvertisementontainsalistwhihenumerates
the amounts of memory available in the luster. Eah individual amount is separately used to generate an
ID, by means of a hash funtion. This ID is then used to determine the luster manager responsible for all
advertisementshavingthisamountofavailablememoryintheirlist. Thislustermanagerisnotthepeer that
storestheadvertisement,itonlyknowstheluster managerwhihpublisheditin theJuxMemnetwork. This
plaementoflusteradvertisementsallowslientstoeasilyretrieveadvertisementsinordertoalloatememory:
any request for agiven amountof memory isdireted to theluster manager responsible forthat amount of
memory,usingthehashmehanismsdesribedabove
Searhing for advertisements is therefore short, and responses are exatand exhaustive, e.g., all the
ad-vertisements that inlude the requested memory size will be returned. But sine using a DHT on memory
sizes meansto generateadierenthashfor eah memorysize,JuxMem usesaparameterizable poliy forthe
disretization of thespaeof memory sizes. Thus, JuxMem will searh forthe minimummemory size,given
1
G1
G2
P
C
Group "cluster A4"
G4
3a
G3
3a
5
Group "cluster A3"
Group "cluster A1"
3b
4
6
2
3b
G6
G5
Group "cluster A5"
Group "cluster A6"
Group "cluster A2"
Provider
Client
Cluster manager
Fig.3.2. Stepsofanalloation requestmadebyalient.
ifitusesapowerof2lawforthespaedisretization. Providersalsointernallyusethesamelawwhenoering
memoryareas,butprovidethemaximummemorysize,givenbythepoliyused,thatisinferiortotheonethey
wishtooer.
Oneoftheonstraintswexedistosupportthevolatilityofnodeswhihmakeupthelusters. Therefore,
theadvertisementspublishedatatime
t
1
anbeinvalidatthetimet
2
> t
1
,sineprovidersandisappearfromJuxMematanytime. Themehanismusedtomanagethisvolatilityofpeersisbasedonrepublishingtheluster
advertisements wheneverahanging ofthe amountof memoryprovided is deteted. Besides, advertisements
havealimitedbut parameterizablelifetime,soitis neessarytoperiodiallyrepublishthem.
Proessinganalloationrequest. Clientsmakealloationrequestsbyspeifyingthesizeofthememoryarea
theywanttoalloate. Thedierentstepsforsuharequest,numberedontheFigure 3.2,arethefollowing:
1. Thelient
C
ofthe lustergroupA
1
wantsto alloate amemory areaof 8MB witha redundanydegreeoftwo. Consequently,itsubmitsitsrequesttothelustermanager
G
1
towhihitisonneted.2. Thelustermanager
G
1
thendeterminesthatthepeerresponsibleofadvertisementshavingamemorysize of 8MB in their listis the luster manager
G
3
, usingthe hash mehanismdesribedpreviously.Therefore,theluster managerpeer
G
1
forwardstherequesttoG
3
.3. Thelustermanager
G
3
thendeterminesthat lustermanagersG
2
andG
4
maththeriterionofthelient,andasksthem toforwardtheirlusteradvertisementtothelient
C
.4. The lient
C
then hooses the luster managerG
2
asthe peer having the best advertisement: forinstanetheorrespondinglusteroersahigherdegreeofredundanythanthelusterhandledbythe
lustermanager
G
4
. Thus, itsubmitsitsalloationrequesttoG
2
.5. Thelustermanager
G
2
reeivesthealloationrequestandhandlesit. Ifitansatisfytherequestthenitasksoneofitsproviders,forexample
P
,to alloatea8MBmemoryarea. Iftherequestannotbesatised,anerrormessageissentbaktothelient.
6. If the provider
P
an satisfy this request, it reates a 10 MB memory area, then sends bak theadvertisementof this memoryareato thelient
C
.P
beomesthe lustermanager ofthe assoiateddatagroup, whih means that it is responsible for repliatingthe data blokstored in that memory
area. Iftheprovider
P
annotsatisfytherequest,anerrormessageissentbaktothelustermanagerG
2
,whihantryotherproviderpeersofthelustergroup.Ifnoprovidersanbefoundonthelaststepofanalloationrequest,anerrormessageissentbaktothelient.
Thenthelientanrestartthealloationrequestfromstep4,e.g.,withanotherlustermanagermathingthe
requested memorysize. Finally, if no luster manageranalloatethe memoryarea, thelient inreasesthe
requestedmemorysize andrestartsthealloationrequestfrom thebeginning. This anbedone
N
times(for3.3. Managing shared data. When a memory area is alloated by a lient, a data group is reated
on the hosen provider and an advertisement is sent to the lient. This advertisement allows the lient to
ommuniatewith thedatagroup. This advertisementispublished at thejuxmem'sgrouplevel,but onlythe
IDofthisadvertisementisreturnedattheappliationlevel. Aesstodatabyotherlientsisthenpossibleby
usingthisID:theplatformtransparently loatestheorrespondingdatablok.
Storageofdatabloksisindependentoflients. Indeed,whenlientsdisonnetfromJuxMem,databloks
still remainstoredin thedata sharingservieontheproviders. Consequently,lientsanhaveaessto data
blokspreviously stored by other lients: they simply need to look for the advertisement of the datagroup
assoiated with the data blok (whose identier is assumed to be known). The map primitive of the API of
JuxMem does this by taking in input the ID of the data blok. In this way, the storage of data bloks is
persistent.
Eahdatablokisrepliatedonaxed,parameterizablenumberofprovidersforabetteravailability. This
redundanydegreeisspeiedasanattribute atalloationtime. Theonsistenyofthedierentopiesmust
thenbehandled. InthisrstversionofJuxMem,theuseofamultiastatthelevelofthejuxmemgroupsolves
thisproblem: thedierentopiesofasamedatablokaresimultaneouslyupdatedwheneverawritingaessis
made. Alternativeonsistenymodelsandprotoolswillbeexperimentedinfurtherversions. Notethatlients
whih havepreviously read a data blok are not notied of this update: lients do notstore a opy of data
blok. Therefore, theresult of areading whih is valid at a time
t
1
, may notbevalid at timet
2
> t
1
. It isworthnotingthatthisdierenebetweenlientandprovidersallowstohandleahighnumberoflientswithout
havingtodealwithahighnumberofopiesofdatabloks. Synhronizationbetweenlientswhihonurrently
aessadatablokishandled usingthelok/unlokprimitives.
3.4. Handlingvolatileproviders. Inordertotoleratethevolatilityofpeers,astatirepliationofdata
onaxed andparameterizablenumberofprovidersisnotenough. Indeed,theset ofprovidershostingaopy
of the samedata blokan suessivelybeomeunavailable. A dynami monitoring of thenumber of opies
fordatais thereforeneeded. Consequently,eahdatagrouphasamanager(noted datamanager)whih isin
harge of monitoringthe levelof redundany of the data blok. If this numbergoesbelowthe one speied
bylients, thedata managermust searhand ask aproviderto hostan extraopy ofthe data blok. When
thedatamanagerdeidestorepliateit,itmustrstlokit(internally)inordertomaintainonsisteny. The
providerwhihwillhostthisnewopyisthenresponsibleforunlokingit. A timeout mehanismfollowedbya
ping testisused in ordertodetetiftheproviderbeameunavailablejust beforeunlokingthedatablok. If
itisthease,thenthedatamanagerunloksitselfthedatablok.
3.5. Handlingvolatilemanagers. Ifalustermanagergoesdown,thisouldleadtotheunavailability
ofresouresprovidedbyawholeluster. Theroleoflustermanager(notedmainlustermanager)istherefore
automatially dupliated on another provider of the luster (alled seondary luster manager). Managers
periodiallysynhronizeusing amehanismbasedontheexhangeofprovideradvertisements,in ordertond
out new advertisements published. They an thus both know in a nearly aurate manner the amount of
memoryavailableintheluster. Amehanismbasedonperiodialheartbeatsallowstodynamiallyensurethis
dupliationofluster managers. Suhamehanismisalsoused forthedatamanagers(seeSetion3.4). Note
that,thepossiblehangesofmanagersinthelusteranddatagroups,duetotheunavailabilityofmanagers,
are notseenoutsidethese groups. Theavailabilityof lusters and ofdata bloksis thus maximized, whereas
theperturbationonthelientsideisminimized.
4. Implementationand preliminary evaluations.
4.1. Implementation of JuxMem within the JXTA framework. In order to build a prototype
of the software arhiteture desribed in the previous setion, we have used the JXTA generi peer-to-peer
framework(seeSetion2.3). OurJuxMemprototypeusestherefereneJavabindingofJXTA(whihistoday
theonlybindingompatiblewiththeJXTA2.0speiation).JuxMemiswritten inJavaand inludesabout
50lasses(5000odelines).
JXTA fully meets the needs of JuxMem. Thus, managers of data and luster groups are based on
JXTA's rendezvous peers. Indeed, managers have to know if providersare still alive by using a ping test in
order to manage aluster or ablok of data. This an only be done if providers have previouslypublished
0
20
40
60
80
100
160
140
120
100
80
60
50
40
30
Seconds elapsed between provider losses
Relative overhead (%)
Fig.4.1. Relativeoverheadduetothevolatilityofprovidersforasequenelok-put-unlok,withrespettoa stablesystem.
managers. Forexample,datamanagershaveto forwardaessrequests, madebylients, toprovidershosting
a opy of the data blok. In the same way, luster managers have to forward alloation requests, made by
lients,toproviders. Clientsandproviderswhihdonotatasdatamanagersforoneorseveralbloksofdata
are based on JXTA's edge peers. Indeed, theydo not have to play a role in the dynami monitoring of the
numberof opies for ablok ofdata in thesystem. Therefore, theydo nothave to storepublished provider
advertisements. Moreover,lientsonlyneedtodisoverandstorelusteradvertisementswhihwillallowthem
toalloatememoryareas. ThevariousgroupsdenedinJuxMemareimplementedbyJXTA'speergroups. The
juxmemgroupimplementsaJXTApeergroupservieprovidingtheAPIofJuxMem(seeSetion3.1). Finally,
theommuniationhannels ofJXTAalso oertheneededsupport forbuildingmultiastommuniationsfor
simultaneouslyupdatingopiesofthesameblokofdata.
4.2. Preliminaryevaluations. Forourpreliminaryexperiments,weusedalusterof450MHzPentium
II nodeswith256MB RAM,interonnetedbya100MB/sFastEthernetnetwork.
WerstmeasuredthememoryonsumptionoverheadgeneratedbythedierentJuxMempeerswithrespet
to theunderlying JXTA peersused to build JuxMem peers. This overheadis reasonable: it rangesbetween
5%and7.4%.
We then measured the inuene of the volatility degree of provider peers on the duration of a sequene
lok-put-unlokexeuted in aloopbyalient. Thissequene in theloopis madeonadata blok storedin
JuxMem. Thegoal of this measure is to evaluate therelative overhead generated by the repliations whih
take plae in order to maintain a given redundany degree for a given blok of data. This repliations are
transparently triggered when the servie detets that a provider holding a data blok goes down. If these
repliationstakeplaewhilealientaessesthedatablokbeingrepliated,theseaessesslowdown.
Thetestprogramrstalloatesasmallmemoryarea(1byte)onaproviderbelongingtolusterandwrites
toitadatablok. Theredundanydegreeissetto3. Thealloationtakesplaeonalusterinitiallyonsisting
of16providersandonelustermanager. 16mahines oftheluster previouslydesribedhostaprovider,one
mahine of the sameluster hosts aluster managerand another mahine of thesame luster hosts alient.
Thelientexeutesa100iterationloop,andeahiterationonsistsofasequenelok-put-unlok.
Duringtheexeutionofthis loop,arandomproviderhostingaopyofthe dataiskilled every
δ
seonds,where
δ
is a parameter of the experiment. In order to measure only the overhead due to the volatility ofproviders,thedatamanageroftheassoiatedgroupisneverkilled.
Figure 4.1showstherelativeoverheadmeasured, withrespet to astable system(i.e. where no provider
goesdownduringtheloopexeution:
δ
=
∞
). Whenthedatamanagerdetetsthatprovidersholdingaopyofthedatablokhavegonedown,ittriestorepliatetheblokonotheravailableproviders,whiharenotalready
hostingaopyof thedata blok. Toensure theonsistenyof thedata during itsrepliation, lientsare not
Theurveproleisexplainedbythenumberoftimesthesystemrepliatesthedataonproviders,in order
to maintaintheredundany degreespeiedbythelient(whihis3forthis test). Forthewhole durationof
ourtest,thenumberoftriggeredrepliationsisgivenin theTable4.1asafuntion ofthe
δ
parameter.For highly volatile systems(
δ <
80
s), the number of repliations triggered beomes higher than 2 andtherelativeoverheadbeomessigniant. For
δ
= 30
s, itreahesmorethan65%(10repliations triggered).However, in arealisti situation, thenodevolatility onthe arhiteturewe onsider is typially alot weaker
(
δ
≫
80
s). Forsuh values, the reonguration overhead is less than 5%. Wean reasonably say that theJuxMemplatforminludesamehanismwhihallowstodynamially maintainaertainredundanydegreefor
databloks,in ordertoimprovedataavailability,withoutsigniantoverhead, whileauthorizingnodefailures.
Table4.1
Numberoftriggeredrepliationswhenthevolatilityofproviderpeersevolvesfrom160to30seonds.
Seonds 160 140 120 100 80 60 50 40 30
Numberof triggered repliations 1 1 1 1 2 2.5 5 5.5 10
5. Conlusion. Thispaperdenesahierarhial arhitetureforadatasharingserviemanagingmutable
datawithinagridonsistingofafederationoflusters. Thisarhiteturehasbeendesignedusingapeer-to-peer
approah, and demonstratedthrough theJuxMem platform. Not only the arhiteture allows to redue the
numberofmessagestosearhforapiee ofdata,thanksto ahierarhialsearhsheme,but italso allowsto
takeadvantageof spei features of the underlying physial arhiteture. Themanagement poliy for eah
lusteranbespeitoitsonguration,forinstaneintermsofnetworklinkstobeused. Thus,somelusters
ouldusehigh-bandwidth,low-latenynetworksforintra-lusterommuniation,ifavailable.
TheJuxMemuseranalloatememoryareasinthesystem,byspeifyinganareasizeandsomeattributes,
suhasaredundanydegree. ThealloationprimitivereturnsanIDwhihidentiestheblokof data. Then,
dataloalization andtransferisfully transparent,sinethis IDissuientin orderto aessand manipulate
theorrespondingdatawhereveritis: noIP addressnorportnumberneedstobespeiedat theappliation
level.
Ourarhiteturesupportsthevolatilityofalltypesofpeers. Thiskindofvolatilityisalsosupportedin
peer-to-peersystemssuhasGnutellaorKaZaA, whih enhane dataavailabilitythanksto redundany. However,
this isasideeet oftheuserations. Inontrast,oursystematively takesinto aountthisvolatility: this
allowsnotonlytomaintainaertaindegreeofdataredundany(asinsystemslikeIvyorCFS[5℄),butalsoto
supportthevolatilityofpeerswithspei responsibilities(e.g., lustermanagers,ordatamanagers).
The implementation of a JXTA-based prototype has shown the feasibility of suh a system. However,
note that thedesign of JuxMem isnot dependenton JXTA. Atually, other librariesould beused, suh as
JavaGroups[13℄. WeusedtheJavaversionofJXTA,sinethisisthemostadvanedbindingofJXTA,theonly
oneompatible withtheJXTA2.0speiation.
The modular arhiteture of JXTA allowsto easily add and remove serviesand/or protools, inluding
ommuniationprotools. This should eventuallyallow the platform to take advantage of high-performane
networks(suhasMyrinet orSCI) fordatatransfer. Weplan to addressthis problem in thefuture. Wealso
plan to use JuxMem as an experimental platform for dierent data onsisteny strategies supporting peer
volatility,in ordertobuild aongurable, adaptivedata sharingservieformutabledata. Thenal goalis to
integratethisservieintolarge-saleomputingenvironments,suhasDIET[4℄,developedatENSLyon. This
willallowanextensiveevaluationoftheservie,withrealistiodes,usingvariousdataaessshemes.
REFERENCES
[1℄ B. Allok, J.Bester, J.Bresnahan, A. Chervenak, L. Liming, S. Meder and S. Tueke, GridFTP Protool
Speiation,GGFGridFTPWorkingGroupDoument,Sept.2002.
[2℄ A.Bassi,M.Bek,G.Fagg,T.Moore,J.Plank,M.SwanyandR.Wolski,TheInternetBakplaneProtool: Astudy
inresoure sharing,In2ndIEEE/ACMInternational SymposiumonCluster Computingandthe Grid(CCGrid2002),
pages194201,Berlin,Germany,May2002.IEEE.
[3℄ J.Bester, I. Foster, C. Kesselman, J. Tedeso and S. Tueke, GASS:Adata movement andaess servie for
[4℄ E.Caron,F.Desprez,F.Lombard,J.-M.Niod,M.QuinsonandF.Suter,Asalableapproahtonetworkenabled
servers,InB.MonienandR.Feldmann,editors,8thInternationalEuro-ParConferene,volume2400ofLetureNotes
inComputerSiene,pages907910,Paderborn,Germany,Aug.2002.Springer-Verlag.
[5℄ F.Dabek,F.Kaashoek,D.Karger,R.MorrisandI.Stoia,Wide-area ooperativestoragewithCFS,In18thACM
SymposiumonOperatingSystemsPriniples(SOSP'01),pages202215,ChateauLakeLouise,Ban,Alberta,Canada,
Ot.2001.
[6℄ A.Datta,M.HauswirthandK.Aberer,Updates inhighlyunreliable,repliated peer-to-peersystems,In23rd
Interna-tionalConfereneonDistributedComputingSystems(ICDCS2003),pages7687,Providene,RhodeIsland,USA,May
2003.
[7℄ I.FosterandC.Kesselman,Globus: Ametaomputinginfrastruturetoolkit,TheInternationalJournalofSuperomputer
AppliationsandHighPerformaneComputing,11(2):115128,1997.
[8℄ J.Kubiatowiz,D.Bindel,Y.Chen,P.Eaton,D.Geels,R.Gummadi,S.Rhea,H.Weatherspoon,W.Weimer,
C.WellsandB.Zhao,OeanStore:Anarhitetureforglobal-salepersistentstorage,In9thInternationalConferene
onArhitetureSupportforProgrammingLanguagesandOperatingSystems(ASPLOS2000),number2218inLeture
NotesinComputerSiene,pages190201,Cambridge,MA,Nov.2000.Springer.
[9℄ A.Muthitaharoen,R.Morris,T.M.GilandB.Chen,Ivy:Aread/writepeer-to-peerlesystem,In5thSymposium
onOperatingSystemsDesignandImplementation(OSDI'02),Boston,MA,De.2002.
[10℄ A.Oram,Peer-to-Peer: Harnessing thePowerof DisruptiveTehnologies,hapterGnutella,pages 94122,O'Reilly,May
2001.
[11℄ J.Proti¢,M.Tomasevi¢andV.Milutinovi¢,DistributedSharedMemory: ConeptsandSystems,IEEE,Aug.1997.
[12℄ S.Rhea,P.Eaton,D.Geels,H.Weatherspoon,B. ZhaoandJ.Kubiatowiz,Pond: theoeanstoreprototype,In
2ndUSENIXConfereneonFileandStorageTehnologies(FAST'03),Californie,CA,USA,Mar.2003.
[13℄ JavaGroups, http://www.javagroups.o m/j avag roup sne w/do s/ inde x.ht ml
[14℄ TheJXTAprojet, http://www.jxta.org/
[15℄ JXTA v2.0protoolspeifiation, http://spe.jxta.org/no nav/ v1.0 /do boo k/J XTAP roto ol s.pd f, Mar.2003.
[16℄ KaZaA,http://www.kazaa.om/
[17℄ Napsterprotoolspeifiation, http://opennap.sourefo rge .net /nap ste r.tx t,Mar.2001.
[18℄ TheNetSolveprojet, http://il.s.utk.edu/nets olv e/
Editedby: WilsonRivera, JaimeSeguel.
Reeived: June26,2003.