• No results found

JuxMem: An Adaptive Supportive Platform for Data Sharing on the Grid

N/A
N/A
Protected

Academic year: 2020

Share "JuxMem: An Adaptive Supportive Platform for Data Sharing on the Grid"

Copied!
11
0
0

Loading.... (view fulltext now)

Full text

(1)

Volume6,Number3,pp.4555. http://www.spe.org

2005SWPS

JUXMEM: AN ADAPTIVE SUPPORTIVE PLATFORM FOR DATA SHARING ON THE

GRID

G. ANTONIU

, L. BOUGÉ

,AND M. JAN

Abstrat. Weaddressthe hallenge of managinglargeamountsof numerial data within omputinggridsonsisting ofa

federationoflusters. Welaimthatstoring,aessing,updatingandsharingsuhdata shouldbeonsideredbyappliationsas

anexternalservie.Weproposeahierarhialarhiteture forthisservie,basedonapeer-to-peer approah. Thisarhitetureis

illustratedthrough asoftwareplatformalledJuxMem(forJuxtaposedMemory),whihprovidestransparentaesstomutable

data,whileenhaningdatapersisteneinadynamienvironment. Managingthevolatilityofstorageresoures isspeially

empha-sized. Asaproofofonept,wedesribeaprototypeimplementationontopoftheJXTApeer-to-peerframework,andwereport

onapreliminaryexperimentalevaluation.

Keywords. datasharing,grid,peer-to-peer,hierarhialarhiteture,JXTA.

1. Introdution. A majorontribution of thegrid omputingenvironmentsdeveloped sofar isto have

deoupled omputation from deployment. Deployment is then onsidered as an external servie provided by

the underlying infrastruture, outside the appliation. This servie is in harge of loating and interating

withthephysialresoures,inorderto eientlysheduleandmaptheomputation. Inontrast,asoftoday,

no suh sophistiated servie exists regardingdata management on the grid. Paradoxially enough, omplex

infrastruturesareavailablefortransparentomputationshedulingondistributedsites,whereastheuserisstill

lefttoexpliitlystoreandtransferthedataneededbytheomputationbetweenthesesites. Atbest,advaned

FTP-like funtionalities are proposed by existing environments. Within the ontext of a growing number of

appliationsusinglargeamountsofdata,thisexpliitdatamanagement arisesasamajorlimitationagainstthe

eientuseofmodernomputationalgrids.

Likedeployment,welaimthatanadequateapproahtothisproblemonsistsindeouplingdata

manage-ment fromomputation, throughanexternalservie tailoredto therequirementsofsientiomputation. In

thiswork,wefousontheaseofagridonsistingofafederationofdistributed lusters. Suhadatasharing

servie shouldmeetthefollowingtwoproperties.

Persistene. The datasets used by thegrid omputingappliations may be verylarge. Theirtransferfrom

onesiteto anothermaybeostly(intermsof bothbandwidthandlateny),sosuhdatamovements

should bearefullyoptimized. Therefore, adata management servieshould allow datato be stored

onthegridinfrastrutureindependentlyoftheappliations,inordertoallowtheirreuseinaneient

way. Suh aservieshould alsoprovidedata loalizationinformation, inorder too-operatewiththe

omputationshedulingservie,andtherebyenhane theglobal eieny.

Transpareny. Suhadatamanagementservieshould providetransparentaesstodata. Itshouldhandle

data loalization and transfer without any help from the programmer. Yet, it should make good

useof additionalinformation and hints provided by theprogrammer, ifany. Theservie should also

transparentlyuse adequaterepliationstrategiesandonsistenyprotoolsto ensuredataavailability

andonsistenyin alarge-sale,dynami arhiteture. Inpartiular,itshould support eventssuhas

omputationalandstorageresouresjoining andleaving,orevenunexpetedlyfailing.

Atthesametime,threemainonstraintsneedto beaddressed:

Volatility and dynamiity. The lusters whih make up thegrid are not guaranteed to remainonstantly

available. Nodesmayleavedue totehnial problemsorbeausesomeresouresbeometemporarily

unavailable. This should obviously not result in disabling the data management servie. Also, new

nodesmaydynamiallyjointhephysialinfrastruture: theservieshouldbeabletodynamiallytake

intoaounttheadditionalresourestheyprovide.

Salability. The algorithms proposed for parallel omputing have often been studied on small-sale

ong-urations. Our target arhiteture is typially made of thousands of omputing nodes, say tens of

hundred-nodelusters. Itis well-knownthat designinglow-level,expliitMPI programsis most

di-ultatsuhasale. Inontrast,high-level,peer-to-peerapproaheshaveprovedtoremaineetiveat

muh largersales.

IRISA/INRIACampusdeBeaulieu,35042Rennes,FR.({Gabriel.Antoniu,Mathieu .Jan }i risa .fr ).

(2)

Mutable data. Inourtargetappliations,dataaregenerallysharedandanbemodiedbymultiplepartners.

A large number of strategieshave been proposed forhandling data repliationand data onsisteny,

in the ontext of Distributed Shared Memory (DSM)systems. Again,these strategiesand protools

havebeendesignedwiththeassumptionofasmall-sale,stati,homogeneousarhiteture,typiallyof

lustersoffewtensofnodes. Adatasharingservieforthegridshouldonsider onsistenyprotools

adaptedto adynami,large-sale,heterogeneousarhiteture.

The type of servie we propose is similar in some respets to several types of existing data

manage-ment systems. However, these systems addressonly partially the goalsand thethree onstraintsmentioned

above.

Non-transparent,large-saledata management. Currently,themostwidely-usedapproahtodata

man-agementfordistributedgridomputationreliesonexpliitdatatransfersbetweenlientsandomputing

servers. As an example,the Globus [7℄ platform provides dataaess mehanisms(GlobusAess to

Seondary Storage [3℄) based on the GridFTP protool [1℄. Though this protool provides

authen-tiation, parallel transfers, hekpoint/restartmehanisms, et., it is still a FTP-like protool whih

requires expliit data loalization and transfer. Globus also integratesdata atalogs, where multiple

opiesofthesamedataanbereorded. Themanagementof theseatalogsismanual: itistheuser's

responsibility to reord these opies and make sure they are onsistent: no onsisteny guarantee is

providedbyGlobus.

Large-sale data storage. TheIBPProjet[2℄providesalarge-saledatastoragesystem,onsistingofaset

ofbuersdistributedoverInternet. Theuseranrentthesestorageareasandusethemastemporary

buersforeientdatatransfersarossawide-areanetwork. IBPhasbeenusedbytheNetsolve[18℄

omputingenvironmenttoimplementaservieofpersistentdata. Transfermanagementisstill atthe

user'sharge. Besides, IBPdoesnothandle dynamijoin/departureof storagenodesandprovidesno

onsistenyguaranteeformultipleopies ofthesamedata.

Transparent,small-saledata sharing. Distributed Shared Memory (DSM) systems provide transparent

data sharing, via a unique address spae aessible to physially distributed mahines. Within this

ontext,avarietyofonsistenymodelsandprotoolshavebeendened,in ordertoallowaneient

managementofrepliateddata. Thesesystemsdooertransparentaesstodata: allnodesanread

andwrite datainauniformway,usingauniqueidentieroravirtualaddress. Itistheresponsibility

of theDSM systemto loalize, transfer, repliatedata, and guaranteetheir onsisteny aordingto

somesemantis. Nevertheless, existingDSM systemshavegenerallyshownsatisfatory eienyonly

onsmall-saleongurations,typially,afewtensofnodes[11℄.

Peer-to-peersharing ofimmutabledata. Reently, peer-to-peer (P2P) has provento be an eient

ap-proahforlarge-saledatasharing. Thepeer-to-peermodelisomplementarytothelient-servermodel:

therelationsbetweenmahinesaresymmetrial,eahnodeanbelientinatransationandserverin

another. This paradigmhasbeenmade popular by Napster[17℄,Gnutella[10℄, andnowKaZaA [16℄.

Wean note that these systemsfous onsharingimmutable les: theshared dataare read-only and

anberepliatedatease.

Peer-to-peersharing ofmutable data. Reently, some mehanisms for sharing mutable data in a

peer-to-peer environment have been proposed by systemslike OeanStore [8℄, Ivy [9℄ and P-Grid [6℄. In

OeanStore,for eah dataonly asmall set ofprimary replias,alled the innerring agrees,serializes

andappliesupdates. Updatesarethenmultiastdownadisseminationtreeto allotherahedopies

ofthedata,alledseondaryreplias. However,OeanStoreusesaversioningmehanismwhihhasnot

provento beeientat largesales. Seond, despiteit provideshooksfor managing theonsisteny

of data, appliations still have to use low-level mehanismsfor eah onsisteny model [12℄. Third,

published measurements on the performane of updates only assumea single writer per data blok.

Finally, serversmaking up inner rings are assumed to be highly available. The Ivy systemhas one

main limitation: appliations have to repaironiting writes, thus the number of writers per data

is very limited. Both Oeanstore and Ivy target general-purpose, persistent le storage, not data

management for high-performane, omputing grids where for example distributed matries have to

bemovedusingparallel transfers. P-Gridproposesaooding-based algorithmforupdatingdata,but

(3)

Cluster A1

Cluster A3

Cluster A2

Wide−Area

Network

Fig.2.1. Numerialsimulationforweatherforeastusinga pipelineommuniationshemewith3 lusters.

2. Designinga data sharingservie for thegrid.

2.1. Motivatingsenarios. Letusonsideradistributedfederationof3lusters:

A

1

,

A

2

and

A

3

,whih

o-operatetogetherasshownonFigure2.1. Eahlusteristypiallyinteronnetedthroughahigh-performane

loal-areanetwork, whereasthey areall oupled togetherthrough aregular wide-areanetwork. Consider for

instaneaweatherforeastsimulation. Cluster

A

1

mayomputethe foreastforagivenday,then

A

2

forthe

nextday,andnally

A

3

forthedayafter.Thus,

A

3

usesdataproduedby

A

2

,whihinturnusesdataprodued

by

A

1

, asinapipeline. Alternatively, luster

A

1

maysimulate the weatherforeastin agivenountry, while

A

2

et

A

3

simulateitfortwoneighboringountries.

Suh simulationsprodue largeamount ofnumerial data, and data-related ationsare deeply intriated

withomputation. Thedatamanagementsystemsdesribedin theprevioussetiondonotprovideanysimple

tehnique to support suh designs. Consider for instane transferring data from

A

1

to

A

2

: a widely-used

tehniqueonsistsinexpliitlywritingthedataonadiskwithinluster

A

1

,thenusealetransfertooltodeposit

them on a disk within luster

A

2

. The appliation is diretly involved in this series of ations. In ontrast,

we propose to deouplethe appliation from the data management, bymaking data storageand loalization

transparentwith respetto theappliation. Cluster

A

1

should onlystorethedata withinthe federation-wide

datamanagementservie,fromwhihluster

A

2

ouldrequestthemasneeded. Dataloalizationandtransfer

arethenompletelyexternaltotheappliations.

Letusnowsupposethatour3appliationsnolongero-operateaordingtoapipelinesheme,butrather

aordingtoamultiple-writers sheme. Forinstane, eahappliation simulates asinglephenomenon partof

theglobal weatherforeast: say, wind, rain and louds. Inthis ase, eah luster needs data from the other

onesinordertomakeprogress.Adatasharingservieouldallowtheonurrentappliationsnotonlytoread,

butalsoto writeto theglobally shareddata,while transparentlyhandlingdataonsisteny. Thisissimilarto

DSMsystems,butatamuhlargersale,andinafullydynamiontext. Also,assumethatsomenodesfailin

luster

A

2

. Someofthedataneessaryfor

A

3

ouldthusbeomeunavailable. Thedatasharingservieshould

alsoprovidemehanismstotoleratesuhfaults, forinstane,basedonredundany.

2.2. Design priniples. We onsider twomajor souresof inspiration for the design of adata sharing

servieforsientigridomputing:

DSM systems, whihproposeonsistenymodelsandprotoolsforeienttransparentmanagementof

mu-tabledata, onstati, small-saledongurations (tensof nodes);

P2P systems, whihhaveprovenadequateforthemanagementofimmutable data onhighly dynami,

large-sale ongurations(millions ofnodes).

Thesetwolassesofsystemshavebeendesignedandstudied in verydierentontexts. InDSM systems,the

nodes are generally under the ontrol of asingle administration, and theresoures are trusted. In ontrast,

P2PsystemsaggregateresouresloatedattheedgeoftheInternet,withnotrustguarantee,andlooseontrol.

Moreoverthesenumerousresouresareessentiallyheterogeneousin termsofproessors,operatingsystemsand

(4)

Table2.1

AgriddatasharingservieasaompromisebetweenDSM andP2Psystems.

DSM Grid data servie P2P

Sale

10

1

10

2

10

3

10

4

10

5

10

6

Resoure ontrol

and trust degree

High Medium Null

Dynamiity Null Medium High

Resoure

homogeneity

Homogeneous

(lusters)

Ratherheterogeneous

(lustersoflusters)

Heterogeneous

(Internet)

Data type Mutable Mutable Immutable

Appliation

omplexity

Complex Complex Simple

Typial

appliations

Sienti

omputation

Sientiomputationand

datastorage

Filesharingand

storage

multiple nodes. Inontrast,P2Psystemsgenerallyserveasasupportforstoringand sharingimmutableles.

Theseantagonistfeaturesaresummarizedin therstandthird olumnsofTable2.1.

Ourdata sharingservietargetsphysialarhitetureswithfeatures intermediatebetweenDSM andP2P

systems. Weaddresssalesoftheorderofthousandsofnodes,organizedasafederationoflusters,saytensof

hundred-nodelusters. Atagloballevel,theresouresarethusratherheterogeneous,whiletheyanprobably

beonsideredashomogeneouswithintheindividuallusters. Theontroldegreeand thetrustdegreearealso

intermediate,sinethelustersmaybelongtodierentadministrations,whihsetupagreementsonthesharing

protool. Finally,wetarget numerialappliationslikeheavysimulations,madebyouplingindividualodes.

These simulations proess large amountsof data, with signiantrequirements in terms of data storage and

sharing. These intermediatefeaturesareillustratedintheseondolumn ofTable2.1.

Theontributionofthispaperis namelytoproposeanarhitetureforsuhadata sharingservie,whih

addresses the problem of managing mutable data on dynami, large-sale ongurations. Ourapproah aims

at taking benet of both DSM systems (transparent aess to data, onsistenyprotools) and P2Psystems

(salability,supportforresourevolatilityanddynamiity).

2.3. The JXTAimplementationframework. OurproposalispartlyinspiredbytheP2Papproah. It

anusefullybenetfromaplatformprovidingbasimehanismsforpeer-to-peerinteration. Toourknowledge,

themostadvanedimplementationplatforminthis areaisJXTA [14℄. ThenameJXTAstandsforjuxtaposed,

inorder tosuggestthejuxtapositionratherthantheoppositionoftheP2Pandlient-servermodels. JXTAis

aprojetoriginallyinitiatedbySunMirosystems.

JXTAisanopen-soureframework,whihspeiesasetoflanguage-andplatform-independentXML-based

protools[15℄. JXTAprovidesarihsetofbuildingbloksforthemanagementofpeer-to-peersystems: resoure

disovery,peergroupmanagement,peer-to-peerommuniation,et.

Peers. ThebasientityinJXTAisthepeer. Peersareorganizedinnetworks. Theyareuniquelyidentiedby

IDs. An IDisalogialaddressindependentoftheloationofthepeerinthephysialnetwork. JXTA

introduesseveraltypesofpeers. Themostrelevantasfarasweareonernedaretheedgepeers and

rendezvous peers. Edgepeersareableto ommuniate withother peersin theJXTAvirtualnetwork.

Theyan alsostoreadvertisementsofresourestheydisoverin thenetwork. Rendezvouspeershave

theextraabilityofforwardingtherequeststheyreeivetootherrendezvouspeers. Theyanalsooer

astorageareaforadvertisementsthat havebeenpublished byedgepeers. Finally, theyare internally

managedbyJXTAusingadistributedhashtable (DHT)andaremakinguptheframeofJXTA.They

anthus be dynamially loatedin aneient way. Joining, leaving, andeven unexpeted failing of

rendezvouspeersaresupportedbytheJXTA protools.

Peer groups. Peersanbemembersofoneorseveralpeer groups. Apeergroupismadeupof severalpeers

that share a ommonset ofinterests, e.g., peers that havethe sameaess rightsto some resoures.

Themainmotivationforreatingpeergroupsistobuildserviesolletivelydeliveredbypeergroups,

(5)

Pipes. CommuniationbetweenpeersorpeergroupswithintheJXTAvirtualnetworkismadebyusingpipes.

Pipesareunidiretional,unreliableandasynhronouslogialhannels. JXTAoerstwotypesofpipes:

point-to-point pipes,and propagatepipes. Propagatepipesanbeused to build amultiast layerat

thevirtuallevel.

Advertisements. EveryresoureintheJXTAnetwork(peer,peergroup,pipe,servie,et.) isdesribedand

published using advertisements. Advertisementsare strutured XML doumentswhih arepublished

withinthenetworkofrendezvouspeers. Torequestaservie,alienthasrsttodisover amathing

advertisementusingspei loalizationprotools.

JXTA protools. JXTAproposessixgeneriprotools. Outofthese,twoarepartiularlyusefulforbuilding

higher-levelpeer-to-peerservies: thePeerDisoveryProtool,whihallowsforadvertisement

publish-ingand disovery; and thePipe Binding Protool, whih dynamially establishes links between peers

ommuniatingonagivenpipe.

ThedatasharingserviethatweproposeisdesignedusingtheJXTA buildingbloksdesribedabove.

3. JuxMem: a supportive platform for data sharing on the grid. Thearhiteture of the data

sharing servie we propose, mirrors an arhiteture onsisting of a federation of distributed lusters. The

arhiteture is therefore hierarhial, and is illustratedthrough the proposition of asoftware platform alled

JuxMem (for Juxtaposed Memory), whose goal is to be the foundation for a data sharing servie for grid

omputingenvironments,likeDIET[4℄.

Group "cluster A"

Group "data"

Group "cluster B"

Physical network

Overlay network

Group "cluster C"

Cluster C

Cluster B

Cluster A

Node

Group "juxmem"

Client

Provider

Cluster manager

Fig.3.1. Hierarhyof theentitiesinthenetworkoverlaydenedby JuxMem.

3.1. Hierarhialarhiteture. Figure3.1showsthehierarhyoftheentitiesdenedinthearhiteture

of JuxMem. Thisarhitetureis madeupof anetworkof peer groups(lustergroups

A

,

B

and

C

), whih

generallyorrespond to lusters at thephysial level. All the groupsareinside awider groupwhih inludes

allthepeers whih runtheservie(the juxmemgroup). Eahluster grouponsistsof aset ofnodeswhih

providememoryfordatastorage. Wewillallthesenodesproviders. Ineahlustergroup,anodeisinharge

of managingthe memorymade available by theprovidersof thegroup. This node isalled lustermanager.

Finally, a nodewhih simplyuses theservie to alloate and/oraess data bloks is alled lient. It should

be notedthat anodeanbeat thesametime aluster manager,alientand aprovider, but forthe sakeof

larity,eahnodeplaysonlyonerolein theexampleillustratedontheFigure3.1.

(6)

providersfromdierentlustergroups. Indeed,adataanbespreadoveronseverallusters(hereAandC).

Forthisreason,thedataand lustergroupsareatthesamelevelofthegrouphierarhy. Note alsothatthe

lustergroupsouldalsoorrespondtosubsetsofthesamephysial luster.

Anotherimportantfeatureisthatthearhitetureof JuxMemisdynami,sinelusteranddatagroups

an be reated at run time. For instane, for eah blok of data inserted into the system, a data group is

automatiallyinstantiated.

API of the data sharing servie. The Appliation Programming Interfae (API) provided by JuxMem

illustratesthefuntionalities ofadatasharingservieprovidingdata persisteneaswellastransparenywith

respettodata loalization.

allo(size, attributes) allowstoreateamemoryareaofthespeiedsizeonaluster. Theattributes

parameter allows to speify the level of redundany and the default protool used to manage the

onsistenyof theopies of the orrespondingdata blok. This funtion returns anID whih anbe

seenat theappliationlevelasadatablokID.

map(id, attributes) allows to retrieve the advertisement of a data ommuniation hannel whih has to

beused to manipulate the data blokidentied by id. The attributes argumentallows to speify

parametersfortheviewofthedatablokdesiredbythelient,likeforinstanewhatweallthedegree

of onsisteny: somelientsmay haveweaker onsistenyrequirementsthan the oneensured by the

defaultprotoolusedtomanagethedatablok.

put(id, value) allowstomodifythevalueofthedatablokidentied byid. Thenewvalueisthenvalue.

get(id) allowstogettheurrentvalueofthedatablokidentiedbyid.

lok(id) allows to lok the data blok identied by id. A lok is impliitly assoiated to eah data blok.

Clientswhihaessashareddatablokneedto synhronizeusingthis lok.

unlok(id) allowsto unlokthedataidentiedbyid.

reonfigure(attributes) allowsto dynamially reongure anode. The attributesparameter allows to

indiateifthenodeisgoingtoatasalustermanagerand/orasaprovider. Ifthenodeisgoingtoat

asaprovider, theattributesparameteralsoallowsto speify theamountofmemory thatthe node

providestoJuxMem.

3.2. Managing memoryresoures.

Publishingandplaementofresoureadvertisements. Memoryresouresaremanagedusingadvertisements.

Eahproviderpublishestheamountofmemoryitoerswithinthelustergrouptowhihitbelongs,bythe

meansof aprovider advertisement. Theluster managerof thegroupstoresallsuh advertisementsavailable

in his group. He is also responsible for publishing theamount of memoryavailable in the luster byusing a

luster advertisement. Thisadvertisementliststheamountsof memoryoeredbyprovidersofthe assoiated

lustergroup. These lusteradvertisementsarepublishedinside thejuxmemgroup,so thattheyanthenbe

usedbyallthelientsin ordertoalloatememory.

Clustermanagersarethusinhargeofmakingthelink betweenthelustergroupandthejuxmemgroup.

Theymakeup anetwork organizedusingaDHT at thelevel ofthejuxmemgrouplevel,in order tobuild the

frameofthedatasharingservie. ThisframeisrepresentedbytheringontheFigure3.2. Eahlustermanager

G1toG6isresponsibleforaluster,respetivelyA1toA6,eahofwhihismadeupofvenodes. Atthelevel

ofthe juxmemgroup,theDHT worksas follows. Eah lusteradvertisementontainsalistwhihenumerates

the amounts of memory available in the luster. Eah individual amount is separately used to generate an

ID, by means of a hash funtion. This ID is then used to determine the luster manager responsible for all

advertisementshavingthisamountofavailablememoryintheirlist. Thislustermanagerisnotthepeer that

storestheadvertisement,itonlyknowstheluster managerwhihpublisheditin theJuxMemnetwork. This

plaementoflusteradvertisementsallowslientstoeasilyretrieveadvertisementsinordertoalloatememory:

any request for agiven amountof memory isdireted to theluster manager responsible forthat amount of

memory,usingthehashmehanismsdesribedabove

Searhing for advertisements is therefore short, and responses are exatand exhaustive, e.g., all the

ad-vertisements that inlude the requested memory size will be returned. But sine using a DHT on memory

sizes meansto generateadierenthashfor eah memorysize,JuxMem usesaparameterizable poliy forthe

disretization of thespaeof memory sizes. Thus, JuxMem will searh forthe minimummemory size,given

(7)

1

G1

G2

P

C

Group "cluster A4"

G4

3a

G3

3a

5

Group "cluster A3"

Group "cluster A1"

3b

4

6

2

3b

G6

G5

Group "cluster A5"

Group "cluster A6"

Group "cluster A2"

Provider

Client

Cluster manager

Fig.3.2. Stepsofanalloation requestmadebyalient.

ifitusesapowerof2lawforthespaedisretization. Providersalsointernallyusethesamelawwhenoering

memoryareas,butprovidethemaximummemorysize,givenbythepoliyused,thatisinferiortotheonethey

wishtooer.

Oneoftheonstraintswexedistosupportthevolatilityofnodeswhihmakeupthelusters. Therefore,

theadvertisementspublishedatatime

t

1

anbeinvalidatthetime

t

2

> t

1

,sineprovidersandisappearfrom

JuxMematanytime. Themehanismusedtomanagethisvolatilityofpeersisbasedonrepublishingtheluster

advertisements wheneverahanging ofthe amountof memoryprovided is deteted. Besides, advertisements

havealimitedbut parameterizablelifetime,soitis neessarytoperiodiallyrepublishthem.

Proessinganalloationrequest. Clientsmakealloationrequestsbyspeifyingthesizeofthememoryarea

theywanttoalloate. Thedierentstepsforsuharequest,numberedontheFigure 3.2,arethefollowing:

1. Thelient

C

ofthe lustergroup

A

1

wantsto alloate amemory areaof 8MB witha redundany

degreeoftwo. Consequently,itsubmitsitsrequesttothelustermanager

G

1

towhihitisonneted.

2. Thelustermanager

G

1

thendeterminesthatthepeerresponsibleofadvertisementshavingamemory

size of 8MB in their listis the luster manager

G

3

, usingthe hash mehanismdesribedpreviously.

Therefore,theluster managerpeer

G

1

forwardstherequestto

G

3

.

3. Thelustermanager

G

3

thendeterminesthat lustermanagers

G

2

and

G

4

maththeriterionofthe

lient,andasksthem toforwardtheirlusteradvertisementtothelient

C

.

4. The lient

C

then hooses the luster manager

G

2

asthe peer having the best advertisement: for

instanetheorrespondinglusteroersahigherdegreeofredundanythanthelusterhandledbythe

lustermanager

G

4

. Thus, itsubmitsitsalloationrequestto

G

2

.

5. Thelustermanager

G

2

reeivesthealloationrequestandhandlesit. Ifitansatisfytherequestthen

itasksoneofitsproviders,forexample

P

,to alloatea8MBmemoryarea. Iftherequestannotbe

satised,anerrormessageissentbaktothelient.

6. If the provider

P

an satisfy this request, it reates a 10 MB memory area, then sends bak the

advertisementof this memoryareato thelient

C

.

P

beomesthe lustermanager ofthe assoiated

datagroup, whih means that it is responsible for repliatingthe data blokstored in that memory

area. Iftheprovider

P

annotsatisfytherequest,anerrormessageissentbaktothelustermanager

G

2

,whihantryotherproviderpeersofthelustergroup.

Ifnoprovidersanbefoundonthelaststepofanalloationrequest,anerrormessageissentbaktothelient.

Thenthelientanrestartthealloationrequestfromstep4,e.g.,withanotherlustermanagermathingthe

requested memorysize. Finally, if no luster manageranalloatethe memoryarea, thelient inreasesthe

requestedmemorysize andrestartsthealloationrequestfrom thebeginning. This anbedone

N

times(for

(8)

3.3. Managing shared data. When a memory area is alloated by a lient, a data group is reated

on the hosen provider and an advertisement is sent to the lient. This advertisement allows the lient to

ommuniatewith thedatagroup. This advertisementispublished at thejuxmem'sgrouplevel,but onlythe

IDofthisadvertisementisreturnedattheappliationlevel. Aesstodatabyotherlientsisthenpossibleby

usingthisID:theplatformtransparently loatestheorrespondingdatablok.

Storageofdatabloksisindependentoflients. Indeed,whenlientsdisonnetfromJuxMem,databloks

still remainstoredin thedata sharingservieontheproviders. Consequently,lientsanhaveaessto data

blokspreviously stored by other lients: they simply need to look for the advertisement of the datagroup

assoiated with the data blok (whose identier is assumed to be known). The map primitive of the API of

JuxMem does this by taking in input the ID of the data blok. In this way, the storage of data bloks is

persistent.

Eahdatablokisrepliatedonaxed,parameterizablenumberofprovidersforabetteravailability. This

redundanydegreeisspeiedasanattribute atalloationtime. Theonsistenyofthedierentopiesmust

thenbehandled. InthisrstversionofJuxMem,theuseofamultiastatthelevelofthejuxmemgroupsolves

thisproblem: thedierentopiesofasamedatablokaresimultaneouslyupdatedwheneverawritingaessis

made. Alternativeonsistenymodelsandprotoolswillbeexperimentedinfurtherversions. Notethatlients

whih havepreviously read a data blok are not notied of this update: lients do notstore a opy of data

blok. Therefore, theresult of areading whih is valid at a time

t

1

, may notbevalid at time

t

2

> t

1

. It is

worthnotingthatthisdierenebetweenlientandprovidersallowstohandleahighnumberoflientswithout

havingtodealwithahighnumberofopiesofdatabloks. Synhronizationbetweenlientswhihonurrently

aessadatablokishandled usingthelok/unlokprimitives.

3.4. Handlingvolatileproviders. Inordertotoleratethevolatilityofpeers,astatirepliationofdata

onaxed andparameterizablenumberofprovidersisnotenough. Indeed,theset ofprovidershostingaopy

of the samedata blokan suessivelybeomeunavailable. A dynami monitoring of thenumber of opies

fordatais thereforeneeded. Consequently,eahdatagrouphasamanager(noted datamanager)whih isin

harge of monitoringthe levelof redundany of the data blok. If this numbergoesbelowthe one speied

bylients, thedata managermust searhand ask aproviderto hostan extraopy ofthe data blok. When

thedatamanagerdeidestorepliateit,itmustrstlokit(internally)inordertomaintainonsisteny. The

providerwhihwillhostthisnewopyisthenresponsibleforunlokingit. A timeout mehanismfollowedbya

ping testisused in ordertodetetiftheproviderbeameunavailablejust beforeunlokingthedatablok. If

itisthease,thenthedatamanagerunloksitselfthedatablok.

3.5. Handlingvolatilemanagers. Ifalustermanagergoesdown,thisouldleadtotheunavailability

ofresouresprovidedbyawholeluster. Theroleoflustermanager(notedmainlustermanager)istherefore

automatially dupliated on another provider of the luster (alled seondary luster manager). Managers

periodiallysynhronizeusing amehanismbasedontheexhangeofprovideradvertisements,in ordertond

out new advertisements published. They an thus both know in a nearly aurate manner the amount of

memoryavailableintheluster. Amehanismbasedonperiodialheartbeatsallowstodynamiallyensurethis

dupliationofluster managers. Suhamehanismisalsoused forthedatamanagers(seeSetion3.4). Note

that,thepossiblehangesofmanagersinthelusteranddatagroups,duetotheunavailabilityofmanagers,

are notseenoutsidethese groups. Theavailabilityof lusters and ofdata bloksis thus maximized, whereas

theperturbationonthelientsideisminimized.

4. Implementationand preliminary evaluations.

4.1. Implementation of JuxMem within the JXTA framework. In order to build a prototype

of the software arhiteture desribed in the previous setion, we have used the JXTA generi peer-to-peer

framework(seeSetion2.3). OurJuxMemprototypeusestherefereneJavabindingofJXTA(whihistoday

theonlybindingompatiblewiththeJXTA2.0speiation).JuxMemiswritten inJavaand inludesabout

50lasses(5000odelines).

JXTA fully meets the needs of JuxMem. Thus, managers of data and luster groups are based on

JXTA's rendezvous peers. Indeed, managers have to know if providersare still alive by using a ping test in

order to manage aluster or ablok of data. This an only be done if providers have previouslypublished

(9)

0

20

40

60

80

100

160

140

120

100

80

60

50

40

30

Seconds elapsed between provider losses

Relative overhead (%)

Fig.4.1. Relativeoverheadduetothevolatilityofprovidersforasequenelok-put-unlok,withrespettoa stablesystem.

managers. Forexample,datamanagershaveto forwardaessrequests, madebylients, toprovidershosting

a opy of the data blok. In the same way, luster managers have to forward alloation requests, made by

lients,toproviders. Clientsandproviderswhihdonotatasdatamanagersforoneorseveralbloksofdata

are based on JXTA's edge peers. Indeed, theydo not have to play a role in the dynami monitoring of the

numberof opies for ablok ofdata in thesystem. Therefore, theydo nothave to storepublished provider

advertisements. Moreover,lientsonlyneedtodisoverandstorelusteradvertisementswhihwillallowthem

toalloatememoryareas. ThevariousgroupsdenedinJuxMemareimplementedbyJXTA'speergroups. The

juxmemgroupimplementsaJXTApeergroupservieprovidingtheAPIofJuxMem(seeSetion3.1). Finally,

theommuniationhannels ofJXTAalso oertheneededsupport forbuildingmultiastommuniationsfor

simultaneouslyupdatingopiesofthesameblokofdata.

4.2. Preliminaryevaluations. Forourpreliminaryexperiments,weusedalusterof450MHzPentium

II nodeswith256MB RAM,interonnetedbya100MB/sFastEthernetnetwork.

WerstmeasuredthememoryonsumptionoverheadgeneratedbythedierentJuxMempeerswithrespet

to theunderlying JXTA peersused to build JuxMem peers. This overheadis reasonable: it rangesbetween

5%and7.4%.

We then measured the inuene of the volatility degree of provider peers on the duration of a sequene

lok-put-unlokexeuted in aloopbyalient. Thissequene in theloopis madeonadata blok storedin

JuxMem. Thegoal of this measure is to evaluate therelative overhead generated by the repliations whih

take plae in order to maintain a given redundany degree for a given blok of data. This repliations are

transparently triggered when the servie detets that a provider holding a data blok goes down. If these

repliationstakeplaewhilealientaessesthedatablokbeingrepliated,theseaessesslowdown.

Thetestprogramrstalloatesasmallmemoryarea(1byte)onaproviderbelongingtolusterandwrites

toitadatablok. Theredundanydegreeissetto3. Thealloationtakesplaeonalusterinitiallyonsisting

of16providersandonelustermanager. 16mahines oftheluster previouslydesribedhostaprovider,one

mahine of the sameluster hosts aluster managerand another mahine of thesame luster hosts alient.

Thelientexeutesa100iterationloop,andeahiterationonsistsofasequenelok-put-unlok.

Duringtheexeutionofthis loop,arandomproviderhostingaopyofthe dataiskilled every

δ

seonds,

where

δ

is a parameter of the experiment. In order to measure only the overhead due to the volatility of

providers,thedatamanageroftheassoiatedgroupisneverkilled.

Figure 4.1showstherelativeoverheadmeasured, withrespet to astable system(i.e. where no provider

goesdownduringtheloopexeution:

δ

=

). Whenthedatamanagerdetetsthatprovidersholdingaopyof

thedatablokhavegonedown,ittriestorepliatetheblokonotheravailableproviders,whiharenotalready

hostingaopyof thedata blok. Toensure theonsistenyof thedata during itsrepliation, lientsare not

(10)

Theurveproleisexplainedbythenumberoftimesthesystemrepliatesthedataonproviders,in order

to maintaintheredundany degreespeiedbythelient(whihis3forthis test). Forthewhole durationof

ourtest,thenumberoftriggeredrepliationsisgivenin theTable4.1asafuntion ofthe

δ

parameter.

For highly volatile systems(

δ <

80

s), the number of repliations triggered beomes higher than 2 and

therelativeoverheadbeomessigniant. For

δ

= 30

s, itreahesmorethan65%(10repliations triggered).

However, in arealisti situation, thenodevolatility onthe arhiteturewe onsider is typially alot weaker

(

δ

80

s). Forsuh values, the reonguration overhead is less than 5%. Wean reasonably say that the

JuxMemplatforminludesamehanismwhihallowstodynamially maintainaertainredundanydegreefor

databloks,in ordertoimprovedataavailability,withoutsigniantoverhead, whileauthorizingnodefailures.

Table4.1

Numberoftriggeredrepliationswhenthevolatilityofproviderpeersevolvesfrom160to30seonds.

Seonds 160 140 120 100 80 60 50 40 30

Numberof triggered repliations 1 1 1 1 2 2.5 5 5.5 10

5. Conlusion. Thispaperdenesahierarhial arhitetureforadatasharingserviemanagingmutable

datawithinagridonsistingofafederationoflusters. Thisarhiteturehasbeendesignedusingapeer-to-peer

approah, and demonstratedthrough theJuxMem platform. Not only the arhiteture allows to redue the

numberofmessagestosearhforapiee ofdata,thanksto ahierarhialsearhsheme,but italso allowsto

takeadvantageof spei features of the underlying physial arhiteture. Themanagement poliy for eah

lusteranbespeitoitsonguration,forinstaneintermsofnetworklinkstobeused. Thus,somelusters

ouldusehigh-bandwidth,low-latenynetworksforintra-lusterommuniation,ifavailable.

TheJuxMemuseranalloatememoryareasinthesystem,byspeifyinganareasizeandsomeattributes,

suhasaredundanydegree. ThealloationprimitivereturnsanIDwhihidentiestheblokof data. Then,

dataloalization andtransferisfully transparent,sinethis IDissuientin orderto aessand manipulate

theorrespondingdatawhereveritis: noIP addressnorportnumberneedstobespeiedat theappliation

level.

Ourarhiteturesupportsthevolatilityofalltypesofpeers. Thiskindofvolatilityisalsosupportedin

peer-to-peersystemssuhasGnutellaorKaZaA, whih enhane dataavailabilitythanksto redundany. However,

this isasideeet oftheuserations. Inontrast,oursystematively takesinto aountthisvolatility: this

allowsnotonlytomaintainaertaindegreeofdataredundany(asinsystemslikeIvyorCFS[5℄),butalsoto

supportthevolatilityofpeerswithspei responsibilities(e.g., lustermanagers,ordatamanagers).

The implementation of a JXTA-based prototype has shown the feasibility of suh a system. However,

note that thedesign of JuxMem isnot dependenton JXTA. Atually, other librariesould beused, suh as

JavaGroups[13℄. WeusedtheJavaversionofJXTA,sinethisisthemostadvanedbindingofJXTA,theonly

oneompatible withtheJXTA2.0speiation.

The modular arhiteture of JXTA allowsto easily add and remove serviesand/or protools, inluding

ommuniationprotools. This should eventuallyallow the platform to take advantage of high-performane

networks(suhasMyrinet orSCI) fordatatransfer. Weplan to addressthis problem in thefuture. Wealso

plan to use JuxMem as an experimental platform for dierent data onsisteny strategies supporting peer

volatility,in ordertobuild aongurable, adaptivedata sharingservieformutabledata. Thenal goalis to

integratethisservieintolarge-saleomputingenvironments,suhasDIET[4℄,developedatENSLyon. This

willallowanextensiveevaluationoftheservie,withrealistiodes,usingvariousdataaessshemes.

REFERENCES

[1℄ B. Allok, J.Bester, J.Bresnahan, A. Chervenak, L. Liming, S. Meder and S. Tueke, GridFTP Protool

Speiation,GGFGridFTPWorkingGroupDoument,Sept.2002.

[2℄ A.Bassi,M.Bek,G.Fagg,T.Moore,J.Plank,M.SwanyandR.Wolski,TheInternetBakplaneProtool: Astudy

inresoure sharing,In2ndIEEE/ACMInternational SymposiumonCluster Computingandthe Grid(CCGrid2002),

pages194201,Berlin,Germany,May2002.IEEE.

[3℄ J.Bester, I. Foster, C. Kesselman, J. Tedeso and S. Tueke, GASS:Adata movement andaess servie for

(11)

[4℄ E.Caron,F.Desprez,F.Lombard,J.-M.Niod,M.QuinsonandF.Suter,Asalableapproahtonetworkenabled

servers,InB.MonienandR.Feldmann,editors,8thInternationalEuro-ParConferene,volume2400ofLetureNotes

inComputerSiene,pages907910,Paderborn,Germany,Aug.2002.Springer-Verlag.

[5℄ F.Dabek,F.Kaashoek,D.Karger,R.MorrisandI.Stoia,Wide-area ooperativestoragewithCFS,In18thACM

SymposiumonOperatingSystemsPriniples(SOSP'01),pages202215,ChateauLakeLouise,Ban,Alberta,Canada,

Ot.2001.

[6℄ A.Datta,M.HauswirthandK.Aberer,Updates inhighlyunreliable,repliated peer-to-peersystems,In23rd

Interna-tionalConfereneonDistributedComputingSystems(ICDCS2003),pages7687,Providene,RhodeIsland,USA,May

2003.

[7℄ I.FosterandC.Kesselman,Globus: Ametaomputinginfrastruturetoolkit,TheInternationalJournalofSuperomputer

AppliationsandHighPerformaneComputing,11(2):115128,1997.

[8℄ J.Kubiatowiz,D.Bindel,Y.Chen,P.Eaton,D.Geels,R.Gummadi,S.Rhea,H.Weatherspoon,W.Weimer,

C.WellsandB.Zhao,OeanStore:Anarhitetureforglobal-salepersistentstorage,In9thInternationalConferene

onArhitetureSupportforProgrammingLanguagesandOperatingSystems(ASPLOS2000),number2218inLeture

NotesinComputerSiene,pages190201,Cambridge,MA,Nov.2000.Springer.

[9℄ A.Muthitaharoen,R.Morris,T.M.GilandB.Chen,Ivy:Aread/writepeer-to-peerlesystem,In5thSymposium

onOperatingSystemsDesignandImplementation(OSDI'02),Boston,MA,De.2002.

[10℄ A.Oram,Peer-to-Peer: Harnessing thePowerof DisruptiveTehnologies,hapterGnutella,pages 94122,O'Reilly,May

2001.

[11℄ J.Proti¢,M.Tomasevi¢andV.Milutinovi¢,DistributedSharedMemory: ConeptsandSystems,IEEE,Aug.1997.

[12℄ S.Rhea,P.Eaton,D.Geels,H.Weatherspoon,B. ZhaoandJ.Kubiatowiz,Pond: theoeanstoreprototype,In

2ndUSENIXConfereneonFileandStorageTehnologies(FAST'03),Californie,CA,USA,Mar.2003.

[13℄ JavaGroups, http://www.javagroups.o m/j avag roup sne w/do s/ inde x.ht ml

[14℄ TheJXTAprojet, http://www.jxta.org/

[15℄ JXTA v2.0protoolspeifiation, http://spe.jxta.org/no nav/ v1.0 /do boo k/J XTAP roto ol s.pd f, Mar.2003.

[16℄ KaZaA,http://www.kazaa.om/

[17℄ Napsterprotoolspeifiation, http://opennap.sourefo rge .net /nap ste r.tx t,Mar.2001.

[18℄ TheNetSolveprojet, http://il.s.utk.edu/nets olv e/

Editedby: WilsonRivera, JaimeSeguel.

Reeived: June26,2003.

References

Related documents

To analyze how the humor works in Burger King’s tweets, this study uses the extended version of General Theory of Verbal Humor (GTVH) by Attardo (2001) &amp; Tsakona (2013) and the

Based on the results of the automotive case study and Wang’s (2006) research, the inter-firm CobiT framework aims to improve the performance of supply chain management

To attain that objective the standard neo-classical Solow growth model at steady-state level was taken as theoretical framework and made a production function

Tenslotte moet worden opgemerkt dat het herstel van de financiën al spoeclig weer volleclig teruet werd gedaan door Antoons enorme uitgaven voor drie Luxemburgse veldtochten, die

We take advantage of an unprecedented change in policy in a number of Swiss occupational pension plans: The 20 percent reduction in the rate at which retirement capital is

 The Insurance Regulatory and Development Authority of India (IRDA) defines TPA as a Third Party Administrator who, for the time being, is licensed by the Authority, and

Distance Education Support: Law librarians can provide an additional point of connection to distance education students by providing them with both access to print and digital