• No results found

CiteSeerX — Garbage Collection for Storage-Oriented Clusters

N/A
N/A
Protected

Academic year: 2022

Share "CiteSeerX — Garbage Collection for Storage-Oriented Clusters"

Copied!
10
0
0

Loading.... (view fulltext now)

Full text

(1)

William Brodie-Tyrrell,Henry Detmold, Katrina Falkner, David S. Munro

S hool of Computer S ien e, University of Adelaide

Adelaide SA 5005, Australia

{william, henry, katrina, dave} s.adelaide.edu.au

Abstra t

Storage-oriented lusterspresentunique hallengesto

the implementation of storage management. Su h

lustersmanageavastamountofdata,mostofwhi h

islo atedonse ondarystorage. Manualstorageman-

agement in storage-oriented luster environments is

omplex, error-prone and tedious. As a result there

is a lear need for automati storage management

(garbage olle tion)forsu h lusters.

Thegoalsofagarbage olle torforuseinastorage-

oriented luster are safety, ompleteness and s ala-

bility in the fa e of distributed y les of garbage in

se ondarystorage. Ofthefewextantdistributed se -

ondarystoragegarbage olle tors,nonemeetallofthe

statedgoalswhilst alsooperatinge iently.

This paperdes ribesthe design and implementa-

tion of anew distributed garbage olle tor basedon

the train algorithm, spe i ally for use in storage-

oriented lusters. The olle tor presented here ex-

tendsthetrainalgorithm,employinganasyn hronous

distributed termination dete tion algorithm for iso-

lated train dete tion, a me hanism for deferring the

update of metadata and a new external root tra k-

ingme hanismtopermitintera tionwith lientsthat

a he andswizzlepointers. Ourexperimentsdemon-

strate that these extensions su essfully adapt the

train algorithm for e ient operation in a storage-

oriented luster, fullling the stated goals of safety,

ompletenessands alability.

1 Introdu tion

Thefo usofresear hinto luster omputinghas on-

entrated on omputational lusters, while storage-

Copyright 2004, AustralianComputerSo iety,In .Thispa-

per appeared at Twenty-Seventh Australasian ComputerS i-

en eConferen e(ACSC2004),Dunedin,NewZealand,January

2004. Conferen es in Resear h and Pra ti e in Information

Te hnology,Vol. 26. VladimirEstivill-Castro,Ed. Reprodu -

tionfora ademi , not-forprotpurposespermitted provided

oriented lusters have re eived less attention. Clus-

tering of storage provides s alability of I/O per-

forman e without the use of expensive spe ial-

purpose hardware. Computational lusters provide

super omputer- lassperforman eat ommodity ost,

similarly storage-oriented lusters havethe potential

to provide mainframe- lassI/O for ommodity ost.

Signi antly, thereis agrowingneedforsu h luster

te hnologyto support an important lass ofappli a-

tionsthatexhibitsstrongly onne tedgraphsofdata

obje ts within very large data sets. Finite Element

Analysis for the simulation of physi al systems is a

widely usedmemberof this lass. However,storage-

oriented luster te hnology is urrently less mature

than omputational luster te hnology and many of

the key problems in the implementation of storage-

oriented lustershaveyettobeadequatelyaddressed.

Ourresear hattempts tosolvekeyproblemsin stor-

agemanagementforstorage-oriented lusters.

Ourmodelforprogrammingstorage-oriented lus-

tersisbasedontheorthogonalpersisten eabstra tion

overstorage(Atkinson, Bailey, Chisholm, Co kshott

& Morrison 1983), permitting the form of programs

to be independent of the lifetime of data they ma-

nipulate and the lifetime of data to be independent

of its type. Extending this abstra tion to distribu-

tionmakestheformofaprogramindependentofthe

underlying stable storage me hanism and the num-

berofstoragesites parti ipating. Thebenetsofthe

persisten e abstra tion have been well do umented

(Atkinson &Morrison1995)and areparti ularlyap-

propriateforthe lassoflong-runningappli ationsex-

pe tedonstorage-oriented lusters.

To manage the storage resour es of long-running

appli ations,itisessentialtoidentifyandre laimthe

storageallo ated to those obje tsthat areno longer

needed. The onsequen eofnotdoing so isthat ap-

pli ationswillfailunexpe tedlyandunne essarilydue

to exhaustion of freespa e; the approa h ofdis ard-

ingtheentireaddressspa eattheendofexe utionis

(2)

tifyingtheobje tsthatarenolongerrequiredsothat

theymaysafelybere laimedisa omplex,error-prone

and tedious task even for small programs where the

data stru tures are simple. Furthermore, when the

majority of those obje ts are lo ated in se ondary

storage,therearesigni antengineering hallengesto

performing both identi ation and re lamation e-

iently. Finally,as ertaininglivenessofanobje tina

distributed system(su h asa luster) requiresglobal

knowledge. However, in order to maintain s alabil-

ity(andhen ee ien y)thisglobal knowledgemust

bedistilled overaseries ofasyn hronouslo al steps.

Moreover, these steps should involve only the mini-

mal subset of sites ne essaryto identify aparti ular

garbage omponent.

Anyimplementationofstoragemanagementatthe

appli ation level must orre tly address all of the

problems identied above. At best, this burdens

appli ation programmers with a tedious and time-

onsumingproblem. Amorelikelyout ome(andone

whi hismu hworse)isthattheappli ationlevelstor-

agemanagementisin orre tinsomeaspe t,resulting

in unsafe re lamationof liveobje tsor failure to re-

laimgarbageorboth. Itisthereforevirtuallyessen-

tial to takeasystemati and automati approa h to

storage management, namely, theimplementation of

adistributedgarbage olle tor.

Thetrainalgorithm (Hudson &Moss1992,Selig-

mann & Grarup 1995), from whi h our resear h is

derived, exhibits the desirable properties of safety,

ompleteness,asyn hronyands alability. Thesealign

withthemajorityoftherequirementsidentiedabove,

hen e justifying the use of this garbage identi-

ation te hnique. The train algorithm has previ-

ously been shown to be amenable to implementa-

tion in a wide range of ar hite tures, ea h with

their own unique set of hallenges. Existing adap-

tations of the train algorithm in lude generational

(Hudson&Moss1992),unipro essorpersistent(Moss,

Munro & Hudson 1996), distributed main memory

(Hudson, Morrison, Moss & Munro 1997) and dis-

tributed shared memory (Munro, Falkner, Lowry &

Vaughan 2001) garbage olle tion. The hallenge is

toadaptthetrainalgorithmtouseinstorage-oriented

lusters,a ontextwhereithasnotyetbeenseen.

The ontributions of this paper are the design,

implementation and initial evaluation of a new dis-

tributedse ondarystoragegarbage olle torbasedon

the train algorithm. This new olle tor adapts and

ombinesanumberofprovente hniquestoprovide:

 safetythroughpointer tra king,

Node Boundaries Persistent Heap

Stable Storage Object Caches Mutators

Figure1: LayeredPro essBaseAr hite ture

 ompleteness through isolated train dete tion,

underpinned by Safra's Algorithm (Dijkstra

1987),

 toleran eofse ondarystoragelaten ythrougha

deferredmetadataupdatesimilartothatusedin

PMOS(Mosset al.1996),and

 s alabilityofpersistentgarbage olle tionbyin-

rementallygatheringexternalreferen einforma-

tion.

To provide an e ient programming abstra tion, it

is ne essaryforsites to a he datain primarymem-

ory. A further ontribution of the distributed store

des ribedhereis anewte hniquefortra kingpoint-

ersin lient a hes, whi h isne essaryto permitthe

de ouplingofgarbage olle tioninthepersistentheap

fromthat inthe lient a hes.

This paperdes ribes thear hite ture and experi-

mentalplatforminwhi hthegarbage olle torexists,

the hallenges onfronting su h a olle tor and their

solutionsaswellasinitialexperimentationperformed

to onrmthevalidityofourapproa h.

2 Ar hite ture

Our ar hite ture in ludes distribution of both stor-

age and omputation a ross the nodes of a lus-

ter. The omputational model is based on the Pro-

essBase (Morrison, Balasubramaniam, Greenwood,

Kirby, Mayes, Munro & Warboys 1999) program-

ming language. The storage model is an orthog-

onal persisten e (Atkinson et al. 1983, Atkinson &

Morrison1995)abstra tionsupportedbythePro ess-

(3)

heaplayer. Inparti ular,wehavedesignedandimple-

mentedanovelse ondarystoragestoragegarbage ol-

le tor, DPMOS (Distributed Persistent Mature Ob-

je tSpa e),basedonthetrainalgorithm.

These ondarystoragear hite tureisasetofpeer

sites that are in ommuni ation with ea h other to

synthesize a single address spa e from separate sta-

ble storageheld on ea h siteas perFigure 1. Client

virtual ma hines an onne t to any of the parti i-

patingstorenodes;astoresiteneednothavebotha

onne ted lientand lo al storage. The stablestores

are independent and do not ommuni ate with ea h

other. Obje t lo ations are en oded in their identi-

ers,providingatrivialresolutions heme.

Ea h virtual ma hine has a write-ba k obje t

a he and a olle tionof mutatorthreads that oper-

ate over the a hed obje ts. Pro essBase is a per-

sistent language system designed for use in hyper-

programming (Kirby, Connor, Cutts, Dearle, Farkas

&Morrison1992)andpro ess-modellingsystemsand

for whi h there is ongoing resear h into distribution

of omputation and storage at both the persistent

heap and a he levels (Howard, Detmold, Falkner &

Munro2003,Falkner,Detmold, Munro&Oldsn.d.).

ThePro essBaseprogrammingmodelis thatof mul-

tiple on urrentmutatorsthatoperatedire tlyonob-

je tsheldinthe a hes.

Ca he oheren yistheresponsibilityofthe lients,

however ertaineventsrelatingtothe opyingandde-

stru tionofreferen esmustbereportedtothepersis-

tentheaptopermitsafegarbage olle tiontherein.

Our se ondary storage garbage olle tion algo-

rithm operates within the persistent heap layer. A

ommonapproa htoa hievings alabilitythroughin-

rementalityinagarbage olle tor(distributedoroth-

erwise) is to partition the obje t spa e and re laim

garbageinpartitionsindividually. Distributeda y li

garbage is re laimed by repeated invo ation of the

lo al garbage olle torin onjun tionwith apointer

tra kingalgorithm that maintains a onservative set

ofexternalroots forea hpartition. Distributed y li

garbage is re laimed using an asyn hronous imple-

mentationofthetrainalgorithm.

Stable storage is implemented at ea h site using

anafter-imageshadow paging(Gray, M Jones,Blas-

gen, Lindsay, Lorie, Pri e, Putzolu & Traiger 1981)

me hanismand page a he. Theshadowpagedstore

provides stability with he kpoint semanti s and its

use of a a he permits the dire t mapping of data

stru tures in the heap onto the page store's address

spa ewithoutsigni antperforman eloss. The a he

supportspage-pinningtoexpli itlyretain riti aland

Layer n Load(n−1) = Cost(n)

Layer n+1

Layer n−1 Load(n) = Cost(n+1)

Figure 2: Layeringin MeasurementModel

frequently-a esseddatastru turesinthe a he. The

stable storage spa e is a at address spa e with no

on eptofobje ts,referen esorrea hability.

Communi ation betweensites is by asyn hronous

messagepassingoveranetwork that preservesFIFO

orderingofmessagesbetweenendpoints. Thesystem

assumesreliablemessagedelivery,messagelossresults

in gra efulshutdown. Similarly, thealgorithm isnot

rashtolerantandthelossofasitealsoleadstogra e-

fulshutdown. Finally,Byzantinebehaviourisnotper-

mittedanywhere inthesystem.

2.1 Measurement

Thear hite turehasbeendesignedinsu hawayasto

fa ilitatemeasurementandevaluation. Inparti ular,

thelayeringofthear hite turepermitsthesubstitu-

tionofanylayerforthepurposesof omparisonandis

thebasisforquantitativeperforman eands alability

measurements. Thisenablesoursystemtobeusedin

adetailed investigation into distributed garbage ol-

le tion.

Byfo usingmeasurementona tivityat theinter-

fa es between layers, it be omes possible to dire tly

omparetheee tsofdierentpoli ieswithinalayer

oreventwoentirelydierentimplementations ofthe

samelayer.

Ea hlayerinthear hite tureprovidesaservi eto

layersaboveit by omposing ame hanism andpoli-

iesofitsownwithrequestsoflowerlayers. Theload

uponalayern isdened asthesetof requestsmade

of that layer while the ost of a layer is the set of

requests it makes of the layer below itself, see Fig-

ure 2. Ea h layertransforms its loadinto a ost by

theappli ationof poli ies tome hanisms,with those

poli iespossiblyspe iedbyhigherlayersin the ase

of a ompliantsystem(Morrison, Balasubramaniam,

Greenwood, Kirby, Mayes,Munro &Warboys2000).

There maybe multiple independent systemsat ea h

(4)

sidered to be a layer below the Persistent Heap yet

independentofea hother. Inprin ipletheloadupon

alayer ould be the unionof a esses from multiple

higherlayersbut thisdoesnoto urin theDPMOS

ar hite ture: thereareno ir umstan eswheremulti-

plesystemsinanupperlayeroperatedire tlyupona

singlesystemin alowerlayer. Thelong-termgoalof

this resear histo attempt tomeasure quantitatively

thetransformationof loadinto ostat thepersistent

heaplayerandtherebytomake omparisonsbetween

dierentdistributed persistentgarbage olle tors.

In the ase of the distributed persistent garbage

olle tor, the load is represented in terms of obje t

read and write requests and pointer tra king events

from a hes;the ost anbemeasuredintermsofthe

network tra generated and the load pla ed upon

ea h ofthestable stores.

3 Challenges

Thedesign andimplementationof adistributed per-

sistent garbage olle tor for use in storage-oriented

lustersposearangeof hallengesthatareaddressed

byourresear h. Theserelateto:

 The e ient and a urate tra king of the re-

ation, destru tionandtransmissionofreferen es

externaltoapartition.

 A essingstable datainsu hawayastobetol-

erantofse ondarystoragelaten y.

 The e ient operation of as alable distributed

algorithm to a urately identify distributed y-

les.

This se tion des ribes in detail the three hallenges

addressedbyourresear h.

3.1 PointerTra king

Toperform safegarbage olle tionof apartition, in-

formation regarding referen es into that partition is

required. Whilst it is in prin iple possible to a -

quirethisinformationviaaglobaltra e,this renders

thealgorithmuns alable. Hen e,tosupports alabil-

ityof olle tion,inter-partitionreferen einformation

must be maintained in rementally. This pro ess is

known aspointer tra king. The algorithm des ribed

in (Hudson,Morrison,Moss&Munro1998)provides

an e ientsolution to the pointer tra king problem

but is suited only to a single layer store where ref-

eren es are not a hed at a higher layer. However,

te ture,thereforeitisne essaryto adaptthepointer

tra king algorithm to ope with referen es held in

a hes.

Fors alability, itis desirableto de ouplegarbage

olle tionin a hesfromgarbage olle tionintheper-

sistentheapsothat bothmayoperateindependently.

Thisensuresthe s alabilityofthe persistentgarbage

olle tor be ause it does notrequiresyn hronisation

with and interruption of the a hes. For these rea-

sons, a new pointer tra king me hanism is required

that en ompasses lient a hesand is aware of their

behaviour.

3.2 Laten y Toleran e

Garbage olle tion in primary memory is a well-

investigated area (Wilson 1992). In this eld, algo-

rithms typi allyassumethat storageisrandomlyad-

dressable with littlepenalty, i.e. random a esses to

storage have low laten y. Clearly, this assumption

doesnotholdfordis storage.Forase ondarystorage

garbage olle tortooperatee iently,itmustbeim-

plementedsoastotakea ountofthereadandwrite

laten yofse ondarystorageandtominimise theim-

pa tofthatlaten yonoverallperforman e. Themain

strategiesavailabletotheimplementationare a hing

inmainmemoryandoptimisinga esspatternssoas

toprefersequentiala essoverrandoma ess.

Thepointertra kingalgorithmrequiredtosupport

partitioned olle tion entailsthe reation and main-

tenan e of metadata, re ording inter-partition refer-

en es. As with all other data, this metadata must

be kept on se ondarystorage. There is at least one

metadata update for every inter-partition referen e

update. Furthermore,agivenreferen eupdateandits

asso iatedmetadataupdatesarephysi allydispersed.

Theseproperties ompli atetheimplementationofan

approa hto maintainingmetadatathat istolerantof

high storage laten ies, whilst simultaneouslymaking

itmoreimportanttoadopt su happroa hes.

3.3 Colle tion ofCy li Garbage

Referen e ountingalgorithms,su haspointertra k-

ing, are inadequate to dete t y les of garbage that

span partitions. In a main memory olle tor one

might simply ignore su h y les sin e they will dis-

appear whenthe programexits. This isnotaviable

design hoi einastorage-oriented lustersin ethead-

dressspa e is persistentand so ompleteness is thus

a ne essary requirement. Hen e, an additional dis-

tributedalgorithmisrequiredtohandlethis aseand

(5)

le dete tion algorithm must be e ient and s al-

able. This isan inherently hallengingproblem that

has been the subje t of extensive resear h (Tel &

Mattern 1991, Bla kburn, Hudson, Morrison, Moss,

Munro & Zigman 2001, Lowry& Munro 2002, Nor-

ross,Morrison,Munro&Detmold2003).

4 Implementation

Herewedes ribeourapproa h tosolvingthespe i

hallenges of implementing a safe, omplete, asyn-

hronous and s alable distributed persistentgarbage

olle torposedinthepreviousse tion.

The generi train algorithm, on whi h our olle -

tor is based, operates as follows. Obje tsare stored

in ars (partitions),of whi h therearemanypersite

andwhi hdonot rosssiteboundaries. Ea h ar an

be olle ted independently of any other ar in the

system and the inter-partition referen e information

(from thepointer tra kingalgorithm) requiredto do

thisiskeptinrememberedsets(remsets). Ea h arbe-

longstoatrain,whi hmayspanmultiplesites. When

a ar is olle ted, the rea hable obje tsin it are re-

asso iated ( opied) into other ars, possibly in other

trains, depending on the information in the remset

and then the spa e o upied by the olle ted ar is

re laimed. Thereasso iationpoli ysele tsadestina-

tiontrainforea hobje tsothatdistributed y lesare

marshalledinto asingletrain. In addition,rea hable

obje tsareeventuallymovedintoothertrains,leaving

thistrainisolated. Trainisolationisastableproperty

whi h maybeestablishedusingaDistributedTermi-

nationDete tion(DTD) algorithm.

4.1 PointerTra king

There are two ases where pointer events requiring

tra kingo ur:

 Obje twritesissuedby a hesandpertainingto

obje ts on a site other than the one the a he

is onne ted to (remote writes). A remote ob-

je twrite involvesthetransmissionofreferen es

from a a he to a persistent heap site and the

reationofreferen esinthelatter;this ase(the

reationand destru tionof referen esin theob-

je twritten aswellasthepresen e ofreferen es

innetworkmessages)ishandledbyanimplemen-

tation of the pointer tra kingalgorithm dened

in(Hudson etal.1998).

 Remoteobje treads issuedby a hes. These in-

volve the reation of referen es in a a he and

mote. Be ause the referen es reatedare notin

the persistent heap, they annot be tra ked by

Hudson et. al.'salgorithm. Oursolutionto this

isdes ribedbelow.

Ca hes in our ar hite ture swizzle (translate) refer-

en esfromthepersistentheap'sformattotheirnative

ma hine address format in order to provide a ept-

able omputational performan e in the virtual ma-

hine. Our approa h is to tra k referen es as they

enter (are faulted into) and leave (are purged from)

a hes. Swizzling allows us to ignore the referen e

reationsanddeletions performedbymutators.

The a he pointer tra king algorithm must a u-

ratelyreportwhenanon-zero ountofreferen estoa

givenobje tareheldbyany a he. Furthermore,this

mustbedonewithoutsyn hronisingwith a hesand

withoutinformation regarding mutations of swizzled

referen es.

Referen es are tra ked separately, depending on

whether or not they are swizzled, only the non-

swizzled referen es are ounted by this algorithm.

Referen estoanobje t anbeswizzledonlyifa opy

oftheobje tisheldbythe a he,sothepointertra k-

ingalgorithm onservativelyassumesthatifanobje t

is present in the a he thenthere areswizzled refer-

en es to it. If anobje t is purged to makespa e in

the a he,allreferen estoitmustbedeswizzled(and

therefore ounted).

The Pro essBase virtual ma hine (VM) operates

only on swizzled referen es and sohas no dire t in-

tera tion with the a he pointer tra king algorithm.

Sin e the pointer tra king algorithm is only invoked

during obje tfault and purge, its impa ton theop-

erationofthevirtualma hineis minimal.

A fault willo urat any pointwhere theVM at-

temptstodereferen eanunswizzledpointertoanob-

je tnotpresentin the a he. Iftheobje tisalready

present, the pointer is swizzled, resulting in aminus

message. Oursystem uses asimple poli y by whi h

purgesaretriggeredonlyby a heexhaustion. Ifswiz-

zledpointersexist inamodiedobje ttobepurged,

theymustbeunswizzled(requiringplusmessages)be-

foretheobje tiswritten ba k.

Forea hobje t ontainedin orreferredtoby one

ormore a hes,abooleanvalueand ounter,ea hve -

torisedby a heidentity,iskept. Thesearereferredto

asR

b

[X℄ and R

[X℄ respe tively. R

b

[X℄denotes the

presen eofRin a heXandR

[X℄ ontainsa onser-

vativeestimateof thenumberofunswizzledpointers

toRin X. Ifeitherofthesevaluesarenon-zero,Ris

onsidered live and annot be re laimed by the per-

(6)

Site A Site C

In Cache on Site B

P1 P2

R Q

p2 p1 r

Figure3: ExampleObje tGraph

C A

{r,R}

R {−,R,B}

{+,P2,B}

{+,P1,B}

3 5

1

2 4

B

{f,R,B}

Figure4: FaultSequen e

Thetra kingofreferen esinthe a hesisdes ribed

byfourevents:

 Plus: {+, R, B} indi ates the reation of an

unswizzledreferen eRat a hesiteB;it auses

R

[B℄ tobein remented.

 Minus: {-,R,B}indi ates thedestru tionofan

unswizzledreferen eRat a hesiteB;it auses

R

[B℄ tobede remented.

 Fault: {f,R,B}indi atesthata opyofRisheld

in a heB;it ausesR

b

[B℄to beset.

 Purge: {p, R, B} indi ates that B has lost its

opyofR ;it ausesR

b

[B℄tobe leared.

Considertheexample obje t graphofFigure 3. Ob-

je tRresidesonsiteAandobje tsP1andP2reside

onsiteC. A a heatsiteBholdsareferen er(toR )

in obje tQ. Figure4andFigure 5des ribethefault

A C

R

{p,R,B}

{−,P,B}

1 3 2

B

Figure5: PurgeSequen e

When a mutator on B dereferen es the referen e

toRinQ,thereferen emustbeswizzled. Forthisto

o ur,R must befaulted in (Figure 4) so B sendsa

read request (1) toA. Thefault event{f ,R, B} (2)

o urs at A and plus events are generated at A for

everyreferen e(p1andp2in Figure3) inR . Also,a

messageissent(3)tothehomesite C ofea h obje t

Pnforwhi haplusevento urred. The ontentsof

Rarethentransmitted(4)toBwhi h answizzlethe

referen ein Q,resulting in thetransmission (5) ofa

{-,R,B}messagetoA.

WhenBpurgesR(Figure5),itwritesthe ontents

ba k(1)toAifRwasmodied. Thisinvolvespointer

transmissionandthe reationanddeletionofpointers,

all of whi h are tra ked by the proto ol of (Hudson

et al. 1998). Minus messages (2) are generated for

ea h unswizzledreferen e in R and A is notied (3)

thatR hasbeenpurgedfrom B.

Our ar hite ture uses a he kpoint model of sta-

bility and permits the sharing of referen es between

obje t a hes; our implementation of the persistent

heap is ne essarily onservative in its treatment of

referen es held by lient a hes. In ontrast, trans-

a tional systems introdu e a degree of syn hronisa-

tionbetween lient and server(Amsaleg, Franklin &

Gruber1995),permittingtheservertomakeassump-

tionsaboutwhenreferen es anbe reated.

4.2 Laten y Toleran e

Inaddressingthelaten yofse ondarystoragea ess,

ourfo usisonredu ingthe ostsofmetadataupdate.

Wepursuetwoapproa hestothis:

 a hing ofmetadatatoamortizese ondarystor-

(7)

 bat hing of updates to metadata to enable se -

ondary storage writes to be performed sequen-

tially.

Eagerlyupdatingremsetsonse ondarystorageis in-

feasible be auseupdates are requiredat upto three

addressesnotspatially orrelatedwithea hother: the

referen e update, a de rement at the remset for the

pointer overwritten and an in rement at the remset

forthepointer reated. Thisimpliesuptothreeseek

operationsforea hpointer update andresultsinun-

a eptablethroughput. ThesolutionusedinDPMOS

istodeferandbat hupdatesuntilanopportunetime

orwhen the log of deferred updates hasbe ome too

large(apoli yde ision)andmustbepurged. Thisso-

lutionmeansthatmultipleupdatestoasingleremset

are bat hed into a single dis -blo k write whi h re-

quiresonlyasingleseek. Ourresultsshowthatwhere

ablo kwriteisdominatedbyseektimesu hbat hing

improvesperforman e.

DPMOSremsetshaveatreestru tureinmemory,

permitting O(logn) sear h and update but are at-

tenedforpla ementonse ondarystorage. There isa

xed size a he of remsets in memory. Remsets are

purgedfromthe a hewhenanewremsetisrequired

tobereadin.

If areferen eupdate o urs and the desiredrem-

set is not in memory, the update is appended to a

stru ture in memoryreferredtoasaref set. This

set ontains a ve tor of re ords, ea h des ribing an

in rementorde rementoperationonaremset entry.

ref setsaretheme hanismfordeferringandbat h-

ing updates to remsets, the assumption being that

many operationswill bestored in a ref set before

the orrespondingremset isloaded andupdated. On

loadingaremsetintothe a he,anyappli ableref

is applied. When the spa e o upied by ref sets

be omestoolarge(again,apoli y de ision),remsets

arefaultedin andupdated.

ref sets are always applied to their respe tive

remsets on he kpoint and therefore never resideon

stablestorage. Ourinitialexperimentsleadusto be-

lieve that he kpoint speedmay be improved at the

ostof omplexityandre overyspeedbystoringref

setsstablyinsteadofeagerlyapplyingthemat he k-

point. Wewillinvestigatethisalternativeapproa hin

futurework.

The stable storagesubsystem used by DPMOS is

ashadowpagingsystemwithapage a heand he k-

pointdurabilitysemanti s. Partitionsbeing olle ted

arepinnedintothe a hetopermitagarbage olle -

anditsin remental olle tionmeansthatonlyasmall

amountofpersistentspa eneedstobein a heatany

giventime.

PMOS (Moss et al. 1996) is a unipro essor per-

sistent implementation of the train algorithm that

usesdeferredmetadataupdatesto gainperforman e.

PMOSmaintainstwosu h a hes:theref andlo

sets. TheformerissimilartoDPMOS'sref system

andthelatterdes ribesobje tmotionduetogarbage

olle tion. DPMOS does not require a lo set as

thereisanindire tionbetweenpersistentobje tiden-

tiersand theirphysi al lo ation;in ontrast,PMOS

must update all referen es to an obje t when it is

movedbythegarbage olle torandthelo setdefers

theseupdates.

4.3 Colle tion ofCy li Garbage

The olle tionof distributed y les requires ame h-

anism to identify isolated trains. The reasso iation

poli ies employed by the train algorithm ause dis-

tributed y les to move into a single train and re-

main there if unrea hable while live obje ts will be

reasso iatedoutofthetrain. A partitionalongtrain

boundariesis thereby formedbetweenrea hable and

unrea hableobje ts. There lamationofawholetrain

isthepro essbywhi hinter-partitionanddistributed

y lesarere laimed.

Liveness of a site in a omputation subje t to

DTD is equivalent (Tel & Mattern 1991, Bla kburn

et al. 2001, Lowry & Munro 2002) to the presen e

of rea hable obje ts in a partition of a distributed

garbage olle tor. Severalsu halgorithmsalreadyex-

istand anbereadilyadaptedtooursystemtodete t

isolation dete tionfor thepurpose of garbage olle -

tion. In(Nor rossetal.2003),thenotionof lubrules

isinvestigatedtopermittheuseof dierentDTDal-

gorithmsforthepurposeof trainisolationdete tion.

Ea h olle tor partition orresponds to a DTD site

andatrainisequivalenttotheDTDalgorithm's om-

putations ope.

Reasso iationofanobje tinto atrain onstitutes

spontaneous a tivity in the ontext of a DTD algo-

rithmso is notpermissible ifa DTD algorithm is to

beused todete ttrainisolation. Thereforeatrainis

losed (Lowry&Munro2002)duringtheexe utionof

theDTD algorithm,therebyprohibitingthereasso i-

ationof obje tsintothat train. ShouldtheDTD fail

(i.e. not dete t isolation), the trainis reopened and

reasso iation re ommen es. A orre t asyn hronous

DTD algorithm is safe with respe t to the exporta-

tionofreferen esfromatrainsin esu hexportation

(8)

1e+07 1e+08 1e+09

0 10 20 30 40 50 60 70

Cost (Stable Words Written)

Remset Cache Size (remsets) max delta-ref 0 max delta-ref 8 max delta-ref 64 max delta-ref 128

Figure6: Ee tofCa hingMetadata(FEA)

ano uronlyifthereisanexistingexternalreferen e

tothetrain.

Wehave hosenSafra'sAlgorithm(Dijkstra1987)

for the purpose of dete ting isolated trains. This is

an asyn hronous ring-based wave algorithm for dis-

tributed terminationdete tion. It is parti ularlyap-

propriateforourpurposesnotonlybe auseitisasyn-

hronousbutalsobe auseit loselymat hesthestru -

tureoftrainsinourimplementation. The losingand

opening of trainsis performed bythe exe ution of a

waveovertheringofsitesparti ipatinginatrain. The

FIFOnatureofmessagedeliveryandtheuseofwaves

over a ring to manage train membership as well as

openand losetrainsensuresthat trainmanagement

eventsdonotoverlapwiththeopeningor losingofa

train.

5 Experimentsand Results

Ourexperimentalplatform onsistsofapairofAthlon

XPsystems: oneof1527MHzand1024MBRAMand

the other 1470MHzand 512MBRAM onne tedvia

swit hed fast ethernet. Wall- lo k time is the only

ma hine-dependent result as the I/O osts are de-

terministi (dependingonlyon ongurationandthe

loaddatastru tures)andthereforeindependentofthe

hardwareusedforexe ution.

To verify integrity of the store, the OO7 (Carey,

DeWitt&Naughton1993)medium datastru tures

were reated,installedintothestoreandveried. To

measure performan e in thefa e of highly onne ted

data,asmallnite-elementsimulationwaswrittenin

Pro essBaseand he kpointed. It reatesa2Dplanar

mesh of triangles, ea h onne ted to all of itsneigh-

boursandobje tsrepresentingitsverti es. Asanini-

tial experiment, the ee t of the remset a hes and

1e+07 1e+08 1e+09 1e+10

0 10 20 30 40 50 60 70

Cost (Stable Words Written)

Remset Cache Size (remsets) max delta-ref 0 max delta-ref 64 max delta-ref 128

Figure 7: Ee tofCa hingMetadata(OO7)

in stableI/O withrespe t tothe sizeof both remset

a hesandref sets.

Figure 6 shows at least one order of magnitude

redu tioninexe ution ost(as measuredby theload

on the stable storage layer) is possible by using the

laten y tolerant metadataupdate s hemes des ribed

inthispaper. Theresultsalsoshowthatthereislittle

benetin in reasing ref set andremset a he size

beyonda ertainminimum.

Experiments with OO7 (Figure 7) in onjun tion

with extremely small remset a hes and ref sets

showapproximately2.5ordersofmagnitudedieren e

in I/O ost betweenthehighestandlowest ost on-

gurations. The OO7 data stru ture has lower on-

ne tivityandthereforefewerinter-partitionreferen es

thantheFEAtest,soasexpe tedtheuseofmetadata

a hinghaslessee t.

Furtherexperimentationinto a hingpoli ies (in-

luding ref setsand remset a hes)and othersys-

tempoli ies isongoing.

6 RelatedWork

Distributed persistent obje t stores have been de-

signedandimplementedinthepastbuttheirgarbage

olle torshaveingeneralsueredfromala kof om-

pleteness,s alabilityorboth.

Lar hant (Shapiro & Ferreira 1995) and PerDIS

(Ferreira,Shapiro, Blondel,Fambon,Gar ia,Kloost-

erman, Ri her, Roberts, Sandakly, Colouris, Dol-

limore,Guedes,Hagimont&Krakowiak2000)aredis-

tributedpersistentstoreswithgarbage olle tionbut

theyarenot omplete inthefa e oflargedistributed

y les and are only probabilisti ally omplete with

smallerdistributed y les.

Thor (Maheshwari 1993) is a distributed

(9)

(Maheshwari & Liskov 1997b), distributed mark-

ing (Maheshwari & Liskov 1997 ) or ba k-tra ing

(Maheshwari& Liskov1997a). Controlled migration

requires that anentiredistributed y le bemigrated

to a single site to be re laimed by that site's lo al

garbage olle tion and is therefore not s alable to

large onne ted omponentsof garbage. Distributed

marking operates in phases over all sites in the

system, regardless of the sites that the omponent

to be re laimed a tually spans. The global nature

of garbage dete tion in that s heme means that it

is not s alable to large numbers of nodes. Thor's

ba k-tra ing algorithm is omplete and s alable but

spa e-andtime-ine ientbe auseofitsrequirement

to store state for ea h remset entry involved in a

ba k tra e. Safety of the ba k tra ealso requires a

pointer tra kingalgorithm that is theoreti ally more

expensivethanthat usedbyDMOSandDPMOS.

7 Further Work

The ontext of this resear h is a detailed ompar-

ison between two dierent approa hes to obtaining

omplete olle tion in the fa e of distributed y les

of garbage while maintaining e ien y and s alabil-

ity. Thear hite tureandmeasurementframeworkde-

s ribed inthis paperis thebasisforthequantitative

omparisonsbetweenthe trainalgorithmand aform

ofba k-tra ingbasedonThor.

8 Con lusions

Powerful s ienti appli ations in ommon use de-

mand s alable, omplete garbage olle tion for

storage-oriented lusters. S alability and omplete-

nessof olle tionare madene essarybythetypesof

data stru tures employed by these appli ations and

e ien y and s alability are required of the system

whi h storesthem.

The ontributionsofDPMOSare:

 Thedesignandimplementationofanewpointer

tra kingalgorithmtopermitsafegarbage olle -

tion in the persistentheap, de oupledfrom ol-

le tionin a hes.

 The novel ombination of the Train Algorithm,

Distributed Termination Dete tion, Deferred

MetadataUpdateand ournewPointerTra king

algorithmtoprovidesafe, omplete,s alableand

e ient garbage olle tion for storage-oriented

lusters.

system that uses this novel ombination and shows

thatthis news alableand ompleteapproa hto dis-

tributedpersistentgarbage olle tionisvaluablesin e

itpermitsthesupportofalargeandpowerful lassof

appli ations ommonlytargetedbyhighperforman e

anddistributedsystems.

9 A knowledgments

The work is supported in part by ARC Linkage In-

ternationalAwardLX0349049:Extendingafamilyof

garbage olle tors.

Referen es

Amsaleg,L.,Franklin, M.&Gruber,O.(1995),E-

ient in remental garbage olle tion for lient

server obje t database systems, in `VLDB95',

Zuri h, Switzerland,pp.4253.

Atkinson, M. P., Bailey, P., Chisholm, K., Co k-

shott, W. & Morrison,R. (1983), `Anapproa h

to persistent programming', Computer Journal

26(4), 360365.

Atkinson, M. P. & Morrison, R. (1995), `Orthogo-

nally persistent obje t systems', VLDB Journal

4(3),319401.

Bekkers, Y. & Cohen, J., eds (1992), Pro eedings

of International Workshopon Memory Manage-

ment, Vol. 637 of LNCS, Springer-Verlag, St

Malo,Fran e.

Bla kburn, S., Hudson, R., Morrison, R., Moss, J.,

Munro, D. & Zigman, J. (2001), Starting with

termination: A methodology for building dis-

tributed garbage olle tionalgorithms, in`Pro .

ACSC2001'.

Carey, M. J., DeWitt, D. J. & Naughton, J. F.

(1993),`TheOO7ben hmark', SIGMODRe ord

22(2), 1221.

Dijkstra,E.W.(1987),ShmuelSafra'sversionofter-

minationdete tion,Te hni alReportEWD998,

TheUniversityofTexasatAustin.

Falkner, K., Detmold, H., Munro, D. & Olds, T.

(n.d.), Towards ompliant distributed shared

memory,in`WSDSM'02,2ndInt'lConfonClus-

ter ComputingandtheGrid (CCGRID)'.

Ferreira, P., Shapiro, M., Blondel, X., Fambon, O.,

Gar ia,J.,Kloosterman,S.,Ri her,N.,Roberts,

(10)

(2000),PerDiS:Design,implementationanduse

of aPERsistent DIstributed Store, Vol. 1752 of

LNCS,pp.427452.

Gray,J.,M Jones,P.,Blasgen,M.,Lindsay,B.,Lorie,

R., Pri e, T., Putzolu, F. & Traiger, I. (1981),

`There overymanageroftheSystemRdatabase

manager',ACMComputing Surveys13(2), 223

242.

Howard,D.,Detmold,H.,Falkner,K.&Munro,D.S.

(2003),Usingthe ompliantsystemsar hite ture

todeliverexiblepoli ieswithintwo-phase om-

mit,inA.E.James,B.Lings&M.Younas,eds,

`BNCOD', Vol. 2712 of Le ture Notes in Com-

puterS ien e,Springer,pp.245252.

Hudson,R.L.,Morrison,R.,Moss,J.E.B.&Munro,

D.S. (1997),Garbage olle tingtheworld: One

aratatime, in`OOPSLA'97ACM Conferen e

onObje t-OrientedSystems,LanguagesandAp-

pli ations  Twelth Annual Conferen e', Vol.

32(10)of ACM SIGPLAN Noti es, ACM Press,

Atlanta,GA,pp.162175.

Hudson,R.L.,Morrison,R.,Moss,J.E.B.&Munro,

D. S. (1998), Where haveall the pointers gone,

in `Pro eedings of 21st Australasian Computer

S ien eConferen e',Perth,pp.107119.

Hudson, R. L.& Moss, J.E. B.(1992), In remental

olle tionofmatureobje ts,inBekkers&Cohen

(1992),pp.388403.

Kirby, G., Connor, R., Cutts, Q., Dearle, A.,

Farkas, A. & Morrison, R. (1992), Persistent

hyper-programs,in `PersistentObje tSystems',

Springer-Verlag,pp.86106.

Lowry,M.C.&Munro,D.S. (2002),SafeandCom-

plete Distributed Garbage Colle tion with The

TrainAlgorithm,in`Pro eedingsofICPADS'02',

Taipei,Taiwan,pp.651658.

Maheshwari, U. (1993), Distributed garbage olle -

tion in a lientserver, transa tional, persistent

obje tsystem,Te hni alReportMIT/LCS/TR

574,MITPress.

Maheshwari, U. & Liskov, B. (1997a), Colle ting

y li distributed garbage by ba k tra ing, in

`Pro eedings of PODC'97 Prin iples of Dis-

tributedComputing',pp.239248.

Maheshwari, U. & Liskov, B. (1997b), `Colle ting

y li distributed garbage by ontrolled migra-

garbage olle tionofalargeobje tstore,in`Pro-

eedingsof SIGMOD'97',pp.313323.

Morrison,R.,Balasubramaniam,D.,Greenwood,M.,

Kirby,G.N.C.,Mayes,K.,Munro,D.S.&War-

boys,B.C.(1999),Pro essBaseStandardLibrary

Referen eManual(Version1.0.1),Universitiesof

StAndrews andMan hester.

Morrison,R., Balasubramaniam, D.,Greenwood, R.,

Kirby, G., Mayes, K., Munro, D. & Warboys,

B. (2000),`A ompliantpersistentar hite ture',

Software, Pra ti e &Experien e30(4), 363386.

Moss,J.E.B.,Munro,D.S.&Hudson,R.L.(1996),

PMOS:A ompleteand oarse-grainedin remen-

tal garbage olle torforpersistentobje tstores,

in `Pro .ofthe6thInt.Workshop onPersistent

Obje tSystems',CapeMayNJ(USA), pp.140

150.

Munro, D., Falkner, K., Lowry, M. & Vaughan,

F. (2001), Mosai : A non-instrusive omplete

garbage olle tor for DSM systems, in `Pro .

1st CCGRID, WSDSM'01,Brisbane,Australia',

pp.539546.

Nor ross, S., Morrison, R., Munro, D. & Det-

mold, H. (2003), Implementing a family of dis-

tributed garbage olle tors, in `Pro eedings of

the2003AustralasianComputerS ien eConfer-

en e',Adelaide,Australia,pp.161170.

Seligmann, J. &Grarup, S. (1995), In remental ma-

ture garbage olle tion using the train algo-

rithm,in O.Nierstras,ed.,`Pro eedings of1995

European Conferen e on Obje t-Oriented Pro-

gramming',LNCS,Springer-Verlag,Universityof

Aarhus,pp.235252.

Shapiro,M.&Ferreira,P. (1995),Lar hantRDOSS:

a distributed shared persistent memory and its

garbage olle tor, in J.-M. Helary & M. Ray-

mond, eds, `Workshop on Distributed Algo-

rithms', number972in `LNCS',Le MontSaint-

Mi hel,pp.198214.

Tel, G. &Mattern, F. (1991),The derivation of dis-

tributed termination dete tion algorithms from

garbage olle tion s hemes  (extended ab-

stra t), inAarts et al.,eds, `PARLE'91Parallel

Ar hite tures and Languages Europe', Vol. 505

of LNCS,Springer-Verlag.

Wilson,P.R.(1992),Unipro essorgarbage olle tion

te hniques,inBekkers&Cohen(1992).

References

Related documents

The divine doctrine which Olayinka Daini employed to attain salvation and which he was using to attempt sanctification was called Rebirth by Primate A.A.T Oshinyemi, but when

Makes your project and git fork pull request workflow to work, not only workflows run on the master branch contains merge commit on the great. Review branch to the pull requests, but

a) Costs appear only in the objective function. b) The number of variables is (number of origins) × (number of destinations). c) The number of constraints is (number of origins)

At least two Thai militant organizations were represented at these discussions, Patani Islamic Mujahideen Movement (Gerakan Mujahidin Islam Patani or GMIP) and

The data come from a routine clinical identification programme called Cognitive Geriatric Assessment research (CGA), which was a retrospective cohort study

∎ Gives providers quick access to information in other Military Health System (MHS) applications such as AHLTA Web Print, HAIMS, and the Virtual Lifetime Electronic Record

The undersigned hereby releases and forever discharges the ACR, its directors, officers, members agents, volunteers, and employees from and against any and all claims, suits,

Using available test-retest data, we compare HYDECA estimates of V ND and binding potentials to those obtained based on V ND estimated using a purported..