William Brodie-Tyrrell,Henry Detmold, Katrina Falkner, David S. Munro
S hool of Computer S ien e, University of Adelaide
Adelaide SA 5005, Australia
{william, henry, katrina, dave} s.adelaide.edu.au
Abstra t
Storage-oriented lusterspresentunique hallengesto
the implementation of storage management. Su h
lustersmanageavastamountofdata,mostofwhi h
islo atedonse ondarystorage. Manualstorageman-
agement in storage-oriented luster environments is
omplex, error-prone and tedious. As a result there
is a lear need for automati storage management
(garbage olle tion)forsu h lusters.
Thegoalsofagarbage olle torforuseinastorage-
oriented luster are safety, ompleteness and s ala-
bility in the fa e of distributed y les of garbage in
se ondarystorage. Ofthefewextantdistributed se -
ondarystoragegarbage olle tors,nonemeetallofthe
statedgoalswhilst alsooperatinge iently.
This paperdes ribesthe design and implementa-
tion of anew distributed garbage olle tor basedon
the train algorithm, spe i ally for use in storage-
oriented lusters. The olle tor presented here ex-
tendsthetrainalgorithm,employinganasyn hronous
distributed termination dete tion algorithm for iso-
lated train dete tion, a me hanism for deferring the
update of metadata and a new external root tra k-
ingme hanismtopermitintera tionwith lientsthat
a he andswizzlepointers. Ourexperimentsdemon-
strate that these extensions su essfully adapt the
train algorithm for e ient operation in a storage-
oriented luster, fullling the stated goals of safety,
ompletenessands alability.
1 Introdu tion
Thefo usofresear hinto luster omputinghas on-
entrated on omputational lusters, while storage-
Copyright 2004, AustralianComputerSo iety,In .Thispa-
per appeared at Twenty-Seventh Australasian ComputerS i-
en eConferen e(ACSC2004),Dunedin,NewZealand,January
2004. Conferen es in Resear h and Pra ti e in Information
Te hnology,Vol. 26. VladimirEstivill-Castro,Ed. Reprodu -
tionfora ademi , not-forprotpurposespermitted provided
oriented lusters have re eived less attention. Clus-
tering of storage provides s alability of I/O per-
forman e without the use of expensive spe ial-
purpose hardware. Computational lusters provide
super omputer- lassperforman eat ommodity ost,
similarly storage-oriented lusters havethe potential
to provide mainframe- lassI/O for ommodity ost.
Signi antly, thereis agrowingneedforsu h luster
te hnologyto support an important lass ofappli a-
tionsthatexhibitsstrongly onne tedgraphsofdata
obje ts within very large data sets. Finite Element
Analysis for the simulation of physi al systems is a
widely usedmemberof this lass. However,storage-
oriented luster te hnology is urrently less mature
than omputational luster te hnology and many of
the key problems in the implementation of storage-
oriented lustershaveyettobeadequatelyaddressed.
Ourresear hattempts tosolvekeyproblemsin stor-
agemanagementforstorage-oriented lusters.
Ourmodelforprogrammingstorage-oriented lus-
tersisbasedontheorthogonalpersisten eabstra tion
overstorage(Atkinson, Bailey, Chisholm, Co kshott
& Morrison 1983), permitting the form of programs
to be independent of the lifetime of data they ma-
nipulate and the lifetime of data to be independent
of its type. Extending this abstra tion to distribu-
tionmakestheformofaprogramindependentofthe
underlying stable storage me hanism and the num-
berofstoragesites parti ipating. Thebenetsofthe
persisten e abstra tion have been well do umented
(Atkinson &Morrison1995)and areparti ularlyap-
propriateforthe lassoflong-runningappli ationsex-
pe tedonstorage-oriented lusters.
To manage the storage resour es of long-running
appli ations,itisessentialtoidentifyandre laimthe
storageallo ated to those obje tsthat areno longer
needed. The onsequen eofnotdoing so isthat ap-
pli ationswillfailunexpe tedlyandunne essarilydue
to exhaustion of freespa e; the approa h ofdis ard-
ingtheentireaddressspa eattheendofexe utionis
tifyingtheobje tsthatarenolongerrequiredsothat
theymaysafelybere laimedisa omplex,error-prone
and tedious task even for small programs where the
data stru tures are simple. Furthermore, when the
majority of those obje ts are lo ated in se ondary
storage,therearesigni antengineering hallengesto
performing both identi ation and re lamation e-
iently. Finally,as ertaininglivenessofanobje tina
distributed system(su h asa luster) requiresglobal
knowledge. However, in order to maintain s alabil-
ity(andhen ee ien y)thisglobal knowledgemust
bedistilled overaseries ofasyn hronouslo al steps.
Moreover, these steps should involve only the mini-
mal subset of sites ne essaryto identify aparti ular
garbage omponent.
Anyimplementationofstoragemanagementatthe
appli ation level must orre tly address all of the
problems identied above. At best, this burdens
appli ation programmers with a tedious and time-
onsumingproblem. Amorelikelyout ome(andone
whi hismu hworse)isthattheappli ationlevelstor-
agemanagementisin orre tinsomeaspe t,resulting
in unsafe re lamationof liveobje tsor failure to re-
laimgarbageorboth. Itisthereforevirtuallyessen-
tial to takeasystemati and automati approa h to
storage management, namely, theimplementation of
adistributedgarbage olle tor.
Thetrainalgorithm (Hudson &Moss1992,Selig-
mann & Grarup 1995), from whi h our resear h is
derived, exhibits the desirable properties of safety,
ompleteness,asyn hronyands alability. Thesealign
withthemajorityoftherequirementsidentiedabove,
hen e justifying the use of this garbage identi-
ation te hnique. The train algorithm has previ-
ously been shown to be amenable to implementa-
tion in a wide range of ar hite tures, ea h with
their own unique set of hallenges. Existing adap-
tations of the train algorithm in lude generational
(Hudson&Moss1992),unipro essorpersistent(Moss,
Munro & Hudson 1996), distributed main memory
(Hudson, Morrison, Moss & Munro 1997) and dis-
tributed shared memory (Munro, Falkner, Lowry &
Vaughan 2001) garbage olle tion. The hallenge is
toadaptthetrainalgorithmtouseinstorage-oriented
lusters,a ontextwhereithasnotyetbeenseen.
The ontributions of this paper are the design,
implementation and initial evaluation of a new dis-
tributedse ondarystoragegarbage olle torbasedon
the train algorithm. This new olle tor adapts and
ombinesanumberofprovente hniquestoprovide:
safetythroughpointer tra king,
Node Boundaries Persistent Heap
Stable Storage Object Caches Mutators
Figure1: LayeredPro essBaseAr hite ture
ompleteness through isolated train dete tion,
underpinned by Safra's Algorithm (Dijkstra
1987),
toleran eofse ondarystoragelaten ythrougha
deferredmetadataupdatesimilartothatusedin
PMOS(Mosset al.1996),and
s alabilityofpersistentgarbage olle tionbyin-
rementallygatheringexternalreferen einforma-
tion.
To provide an e ient programming abstra tion, it
is ne essaryforsites to a he datain primarymem-
ory. A further ontribution of the distributed store
des ribedhereis anewte hniquefortra kingpoint-
ersin lient a hes, whi h isne essaryto permitthe
de ouplingofgarbage olle tioninthepersistentheap
fromthat inthe lient a hes.
This paperdes ribes thear hite ture and experi-
mentalplatforminwhi hthegarbage olle torexists,
the hallenges onfronting su h a olle tor and their
solutionsaswellasinitialexperimentationperformed
to onrmthevalidityofourapproa h.
2 Ar hite ture
Our ar hite ture in ludes distribution of both stor-
age and omputation a ross the nodes of a lus-
ter. The omputational model is based on the Pro-
essBase (Morrison, Balasubramaniam, Greenwood,
Kirby, Mayes, Munro & Warboys 1999) program-
ming language. The storage model is an orthog-
onal persisten e (Atkinson et al. 1983, Atkinson &
Morrison1995)abstra tionsupportedbythePro ess-
heaplayer. Inparti ular,wehavedesignedandimple-
mentedanovelse ondarystoragestoragegarbage ol-
le tor, DPMOS (Distributed Persistent Mature Ob-
je tSpa e),basedonthetrainalgorithm.
These ondarystoragear hite tureisasetofpeer
sites that are in ommuni ation with ea h other to
synthesize a single address spa e from separate sta-
ble storageheld on ea h siteas perFigure 1. Client
virtual ma hines an onne t to any of the parti i-
patingstorenodes;astoresiteneednothavebotha
onne ted lientand lo al storage. The stablestores
are independent and do not ommuni ate with ea h
other. Obje t lo ations are en oded in their identi-
ers,providingatrivialresolutions heme.
Ea h virtual ma hine has a write-ba k obje t
a he and a olle tionof mutatorthreads that oper-
ate over the a hed obje ts. Pro essBase is a per-
sistent language system designed for use in hyper-
programming (Kirby, Connor, Cutts, Dearle, Farkas
&Morrison1992)andpro ess-modellingsystemsand
for whi h there is ongoing resear h into distribution
of omputation and storage at both the persistent
heap and a he levels (Howard, Detmold, Falkner &
Munro2003,Falkner,Detmold, Munro&Oldsn.d.).
ThePro essBaseprogrammingmodelis thatof mul-
tiple on urrentmutatorsthatoperatedire tlyonob-
je tsheldinthe a hes.
Ca he oheren yistheresponsibilityofthe lients,
however ertaineventsrelatingtothe opyingandde-
stru tionofreferen esmustbereportedtothepersis-
tentheaptopermitsafegarbage olle tiontherein.
Our se ondary storage garbage olle tion algo-
rithm operates within the persistent heap layer. A
ommonapproa htoa hievings alabilitythroughin-
rementalityinagarbage olle tor(distributedoroth-
erwise) is to partition the obje t spa e and re laim
garbageinpartitionsindividually. Distributeda y li
garbage is re laimed by repeated invo ation of the
lo al garbage olle torin onjun tionwith apointer
tra kingalgorithm that maintains a onservative set
ofexternalroots forea hpartition. Distributed y li
garbage is re laimed using an asyn hronous imple-
mentationofthetrainalgorithm.
Stable storage is implemented at ea h site using
anafter-imageshadow paging(Gray, M Jones,Blas-
gen, Lindsay, Lorie, Pri e, Putzolu & Traiger 1981)
me hanismand page a he. Theshadowpagedstore
provides stability with he kpoint semanti s and its
use of a a he permits the dire t mapping of data
stru tures in the heap onto the page store's address
spa ewithoutsigni antperforman eloss. The a he
supportspage-pinningtoexpli itlyretain riti aland
Layer n Load(n−1) = Cost(n)
Layer n+1
Layer n−1 Load(n) = Cost(n+1)
Figure 2: Layeringin MeasurementModel
frequently-a esseddatastru turesinthe a he. The
stable storage spa e is a at address spa e with no
on eptofobje ts,referen esorrea hability.
Communi ation betweensites is by asyn hronous
messagepassingoveranetwork that preservesFIFO
orderingofmessagesbetweenendpoints. Thesystem
assumesreliablemessagedelivery,messagelossresults
in gra efulshutdown. Similarly, thealgorithm isnot
rashtolerantandthelossofasitealsoleadstogra e-
fulshutdown. Finally,Byzantinebehaviourisnotper-
mittedanywhere inthesystem.
2.1 Measurement
Thear hite turehasbeendesignedinsu hawayasto
fa ilitatemeasurementandevaluation. Inparti ular,
thelayeringofthear hite turepermitsthesubstitu-
tionofanylayerforthepurposesof omparisonandis
thebasisforquantitativeperforman eands alability
measurements. Thisenablesoursystemtobeusedin
adetailed investigation into distributed garbage ol-
le tion.
Byfo usingmeasurementona tivityat theinter-
fa es between layers, it be omes possible to dire tly
omparetheee tsofdierentpoli ieswithinalayer
oreventwoentirelydierentimplementations ofthe
samelayer.
Ea hlayerinthear hite tureprovidesaservi eto
layersaboveit by omposing ame hanism andpoli-
iesofitsownwithrequestsoflowerlayers. Theload
uponalayern isdened asthesetof requestsmade
of that layer while the ost of a layer is the set of
requests it makes of the layer below itself, see Fig-
ure 2. Ea h layertransforms its loadinto a ost by
theappli ationof poli ies tome hanisms,with those
poli iespossiblyspe iedbyhigherlayersin the ase
of a ompliantsystem(Morrison, Balasubramaniam,
Greenwood, Kirby, Mayes,Munro &Warboys2000).
There maybe multiple independent systemsat ea h
sidered to be a layer below the Persistent Heap yet
independentofea hother. Inprin ipletheloadupon
alayer ould be the unionof a esses from multiple
higherlayersbut thisdoesnoto urin theDPMOS
ar hite ture: thereareno ir umstan eswheremulti-
plesystemsinanupperlayeroperatedire tlyupona
singlesystemin alowerlayer. Thelong-termgoalof
this resear histo attempt tomeasure quantitatively
thetransformationof loadinto ostat thepersistent
heaplayerandtherebytomake omparisonsbetween
dierentdistributed persistentgarbage olle tors.
In the ase of the distributed persistent garbage
olle tor, the load is represented in terms of obje t
read and write requests and pointer tra king events
from a hes;the ost anbemeasuredintermsofthe
network tra generated and the load pla ed upon
ea h ofthestable stores.
3 Challenges
Thedesign andimplementationof adistributed per-
sistent garbage olle tor for use in storage-oriented
lustersposearangeof hallengesthatareaddressed
byourresear h. Theserelateto:
The e ient and a urate tra king of the re-
ation, destru tionandtransmissionofreferen es
externaltoapartition.
A essingstable datainsu hawayastobetol-
erantofse ondarystoragelaten y.
The e ient operation of as alable distributed
algorithm to a urately identify distributed y-
les.
This se tion des ribes in detail the three hallenges
addressedbyourresear h.
3.1 PointerTra king
Toperform safegarbage olle tionof apartition, in-
formation regarding referen es into that partition is
required. Whilst it is in prin iple possible to a -
quirethisinformationviaaglobaltra e,this renders
thealgorithmuns alable. Hen e,tosupports alabil-
ityof olle tion,inter-partitionreferen einformation
must be maintained in rementally. This pro ess is
known aspointer tra king. The algorithm des ribed
in (Hudson,Morrison,Moss&Munro1998)provides
an e ientsolution to the pointer tra king problem
but is suited only to a single layer store where ref-
eren es are not a hed at a higher layer. However,
te ture,thereforeitisne essaryto adaptthepointer
tra king algorithm to ope with referen es held in
a hes.
Fors alability, itis desirableto de ouplegarbage
olle tionin a hesfromgarbage olle tionintheper-
sistentheapsothat bothmayoperateindependently.
Thisensuresthe s alabilityofthe persistentgarbage
olle tor be ause it does notrequiresyn hronisation
with and interruption of the a hes. For these rea-
sons, a new pointer tra king me hanism is required
that en ompasses lient a hesand is aware of their
behaviour.
3.2 Laten y Toleran e
Garbage olle tion in primary memory is a well-
investigated area (Wilson 1992). In this eld, algo-
rithms typi allyassumethat storageisrandomlyad-
dressable with littlepenalty, i.e. random a esses to
storage have low laten y. Clearly, this assumption
doesnotholdfordis storage.Forase ondarystorage
garbage olle tortooperatee iently,itmustbeim-
plementedsoastotakea ountofthereadandwrite
laten yofse ondarystorageandtominimise theim-
pa tofthatlaten yonoverallperforman e. Themain
strategiesavailabletotheimplementationare a hing
inmainmemoryandoptimisinga esspatternssoas
toprefersequentiala essoverrandoma ess.
Thepointertra kingalgorithmrequiredtosupport
partitioned olle tion entailsthe reation and main-
tenan e of metadata, re ording inter-partition refer-
en es. As with all other data, this metadata must
be kept on se ondarystorage. There is at least one
metadata update for every inter-partition referen e
update. Furthermore,agivenreferen eupdateandits
asso iatedmetadataupdatesarephysi allydispersed.
Theseproperties ompli atetheimplementationofan
approa hto maintainingmetadatathat istolerantof
high storage laten ies, whilst simultaneouslymaking
itmoreimportanttoadopt su happroa hes.
3.3 Colle tion ofCy li Garbage
Referen e ountingalgorithms,su haspointertra k-
ing, are inadequate to dete t y les of garbage that
span partitions. In a main memory olle tor one
might simply ignore su h y les sin e they will dis-
appear whenthe programexits. This isnotaviable
design hoi einastorage-oriented lustersin ethead-
dressspa e is persistentand so ompleteness is thus
a ne essary requirement. Hen e, an additional dis-
tributedalgorithmisrequiredtohandlethis aseand
le dete tion algorithm must be e ient and s al-
able. This isan inherently hallengingproblem that
has been the subje t of extensive resear h (Tel &
Mattern 1991, Bla kburn, Hudson, Morrison, Moss,
Munro & Zigman 2001, Lowry& Munro 2002, Nor-
ross,Morrison,Munro&Detmold2003).
4 Implementation
Herewedes ribeourapproa h tosolvingthespe i
hallenges of implementing a safe, omplete, asyn-
hronous and s alable distributed persistentgarbage
olle torposedinthepreviousse tion.
The generi train algorithm, on whi h our olle -
tor is based, operates as follows. Obje tsare stored
in ars (partitions),of whi h therearemanypersite
andwhi hdonot rosssiteboundaries. Ea h ar an
be olle ted independently of any other ar in the
system and the inter-partition referen e information
(from thepointer tra kingalgorithm) requiredto do
thisiskeptinrememberedsets(remsets). Ea h arbe-
longstoatrain,whi hmayspanmultiplesites. When
a ar is olle ted, the rea hable obje tsin it are re-
asso iated ( opied) into other ars, possibly in other
trains, depending on the information in the remset
and then the spa e o upied by the olle ted ar is
re laimed. Thereasso iationpoli ysele tsadestina-
tiontrainforea hobje tsothatdistributed y lesare
marshalledinto asingletrain. In addition,rea hable
obje tsareeventuallymovedintoothertrains,leaving
thistrainisolated. Trainisolationisastableproperty
whi h maybeestablishedusingaDistributedTermi-
nationDete tion(DTD) algorithm.
4.1 PointerTra king
There are two ases where pointer events requiring
tra kingo ur:
Obje twritesissuedby a hesandpertainingto
obje ts on a site other than the one the a he
is onne ted to (remote writes). A remote ob-
je twrite involvesthetransmissionofreferen es
from a a he to a persistent heap site and the
reationofreferen esinthelatter;this ase(the
reationand destru tionof referen esin theob-
je twritten aswellasthepresen e ofreferen es
innetworkmessages)ishandledbyanimplemen-
tation of the pointer tra kingalgorithm dened
in(Hudson etal.1998).
Remoteobje treads issuedby a hes. These in-
volve the reation of referen es in a a he and
mote. Be ause the referen es reatedare notin
the persistent heap, they annot be tra ked by
Hudson et. al.'salgorithm. Oursolutionto this
isdes ribedbelow.
Ca hes in our ar hite ture swizzle (translate) refer-
en esfromthepersistentheap'sformattotheirnative
ma hine address format in order to provide a ept-
able omputational performan e in the virtual ma-
hine. Our approa h is to tra k referen es as they
enter (are faulted into) and leave (are purged from)
a hes. Swizzling allows us to ignore the referen e
reationsanddeletions performedbymutators.
The a he pointer tra king algorithm must a u-
ratelyreportwhenanon-zero ountofreferen estoa
givenobje tareheldbyany a he. Furthermore,this
mustbedonewithoutsyn hronisingwith a hesand
withoutinformation regarding mutations of swizzled
referen es.
Referen es are tra ked separately, depending on
whether or not they are swizzled, only the non-
swizzled referen es are ounted by this algorithm.
Referen estoanobje t anbeswizzledonlyifa opy
oftheobje tisheldbythe a he,sothepointertra k-
ingalgorithm onservativelyassumesthatifanobje t
is present in the a he thenthere areswizzled refer-
en es to it. If anobje t is purged to makespa e in
the a he,allreferen estoitmustbedeswizzled(and
therefore ounted).
The Pro essBase virtual ma hine (VM) operates
only on swizzled referen es and sohas no dire t in-
tera tion with the a he pointer tra king algorithm.
Sin e the pointer tra king algorithm is only invoked
during obje tfault and purge, its impa ton theop-
erationofthevirtualma hineis minimal.
A fault willo urat any pointwhere theVM at-
temptstodereferen eanunswizzledpointertoanob-
je tnotpresentin the a he. Iftheobje tisalready
present, the pointer is swizzled, resulting in aminus
message. Oursystem uses asimple poli y by whi h
purgesaretriggeredonlyby a heexhaustion. Ifswiz-
zledpointersexist inamodiedobje ttobepurged,
theymustbeunswizzled(requiringplusmessages)be-
foretheobje tiswritten ba k.
Forea hobje t ontainedin orreferredtoby one
ormore a hes,abooleanvalueand ounter,ea hve -
torisedby a heidentity,iskept. Thesearereferredto
asR
b
[X℄ and R
[X℄ respe tively. R
b
[X℄denotes the
presen eofRin a heXandR
[X℄ ontainsa onser-
vativeestimateof thenumberofunswizzledpointers
toRin X. Ifeitherofthesevaluesarenon-zero,Ris
onsidered live and annot be re laimed by the per-
Site A Site C
In Cache on Site B
P1 P2
R Q
p2 p1 r
Figure3: ExampleObje tGraph
C A
{r,R}
R {−,R,B}
{+,P2,B}
{+,P1,B}
3 5
1
2 4
B
{f,R,B}
Figure4: FaultSequen e
Thetra kingofreferen esinthe a hesisdes ribed
byfourevents:
Plus: {+, R, B} indi ates the reation of an
unswizzledreferen eRat a hesiteB;it auses
R
[B℄ tobein remented.
Minus: {-,R,B}indi ates thedestru tionofan
unswizzledreferen eRat a hesiteB;it auses
R
[B℄ tobede remented.
Fault: {f,R,B}indi atesthata opyofRisheld
in a heB;it ausesR
b
[B℄to beset.
Purge: {p, R, B} indi ates that B has lost its
opyofR ;it ausesR
b
[B℄tobe leared.
Considertheexample obje t graphofFigure 3. Ob-
je tRresidesonsiteAandobje tsP1andP2reside
onsiteC. A a heatsiteBholdsareferen er(toR )
in obje tQ. Figure4andFigure 5des ribethefault
A C
R
{p,R,B}
{−,P,B}
1 3 2
B
Figure5: PurgeSequen e
When a mutator on B dereferen es the referen e
toRinQ,thereferen emustbeswizzled. Forthisto
o ur,R must befaulted in (Figure 4) so B sendsa
read request (1) toA. Thefault event{f ,R, B} (2)
o urs at A and plus events are generated at A for
everyreferen e(p1andp2in Figure3) inR . Also,a
messageissent(3)tothehomesite C ofea h obje t
Pnforwhi haplusevento urred. The ontentsof
Rarethentransmitted(4)toBwhi h answizzlethe
referen ein Q,resulting in thetransmission (5) ofa
{-,R,B}messagetoA.
WhenBpurgesR(Figure5),itwritesthe ontents
ba k(1)toAifRwasmodied. Thisinvolvespointer
transmissionandthe reationanddeletionofpointers,
all of whi h are tra ked by the proto ol of (Hudson
et al. 1998). Minus messages (2) are generated for
ea h unswizzledreferen e in R and A is notied (3)
thatR hasbeenpurgedfrom B.
Our ar hite ture uses a he kpoint model of sta-
bility and permits the sharing of referen es between
obje t a hes; our implementation of the persistent
heap is ne essarily onservative in its treatment of
referen es held by lient a hes. In ontrast, trans-
a tional systems introdu e a degree of syn hronisa-
tionbetween lient and server(Amsaleg, Franklin &
Gruber1995),permittingtheservertomakeassump-
tionsaboutwhenreferen es anbe reated.
4.2 Laten y Toleran e
Inaddressingthelaten yofse ondarystoragea ess,
ourfo usisonredu ingthe ostsofmetadataupdate.
Wepursuetwoapproa hestothis:
a hing ofmetadatatoamortizese ondarystor-
bat hing of updates to metadata to enable se -
ondary storage writes to be performed sequen-
tially.
Eagerlyupdatingremsetsonse ondarystorageis in-
feasible be auseupdates are requiredat upto three
addressesnotspatially orrelatedwithea hother: the
referen e update, a de rement at the remset for the
pointer overwritten and an in rement at the remset
forthepointer reated. Thisimpliesuptothreeseek
operationsforea hpointer update andresultsinun-
a eptablethroughput. ThesolutionusedinDPMOS
istodeferandbat hupdatesuntilanopportunetime
orwhen the log of deferred updates hasbe ome too
large(apoli yde ision)andmustbepurged. Thisso-
lutionmeansthatmultipleupdatestoasingleremset
are bat hed into a single dis -blo k write whi h re-
quiresonlyasingleseek. Ourresultsshowthatwhere
ablo kwriteisdominatedbyseektimesu hbat hing
improvesperforman e.
DPMOSremsetshaveatreestru tureinmemory,
permitting O(logn) sear h and update but are at-
tenedforpla ementonse ondarystorage. There isa
xed size a he of remsets in memory. Remsets are
purgedfromthe a hewhenanewremsetisrequired
tobereadin.
If areferen eupdate o urs and the desiredrem-
set is not in memory, the update is appended to a
stru ture in memoryreferredtoasaref set. This
set ontains a ve tor of re ords, ea h des ribing an
in rementorde rementoperationonaremset entry.
ref setsaretheme hanismfordeferringandbat h-
ing updates to remsets, the assumption being that
many operationswill bestored in a ref set before
the orrespondingremset isloaded andupdated. On
loadingaremsetintothe a he,anyappli ableref
is applied. When the spa e o upied by ref sets
be omestoolarge(again,apoli y de ision),remsets
arefaultedin andupdated.
ref sets are always applied to their respe tive
remsets on he kpoint and therefore never resideon
stablestorage. Ourinitialexperimentsleadusto be-
lieve that he kpoint speedmay be improved at the
ostof omplexityandre overyspeedbystoringref
setsstablyinsteadofeagerlyapplyingthemat he k-
point. Wewillinvestigatethisalternativeapproa hin
futurework.
The stable storagesubsystem used by DPMOS is
ashadowpagingsystemwithapage a heand he k-
pointdurabilitysemanti s. Partitionsbeing olle ted
arepinnedintothe a hetopermitagarbage olle -
anditsin remental olle tionmeansthatonlyasmall
amountofpersistentspa eneedstobein a heatany
giventime.
PMOS (Moss et al. 1996) is a unipro essor per-
sistent implementation of the train algorithm that
usesdeferredmetadataupdatesto gainperforman e.
PMOSmaintainstwosu h a hes:theref andlo
sets. TheformerissimilartoDPMOS'sref system
andthelatterdes ribesobje tmotionduetogarbage
olle tion. DPMOS does not require a lo set as
thereisanindire tionbetweenpersistentobje tiden-
tiersand theirphysi al lo ation;in ontrast,PMOS
must update all referen es to an obje t when it is
movedbythegarbage olle torandthelo setdefers
theseupdates.
4.3 Colle tion ofCy li Garbage
The olle tionof distributed y les requires ame h-
anism to identify isolated trains. The reasso iation
poli ies employed by the train algorithm ause dis-
tributed y les to move into a single train and re-
main there if unrea hable while live obje ts will be
reasso iatedoutofthetrain. A partitionalongtrain
boundariesis thereby formedbetweenrea hable and
unrea hableobje ts. There lamationofawholetrain
isthepro essbywhi hinter-partitionanddistributed
y lesarere laimed.
Liveness of a site in a omputation subje t to
DTD is equivalent (Tel & Mattern 1991, Bla kburn
et al. 2001, Lowry & Munro 2002) to the presen e
of rea hable obje ts in a partition of a distributed
garbage olle tor. Severalsu halgorithmsalreadyex-
istand anbereadilyadaptedtooursystemtodete t
isolation dete tionfor thepurpose of garbage olle -
tion. In(Nor rossetal.2003),thenotionof lubrules
isinvestigatedtopermittheuseof dierentDTDal-
gorithmsforthepurposeof trainisolationdete tion.
Ea h olle tor partition orresponds to a DTD site
andatrainisequivalenttotheDTDalgorithm's om-
putations ope.
Reasso iationofanobje tinto atrain onstitutes
spontaneous a tivity in the ontext of a DTD algo-
rithmso is notpermissible ifa DTD algorithm is to
beused todete ttrainisolation. Thereforeatrainis
losed (Lowry&Munro2002)duringtheexe utionof
theDTD algorithm,therebyprohibitingthereasso i-
ationof obje tsintothat train. ShouldtheDTD fail
(i.e. not dete t isolation), the trainis reopened and
reasso iation re ommen es. A orre t asyn hronous
DTD algorithm is safe with respe t to the exporta-
tionofreferen esfromatrainsin esu hexportation
1e+07 1e+08 1e+09
0 10 20 30 40 50 60 70
Cost (Stable Words Written)
Remset Cache Size (remsets) max delta-ref 0 max delta-ref 8 max delta-ref 64 max delta-ref 128
Figure6: Ee tofCa hingMetadata(FEA)
ano uronlyifthereisanexistingexternalreferen e
tothetrain.
Wehave hosenSafra'sAlgorithm(Dijkstra1987)
for the purpose of dete ting isolated trains. This is
an asyn hronous ring-based wave algorithm for dis-
tributed terminationdete tion. It is parti ularlyap-
propriateforourpurposesnotonlybe auseitisasyn-
hronousbutalsobe auseit loselymat hesthestru -
tureoftrainsinourimplementation. The losingand
opening of trainsis performed bythe exe ution of a
waveovertheringofsitesparti ipatinginatrain. The
FIFOnatureofmessagedeliveryandtheuseofwaves
over a ring to manage train membership as well as
openand losetrainsensuresthat trainmanagement
eventsdonotoverlapwiththeopeningor losingofa
train.
5 Experimentsand Results
Ourexperimentalplatform onsistsofapairofAthlon
XPsystems: oneof1527MHzand1024MBRAMand
the other 1470MHzand 512MBRAM onne tedvia
swit hed fast ethernet. Wall- lo k time is the only
ma hine-dependent result as the I/O osts are de-
terministi (dependingonlyon ongurationandthe
loaddatastru tures)andthereforeindependentofthe
hardwareusedforexe ution.
To verify integrity of the store, the OO7 (Carey,
DeWitt&Naughton1993)medium datastru tures
were reated,installedintothestoreandveried. To
measure performan e in thefa e of highly onne ted
data,asmallnite-elementsimulationwaswrittenin
Pro essBaseand he kpointed. It reatesa2Dplanar
mesh of triangles, ea h onne ted to all of itsneigh-
boursandobje tsrepresentingitsverti es. Asanini-
tial experiment, the ee t of the remset a hes and
1e+07 1e+08 1e+09 1e+10
0 10 20 30 40 50 60 70
Cost (Stable Words Written)
Remset Cache Size (remsets) max delta-ref 0 max delta-ref 64 max delta-ref 128
Figure 7: Ee tofCa hingMetadata(OO7)
in stableI/O withrespe t tothe sizeof both remset
a hesandref sets.
Figure 6 shows at least one order of magnitude
redu tioninexe ution ost(as measuredby theload
on the stable storage layer) is possible by using the
laten y tolerant metadataupdate s hemes des ribed
inthispaper. Theresultsalsoshowthatthereislittle
benetin in reasing ref set andremset a he size
beyonda ertainminimum.
Experiments with OO7 (Figure 7) in onjun tion
with extremely small remset a hes and ref sets
showapproximately2.5ordersofmagnitudedieren e
in I/O ost betweenthehighestandlowest ost on-
gurations. The OO7 data stru ture has lower on-
ne tivityandthereforefewerinter-partitionreferen es
thantheFEAtest,soasexpe tedtheuseofmetadata
a hinghaslessee t.
Furtherexperimentationinto a hingpoli ies (in-
luding ref setsand remset a hes)and othersys-
tempoli ies isongoing.
6 RelatedWork
Distributed persistent obje t stores have been de-
signedandimplementedinthepastbuttheirgarbage
olle torshaveingeneralsueredfromala kof om-
pleteness,s alabilityorboth.
Lar hant (Shapiro & Ferreira 1995) and PerDIS
(Ferreira,Shapiro, Blondel,Fambon,Gar ia,Kloost-
erman, Ri her, Roberts, Sandakly, Colouris, Dol-
limore,Guedes,Hagimont&Krakowiak2000)aredis-
tributedpersistentstoreswithgarbage olle tionbut
theyarenot omplete inthefa e oflargedistributed
y les and are only probabilisti ally omplete with
smallerdistributed y les.
Thor (Maheshwari 1993) is a distributed
(Maheshwari & Liskov 1997b), distributed mark-
ing (Maheshwari & Liskov 1997 ) or ba k-tra ing
(Maheshwari& Liskov1997a). Controlled migration
requires that anentiredistributed y le bemigrated
to a single site to be re laimed by that site's lo al
garbage olle tion and is therefore not s alable to
large onne ted omponentsof garbage. Distributed
marking operates in phases over all sites in the
system, regardless of the sites that the omponent
to be re laimed a tually spans. The global nature
of garbage dete tion in that s heme means that it
is not s alable to large numbers of nodes. Thor's
ba k-tra ing algorithm is omplete and s alable but
spa e-andtime-ine ientbe auseofitsrequirement
to store state for ea h remset entry involved in a
ba k tra e. Safety of the ba k tra ealso requires a
pointer tra kingalgorithm that is theoreti ally more
expensivethanthat usedbyDMOSandDPMOS.
7 Further Work
The ontext of this resear h is a detailed ompar-
ison between two dierent approa hes to obtaining
omplete olle tion in the fa e of distributed y les
of garbage while maintaining e ien y and s alabil-
ity. Thear hite tureandmeasurementframeworkde-
s ribed inthis paperis thebasisforthequantitative
omparisonsbetweenthe trainalgorithmand aform
ofba k-tra ingbasedonThor.
8 Con lusions
Powerful s ienti appli ations in ommon use de-
mand s alable, omplete garbage olle tion for
storage-oriented lusters. S alability and omplete-
nessof olle tionare madene essarybythetypesof
data stru tures employed by these appli ations and
e ien y and s alability are required of the system
whi h storesthem.
The ontributionsofDPMOSare:
Thedesignandimplementationofanewpointer
tra kingalgorithmtopermitsafegarbage olle -
tion in the persistentheap, de oupledfrom ol-
le tionin a hes.
The novel ombination of the Train Algorithm,
Distributed Termination Dete tion, Deferred
MetadataUpdateand ournewPointerTra king
algorithmtoprovidesafe, omplete,s alableand
e ient garbage olle tion for storage-oriented
lusters.
system that uses this novel ombination and shows
thatthis news alableand ompleteapproa hto dis-
tributedpersistentgarbage olle tionisvaluablesin e
itpermitsthesupportofalargeandpowerful lassof
appli ations ommonlytargetedbyhighperforman e
anddistributedsystems.
9 A knowledgments
The work is supported in part by ARC Linkage In-
ternationalAwardLX0349049:Extendingafamilyof
garbage olle tors.
Referen es
Amsaleg,L.,Franklin, M.&Gruber,O.(1995),E-
ient in remental garbage olle tion for lient
server obje t database systems, in `VLDB95',
Zuri h, Switzerland,pp.4253.
Atkinson, M. P., Bailey, P., Chisholm, K., Co k-
shott, W. & Morrison,R. (1983), `Anapproa h
to persistent programming', Computer Journal
26(4), 360365.
Atkinson, M. P. & Morrison, R. (1995), `Orthogo-
nally persistent obje t systems', VLDB Journal
4(3),319401.
Bekkers, Y. & Cohen, J., eds (1992), Pro eedings
of International Workshopon Memory Manage-
ment, Vol. 637 of LNCS, Springer-Verlag, St
Malo,Fran e.
Bla kburn, S., Hudson, R., Morrison, R., Moss, J.,
Munro, D. & Zigman, J. (2001), Starting with
termination: A methodology for building dis-
tributed garbage olle tionalgorithms, in`Pro .
ACSC2001'.
Carey, M. J., DeWitt, D. J. & Naughton, J. F.
(1993),`TheOO7ben hmark', SIGMODRe ord
22(2), 1221.
Dijkstra,E.W.(1987),ShmuelSafra'sversionofter-
minationdete tion,Te hni alReportEWD998,
TheUniversityofTexasatAustin.
Falkner, K., Detmold, H., Munro, D. & Olds, T.
(n.d.), Towards ompliant distributed shared
memory,in`WSDSM'02,2ndInt'lConfonClus-
ter ComputingandtheGrid (CCGRID)'.
Ferreira, P., Shapiro, M., Blondel, X., Fambon, O.,
Gar ia,J.,Kloosterman,S.,Ri her,N.,Roberts,
(2000),PerDiS:Design,implementationanduse
of aPERsistent DIstributed Store, Vol. 1752 of
LNCS,pp.427452.
Gray,J.,M Jones,P.,Blasgen,M.,Lindsay,B.,Lorie,
R., Pri e, T., Putzolu, F. & Traiger, I. (1981),
`There overymanageroftheSystemRdatabase
manager',ACMComputing Surveys13(2), 223
242.
Howard,D.,Detmold,H.,Falkner,K.&Munro,D.S.
(2003),Usingthe ompliantsystemsar hite ture
todeliverexiblepoli ieswithintwo-phase om-
mit,inA.E.James,B.Lings&M.Younas,eds,
`BNCOD', Vol. 2712 of Le ture Notes in Com-
puterS ien e,Springer,pp.245252.
Hudson,R.L.,Morrison,R.,Moss,J.E.B.&Munro,
D.S. (1997),Garbage olle tingtheworld: One
aratatime, in`OOPSLA'97ACM Conferen e
onObje t-OrientedSystems,LanguagesandAp-
pli ations Twelth Annual Conferen e', Vol.
32(10)of ACM SIGPLAN Noti es, ACM Press,
Atlanta,GA,pp.162175.
Hudson,R.L.,Morrison,R.,Moss,J.E.B.&Munro,
D. S. (1998), Where haveall the pointers gone,
in `Pro eedings of 21st Australasian Computer
S ien eConferen e',Perth,pp.107119.
Hudson, R. L.& Moss, J.E. B.(1992), In remental
olle tionofmatureobje ts,inBekkers&Cohen
(1992),pp.388403.
Kirby, G., Connor, R., Cutts, Q., Dearle, A.,
Farkas, A. & Morrison, R. (1992), Persistent
hyper-programs,in `PersistentObje tSystems',
Springer-Verlag,pp.86106.
Lowry,M.C.&Munro,D.S. (2002),SafeandCom-
plete Distributed Garbage Colle tion with The
TrainAlgorithm,in`Pro eedingsofICPADS'02',
Taipei,Taiwan,pp.651658.
Maheshwari, U. (1993), Distributed garbage olle -
tion in a lientserver, transa tional, persistent
obje tsystem,Te hni alReportMIT/LCS/TR
574,MITPress.
Maheshwari, U. & Liskov, B. (1997a), Colle ting
y li distributed garbage by ba k tra ing, in
`Pro eedings of PODC'97 Prin iples of Dis-
tributedComputing',pp.239248.
Maheshwari, U. & Liskov, B. (1997b), `Colle ting
y li distributed garbage by ontrolled migra-
garbage olle tionofalargeobje tstore,in`Pro-
eedingsof SIGMOD'97',pp.313323.
Morrison,R.,Balasubramaniam,D.,Greenwood,M.,
Kirby,G.N.C.,Mayes,K.,Munro,D.S.&War-
boys,B.C.(1999),Pro essBaseStandardLibrary
Referen eManual(Version1.0.1),Universitiesof
StAndrews andMan hester.
Morrison,R., Balasubramaniam, D.,Greenwood, R.,
Kirby, G., Mayes, K., Munro, D. & Warboys,
B. (2000),`A ompliantpersistentar hite ture',
Software, Pra ti e &Experien e30(4), 363386.
Moss,J.E.B.,Munro,D.S.&Hudson,R.L.(1996),
PMOS:A ompleteand oarse-grainedin remen-
tal garbage olle torforpersistentobje tstores,
in `Pro .ofthe6thInt.Workshop onPersistent
Obje tSystems',CapeMayNJ(USA), pp.140
150.
Munro, D., Falkner, K., Lowry, M. & Vaughan,
F. (2001), Mosai : A non-instrusive omplete
garbage olle tor for DSM systems, in `Pro .
1st CCGRID, WSDSM'01,Brisbane,Australia',
pp.539546.
Nor ross, S., Morrison, R., Munro, D. & Det-
mold, H. (2003), Implementing a family of dis-
tributed garbage olle tors, in `Pro eedings of
the2003AustralasianComputerS ien eConfer-
en e',Adelaide,Australia,pp.161170.
Seligmann, J. &Grarup, S. (1995), In remental ma-
ture garbage olle tion using the train algo-
rithm,in O.Nierstras,ed.,`Pro eedings of1995
European Conferen e on Obje t-Oriented Pro-
gramming',LNCS,Springer-Verlag,Universityof
Aarhus,pp.235252.
Shapiro,M.&Ferreira,P. (1995),Lar hantRDOSS:
a distributed shared persistent memory and its
garbage olle tor, in J.-M. Helary & M. Ray-
mond, eds, `Workshop on Distributed Algo-
rithms', number972in `LNCS',Le MontSaint-
Mi hel,pp.198214.
Tel, G. &Mattern, F. (1991),The derivation of dis-
tributed termination dete tion algorithms from
garbage olle tion s hemes (extended ab-
stra t), inAarts et al.,eds, `PARLE'91Parallel
Ar hite tures and Languages Europe', Vol. 505
of LNCS,Springer-Verlag.
Wilson,P.R.(1992),Unipro essorgarbage olle tion
te hniques,inBekkers&Cohen(1992).