PARROT:AN APPLICATIONENVIRONMENTFOR DATA-INTENSIVE COMPUTING
DOUGLAS THAIN AND MIRON LIVNY
∗
Abstrat. Distributed omputing ontinues to be an alphabet-soup of servies and protools for managing omputation
and storage. To liveinthisenvironment, appliations require middlewarethat an transparently adaptstandard interfaesto
newdistributed systems;suhmiddlewareisknownas an interposition agent. Inthispaper,wepresentseveral lessonslearned
aboutinterpositionagentsviaaprogressivestudyofdesignpossibilities. Althoughperformaneisanimportantonern,wepay
speialattention toless tangible issuessuh asportability,reliability,and ompatibility. We beginwithaomparisonof seven
methodsof interposition and selet onemethod, the debugger trap, thatis the slowest but also the most reliable. Usingthis
method,weimplementa omplete interposition agent, Parrot, that splies existingremote I/Osystemsintothe namespaeof
standardappliations. TheprimarydesignproblemofParrotisthemappingofxedappliationsemantisintothesemantisof
theavailableI/Osystems.Weoeradetaileddisussionofhowerrorsandotherunexpetedonditionsmustbearefullymanaged
inordertokeepthismappingintat. WeonludewithaevaluationoftheperformaneoftheI/OprotoolsemployedbyParrot,
anduseanAndrew-likebenhmarktodemonstratethatsemantidiereneshaveonsequenesinperformane. 1
Keywords. Adaptivemiddleware,errordiagnosis,interpositionagents,virtualmahines.
1. Introdution. Theeldofdistributedomputinghasproduedountlesssystemsforharnessingremote
proessorsand aessingremotedata. Despitetheintentionsoftheirdesigners,nosinglesystemhasahieved
universalaeptaneordeployment. Eaharriesitsownstrengthsandweaknessinperformane,manageability,
and reliability. Renewed interest in world-wideomputational systems is inreasing the number of protools
andinterfaesin play. Aomplexeologyofdistributed systemsishereto stay.
CPU/IO
Interaction
Common I/O Interface
Distributed I/O Services
FTP
Process
Distributed Computing Services
Condor
PBS
NQE
LSF
Leveler
Load
Application
Local Operating System
Common Process Interface
(open, close, read, write, lseek)
(main, exit, abort, kill, sleep)
NeST
Chirp
RFIO
Parrot
DCAP
Fig.1.1. TheHourglassModel
Theresult is anhourglass model of distributed omputing,
showninFigure1.1. Attheenterlieordinaryappliationsbuilt
to standard interfaes suh as POSIX. Above lie a number of
bath systemsthat manage proessors,interat withusers,and
dealwithfailuresofexeution. Abathsysteminteratswithan
appliationthroughsimpleinterfaessuhasmainandexit.
Be-lowlieanumberofI/Oserviesthatorganizeandommuniate
with remote memory, disks, and tapes. An ordinary operating
system(OS)transformsanappliation'sexpliitreadsandwrites
into thelow-levelblokandnetwork operationsthat omposea
loalordistributed lesystem.
However,attahinganewI/OservietoatraditionalOSis
not a trivial task. Although the priniple of an extensible OS
hasreeivedmuhattentionintheresearhommunity[19℄,
pro-dution operating systems have limited failities for extension,
usuallyrequiringkernelmodiationsoradministratorprivileges.
Althoughthismaybeaeptableforapersonalomputer,this
re-quirementmakesitdiultorimpossibletoprovideustomI/O
andnamingserviesforappliationsvisitingaborrowed
omput-ingenvironmentsuh asatimesharedmainframe,aommodity
omputingluster,oranopportunistiworkgroup.
Toremedythissituation,weadvoatetheuseofinterposition
agents [13℄. These devies transform standard interfaes into
remoteI/Oprotoolsnotnormallyfoundinanoperatingsystem.
In eet, an agent allowsan appliation to bring its lesystem
andnamespaealongwithitwhereveritgoes. Thisreleasesthe
dependeneonthedetails oftheexeutionsitewhile preserving
theuseofstandardinterfaes. Inaddition,theagentantapintonamingserviesthattransformprivatenames
intofully-qualiednamesrelevantinthelargersystem.
∗
ComputerSienesDepartment,UniversityofWisonsin
1
internaltehniques externaltehniques
poly. stati dyn. binary debug remote kernel
exten. link link rewrite trap lesys. allout
sope library stati dynami dynami nosetuid any any
burden rewrite relink identify identify runommand superuser modifyos
layer xed any any any sysall fsopsonly sysall
init/ni hard hard hard hard easy impossible easy
a. linker no no no no yes yes yes
debug yes yes yes yes limited yes yes
seure no no no no yes yes yes
ndholes easy hard hard hard easy easy easy
porting easy hard hard hard medium easy medium
Fig.1.2.PropertiesofInterpositionTehniques
Inthispaper,wepresentpratiallessonslearnedfromseveralyearsofbuildinganddeployinginterposition
agentswithintheCondorprojet.[20,28,21,22℄AlthoughthenotionofsuhagentsisnotuniquetoCondor[13,
2,12℄,theyhaveseenrelativelylittleuseinotherprodutionsystems. Thisisduetoavarietyoftehnialand
semantidiultiesthat ariseinonnetingrealsystemstogether.
Wepresentthispaperasaprogressivedesignstudythatexplorestheseproblemsandexplainsoursolutions.
We begin with adetailed study of seven methods of interposition, veof whih wehave experiene building
and deploying. The remaining two are eetive but impratial beause of the privilege required. We will
ompare the performane and funtionality of these methods, giving partiular attention to intangibles suh
asportability andreliability. Inpartiular, wewillonentrateon onemethod that has notbeen exploredin
detail: thedebuggertrap. Althoughthismethodhasbeenemployedinidealizedoperatingsystems,itrequires
additional tehniques in order to provideaeptable performane on popularoperating systemswith limited
debuggingapabilities, suhas Linux.
Usingthedebuggertrap, wefousonthedesignof Parrot,aninterpositionagentthatspliesremoteI/O
systemsintothelesystemspae ofordinaryappliations. Aentral problemin thedesignofanI/Oagentis
thesemantiproblemofmappingnot-quite-identialinterfaestoeahother. Theoutgoingmappingisusually
quitesimple: read beomesaget,writebeomesaput,andso forth. Therealdiulty liesin interpretingthe
large spae of return valuesfrom remote servies. Many new kinds of failure are introdued: servers rash,
redentials expire, and disks ll. Trivial transformations into the appliation's standard interfae lead to a
brittleandfrustratingexperienefortheuser.
Aorollarytothisobservationisthataesstoomputationandstorageannotbefullydivored. Abstrat
notionsof designoften enouragethe partition ofdistributed systemsinto twoativities: either omputation
orstorage. An interpositionagentservesasaonnetionbetweenthesetwoonerns;likeanoperatingsystem
kernel,itmanagesbothtypesofdeviesandmustmediatetheirinteration,sometimesbypassingtheappliation
itself.
Thispaperisaondensedversionofaworkshoppaper. Duetospaelimitations,wehaveomittedanumber
of setions and details,indiated byfootnotes. Theinterested readermay ndfurther details in the original
paper[23℄orinatehnialreport. [24℄ 2
2. Interposition Tehniques Compared. Thereare manytehniques forinterpositioningservies
be-tween an appliation and the underlying system. Eah has partiular strengths and weaknesses. Figure 1.2
summarizesseveninterpositiontehniques. Theymaybebrokenintotwobroadategories: internaland
exter-nal. Internaltehniquesmodifythememoryspaeofanappliationproessinsomefashion. These tehniques
areexibleandeient,butannotbeappliedtoarbitraryproesses. Externaltehniquesaptureandmodify
operationsthat are visible outsideanappliation's addressspae. These tehniques arelessexible andhave
higher overhead,but an beapplied to nearlyany proess. TheCondor projethas experiene building and
deployingalloftheinternaltehniquesaswelloneexternaltehnique: thedebuggertrap. The remainingtwo
externaltehniqueswedesribefromrelevantpubliations.
Thesimplest tehniqueisthepolymorphiextension. Iftheappliationstrutureisamenabletoextension,
wemaysimplyaddanewimplementationofanexistinginterfae. Theuserthenmustmakesmallodehanges
toinvoketheappropriateonstrutororfatoryin orderto produethenewobjet. This tehniqueisusedin
Condor'sJavaUniverse[22℄toonnetanordinaryInputStreamorOutputStreamtoaseureremoteproxy. It
isalsofoundingeneralpurposelibrariessuhasSFIO[25℄.
Thestati library tehniqueinvolvesreatingareplaementforanexisting library. Theuserisobligedto
re-linktheappliationwith thenewlibrary. Forexample,Condor's StandardUniverse[20℄providesadrop-in
replaement for the standard C library that provides transparent hekpointing as well as proxying of I/O
baktothesubmissionsite,fully emulatingtheuser'shomeenvironment. Thedynami librarytehniquealso
involvesreatingareplaementforanexistinglibrary. However,throughtheuseoflinkerontrols,theusermay
diret thenewlibraryto beused in plae ofthe oldforanygiven dynamiallylinked library. Thistehnique
is used by DCahe [8℄, some implementations of SOCKS [15℄, as well as our own Bypass [21℄ toolkit. The
binaryrewriting tehniqueinvolvesmodifyingthemahineodeofaproessat runtimetoredirettheowof
ontrol. This requiresverydetailed knowledge oftheCPU arhiteturein use,but thisanbehiddenbehind
anabstrationsuhastheParadyn[17℄toolkit. Thistehniquehasbeenused tohijakanunwittingproess
atruntime[28℄.
Traditional debuggersmake useof aspeialized operating systeminterfae for stopping, examining, and
resuming a proess. The debugger trap tehnique uses this interfae, but instead of merely examining the
proess, the debuggingagent traps eah systemall, provides an implementation, and then plaes the result
bakin thetargetproesswhile nullifyingtheintendedsystemall. An exampleofthistehniqueisUFO[2℄,
whih allowsaess to HTTPand ftpresouresviawhole-lefething. Adiultywith thedebugger trapis
thatmanytoolsompeteforaesstoasingleproess'debuginterfae. TheToolDaemonProtool(TDP)[18℄
providesaninterfaeformanagingsuhtoolsin adistributedsystem.
A remote lesystemmaybeused asan interposition agentbysimply modifying thele server. NFS is a
popular hoie for this tehnique, and is used bythe Legion[27℄ objet-spaetranslator,as well theSlie [4℄
miroproxy. Finally,shortofmodifyingthekernelitself,wemayinstallaone-timekernelalloutwhihpermits
a lesystem to be servied by a user-level proess. This faility an be present from the ground up in a
mirokernel [1℄, but an also be added as an afterthought, whih is the ase for most implementations of
AFS[11℄.
Thefourinternaltehniquesmayonlybeappliedtoertainkindsofprograms. Polymorphiextensionand
statilinking only applyto those programsthat anberebuilt. The dynamilibrarytehniquerequires that
thereplaedlibrarybedynami,whilebinaryrewriting(withtheParadyntoolkit)requiresthepreseneofthe
dynamiloader, although nopartiularlibrarymust be dynami. Thethreeexternal tehniques apply toany
proess,withtheexeptionthatthedebuggingtrappreventsthetraedproessfromelevatingitsprivilegelevel
throughthesetuidfeature.
Theburdenupontheuserforeahofthesetehniquesalsovarieswidely. Forexample,polymorphi
exten-sionrequiressmall odehangeswhilestatilinkingrequiresrebuilding. Thesetehniquesmaynotbepossible
with pakaged ommerial software. Dynami linking and binary rewriting require that the user understand
whih programs are dynamially linked and whih are not. Most standardsystem utilities are dynami, but
many ommerial pakages are stati. Our experiene is that users are surprised and quite frustrated when
an(unexpetedly)statiappliationblithely ignoresaninterposition agent. Theremotelesystemand kernel
allout tehniques impose the smallest userburden, but requirea ooperative systemadministrator to make
the neessary hanges. The debugger trap imposes a small burden on the user to simply invoke the agent
exeutable.
Perhaps the most signiant dierene between the tehniques is the ability to trap dierent layers of
software. Eah oftheinternal tehniquesmaybeappliedat anylayerofode. Forexample,Bypasshasbeen
usedtoinstrumentanappliation'sallstothestandardmemoryalloator,theXWindowSystemlibrary,and
theOpenGLlibrary. Inontrast,theexternaltehniques arexedtopartiularinterfaes. Thedebuggertrap
only operates on physial system alls, while the remote lesystem and kernel allout are limited to ertain
lesystemoperations.
Dierenes in these tehniques aet the design of ode that they attah to. Consider the matter of
implementingadiretorylistingonaremotedevie. Theinternaltehniquesareapableofintereptinglibrary
allssuh asopenand opendir. These areeasily mappedto remoteleaess protools,whihgenerallyhave
separateproeduresforaessinglesanddiretories. However,theUnix interfaeunieslesanddiretories;
both are aessed throughthe systemall open. External tehniques must aeptan open on either ale or
Theexternaltehniquesalsodierintherangeofoperationsthattheyareabletotrap. Whilethedebugger
trapanmodifyanysystemall,theremotelesystemand kernelallouttehniques arelimitedto lesystem
operations. A partiular remote lesystem may have even further restritions. For example, the stateless
NFS protool has no representation of the system alls open and lose. Without aess to this information,
the interposed servieannot provide semantis signiantlydierent than those provided by NFS. Further,
suh le system interfaes do not express any binding betweenindividual operations and the proesses that
initiate them. That is, a remote lesystem agent sees a read or write but not the proess id that issued it.
Withoutthis information, itisdiult orimpossibleto performingaountingfor thepurposesofseurity or
performane.
Anumberofimportantativitiestakeplaeduringtheinitializationandnalizationofaproess: dynami
librariesare loaded;onstrutors, destrutors,and otherautomatiroutinesare run;I/Ostreams are reated
or ushed. During these transitions, the libraries and other resoures in use by a proess are in a state of
ux. Thisompliatestheimplementationofinternalagentsthat wishto intereptsuhativity. Forexample,
theappliation may perform I/Oin aglobal onstrutorordestrutor. Thus, an internal agentitself annot
relyonglobal onstrutorsordestrutors: there is noordering enforedbetween thoseof theappliation and
thoseoftheagent. Likewise,adynamiallyloadedagentannotinterposeontheationsofthedynamilinker.
Theprogrammerofsuhagentsmustnotonlyexerisearein onstrutingtheagent,butalsoin seletingthe
librariesinvokedby the agent. Suh odeis time onsuming to reate anddebug. These ativitiesare muh
more easily manipulated throughexternal tehniques. Forexample, externaltehniques an easily trap and
modifytheativitiesofthedynamilinker.
Noodeiseveromplete norfullydebugged. Produtiondeploymentofinterposition agentsrequires that
usersbepermittedtodebug bothappliationsand agents. All tehniquesadmitdebuggingof userprograms,
with the only ompliation arising in the debugger trap. For obvious reasons, a single proess annot be
debuggedby twoproessesat one, soa debuggerannot beattahed to aninstrumentedproess. However,
adebuggertrapagentanbeused tomanageanentireproess tree,so insteadtheusermayusetheagentto
invokethedebugger,whih maythen invoketheappliation. Thedebugger's operationsmaybetrappedjust
likeanyothersystemall andpassedalongtotheappliation,allunderthesupervisionoftheagent.
Interposition agents may be used for seurity as well as onveniene. An agent may provide a sandbox
whihprevents anuntrustedappliation from modifying anyexternal datathat it is notpermitted toaess.
Theinternal tehniquesarenotsuitable forthis seuritypurpose, beausetheymayeasily be subverted bya
programthatinvokessystemallsdiretlywithoutpassingthroughlibraries. Theexternaltehniques,however,
annotbefooledin thiswayandarethussuitableforseurity.
Related to seurity is thematter of hole detetion. An interposition agentmay fail to trapan operation
attempted by an appliation. This may simply be a bug in the agent, or it may be that the interfae has
evolved over time, and the appliation is using adepreated or newly added interfae that the agent is not
aware of. Internal agents are espeially sensitive to this bug. As standard libraries develop, interfaes are
addedanddeleted,andmodiedlibraryroutinesmayinvokesystemallsdiretlywithoutpassingthroughthe
orresponding publiinterfaefuntion. Forexample,fopen mayinvoketheopen systemallwithoutpassing
throughtheopenfuntion. Suhaneventausesgeneralhaosinboththeappliationandagent,oftenresulting
inrashesor(worse)silentoutputerrors. Nosuh problemoursinexternalagents. Althoughinterfaesstill
hange, any unexpeted event is deteted as an unknown system all. The agent may then terminate the
appliationandindiate theexatproblem.
The problemofholedetetionmustnotbeunderestimated. Ourexperieneisthatanysigniant
operating system upgrade inludes hanges to the standard libraries, whih in turn requiremodiations to
internal trapping tehniques. Thus, internal agents are rarely forward ompatible. Further, identifying and
xingsuhholesistimeonsuming. Beausethemissedoperationitselfisunknown,onemustspendlonghours
with a debugger to see where the expeted ourse of the appliation diers from the atual behavior. One
disovered,anewentrypointmustbeaddedtotheagent. Thetreatmentissimplebutthediagnosisisdiult.
WehavelearnedthislessonthehardwaybyportingboththeCondorremotesystemalllibraryandtheBypass
toolkittoawidevarietyofUnix-likeplatforms.
Forthese reasons,wehavedesribedporting in Figure1.2 asfollows. Thepolymorphiextension andthe
remote lesystem are quite easy to build on a new system. The debugger trap and the kernel allout have
getpid stat open/lose read 8KB bandwidth
unmod .18
±
.03µ
s 1.85±
.09 3.18±
.08 3.27±
.19 282±
13MB/s rewrite .21±
.25µ
s 1.82±
.02 3.21±
.05 3.26±
.03 280±
7MB/s stati .21±
.02µ
s 1.80±
.17 3.59±
.05 3.34±
.02 280±
17MB/s dynami 1.22±
.01µ
s 3.60±
.10 5.53±
.06 4.31±
.09 278±
4MB/s (α
unmod) (6.8x) (1.9x) (1.7x) (1.3x) (0.99x)debug 10.06
±
.21µ
s 55.41±
.50 42.09±
.06 30.99±
.26 122±
4MB/s(
α
unmod) (56x) (30x) (13x) (9x) (0.43x)Fig.2.1. OverheadofInterpositionTehniques
andbinaryrewritingshouldbeviewedasasigniantportinghallengethatmustberevisitedateveryminor
operatingsystemupgrade.
Figure2.1omparestheperformaneoffourtransparentinterpositiontehniques. Weonstruteda
benh-markCprogramwhihtimed100,000iterationsofvarioussystemallsona1545MHzAthlonXP1800running
Linux 2.4.18. Available bandwidth wasmeasuredbyreading a100MB le sequentiallyin 1MB bloks. The
meanandstandarddeviationof1000ylesofeahbenhmarkareshown. Fileoperationswereperformedonan
existinglein atemporaryle system. Theunmodasegivestheperformaneofthisbenhmarkwithoutany
agent attahed, whilethe remainingveshowthesamebenhmark modied by eah interposition tehnique.
Ineahase,weonstrutedaveryminimalagenttotrapsystemalls andinvokethemwithoutmodiation.
Asanbeseen,thebinaryrewritingand statilinking methodsaddnosigniantosttotheappliation.
Thedynamimethodhasoverheadontheorderofmiroseonds,asitmustmanagethestrutureof(potentially)
multiple agents and invokea funtion pointer. However, these overheads are quikly dominated by the ost
ofmovingdata in andoutof theproess. Thedebuggertraphas thegreatestoverheadof allthetehniques,
rangingfroma56xslowdownforgetpidtoa6xslowdownforwriting8KB.Mostimportantly,thebandwidth
measurement demonstratesthat the debugger trapahieveslessthan half of the unmodiedI/O bandwidth.
Itshould be fairlynotedthat this latenyandbandwidthwillbedominatedbythelatenyand bandwidthof
aessingremoteserviesonommoditynetworks. Seurityandreliabilityomeatameasurableost. 3
Fig.3.1. InterativeBrowsingwithParrot
3. Parrot. TheParrotinterposition agentattahesstandardappliationsto avarietyofdistributed I/O
systemsby wayof thedebuggertrap,desribedabove. EahI/Oprotool ispresentedasanormallesystem
entryunder anewtop-leveldiretorybearingthenameoftheprotool. Inaddition,anoptionalmountlistmay
be given, whih redirets parts ofthe lesystemnamespae toexternal paths. Figure 3.1 showsParrotbeing
usedwithstandardtoolstomanipulatelesstoredattheMassStorageServer(MSS)attheNationalCenterfor
SuperomputingAppliations(NCSA)viatheGridSeurityInfrastruture(GSI)[9℄variantoftheFileTransfer
Protool(FTP).
Parrot is equipped with a variety of drivers for ommuniatingwith external storage systems; eah has
partiular features and limitations. The simplest is the Loal driver, whih simply passes operations on to
the underlying operating system. The Chirp protool was designed by the authors in an earlier work [22℄
to provide remote I/O with semantis very similar to POSIX. A standalone hirp server is distributed with
Parrot. The venerable File Transfer Protool (FTP) has been in heavy use sine the early days of the
Internet. Itssimpliityallowsfor awidevarietyof ofimplementations, whih, for ourpurposes, resultsin an
unfortunatedegreeofimpreisionwhihwewillexpanduponbelow. ParrotsupportstheseureGSI[3℄variant
offtp. TheNeSTprotoolisthenativelanguageoftheNeSTstorageappliane[6℄,whihprovidesanarrayof
authentiation,alloation,andaountingmehanismsforstoragethatmaybesharedamongmultipletransient
users. TheRFIOandDCAPprotoolsweredesignedinthehigh-energyphysisommunitytoprovideaess
tohierarhialmassstoragedeviessuh asCastor[5℄andDCahe[8℄.
Beause Parrot must preserve POSIX semantis for the sakeof the appliation, our foremost onern is
theabilityofeahofthese protools toprovidetheneessarysemantis. Performaneisaseondaryonern,
althoughit isaetedsigniantlybysemantiissues. A summaryofthesemantisof eahof theseprotools
isgiveninFigure 3.2. 4
namebinding disipline dirs metadata symlinks onnetions
posix open/lose random yes diret yes
-hirp open/lose random yes diret yes perlient
ftp get/put sequential varies indiret no perle
nest get/put random yes indiret yes perlient
ro open/lose random yes diret no perle/op
dap open/lose random no diret no perlient
Fig.3.2. ProtoolCompatibilitywithPOSIX
4. Errors and Boundary Conditions. Error handlinghasnotbeenapervasiveproblemin thedesign
oftraditionaloperatingsystems. Asnewmodelsofleinterationhavedeveloped,attendingerrormodeshave
beenaddedto existing systemsbyexpanding thesoftwareinterfaeat everylevel. Forexample, theaddition
ofdistributed lesystemsto theUnixkernelreatedthenewpossibility ofastale lehandle, representedby
the ESTALEerror. As this errormode wasdisoveredat the verylowest layersof the kernel, the valuewas
addedtothedeviedriverinterfae,thelesysteminterfae,thestandardlibrary,andexpetedtobehandled
diretlybyappliations.
We have no suh luxury in an interposition agent. Appliations use the existing interfae, and wehave
neither the desire nor the ability to hange it. Sometimes, if we are luky, we may re-use an error suh as
ESTALEforananalogous,ifnotidentialpurpose. Yet,theunderlyingdevie driversgenerateerrorsranging
fromthevaguelesystemerror tothemirosopiallypreiseserver'sertiationauthorityisnottrusted.
How should the unlimited spae of errors in the lower layers be transformed into the xed spae of errors
availableto theappliation? 5
Forexample,severaldeviedrivershavetheneessarymahinerytoarryoutallofauser'spossiblerequests,
butprovidevagueerrorswhenasupportedoperationfails. TheFTPdriverallowsanappliationtoreadale
viatheGETommand. However,iftheGET ommandfails, theonlyavailableinformation istheerrorode
550,whihenompassesalmostanysortof lesystemerrorinludingnosuhle, aessdenied, andisa
diretory. ThePOSIXinterfae doesnotpermita ath-all errorvalue; itrequires aspei reason. Whih
errorodeshouldbereturnedto theappliation?
Onetehniquefordealingwiththis problemisto interviewtheserviein orderto narrowdowntheause
oftheerror,in amannersimilarto thatofanexpertsystem. Supposethat weattemptto retrievealeusing
anFTPGEToperation. IftheGETshouldfail,wemayhypothesizethatthenamedleisatuallyadiretory.
The hypothesis may betested with ahange diretory(CWD) ommand. If that sueeds, thehypothesis is
true, and wemayreturn thepreise error notale. If that fails, we must propose anotherhypothesis and
testit. Parrotperformsanumberoftwo-and three-stepinterviewsin responsetoavarietyofFTPerrors.
TheonnetionstrutureofaremoteI/Oprotoolalsohasimpliationsforsemantisaswellasperformane.
Chirp,NeST,andDCAPrequireoneTCPonnetionbetweeneahlientand server. FTPandRFIOrequire
anewonnetion madeforeah leopened. Inaddition, RFIOrequires anewonnetionfor eah operation
performedonanon-openle. Beausemostle systemoperationsaremetadataqueries,this anresultin an
4
Omitted: DetailsofthevariousprotoolssupportedbyParrot.
extraordinarynumberofonnetionsinashortamountoftime. Ignoringthelatenypenaltiesofthisativity,a
largenumberofTCPonnetionsanonsumeresouresatlients,servers,andnetworkdeviessuhasaddress
translators. 6
5. Performane. We havedeferredadisussion of performane until this pointso that wemaysee the
performaneeetsofsemantionstraints. Althoughitispossibletowriteappliationsexpliitlytouseremote
I/Oprotoolsin themosteientmanner, Parrotmustprovideonservativeandompleteimplementationsof
POSIXoperations. Forexample,anappliationmayonlyneedtoknowthesizeofale,butifitrequeststhis
informationviastat,Parrotisobligedtollthestruturewitheverythingitan,possiblyatgreatost.
0
1
2
3
4
5
6
7
8
64M
16M
4M
1M
256K
64K
16K
4K
Bandwidth (MB/s)
Block Size
ftp
rfio
dcap
nest
chirp
Fig.5.1.Throughputof128MBFile Copy
TheI/Oserviesdisussedhere,withthe
exep-tionofChirp,aredesignedprimarilyforeient
high-volumedatamovement. Thisisdemonstratedby
Fig-ure5.1,whihomparesthethroughputofthe
proto-olsatvariousbloksizes. Thethroughputwas
mea-suredbyopyinga128MBleintotheremotestorage
deviewiththestandardpommandequippedwith
Parrotandavaryingdefaultbloksize,asontrolled
throughthestatemulationdesribedabove.
Of ourse, theabsolute values are an artifat of
oursystem,however,itanbeseenthatallofthe
pro-tools must betuned for optimalperformane. The
exeptionisChirp,whihonlyreahesaboutonehalf
of the available bandwidth. This is beause of the
stritRPCnaturerequiredforPOSIXsemantis;the
Chirp server does not extrat from the underlying
lesystem any more data than neessary to supply
theimmediate read. Although itis tehnially
feasi-blefortheservertoreadaheadinantiipationofthenextoperation,suhdatapulledintotheserver'saddress
spaemightbeinvalidatedbyotheratorsonthele inthemeantimeandisthussemantiallyinorret.
Thehiupin throughputofDCAP at abloksize of64KBis anunintended interationwiththedefault
TCPbuersize of64KB. ThedevelopersofDCAP are awareof theartifatandreommendhangingeither
theblok size orthebuer size to avoid it. This is reasonableadvie, given that all ofthe protoolsrequire
tuningofsomekind.
Figure5.2 benhmarksthelatenyofPOSIX-equivalent operationsineahI/Oprotool. These
measure-ments were obtained in a manner idential to that of Figure 2.1, with the indiated servers residing on the
samesystemasin Figure5.1. Notie thatthelateniesaremeasuredin milliseonds,whereasFigure 2.1gave
miroseonds.
proto stat open/lose read 8KB write8KB bandwidth
hirp .50
±
.14 ms .84±
.09 2.80±
.06 2.23±
.04 4.1 MB/s ftp .87±
.09 ms 2.82±
.26 (norandomaess) 7.9 MB/s nest 2.51±
.05 ms 2.53±
.17 4.48±
.14 7.41±
.32 7.9 MB/s ro 13.41±
.28 ms 23.11±
1.29 3.32±
.14 2.85±
.18 7.3 MB/s dap 152.53±
16.68 ms 159.09±
16.68 3.01±
0.62 3.14±
.62 7.5 MB/sFig.5.2. PerformaneofI/OProtoolsOnaLoal-Area Network
Wehastentonotethat thisomparison,in aertainsense,isnotfair. These dataserversprovidevastly
dierentservies,sotheperformanedierenesdemonstratetheostof theservie,notthelevernessof the
implementation. For example,Chirp andFTP ahievelow lateniesbeausethey are lightweighttranslation
layersover an ordinaryle system. NeST has somewhat higher lateny beause it provides the abstration
of avirtual le system, usernamespae, aess ontrol lists, and a storagealloation system, allbuilt on an
existinglesystem. Theostisduetotheneessarymetadatalogthatreordsallsuhativitythatannotbe
storeddiretlyintheunderlyinglesystem. BothRFIOandDCAParedesignedtointeratwithmassstorage
dist. proto opy list san make delete
loal loal .15
±
.02se .09±
.20 .08±
.02 65.38±
3.47 .86±
.18se loal hirp 1.22±
.03se .34±
.02 .40±
.01 81.02±
1.46 .79±
.01se lan hirp 6.16±
.22se .57±
.30 1.32±
.03 144.00±
1.35 1.26±
.02se lan hirp 10.67±
.90se .53±
.07 4.72±
.32 95.05±
2.33 1.24±
.03se lan ftp 34.88±
1.72se 1.47±
.02 17.78±
1.14 122.54±
3.14 2.95±
.15se lan nest 52.35±
4.18se 12.92±
4.87 28.14±
4.52 307.19±
3.26 31.73±
4.37se lan ro (overwhelmedbyrepeatedonnetions)lan dap (doesnotsupportdiretorieswithoutnfs)
Fig.5.3. PerformaneoftheAndrew-LikeBenhmark
systems; singleoperationsmay result in gigabytes of ativitywithin adisk ahe, possibly moving lesto or
fromtape. Inthat ontext,lowlatenyisnotaonern.
That said, several things may be observed from this table. Although FTP has benetted from years of
optimizations,theostofastatisgreaterthanthatofChirpbeauseoftheneedformultipleroundtripstoll
in theneessarydetails. Theadditionallatenyof open/lose is dueto themultiple round trips toname and
establishanewTCPonnetion. BothRFIOandDCAPhavehigherlateniesforsinglebytereadsandwrites
thanfor8KBreadsandwrites. Thisisduetobueringwhihdelayssmalloperationsinantiipationoffurther
data. Mostimportantly,alloftheseremoteoperationsexeedthelatenyofthedebuggertrapitselfbyseveral
ordersof magnitude. Thus, weareomfortablewith thepreviousdeisiontosarieperformanein favorof
reliabilityin theinterpositiontehnique.
WeonludewithamarobenhmarksimilartotheAndrewbenhmark.[11℄ThisAndrew-likebenhmark
onsistsofaseriesofoperationsontheParrotsouretree,whihonsistsof13diretoriesand296lestotaling
955KB.Toprepare,thesouretreeismovedtotheremotedevie. Intheopystage,thetreeisdupliatedon
theremotedevie. Intheliststage,adetailedlist(ls-lR)ofthetreeismade. Inthesanstage,alllesinthe
treeare searhed(grep) foratextstring. Inthemake stage,the software isbuilt. From anI/Operspetive,
thisinvolvesasequentialreadofeverysourele,asequentialwriteofeveryobjetle,andaseriesofrandom
readsandwritesto reatetheexeutables. Inthedeletestage,thetreeisdeleted.
Figure 5.3omparestheperformaneof theAndrew-likebenhmark inavarietyof ongurations. Inthe
three ases abovethe horizontal rule, wemeasure the ost of eah layerof software added: rst with Parrot
only,then withaChirp serveron thesamehost,thenwith aChirpserverarosstheloal areanetwork. Not
surprisingly, the I/Oost of separatingomputation from storageis high. Copying datais muh slowerover
thenetwork,although theslowdownin themakestage isquiteaeptableifweintend toinreasethroughput
viaremoteparallelization.
Inthetwoasesadjaenttotherule,theonlyhangeistheenablingofahing. Asmightbeexpeted,the
ostof unneessarydupliationauses aninreasein opyingthesouretree, although thediereneis easily
madeupinthemakestage,wheretheaheeliminatesthemultiplerandomI/Oneessarytolinkexeutables.
Thelistanddeletestagesonlyinvolvediretorystrutureandmetadataaessandarethusnotaetedbythe
ahe.
Intheveasesbelowthehorizontal rule,weexplorethe useofvarious protools to run thebenhmark.
In all of these ases, ahing is enabled in order to eliminate the ost of random aess as disussed. The
DCAP protool is semantially unable to run the benhmark, as it doesnot provide the neessaryaess to
diretories. TheRFIOprotoolissemantiallyabletorunthebenhmark,butthehighfrequenyoflesystem
operationsresultsinalargenumberofTCPonnetions,whihquiklyexhaustsnetworkingresouresatboth
thelientandtheserver,thuspreventingthebenhmarkfrom running. Chirp,FTP, andNeSTareallableto
omplete thebenhmark. TheNeST resultshaveahigh variane,due to delays inurredwhile themetadata
log is periodially ompressed. The dierene in performane between Chirp, FTP, and NeST is primarily
attributable tothe ostof metadatalookups. All thestages makeheavyuseof stat; themultiple round trips
neessarytoimplementthisompletelyforFTPandNeSThaveastrikingumulativeeet.
6. Conlusions. Interposition agents provide a stable platform for bringing old appliations into new
environments. We have outlined the diulties that we have enountered as well as the solutions we have
in theuse of virtualmahines in distributed systems[26℄the needfor powerfulbut lowoverhead methods of
interpositiongrows. Theappropriateinterfaeforthistaskisstillanopenresearhtopi.
The notion of virtualizing or multiplexing an existing interfae is a ommon tehnique [14, 7℄, but the
plague of errorsandother boundaryonditionsseems to be suered silentlybypratitioners. Suh problems
arerarelypubliized,however,weareawareoftwoexellentexeptions. C.Metz[16℄desribeshowtheBerkeley
sokets interfaeissurprisinglyhardto multiplex. T.Garnkel[10℄ desribesthesubtlesemantiproblems of
sandboxinguntrustedappliations.
Formoreinformation: http://www.s.wis.edu/thain/ rese arh/ parro t
7. Aknowledgments. Wethank John Bentand SanderKlousfor their help deployingand debugging
Parrot. VitorZandywrotethemehanismforbinaryrewriting. AlainRoygavethoughtfulommentsonearly
draftsofthispaper.
REFERENCES
[1℄ M. Aetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, andM. Young,Mah: Anewkernel
foundation forUnixdevelopment,inProeedingsoftheUSENIXSummerTehnialConferene,Atlanta,GA,1986.
[2℄ A. Alexandrov, M. Ibel, K. Shauser, and C. Sheiman, UFO: A personal global le system based on user-level
extensionstotheoperatingsystem,ACMTransationsonComputerSystems,(1998),pp.207233.
[3℄ W.Allok, A. Chervenak, I. Foster,C. Kesselman, and S.Tueke,Protools andserviesfor distributed
data-intensivesiene,inProeedingsofAdvanedComputingandAnalysisTehniquesinPhysisResearh,2000,pp.161163.
[4℄ D.Anderson,J. Chase,andA.Vahdat,Interposed requestroutingforsalablenetworkstorage,inProeedingsofthe
FourthSymposiumonOperatingSystemsDesignandImplementation,2000.
[5℄ O.Barring, J. Baud, andJ.Durand,CASTOR projet status,inProeedingsofComputing inHighEnergy Physis,
Padua,Italy,2000.
[6℄ J. Bent, V. Venkataramani, N. LeRoy, A. Roy, J. Stanley, A. Arpai-Dusseau, R. Arpai-Dusseau, and
M. Livny, Flexibility, manageability, and performane in a grid storage appliane, inProeedings of the Eleventh
IEEESymposiumonHighPerformaneDistributedComputing,Edinburgh,Sotland,July2002.
[7℄ D.Cheriton,UIO: AuniformI/O systeminterfaefordistributedsystems, ACMTransationsonComputerSystems,5
(1987),pp.1246.
[8℄ M.Ernst,P.Fuhrmann,M.Gasthuber,T.Mkrthyan,andC.Waldman,dCahe,adistributedstoragedataahing
system,inProeedingsofComputinginHighEnergyPhysis,Beijing,China,2001.
[9℄ I.Foster,C.Kesselman,G.Tsudik,andS.Tueke,Aseurityarhitetureforomputationalgrids,inProeedingsof
the5thACMConfereneonComputerandCommuniationsSeurityConferene,1998,pp.8392.
[10℄ T.Garfinkel,Trapsand pitfalls: Pratialproblemsininsystemall interpositionbasedseuritytools,inProeedingsof
theNetworkandDistributedSystemsSeuritySymposium,February2003.
[11℄ J.Howard, M. Kazar, S. Menees, D. Nihols, M. Satyanarayanan, R. Sidebotham, and M. West, Saleand
performaneinadistributedlesystem,ACMTransationsonComputerSystems,6(1988),pp.5181.
[12℄ G.HuntandD.Brubaher,Detours: Binaryintereptionof Win32funtions,Teh.ReportMSR-TR-98-33,Mirosoft
Researh,February1999.
[13℄ M.Jones,Interpositionagents:Transparentlyinterposinguserodeatthesysteminterfae,inProeedingsofthe14thACM
SymposiumonOperatingSystemsPriniples,1993.
[14℄ S.Kleiman,Vnodes: Anarhitetureformultiple lesystemtypesinSun Unix,inProeedingsoftheUSENIXTehnial
Conferene,1986,pp.151163.
[15℄ M.Leeh, M.Ganis,Y.Lee,R. Kuris, D.Koblas, andL.Jones,SOCKS protoolversion5. InternetEngineering
TaskFore,RequestforComments1928,Marh1996.
[16℄ C.Metz,ProtoolindependeneusingthesoketsAPI,inProedingsoftheUSENIXTehnialConferene,June2002.
[17℄ B.Miller,M.Callaghan,J.Cargille,J.Hollingsworth,R.B.Irvin,K.Karavani,K.Kunhithapadam,and
T. Newhall,TheParadyn parallelperformanemeasurementtools,IEEEComputer,28(1995),pp.3746.
[18℄ B.Miller,A.Cortes,M.A.Senar,andM.Livny,Thetooldaemonprotool(TDP),inProeedingsofSuperomputing,
Phoenix,AZ,November2003.
[19℄ C.SmallandM.Seltzer,AomparisonofOSextensiontehnologies,inProeedingsoftheUSENIXTehnialConferene,
1996,pp.4154.
[20℄ M.SolomonandM.Litzkow,SupportinghekpointingandproessmigrationoutsidetheUnixkernel,inProeedingsof
theUSENIXWinterTehnialConferene,1992.
[21℄ D.ThainandM.Livny,Multiplebypass: Interpositionagentsfordistributedomputing,JournalofClusterComputing,4
(2001),pp.3947.
[22℄ ,Errorsope ona omputational grid, inProeedingsofthe EleventhIEEE Symposiumon High Performane
Dis-tributedComputing,July2002.
[23℄ ,Parrot: Transparentuser-levelmiddlewarefordata-intensiveomputing,inProeedingsoftheWorkshoponAdaptive
GridMiddleware,September2003.
[24℄ , Parrot: Transparent user-level middleware for data-intensive omputing, Teh. Report 1493, Computer Sienes
Department,UniversityofWisonsin,Deember2003.
[26℄ A.Whitaker,M.Shaw,andS.D.Gribble,SaleandperformaneintheDenaliisolationkernel,inProeedingsofthe
FifthSymposiumonOperatingSystemDesignandImplementation,Boston,MA,Deember2002.
[27℄ B.White,A.Grimshaw,andA.Nguyen-Tuong,Grid-BasedFileAess: TheLegionI/OModel,inProeedingsofthe
NinthIEEESymposiumonHighPerformaneDistributedComputing,August2000.
[28℄ V.Zandy,B.Miller,andM.Livny,Proess hijaking,inProeedingsoftheEighthIEEEInternational Symposiumon
HighPerformaneDistributedComputing,1999.
Editedby: WilsonRivera,JaimeSeguel.
Reeived: July14,2003.