for a SurvivableStorage System
JayJ.Wylie, MehmetBakkaloglu,VijayPandurangan,
MichaelW. Bigrigg,SemihOguz,KenTew,CoryWilliams,
GregoryR.Ganger,PradeepK.Khosla
May2001
CMU-CS-01-120
SchoolofComputerScience
CarnegieMellonUniversity
Pittsburgh,PA 15213
Abstract
Survivable storage system design has become a popular research topic. This paper tackles the diÆcult problem of
reasoningabout theengineering trade-os inherentindatadistributionscheme selection. Thechoiceof anencoding
algorithmanditsparameterspositionsasystemataparticularpointinacomplextrade-ospacebetweenperformance,
availability, and security. We demonstrate that no choice is right for all systems, and we present an approach to
codifyingand visualizing this trade-o space. Using this approach, we explore the sensitivity of thespace tosystem
characteristics, workload,anddesiredlevelsofsecurityandavailability.
Thisresearch ispart ofthePASISproject[40,48 ]atCarnegieMellonUniversity.
ThisworkispartiallyfundedbyDARPA/ATO'sOrganicallyAssuredandSurvivableInformationSystems(OASIS)program(Air
ForcecontractnumberF30602-99-2-0539-AFRL).WethankthemembersandcompaniesoftheParallelDataConsortium(atthetimeof
thiswriting:EMCCorporation,Hewlett-PackardLabs,Hitachi,IBMCorporation,IntelCorporation,LSILogic,LucentTechnologies,
NetworkAppliances,PANASAS,PlatysCommunications,SeagateTechnology,SnapAppliances,SunMicrosystems,VeritasSoftware
Digitalinformationisacriticalresource,creatinganeedfordistributedstoragesystemsthatprovidesuÆcient
dataavailabilityanddatasecurityinthefaceoffailuresandmaliciousattacks. Manyresearchgroups[13,14,
17,24, 37,38,40, 41]arenowexploringthedesignandimplementationof suchsurvivable storage systems.
These systemsbuild on mature technologies from decentralized storage systems [1, 3, 8, 20, 33, 34] and
alsosharethesamehigh-levelarchitectures. Infact,developmentofsurvivablestoragewiththesamebasic
architecture was pursuedover15 years ago[12, 15, 16]. The challenge now, asit was then, is to achieve
acceptable levels of performance and manageability. Moreover, a means to evaluate survivable storage
systemsis requiredtofacilitatethedesignofthesesystems.
Onekeytomaximizingsurvivablestorageperformanceismindfulselectionofthedatadistributionscheme.
A data distribution scheme consists of aspecic algorithm for data encoding &partitioning and a set of
valuesforitsparameters. Therearemanyalgorithmsapplicabletosurvivablestorage,includingencryption,
replication,striping, erasure-resilientcoding, secretsharing,andvarious combinations. Eachalgorithmhas
oneormore tunable parameters. The resultis a largetoolbox of possible schemes, each oeringdierent
levelsofperformance(throughput),availability(probabilitythatdatacanbeaccessed),andsecurity(eort
required to compromisethe condentiality orintegrityof stored data). Forexample, replication provides
availabilityat a high cost in network bandwidthand storage space, whereasshort secret sharingprovides
availabilityandsecurityatlowerstorageandbandwidthcostbuthigherCPUutilization. Likewise,selecting
thenumberof sharesrequiredto reconstructasecret-sharedvalueinvolvesatrade-obetweenavailability
andcondentiality:ifmoremachinesmustbecompromisedtostealthesecret,thenmoremustbeoperational
to provideit legitimately. Secret sharingschemes and other data distribution algorithms are described in
Section2.
Nosingledata distributionscheme is right forall systems. Instead, the rightchoicefor anyparticular
systemdependsonanarrayoffactors,includingexpectedworkload,systemcomponentcharacteristics,and
desiredlevelsofavailability andsecurity. Unfortunately, mostsystemdesignsappearto involvean ad hoc
choice, oftenresultinginasubstantialperformancelossduetomissed opportunities andover-engineering.
Thispaperpromotesabetterapproachtoselectingthedatadistributionscheme. Atahighlevel,thisnew
approachconsistsofthreesteps: enumeratingpossibledatadistributionschemes(<algorithm;parameters>
pairs),modelingtheconsequencesofeachscheme,andidentifyingthebest-performingschemeforanygiven
set of availability and security requirements. The surface shown in Figure 1 illustrates one result of the
approach. Generatingsuch asurfacerequirescodifyingeachdimensionof thetrade-o spacesuchthat all
data distribution schemesfall intoatotal order. Thesurfaceservestwofunctions: (1)it enablesinformed
trade-osamongsecurity,availability,andperformance,and(2)itidentiesthebest-performingschemefor
each point in the trade-o space. Specically, the surface shown represents the performance of the
best-performingschemethatprovidesatleastthecorrespondinglevelsofavailabilityandsecurity. Manyschemes
arenotbest atanyofthepointsinthespaceandassucharenotrepresentedonthesurface.
Thispaperdemonstratesthefeasibilityand importanceof carefuldata distributionscheme choice. The
resultsshowthattheoptimalchoicevariesasafunctionofworkload,systemcharacteristics,andthedesired
levels ofavailabilityand security. Theresults showthat minor (2) changes in these determinants have
little eect, which means that the models need not be exact to be useful. Importantly, the results also
showthatlargechanges,whichwouldcorrespondtodistinctsystems,createsubstantiallydierenttrade-o
spacesandbestchoices. Thus,failingtoexaminethetrade-ospaceinthecontextofone'ssystemcanyield
bothpoorperformanceandunfullledrequirements. With sensitivitystudiesandsurvivablestoragesystem
examplesfrom theliterature,weidentify interestingtrendsanddesignpoints.
Theremainderofthispaperisorganizedasfollows. Section2discussessurvivablestoragesystemdesign
and data distribution scheme options. Section 3 describes how we codify the trade-o space. Section 4
describesPASIS[40,48],ourprototypesurvivablestoragesystem,anditsroleinvalidatingourperformance
model. Section 5exploresthetrade-ospaceand itssensitivitytomodel inaccuracies,systemcomponents,
and expected workloads. Section 6 describes current and pastsurvivablestorage system research,
identi-fying insights from our explorations that could enhance their designs. Section 7 summarizes this papers
formance,availability,andsecurityaxes. Performanceisquantiedasthenumberof32KBrequestspersecondthatcanbe
satisedforasingleclient;onlythe best-performingschemethatprovidesatleastthegivensecurityandavailabilitylevelsis
shown,resultinginamonotonicdecreasealongthoseaxes. Availabilityisthe probabilitythatstoreddata canbeaccessed.
Securityistheeortrequiredtocompromiseeitherthecondentialityorintegrityofstoreddata.Section2describesthedata
distributionalgorithmslistedinthelegend.Section3describestheaxesindetail.
2 Survivable Storage Systems
Survivable systems operate from the fundamental design thesis that no individual service, node, or user
can be fully trusted; having some compromised entities must be viewed asthe common case rather than
the exception. Survivable storage systems must encode and distribute data across independent storage
nodes,entrustingthedata'spersistencetosetsofnodesratherthantoindividualnodes. Ifcondentialityis
required,unencoded datashouldnotbestoreddirectly onstoragenodes;otherwise,compromisingasingle
storagenodeenablesanattackertobypassaccess-controlpolicies.
Priorwork in cluster storagesystems [1,3, 8,20, 33, 34] providesmuch insightinto how to eÆciently
decentralize storage services while providing a single, unied view to applications. Figure 2 illustrates
the basic decentralized storage architecture. Inmost cases, some intermediary software is responsible for
translating between the unied view and the decentralized reality. This piece of software may execute
directlyoneachclientsystem[1,3,20,33],onstoragenodesidentiedasleaders[8,34],oratintermediary
nodes[1,3].
Inmostcluster storagesystems, thedata andsomeform of redundancy informationare striped across
storagenodes. Survivablestoragesystemsdierfromdecentralizedstoragesystemsmainlyin theencoding
mechanism used (i.e., the data distribution scheme). The data distribution scheme enables the storage
systemtosurvivebothfailuresandcompromisesofstoragenodes. Eachscheme(a<algorithm;parameters>
pair)oersadierentlevelofperformance,availabilityandsecurity. Littleunderstandingexistsofthelarge
trade-ospacethat theschemesoccupyinpractice.
Thispapertakesasteptowardsunderstandingtherelativemeritsofdierentdatadistribution schemes
byexaminingtheminthecontextofthetrade-ospace. Theremainderofthissectiondescribesvariousdata
distributionalgorithms and the parametersthat specifyeach algorithm. There are several issues involved
with constructing acomplete survivablestorage systemthat are notcentral to theresults and insightsof
thispaper. Specically,thispaperarguesforandprovidesanapproachformakingmindfulselectionsofthe
data distribution scheme, but explicitly tries to doso while remainingimpartial onissues that arelargely
orthogonal. Examplesofsuchissuesincludemetadataandnaming mechanisms,clientnodeauthentication,
Applications
Decode/
Encode
Multi-read/
Multi-write
Meta-Data
...
Storage
Nodes
...
Shares
Data
Requests
Intermediary
Software
Applications
Decode/
Encode
Multi-read/
Multi-write
Meta-Data
...
Storage
Nodes
...
Shares
Data
Requests
Intermediary
Software
Figure2: Genericdecentralizedstoragearchitecture. Intermediarysoftwaretranslatestheapplications'uniedview
ofstorage tothedecentralizedreality. Solidlinestrace thedata path. Dashedlinestracethe meta-datapath. Applications
readand writeblockstotheintermediarysoftware.Encodingtransformsblocksintoshares(decodingdoesthereverse). Sets
ofsharesarereadfrom(writtento)storagenodes. Intermediarysoftwaremayrunonclients,leaderstoragenodes,oratsome
pointinbetween.
2.1 Threshold algorithms
There is a wide array of data distribution algorithms, including encryption, replication, striping,
erasure-resilient coding, information dispersal, and secret sharing. Threshold algorithms, characterized by three
parameters(p, m, and n), representalarge setof these algorithms. Inap-m-n thresholdscheme, data is
encoded into n shares such that any m of the shares canreconstruct the data and less than p reveal no
information abouttheencodeddata. Thus, astored valueisavailable ifat leastm of thenshares canbe
retrieved. Attackers must compromise at least p storage nodes before it is even theoretically possible to
determineanypartoftheencodeddata.
Table1listssomep-m-nthresholdschemesthathavemorefamiliarnames. Perhapsthesimplestexample
is n-wayreplication, which is a1-1-n thresholdscheme. That is, outofthenreplicas that arestored, any
single replica provides theoriginal data (m =1), andeach replica reveals information about theencoded
data(p=1). Anothersimpleexampleisdecimation(orstriping,asindiskarrays),whereinalargeblockof
datais partitionedinton sub-blocks,each containing1=n ofthedata(so,p=1andm=n). Attheother
end of thespectrumis \splitting", ann-n-n thresholdscheme that consists of storingn 1randomvalues
and onevalue thatis theexclusive-orofthe originalvalueand thosen 1values;p=m=n forsplitting,
since alln shares are neededto recoverthe original data. Replication, decimation, and splitting schemes
haveasingletunableparameter,n,whichaectstheirplacein thetrade-ospace.
With more CPU-intensivemathematics, the full range of p-m-nthreshold schemes becomes available.
Forexample, secretsharingschemes[4,27, 43]are m-m-nthreshold schemes. Shamir'simplementationof
secretsharingis basedoninterpolatingpointsonapolynomialinaniteeld [43]. Thesecretvaluealong
1-1-n Replication 1-n-n Decimation(Striping) n-n-n Splitting (XORing) 1-m-n InformationDispersal m-m-n SecretSharing p-m-n RampSchemes
Table1: Examplep-m-nthresholdschemes.
shareisgeneratedbyevaluating thepolynomialat distinctpoints(distinctfromothersharesandthepoint
containingthesecretvalue).
Rabin'sinformationdispersalalgorithm,a1-m-nthresholdscheme,usesthesamepolynomial-basedmath
asShamir'ssecretsharing,but norandomnumbers[42]; msecretvaluesareused todeterminetheunique
encoding polynomial. Thus, each share reveals partial information about the m simultaneously encoded
values, but the encoding is much more space-eÆcient. Ramp schemes [5] are p-m-n threshold schemes,
and they can also be implemented with the same polynomial-based math. The points used to uniquely
determinetheencodingpolynomialarep 1randomvaluesandm (p 1)secretvalues. Rampschemesthus
oerinformation-theoreticcondentialityofuptop 1shares. Theyarealsomorespace-eÆcientthansecret
sharing (so long as m>p). For p= 1, ramp schemes are equivalentto information dispersal; for p = m,
they are equivalent to secret sharing. Threshold algorithms canbe implemented in waysother than the
polynomialmethodsummarizedinthissection. Forexample,Blakleyproposedimplementingsecretsharing
with intersecting hyper-planes [4], and Luby proposed implementing information dispersal with Tornado
codes[35].
Parameteroptionsforp-m-nthresholdschemesexposealargespaceofpossibledatadistributionschemes.
Infact, there areon theorder ofN 3
dierent optionsto considergivenN storagenodes. Each scheme in
this spaceoersdierent levelsofavailability,condentiality,CPU costs,and storagerequirements(which
alsotranslatesinto networkbandwidthrequirements). Forexample,asnincreases,informationavailability
increases(itismoreprobablethatmsharesareavailable),buttheamountofstoragerequiredincreases(more
shares are stored) and condentiality decreases (more shares are available to steal). As m increases, the
storagespacerequireddecreases(ashare'ssizeisproportionalto1=(m (p 1))),butsodoesitsavailability
(moreshares arerequiredto reconstructtheoriginalobject). Also,asmincreases,eachshare containsless
information;thismayincreasethenumberofsharesthatmustbecapturedbeforeanintrudercanreconstruct
ausefulportionoftheoriginalobject. Aspincreases,theinformationsystem'scondentialityincreases,but
thestoragespacerequiredalsoincreases. Withsuchawidearrayofoptions,selectingthemostappropriate
datadistribution schemeforagivenenvironmentisfarfromtrivial.
2.2 Other algorithms
Therearealsoimportantdatadistributionalgorithmsoutsidetheclassofp-m-nthresholdschemes. Notably,
encryptionisacommonapproachtoprotectingthecondentialityofinformation. Symmetrickeyencryption
(e.g., triple-DES, AES) is adata distribution algorithm characterizedby a single parameter|keylength.
Hybrid data distribution algorithms can be constructed by combining the algorithms already discussed.
Forexample,many survivable storagesystemscombinereplication with encryption to address availability
and security, respectively. Security in such a systemhinges both upon how well the encryption keys are
protectedand upon the diÆculty of cryptanalysis. As anotherexample, short secret sharing encrypts the
original valuewitharandom key, storesthe encryptionkeyusing secretsharing,andstoresthe encrypted
informationusing informationdispersal[30]. Shortsecretsharingalgorithmshavethree parameters: m, n,
andk(keylength). Publickeycryptography(e.g.,RSA)canbeusedinsteadofsymmetrickeycryptography
to protect information condentiality. The management of cryptographic keysmust be addressedin the
designofasystemthatusescryptography;symmetrickeyandpublickeycryptographyrequiredierentkey
management strategies. Finally, compression algorithms (e.g., Human coding) canbe used before other
algorithms(e.g.,MD5,SHA-1)canbeusedtoadddigeststodatabeforeitisencodedwithanotheralgorithm.
Thehashofthedecodeddatacanbecomparedwiththedigesttoverifythedecodeddata'sintegrity. Digests
can also be generated for shares resulting from an encoding (e.g., distributed ngerprints [29]), allowing
integrity to be veried prior to decoding the data. Hash algorithms are parameterized by the hash size,
and theymust either beencoded withor storedseparately from the datain order tobe eective. Digital
signatures(e.g.,DSA)canprovidesimilar integrityguaranteesashashalgorithms.
Twoclasses of integrity algorithms are often used to build authentication and directory services that
protecttheintegrityofmeta-data. Therstclassareagreementalgorithms[32]suchastheByzantinefault
tolerantlibrary [9]. Quorum systems [19, 36], which are asuperset of voting algorithms [21, 45], are the
secondclass. Also,thereareintegrityalgorithmsthatworkexclusivelywiththresholdalgorithms. Inschemes
where m is lessthan n, excess shares can be used during decode (dierent permutations of m shares are
usedforvalidation). Secretsharingschemescanalsobemodieddirectly tooeraprobabilisticguarantee
ofcheaterdetection[46].
3 The Trade-o Space
The algorithms described in Section 2, together with the degrees of freedom given by their parameters,
provide thousandsof data distributionschemesfor survivablestoragesystems. Thoughtfullyselecting one
schemefromthissetrequirestheabilitytoevaluatetheirrelativemeritsand,further,todosointhecontext
ofthetarget environment. Thissectiondescribesourapproachtocomparing theperformance, availability,
andsecurityofvariousschemes.
3.1 Evaluating availability
Thesubstantialbody ofpriorwork in buildinghighly-availablesystems[23, 44]providesaclearmetricfor
evaluatinginformationavailability: theprobabilitythatadesiredpieceofinformationcanbeaccessedatany
givenpointintime. Assuminguncorrelatedfailures,thisprobabilitycanbecomputedfromtheprobabilities
thatrequiredsystemcomponentsareavailable. Forexample,ap-m-nthresholdschemerequires atleastm
ofthenstoragenodescontainingsharestobeoperationaltoperformaread. Iff
node
istheprobabilitythat
astoragenodehasfailedorisotherwiseunavailable,thentheavailabilityofthestoredinformationis
Availability read = n m X i=0 n i (f node ) i (1 f node ) n i
Forwrites, thecomputationof availabilitydependsuponthesystem'sdesign. Asystemcouldrequirethat
all of n specic nodes be operationalfor a write to succeed. This approach clearly has poor availability
characteristics. ToimprovewriteavailabilityinasystemwithN >nstoragenodes,thewriteoperationcan
attempt to writeshares to dierentstoragenodesuntil ithascompleted n writes. Alternatively,asystem
couldallowawritetocompletewhenfewerthannshareshavebeenwritten. Thisrequiresmoreworkduring
failurerecoveryorreducesreadavailabilityofthestoreddata (becauseniseectivelylowerforthatdata).
We calculate write availability using thelast approach. Thus, m storagenodes of theN storage nodes in
thesystemmustbeavailableforawritetobeperformed,andwriteavailabilityis
Availability write = N m X i=0 N i (f node ) i (1 f node ) N i
Availability requirementsfor storage systemstend to be quite high. A popular manner for discussing
high-availability valuesis in terms of \nines,"referring to the numberof nines after the decimalpoint in
theavailabilityvaluebefore adigit otherthannine. Forexample,alowavailabilityvalueof0.993 hasjust
twonines, whereasahigh availability valueof0.99999996hasseven nines. Itis worthnotingthat failures
arenotalwaysuncorrelated,asisgenerallyassumedinavailabilitycomputations. Forsurvivablesystemsin
particular, denial-of-service (DoS)attackscaninduce highly-correlatedfailures, bringing into questionthe
certainlyarewhencomparingdatadistributionschemesforagivensetofsystemcomponents. Inparticular,
availabilityincreasesonlywhenmoreoptionsareavailableforservicingrequests. Thus,ahigheravailability
valuemeansthat aDoSattackmusteliminatealargersetofstoragenodes.
3.2 Evaluating security
Byfar, securityis thehardest of thethree dimensionsto evaluate, asthere areno provenmetricsoreven
a mature body of research from which to draw. Our initial plan was to reuse the mathematics of fault
tolerance, counting how many storage nodes must be compromised to bypass condentiality or integrity.
However,thisapproachraisedsignicantproblems: Firstandforemost,itreliesuponhavinga\probability
of being compromised"with which to computesecurity values. Unfortunately, weare aware ofno reliable
wayto obtain or even estimate such a value. Further, using such avalue would bequestionable, because
securityproblemsexperiencedbynodesin adistributed systemareexpectedto behighly correlated;when
thesystemisunderattack,theprobabilityvaluewouldgoup. Anotherproblemwithevaluatingsecurityin
termsof aprobability ofanodebeingcompromised isthat it is auseful measure foronly asubset ofthe
data distribution algorithms (thresholdalgorithms). Forexample, it ignoresthe additional condentiality
provided byencryption.
Our current approach to evaluating security focuses on the eort (E) required for an active foe to
compromise the security of the system. For example, for a n-n-n threshold scheme, condentiality can
becompromisedbybreaking intoalln heterogenousstoragenodes orby compromisingtheauthentication
system: E Conf =min[E Auth ;(nE Break In )]
Extending this example, assume that the data is also encrypted and that the names of shares reveal no
information about the encoded data. The attacker has many paths to the data|attempt cryptanalysis
orsteal theencryption key, attempt combining sharesin many permutations orcompromisethedirectory
service. Assumingthat anattackerisgoingtotaketheeasiestpathtothedata:
E Conf = min[E Auth ;(nE Break In ) +min[E Cryptanalysis ;E StealKey ] +min[E IdentifyShares ;E StealNames ]]
Inthispaper,wefocusstrictlyonthesecurity ofthestoragesystem;attacksontheauthenticationservice
anddirectoryservicearenotconsideredfurther. Thus,weuseonlytwovaluesinoursecuritymodel: E
BreakIn
and E
CircumventCryptography
. These valuesare dimensionless. Thus, the unit of the securityaxis is \eort
units,"andsecurityvaluesacrossallruns havebeennormalizedonascalethatrangesfrom0to100.
Therstterm,E
BreakIn
,is theeortrequiredto stealapiece ofdatafrom astoragenode. Thesecond
term,E
CircumventCryptography
,istheeortrequiredtocircumventcryptographiccondentiality. Itisacoarse
metricthat includescryptanalysis,keyguessing,andthetheftofdecryption keys. Thecondentialityofan
encryptedreplicaisthusthesummationofthesetwoterms: oneencryptedreplica mustbestolenandthen
the encryption must be broken. Alternatively, the condentiality of secret sharing is solely afunction of
therstterm(i.e., mE
BreakIn
)|atotalof msharesmustbestolen to compromisethecondentialityof
data,thustheeortismtimestheeorttostealapieceofdata. Wecalculatetheeorttocompromisethe
condentiality ofinformation dispersal asaweightedsum of theinformation contentof the shares, where
the weight is proportionalto the numberof shares that must bestolen to decode. This heuristicsets the
condentialityofinformationdispersalabovethat ofreplication,belowthatofsecretsharing,andhigheras
m increases. Weassume that E
BreakIn
is constant forall storagenodes (i.e.,distinct attacksof equivalent
diÆculty are necessaryto compromiseeach storage node). For this assumption to be true storage nodes
must be heterogenous. Beyond this, we hypothesize that, asn increases, the likelihood that an attacker
cannd astorage node that they cancompromiseincreases; assuch, condentiality should decrease asn
increases. Wedonotcurrentlyconsiderninthecalculationofsecurity. Amoredetailedsecuritymodelcan
beimplemented,howeverthesesimplemodelscapturethemajorfeaturesofthealgorithmswearecurrently
guished availabilityfrom the two other securitycharacteristics because there isa trade-o betweenit and
them. Wefocusoursecurityaxisoncondentiality,sincethereappearstobeagreementinthecommunity
that cryptographichashesare the right way to protectintegrity, and we havefound nocontradictory
evi-dence. Thehashescanbegeneratedfrom thecleartextortheciphertext, andtheycanbestoredwithdata
shares,encodedintothename,orstoredinthedirectoryservice. Alloftheseincreaseintegritysignicantly,
usuallywellbeyondothereortlevels.
Although eort terms are diÆcult to quantify, we believe that eort-based evaluation focuses the
de-signer's attention on the right thing|raising the security bar \high enough." As with availability, the
relative merits of schemes canbe compared usefully even if the absolute eort quantities are inaccurate.
Further,othersecurityengineeringresearchprojectsarenowaddressingtheproblemof measuringsecurity
anddoingsointermsofeort[39]. Asbettereortmodelsbecomeavailable,theycanbemodularlyinserted
intothetrade-oexplorationapproach.
3.3 Evaluating performance
As with availability, performance metrics and evaluation techniques of distributed systems are part of a
mature body of prior work [26, 28]. The main issue faced in creatinga general tool is balancing detail,
whichshould yieldgreateraccuracy,withgeneralityintheperformancemodels. Weuseasimple,abstract
system model to predict the relativeperformance of dierent data distribution options. Like thegeneral
decentralized-storagemodel describedin Section 2,weintend foritto representawidearrayofsurvivable
storagesystemdesigns. Themodelincludesthreeparts:CPUtimeforencodeordecodeintheintermediary
software,networkbandwidthfordeliveringtheshares,andstoragenoderesponse times.
CPUTime. TheencodeanddecodeoperationsinvolvedwithanydatadistributionschemerequireCPU
time. ThereareordersofmagnitudedierencesbetweentheCPUtimesfordierentschemes. Forexample,
Figure3showstheCPU costforencoding a32KBblockforfour p-m-nthresholdschemesforallpossible
parameterselectionswithn25. Noticethelargedierencesbetweentheshapesoftheperformancecurves
andthetimesatthesamemandnfordierentschemes.
Wehaveconstructedandcalibratedsimplemodelsforthefollowingdatadistribution algorithms: ramp
schemes, secretsharing, information dispersal, replication, decimation, splitting, short secret sharing,
en-cryption,andhashalgorithms. ThemodelsrequireCPUmeasurementsofafewkeyprimitives: polynomial
interpolationoforderm 1,randomnumbergeneration,triple-DESencryption,andMD5hashgeneration.
Given therequisite measurements,themodelscanpredicttheCPUtime requiredforanencodeordecode
operationforanyofthedatadistributionschemeswithat least90%accuracy.
Network Bandwidth. Theamountof datathat mustpassbetween theclient andthe storagenodes
depends directly on the data distribution scheme. For any of the p-m-n threshold schemes, the network
bandwidthrequired depends onthe read-writeratioand the valuesof p,m, and n. Each share's size can
becomputedastheoriginaldatasizemultipliedby1=(m (p 1)). Forawrite,weassumethat allnshares
mustbeupdated. For aread,weassumethatonlymsharesarefetchedbydefault.
Wemodel thenetwork bandwidthwith adistribution, indicatingthe probability ofa givenbandwidth
during any period of time, assuming that a single bottleneck link determines the aggregate bandwidth.
Although imprecise, we believe that this allows the rst-order eects of concern to be captured, when
combined with the storage node latencies discussed below. For alocal-area network ordial-up client, we
expect thisbottleneck link tobeat theclient'snetworkinterfacecard. Forawide-areasystem,weexpect
this link to be atthe edgerouter throughwhich theclientinteracts with thewide-areanetwork. Network
congestion would appearin this verysimplemodelasareductionofavailable bandwidth(if consistent)or
anincreasein variability(ifsporadic).
StorageNode Latency. Aread orwrite requestto astoragenodeinvolvesworkat thestoragenode
in addition to moving data over the network. This is observed at the client as a response time latency
that includes both delays, but accurate modeling requires separating these delays and recombining them
appropriately. In particular, the time for an operation that simultaneously writes data to n nodes must
captureboththesharedbandwidthandconcurrentlatencyaspects.
Wemodelstoragenodelatencyasarandomvariablewhosemeanandvariancedependontheoperation
n=25areshownforfourthresholdschemes: secretsharing(p=m),rampschemewithp=dm=2e ,rampschemewithp=2,
andinformationdispersal(p=1).
onthelatternecessaryto modelstorageprotocols(e.g., FTPorNT4'sCIFS implementation)that opena
newTCP/IPconnectionwithslow-startforeach letransfered. Competing loadsonastoragenode would
appearinthissimplemodelasincreasesinthemeanand/orthevariance.
Overall performance model. To predict the access latency for a given request, we combine these
partial models asfollows. Fora write, ignoring varianceand assuming homogeneous servers, we sumthe
modeled CPU timefor encode, the storagenode latency for writingone share to one server,and n times
the network transmission time required per share. Thus, in sequence, the data is encoded, each share is
sentoverthenetworktoitsappropriatestoragenode,andallresponsesareawaited;then th
requestissent
after therstn 1andit nishesaftertheaveragestoragenodelatency. Foraread,thesamestepsoccur
in reverse. With variance, thelast request sentis notnecessarilythe last to nish, so apredictedlatency
foreachrequestmustbecomputed. Ourtoolsusesimulationtocomputetheexpectedperformanceforany
given scheme. In our experience, the simulation is fast enoughfor our purposes, with each single-scheme
predictionrequiring only10{30ms. It is alsoaccurate enough,yielding predictionsthat are within 15%of
valuesmeasuredontheprototypedescribedin Section4foravarietyofvalidation experiments.
4 Prototype and Validation
Wehaveimplementedaprototypesurvivablestoragesystem,PASIS[40, 48], which weuseto validateour
performancemodelsof datadistributionschemes. Oursystemts thearchitecture illustratedin Figure2,
withtheintermediarysoftwareexecutingontheclientmachines. Theencode/decodeand
multi-read/multi-writeaspectsofthearchitectureareimplementedinseparatelibraries. PASISallowsustoselectfrommany
datadistributionalgorithmsandspecifyawiderangeofparameters;thisenablesustoexploremanyschemes
The encode/decode librarysupports allof the basicschemes described in Section 2. The libraryinterface
exportstwomainfunctions: (1)adecodecallthattakesaschemedescriptionandasetofsharesasinputand
returnstheoriginaldataasoutput,and(2)anencodecallthatdoesthereverse. Theimplementationbuilds
onversion4.1oftheCrypto++library[11],whichprovidescodeforencryption(triple-DES),hashing(MD5),
and polynomial operationsin anite eld. Toenhance its performance for encoding/decoding bulk data
(as opposed tokey-like secrets), we replaced itsstream-based internal functions withblock-basedversions
andreducedtheeldsize. Thelatterchangeenablestheuseoflookuptables tondmultiplicativeinverses
and the product of two elements. Together,these changes yield 3{5 faster encode and decode times for
the information dispersal, secret sharing,and ramp algorithms when compared to the original Crypto++
implementations.
Themulti-read/multi-writelibrary manages parallel communicationwith a set of storage nodes. The
libraryinterfaceexportstwomainfunctions: (1)afetchcallthat takesasetofnsharedescriptionsanda
numberofsharesrequired(m), and(2)astorecallthat doesthesameforwrites. Eachshare description
includesastoragenodedescriptionandanameforthesharewithinthatstoragenode. Fortheexperiments
describedin thispaper,theselibrarieswere combinedinabenchmarkapplicationthat itself keepstrackof
per-blockmetadata.
The library currently supports the use of NFS storage nodes, CIFS storage nodes, and FTP storage
nodes. By using these standard protocols, we are able to use independently-implemented, o-the-shelf
storagenodes. Thisis animportantpracticaldesignconsiderationthat allowsus to enhance storagenode
heterogeneity, which in turnenhances security; forexample,thelikelihood islowofndingasingleattack
thatcancompromiseaSunNFSserver,aNetworkApplianceNFSserver,andaSNAPNFSserver. Without
thisfeature,Eort
BreakIn
forthesecondthroughn th
storagenodescouldbezero.
We have also implemented an NFS proxy server that exports an NFSv2 interface to clients and uses
the librariesaboveto encode anddistribute the lesacrossaset of storagenodes. Tostorethe necessary
metadata,eachexportedle ordirectoryis representedbytwoactualsets ofstorageobjects: oneset that
storessharedescriptionsforeachleblockand oneset thatstorestheactualdata. Theshare descriptions
fortheformerarestoredin thecorrespondingdirectoryentries. Wegenerallyallowonlythelocal machine
to mountthe NFS proxy server(over the loopback interface), thus allowing us to use survivablestorage
without changing the kernel. The downside, as expected, is lower performance relative to an in-kernel
implementation.
4.2 Model Verication
Wehaveusedourprototype,PASIS,toverifythatthesimpleperformancemodelofSection3cancapturethe
keyperformancetrendsofatleastonerealsurvivablestoragesystem. Todoso,wemeasuredthenecessary
modelparameters,conguredthemodelaccordingly,andthencomparedthemodelpredictionstomeasured
systemperformance. Themodelpredictionsofreadandwritethroughputwerecomparedtothistestbedfor
allthresholdscheme optionsupto n=6. Inalmostallcases, theyarewithin 10%and thelargestobserved
errorwas15%. Section5showsthatthislevelofaccuracyismorethansuÆcientforproperdatadistribution
schemeselection.
ThetestbedconsistedofsevenPCsystemsinter-connectedviaadedicated100Mbpsswitch. Eachsystem
containsa600MHzIntelPentiumIII,256MBofRAM,a3ComFastEtherlinkXLnetworkinterfacecard,
andaQuantumAtlas10KdiskconnectedviaanAdaptecULTRA-2SCSIcontroller. Onesystem,actingas
theclient,runsWindowsNT4.0withServicePack5. Theothersixsystems,actingasCIFSstoragenodes,
runaSAMBAserveronLinuxRedHat6.2withkernelversion2.2.14. Timemeasurementsarebasedonthe
Pentiumcyclecounter,andallvaluesarethemeanof20measurements.
In addition to the overall validation, we have validated the libraries in isolation. The encode/decode
model matches measured library performance to within 10% for all supported data distribution schemes
up to n = 25. This was also veried on a 300 MHz Pentium II and a 500 MHz Pentium III, and the
measuredCPUtimes were observedtoscalelinearlywith CPUperformance. Likewise,the modelpredicts
Wehavebuiltamodelwithagoodbalancebetweenaccuracyandsensitivity. Thisallowsusto capturethe
large-scaleeectsin thetrade-ospace. Comprehensionofsuchfeatures inthetrade-ospace isnecessary
to make system-leveldesign decisions. For large variations in system characteristics (i.e., distinct system
congurations),largedierencesintheselectionsurfaceareobserved. Thissupportsourclaimthatsystems
with dierent characteristics require dierent schemes. Also, a specic system should, for performance
reasons,usedierentschemeswhenoperatingin verydierentregionsofthespace|aregionbeinganarea
of theavailability-securityplane. For smallvariations in thesystem conguration(i.e., slightinaccuracies
in modelsand/or measurements), onlysmall dierencesin the selectionsurfaceare observed. This means
thattheschemeselectionsurfaceisstable. AlthoughitisinherentlydiÆculttoaccuratelycharacterizereal
systems,weareabletodetermineaclose-to-optimalschemechoiceforagivensystembecauseofthestability
oftheselectionsurface.
Intheremainderofthissection,thedefaultcongurationisdened,andourmethodofmeasurementis
explained. Aseriesofexperimentsareperformedtoinvestigatethesensitivityoftheselectionsurfacetosmall
congurationchanges,tolargecongurationchanges,andtoworkloadchanges. Finally,theavailabilityand
securitymodelsareconsidered.
Default conguration. For small systems (e.g., with n 10), there are around 1000 schemes to
consider. Inthe remainderof thissection, weconcentrateonasystemwith 10 storagenodes. All scheme
selectiongraphsin thissectionareplotted withperformancemeasuredin32KBblockspersecond,security
measuredin eort,andavailabilitymeasuredin \numberofnines." Performanceiscalculatedbasedonthe
CPU,network,andstoragenodesasdescribedinSection4.2. Thedefaultvalueforthenetworkparameteris
100Mbps. Becausenoclearmetrichasemergedfromresearchintoestimatingeortrequiredtocompromise
security,weset thevaluesforE
BreakIn and E
CircumventCryptography
equalin thedefault model. Thedefault
valueforthesystemworkloadisaread/writeprobabilityof0.5(anequalnumberofreadsandwrites). The
defaultvalueforasinglestoragenode'sfailureprobabilityisf
node
=0:005.
The default conguration is the base case against which all other congurations are compared. The
schemeselectionsurfaceforthedefaultcongurationisshowninFigure 1.
Measurements. To compare scheme selection surfaces, we use ametric with twocomponents. The
rstcomponentquantieswhatpercentageoftheschemechoiceschange. Thesecondcomponentquanties
the magnitude ofthe performance costassociatedwith using the other surface'sschemeat agiven point.
Forexample, we statesuch resultsas, \The selection surfacediersoverX% of thearea with anaverage
dierenceofY%." Whenperformingsensitivityanalysis,wereversethedirectionofthecomparison(i.e.,we
examinetheimpactofinaccuracies intheassumedsystemconguration). Thisallowsus todeterminehow
accuratethecongurationrepresentationmustbeforthetrade-oanalysistobeuseful. Insomeregionsof
thetrade-ospace,manyschemeshavesimilarperformanceoverarangeofcongurations. Insuchregions,
dierent schemes may be selected for dierent congurations, but with minimal performance impact. In
other regions, signicant performance costs are incurred if the wrong scheme is selected. To capture this
interestingeect,wesometimesperformacomparisonbetweentwoselectionsurfacesandonlylistthearea
and dierence for regionsthat havechanged bysome minimum amount (i.e., we ignore dierences of less
than10%to getabettermeasure oflargescaleeects).
Anothermethod of analyzing the trade-o space is used to understand the consumption of resources
by the system. Specically, we examinethe ratio of time spent performing CPU operations to the time
spentperformingnetworkI/Oinordertoascertainwhat regionsoftheselectionsurfacearebound byeach
resource. A schemeis considered to bebound byaresourceif theresourceaccounts forgreaterthan 80%
of the overall performance time. The resource consumption for the default conguration is presented in
Figure4.
5.1 Sensitivityto the model
Iftheselectionsurfaceishighlysensitivetosmallchangesintheconguration,thenitmightnotbeauseful
designtoolbecauseitwouldrequireexactinformationabouthowasystemwill behave. Toensurethatthe
selectionsurfaceis relativelystable, weexamineditssensitivity tothesystemcharacteristicsandworkload
selectedschemeisbound bythe availablenetworkbandwidth. Darkgreyindicatesa regioninwhichthe selectedschemeis
bound bythe CPU.Inthedefault conguration, replication(which appearsalongthe zerosecurityline)isnetwork bound.
Fartherdownthesecurityaxis,schemesareCPUbound.Thereisalargeregion,comprisedofinformationdispersal,replication
withcryptography,rampschemes,andshortsecretsharing,withbalancedresourceconsumption.
factoroftwoineachdirection. Fortheworkload,weconsideredworkloadsof40%readsaswellas60%reads.
Forthe securitymodel we variedthe value of E
BreakIn
by 10%in either direction relativeto thevalue of
E
CircumventCryptography
. Consideringthesecases,theselectionsurfacediersbylessthan9%oftheareawith
adierenceoflessthan9%. Fromthis, weconcludethat theselectionsurfaceis relativelystable. As well,
thismeansthat aslongas systemcharacteristicsare modeledwithmoderateaccuracyandtheworkloadis
roughlyunderstood, theschemeselectionsurfacecanprovidesignicantinsightintotherealconguration.
5.2 System characteristics
Toinvestigatetheimpactthatsystemcharacteristicshaveonproperschemeselection,wextheCPUspeed
andvarythenetworkspeed. Thisexploressystemcongurationsrepresentativeofawiderangeofdistributed
systemssuchaspeer-basedcomputingonaLANorclient-serverarchitectures.
Webegin by varyingthe defaultnetwork speedoneorder ofmagnitude (i.e.,to 10 Mbpsand 1Gbps).
Speedingupthenetworkhaslittleeectontheselectionsurface. Theselectionsurfacediersoverlessthan
10%ofthe areawithanaverage dierenceof 3%. Changesresultedfrom replication withcryptography,a
computationallycheapandspace-ineÆcientalgorithm,becomingabetterselectionthaninformation
disper-sal. Figure 5 shows the selection surfacefor the fast network. When the network bandwidth is reduced,
theselectionsurfacediersover26%oftheareawith anaverage dierenceof17%. Clearly, thischangeis
substantial. However,someofthisregionis comprisedofa\checkerboard,"wheretherightchoicebounces
between two schemeswith little performance dierence. In this checkerboard region, theselection surface
diersover11%oftheareawithanaveragedierenceof3%. Inonepartofthecheckerboardregion,short
secretsharingandrampschemeshavesimilarperformance. Inanotherpart,informationdispersalschemes
withparametersthatmakethemmorespace-eÆcientareselectedoverlessspace-eÆcientones. The
remain-der ofthedierence betweenthe surfacesis due to 14%of theareawithan averagedierence of29%. In
thisregion,whichisalongtheavailabilityaxis,informationdispersaldominatesreplication. Figure6shows
theselectionsurfacefortheslownetwork.
Ananalysisoftheresourceconsumptionovertheselectionsurfacesforthevariousnetworkspeeds
consid-eredshowsthat,forthefastnetworkmostoftheselectionsurfaceisCPUbound,forthedefaultmodelmost
of thesurfaceis balanced,andfor slownetworks mostofthe surfaceisnetwork bound. Many conclusions
canbedrawnfromtheresultsoftheseexperiments. First,schemeselectiondependsonthebalancebetween
the speed ofthe processorand the speed of thenetwork. Second, in system congurationsthat include a
slownetwork,thespace-eÆciencyofaschemeisasignicantindicatorofitsoverallperformance. Third,the
ofthe graph,because thenetworkconsumptionofreplicationhasalowperformancecostonafastnetwork. Note: Thefast
networkrequireddoublingtheperformanceaxisscale.
Figure6: Slownetwork(10Mbps). Informationdispersaldominatesalongtheavailabilityaxis.Shortsecretsharinghas
similarperformancetorampschemesintheforegroundofthegraph.
CPUboundin thedefault conguration.
5.3 Impact of workload
Toinvestigatetheimpactthatworkloadhasonschemeselection,theratioofreadstowriteswasvariedfrom
a100%read workloadtoa100%write workload. Recall thatthedefaultworkloadisequalpartsreadsand
writes. As shown in Table 2,this experiment found that scheme selection is insensitive to workload over
a largeportion of this range, 10%reads to 90%reads. However, at the end pointsof therange, there is
substantialchange inthe selectionsurface. In particular,changingtheworkloadcanchange theoperating
regionofanalgorithm(i.e.,itchangestheavailabilityguaranteethatcanbemade)|acleardierencefrom
modifyingsystemcharacteristics.
Thereasontheselectionsurfaceisinsensitiveto theworkload,exceptatextremes,isbecausethe
work-load mainly changes system performance. As the read workload increases/decreasesthe selection surface
expands/contractsalongtheperformanceaxis. Thereasonforthisisthattheread(decode)costsarelower
thanthewrite(encode)costsformostalgorithms. Onlyattheendsoftheworkloadrangedothedierent
mid-0% 97% 33% 5% 67% 6% 20% 22% 5% 40% 6% 4% 60% 4% 2% 80% 26% 7% 95% 50% 16% 100% 61% 20%
Table 2: Impact of workload on selection surface. Schemeselection isinsensitiveto changes inthe mid-rangeof
workloads.Signicantchangesoccurneartheendpointsofthepossibleworkloads.
rangeworkload. Theconclusionofthisexperimentisthat, formostread-writeworkloads,schemeselection
canbe doneassuming a workload of 50%reads becausethe selection surfaceis stable overa largerange.
However,forwrite-once/read-manystoragesystems,thetrade-ospacediersradically.
The selection surfaces for the 100% read and 100% write workloads have some specic features that
provideinsightintothetrade-ospace. Figure7andFigure8showtheselectionsurfacesfortheseworkloads.
Splittingdominatesthelowavailability-highsecurityregion(the backplaneofthegraph)forthe100%read
workload. Whereas, splitting dominates the diagonal that demarcates the edge of the operationalregion
for the 100% write workload. For other workloads, splitting only occurs at the maximum security point.
Splitting's encodeoperation isfaster thansecretsharing's; however,foranyother workloadsecretsharing
dominates splitting because splitting has extremely poor read availability. Splitting's decode operation
is extremely fast compared to its encode. This is why it is selected for the read workload|its encode
performancedoesnotpenalize it.
Thisinsightaboutsplittingisapplicabletotherestofthethresholdschemes. Forthe100%readworkload,
schemesarenotpenalizedforpoorwriteperformance. Theresultofthisisthatrampschemesdominatethe
region held by information dispersal for write dominated workloads|rampschemes with large n and low
marecompetitivewithinformationdispersalschemeswithmid-rangenandsimilarmvaluesbecausetheir
poorwriteperformance iscounted. Also, shortsecretsharing isshownto havelittlevaluefor aread-only
workload. Shortsecretsharingoersgoodsecurityfor alowencode cost. Again,theencodecostdoesnot
matterinaread-onlyworkload|rampschemesoerbetterread-onlyperformancethanshortsecretsharing.
Consideringthe resource consumption of the 100% read and 100% write workload is also instructive.
Figure9andFigure10demonstratethatthebestperformingschemeshaveabalancedutilizationofresources
when performing reads. Whereas, the CPU is thescarcest resourcewhen encoding data forhigh security.
Thefact that securityisCPU bound matchesintuition|securityisexpensiveand systemdesignersloathe
payingtheprice. However,the fact that reading secure datais notasexpensive, in termsof CPU cycles,
issignicant. Indeed,forworkloadswith>60%readstheutilizationofresourcesinthesystemisbalanced
deepintothesecurityregion.
5.4 Failure model
Thefailureprobabilityforstoragenodes,f
node
,isaninterestingparameterin ourmodel. Atotalordering
of all schemes based on their availability can be performed without substituting a specic value for this
parameter. Thus, theparameter only contractsorexpands the surfacein theavailability dimension. For
example,thesurfacein Figure11,whichisbasedonafailureprobability10largerthanthedefaultvalue,
hastheexactsameshapeasthesurfaceinFigure1,exceptforthescalingalongtheavailabilityaxis. Thus,in
somesensetheselectionsurfacedoesnotdependonthefailureprobability|orderingalongtheavailability
axis is solely a function of n and m. Clearly though, an accurate understanding of the failure model of
Figure8: 100%Writeworkload.
5.5 Security model
In this section, we investigate what happens if one assumes circumventing cryptographic protection and
compromising storage nodes require dierent amounts of eort. It is important to only consider relative
changesin theseexperiments|theeortmetriconlyprovidesarelativeorderingofschemes.
WeperformedexperimentstoseewhathappenswhenE
CircumventCryptography isvariedrelativetoE BreakIn . Reducing E CircumventCryptography
by anorder of magnitude relative to E
BreakIn
has little eect on the
se-lection surface. It diers over less than 4% of the area with an average dierence of 9%. Thus, the
eort valuation used to generate a total order for schemes along the security axis gives less weight to
cryptographiccondentiality than to information-theoretic condentiality. The results are much dierent
when E
CircumventCryptography
is increased by an order of magnitude relativeto E
BreakIn
. Replication with
cryptography dominates almost the entirety of the security-availability plane. This is because the
max-imum possible value of n is 10 in our default model. A threshold scheme can only achieve security of
10E
BreakIn
|one order of magnitude. To gain more insight into how the selection surface changes, we
increasedE
CircumventCryptography
byafactorof 2.5relativeto E
BreakIn
. Figure12 presentstheresulting
se-lectionsurface.Replicationwithcryptographydominatesaverylargeregionoftheschemeselectionsurface.
As well, theregion in which short secretsharingdominatesramp schemes growssignicantly. Indeed,the
consumptionofnetworkandCPUresourceswhenperformingreads.
Figure 10: Balance of resources fora 100% write workload. Thecostofsecurityisentirelyinthewrite(encode)
operation. Aswell,thelevelofsecurityisboundbytheCPU.
Even though the numbersassigned to the security costs have an arbitrary nature, it is interesting to
consider how the value placed in cryptographictechniques can change depending on the amount of time
condentiality must be maintained. A goal of security engineering is to expend just enough eort (i.e.,
minimize performance degradation) to achievethe security goal. For systems with ashort condentiality
window, the weighting of E
CircumventCryptography
relative to E
BreakIn
should be higher. Eectively, this
assumes a bound on the time that an attacker can utilize to crack the cipher, steal keys, or guess keys.
On the other hand, information whose condentiality is a long-term priority must carefully consider the
valuationofcryptographictechniques.
5.6 Analysis
Although the trade-o space is complex, we expected specic algorithms to dominate certain regions of
theavailability-securityplane. Within aregionweexpected theparametersto havea largeimpactonthe
performanceachieved. Our investigation showsthat ouroriginal insightswere basically correct. However,
sometransitionsbetweenregionsareabrupt,indicatingthatalargeperformancedierenceoccursforasmall
change in security oravailability guarantees. For example, the transition from replication to information
afactor oftwo alongthe availabilityaxis. Remember,the scaleof theavailabilityaxisisthe numberofnines. Anorderof
magnitudeincreaseintheprobabilityoffailureresultshalvesthenumberofninesofavailability.
Figure12: Placinghighervalueoncryptographythannodesecurity. Ifcryptographyprovidesbettercondentiality
thanthesecurityofthestoragenodes,replicationwithcryptographyisaveryeectivescheme. Aswell,shortsecretsharing,
whichcombinescryptographywithinformationdispersal,dominatesrampschemesdeepintotheregionofhighsecurity.
havethis property. Thiscontradictsouroriginalintuitionthatthelargenumberofschemeswouldresultin
aselectionsurfacethat issmooth. Ifasystemisoperatingnearasharpperformancetransition, thesystem
requirementsmustbeconsideredandanengineeringdecisionmustbemadeastowhethertoerrin favorof
performance,availability,orsecurity.
Sharptransitionstendtobelocalizedin thebackcornerofthetrade-o space(lowavailabilityandlow
security). The sharpness is due to the data redundancy of threshold schemes. Share size is calculated as
1=(m (p 1)),andnsharesaregeneratedbyathresholdscheme. Considera1-1-4thresholdscheme(4-fold
replication) and a 1-2-4 threshold scheme (an information dispersal scheme). By increasing m from 1 to
2, thedata redundancyis halved (i.e., storagespaceconsumption is thesameas2-fold replication). Ona
read, both schemesrequire the sameamount of data (one replica versus two shares, each half the size of
the replica). The abrupt performance dierence ontheselection surfaceis due to writeoperations,which
dependontheamountofdatathatmustbesentacrossthenetwork.
Anotherfeature of the selection surfaceoccurs when two algorithms perform similarly; the resultis a
\checkerboard"ontheselectionsurface. Insuchcases,itisinformativetoconsidertheresourceconsumption
Anotheraspect of the space that interests us is the performance dierence between algorithms. Our
intuition was that limiting the set of potential algorithms would often result in poor performance. Our
intuition was correct: each algorithm tends to be good for only some region of the availability-security
plane. If thesystem beingconsidered doesnot operatewithin the algorithm's \good"region, thesystem
hasincurredanunnecessaryperformancepenalty. Indeed,limitingtheset ofalgorithmsreducestheregion
of theavailability-securityplanein which the systemcanoperate. Constrainingan algorithmto aspecic
setofparameters(i.e.,buildingasystemaroundasinglescheme)xestheavailabilityandsecuritythatcan
beoered. Moreover,unlessthoroughanalysisofthetrade-ospacewasdoneaspartofthesystemdesign,
itis unlikelythat thescheme selectedis well matchedto thesystembeingbuilt. Thisis onlyareasonable
designdecisionifthesystemisto operatein asingleregionand theschemeusedis ontheselectionsurface
attheregion.
Thehighestdegreeofavailabilityisalwaysachievedbyareplicationalgorithm(becauseonlyreplication
can handle n 1failures) and the highest degreeof securityis always achieved by splitting (because only
splittingcanhandle n 1compromises).
Aftersomelevelofsecurity,secretsharingdominatesonlytheedgeregionthatdemarcatesthereachable
portion of the availability-securityplane. This is because for am
1 -m
1 -n
1
secretsharing scheme, there is
usuallyap 2 -m 2 -n 2
rampschemewithp
2 =m 1 ,m 2 >m 1 ,andn 2 >n 1
whichprovides,bydenition,thesame
securityguarantee,providesadequateavailability,but hasmuch betterperformance. Thereasonforthisis
that thespace savingsof ramp schemes withm>p resultsin fewercomputations beingrequired,and thus
better performance (i.e., since shares are smaller in size, fewer calculations are required to generate each
share). Thus,ifsecretsharingistheonlyalgorithmthatoersthesecurityandavailabilityrequired,
increas-ing thenumberof storagenodes, (i.e., themaximum valuen may take),will produceabetter performing
ramp scheme(>2 improvement)that meets theavailabilityandsecurityrequirements. Clearly, this can
onlybedoneifthereisnotaharddesignconstraintonthenumberofstoragenodesin thesystem.
6 Survivable Storage Projects
Survivablestorage systems are an active areaof current research. This section describes six current and
pastsurvivable storageprojects,focusing ontheir systemmodels, target workloads, and data distribution
schemes. Foreach,designinsightsresultingfrom ourexplorationofthetrade-ospacearenoted.
Delta-4. TheDelta-4system [12,16] iscomprisedofclients,securityserversanddata storageservers.
The data distributionscheme employed is referredto asfragmentation, redundancy and scattering. Data
is decimatedinto fragments,each fragmentis encrypted (withachained cipherso that fragmentsmust be
decoded in acertain order), and theencrypted fragmentsare replicated. Each datastorage serveris sent
all ofthe fragmentsand uses apseudo-randomalgorithm to decidewhich fragments to store. Thenames
of fragmentsare self-verifying but givenoinformation aboutthefragment'scontents. Toaccessapiece of
data, aclient mustget authorization from a set of storageservers. The storageserversrun anagreement
protocolto provideintegrity. Thehash,whichenablesaclientto determinethenamesofthe fragmentsit
wantstoread,isstoredusingsecretsharingonthesecurityservers. Boththeintegrityandcondentialityof
thedatastoredin Delta-4hingesonthesecurityserversadequatelyprotectingfragments'meta-data. Since
CPUcyclesarenotasscarceastheywerefteenyearsago,cryptographycoupledwithinformationdispersal
should be used in a Delta-4 type system rather than replication. Indeed, since all shares are sent to all
storagenodes,space-eÆciencyofencodingisparamounttoreducingbandwidthconsumption.
Publius. Publius [47] isstrongly in uencedbyAnderson'sEternityService proposal [2],whichargues
forsurvivablestorageplusanonymity. Publiususesencrypted replicas fordatadistribution. Thekeyused
forencryptionisencodedwithsecretsharing,andasingleshareofthekeyisstoredwitheachreplica. This
algorithm is verysimilar to short secretsharing, exceptthat replication rather thaninformation dispersal
isused toencodetheciphertext. Noindicationis givenastothespecicparametersselectedforthedata
distributionscheme. However,theauthorsdoidentifyparameterselectionasadiÆculttask,citingthedesire
tobalanceresistancetocensorshipandperformance. Theavailabilityofdatain aPubliussystemislimited
bythesecretsharingoftheencryptionkey. Thus,theextraspaceconsumedbyreplication,overinformation
utilizingabetterscheme.
Intermemory. TheIntermemoryproject[10, 22] focuses onarchivalstorageofpublicdata(i.e.,
write-once, read-manydata with high availability and integrityrequirementsovertime). Intermemory's goal is
to provideextraordinarilyhigh availabilitywhile minimizingstorageconsumptionandthe expectedcostof
reading data. Deepencodingisthedatadistribution schemeused inIntermemory. Deepencodinginvolves
storingasinglecopyofapieceofdata,thenapplyingathresholdscheme(Tornadoschemesinthiscase[35])
to thepieceofdata, distributingtheshares,and thenhavingthestoragenodesrecursivelyapply thesame
threshold scheme to each share they receive. Deep encoding is performed to some depth level(e.g., deep
encodingofdepth threestoresareplica,nshares, andn 2
subsequentsharesofshares). Intermemoryusesa
1-16-32thresholdschemeapplied to depthlevelthree. Public-keycryptographyis discussedasameansto
providelong-termintegrityofstoreddataintheIntermemorysystem. TheIntermemoryprojecthasselected
anexcellentschemethatprovideshighavailabilitywithreasonableexpectedreadperformanceandverylow
storage spaceoverhead. However, integrity is explicitly stated asagoal of the system. An attackerneed
onlycompromiseasinglereplicato defeattheintegrityofthestored data. Addingahash totheencoding
algorithm and removingthe rst depth level of deep encoding would greatlyincrease the integrity of the
Intermemorysystemattheexpenseofincreasingtheexpected costofperformingaread.
Oceanstore. Oceanstore [31] is envisioned asa wide-area data utility. It intends to leverage excess
storagecapacityoverawideareatoprovidesurvivablestorageofdataaccessiblefromanywhere. Oceanstore
storestwotypesofdata: activedata(whichcanbereadandwritten)andarchivaldata(whichiswrite-once,
read-many). Active data isstored using encrypted replicas. Replicashave self-verifyingnames to provide
integrity. Agreementalgorithms are used to manageauthorization to data. Archival data is stored using
deepencodingasin Intermemory. Oceanstoreisaddressingmanyotherissuespertainingtobuildingadata
storageutilityofglobalscale,includingclientmobility,robustnaming,datamigration,andreplicadiscovery.
Analyzingthetrade-os amongstdatadistributionschemeswouldenableOceanstoretoclearlyunderstand
thedierentcharacteristicsofthetwodistinctdatadistributionschemesitemploys. Indeed,suchananalysis
isnecessaryforthemto determineat whichworkloaddatashould switchfrombeingactivetoarchival.
Farsite. TheFarsiteproject[7]isexploringpeer-basedstorageinalocal-areasetting. Farsiteis
attempt-ingto leveragetheidle cycles andunused storagecapacity ofdesktop workstations, whichinherently have
poor availabilitycharacteristics,to provide asurvivablestoragesystem. Farsite usesencrypted replication
with self-verifying names. A Byzantine agreement protocol amongstclient machines storing meta-data is
usedtoguaranteetheintegrityofthedirectoryservice(andsubsequentlyprovidetheself-verifyingnatureof
thereplicanames). Tolimitspaceconsumption, singleinstance storage[6],in conjunctionwithconvergent
encryption,is usedto ensurethat onlynreplicas ofanysimilarpiece of datais storedsystemwide. Since
Farsiteexplicitlystatesconcernsaboutstoragespaceconsumption,informationdispersal ratherthen
repli-cationshouldprobablybeused. WithrelativelysmallCPUcosts,2{4reductionsin spaceandbandwidth
utilization are available. Further, bothsingle instancestorageand convergentencryption canstill beused
withthesharesgeneratedbyinformationdispersal.
e-Vault. Thee-Vaultsystem[18,25]isasurvivablestoragesystemthatuseseitherinformationdispersal
orshort secretsharing, depending onwhether ornotstrong condentiality isdesired. Distributed
nger-prints [29] are used for integrity. The systemis designed for a local-areasetting in which clients interact
withstorageserversviaa\gateway"(adesignated storageserver). Performancenumbersaregivenforthe
1-2-3information dispersal scheme[25], sincethetest platform is limitedto three storageservers. This is
averylimitedsystemin which to consider informationdispersal schemes. A systemwith just afew more
storagenodeshassignicantlybetterperformanceand availabilitycharacteristics.
PASIS.Ourprototypesystem,PASIS[40,48], isdescribedinSection 4. PASISprovidesan
implemen-tation ofasurvivablestoragesystemagainstwhich wevalidate ourperformancemodels. The maindesign
goalofPASISistobe exibleenoughforustoinvestigatemanydierentapproachestobuildingsurvivable
storagesystems. Ourimplementationofencode/decodefunctionalityisseparatefromour
multi-read/multi-writefunctionality. Theencode/decodelibraryprovidesthresholdalgorithms,cryptographicalgorithms,and
hybridalgorithms,allofwhichcanbeinstantiatedoverawiderangeofparameters. Thelibraryisstructured
soastofacilitatetheadditionofdatadistributionalgorithmsandthecompositionofhybridalgorithms. The
multi-read/multi-writelibrarysupportsNFS,CIFS,andFTPstoragenodesandcanbeextendedtosupport
7 Summary
Thispaperillustratesthecomplextrade-ospaceassociatedwithselectingtherightdatadistributionscheme
forasurvivablestoragesystem. Areasonedapproachtoexploringthistrade-ospaceisdevelopedandused
to explorethespace'ssensitivityto varioussystemand workloadcharacteristics. Althoughfurther workis
neededtorenethemodels,webelievethatthispapertakesabigstepintherightdirection|awayfromad
hocselectionsandtowardsinformedengineeringdecisions.
ThisworkwasdoneinthecontextofthePASISsurvivablestorageprojectatCarnegieMellonUniversity.
Additionalinformationonourworkandresultscanbefoundontheprojectwebsite [40].
8 Acknowledgements
TheauthorswouldliketothankJohnD.Strunk,GarthR.Goodson,andTheodore M.Wong forextensive
discussionandfeedbackonthedirectionsandgoalsofthisresearch. Wewouldalsoliketo thankthemany
other graduate students in the Parallel Data Lab that provided feedback on rough drafts of this paper.
Finally, we would like to thank Matt Vestal for administering the cluster of machines on which we ran
experiments.
References
[1] DarrellC.Anderson, Jerey S.Chase, andAminM. Vahdat. Interposedrequestroutingfor scalablenetwork
storage. Symposiumon Operating SystemsDesignandImplementation(SanDiego, CA,22{25October2000),
2000.
[2] RossJ.Anderson. TheEternityService. PRAGOCRYPT,pages242{253. CTUPublishing,1996.
[3] ThomasE.Anderson,MichaelD.Dahlin,JeannaM.Neefe,DavidA.Patterson,DrewS.Roselli,andRandolphY.
Wang. Serverlessnetworklesystems.ACM TransactionsonComputer Systems,14(1):41{79,February1996.
[4] G.R.Blakley. Safeguarding cryptographickeys. AFIPSNational Computer Conference (NewYork,NY,4{7
June1979),pages313{317. AFIPS,1979.
[5] G.R.BlakleyandCatherine Meadows. Security oframpschemes. Advances inCryptology -CRYPTO,pages
242{268. Springer-Verlag,1985.
[6] WilliamJ.Bolosky,ScottCorbin,DavidGoebel,andJohnR.Douceur.SingleInstanceStorageinWindows2000.
USENIX Windows Systems Symposium(Seattle, WA, 3{4 August2000), pages13{24. USENIX Association,
2000.
[7] WilliamJ.Bolosky, JohnR.Douceur, DavidEly,andMarvinTheimer. Feasabilityof aserverless distributed
lesystemdeployedonanexistingsetofdesktopPCs. ACM SIGMETRICSConference onMeasurement and
Modeling of Computer Systems (Santa Clara, CA, 17{21 June 2000). Published as Performance Evaluation
Review,28(1):34{43. ACM,2000.
[8] MiguelCastroandBarbaraLiskov.Practicalbyzantinefaulttolerance.SymposiumonOperatingSystemsDesign
andImplementation(NewOrleans,LA,22{25February1999),pages173{186. ACM,1998.
[9] Miguel Castro and Barbara Liskov. Proactive recovery ina Byzantine-fault-tolerant system. Symposium on
OperatingSystemsDesignandImplementation(SanDiego,CA,23{25 October2000),pages273{287. USENIX
Association, 2000.
[10] Yuan Chen, Jan Edler, Andrew Goldberg, Allan Gottlieb, Sumeet Sobti, and Peter Yianilos. A prototype
implementation ofarchival Intermemory. ACM Conference on DigitalLibraries(Berkeley, CA, 11{14 August
1999),pages28{37. ACM,1999.
[11] WeiDai. Crypto++referencemanual . http://cryptopp.sourceforge.net/docs/ref/.
[12] YvesDeswarte,L.Blain,andJean-CharlesFabre. Intrusiontolerance indistributedcomputingsystems. IEEE
SymposiumonSecurityandPrivacy(Oakland,CA,20{22May1991),pages110{121,1991.
[15] J.FragaandD.Powell. Afaultandintrusion-tolerantlesystem. IFIPInternationalConferenceonComputer
Security,pages203{218. Elsevier,1985.
[16] Jean-Michel Fray, Yves Deswarte, and David Powell. Intrusion-tolerance using ne-grain
fragmentation-scattering. IEEE Symposium onSecurity and Privacy(Oakland, CA,7{9April 1986),pages194{201. IEEE,
1986.
[17] FreeHaven.http://www.freehaven.net/.
[18] Juan A. Garay, Rosario Gennaro, Charanjit Jutla, and Tal Rabin. Secure distributed storage and retrieval.
TheoreticalComputer Science,243(1{2):363{389. Elsevier,September2000.
[19] HectorGarcia-Molina andDanielBarbara. Howtoassignvotesinadistributedsystem. Journalof theACM,
32(4):841{855,October1985.
[20] GarthA.Gibson,DavidF.Nagle,KhalilAmiri,JeButler,FayW.Chang,HowardGobio,CharlesHardin,Erik
Riedel,DavidRochberg,andJimZelenka. Acost-eective,high-bandwidth storagearchitecture. Architectural
Support forProgramming Languages andOperating Systems(SanJose, CA,3{7October1998). Published as
SIGPLANNotices,33(11):92{103,November1998.
[21] D. K. Giord. Violet: an experimental decentralized system. Technical report CSL{79{12. Xerox Palo Alto
ResearchCenter,CA,November1979.
[22] AndrewV. Goldberg andPeterN.Yianilos. Towards anarchivalintermemory. IEEE InternationalForumon
Research andTechnologyAdvancesinDigitalLibraries(SantaBarbara,CA,22{24April1998),pages147{156.
IEEE,1998.
[23] JimGrayandDanielP.Siewiorek.High-availabilitycomputersystems.IEEEComputer,24(9):39{48,September
1991.
[24] Intermemory. http://www.intermemory.org/.
[25] A. Iyengar, R. Cahn, C. Jutla, and J. A. Garay. Design and implementation of a secure distributed data
repository. IFIP InternationalInformation Security Conference (Vienna,Austria,and Budapest,Hungary,31
August{4September1998),pages123{135. ACM,1998.
[26] RajJain. Theart ofcomputersystemsperformance analysis. JohnWiley&Sons,1991.
[27] EhudD.Karnin,JonathanW.Greene,andMartinE.Hellman. Onsecretsharingsystems. IEEETransactions
onInformationTheory,IT-29(1):35{41. IEEE,January1983.
[28] LeonardKleinrock. Queueingsystems. JohnWileyandSons,1973.
[29] HugoKrawczyk. Distributedngerprintsandsecureinformation dispersal. ACM SymposiumonPrinciples of
DistributedComputing(Ithaca,NY,15{18August1993),pages207{218,1993.
[30] HugoKrawczyk. Secretsharing madeshort. Advances in Cryptology - CRYPTO(SantaBarbara, CA,22{26
August1993),pages136{146. Springer-Verlag,1994.
[31] John Kubiatowicz, David Bindel, YanChen, Steven Czerwinski, Patrick Eaten, Dennis Geels, Ramakrishna
Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao. OceanStore: an
architectureforglobal-scalepersistentstorage.ArchitecturalSupportforProgrammingLanguagesandOperating
Systems(Cambridge,MA,12{15November2000).PublishedasOperatingSystemsReview,34(5):190{201,2000.
[32] LeslieLamport,RobertShostak,andMarshall Pease. TheByzantinegeneralsproblem. ACMTransactionson
ProgrammingLanguagesandSystems,4(3):382{401. ACM,July1982.
[33] EdwardK. Leeand Chandramohan A. Thekkath. Petal: distributed virtual disks. Architectural Support for
ProgrammingLanguagesandOperatingSystems(Cambridge,MA,1{5October1996). Published asSIGPLAN
Notices,31(9):84{92, 1996.
[34] BarbaraLiskov,SanjayGhemawat,RobertGruber, PaulJohnson,LiubaShrira,andMichael Williams.
Repli-cation inthe Harp le system. ACM Symposium on Operating System Principles (Pacic Grove, CA, 13{16
October1991). PublishedasOperatingSystemsReview,25(5):226{238,1991.
[35] Michael Luby,MichaelMitzenmacher,MohammadAminShokrollahi,andDanielA.Spielman. Analysisoflow
densitycodesandimproveddesignsusingirregulargraphs. ACMSymposiumonTheory ofComputing(Dallas,
TX,23{26May1998),pages249{258. ACM,1998.
[36] DahliaMalkhiandMichaelK.Reiter.SecureandscalablereplicationinPhalanx. IEEESymposiumonReliable
[38] Oceanstore. http://oceanstore.cs.berkeley.edu/.
[39] RodolpheOrtalo,YvesDeswarte,andMohamedKaaniche.Experimentingwithquantitativeevaluationtoolsfor
monitoringoperationalsecurity. IEEETransactionsonSoftwareEngineering,25(5):633{650. IEEE,September
1999.
[40] PASIS. http://www.ices.cmu.edu/pasis/.
[41] Publius. http://cs1.cs.nyu.edu/waldman/publius/.
[42] MichaelO.Rabin. EÆcientdispersalofinformationforsecurity,loadbalancing,andfaulttolerance. Journalof
theACM,36(2):335{348. ACM,April1989.
[43] AdiShamir. Howtoshareasecret. CommunicationsoftheACM,22(11):612{613. ACM,November1979.
[44] Daniel P. Siewiorek and Robert S.Swarz. Reliable computer systems: design and evaluation. Digital Press,
Secondedition,1992.
[45] R.H.Thomas.Amajorityconsensusapproachtoconcurrencycontrol.ACMTransactionsonDatabaseSystems,
4:180{209, 1979.
[46] MartinTompaandHeatherWoll. Howtoshareasecretwithcheaters. JournalofCryptography,1(2):133{138,
1988.
[47] Marc Waldman, Aviel D. Rubin, and Lorrie Faith Cranor. Publius: a robust, tamper-evident,
censorship-resistant,webpublishingsystem.USENIXSecuritySymposium(Denver,CO,14{17August2000),pages59{72.
USENIXAssociation, 2000.
[48] JayJ.Wylie,MichaelW.Bigrigg,JohnD.Strunk,GregoryR.Ganger,HanKiliccote,andPradeepK.Khosla.