Applications. Decode/ Encode ... Meta- Data. Data. Shares. Multi-read/ Multi-write. Intermediary Software ... Storage Nodes

(1)

for a SurvivableStorage System

JayJ.Wylie, MehmetBakkaloglu,VijayPandurangan,

MichaelW. Bigrigg,SemihOguz,KenTew,CoryWilliams,

GregoryR.Ganger,PradeepK.Khosla

May2001

CMU-CS-01-120

SchoolofComputerScience

CarnegieMellonUniversity

Pittsburgh,PA 15213

Abstract

Survivable storage system design has become a popular research topic. This paper tackles the diÆcult problem of

reasoningabout theengineering trade-os inherentindatadistributionscheme selection. Thechoiceof anencoding

algorithmanditsparameterspositionsasystemataparticularpointinacomplextrade-ospacebetweenperformance,

availability, and security. We demonstrate that no choice is right for all systems, and we present an approach to

codifyingand visualizing this trade-o space. Using this approach, we explore the sensitivity of thespace tosystem

characteristics, workload,anddesiredlevelsofsecurityandavailability.

Thisresearch ispart ofthePASISproject[40,48 ]atCarnegieMellonUniversity.

ThisworkispartiallyfundedbyDARPA/ATO'sOrganicallyAssuredandSurvivableInformationSystems(OASIS)program(Air

ForcecontractnumberF30602-99-2-0539-AFRL).WethankthemembersandcompaniesoftheParallelDataConsortium(atthetimeof

thiswriting:EMCCorporation,Hewlett-PackardLabs,Hitachi,IBMCorporation,IntelCorporation,LSILogic,LucentTechnologies,

NetworkAppliances,PANASAS,PlatysCommunications,SeagateTechnology,SnapAppliances,SunMicrosystems,VeritasSoftware

(2)

(3)

Digitalinformationisacriticalresource,creatinganeedfordistributedstoragesystemsthatprovidesuÆcient

dataavailabilityanddatasecurityinthefaceoffailuresandmaliciousattacks. Manyresearchgroups[13,14,

17,24, 37,38,40, 41]arenowexploringthedesignandimplementationof suchsurvivable storage systems.

These systemsbuild on mature technologies from decentralized storage systems [1, 3, 8, 20, 33, 34] and

alsosharethesamehigh-levelarchitectures. Infact,developmentofsurvivablestoragewiththesamebasic

architecture was pursuedover15 years ago[12, 15, 16]. The challenge now, asit was then, is to achieve

acceptable levels of performance and manageability. Moreover, a means to evaluate survivable storage

systemsis requiredtofacilitatethedesignofthesesystems.

Onekeytomaximizingsurvivablestorageperformanceismindfulselectionofthedatadistributionscheme.

A data distribution scheme consists of aspecic algorithm for data encoding &partitioning and a set of

valuesforitsparameters. Therearemanyalgorithmsapplicabletosurvivablestorage,includingencryption,

replication,striping, erasure-resilientcoding, secretsharing,andvarious combinations. Eachalgorithmhas

oneormore tunable parameters. The resultis a largetoolbox of possible schemes, each oeringdierent

levelsofperformance(throughput),availability(probabilitythatdatacanbeaccessed),andsecurity(eort

required to compromisethe condentiality orintegrityof stored data). Forexample, replication provides

availabilityat a high cost in network bandwidthand storage space, whereasshort secret sharingprovides

availabilityandsecurityatlowerstorageandbandwidthcostbuthigherCPUutilization. Likewise,selecting

thenumberof sharesrequiredto reconstructasecret-sharedvalueinvolvesatrade-obetweenavailability

andcondentiality:ifmoremachinesmustbecompromisedtostealthesecret,thenmoremustbeoperational

to provideit legitimately. Secret sharingschemes and other data distribution algorithms are described in

Section2.

Nosingledata distributionscheme is right forall systems. Instead, the rightchoicefor anyparticular

systemdependsonanarrayoffactors,includingexpectedworkload,systemcomponentcharacteristics,and

desiredlevelsofavailability andsecurity. Unfortunately, mostsystemdesignsappearto involvean ad hoc

choice, oftenresultinginasubstantialperformancelossduetomissed opportunities andover-engineering.

Thispaperpromotesabetterapproachtoselectingthedatadistributionscheme. Atahighlevel,thisnew

approachconsistsofthreesteps: enumeratingpossibledatadistributionschemes(<algorithm;parameters>

pairs),modelingtheconsequencesofeachscheme,andidentifyingthebest-performingschemeforanygiven

set of availability and security requirements. The surface shown in Figure 1 illustrates one result of the

approach. Generatingsuch asurfacerequirescodifyingeachdimensionof thetrade-o spacesuchthat all

data distribution schemesfall intoatotal order. Thesurfaceservestwofunctions: (1)it enablesinformed

trade-osamongsecurity,availability,andperformance,and(2)itidentiesthebest-performingschemefor

each point in the trade-o space. Specically, the surface shown represents the performance of the

best-performingschemethatprovidesatleastthecorrespondinglevelsofavailabilityandsecurity. Manyschemes

arenotbest atanyofthepointsinthespaceandassucharenotrepresentedonthesurface.

Thispaperdemonstratesthefeasibilityand importanceof carefuldata distributionscheme choice. The

resultsshowthattheoptimalchoicevariesasafunctionofworkload,systemcharacteristics,andthedesired

levels ofavailabilityand security. Theresults showthat minor (2) changes in these determinants have

little eect, which means that the models need not be exact to be useful. Importantly, the results also

showthatlargechanges,whichwouldcorrespondtodistinctsystems,createsubstantiallydierenttrade-o

spacesandbestchoices. Thus,failingtoexaminethetrade-ospaceinthecontextofone'ssystemcanyield

bothpoorperformanceandunfullledrequirements. With sensitivitystudiesandsurvivablestoragesystem

examplesfrom theliterature,weidentify interestingtrendsanddesignpoints.

Theremainderofthispaperisorganizedasfollows. Section2discussessurvivablestoragesystemdesign

and data distribution scheme options. Section 3 describes how we codify the trade-o space. Section 4

describesPASIS[40,48],ourprototypesurvivablestoragesystem,anditsroleinvalidatingourperformance

model. Section 5exploresthetrade-ospaceand itssensitivitytomodel inaccuracies,systemcomponents,

and expected workloads. Section 6 describes current and pastsurvivablestorage system research,

identi-fying insights from our explorations that could enhance their designs. Section 7 summarizes this papers

(4)

formance,availability,andsecurityaxes. Performanceisquantiedasthenumberof32KBrequestspersecondthatcanbe

satisedforasingleclient;onlythe best-performingschemethatprovidesatleastthegivensecurityandavailabilitylevelsis

shown,resultinginamonotonicdecreasealongthoseaxes. Availabilityisthe probabilitythatstoreddata canbeaccessed.

Securityistheeortrequiredtocompromiseeitherthecondentialityorintegrityofstoreddata.Section2describesthedata

distributionalgorithmslistedinthelegend.Section3describestheaxesindetail.

2 Survivable Storage Systems

Survivable systems operate from the fundamental design thesis that no individual service, node, or user

can be fully trusted; having some compromised entities must be viewed asthe common case rather than

the exception. Survivable storage systems must encode and distribute data across independent storage

nodes,entrustingthedata'spersistencetosetsofnodesratherthantoindividualnodes. Ifcondentialityis

required,unencoded datashouldnotbestoreddirectly onstoragenodes;otherwise,compromisingasingle

storagenodeenablesanattackertobypassaccess-controlpolicies.

Priorwork in cluster storagesystems [1,3, 8,20, 33, 34] providesmuch insightinto how to eÆciently

decentralize storage services while providing a single, unied view to applications. Figure 2 illustrates

the basic decentralized storage architecture. Inmost cases, some intermediary software is responsible for

translating between the unied view and the decentralized reality. This piece of software may execute

directlyoneachclientsystem[1,3,20,33],onstoragenodesidentiedasleaders[8,34],oratintermediary

nodes[1,3].

Inmostcluster storagesystems, thedata andsomeform of redundancy informationare striped across

storagenodes. Survivablestoragesystemsdierfromdecentralizedstoragesystemsmainlyin theencoding

mechanism used (i.e., the data distribution scheme). The data distribution scheme enables the storage

systemtosurvivebothfailuresandcompromisesofstoragenodes. Eachscheme(a<algorithm;parameters>

pair)oersadierentlevelofperformance,availabilityandsecurity. Littleunderstandingexistsofthelarge

trade-ospacethat theschemesoccupyinpractice.

Thispapertakesasteptowardsunderstandingtherelativemeritsofdierentdatadistribution schemes

byexaminingtheminthecontextofthetrade-ospace. Theremainderofthissectiondescribesvariousdata

distributionalgorithms and the parametersthat specifyeach algorithm. There are several issues involved

with constructing acomplete survivablestorage systemthat are notcentral to theresults and insightsof

thispaper. Specically,thispaperarguesforandprovidesanapproachformakingmindfulselectionsofthe

data distribution scheme, but explicitly tries to doso while remainingimpartial onissues that arelargely

orthogonal. Examplesofsuchissuesincludemetadataandnaming mechanisms,clientnodeauthentication,

(5)

Applications

Decode/

Encode

Multi-read/

Multi-write

Meta-Data

...

Storage

Nodes

...

Shares

Data

Requests

Intermediary

Software

Applications

Decode/

Encode

Multi-read/

Multi-write

Meta-Data

...

Storage

Nodes

...

Shares

Data

Requests

Intermediary

Software

Figure2: Genericdecentralizedstoragearchitecture. Intermediarysoftwaretranslatestheapplications'uniedview

ofstorage tothedecentralizedreality. Solidlinestrace thedata path. Dashedlinestracethe meta-datapath. Applications

readand writeblockstotheintermediarysoftware.Encodingtransformsblocksintoshares(decodingdoesthereverse). Sets

ofsharesarereadfrom(writtento)storagenodes. Intermediarysoftwaremayrunonclients,leaderstoragenodes,oratsome

pointinbetween.

2.1 Threshold algorithms

There is a wide array of data distribution algorithms, including encryption, replication, striping,

erasure-resilient coding, information dispersal, and secret sharing. Threshold algorithms, characterized by three

parameters(p, m, and n), representalarge setof these algorithms. Inap-m-n thresholdscheme, data is

encoded into n shares such that any m of the shares canreconstruct the data and less than p reveal no

information abouttheencodeddata. Thus, astored valueisavailable ifat leastm of thenshares canbe

retrieved. Attackers must compromise at least p storage nodes before it is even theoretically possible to

determineanypartoftheencodeddata.

Table1listssomep-m-nthresholdschemesthathavemorefamiliarnames. Perhapsthesimplestexample

is n-wayreplication, which is a1-1-n thresholdscheme. That is, outofthenreplicas that arestored, any

single replica provides theoriginal data (m =1), andeach replica reveals information about theencoded

data(p=1). Anothersimpleexampleisdecimation(orstriping,asindiskarrays),whereinalargeblockof

datais partitionedinton sub-blocks,each containing1=n ofthedata(so,p=1andm=n). Attheother

end of thespectrumis \splitting", ann-n-n thresholdscheme that consists of storingn 1randomvalues

and onevalue thatis theexclusive-orofthe originalvalueand thosen 1values;p=m=n forsplitting,

since alln shares are neededto recoverthe original data. Replication, decimation, and splitting schemes

haveasingletunableparameter,n,whichaectstheirplacein thetrade-ospace.

With more CPU-intensivemathematics, the full range of p-m-nthreshold schemes becomes available.

Forexample, secretsharingschemes[4,27, 43]are m-m-nthreshold schemes. Shamir'simplementationof

secretsharingis basedoninterpolatingpointsonapolynomialinaniteeld [43]. Thesecretvaluealong

(6)

1-1-n Replication 1-n-n Decimation(Striping) n-n-n Splitting (XORing) 1-m-n InformationDispersal m-m-n SecretSharing p-m-n RampSchemes

Table1: Examplep-m-nthresholdschemes.

shareisgeneratedbyevaluating thepolynomialat distinctpoints(distinctfromothersharesandthepoint

containingthesecretvalue).

Rabin'sinformationdispersalalgorithm,a1-m-nthresholdscheme,usesthesamepolynomial-basedmath

asShamir'ssecretsharing,but norandomnumbers[42]; msecretvaluesareused todeterminetheunique

encoding polynomial. Thus, each share reveals partial information about the m simultaneously encoded

values, but the encoding is much more space-eÆcient. Ramp schemes [5] are p-m-n threshold schemes,

and they can also be implemented with the same polynomial-based math. The points used to uniquely

determinetheencodingpolynomialarep 1randomvaluesandm (p 1)secretvalues. Rampschemesthus

oerinformation-theoreticcondentialityofuptop 1shares. Theyarealsomorespace-eÆcientthansecret

sharing (so long as m>p). For p= 1, ramp schemes are equivalentto information dispersal; for p = m,

they are equivalent to secret sharing. Threshold algorithms canbe implemented in waysother than the

polynomialmethodsummarizedinthissection. Forexample,Blakleyproposedimplementingsecretsharing

with intersecting hyper-planes [4], and Luby proposed implementing information dispersal with Tornado

codes[35].

Parameteroptionsforp-m-nthresholdschemesexposealargespaceofpossibledatadistributionschemes.

Infact, there areon theorder ofN 3

dierent optionsto considergivenN storagenodes. Each scheme in

this spaceoersdierent levelsofavailability,condentiality,CPU costs,and storagerequirements(which

alsotranslatesinto networkbandwidthrequirements). Forexample,asnincreases,informationavailability

increases(itismoreprobablethatmsharesareavailable),buttheamountofstoragerequiredincreases(more

shares are stored) and condentiality decreases (more shares are available to steal). As m increases, the

storagespacerequireddecreases(ashare'ssizeisproportionalto1=(m (p 1))),butsodoesitsavailability

(moreshares arerequiredto reconstructtheoriginalobject). Also,asmincreases,eachshare containsless

information;thismayincreasethenumberofsharesthatmustbecapturedbeforeanintrudercanreconstruct

ausefulportionoftheoriginalobject. Aspincreases,theinformationsystem'scondentialityincreases,but

thestoragespacerequiredalsoincreases. Withsuchawidearrayofoptions,selectingthemostappropriate

datadistribution schemeforagivenenvironmentisfarfromtrivial.

2.2 Other algorithms

Therearealsoimportantdatadistributionalgorithmsoutsidetheclassofp-m-nthresholdschemes. Notably,

encryptionisacommonapproachtoprotectingthecondentialityofinformation. Symmetrickeyencryption

(e.g., triple-DES, AES) is adata distribution algorithm characterizedby a single parameter|keylength.

Hybrid data distribution algorithms can be constructed by combining the algorithms already discussed.

Forexample,many survivable storagesystemscombinereplication with encryption to address availability

and security, respectively. Security in such a systemhinges both upon how well the encryption keys are

protectedand upon the diÆculty of cryptanalysis. As anotherexample, short secret sharing encrypts the

original valuewitharandom key, storesthe encryptionkeyusing secretsharing,andstoresthe encrypted

informationusing informationdispersal[30]. Shortsecretsharingalgorithmshavethree parameters: m, n,

andk(keylength). Publickeycryptography(e.g.,RSA)canbeusedinsteadofsymmetrickeycryptography

to protect information condentiality. The management of cryptographic keysmust be addressedin the

designofasystemthatusescryptography;symmetrickeyandpublickeycryptographyrequiredierentkey

management strategies. Finally, compression algorithms (e.g., Human coding) canbe used before other

(7)

algorithms(e.g.,MD5,SHA-1)canbeusedtoadddigeststodatabeforeitisencodedwithanotheralgorithm.

Thehashofthedecodeddatacanbecomparedwiththedigesttoverifythedecodeddata'sintegrity. Digests

can also be generated for shares resulting from an encoding (e.g., distributed ngerprints [29]), allowing

integrity to be veried prior to decoding the data. Hash algorithms are parameterized by the hash size,

and theymust either beencoded withor storedseparately from the datain order tobe eective. Digital

signatures(e.g.,DSA)canprovidesimilar integrityguaranteesashashalgorithms.

Twoclasses of integrity algorithms are often used to build authentication and directory services that

protecttheintegrityofmeta-data. Therstclassareagreementalgorithms[32]suchastheByzantinefault

tolerantlibrary [9]. Quorum systems [19, 36], which are asuperset of voting algorithms [21, 45], are the

secondclass. Also,thereareintegrityalgorithmsthatworkexclusivelywiththresholdalgorithms. Inschemes

where m is lessthan n, excess shares can be used during decode (dierent permutations of m shares are

usedforvalidation). Secretsharingschemescanalsobemodieddirectly tooeraprobabilisticguarantee

ofcheaterdetection[46].

3 The Trade-o Space

The algorithms described in Section 2, together with the degrees of freedom given by their parameters,

provide thousandsof data distributionschemesfor survivablestoragesystems. Thoughtfullyselecting one

schemefromthissetrequirestheabilitytoevaluatetheirrelativemeritsand,further,todosointhecontext

ofthetarget environment. Thissectiondescribesourapproachtocomparing theperformance, availability,

andsecurityofvariousschemes.

3.1 Evaluating availability

Thesubstantialbody ofpriorwork in buildinghighly-availablesystems[23, 44]providesaclearmetricfor

evaluatinginformationavailability: theprobabilitythatadesiredpieceofinformationcanbeaccessedatany

givenpointintime. Assuminguncorrelatedfailures,thisprobabilitycanbecomputedfromtheprobabilities

thatrequiredsystemcomponentsareavailable. Forexample,ap-m-nthresholdschemerequires atleastm

ofthenstoragenodescontainingsharestobeoperationaltoperformaread. Iff

node

istheprobabilitythat

astoragenodehasfailedorisotherwiseunavailable,thentheavailabilityofthestoredinformationis

Availability read = n m X i=0 n i (f node ) i (1 f node ) n i

Forwrites, thecomputationof availabilitydependsuponthesystem'sdesign. Asystemcouldrequirethat

all of n specic nodes be operationalfor a write to succeed. This approach clearly has poor availability

characteristics. ToimprovewriteavailabilityinasystemwithN >nstoragenodes,thewriteoperationcan

attempt to writeshares to dierentstoragenodesuntil ithascompleted n writes. Alternatively,asystem

couldallowawritetocompletewhenfewerthannshareshavebeenwritten. Thisrequiresmoreworkduring

failurerecoveryorreducesreadavailabilityofthestoreddata (becauseniseectivelylowerforthatdata).

We calculate write availability using thelast approach. Thus, m storagenodes of theN storage nodes in

thesystemmustbeavailableforawritetobeperformed,andwriteavailabilityis

Availability write = N m X i=0 N i (f node ) i (1 f node ) N i

Availability requirementsfor storage systemstend to be quite high. A popular manner for discussing

high-availability valuesis in terms of \nines,"referring to the numberof nines after the decimalpoint in

theavailabilityvaluebefore adigit otherthannine. Forexample,alowavailabilityvalueof0.993 hasjust

twonines, whereasahigh availability valueof0.99999996hasseven nines. Itis worthnotingthat failures

arenotalwaysuncorrelated,asisgenerallyassumedinavailabilitycomputations. Forsurvivablesystemsin

particular, denial-of-service (DoS)attackscaninduce highly-correlatedfailures, bringing into questionthe

(8)

certainlyarewhencomparingdatadistributionschemesforagivensetofsystemcomponents. Inparticular,

availabilityincreasesonlywhenmoreoptionsareavailableforservicingrequests. Thus,ahigheravailability

valuemeansthat aDoSattackmusteliminatealargersetofstoragenodes.

3.2 Evaluating security

Byfar, securityis thehardest of thethree dimensionsto evaluate, asthere areno provenmetricsoreven

a mature body of research from which to draw. Our initial plan was to reuse the mathematics of fault

tolerance, counting how many storage nodes must be compromised to bypass condentiality or integrity.

However,thisapproachraisedsignicantproblems: Firstandforemost,itreliesuponhavinga\probability

of being compromised"with which to computesecurity values. Unfortunately, weare aware ofno reliable

wayto obtain or even estimate such a value. Further, using such avalue would bequestionable, because

securityproblemsexperiencedbynodesin adistributed systemareexpectedto behighly correlated;when

thesystemisunderattack,theprobabilityvaluewouldgoup. Anotherproblemwithevaluatingsecurityin

termsof aprobability ofanodebeingcompromised isthat it is auseful measure foronly asubset ofthe

data distribution algorithms (thresholdalgorithms). Forexample, it ignoresthe additional condentiality

provided byencryption.

Our current approach to evaluating security focuses on the eort (E) required for an active foe to

compromise the security of the system. For example, for a n-n-n threshold scheme, condentiality can

becompromisedbybreaking intoalln heterogenousstoragenodes orby compromisingtheauthentication

system: E Conf =min[E Auth ;(nE Break In )]

Extending this example, assume that the data is also encrypted and that the names of shares reveal no

information about the encoded data. The attacker has many paths to the data|attempt cryptanalysis

orsteal theencryption key, attempt combining sharesin many permutations orcompromisethedirectory

service. Assumingthat anattackerisgoingtotaketheeasiestpathtothedata:

E Conf = min[E Auth ;(nE Break In ) +min[E Cryptanalysis ;E StealKey ] +min[E IdentifyShares ;E StealNames ]]

Inthispaper,wefocusstrictlyonthesecurity ofthestoragesystem;attacksontheauthenticationservice

anddirectoryservicearenotconsideredfurther. Thus,weuseonlytwovaluesinoursecuritymodel: E

BreakIn

and E

CircumventCryptography

. These valuesare dimensionless. Thus, the unit of the securityaxis is \eort

units,"andsecurityvaluesacrossallruns havebeennormalizedonascalethatrangesfrom0to100.

Therstterm,E

BreakIn

,is theeortrequiredto stealapiece ofdatafrom astoragenode. Thesecond

term,E

,istheeortrequiredtocircumventcryptographiccondentiality. Itisacoarse

metricthat includescryptanalysis,keyguessing,andthetheftofdecryption keys. Thecondentialityofan

encryptedreplicaisthusthesummationofthesetwoterms: oneencryptedreplica mustbestolenandthen

the encryption must be broken. Alternatively, the condentiality of secret sharing is solely afunction of

therstterm(i.e., mE

BreakIn

)|atotalof msharesmustbestolen to compromisethecondentialityof

data,thustheeortismtimestheeorttostealapieceofdata. Wecalculatetheeorttocompromisethe

condentiality ofinformation dispersal asaweightedsum of theinformation contentof the shares, where

the weight is proportionalto the numberof shares that must bestolen to decode. This heuristicsets the

condentialityofinformationdispersalabovethat ofreplication,belowthatofsecretsharing,andhigheras

m increases. Weassume that E

BreakIn

is constant forall storagenodes (i.e.,distinct attacksof equivalent

diÆculty are necessaryto compromiseeach storage node). For this assumption to be true storage nodes

must be heterogenous. Beyond this, we hypothesize that, asn increases, the likelihood that an attacker

cannd astorage node that they cancompromiseincreases; assuch, condentiality should decrease asn

increases. Wedonotcurrentlyconsiderninthecalculationofsecurity. Amoredetailedsecuritymodelcan

beimplemented,howeverthesesimplemodelscapturethemajorfeaturesofthealgorithmswearecurrently

(9)

guished availabilityfrom the two other securitycharacteristics because there isa trade-o betweenit and

them. Wefocusoursecurityaxisoncondentiality,sincethereappearstobeagreementinthecommunity

that cryptographichashesare the right way to protectintegrity, and we havefound nocontradictory

evi-dence. Thehashescanbegeneratedfrom thecleartextortheciphertext, andtheycanbestoredwithdata

shares,encodedintothename,orstoredinthedirectoryservice. Alloftheseincreaseintegritysignicantly,

usuallywellbeyondothereortlevels.

Although eort terms are diÆcult to quantify, we believe that eort-based evaluation focuses the

de-signer's attention on the right thing|raising the security bar \high enough." As with availability, the

relative merits of schemes canbe compared usefully even if the absolute eort quantities are inaccurate.

Further,othersecurityengineeringresearchprojectsarenowaddressingtheproblemof measuringsecurity

anddoingsointermsofeort[39]. Asbettereortmodelsbecomeavailable,theycanbemodularlyinserted

intothetrade-oexplorationapproach.

3.3 Evaluating performance

As with availability, performance metrics and evaluation techniques of distributed systems are part of a

mature body of prior work [26, 28]. The main issue faced in creatinga general tool is balancing detail,

whichshould yieldgreateraccuracy,withgeneralityintheperformancemodels. Weuseasimple,abstract

system model to predict the relativeperformance of dierent data distribution options. Like thegeneral

decentralized-storagemodel describedin Section 2,weintend foritto representawidearrayofsurvivable

storagesystemdesigns. Themodelincludesthreeparts:CPUtimeforencodeordecodeintheintermediary

software,networkbandwidthfordeliveringtheshares,andstoragenoderesponse times.

CPUTime. TheencodeanddecodeoperationsinvolvedwithanydatadistributionschemerequireCPU

time. ThereareordersofmagnitudedierencesbetweentheCPUtimesfordierentschemes. Forexample,

Figure3showstheCPU costforencoding a32KBblockforfour p-m-nthresholdschemesforallpossible

parameterselectionswithn25. Noticethelargedierencesbetweentheshapesoftheperformancecurves

andthetimesatthesamemandnfordierentschemes.

Wehaveconstructedandcalibratedsimplemodelsforthefollowingdatadistribution algorithms: ramp

schemes, secretsharing, information dispersal, replication, decimation, splitting, short secret sharing,

en-cryption,andhashalgorithms. ThemodelsrequireCPUmeasurementsofafewkeyprimitives: polynomial

interpolationoforderm 1,randomnumbergeneration,triple-DESencryption,andMD5hashgeneration.

Given therequisite measurements,themodelscanpredicttheCPUtime requiredforanencodeordecode

operationforanyofthedatadistributionschemeswithat least90%accuracy.

Network Bandwidth. Theamountof datathat mustpassbetween theclient andthe storagenodes

depends directly on the data distribution scheme. For any of the p-m-n threshold schemes, the network

bandwidthrequired depends onthe read-writeratioand the valuesof p,m, and n. Each share's size can

becomputedastheoriginaldatasizemultipliedby1=(m (p 1)). Forawrite,weassumethat allnshares

mustbeupdated. For aread,weassumethatonlymsharesarefetchedbydefault.

Wemodel thenetwork bandwidthwith adistribution, indicatingthe probability ofa givenbandwidth

during any period of time, assuming that a single bottleneck link determines the aggregate bandwidth.

Although imprecise, we believe that this allows the rst-order eects of concern to be captured, when

combined with the storage node latencies discussed below. For alocal-area network ordial-up client, we

expect thisbottleneck link tobeat theclient'snetworkinterfacecard. Forawide-areasystem,weexpect

this link to be atthe edgerouter throughwhich theclientinteracts with thewide-areanetwork. Network

congestion would appearin this verysimplemodelasareductionofavailable bandwidth(if consistent)or

anincreasein variability(ifsporadic).

StorageNode Latency. Aread orwrite requestto astoragenodeinvolvesworkat thestoragenode

in addition to moving data over the network. This is observed at the client as a response time latency

that includes both delays, but accurate modeling requires separating these delays and recombining them

appropriately. In particular, the time for an operation that simultaneously writes data to n nodes must

captureboththesharedbandwidthandconcurrentlatencyaspects.

Wemodelstoragenodelatencyasarandomvariablewhosemeanandvariancedependontheoperation

(10)

n=25areshownforfourthresholdschemes: secretsharing(p=m),rampschemewithp=dm=2e ,rampschemewithp=2,

andinformationdispersal(p=1).

onthelatternecessaryto modelstorageprotocols(e.g., FTPorNT4'sCIFS implementation)that opena

newTCP/IPconnectionwithslow-startforeach letransfered. Competing loadsonastoragenode would

appearinthissimplemodelasincreasesinthemeanand/orthevariance.

Overall performance model. To predict the access latency for a given request, we combine these

partial models asfollows. Fora write, ignoring varianceand assuming homogeneous servers, we sumthe

modeled CPU timefor encode, the storagenode latency for writingone share to one server,and n times

the network transmission time required per share. Thus, in sequence, the data is encoded, each share is

sentoverthenetworktoitsappropriatestoragenode,andallresponsesareawaited;then th

requestissent

after therstn 1andit nishesaftertheaveragestoragenodelatency. Foraread,thesamestepsoccur

in reverse. With variance, thelast request sentis notnecessarilythe last to nish, so apredictedlatency

foreachrequestmustbecomputed. Ourtoolsusesimulationtocomputetheexpectedperformanceforany

given scheme. In our experience, the simulation is fast enoughfor our purposes, with each single-scheme

predictionrequiring only10{30ms. It is alsoaccurate enough,yielding predictionsthat are within 15%of

valuesmeasuredontheprototypedescribedin Section4foravarietyofvalidation experiments.

4 Prototype and Validation

Wehaveimplementedaprototypesurvivablestoragesystem,PASIS[40, 48], which weuseto validateour

performancemodelsof datadistributionschemes. Oursystemts thearchitecture illustratedin Figure2,

withtheintermediarysoftwareexecutingontheclientmachines. Theencode/decodeand

multi-read/multi-writeaspectsofthearchitectureareimplementedinseparatelibraries. PASISallowsustoselectfrommany

datadistributionalgorithmsandspecifyawiderangeofparameters;thisenablesustoexploremanyschemes

(11)

The encode/decode librarysupports allof the basicschemes described in Section 2. The libraryinterface

exportstwomainfunctions: (1)adecodecallthattakesaschemedescriptionandasetofsharesasinputand

returnstheoriginaldataasoutput,and(2)anencodecallthatdoesthereverse. Theimplementationbuilds

onversion4.1oftheCrypto++library[11],whichprovidescodeforencryption(triple-DES),hashing(MD5),

and polynomial operationsin anite eld. Toenhance its performance for encoding/decoding bulk data

(as opposed tokey-like secrets), we replaced itsstream-based internal functions withblock-basedversions

andreducedtheeldsize. Thelatterchangeenablestheuseoflookuptables tondmultiplicativeinverses

and the product of two elements. Together,these changes yield 3{5 faster encode and decode times for

the information dispersal, secret sharing,and ramp algorithms when compared to the original Crypto++

implementations.

Themulti-read/multi-writelibrary manages parallel communicationwith a set of storage nodes. The

libraryinterfaceexportstwomainfunctions: (1)afetchcallthat takesasetofnsharedescriptionsanda

numberofsharesrequired(m), and(2)astorecallthat doesthesameforwrites. Eachshare description

includesastoragenodedescriptionandanameforthesharewithinthatstoragenode. Fortheexperiments

describedin thispaper,theselibrarieswere combinedinabenchmarkapplicationthat itself keepstrackof

per-blockmetadata.

The library currently supports the use of NFS storage nodes, CIFS storage nodes, and FTP storage

nodes. By using these standard protocols, we are able to use independently-implemented, o-the-shelf

storagenodes. Thisis animportantpracticaldesignconsiderationthat allowsus to enhance storagenode

heterogeneity, which in turnenhances security; forexample,thelikelihood islowofndingasingleattack

thatcancompromiseaSunNFSserver,aNetworkApplianceNFSserver,andaSNAPNFSserver. Without

thisfeature,Eort

BreakIn

forthesecondthroughn th

storagenodescouldbezero.

We have also implemented an NFS proxy server that exports an NFSv2 interface to clients and uses

the librariesaboveto encode anddistribute the lesacrossaset of storagenodes. Tostorethe necessary

metadata,eachexportedle ordirectoryis representedbytwoactualsets ofstorageobjects: oneset that

storessharedescriptionsforeachleblockand oneset thatstorestheactualdata. Theshare descriptions

fortheformerarestoredin thecorrespondingdirectoryentries. Wegenerallyallowonlythelocal machine

to mountthe NFS proxy server(over the loopback interface), thus allowing us to use survivablestorage

without changing the kernel. The downside, as expected, is lower performance relative to an in-kernel

implementation.

4.2 Model Verication

Wehaveusedourprototype,PASIS,toverifythatthesimpleperformancemodelofSection3cancapturethe

keyperformancetrendsofatleastonerealsurvivablestoragesystem. Todoso,wemeasuredthenecessary

modelparameters,conguredthemodelaccordingly,andthencomparedthemodelpredictionstomeasured

systemperformance. Themodelpredictionsofreadandwritethroughputwerecomparedtothistestbedfor

allthresholdscheme optionsupto n=6. Inalmostallcases, theyarewithin 10%and thelargestobserved

errorwas15%. Section5showsthatthislevelofaccuracyismorethansuÆcientforproperdatadistribution

schemeselection.

ThetestbedconsistedofsevenPCsystemsinter-connectedviaadedicated100Mbpsswitch. Eachsystem

containsa600MHzIntelPentiumIII,256MBofRAM,a3ComFastEtherlinkXLnetworkinterfacecard,

andaQuantumAtlas10KdiskconnectedviaanAdaptecULTRA-2SCSIcontroller. Onesystem,actingas

theclient,runsWindowsNT4.0withServicePack5. Theothersixsystems,actingasCIFSstoragenodes,

runaSAMBAserveronLinuxRedHat6.2withkernelversion2.2.14. Timemeasurementsarebasedonthe

Pentiumcyclecounter,andallvaluesarethemeanof20measurements.

In addition to the overall validation, we have validated the libraries in isolation. The encode/decode

model matches measured library performance to within 10% for all supported data distribution schemes

up to n = 25. This was also veried on a 300 MHz Pentium II and a 500 MHz Pentium III, and the

measuredCPUtimes were observedtoscalelinearlywith CPUperformance. Likewise,the modelpredicts

(12)

Wehavebuiltamodelwithagoodbalancebetweenaccuracyandsensitivity. Thisallowsusto capturethe

large-scaleeectsin thetrade-ospace. Comprehensionofsuchfeatures inthetrade-ospace isnecessary

to make system-leveldesign decisions. For large variations in system characteristics (i.e., distinct system

congurations),largedierencesintheselectionsurfaceareobserved. Thissupportsourclaimthatsystems

with dierent characteristics require dierent schemes. Also, a specic system should, for performance

reasons,usedierentschemeswhenoperatingin verydierentregionsofthespace|aregionbeinganarea

of theavailability-securityplane. For smallvariations in thesystem conguration(i.e., slightinaccuracies

in modelsand/or measurements), onlysmall dierencesin the selectionsurfaceare observed. This means

thattheschemeselectionsurfaceisstable. AlthoughitisinherentlydiÆculttoaccuratelycharacterizereal

systems,weareabletodetermineaclose-to-optimalschemechoiceforagivensystembecauseofthestability

oftheselectionsurface.

Intheremainderofthissection,thedefaultcongurationisdened,andourmethodofmeasurementis

explained. Aseriesofexperimentsareperformedtoinvestigatethesensitivityoftheselectionsurfacetosmall

congurationchanges,tolargecongurationchanges,andtoworkloadchanges. Finally,theavailabilityand

securitymodelsareconsidered.

Default conguration. For small systems (e.g., with n 10), there are around 1000 schemes to

consider. Inthe remainderof thissection, weconcentrateonasystemwith 10 storagenodes. All scheme

selectiongraphsin thissectionareplotted withperformancemeasuredin32KBblockspersecond,security

measuredin eort,andavailabilitymeasuredin \numberofnines." Performanceiscalculatedbasedonthe

CPU,network,andstoragenodesasdescribedinSection4.2. Thedefaultvalueforthenetworkparameteris

100Mbps. Becausenoclearmetrichasemergedfromresearchintoestimatingeortrequiredtocompromise

security,weset thevaluesforE

BreakIn and E

equalin thedefault model. Thedefault

valueforthesystemworkloadisaread/writeprobabilityof0.5(anequalnumberofreadsandwrites). The

defaultvalueforasinglestoragenode'sfailureprobabilityisf

node

=0:005.

The default conguration is the base case against which all other congurations are compared. The

schemeselectionsurfaceforthedefaultcongurationisshowninFigure 1.

Measurements. To compare scheme selection surfaces, we use ametric with twocomponents. The

rstcomponentquantieswhatpercentageoftheschemechoiceschange. Thesecondcomponentquanties

the magnitude ofthe performance costassociatedwith using the other surface'sschemeat agiven point.

Forexample, we statesuch resultsas, \The selection surfacediersoverX% of thearea with anaverage

dierenceofY%." Whenperformingsensitivityanalysis,wereversethedirectionofthecomparison(i.e.,we

examinetheimpactofinaccuracies intheassumedsystemconguration). Thisallowsus todeterminehow

accuratethecongurationrepresentationmustbeforthetrade-oanalysistobeuseful. Insomeregionsof

thetrade-ospace,manyschemeshavesimilarperformanceoverarangeofcongurations. Insuchregions,

dierent schemes may be selected for dierent congurations, but with minimal performance impact. In

other regions, signicant performance costs are incurred if the wrong scheme is selected. To capture this

interestingeect,wesometimesperformacomparisonbetweentwoselectionsurfacesandonlylistthearea

and dierence for regionsthat havechanged bysome minimum amount (i.e., we ignore dierences of less

than10%to getabettermeasure oflargescaleeects).

Anothermethod of analyzing the trade-o space is used to understand the consumption of resources

by the system. Specically, we examinethe ratio of time spent performing CPU operations to the time

spentperformingnetworkI/Oinordertoascertainwhat regionsoftheselectionsurfacearebound byeach

resource. A schemeis considered to bebound byaresourceif theresourceaccounts forgreaterthan 80%

of the overall performance time. The resource consumption for the default conguration is presented in

Figure4.

5.1 Sensitivityto the model

Iftheselectionsurfaceishighlysensitivetosmallchangesintheconguration,thenitmightnotbeauseful

designtoolbecauseitwouldrequireexactinformationabouthowasystemwill behave. Toensurethatthe

selectionsurfaceis relativelystable, weexamineditssensitivity tothesystemcharacteristicsandworkload

(13)

selectedschemeisbound bythe availablenetworkbandwidth. Darkgreyindicatesa regioninwhichthe selectedschemeis

bound bythe CPU.Inthedefault conguration, replication(which appearsalongthe zerosecurityline)isnetwork bound.

Fartherdownthesecurityaxis,schemesareCPUbound.Thereisalargeregion,comprisedofinformationdispersal,replication

withcryptography,rampschemes,andshortsecretsharing,withbalancedresourceconsumption.

factoroftwoineachdirection. Fortheworkload,weconsideredworkloadsof40%readsaswellas60%reads.

Forthe securitymodel we variedthe value of E

BreakIn

by 10%in either direction relativeto thevalue of

E

. Consideringthesecases,theselectionsurfacediersbylessthan9%oftheareawith

adierenceoflessthan9%. Fromthis, weconcludethat theselectionsurfaceis relativelystable. As well,

thismeansthat aslongas systemcharacteristicsare modeledwithmoderateaccuracyandtheworkloadis

roughlyunderstood, theschemeselectionsurfacecanprovidesignicantinsightintotherealconguration.

5.2 System characteristics

Toinvestigatetheimpactthatsystemcharacteristicshaveonproperschemeselection,wextheCPUspeed

andvarythenetworkspeed. Thisexploressystemcongurationsrepresentativeofawiderangeofdistributed

systemssuchaspeer-basedcomputingonaLANorclient-serverarchitectures.

Webegin by varyingthe defaultnetwork speedoneorder ofmagnitude (i.e.,to 10 Mbpsand 1Gbps).

Speedingupthenetworkhaslittleeectontheselectionsurface. Theselectionsurfacediersoverlessthan

10%ofthe areawithanaverage dierenceof 3%. Changesresultedfrom replication withcryptography,a

computationallycheapandspace-ineÆcientalgorithm,becomingabetterselectionthaninformation

disper-sal. Figure 5 shows the selection surfacefor the fast network. When the network bandwidth is reduced,

theselectionsurfacediersover26%oftheareawith anaverage dierenceof17%. Clearly, thischangeis

substantial. However,someofthisregionis comprisedofa\checkerboard,"wheretherightchoicebounces

between two schemeswith little performance dierence. In this checkerboard region, theselection surface

diersover11%oftheareawithanaveragedierenceof3%. Inonepartofthecheckerboardregion,short

secretsharingandrampschemeshavesimilarperformance. Inanotherpart,informationdispersalschemes

withparametersthatmakethemmorespace-eÆcientareselectedoverlessspace-eÆcientones. The

remain-der ofthedierence betweenthe surfacesis due to 14%of theareawithan averagedierence of29%. In

thisregion,whichisalongtheavailabilityaxis,informationdispersaldominatesreplication. Figure6shows

theselectionsurfacefortheslownetwork.

Ananalysisoftheresourceconsumptionovertheselectionsurfacesforthevariousnetworkspeeds

consid-eredshowsthat,forthefastnetworkmostoftheselectionsurfaceisCPUbound,forthedefaultmodelmost

of thesurfaceis balanced,andfor slownetworks mostofthe surfaceisnetwork bound. Many conclusions

canbedrawnfromtheresultsoftheseexperiments. First,schemeselectiondependsonthebalancebetween

the speed ofthe processorand the speed of thenetwork. Second, in system congurationsthat include a

slownetwork,thespace-eÆciencyofaschemeisasignicantindicatorofitsoverallperformance. Third,the

(14)

ofthe graph,because thenetworkconsumptionofreplicationhasalowperformancecostonafastnetwork. Note: Thefast

networkrequireddoublingtheperformanceaxisscale.

Figure6: Slownetwork(10Mbps). Informationdispersaldominatesalongtheavailabilityaxis.Shortsecretsharinghas

similarperformancetorampschemesintheforegroundofthegraph.

CPUboundin thedefault conguration.

5.3 Impact of workload

Toinvestigatetheimpactthatworkloadhasonschemeselection,theratioofreadstowriteswasvariedfrom

a100%read workloadtoa100%write workload. Recall thatthedefaultworkloadisequalpartsreadsand

writes. As shown in Table 2,this experiment found that scheme selection is insensitive to workload over

a largeportion of this range, 10%reads to 90%reads. However, at the end pointsof therange, there is

substantialchange inthe selectionsurface. In particular,changingtheworkloadcanchange theoperating

regionofanalgorithm(i.e.,itchangestheavailabilityguaranteethatcanbemade)|acleardierencefrom

modifyingsystemcharacteristics.

Thereasontheselectionsurfaceisinsensitiveto theworkload,exceptatextremes,isbecausethe

work-load mainly changes system performance. As the read workload increases/decreasesthe selection surface

expands/contractsalongtheperformanceaxis. Thereasonforthisisthattheread(decode)costsarelower

thanthewrite(encode)costsformostalgorithms. Onlyattheendsoftheworkloadrangedothedierent

(15)

mid-0% 97% 33% 5% 67% 6% 20% 22% 5% 40% 6% 4% 60% 4% 2% 80% 26% 7% 95% 50% 16% 100% 61% 20%

Table 2: Impact of workload on selection surface. Schemeselection isinsensitiveto changes inthe mid-rangeof

workloads.Signicantchangesoccurneartheendpointsofthepossibleworkloads.

rangeworkload. Theconclusionofthisexperimentisthat, formostread-writeworkloads,schemeselection

canbe doneassuming a workload of 50%reads becausethe selection surfaceis stable overa largerange.

However,forwrite-once/read-manystoragesystems,thetrade-ospacediersradically.

The selection surfaces for the 100% read and 100% write workloads have some specic features that

provideinsightintothetrade-ospace. Figure7andFigure8showtheselectionsurfacesfortheseworkloads.

Splittingdominatesthelowavailability-highsecurityregion(the backplaneofthegraph)forthe100%read

workload. Whereas, splitting dominates the diagonal that demarcates the edge of the operationalregion

for the 100% write workload. For other workloads, splitting only occurs at the maximum security point.

Splitting's encodeoperation isfaster thansecretsharing's; however,foranyother workloadsecretsharing

dominates splitting because splitting has extremely poor read availability. Splitting's decode operation

is extremely fast compared to its encode. This is why it is selected for the read workload|its encode

performancedoesnotpenalize it.

Thisinsightaboutsplittingisapplicabletotherestofthethresholdschemes. Forthe100%readworkload,

schemesarenotpenalizedforpoorwriteperformance. Theresultofthisisthatrampschemesdominatethe

region held by information dispersal for write dominated workloads|rampschemes with large n and low

marecompetitivewithinformationdispersalschemeswithmid-rangenandsimilarmvaluesbecausetheir

poorwriteperformance iscounted. Also, shortsecretsharing isshownto havelittlevaluefor aread-only

workload. Shortsecretsharingoersgoodsecurityfor alowencode cost. Again,theencodecostdoesnot

matterinaread-onlyworkload|rampschemesoerbetterread-onlyperformancethanshortsecretsharing.

Consideringthe resource consumption of the 100% read and 100% write workload is also instructive.

Figure9andFigure10demonstratethatthebestperformingschemeshaveabalancedutilizationofresources

when performing reads. Whereas, the CPU is thescarcest resourcewhen encoding data forhigh security.

Thefact that securityisCPU bound matchesintuition|securityisexpensiveand systemdesignersloathe

payingtheprice. However,the fact that reading secure datais notasexpensive, in termsof CPU cycles,

issignicant. Indeed,forworkloadswith>60%readstheutilizationofresourcesinthesystemisbalanced

deepintothesecurityregion.

5.4 Failure model

Thefailureprobabilityforstoragenodes,f

node

,isaninterestingparameterin ourmodel. Atotalordering

of all schemes based on their availability can be performed without substituting a specic value for this

parameter. Thus, theparameter only contractsorexpands the surfacein theavailability dimension. For

example,thesurfacein Figure11,whichisbasedonafailureprobability10largerthanthedefaultvalue,

hastheexactsameshapeasthesurfaceinFigure1,exceptforthescalingalongtheavailabilityaxis. Thus,in

somesensetheselectionsurfacedoesnotdependonthefailureprobability|orderingalongtheavailability

axis is solely a function of n and m. Clearly though, an accurate understanding of the failure model of

(16)

Figure8: 100%Writeworkload.

5.5 Security model

In this section, we investigate what happens if one assumes circumventing cryptographic protection and

compromising storage nodes require dierent amounts of eort. It is important to only consider relative

changesin theseexperiments|theeortmetriconlyprovidesarelativeorderingofschemes.

WeperformedexperimentstoseewhathappenswhenE

CircumventCryptography isvariedrelativetoE BreakIn . Reducing E CircumventCryptography

by anorder of magnitude relative to E

BreakIn

has little eect on the

se-lection surface. It diers over less than 4% of the area with an average dierence of 9%. Thus, the

eort valuation used to generate a total order for schemes along the security axis gives less weight to

cryptographiccondentiality than to information-theoretic condentiality. The results are much dierent

when E

is increased by an order of magnitude relativeto E

BreakIn

. Replication with

cryptography dominates almost the entirety of the security-availability plane. This is because the

max-imum possible value of n is 10 in our default model. A threshold scheme can only achieve security of

10E

BreakIn

|one order of magnitude. To gain more insight into how the selection surface changes, we

increasedE

byafactorof 2.5relativeto E

BreakIn

. Figure12 presentstheresulting

se-lectionsurface.Replicationwithcryptographydominatesaverylargeregionoftheschemeselectionsurface.

As well, theregion in which short secretsharingdominatesramp schemes growssignicantly. Indeed,the

(17)

consumptionofnetworkandCPUresourceswhenperformingreads.

Figure 10: Balance of resources fora 100% write workload. Thecostofsecurityisentirelyinthewrite(encode)

operation. Aswell,thelevelofsecurityisboundbytheCPU.

Even though the numbersassigned to the security costs have an arbitrary nature, it is interesting to

consider how the value placed in cryptographictechniques can change depending on the amount of time

condentiality must be maintained. A goal of security engineering is to expend just enough eort (i.e.,

minimize performance degradation) to achievethe security goal. For systems with ashort condentiality

window, the weighting of E

relative to E

BreakIn

should be higher. Eectively, this

assumes a bound on the time that an attacker can utilize to crack the cipher, steal keys, or guess keys.

On the other hand, information whose condentiality is a long-term priority must carefully consider the

valuationofcryptographictechniques.

5.6 Analysis

Although the trade-o space is complex, we expected specic algorithms to dominate certain regions of

theavailability-securityplane. Within aregionweexpected theparametersto havea largeimpactonthe

performanceachieved. Our investigation showsthat ouroriginal insightswere basically correct. However,

sometransitionsbetweenregionsareabrupt,indicatingthatalargeperformancedierenceoccursforasmall

change in security oravailability guarantees. For example, the transition from replication to information

(18)

afactor oftwo alongthe availabilityaxis. Remember,the scaleof theavailabilityaxisisthe numberofnines. Anorderof

magnitudeincreaseintheprobabilityoffailureresultshalvesthenumberofninesofavailability.

Figure12: Placinghighervalueoncryptographythannodesecurity. Ifcryptographyprovidesbettercondentiality

thanthesecurityofthestoragenodes,replicationwithcryptographyisaveryeectivescheme. Aswell,shortsecretsharing,

whichcombinescryptographywithinformationdispersal,dominatesrampschemesdeepintotheregionofhighsecurity.

havethis property. Thiscontradictsouroriginalintuitionthatthelargenumberofschemeswouldresultin

aselectionsurfacethat issmooth. Ifasystemisoperatingnearasharpperformancetransition, thesystem

requirementsmustbeconsideredandanengineeringdecisionmustbemadeastowhethertoerrin favorof

performance,availability,orsecurity.

Sharptransitionstendtobelocalizedin thebackcornerofthetrade-o space(lowavailabilityandlow

security). The sharpness is due to the data redundancy of threshold schemes. Share size is calculated as

1=(m (p 1)),andnsharesaregeneratedbyathresholdscheme. Considera1-1-4thresholdscheme(4-fold

replication) and a 1-2-4 threshold scheme (an information dispersal scheme). By increasing m from 1 to

2, thedata redundancyis halved (i.e., storagespaceconsumption is thesameas2-fold replication). Ona

read, both schemesrequire the sameamount of data (one replica versus two shares, each half the size of

the replica). The abrupt performance dierence ontheselection surfaceis due to writeoperations,which

dependontheamountofdatathatmustbesentacrossthenetwork.

Anotherfeature of the selection surfaceoccurs when two algorithms perform similarly; the resultis a

\checkerboard"ontheselectionsurface. Insuchcases,itisinformativetoconsidertheresourceconsumption

(19)

Anotheraspect of the space that interests us is the performance dierence between algorithms. Our

intuition was that limiting the set of potential algorithms would often result in poor performance. Our

intuition was correct: each algorithm tends to be good for only some region of the availability-security

plane. If thesystem beingconsidered doesnot operatewithin the algorithm's \good"region, thesystem

hasincurredanunnecessaryperformancepenalty. Indeed,limitingtheset ofalgorithmsreducestheregion

of theavailability-securityplanein which the systemcanoperate. Constrainingan algorithmto aspecic

setofparameters(i.e.,buildingasystemaroundasinglescheme)xestheavailabilityandsecuritythatcan

beoered. Moreover,unlessthoroughanalysisofthetrade-ospacewasdoneaspartofthesystemdesign,

itis unlikelythat thescheme selectedis well matchedto thesystembeingbuilt. Thisis onlyareasonable

designdecisionifthesystemisto operatein asingleregionand theschemeusedis ontheselectionsurface

attheregion.

Thehighestdegreeofavailabilityisalwaysachievedbyareplicationalgorithm(becauseonlyreplication

can handle n 1failures) and the highest degreeof securityis always achieved by splitting (because only

splittingcanhandle n 1compromises).

Aftersomelevelofsecurity,secretsharingdominatesonlytheedgeregionthatdemarcatesthereachable

portion of the availability-securityplane. This is because for am

1 -m

1 -n

1

secretsharing scheme, there is

usuallyap 2 -m 2 -n 2

rampschemewithp

2 =m 1 ,m 2 >m 1 ,andn 2 >n 1

whichprovides,bydenition,thesame

securityguarantee,providesadequateavailability,but hasmuch betterperformance. Thereasonforthisis

that thespace savingsof ramp schemes withm>p resultsin fewercomputations beingrequired,and thus

better performance (i.e., since shares are smaller in size, fewer calculations are required to generate each

share). Thus,ifsecretsharingistheonlyalgorithmthatoersthesecurityandavailabilityrequired,

increas-ing thenumberof storagenodes, (i.e., themaximum valuen may take),will produceabetter performing

ramp scheme(>2 improvement)that meets theavailabilityandsecurityrequirements. Clearly, this can

onlybedoneifthereisnotaharddesignconstraintonthenumberofstoragenodesin thesystem.

6 Survivable Storage Projects

Survivablestorage systems are an active areaof current research. This section describes six current and

pastsurvivable storageprojects,focusing ontheir systemmodels, target workloads, and data distribution

schemes. Foreach,designinsightsresultingfrom ourexplorationofthetrade-ospacearenoted.

Delta-4. TheDelta-4system [12,16] iscomprisedofclients,securityserversanddata storageservers.

The data distributionscheme employed is referredto asfragmentation, redundancy and scattering. Data

is decimatedinto fragments,each fragmentis encrypted (withachained cipherso that fragmentsmust be

decoded in acertain order), and theencrypted fragmentsare replicated. Each datastorage serveris sent

all ofthe fragmentsand uses apseudo-randomalgorithm to decidewhich fragments to store. Thenames

of fragmentsare self-verifying but givenoinformation aboutthefragment'scontents. Toaccessapiece of

data, aclient mustget authorization from a set of storageservers. The storageserversrun anagreement

protocolto provideintegrity. Thehash,whichenablesaclientto determinethenamesofthe fragmentsit

wantstoread,isstoredusingsecretsharingonthesecurityservers. Boththeintegrityandcondentialityof

thedatastoredin Delta-4hingesonthesecurityserversadequatelyprotectingfragments'meta-data. Since

CPUcyclesarenotasscarceastheywerefteenyearsago,cryptographycoupledwithinformationdispersal

should be used in a Delta-4 type system rather than replication. Indeed, since all shares are sent to all

storagenodes,space-eÆciencyofencodingisparamounttoreducingbandwidthconsumption.

Publius. Publius [47] isstrongly in uencedbyAnderson'sEternityService proposal [2],whichargues

forsurvivablestorageplusanonymity. Publiususesencrypted replicas fordatadistribution. Thekeyused

forencryptionisencodedwithsecretsharing,andasingleshareofthekeyisstoredwitheachreplica. This

algorithm is verysimilar to short secretsharing, exceptthat replication rather thaninformation dispersal

isused toencodetheciphertext. Noindicationis givenastothespecicparametersselectedforthedata

distributionscheme. However,theauthorsdoidentifyparameterselectionasadiÆculttask,citingthedesire

tobalanceresistancetocensorshipandperformance. Theavailabilityofdatain aPubliussystemislimited

bythesecretsharingoftheencryptionkey. Thus,theextraspaceconsumedbyreplication,overinformation

(20)

utilizingabetterscheme.

Intermemory. TheIntermemoryproject[10, 22] focuses onarchivalstorageofpublicdata(i.e.,

write-once, read-manydata with high availability and integrityrequirementsovertime). Intermemory's goal is

to provideextraordinarilyhigh availabilitywhile minimizingstorageconsumptionandthe expectedcostof

reading data. Deepencodingisthedatadistribution schemeused inIntermemory. Deepencodinginvolves

storingasinglecopyofapieceofdata,thenapplyingathresholdscheme(Tornadoschemesinthiscase[35])

to thepieceofdata, distributingtheshares,and thenhavingthestoragenodesrecursivelyapply thesame

threshold scheme to each share they receive. Deep encoding is performed to some depth level(e.g., deep

encodingofdepth threestoresareplica,nshares, andn 2

subsequentsharesofshares). Intermemoryusesa

1-16-32thresholdschemeapplied to depthlevelthree. Public-keycryptographyis discussedasameansto

providelong-termintegrityofstoreddataintheIntermemorysystem. TheIntermemoryprojecthasselected

anexcellentschemethatprovideshighavailabilitywithreasonableexpectedreadperformanceandverylow

storage spaceoverhead. However, integrity is explicitly stated asagoal of the system. An attackerneed

onlycompromiseasinglereplicato defeattheintegrityofthestored data. Addingahash totheencoding

algorithm and removingthe rst depth level of deep encoding would greatlyincrease the integrity of the

Intermemorysystemattheexpenseofincreasingtheexpected costofperformingaread.

Oceanstore. Oceanstore [31] is envisioned asa wide-area data utility. It intends to leverage excess

storagecapacityoverawideareatoprovidesurvivablestorageofdataaccessiblefromanywhere. Oceanstore

storestwotypesofdata: activedata(whichcanbereadandwritten)andarchivaldata(whichiswrite-once,

read-many). Active data isstored using encrypted replicas. Replicashave self-verifyingnames to provide

integrity. Agreementalgorithms are used to manageauthorization to data. Archival data is stored using

deepencodingasin Intermemory. Oceanstoreisaddressingmanyotherissuespertainingtobuildingadata

storageutilityofglobalscale,includingclientmobility,robustnaming,datamigration,andreplicadiscovery.

Analyzingthetrade-os amongstdatadistributionschemeswouldenableOceanstoretoclearlyunderstand

thedierentcharacteristicsofthetwodistinctdatadistributionschemesitemploys. Indeed,suchananalysis

isnecessaryforthemto determineat whichworkloaddatashould switchfrombeingactivetoarchival.

Farsite. TheFarsiteproject[7]isexploringpeer-basedstorageinalocal-areasetting. Farsiteis

attempt-ingto leveragetheidle cycles andunused storagecapacity ofdesktop workstations, whichinherently have

poor availabilitycharacteristics,to provide asurvivablestoragesystem. Farsite usesencrypted replication

with self-verifying names. A Byzantine agreement protocol amongstclient machines storing meta-data is

usedtoguaranteetheintegrityofthedirectoryservice(andsubsequentlyprovidetheself-verifyingnatureof

thereplicanames). Tolimitspaceconsumption, singleinstance storage[6],in conjunctionwithconvergent

encryption,is usedto ensurethat onlynreplicas ofanysimilarpiece of datais storedsystemwide. Since

Farsiteexplicitlystatesconcernsaboutstoragespaceconsumption,informationdispersal ratherthen

repli-cationshouldprobablybeused. WithrelativelysmallCPUcosts,2{4reductionsin spaceandbandwidth

utilization are available. Further, bothsingle instancestorageand convergentencryption canstill beused

withthesharesgeneratedbyinformationdispersal.

e-Vault. Thee-Vaultsystem[18,25]isasurvivablestoragesystemthatuseseitherinformationdispersal

orshort secretsharing, depending onwhether ornotstrong condentiality isdesired. Distributed

nger-prints [29] are used for integrity. The systemis designed for a local-areasetting in which clients interact

withstorageserversviaa\gateway"(adesignated storageserver). Performancenumbersaregivenforthe

1-2-3information dispersal scheme[25], sincethetest platform is limitedto three storageservers. This is

averylimitedsystemin which to consider informationdispersal schemes. A systemwith just afew more

storagenodeshassignicantlybetterperformanceand availabilitycharacteristics.

PASIS.Ourprototypesystem,PASIS[40,48], isdescribedinSection 4. PASISprovidesan

implemen-tation ofasurvivablestoragesystemagainstwhich wevalidate ourperformancemodels. The maindesign

goalofPASISistobe exibleenoughforustoinvestigatemanydierentapproachestobuildingsurvivable

storagesystems. Ourimplementationofencode/decodefunctionalityisseparatefromour

multi-read/multi-writefunctionality. Theencode/decodelibraryprovidesthresholdalgorithms,cryptographicalgorithms,and

hybridalgorithms,allofwhichcanbeinstantiatedoverawiderangeofparameters. Thelibraryisstructured

soastofacilitatetheadditionofdatadistributionalgorithmsandthecompositionofhybridalgorithms. The

multi-read/multi-writelibrarysupportsNFS,CIFS,andFTPstoragenodesandcanbeextendedtosupport

(21)

7 Summary

Thispaperillustratesthecomplextrade-ospaceassociatedwithselectingtherightdatadistributionscheme

forasurvivablestoragesystem. Areasonedapproachtoexploringthistrade-ospaceisdevelopedandused

to explorethespace'ssensitivityto varioussystemand workloadcharacteristics. Althoughfurther workis

neededtorenethemodels,webelievethatthispapertakesabigstepintherightdirection|awayfromad

hocselectionsandtowardsinformedengineeringdecisions.

ThisworkwasdoneinthecontextofthePASISsurvivablestorageprojectatCarnegieMellonUniversity.

Additionalinformationonourworkandresultscanbefoundontheprojectwebsite [40].

8 Acknowledgements

TheauthorswouldliketothankJohnD.Strunk,GarthR.Goodson,andTheodore M.Wong forextensive

discussionandfeedbackonthedirectionsandgoalsofthisresearch. Wewouldalsoliketo thankthemany

other graduate students in the Parallel Data Lab that provided feedback on rough drafts of this paper.

Finally, we would like to thank Matt Vestal for administering the cluster of machines on which we ran

experiments.

References

[1] DarrellC.Anderson, Jerey S.Chase, andAminM. Vahdat. Interposedrequestroutingfor scalablenetwork

storage. Symposiumon Operating SystemsDesignandImplementation(SanDiego, CA,22{25October2000),

2000.

[2] RossJ.Anderson. TheEternityService. PRAGOCRYPT,pages242{253. CTUPublishing,1996.

[3] ThomasE.Anderson,MichaelD.Dahlin,JeannaM.Neefe,DavidA.Patterson,DrewS.Roselli,andRandolphY.

Wang. Serverlessnetworklesystems.ACM TransactionsonComputer Systems,14(1):41{79,February1996.

[4] G.R.Blakley. Safeguarding cryptographickeys. AFIPSNational Computer Conference (NewYork,NY,4{7

June1979),pages313{317. AFIPS,1979.

[5] G.R.BlakleyandCatherine Meadows. Security oframpschemes. Advances inCryptology -CRYPTO,pages

242{268. Springer-Verlag,1985.

[6] WilliamJ.Bolosky,ScottCorbin,DavidGoebel,andJohnR.Douceur.SingleInstanceStorageinWindows2000.

USENIX Windows Systems Symposium(Seattle, WA, 3{4 August2000), pages13{24. USENIX Association,

2000.

[7] WilliamJ.Bolosky, JohnR.Douceur, DavidEly,andMarvinTheimer. Feasabilityof aserverless distributed

lesystemdeployedonanexistingsetofdesktopPCs. ACM SIGMETRICSConference onMeasurement and

Modeling of Computer Systems (Santa Clara, CA, 17{21 June 2000). Published as Performance Evaluation

Review,28(1):34{43. ACM,2000.

[8] MiguelCastroandBarbaraLiskov.Practicalbyzantinefaulttolerance.SymposiumonOperatingSystemsDesign

andImplementation(NewOrleans,LA,22{25February1999),pages173{186. ACM,1998.

[9] Miguel Castro and Barbara Liskov. Proactive recovery ina Byzantine-fault-tolerant system. Symposium on

OperatingSystemsDesignandImplementation(SanDiego,CA,23{25 October2000),pages273{287. USENIX

Association, 2000.

[10] Yuan Chen, Jan Edler, Andrew Goldberg, Allan Gottlieb, Sumeet Sobti, and Peter Yianilos. A prototype

implementation ofarchival Intermemory. ACM Conference on DigitalLibraries(Berkeley, CA, 11{14 August

1999),pages28{37. ACM,1999.

[11] WeiDai. Crypto++referencemanual . http://cryptopp.sourceforge.net/docs/ref/.

[12] YvesDeswarte,L.Blain,andJean-CharlesFabre. Intrusiontolerance indistributedcomputingsystems. IEEE

SymposiumonSecurityandPrivacy(Oakland,CA,20{22May1991),pages110{121,1991.

(22)

[15] J.FragaandD.Powell. Afaultandintrusion-tolerantlesystem. IFIPInternationalConferenceonComputer

Security,pages203{218. Elsevier,1985.

[16] Jean-Michel Fray, Yves Deswarte, and David Powell. Intrusion-tolerance using ne-grain

fragmentation-scattering. IEEE Symposium onSecurity and Privacy(Oakland, CA,7{9April 1986),pages194{201. IEEE,

1986.

[17] FreeHaven.http://www.freehaven.net/.

[18] Juan A. Garay, Rosario Gennaro, Charanjit Jutla, and Tal Rabin. Secure distributed storage and retrieval.

TheoreticalComputer Science,243(1{2):363{389. Elsevier,September2000.

[19] HectorGarcia-Molina andDanielBarbara. Howtoassignvotesinadistributedsystem. Journalof theACM,

32(4):841{855,October1985.

[20] GarthA.Gibson,DavidF.Nagle,KhalilAmiri,JeButler,FayW.Chang,HowardGobio,CharlesHardin,Erik

Riedel,DavidRochberg,andJimZelenka. Acost-eective,high-bandwidth storagearchitecture. Architectural

Support forProgramming Languages andOperating Systems(SanJose, CA,3{7October1998). Published as

SIGPLANNotices,33(11):92{103,November1998.

[21] D. K. Giord. Violet: an experimental decentralized system. Technical report CSL{79{12. Xerox Palo Alto

ResearchCenter,CA,November1979.

[22] AndrewV. Goldberg andPeterN.Yianilos. Towards anarchivalintermemory. IEEE InternationalForumon

Research andTechnologyAdvancesinDigitalLibraries(SantaBarbara,CA,22{24April1998),pages147{156.

IEEE,1998.

[23] JimGrayandDanielP.Siewiorek.High-availabilitycomputersystems.IEEEComputer,24(9):39{48,September

1991.

[24] Intermemory. http://www.intermemory.org/.

[25] A. Iyengar, R. Cahn, C. Jutla, and J. A. Garay. Design and implementation of a secure distributed data

repository. IFIP InternationalInformation Security Conference (Vienna,Austria,and Budapest,Hungary,31

August{4September1998),pages123{135. ACM,1998.

[26] RajJain. Theart ofcomputersystemsperformance analysis. JohnWiley&Sons,1991.

[27] EhudD.Karnin,JonathanW.Greene,andMartinE.Hellman. Onsecretsharingsystems. IEEETransactions

onInformationTheory,IT-29(1):35{41. IEEE,January1983.

[28] LeonardKleinrock. Queueingsystems. JohnWileyandSons,1973.

[29] HugoKrawczyk. Distributedngerprintsandsecureinformation dispersal. ACM SymposiumonPrinciples of

DistributedComputing(Ithaca,NY,15{18August1993),pages207{218,1993.

[30] HugoKrawczyk. Secretsharing madeshort. Advances in Cryptology - CRYPTO(SantaBarbara, CA,22{26

August1993),pages136{146. Springer-Verlag,1994.

[31] John Kubiatowicz, David Bindel, YanChen, Steven Czerwinski, Patrick Eaten, Dennis Geels, Ramakrishna

Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao. OceanStore: an

architectureforglobal-scalepersistentstorage.ArchitecturalSupportforProgrammingLanguagesandOperating

Systems(Cambridge,MA,12{15November2000).PublishedasOperatingSystemsReview,34(5):190{201,2000.

[32] LeslieLamport,RobertShostak,andMarshall Pease. TheByzantinegeneralsproblem. ACMTransactionson

ProgrammingLanguagesandSystems,4(3):382{401. ACM,July1982.

[33] EdwardK. Leeand Chandramohan A. Thekkath. Petal: distributed virtual disks. Architectural Support for

ProgrammingLanguagesandOperatingSystems(Cambridge,MA,1{5October1996). Published asSIGPLAN

Notices,31(9):84{92, 1996.

[34] BarbaraLiskov,SanjayGhemawat,RobertGruber, PaulJohnson,LiubaShrira,andMichael Williams.

Repli-cation inthe Harp le system. ACM Symposium on Operating System Principles (Pacic Grove, CA, 13{16

October1991). PublishedasOperatingSystemsReview,25(5):226{238,1991.

[35] Michael Luby,MichaelMitzenmacher,MohammadAminShokrollahi,andDanielA.Spielman. Analysisoflow

densitycodesandimproveddesignsusingirregulargraphs. ACMSymposiumonTheory ofComputing(Dallas,

TX,23{26May1998),pages249{258. ACM,1998.

[36] DahliaMalkhiandMichaelK.Reiter.SecureandscalablereplicationinPhalanx. IEEESymposiumonReliable

(23)

[38] Oceanstore. http://oceanstore.cs.berkeley.edu/.

[39] RodolpheOrtalo,YvesDeswarte,andMohamedKaaniche.Experimentingwithquantitativeevaluationtoolsfor

monitoringoperationalsecurity. IEEETransactionsonSoftwareEngineering,25(5):633{650. IEEE,September

1999.

[40] PASIS. http://www.ices.cmu.edu/pasis/.

[41] Publius. http://cs1.cs.nyu.edu/waldman/publius/.

[42] MichaelO.Rabin. EÆcientdispersalofinformationforsecurity,loadbalancing,andfaulttolerance. Journalof

theACM,36(2):335{348. ACM,April1989.

[43] AdiShamir. Howtoshareasecret. CommunicationsoftheACM,22(11):612{613. ACM,November1979.

[44] Daniel P. Siewiorek and Robert S.Swarz. Reliable computer systems: design and evaluation. Digital Press,

Secondedition,1992.

[45] R.H.Thomas.Amajorityconsensusapproachtoconcurrencycontrol.ACMTransactionsonDatabaseSystems,

4:180{209, 1979.

[46] MartinTompaandHeatherWoll. Howtoshareasecretwithcheaters. JournalofCryptography,1(2):133{138,

1988.

[47] Marc Waldman, Aviel D. Rubin, and Lorrie Faith Cranor. Publius: a robust, tamper-evident,

censorship-resistant,webpublishingsystem.USENIXSecuritySymposium(Denver,CO,14{17August2000),pages59{72.

USENIXAssociation, 2000.

[48] JayJ.Wylie,MichaelW.Bigrigg,JohnD.Strunk,GregoryR.Ganger,HanKiliccote,andPradeepK.Khosla.