• No results found

Applications. Decode/ Encode ... Meta- Data. Data. Shares. Multi-read/ Multi-write. Intermediary Software ... Storage Nodes

N/A
N/A
Protected

Academic year: 2021

Share "Applications. Decode/ Encode ... Meta- Data. Data. Shares. Multi-read/ Multi-write. Intermediary Software ... Storage Nodes"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

for a SurvivableStorage System

JayJ.Wylie, MehmetBakkaloglu,VijayPandurangan,

MichaelW. Bigrigg,SemihOguz,KenTew,CoryWilliams,

GregoryR.Ganger,PradeepK.Khosla

May2001

CMU-CS-01-120

SchoolofComputerScience

CarnegieMellonUniversity

Pittsburgh,PA 15213

Abstract

Survivable storage system design has become a popular research topic. This paper tackles the diÆcult problem of

reasoningabout theengineering trade-o s inherentindatadistributionscheme selection. Thechoiceof anencoding

algorithmanditsparameterspositionsasystemataparticularpointinacomplextrade-o spacebetweenperformance,

availability, and security. We demonstrate that no choice is right for all systems, and we present an approach to

codifyingand visualizing this trade-o space. Using this approach, we explore the sensitivity of thespace tosystem

characteristics, workload,anddesiredlevelsofsecurityandavailability.

Thisresearch ispart ofthePASISproject[40,48 ]atCarnegieMellonUniversity.

ThisworkispartiallyfundedbyDARPA/ATO'sOrganicallyAssuredandSurvivableInformationSystems(OASIS)program(Air

ForcecontractnumberF30602-99-2-0539-AFRL).WethankthemembersandcompaniesoftheParallelDataConsortium(atthetimeof

thiswriting:EMCCorporation,Hewlett-PackardLabs,Hitachi,IBMCorporation,IntelCorporation,LSILogic,LucentTechnologies,

NetworkAppliances,PANASAS,PlatysCommunications,SeagateTechnology,SnapAppliances,SunMicrosystems,VeritasSoftware

(2)
(3)

Digitalinformationisacriticalresource,creatinganeedfordistributedstoragesystemsthatprovidesuÆcient

dataavailabilityanddatasecurityinthefaceoffailuresandmaliciousattacks. Manyresearchgroups[13,14,

17,24, 37,38,40, 41]arenowexploringthedesignandimplementationof suchsurvivable storage systems.

These systemsbuild on mature technologies from decentralized storage systems [1, 3, 8, 20, 33, 34] and

alsosharethesamehigh-levelarchitectures. Infact,developmentofsurvivablestoragewiththesamebasic

architecture was pursuedover15 years ago[12, 15, 16]. The challenge now, asit was then, is to achieve

acceptable levels of performance and manageability. Moreover, a means to evaluate survivable storage

systemsis requiredtofacilitatethedesignofthesesystems.

Onekeytomaximizingsurvivablestorageperformanceismindfulselectionofthedatadistributionscheme.

A data distribution scheme consists of aspeci c algorithm for data encoding &partitioning and a set of

valuesforitsparameters. Therearemanyalgorithmsapplicabletosurvivablestorage,includingencryption,

replication,striping, erasure-resilientcoding, secretsharing,andvarious combinations. Eachalgorithmhas

oneormore tunable parameters. The resultis a largetoolbox of possible schemes, each o eringdi erent

levelsofperformance(throughput),availability(probabilitythatdatacanbeaccessed),andsecurity(e ort

required to compromisethe con dentiality orintegrityof stored data). Forexample, replication provides

availabilityat a high cost in network bandwidthand storage space, whereasshort secret sharingprovides

availabilityandsecurityatlowerstorageandbandwidthcostbuthigherCPUutilization. Likewise,selecting

thenumberof sharesrequiredto reconstructasecret-sharedvalueinvolvesatrade-o betweenavailability

andcon dentiality:ifmoremachinesmustbecompromisedtostealthesecret,thenmoremustbeoperational

to provideit legitimately. Secret sharingschemes and other data distribution algorithms are described in

Section2.

Nosingledata distributionscheme is right forall systems. Instead, the rightchoicefor anyparticular

systemdependsonanarrayoffactors,includingexpectedworkload,systemcomponentcharacteristics,and

desiredlevelsofavailability andsecurity. Unfortunately, mostsystemdesignsappearto involvean ad hoc

choice, oftenresultinginasubstantialperformancelossduetomissed opportunities andover-engineering.

Thispaperpromotesabetterapproachtoselectingthedatadistributionscheme. Atahighlevel,thisnew

approachconsistsofthreesteps: enumeratingpossibledatadistributionschemes(<algorithm;parameters>

pairs),modelingtheconsequencesofeachscheme,andidentifyingthebest-performingschemeforanygiven

set of availability and security requirements. The surface shown in Figure 1 illustrates one result of the

approach. Generatingsuch asurfacerequirescodifyingeachdimensionof thetrade-o spacesuchthat all

data distribution schemesfall intoatotal order. Thesurfaceservestwofunctions: (1)it enablesinformed

trade-o samongsecurity,availability,andperformance,and(2)itidenti esthebest-performingschemefor

each point in the trade-o space. Speci cally, the surface shown represents the performance of the

best-performingschemethatprovidesatleastthecorrespondinglevelsofavailabilityandsecurity. Manyschemes

arenotbest atanyofthepointsinthespaceandassucharenotrepresentedonthesurface.

Thispaperdemonstratesthefeasibilityand importanceof carefuldata distributionscheme choice. The

resultsshowthattheoptimalchoicevariesasafunctionofworkload,systemcharacteristics,andthedesired

levels ofavailabilityand security. Theresults showthat minor (2) changes in these determinants have

little e ect, which means that the models need not be exact to be useful. Importantly, the results also

showthatlargechanges,whichwouldcorrespondtodistinctsystems,createsubstantiallydi erenttrade-o

spacesandbestchoices. Thus,failingtoexaminethetrade-o spaceinthecontextofone'ssystemcanyield

bothpoorperformanceandunful lledrequirements. With sensitivitystudiesandsurvivablestoragesystem

examplesfrom theliterature,weidentify interestingtrendsanddesignpoints.

Theremainderofthispaperisorganizedasfollows. Section2discussessurvivablestoragesystemdesign

and data distribution scheme options. Section 3 describes how we codify the trade-o space. Section 4

describesPASIS[40,48],ourprototypesurvivablestoragesystem,anditsroleinvalidatingourperformance

model. Section 5exploresthetrade-o spaceand itssensitivitytomodel inaccuracies,systemcomponents,

and expected workloads. Section 6 describes current and pastsurvivablestorage system research,

identi-fying insights from our explorations that could enhance their designs. Section 7 summarizes this papers

(4)

formance,availability,andsecurityaxes. Performanceisquanti edasthenumberof32KBrequestspersecondthatcanbe

satis edforasingleclient;onlythe best-performingschemethatprovidesatleastthegivensecurityandavailabilitylevelsis

shown,resultinginamonotonicdecreasealongthoseaxes. Availabilityisthe probabilitythatstoreddata canbeaccessed.

Securityisthee ortrequiredtocompromiseeitherthecon dentialityorintegrityofstoreddata.Section2describesthedata

distributionalgorithmslistedinthelegend.Section3describestheaxesindetail.

2 Survivable Storage Systems

Survivable systems operate from the fundamental design thesis that no individual service, node, or user

can be fully trusted; having some compromised entities must be viewed asthe common case rather than

the exception. Survivable storage systems must encode and distribute data across independent storage

nodes,entrustingthedata'spersistencetosetsofnodesratherthantoindividualnodes. Ifcon dentialityis

required,unencoded datashouldnotbestoreddirectly onstoragenodes;otherwise,compromisingasingle

storagenodeenablesanattackertobypassaccess-controlpolicies.

Priorwork in cluster storagesystems [1,3, 8,20, 33, 34] providesmuch insightinto how to eÆciently

decentralize storage services while providing a single, uni ed view to applications. Figure 2 illustrates

the basic decentralized storage architecture. Inmost cases, some intermediary software is responsible for

translating between the uni ed view and the decentralized reality. This piece of software may execute

directlyoneachclientsystem[1,3,20,33],onstoragenodesidenti edasleaders[8,34],oratintermediary

nodes[1,3].

Inmostcluster storagesystems, thedata andsomeform of redundancy informationare striped across

storagenodes. Survivablestoragesystemsdi erfromdecentralizedstoragesystemsmainlyin theencoding

mechanism used (i.e., the data distribution scheme). The data distribution scheme enables the storage

systemtosurvivebothfailuresandcompromisesofstoragenodes. Eachscheme(a<algorithm;parameters>

pair)o ersadi erentlevelofperformance,availabilityandsecurity. Littleunderstandingexistsofthelarge

trade-o spacethat theschemesoccupyinpractice.

Thispapertakesasteptowardsunderstandingtherelativemeritsofdi erentdatadistribution schemes

byexaminingtheminthecontextofthetrade-o space. Theremainderofthissectiondescribesvariousdata

distributionalgorithms and the parametersthat specifyeach algorithm. There are several issues involved

with constructing acomplete survivablestorage systemthat are notcentral to theresults and insightsof

thispaper. Speci cally,thispaperarguesforandprovidesanapproachformakingmindfulselectionsofthe

data distribution scheme, but explicitly tries to doso while remainingimpartial onissues that arelargely

orthogonal. Examplesofsuchissuesincludemetadataandnaming mechanisms,clientnodeauthentication,

(5)

Applications

Decode/

Encode

Multi-read/

Multi-write

Meta-Data

...

Storage

Nodes

...

Shares

Data

Requests

Intermediary

Software

Applications

Decode/

Encode

Multi-read/

Multi-write

Meta-Data

...

Storage

Nodes

...

Shares

Data

Requests

Intermediary

Software

Figure2: Genericdecentralizedstoragearchitecture. Intermediarysoftwaretranslatestheapplications'uni edview

ofstorage tothedecentralizedreality. Solidlinestrace thedata path. Dashedlinestracethe meta-datapath. Applications

readand writeblockstotheintermediarysoftware.Encodingtransformsblocksintoshares(decodingdoesthereverse). Sets

ofsharesarereadfrom(writtento)storagenodes. Intermediarysoftwaremayrunonclients,leaderstoragenodes,oratsome

pointinbetween.

2.1 Threshold algorithms

There is a wide array of data distribution algorithms, including encryption, replication, striping,

erasure-resilient coding, information dispersal, and secret sharing. Threshold algorithms, characterized by three

parameters(p, m, and n), representalarge setof these algorithms. Inap-m-n thresholdscheme, data is

encoded into n shares such that any m of the shares canreconstruct the data and less than p reveal no

information abouttheencodeddata. Thus, astored valueisavailable ifat leastm of thenshares canbe

retrieved. Attackers must compromise at least p storage nodes before it is even theoretically possible to

determineanypartoftheencodeddata.

Table1listssomep-m-nthresholdschemesthathavemorefamiliarnames. Perhapsthesimplestexample

is n-wayreplication, which is a1-1-n thresholdscheme. That is, outofthenreplicas that arestored, any

single replica provides theoriginal data (m =1), andeach replica reveals information about theencoded

data(p=1). Anothersimpleexampleisdecimation(orstriping,asindiskarrays),whereinalargeblockof

datais partitionedinton sub-blocks,each containing1=n ofthedata(so,p=1andm=n). Attheother

end of thespectrumis \splitting", ann-n-n thresholdscheme that consists of storingn 1randomvalues

and onevalue thatis theexclusive-orofthe originalvalueand thosen 1values;p=m=n forsplitting,

since alln shares are neededto recoverthe original data. Replication, decimation, and splitting schemes

haveasingletunableparameter,n,whicha ectstheirplacein thetrade-o space.

With more CPU-intensivemathematics, the full range of p-m-nthreshold schemes becomes available.

Forexample, secretsharingschemes[4,27, 43]are m-m-nthreshold schemes. Shamir'simplementationof

secretsharingis basedoninterpolatingpointsonapolynomialina nite eld [43]. Thesecretvaluealong

(6)

1-1-n Replication 1-n-n Decimation(Striping) n-n-n Splitting (XORing) 1-m-n InformationDispersal m-m-n SecretSharing p-m-n RampSchemes

Table1: Examplep-m-nthresholdschemes.

shareisgeneratedbyevaluating thepolynomialat distinctpoints(distinctfromothersharesandthepoint

containingthesecretvalue).

Rabin'sinformationdispersalalgorithm,a1-m-nthresholdscheme,usesthesamepolynomial-basedmath

asShamir'ssecretsharing,but norandomnumbers[42]; msecretvaluesareused todeterminetheunique

encoding polynomial. Thus, each share reveals partial information about the m simultaneously encoded

values, but the encoding is much more space-eÆcient. Ramp schemes [5] are p-m-n threshold schemes,

and they can also be implemented with the same polynomial-based math. The points used to uniquely

determinetheencodingpolynomialarep 1randomvaluesandm (p 1)secretvalues. Rampschemesthus

o erinformation-theoreticcon dentialityofuptop 1shares. Theyarealsomorespace-eÆcientthansecret

sharing (so long as m>p). For p= 1, ramp schemes are equivalentto information dispersal; for p = m,

they are equivalent to secret sharing. Threshold algorithms canbe implemented in waysother than the

polynomialmethodsummarizedinthissection. Forexample,Blakleyproposedimplementingsecretsharing

with intersecting hyper-planes [4], and Luby proposed implementing information dispersal with Tornado

codes[35].

Parameteroptionsforp-m-nthresholdschemesexposealargespaceofpossibledatadistributionschemes.

Infact, there areon theorder ofN 3

di erent optionsto considergivenN storagenodes. Each scheme in

this spaceo ersdi erent levelsofavailability,con dentiality,CPU costs,and storagerequirements(which

alsotranslatesinto networkbandwidthrequirements). Forexample,asnincreases,informationavailability

increases(itismoreprobablethatmsharesareavailable),buttheamountofstoragerequiredincreases(more

shares are stored) and con dentiality decreases (more shares are available to steal). As m increases, the

storagespacerequireddecreases(ashare'ssizeisproportionalto1=(m (p 1))),butsodoesitsavailability

(moreshares arerequiredto reconstructtheoriginalobject). Also,asmincreases,eachshare containsless

information;thismayincreasethenumberofsharesthatmustbecapturedbeforeanintrudercanreconstruct

ausefulportionoftheoriginalobject. Aspincreases,theinformationsystem'scon dentialityincreases,but

thestoragespacerequiredalsoincreases. Withsuchawidearrayofoptions,selectingthemostappropriate

datadistribution schemeforagivenenvironmentisfarfromtrivial.

2.2 Other algorithms

Therearealsoimportantdatadistributionalgorithmsoutsidetheclassofp-m-nthresholdschemes. Notably,

encryptionisacommonapproachtoprotectingthecon dentialityofinformation. Symmetrickeyencryption

(e.g., triple-DES, AES) is adata distribution algorithm characterizedby a single parameter|keylength.

Hybrid data distribution algorithms can be constructed by combining the algorithms already discussed.

Forexample,many survivable storagesystemscombinereplication with encryption to address availability

and security, respectively. Security in such a systemhinges both upon how well the encryption keys are

protectedand upon the diÆculty of cryptanalysis. As anotherexample, short secret sharing encrypts the

original valuewitharandom key, storesthe encryptionkeyusing secretsharing,andstoresthe encrypted

informationusing informationdispersal[30]. Shortsecretsharingalgorithmshavethree parameters: m, n,

andk(keylength). Publickeycryptography(e.g.,RSA)canbeusedinsteadofsymmetrickeycryptography

to protect information con dentiality. The management of cryptographic keysmust be addressedin the

designofasystemthatusescryptography;symmetrickeyandpublickeycryptographyrequiredi erentkey

management strategies. Finally, compression algorithms (e.g., Hu man coding) canbe used before other

(7)

algorithms(e.g.,MD5,SHA-1)canbeusedtoadddigeststodatabeforeitisencodedwithanotheralgorithm.

Thehashofthedecodeddatacanbecomparedwiththedigesttoverifythedecodeddata'sintegrity. Digests

can also be generated for shares resulting from an encoding (e.g., distributed ngerprints [29]), allowing

integrity to be veri ed prior to decoding the data. Hash algorithms are parameterized by the hash size,

and theymust either beencoded withor storedseparately from the datain order tobe e ective. Digital

signatures(e.g.,DSA)canprovidesimilar integrityguaranteesashashalgorithms.

Twoclasses of integrity algorithms are often used to build authentication and directory services that

protecttheintegrityofmeta-data. The rstclassareagreementalgorithms[32]suchastheByzantinefault

tolerantlibrary [9]. Quorum systems [19, 36], which are asuperset of voting algorithms [21, 45], are the

secondclass. Also,thereareintegrityalgorithmsthatworkexclusivelywiththresholdalgorithms. Inschemes

where m is lessthan n, excess shares can be used during decode (di erent permutations of m shares are

usedforvalidation). Secretsharingschemescanalsobemodi eddirectly too eraprobabilisticguarantee

ofcheaterdetection[46].

3 The Trade-o Space

The algorithms described in Section 2, together with the degrees of freedom given by their parameters,

provide thousandsof data distributionschemesfor survivablestoragesystems. Thoughtfullyselecting one

schemefromthissetrequirestheabilitytoevaluatetheirrelativemeritsand,further,todosointhecontext

ofthetarget environment. Thissectiondescribesourapproachtocomparing theperformance, availability,

andsecurityofvariousschemes.

3.1 Evaluating availability

Thesubstantialbody ofpriorwork in buildinghighly-availablesystems[23, 44]providesaclearmetricfor

evaluatinginformationavailability: theprobabilitythatadesiredpieceofinformationcanbeaccessedatany

givenpointintime. Assuminguncorrelatedfailures,thisprobabilitycanbecomputedfromtheprobabilities

thatrequiredsystemcomponentsareavailable. Forexample,ap-m-nthresholdschemerequires atleastm

ofthenstoragenodescontainingsharestobeoperationaltoperformaread. Iff

node

istheprobabilitythat

astoragenodehasfailedorisotherwiseunavailable,thentheavailabilityofthestoredinformationis

Availability read = n m X i=0  n i  (f node ) i (1 f node ) n i

Forwrites, thecomputationof availabilitydependsuponthesystem'sdesign. Asystemcouldrequirethat

all of n speci c nodes be operationalfor a write to succeed. This approach clearly has poor availability

characteristics. ToimprovewriteavailabilityinasystemwithN >nstoragenodes,thewriteoperationcan

attempt to writeshares to di erentstoragenodesuntil ithascompleted n writes. Alternatively,asystem

couldallowawritetocompletewhenfewerthannshareshavebeenwritten. Thisrequiresmoreworkduring

failurerecoveryorreducesreadavailabilityofthestoreddata (becausenise ectivelylowerforthatdata).

We calculate write availability using thelast approach. Thus, m storagenodes of theN storage nodes in

thesystemmustbeavailableforawritetobeperformed,andwriteavailabilityis

Availability write = N m X i=0  N i  (f node ) i (1 f node ) N i

Availability requirementsfor storage systemstend to be quite high. A popular manner for discussing

high-availability valuesis in terms of \nines,"referring to the numberof nines after the decimalpoint in

theavailabilityvaluebefore adigit otherthannine. Forexample,alowavailabilityvalueof0.993 hasjust

twonines, whereasahigh availability valueof0.99999996hasseven nines. Itis worthnotingthat failures

arenotalwaysuncorrelated,asisgenerallyassumedinavailabilitycomputations. Forsurvivablesystemsin

particular, denial-of-service (DoS)attackscaninduce highly-correlatedfailures, bringing into questionthe

(8)

certainlyarewhencomparingdatadistributionschemesforagivensetofsystemcomponents. Inparticular,

availabilityincreasesonlywhenmoreoptionsareavailableforservicingrequests. Thus,ahigheravailability

valuemeansthat aDoSattackmusteliminatealargersetofstoragenodes.

3.2 Evaluating security

Byfar, securityis thehardest of thethree dimensionsto evaluate, asthere areno provenmetricsoreven

a mature body of research from which to draw. Our initial plan was to reuse the mathematics of fault

tolerance, counting how many storage nodes must be compromised to bypass con dentiality or integrity.

However,thisapproachraisedsigni cantproblems: Firstandforemost,itreliesuponhavinga\probability

of being compromised"with which to computesecurity values. Unfortunately, weare aware ofno reliable

wayto obtain or even estimate such a value. Further, using such avalue would bequestionable, because

securityproblemsexperiencedbynodesin adistributed systemareexpectedto behighly correlated;when

thesystemisunderattack,theprobabilityvaluewouldgoup. Anotherproblemwithevaluatingsecurityin

termsof aprobability ofanodebeingcompromised isthat it is auseful measure foronly asubset ofthe

data distribution algorithms (thresholdalgorithms). Forexample, it ignoresthe additional con dentiality

provided byencryption.

Our current approach to evaluating security focuses on the e ort (E) required for an active foe to

compromise the security of the system. For example, for a n-n-n threshold scheme, con dentiality can

becompromisedbybreaking intoalln heterogenousstoragenodes orby compromisingtheauthentication

system: E Conf =min[E Auth ;(nE Break In )]

Extending this example, assume that the data is also encrypted and that the names of shares reveal no

information about the encoded data. The attacker has many paths to the data|attempt cryptanalysis

orsteal theencryption key, attempt combining sharesin many permutations orcompromisethedirectory

service. Assumingthat anattackerisgoingtotaketheeasiestpathtothedata:

E Conf = min[E Auth ;(nE Break In ) +min[E Cryptanalysis ;E StealKey ] +min[E IdentifyShares ;E StealNames ]]

Inthispaper,wefocusstrictlyonthesecurity ofthestoragesystem;attacksontheauthenticationservice

anddirectoryservicearenotconsideredfurther. Thus,weuseonlytwovaluesinoursecuritymodel: E

BreakIn

and E

CircumventCryptography

. These valuesare dimensionless. Thus, the unit of the securityaxis is \e ort

units,"andsecurityvaluesacrossallruns havebeennormalizedonascalethatrangesfrom0to100.

The rstterm,E

BreakIn

,is thee ortrequiredto stealapiece ofdatafrom astoragenode. Thesecond

term,E

CircumventCryptography

,isthee ortrequiredtocircumventcryptographiccon dentiality. Itisacoarse

metricthat includescryptanalysis,keyguessing,andthetheftofdecryption keys. Thecon dentialityofan

encryptedreplicaisthusthesummationofthesetwoterms: oneencryptedreplica mustbestolenandthen

the encryption must be broken. Alternatively, the con dentiality of secret sharing is solely afunction of

the rstterm(i.e., mE

BreakIn

)|atotalof msharesmustbestolen to compromisethecon dentialityof

data,thusthee ortismtimesthee orttostealapieceofdata. Wecalculatethee orttocompromisethe

con dentiality ofinformation dispersal asaweightedsum of theinformation contentof the shares, where

the weight is proportionalto the numberof shares that must bestolen to decode. This heuristicsets the

con dentialityofinformationdispersalabovethat ofreplication,belowthatofsecretsharing,andhigheras

m increases. Weassume that E

BreakIn

is constant forall storagenodes (i.e.,distinct attacksof equivalent

diÆculty are necessaryto compromiseeach storage node). For this assumption to be true storage nodes

must be heterogenous. Beyond this, we hypothesize that, asn increases, the likelihood that an attacker

can nd astorage node that they cancompromiseincreases; assuch, con dentiality should decrease asn

increases. Wedonotcurrentlyconsiderninthecalculationofsecurity. Amoredetailedsecuritymodelcan

beimplemented,howeverthesesimplemodelscapturethemajorfeaturesofthealgorithmswearecurrently

(9)

guished availabilityfrom the two other securitycharacteristics because there isa trade-o betweenit and

them. Wefocusoursecurityaxisoncon dentiality,sincethereappearstobeagreementinthecommunity

that cryptographichashesare the right way to protectintegrity, and we havefound nocontradictory

evi-dence. Thehashescanbegeneratedfrom thecleartextortheciphertext, andtheycanbestoredwithdata

shares,encodedintothename,orstoredinthedirectoryservice. Alloftheseincreaseintegritysigni cantly,

usuallywellbeyondothere ortlevels.

Although e ort terms are diÆcult to quantify, we believe that e ort-based evaluation focuses the

de-signer's attention on the right thing|raising the security bar \high enough." As with availability, the

relative merits of schemes canbe compared usefully even if the absolute e ort quantities are inaccurate.

Further,othersecurityengineeringresearchprojectsarenowaddressingtheproblemof measuringsecurity

anddoingsointermsofe ort[39]. Asbettere ortmodelsbecomeavailable,theycanbemodularlyinserted

intothetrade-o explorationapproach.

3.3 Evaluating performance

As with availability, performance metrics and evaluation techniques of distributed systems are part of a

mature body of prior work [26, 28]. The main issue faced in creatinga general tool is balancing detail,

whichshould yieldgreateraccuracy,withgeneralityintheperformancemodels. Weuseasimple,abstract

system model to predict the relativeperformance of di erent data distribution options. Like thegeneral

decentralized-storagemodel describedin Section 2,weintend foritto representawidearrayofsurvivable

storagesystemdesigns. Themodelincludesthreeparts:CPUtimeforencodeordecodeintheintermediary

software,networkbandwidthfordeliveringtheshares,andstoragenoderesponse times.

CPUTime. TheencodeanddecodeoperationsinvolvedwithanydatadistributionschemerequireCPU

time. Thereareordersofmagnitudedi erencesbetweentheCPUtimesfordi erentschemes. Forexample,

Figure3showstheCPU costforencoding a32KBblockforfour p-m-nthresholdschemesforallpossible

parameterselectionswithn25. Noticethelargedi erencesbetweentheshapesoftheperformancecurves

andthetimesatthesamemandnfordi erentschemes.

Wehaveconstructedandcalibratedsimplemodelsforthefollowingdatadistribution algorithms: ramp

schemes, secretsharing, information dispersal, replication, decimation, splitting, short secret sharing,

en-cryption,andhashalgorithms. ThemodelsrequireCPUmeasurementsofafewkeyprimitives: polynomial

interpolationoforderm 1,randomnumbergeneration,triple-DESencryption,andMD5hashgeneration.

Given therequisite measurements,themodelscanpredicttheCPUtime requiredforanencodeordecode

operationforanyofthedatadistributionschemeswithat least90%accuracy.

Network Bandwidth. Theamountof datathat mustpassbetween theclient andthe storagenodes

depends directly on the data distribution scheme. For any of the p-m-n threshold schemes, the network

bandwidthrequired depends onthe read-writeratioand the valuesof p,m, and n. Each share's size can

becomputedastheoriginaldatasizemultipliedby1=(m (p 1)). Forawrite,weassumethat allnshares

mustbeupdated. For aread,weassumethatonlymsharesarefetchedbydefault.

Wemodel thenetwork bandwidthwith adistribution, indicatingthe probability ofa givenbandwidth

during any period of time, assuming that a single bottleneck link determines the aggregate bandwidth.

Although imprecise, we believe that this allows the rst-order e ects of concern to be captured, when

combined with the storage node latencies discussed below. For alocal-area network ordial-up client, we

expect thisbottleneck link tobeat theclient'snetworkinterfacecard. Forawide-areasystem,weexpect

this link to be atthe edgerouter throughwhich theclientinteracts with thewide-areanetwork. Network

congestion would appearin this verysimplemodelasareductionofavailable bandwidth(if consistent)or

anincreasein variability(ifsporadic).

StorageNode Latency. Aread orwrite requestto astoragenodeinvolvesworkat thestoragenode

in addition to moving data over the network. This is observed at the client as a response time latency

that includes both delays, but accurate modeling requires separating these delays and recombining them

appropriately. In particular, the time for an operation that simultaneously writes data to n nodes must

captureboththesharedbandwidthandconcurrentlatencyaspects.

Wemodelstoragenodelatencyasarandomvariablewhosemeanandvariancedependontheoperation

(10)

n=25areshownforfourthresholdschemes: secretsharing(p=m),rampschemewithp=dm=2e ,rampschemewithp=2,

andinformationdispersal(p=1).

onthelatternecessaryto modelstorageprotocols(e.g., FTPorNT4'sCIFS implementation)that opena

newTCP/IPconnectionwithslow-startforeach letransfered. Competing loadsonastoragenode would

appearinthissimplemodelasincreasesinthemeanand/orthevariance.

Overall performance model. To predict the access latency for a given request, we combine these

partial models asfollows. Fora write, ignoring varianceand assuming homogeneous servers, we sumthe

modeled CPU timefor encode, the storagenode latency for writingone share to one server,and n times

the network transmission time required per share. Thus, in sequence, the data is encoded, each share is

sentoverthenetworktoitsappropriatestoragenode,andallresponsesareawaited;then th

requestissent

after the rstn 1andit nishesaftertheaveragestoragenodelatency. Foraread,thesamestepsoccur

in reverse. With variance, thelast request sentis notnecessarilythe last to nish, so apredictedlatency

foreachrequestmustbecomputed. Ourtoolsusesimulationtocomputetheexpectedperformanceforany

given scheme. In our experience, the simulation is fast enoughfor our purposes, with each single-scheme

predictionrequiring only10{30ms. It is alsoaccurate enough,yielding predictionsthat are within 15%of

valuesmeasuredontheprototypedescribedin Section4foravarietyofvalidation experiments.

4 Prototype and Validation

Wehaveimplementedaprototypesurvivablestoragesystem,PASIS[40, 48], which weuseto validateour

performancemodelsof datadistributionschemes. Oursystem ts thearchitecture illustratedin Figure2,

withtheintermediarysoftwareexecutingontheclientmachines. Theencode/decodeand

multi-read/multi-writeaspectsofthearchitectureareimplementedinseparatelibraries. PASISallowsustoselectfrommany

datadistributionalgorithmsandspecifyawiderangeofparameters;thisenablesustoexploremanyschemes

(11)

The encode/decode librarysupports allof the basicschemes described in Section 2. The libraryinterface

exportstwomainfunctions: (1)adecodecallthattakesaschemedescriptionandasetofsharesasinputand

returnstheoriginaldataasoutput,and(2)anencodecallthatdoesthereverse. Theimplementationbuilds

onversion4.1oftheCrypto++library[11],whichprovidescodeforencryption(triple-DES),hashing(MD5),

and polynomial operationsin a nite eld. Toenhance its performance for encoding/decoding bulk data

(as opposed tokey-like secrets), we replaced itsstream-based internal functions withblock-basedversions

andreducedthe eldsize. Thelatterchangeenablestheuseoflookuptables to ndmultiplicativeinverses

and the product of two elements. Together,these changes yield 3{5 faster encode and decode times for

the information dispersal, secret sharing,and ramp algorithms when compared to the original Crypto++

implementations.

Themulti-read/multi-writelibrary manages parallel communicationwith a set of storage nodes. The

libraryinterfaceexportstwomainfunctions: (1)afetchcallthat takesasetofnsharedescriptionsanda

numberofsharesrequired(m), and(2)astorecallthat doesthesameforwrites. Eachshare description

includesastoragenodedescriptionandanameforthesharewithinthatstoragenode. Fortheexperiments

describedin thispaper,theselibrarieswere combinedinabenchmarkapplicationthat itself keepstrackof

per-blockmetadata.

The library currently supports the use of NFS storage nodes, CIFS storage nodes, and FTP storage

nodes. By using these standard protocols, we are able to use independently-implemented, o -the-shelf

storagenodes. Thisis animportantpracticaldesignconsiderationthat allowsus to enhance storagenode

heterogeneity, which in turnenhances security; forexample,thelikelihood islowof ndingasingleattack

thatcancompromiseaSunNFSserver,aNetworkApplianceNFSserver,andaSNAPNFSserver. Without

thisfeature,E ort

BreakIn

forthesecondthroughn th

storagenodescouldbezero.

We have also implemented an NFS proxy server that exports an NFSv2 interface to clients and uses

the librariesaboveto encode anddistribute the lesacrossaset of storagenodes. Tostorethe necessary

metadata,eachexported le ordirectoryis representedbytwoactualsets ofstorageobjects: oneset that

storessharedescriptionsforeach leblockand oneset thatstorestheactualdata. Theshare descriptions

fortheformerarestoredin thecorrespondingdirectoryentries. Wegenerallyallowonlythelocal machine

to mountthe NFS proxy server(over the loopback interface), thus allowing us to use survivablestorage

without changing the kernel. The downside, as expected, is lower performance relative to an in-kernel

implementation.

4.2 Model Veri cation

Wehaveusedourprototype,PASIS,toverifythatthesimpleperformancemodelofSection3cancapturethe

keyperformancetrendsofatleastonerealsurvivablestoragesystem. Todoso,wemeasuredthenecessary

modelparameters,con guredthemodelaccordingly,andthencomparedthemodelpredictionstomeasured

systemperformance. Themodelpredictionsofreadandwritethroughputwerecomparedtothistestbedfor

allthresholdscheme optionsupto n=6. Inalmostallcases, theyarewithin 10%and thelargestobserved

errorwas15%. Section5showsthatthislevelofaccuracyismorethansuÆcientforproperdatadistribution

schemeselection.

ThetestbedconsistedofsevenPCsystemsinter-connectedviaadedicated100Mbpsswitch. Eachsystem

containsa600MHzIntelPentiumIII,256MBofRAM,a3ComFastEtherlinkXLnetworkinterfacecard,

andaQuantumAtlas10KdiskconnectedviaanAdaptecULTRA-2SCSIcontroller. Onesystem,actingas

theclient,runsWindowsNT4.0withServicePack5. Theothersixsystems,actingasCIFSstoragenodes,

runaSAMBAserveronLinuxRedHat6.2withkernelversion2.2.14. Timemeasurementsarebasedonthe

Pentiumcyclecounter,andallvaluesarethemeanof20measurements.

In addition to the overall validation, we have validated the libraries in isolation. The encode/decode

model matches measured library performance to within 10% for all supported data distribution schemes

up to n = 25. This was also veri ed on a 300 MHz Pentium II and a 500 MHz Pentium III, and the

measuredCPUtimes were observedtoscalelinearlywith CPUperformance. Likewise,the modelpredicts

(12)

Wehavebuiltamodelwithagoodbalancebetweenaccuracyandsensitivity. Thisallowsusto capturethe

large-scalee ectsin thetrade-o space. Comprehensionofsuchfeatures inthetrade-o space isnecessary

to make system-leveldesign decisions. For large variations in system characteristics (i.e., distinct system

con gurations),largedi erencesintheselectionsurfaceareobserved. Thissupportsourclaimthatsystems

with di erent characteristics require di erent schemes. Also, a speci c system should, for performance

reasons,usedi erentschemeswhenoperatingin verydi erentregionsofthespace|aregionbeinganarea

of theavailability-securityplane. For smallvariations in thesystem con guration(i.e., slightinaccuracies

in modelsand/or measurements), onlysmall di erencesin the selectionsurfaceare observed. This means

thattheschemeselectionsurfaceisstable. AlthoughitisinherentlydiÆculttoaccuratelycharacterizereal

systems,weareabletodetermineaclose-to-optimalschemechoiceforagivensystembecauseofthestability

oftheselectionsurface.

Intheremainderofthissection,thedefaultcon gurationisde ned,andourmethodofmeasurementis

explained. Aseriesofexperimentsareperformedtoinvestigatethesensitivityoftheselectionsurfacetosmall

con gurationchanges,tolargecon gurationchanges,andtoworkloadchanges. Finally,theavailabilityand

securitymodelsareconsidered.

Default con guration. For small systems (e.g., with n  10), there are around 1000 schemes to

consider. Inthe remainderof thissection, weconcentrateonasystemwith 10 storagenodes. All scheme

selectiongraphsin thissectionareplotted withperformancemeasuredin32KBblockspersecond,security

measuredin e ort,andavailabilitymeasuredin \numberofnines." Performanceiscalculatedbasedonthe

CPU,network,andstoragenodesasdescribedinSection4.2. Thedefaultvalueforthenetworkparameteris

100Mbps. Becausenoclearmetrichasemergedfromresearchintoestimatinge ortrequiredtocompromise

security,weset thevaluesforE

BreakIn and E

CircumventCryptography

equalin thedefault model. Thedefault

valueforthesystemworkloadisaread/writeprobabilityof0.5(anequalnumberofreadsandwrites). The

defaultvalueforasinglestoragenode'sfailureprobabilityisf

node

=0:005.

The default con guration is the base case against which all other con gurations are compared. The

schemeselectionsurfaceforthedefaultcon gurationisshowninFigure 1.

Measurements. To compare scheme selection surfaces, we use ametric with twocomponents. The

rstcomponentquanti eswhatpercentageoftheschemechoiceschange. Thesecondcomponentquanti es

the magnitude ofthe performance costassociatedwith using the other surface'sschemeat agiven point.

Forexample, we statesuch resultsas, \The selection surfacedi ersoverX% of thearea with anaverage

di erenceofY%." Whenperformingsensitivityanalysis,wereversethedirectionofthecomparison(i.e.,we

examinetheimpactofinaccuracies intheassumedsystemcon guration). Thisallowsus todeterminehow

accuratethecon gurationrepresentationmustbeforthetrade-o analysistobeuseful. Insomeregionsof

thetrade-o space,manyschemeshavesimilarperformanceoverarangeofcon gurations. Insuchregions,

di erent schemes may be selected for di erent con gurations, but with minimal performance impact. In

other regions, signi cant performance costs are incurred if the wrong scheme is selected. To capture this

interestinge ect,wesometimesperformacomparisonbetweentwoselectionsurfacesandonlylistthearea

and di erence for regionsthat havechanged bysome minimum amount (i.e., we ignore di erences of less

than10%to getabettermeasure oflargescalee ects).

Anothermethod of analyzing the trade-o space is used to understand the consumption of resources

by the system. Speci cally, we examinethe ratio of time spent performing CPU operations to the time

spentperformingnetworkI/Oinordertoascertainwhat regionsoftheselectionsurfacearebound byeach

resource. A schemeis considered to bebound byaresourceif theresourceaccounts forgreaterthan 80%

of the overall performance time. The resource consumption for the default con guration is presented in

Figure4.

5.1 Sensitivityto the model

Iftheselectionsurfaceishighlysensitivetosmallchangesinthecon guration,thenitmightnotbeauseful

designtoolbecauseitwouldrequireexactinformationabouthowasystemwill behave. Toensurethatthe

selectionsurfaceis relativelystable, weexamineditssensitivity tothesystemcharacteristicsandworkload

(13)

selectedschemeisbound bythe availablenetworkbandwidth. Darkgreyindicatesa regioninwhichthe selectedschemeis

bound bythe CPU.Inthedefault con guration, replication(which appearsalongthe zerosecurityline)isnetwork bound.

Fartherdownthesecurityaxis,schemesareCPUbound.Thereisalargeregion,comprisedofinformationdispersal,replication

withcryptography,rampschemes,andshortsecretsharing,withbalancedresourceconsumption.

factoroftwoineachdirection. Fortheworkload,weconsideredworkloadsof40%readsaswellas60%reads.

Forthe securitymodel we variedthe value of E

BreakIn

by 10%in either direction relativeto thevalue of

E

CircumventCryptography

. Consideringthesecases,theselectionsurfacedi ersbylessthan9%oftheareawith

adi erenceoflessthan9%. Fromthis, weconcludethat theselectionsurfaceis relativelystable. As well,

thismeansthat aslongas systemcharacteristicsare modeledwithmoderateaccuracyandtheworkloadis

roughlyunderstood, theschemeselectionsurfacecanprovidesigni cantinsightintotherealcon guration.

5.2 System characteristics

Toinvestigatetheimpactthatsystemcharacteristicshaveonproperschemeselection,we xtheCPUspeed

andvarythenetworkspeed. Thisexploressystemcon gurationsrepresentativeofawiderangeofdistributed

systemssuchaspeer-basedcomputingonaLANorclient-serverarchitectures.

Webegin by varyingthe defaultnetwork speedoneorder ofmagnitude (i.e.,to 10 Mbpsand 1Gbps).

Speedingupthenetworkhaslittlee ectontheselectionsurface. Theselectionsurfacedi ersoverlessthan

10%ofthe areawithanaverage di erenceof 3%. Changesresultedfrom replication withcryptography,a

computationallycheapandspace-ineÆcientalgorithm,becomingabetterselectionthaninformation

disper-sal. Figure 5 shows the selection surfacefor the fast network. When the network bandwidth is reduced,

theselectionsurfacedi ersover26%oftheareawith anaverage di erenceof17%. Clearly, thischangeis

substantial. However,someofthisregionis comprisedofa\checkerboard,"wheretherightchoicebounces

between two schemeswith little performance di erence. In this checkerboard region, theselection surface

di ersover11%oftheareawithanaveragedi erenceof3%. Inonepartofthecheckerboardregion,short

secretsharingandrampschemeshavesimilarperformance. Inanotherpart,informationdispersalschemes

withparametersthatmakethemmorespace-eÆcientareselectedoverlessspace-eÆcientones. The

remain-der ofthedi erence betweenthe surfacesis due to 14%of theareawithan averagedi erence of29%. In

thisregion,whichisalongtheavailabilityaxis,informationdispersaldominatesreplication. Figure6shows

theselectionsurfacefortheslownetwork.

Ananalysisoftheresourceconsumptionovertheselectionsurfacesforthevariousnetworkspeeds

consid-eredshowsthat,forthefastnetworkmostoftheselectionsurfaceisCPUbound,forthedefaultmodelmost

of thesurfaceis balanced,andfor slownetworks mostofthe surfaceisnetwork bound. Many conclusions

canbedrawnfromtheresultsoftheseexperiments. First,schemeselectiondependsonthebalancebetween

the speed ofthe processorand the speed of thenetwork. Second, in system con gurationsthat include a

slownetwork,thespace-eÆciencyofaschemeisasigni cantindicatorofitsoverallperformance. Third,the

(14)

ofthe graph,because thenetworkconsumptionofreplicationhasalowperformancecostonafastnetwork. Note: Thefast

networkrequireddoublingtheperformanceaxisscale.

Figure6: Slownetwork(10Mbps). Informationdispersaldominatesalongtheavailabilityaxis.Shortsecretsharinghas

similarperformancetorampschemesintheforegroundofthegraph.

CPUboundin thedefault con guration.

5.3 Impact of workload

Toinvestigatetheimpactthatworkloadhasonschemeselection,theratioofreadstowriteswasvariedfrom

a100%read workloadtoa100%write workload. Recall thatthedefaultworkloadisequalpartsreadsand

writes. As shown in Table 2,this experiment found that scheme selection is insensitive to workload over

a largeportion of this range, 10%reads to 90%reads. However, at the end pointsof therange, there is

substantialchange inthe selectionsurface. In particular,changingtheworkloadcanchange theoperating

regionofanalgorithm(i.e.,itchangestheavailabilityguaranteethatcanbemade)|acleardi erencefrom

modifyingsystemcharacteristics.

Thereasontheselectionsurfaceisinsensitiveto theworkload,exceptatextremes,isbecausethe

work-load mainly changes system performance. As the read workload increases/decreasesthe selection surface

expands/contractsalongtheperformanceaxis. Thereasonforthisisthattheread(decode)costsarelower

thanthewrite(encode)costsformostalgorithms. Onlyattheendsoftheworkloadrangedothedi erent

(15)

mid-0% 97% 33% 5% 67% 6% 20% 22% 5% 40% 6% 4% 60% 4% 2% 80% 26% 7% 95% 50% 16% 100% 61% 20%

Table 2: Impact of workload on selection surface. Schemeselection isinsensitiveto changes inthe mid-rangeof

workloads.Signi cantchangesoccurneartheendpointsofthepossibleworkloads.

rangeworkload. Theconclusionofthisexperimentisthat, formostread-writeworkloads,schemeselection

canbe doneassuming a workload of 50%reads becausethe selection surfaceis stable overa largerange.

However,forwrite-once/read-manystoragesystems,thetrade-o spacedi ersradically.

The selection surfaces for the 100% read and 100% write workloads have some speci c features that

provideinsightintothetrade-o space. Figure7andFigure8showtheselectionsurfacesfortheseworkloads.

Splittingdominatesthelowavailability-highsecurityregion(the backplaneofthegraph)forthe100%read

workload. Whereas, splitting dominates the diagonal that demarcates the edge of the operationalregion

for the 100% write workload. For other workloads, splitting only occurs at the maximum security point.

Splitting's encodeoperation isfaster thansecretsharing's; however,foranyother workloadsecretsharing

dominates splitting because splitting has extremely poor read availability. Splitting's decode operation

is extremely fast compared to its encode. This is why it is selected for the read workload|its encode

performancedoesnotpenalize it.

Thisinsightaboutsplittingisapplicabletotherestofthethresholdschemes. Forthe100%readworkload,

schemesarenotpenalizedforpoorwriteperformance. Theresultofthisisthatrampschemesdominatethe

region held by information dispersal for write dominated workloads|rampschemes with large n and low

marecompetitivewithinformationdispersalschemeswithmid-rangenandsimilarmvaluesbecausetheir

poorwriteperformance iscounted. Also, shortsecretsharing isshownto havelittlevaluefor aread-only

workload. Shortsecretsharingo ersgoodsecurityfor alowencode cost. Again,theencodecostdoesnot

matterinaread-onlyworkload|rampschemeso erbetterread-onlyperformancethanshortsecretsharing.

Consideringthe resource consumption of the 100% read and 100% write workload is also instructive.

Figure9andFigure10demonstratethatthebestperformingschemeshaveabalancedutilizationofresources

when performing reads. Whereas, the CPU is thescarcest resourcewhen encoding data forhigh security.

Thefact that securityisCPU bound matchesintuition|securityisexpensiveand systemdesignersloathe

payingtheprice. However,the fact that reading secure datais notasexpensive, in termsof CPU cycles,

issigni cant. Indeed,forworkloadswith>60%readstheutilizationofresourcesinthesystemisbalanced

deepintothesecurityregion.

5.4 Failure model

Thefailureprobabilityforstoragenodes,f

node

,isaninterestingparameterin ourmodel. Atotalordering

of all schemes based on their availability can be performed without substituting a speci c value for this

parameter. Thus, theparameter only contractsorexpands the surfacein theavailability dimension. For

example,thesurfacein Figure11,whichisbasedonafailureprobability10largerthanthedefaultvalue,

hastheexactsameshapeasthesurfaceinFigure1,exceptforthescalingalongtheavailabilityaxis. Thus,in

somesensetheselectionsurfacedoesnotdependonthefailureprobability|orderingalongtheavailability

axis is solely a function of n and m. Clearly though, an accurate understanding of the failure model of

(16)

Figure8: 100%Writeworkload.

5.5 Security model

In this section, we investigate what happens if one assumes circumventing cryptographic protection and

compromising storage nodes require di erent amounts of e ort. It is important to only consider relative

changesin theseexperiments|thee ortmetriconlyprovidesarelativeorderingofschemes.

WeperformedexperimentstoseewhathappenswhenE

CircumventCryptography isvariedrelativetoE BreakIn . Reducing E CircumventCryptography

by anorder of magnitude relative to E

BreakIn

has little e ect on the

se-lection surface. It di ers over less than 4% of the area with an average di erence of 9%. Thus, the

e ort valuation used to generate a total order for schemes along the security axis gives less weight to

cryptographiccon dentiality than to information-theoretic con dentiality. The results are much di erent

when E

CircumventCryptography

is increased by an order of magnitude relativeto E

BreakIn

. Replication with

cryptography dominates almost the entirety of the security-availability plane. This is because the

max-imum possible value of n is 10 in our default model. A threshold scheme can only achieve security of

10E

BreakIn

|one order of magnitude. To gain more insight into how the selection surface changes, we

increasedE

CircumventCryptography

byafactorof 2.5relativeto E

BreakIn

. Figure12 presentstheresulting

se-lectionsurface.Replicationwithcryptographydominatesaverylargeregionoftheschemeselectionsurface.

As well, theregion in which short secretsharingdominatesramp schemes growssigni cantly. Indeed,the

(17)

consumptionofnetworkandCPUresourceswhenperformingreads.

Figure 10: Balance of resources fora 100% write workload. Thecostofsecurityisentirelyinthewrite(encode)

operation. Aswell,thelevelofsecurityisboundbytheCPU.

Even though the numbersassigned to the security costs have an arbitrary nature, it is interesting to

consider how the value placed in cryptographictechniques can change depending on the amount of time

con dentiality must be maintained. A goal of security engineering is to expend just enough e ort (i.e.,

minimize performance degradation) to achievethe security goal. For systems with ashort con dentiality

window, the weighting of E

CircumventCryptography

relative to E

BreakIn

should be higher. E ectively, this

assumes a bound on the time that an attacker can utilize to crack the cipher, steal keys, or guess keys.

On the other hand, information whose con dentiality is a long-term priority must carefully consider the

valuationofcryptographictechniques.

5.6 Analysis

Although the trade-o space is complex, we expected speci c algorithms to dominate certain regions of

theavailability-securityplane. Within aregionweexpected theparametersto havea largeimpactonthe

performanceachieved. Our investigation showsthat ouroriginal insightswere basically correct. However,

sometransitionsbetweenregionsareabrupt,indicatingthatalargeperformancedi erenceoccursforasmall

change in security oravailability guarantees. For example, the transition from replication to information

(18)

afactor oftwo alongthe availabilityaxis. Remember,the scaleof theavailabilityaxisisthe numberofnines. Anorderof

magnitudeincreaseintheprobabilityoffailureresultshalvesthenumberofninesofavailability.

Figure12: Placinghighervalueoncryptographythannodesecurity. Ifcryptographyprovidesbettercon dentiality

thanthesecurityofthestoragenodes,replicationwithcryptographyisaverye ectivescheme. Aswell,shortsecretsharing,

whichcombinescryptographywithinformationdispersal,dominatesrampschemesdeepintotheregionofhighsecurity.

havethis property. Thiscontradictsouroriginalintuitionthatthelargenumberofschemeswouldresultin

aselectionsurfacethat issmooth. Ifasystemisoperatingnearasharpperformancetransition, thesystem

requirementsmustbeconsideredandanengineeringdecisionmustbemadeastowhethertoerrin favorof

performance,availability,orsecurity.

Sharptransitionstendtobelocalizedin thebackcornerofthetrade-o space(lowavailabilityandlow

security). The sharpness is due to the data redundancy of threshold schemes. Share size is calculated as

1=(m (p 1)),andnsharesaregeneratedbyathresholdscheme. Considera1-1-4thresholdscheme(4-fold

replication) and a 1-2-4 threshold scheme (an information dispersal scheme). By increasing m from 1 to

2, thedata redundancyis halved (i.e., storagespaceconsumption is thesameas2-fold replication). Ona

read, both schemesrequire the sameamount of data (one replica versus two shares, each half the size of

the replica). The abrupt performance di erence ontheselection surfaceis due to writeoperations,which

dependontheamountofdatathatmustbesentacrossthenetwork.

Anotherfeature of the selection surfaceoccurs when two algorithms perform similarly; the resultis a

\checkerboard"ontheselectionsurface. Insuchcases,itisinformativetoconsidertheresourceconsumption

(19)

Anotheraspect of the space that interests us is the performance di erence between algorithms. Our

intuition was that limiting the set of potential algorithms would often result in poor performance. Our

intuition was correct: each algorithm tends to be good for only some region of the availability-security

plane. If thesystem beingconsidered doesnot operatewithin the algorithm's \good"region, thesystem

hasincurredanunnecessaryperformancepenalty. Indeed,limitingtheset ofalgorithmsreducestheregion

of theavailability-securityplanein which the systemcanoperate. Constrainingan algorithmto aspeci c

setofparameters(i.e.,buildingasystemaroundasinglescheme) xestheavailabilityandsecuritythatcan

beo ered. Moreover,unlessthoroughanalysisofthetrade-o spacewasdoneaspartofthesystemdesign,

itis unlikelythat thescheme selectedis well matchedto thesystembeingbuilt. Thisis onlyareasonable

designdecisionifthesystemisto operatein asingleregionand theschemeusedis ontheselectionsurface

attheregion.

Thehighestdegreeofavailabilityisalwaysachievedbyareplicationalgorithm(becauseonlyreplication

can handle n 1failures) and the highest degreeof securityis always achieved by splitting (because only

splittingcanhandle n 1compromises).

Aftersomelevelofsecurity,secretsharingdominatesonlytheedgeregionthatdemarcatesthereachable

portion of the availability-securityplane. This is because for am

1 -m

1 -n

1

secretsharing scheme, there is

usuallyap 2 -m 2 -n 2

rampschemewithp

2 =m 1 ,m 2 >m 1 ,andn 2 >n 1

whichprovides,byde nition,thesame

securityguarantee,providesadequateavailability,but hasmuch betterperformance. Thereasonforthisis

that thespace savingsof ramp schemes withm>p resultsin fewercomputations beingrequired,and thus

better performance (i.e., since shares are smaller in size, fewer calculations are required to generate each

share). Thus,ifsecretsharingistheonlyalgorithmthato ersthesecurityandavailabilityrequired,

increas-ing thenumberof storagenodes, (i.e., themaximum valuen may take),will produceabetter performing

ramp scheme(>2 improvement)that meets theavailabilityandsecurityrequirements. Clearly, this can

onlybedoneifthereisnotaharddesignconstraintonthenumberofstoragenodesin thesystem.

6 Survivable Storage Projects

Survivablestorage systems are an active areaof current research. This section describes six current and

pastsurvivable storageprojects,focusing ontheir systemmodels, target workloads, and data distribution

schemes. Foreach,designinsightsresultingfrom ourexplorationofthetrade-o spacearenoted.

Delta-4. TheDelta-4system [12,16] iscomprisedofclients,securityserversanddata storageservers.

The data distributionscheme employed is referredto asfragmentation, redundancy and scattering. Data

is decimatedinto fragments,each fragmentis encrypted (withachained cipherso that fragmentsmust be

decoded in acertain order), and theencrypted fragmentsare replicated. Each datastorage serveris sent

all ofthe fragmentsand uses apseudo-randomalgorithm to decidewhich fragments to store. Thenames

of fragmentsare self-verifying but givenoinformation aboutthefragment'scontents. Toaccessapiece of

data, aclient mustget authorization from a set of storageservers. The storageserversrun anagreement

protocolto provideintegrity. Thehash,whichenablesaclientto determinethenamesofthe fragmentsit

wantstoread,isstoredusingsecretsharingonthesecurityservers. Boththeintegrityandcon dentialityof

thedatastoredin Delta-4hingesonthesecurityserversadequatelyprotectingfragments'meta-data. Since

CPUcyclesarenotasscarceastheywere fteenyearsago,cryptographycoupledwithinformationdispersal

should be used in a Delta-4 type system rather than replication. Indeed, since all shares are sent to all

storagenodes,space-eÆciencyofencodingisparamounttoreducingbandwidthconsumption.

Publius. Publius [47] isstrongly in uencedbyAnderson'sEternityService proposal [2],whichargues

forsurvivablestorageplusanonymity. Publiususesencrypted replicas fordatadistribution. Thekeyused

forencryptionisencodedwithsecretsharing,andasingleshareofthekeyisstoredwitheachreplica. This

algorithm is verysimilar to short secretsharing, exceptthat replication rather thaninformation dispersal

isused toencodetheciphertext. Noindicationis givenastothespeci cparametersselectedforthedata

distributionscheme. However,theauthorsdoidentifyparameterselectionasadiÆculttask,citingthedesire

tobalanceresistancetocensorshipandperformance. Theavailabilityofdatain aPubliussystemislimited

bythesecretsharingoftheencryptionkey. Thus,theextraspaceconsumedbyreplication,overinformation

(20)

utilizingabetterscheme.

Intermemory. TheIntermemoryproject[10, 22] focuses onarchivalstorageofpublicdata(i.e.,

write-once, read-manydata with high availability and integrityrequirementsovertime). Intermemory's goal is

to provideextraordinarilyhigh availabilitywhile minimizingstorageconsumptionandthe expectedcostof

reading data. Deepencodingisthedatadistribution schemeused inIntermemory. Deepencodinginvolves

storingasinglecopyofapieceofdata,thenapplyingathresholdscheme(Tornadoschemesinthiscase[35])

to thepieceofdata, distributingtheshares,and thenhavingthestoragenodesrecursivelyapply thesame

threshold scheme to each share they receive. Deep encoding is performed to some depth level(e.g., deep

encodingofdepth threestoresareplica,nshares, andn 2

subsequentsharesofshares). Intermemoryusesa

1-16-32thresholdschemeapplied to depthlevelthree. Public-keycryptographyis discussedasameansto

providelong-termintegrityofstoreddataintheIntermemorysystem. TheIntermemoryprojecthasselected

anexcellentschemethatprovideshighavailabilitywithreasonableexpectedreadperformanceandverylow

storage spaceoverhead. However, integrity is explicitly stated asagoal of the system. An attackerneed

onlycompromiseasinglereplicato defeattheintegrityofthestored data. Addingahash totheencoding

algorithm and removingthe rst depth level of deep encoding would greatlyincrease the integrity of the

Intermemorysystemattheexpenseofincreasingtheexpected costofperformingaread.

Oceanstore. Oceanstore [31] is envisioned asa wide-area data utility. It intends to leverage excess

storagecapacityoverawideareatoprovidesurvivablestorageofdataaccessiblefromanywhere. Oceanstore

storestwotypesofdata: activedata(whichcanbereadandwritten)andarchivaldata(whichiswrite-once,

read-many). Active data isstored using encrypted replicas. Replicashave self-verifyingnames to provide

integrity. Agreementalgorithms are used to manageauthorization to data. Archival data is stored using

deepencodingasin Intermemory. Oceanstoreisaddressingmanyotherissuespertainingtobuildingadata

storageutilityofglobalscale,includingclientmobility,robustnaming,datamigration,andreplicadiscovery.

Analyzingthetrade-o s amongstdatadistributionschemeswouldenableOceanstoretoclearlyunderstand

thedi erentcharacteristicsofthetwodistinctdatadistributionschemesitemploys. Indeed,suchananalysis

isnecessaryforthemto determineat whichworkloaddatashould switchfrombeingactivetoarchival.

Farsite. TheFarsiteproject[7]isexploringpeer-basedstorageinalocal-areasetting. Farsiteis

attempt-ingto leveragetheidle cycles andunused storagecapacity ofdesktop workstations, whichinherently have

poor availabilitycharacteristics,to provide asurvivablestoragesystem. Farsite usesencrypted replication

with self-verifying names. A Byzantine agreement protocol amongstclient machines storing meta-data is

usedtoguaranteetheintegrityofthedirectoryservice(andsubsequentlyprovidetheself-verifyingnatureof

thereplicanames). Tolimitspaceconsumption, singleinstance storage[6],in conjunctionwithconvergent

encryption,is usedto ensurethat onlynreplicas ofanysimilarpiece of datais storedsystemwide. Since

Farsiteexplicitlystatesconcernsaboutstoragespaceconsumption,informationdispersal ratherthen

repli-cationshouldprobablybeused. WithrelativelysmallCPUcosts,2{4reductionsin spaceandbandwidth

utilization are available. Further, bothsingle instancestorageand convergentencryption canstill beused

withthesharesgeneratedbyinformationdispersal.

e-Vault. Thee-Vaultsystem[18,25]isasurvivablestoragesystemthatuseseitherinformationdispersal

orshort secretsharing, depending onwhether ornotstrong con dentiality isdesired. Distributed

nger-prints [29] are used for integrity. The systemis designed for a local-areasetting in which clients interact

withstorageserversviaa\gateway"(adesignated storageserver). Performancenumbersaregivenforthe

1-2-3information dispersal scheme[25], sincethetest platform is limitedto three storageservers. This is

averylimitedsystemin which to consider informationdispersal schemes. A systemwith just afew more

storagenodeshassigni cantlybetterperformanceand availabilitycharacteristics.

PASIS.Ourprototypesystem,PASIS[40,48], isdescribedinSection 4. PASISprovidesan

implemen-tation ofasurvivablestoragesystemagainstwhich wevalidate ourperformancemodels. The maindesign

goalofPASISistobe exibleenoughforustoinvestigatemanydi erentapproachestobuildingsurvivable

storagesystems. Ourimplementationofencode/decodefunctionalityisseparatefromour

multi-read/multi-writefunctionality. Theencode/decodelibraryprovidesthresholdalgorithms,cryptographicalgorithms,and

hybridalgorithms,allofwhichcanbeinstantiatedoverawiderangeofparameters. Thelibraryisstructured

soastofacilitatetheadditionofdatadistributionalgorithmsandthecompositionofhybridalgorithms. The

multi-read/multi-writelibrarysupportsNFS,CIFS,andFTPstoragenodesandcanbeextendedtosupport

(21)

7 Summary

Thispaperillustratesthecomplextrade-o spaceassociatedwithselectingtherightdatadistributionscheme

forasurvivablestoragesystem. Areasonedapproachtoexploringthistrade-o spaceisdevelopedandused

to explorethespace'ssensitivityto varioussystemand workloadcharacteristics. Althoughfurther workis

neededtore nethemodels,webelievethatthispapertakesabigstepintherightdirection|awayfromad

hocselectionsandtowardsinformedengineeringdecisions.

ThisworkwasdoneinthecontextofthePASISsurvivablestorageprojectatCarnegieMellonUniversity.

Additionalinformationonourworkandresultscanbefoundontheprojectwebsite [40].

8 Acknowledgements

TheauthorswouldliketothankJohnD.Strunk,GarthR.Goodson,andTheodore M.Wong forextensive

discussionandfeedbackonthedirectionsandgoalsofthisresearch. Wewouldalsoliketo thankthemany

other graduate students in the Parallel Data Lab that provided feedback on rough drafts of this paper.

Finally, we would like to thank Matt Vestal for administering the cluster of machines on which we ran

experiments.

References

[1] DarrellC.Anderson, Je rey S.Chase, andAminM. Vahdat. Interposedrequestroutingfor scalablenetwork

storage. Symposiumon Operating SystemsDesignandImplementation(SanDiego, CA,22{25October2000),

2000.

[2] RossJ.Anderson. TheEternityService. PRAGOCRYPT,pages242{253. CTUPublishing,1996.

[3] ThomasE.Anderson,MichaelD.Dahlin,JeannaM.Neefe,DavidA.Patterson,DrewS.Roselli,andRandolphY.

Wang. Serverlessnetwork lesystems.ACM TransactionsonComputer Systems,14(1):41{79,February1996.

[4] G.R.Blakley. Safeguarding cryptographickeys. AFIPSNational Computer Conference (NewYork,NY,4{7

June1979),pages313{317. AFIPS,1979.

[5] G.R.BlakleyandCatherine Meadows. Security oframpschemes. Advances inCryptology -CRYPTO,pages

242{268. Springer-Verlag,1985.

[6] WilliamJ.Bolosky,ScottCorbin,DavidGoebel,andJohnR.Douceur.SingleInstanceStorageinWindows2000.

USENIX Windows Systems Symposium(Seattle, WA, 3{4 August2000), pages13{24. USENIX Association,

2000.

[7] WilliamJ.Bolosky, JohnR.Douceur, DavidEly,andMarvinTheimer. Feasabilityof aserverless distributed

lesystemdeployedonanexistingsetofdesktopPCs. ACM SIGMETRICSConference onMeasurement and

Modeling of Computer Systems (Santa Clara, CA, 17{21 June 2000). Published as Performance Evaluation

Review,28(1):34{43. ACM,2000.

[8] MiguelCastroandBarbaraLiskov.Practicalbyzantinefaulttolerance.SymposiumonOperatingSystemsDesign

andImplementation(NewOrleans,LA,22{25February1999),pages173{186. ACM,1998.

[9] Miguel Castro and Barbara Liskov. Proactive recovery ina Byzantine-fault-tolerant system. Symposium on

OperatingSystemsDesignandImplementation(SanDiego,CA,23{25 October2000),pages273{287. USENIX

Association, 2000.

[10] Yuan Chen, Jan Edler, Andrew Goldberg, Allan Gottlieb, Sumeet Sobti, and Peter Yianilos. A prototype

implementation ofarchival Intermemory. ACM Conference on DigitalLibraries(Berkeley, CA, 11{14 August

1999),pages28{37. ACM,1999.

[11] WeiDai. Crypto++referencemanual . http://cryptopp.sourceforge.net/docs/ref/.

[12] YvesDeswarte,L.Blain,andJean-CharlesFabre. Intrusiontolerance indistributedcomputingsystems. IEEE

SymposiumonSecurityandPrivacy(Oakland,CA,20{22May1991),pages110{121,1991.

(22)

[15] J.FragaandD.Powell. Afaultandintrusion-tolerant lesystem. IFIPInternationalConferenceonComputer

Security,pages203{218. Elsevier,1985.

[16] Jean-Michel Fray, Yves Deswarte, and David Powell. Intrusion-tolerance using ne-grain

fragmentation-scattering. IEEE Symposium onSecurity and Privacy(Oakland, CA,7{9April 1986),pages194{201. IEEE,

1986.

[17] FreeHaven.http://www.freehaven.net/.

[18] Juan A. Garay, Rosario Gennaro, Charanjit Jutla, and Tal Rabin. Secure distributed storage and retrieval.

TheoreticalComputer Science,243(1{2):363{389. Elsevier,September2000.

[19] HectorGarcia-Molina andDanielBarbara. Howtoassignvotesinadistributedsystem. Journalof theACM,

32(4):841{855,October1985.

[20] GarthA.Gibson,DavidF.Nagle,KhalilAmiri,Je Butler,FayW.Chang,HowardGobio ,CharlesHardin,Erik

Riedel,DavidRochberg,andJimZelenka. Acost-e ective,high-bandwidth storagearchitecture. Architectural

Support forProgramming Languages andOperating Systems(SanJose, CA,3{7October1998). Published as

SIGPLANNotices,33(11):92{103,November1998.

[21] D. K. Gi ord. Violet: an experimental decentralized system. Technical report CSL{79{12. Xerox Palo Alto

ResearchCenter,CA,November1979.

[22] AndrewV. Goldberg andPeterN.Yianilos. Towards anarchivalintermemory. IEEE InternationalForumon

Research andTechnologyAdvancesinDigitalLibraries(SantaBarbara,CA,22{24April1998),pages147{156.

IEEE,1998.

[23] JimGrayandDanielP.Siewiorek.High-availabilitycomputersystems.IEEEComputer,24(9):39{48,September

1991.

[24] Intermemory. http://www.intermemory.org/.

[25] A. Iyengar, R. Cahn, C. Jutla, and J. A. Garay. Design and implementation of a secure distributed data

repository. IFIP InternationalInformation Security Conference (Vienna,Austria,and Budapest,Hungary,31

August{4September1998),pages123{135. ACM,1998.

[26] RajJain. Theart ofcomputersystemsperformance analysis. JohnWiley&Sons,1991.

[27] EhudD.Karnin,JonathanW.Greene,andMartinE.Hellman. Onsecretsharingsystems. IEEETransactions

onInformationTheory,IT-29(1):35{41. IEEE,January1983.

[28] LeonardKleinrock. Queueingsystems. JohnWileyandSons,1973.

[29] HugoKrawczyk. Distributed ngerprintsandsecureinformation dispersal. ACM SymposiumonPrinciples of

DistributedComputing(Ithaca,NY,15{18August1993),pages207{218,1993.

[30] HugoKrawczyk. Secretsharing madeshort. Advances in Cryptology - CRYPTO(SantaBarbara, CA,22{26

August1993),pages136{146. Springer-Verlag,1994.

[31] John Kubiatowicz, David Bindel, YanChen, Steven Czerwinski, Patrick Eaten, Dennis Geels, Ramakrishna

Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao. OceanStore: an

architectureforglobal-scalepersistentstorage.ArchitecturalSupportforProgrammingLanguagesandOperating

Systems(Cambridge,MA,12{15November2000).PublishedasOperatingSystemsReview,34(5):190{201,2000.

[32] LeslieLamport,RobertShostak,andMarshall Pease. TheByzantinegeneralsproblem. ACMTransactionson

ProgrammingLanguagesandSystems,4(3):382{401. ACM,July1982.

[33] EdwardK. Leeand Chandramohan A. Thekkath. Petal: distributed virtual disks. Architectural Support for

ProgrammingLanguagesandOperatingSystems(Cambridge,MA,1{5October1996). Published asSIGPLAN

Notices,31(9):84{92, 1996.

[34] BarbaraLiskov,SanjayGhemawat,RobertGruber, PaulJohnson,LiubaShrira,andMichael Williams.

Repli-cation inthe Harp le system. ACM Symposium on Operating System Principles (Paci c Grove, CA, 13{16

October1991). PublishedasOperatingSystemsReview,25(5):226{238,1991.

[35] Michael Luby,MichaelMitzenmacher,MohammadAminShokrollahi,andDanielA.Spielman. Analysisoflow

densitycodesandimproveddesignsusingirregulargraphs. ACMSymposiumonTheory ofComputing(Dallas,

TX,23{26May1998),pages249{258. ACM,1998.

[36] DahliaMalkhiandMichaelK.Reiter.SecureandscalablereplicationinPhalanx. IEEESymposiumonReliable

(23)

[38] Oceanstore. http://oceanstore.cs.berkeley.edu/.

[39] RodolpheOrtalo,YvesDeswarte,andMohamedKaaniche.Experimentingwithquantitativeevaluationtoolsfor

monitoringoperationalsecurity. IEEETransactionsonSoftwareEngineering,25(5):633{650. IEEE,September

1999.

[40] PASIS. http://www.ices.cmu.edu/pasis/.

[41] Publius. http://cs1.cs.nyu.edu/waldman/publius/.

[42] MichaelO.Rabin. EÆcientdispersalofinformationforsecurity,loadbalancing,andfaulttolerance. Journalof

theACM,36(2):335{348. ACM,April1989.

[43] AdiShamir. Howtoshareasecret. CommunicationsoftheACM,22(11):612{613. ACM,November1979.

[44] Daniel P. Siewiorek and Robert S.Swarz. Reliable computer systems: design and evaluation. Digital Press,

Secondedition,1992.

[45] R.H.Thomas.Amajorityconsensusapproachtoconcurrencycontrol.ACMTransactionsonDatabaseSystems,

4:180{209, 1979.

[46] MartinTompaandHeatherWoll. Howtoshareasecretwithcheaters. JournalofCryptography,1(2):133{138,

1988.

[47] Marc Waldman, Aviel D. Rubin, and Lorrie Faith Cranor. Publius: a robust, tamper-evident,

censorship-resistant,webpublishingsystem.USENIXSecuritySymposium(Denver,CO,14{17August2000),pages59{72.

USENIXAssociation, 2000.

[48] JayJ.Wylie,MichaelW.Bigrigg,JohnD.Strunk,GregoryR.Ganger,HanKiliccote,andPradeepK.Khosla.

References

Related documents

Combined with sophisticated storage management software – the Maxtet provides a cost effective, reliable, high capacity data storage solution.. Jukebox Manager combines the

Pharmacist-Managed Vaccination Program Increased Influenza Vaccination Rates in Cardiovascular Patients Enrolled in a Secondary Prevention Lipid Clinic..

exceptionally high cost, involving additional nearby data centers, high performance storage hardware, replication software, extended bandwidth synchronous communication lines

infrastructure security and availability (servers, storage, network bandwidth, etc). 

‒ Force cache flush on other nodes using filesystem when current node access the file.. Generic

applications. Furtherm~re, some estimators exhibit lower convergence rates than ..[ii, have non-normal distributions and may require bootstrapping in order to obtain the distrubution

for all members in a provident fund in order to separate pre-1 March 2015 contributions and growth from post-1 March 2015 contributions and growth. This will enable FundsAtWork

Other furniture have the tell city kitchen and dining set combines the right to paint my transferware pieces, camping and maple and friends and enjoy a buffet.. Republic solid wood