P2P-based Storage Systems

(1)

Dr.-Ing. Kalman Graffi

Faculty for Electrical Engineering, Computer Science at the University of Paderborn

P2P-based Storage Systems

Peer-to-Peer Networks and Applications

This slide set is based on the lecture “Peer-to-Peer Networks" of Prof. Dr. Jussi Kangasharju at University of Helsinki

(2)

1 P2P Filesystems

1.1 P2P Filesystems: Basic Techniques

1.2 Maintaining redundancy

2 Three Examples of P2P Filesystems

2.1 CFS – Cooperative File System

2.2 OceanStore

2.3 Ivy

(3)

Using DHTs to build more complex systems

How DHT can help?

What problems DHTs solve?

What problems are left unsolved?

P2P storage basics, with examples

Splitting into blocks (CFS)

Wide-scale replication (OceanStore) Modifiable filesystem with logs (Ivy)

Future of P2P filesystems

(4)

Recall: DHT maps keys to values

Applications based on DHTs must need this functionality

Or: Must be designed in this way!

Possible to design an application in several ways

Keys and values are application specific

For filesystem: Value = file

For email: Value = email message

For distributed DB: Value = contents of entry, etc. For distributed data structures:

• Value = end item / file

• Or value = structured set of further keys

Application stores values in DHT and uses them

Simple, but a powerful tool

(5)

DHT solves

The problem of mapping keys to values in the distributed hash table

Implications

Efficient storage and retrieval of values / objects Efficient routing

• Robust against many failures

• Efficient in terms of network usage

Provides hash table-like abstraction to application

(6)

Everything else except what is on previous slide…

In particular, the following problems

Functionality

• DHT is: Write-Once-Read-Many

• No support for (multi-user) value modification, deletion

Robustness

• No guarantees against big failures

• Threat models for DHTs not well-understood yet

Availability

• Data not guaranteed to be available

• Only probabilistic guarantees (but possible to get high prob.)

Consistency

• No support for consistency (=same view on object for every) • Data in DHT often highly replicated, consistency is a problem

Version management

• No support for version management

• Might be possible to support this to some degree

(7)

P2P filesystems (FS) or P2P storage systems

File storage and retrieval First applications of DHTs

Fundamental principle:

Key = filename, Value = file contents

Different kinds of systems

Storage for read-only objects Read-write support

Stand-alone storage systems

Systems with links to standard filesystems (e.g., NFS)

P2P Filesystems

(8)

Only examples of P2P filesystems come from research

Research prototypes exist for many systems

No wide-area deployment

Experiments on research testbeds

No examples of real deployment and real usage in wide-area

After initial work, no recent advances?

At least, not visible advances

Exception: new p2p applications (for social networks)

Three examples:

Cooperative File System, CFS OceanStore

Ivy

(9)

Why build P2P filesystems?

Light-weight, reliable, wide-area storage

At least in principle…

Distributed filesystems not widely deployed either…

Were studied already long time ago Principles used in cloud computing

Gain experience with DHT and how DHTs could be used

in real applications

DHT abstraction is powerful, but it has limitations Understanding of the limitations is valuable

(10)

Design Space of P2P Storage

(11)

Fundamental basic techniques for building distributed storage

systems

Splitting files into blocks Replicating files (or blocks!)

Using logs to allow modifications Search in unstructured overlays

For now: Simple analysis of advantages and

disadvantages and three examples

P2P Filesystems: Basic Techniques

(12)

Motivation

Files are of different sizes and peers storing large files have to serve more data

Benefits

Load balancing can be achieved by

• Dividing files into equal-sized blocks • Storing blocks on different peers

If different files share blocks, we can save on storage

Drawbacks:

Instead of needing one peer online, all peers with all blocks must be online

Need metadata about blocks to be stored somewhere Granularity tradeoff:

• Small blocks Good load balance, lots of overhead

• Large blocks Bad load balance, small overhead

(13)

Motivation

If file (or block) is stored only on one peer and that peer is offline, data is not available

Replicating content to multiple peers significantly increases content availability

Benefits

High availability and reliability But only probabilistic guarantees

Con:

How to coordinate lots of replicas?

• Especially important if content can change (or can be deleted)

Good availability in unreliable networks requires high degree of replication

• “Wastes” storage space

(14)

Excursion: Erasure Codes

(15)

A class of erasure codes with the property that

potentially limitless sequence of encoding symbols can be generated from a given set of source symbols

such that the original source symbols can be recovered

• from any subset of the encoding symbols of size equal to or only slightly larger than the number of source symbols

Examples

Digital fountains, Online codes, …

Implications for P2P storage

From maintenance perspective: a hybrid of replication/traditional erasures

Excursion: Rateless Erasure Codes

(16)

Replicated

times

Faults that can be tolerated: − 1

Probability of failure: Storage efficiency: 1/

Access: Find any one good replica

Erasure coded (k of n)

Faults that can be tolerated: −

Probability of failure:

∑ ₋ ₊ 1 −

Storage efficiency: k/n

Access: Find good blocks

Assume: Peer failure is i.i.d. with failure probability

Excursion: Static Resilience

(17)

Excursion: Static Resilience

Churn rate, f=0.25

Replication

r = 5 • Availability ~ 0.73 • Too low r = 24 • Availability > 0.999

• Too much overhead!

Erasure code

N=517, k=100

• Availability ~ 0.999

• Overhead = 5.17

(18)

Motivation

If we want to change the stored files, we need to modify every stored replica

Keep a log for every file (user, …)

• Log gives information about the latest version

Benefits

Changes concentrated in one place

Anyone can figure out what is the latest version

Drawbacks

How to keep the log available? By replicating it? ;-)

(19)

Motivation

Distributed data structures are harmed by churn

Idea: skip structure, have clever search mechanisms

Benefits

Easy to setup and implement Good in small scenarios

• Example: former p2p collaboration tool Groove (up to 10 users)

Drawbacks

Large search overhead comes with full retrievability

• Turns solution not scalable

Too simple for many applications

(20)

Why to maintain redundancy?

Churn

Naïve solution (Eager maintenance)

Probe (e.g., periodically) and replace “lost” redundancy

How do we know whether a peer left permanently?

Heuristic time out T might help

Avoids replications, when peer is only offline for a short time

Maintaining redundancy

(21)

Basic idea

After probing for all blocks,

if the number of available blocks ( ) is below a repair threshold ( ) then replace all missing blocks (deterministic lazy repair)

Approach

If originally blocks were being stored,then − new blocks regenerated

is a design parameter

If (n,k) erasure coding is being used, <

Evaluation

+ Saves bandwidth, however peaky - System complexity

- Meta‐info overheads

• Start with extra redundancy • Otherwise system vulnerable

Lazy Maintenance

Total Recall: System Support for Automated Availability Management

(22)

Randomly probe the redundant nodes

till a threshold ( ) live blocks are found (or all nodes probed) E.g. Probe randomly till 8 online blocks are found

Replenish the blocks detected to be missing

E.g., 10 probes: probes 3 and 9 detect 2 out of 6 unavailable blocks

Randomized Lazy Maintenance

(23)

Number of replacements

Generally less than the total number of unavailable blocks

A continuous but slower process

thus smoother bandwidth usage

System is in a healthier shape

Inherently adaptive

System complexity

is (again) higher

than the naïve solution

Randomized Lazy Maintenance

Slide based on Anwitaman Datta, “Peer-to-Peer Storage Systems: Crowdsourcing the Storage Cloud“, ICDCN’10

Internet‐‐‐‐scale storage systems under churn ‐‐‐‐A steady state analysis, A. Datta, K. Aberer, IEEE P2P 2006

(24)

A radically different idea

Assumption:

Total bandwidth and storage usage are not such a big deal

But the bandwidth used at any time instant is capped to a max‐level

Approach:

Continuously create extra redundancy (till some )

Proactive Replication Strategy

Proactive replication for data durability.

E. Sit, A. Haeberlen, F. Dabek, B. Chun, H. Weatherspoon, R. Morris, M. Frans Kaashoek and J. Kubiatowicz, IPTPS 2006

(25)

What happens when nodes come back online?

Duplicates and its implications

• Replication vs. Erasure codes vs. Rateless codes

Garbage collection decisions

To keep duplicates or not to keep duplicates

• Affects system complexity

• Keep track of duplicates and/or detect duplicates to remove

For coded fragments, limit the degree of diversity to keep ( )

Tricky question:

What happens with updates? Changes of the content? How to find duplicates and replicates?

Garbage collection

Slide based on Anwitaman Datta, “Peer-to-Peer Storage Systems: Crowdsourcing the Storage Cloud“, ICDCN’10

Redundancy Maintenance and Garbage Collection Strategies in Peer‐‐‐‐to‐‐‐‐Peer Storage Systems

(26)

CFS

Basic, read-only system Based on Chord

OceanStore

Read-only, global vision Based on Tapestry

Ivy

Read-write, provide NFS semantics Based on Chord

Three Examples of P2P Filesystems

(27)

History

Developed at MIT, by same people as Chord CFS based on the Chord DHT

Read-only system, only 1 publisher

CFS stores blocks instead of whole files

Part of CFS is a generic block storage layer

Features

Load balancing

Performance equal to FTP

Efficiency and fast recovery from failures

CFS – Cooperative File System

(28)

Scalability (comes from Chord)

Availability

In absence of catastrophic failures, of course… Not-adaptive replication strategy

Balanced number of files

Storage load balanced according to peers’ capabilities See AH-Chord: traffic load is not balanced in Chord

Persistence

Data will be stored for as long as agreed

Quotas

Possibility of per-user quotas

Efficiency

As fast as common FTP in wide area

(29)

Components

FS: provide filesystem API, interpret blocks as files DHash: Provides block storage

Chord: DHT layer, slightly modified

• Instead of 1 successor: r successors

• Node-to-node latency is measured and communicated

Clients access files, servers just provide storage

CFS: Layers, Clients, and Servers

Filesystem DHash Chord Client DHash Chord Server DHash Chord Server

(30)

DHash stores blocks as opposed to whole files

Better load balancing

More network query traffic, but not really significant

Each block replicated k times, with r

_≥

k

Two kinds of blocks:

Content block, addressed by hash of contents

Signed blocks (= root blocks), addressed by public key

• Signed block is the root of the filesystem • One filesystem per publisher

Blocks are cached in network

Most of caching near the responsible node

Blocks in local cache replaced according to least-recently-used Consistency not a problem, blocks addressed by content

• Root blocks different, may get old (but consistent) data

(31)

DHash provides following API to clients:

Put_h(block) : Store block

Put_s(block, PubKey) : Store or update signed block Get(key) : Fetch block associated with key

Filesystem starts with root (= signed) block

Block under publisher’s public key and signed with private key

• Clients: block is read-only

• Publisher can modify filesystem by inserting a new root block

Root block has pointers to other blocks

• Either piece of a file or filesystem metadata (e.g., directory)

Data stored for an agreed-upon finite interval

(32)

Root block identified by publisher’s public key

Each publisher has its own filesystem

Different publishers are on separate filesystems

Other blocks can be metadata or pieces of file

Metadata: contains links to other keys

Piece of file: other blocks identified based on hash of contents

CFS Filesystem Structure

Key: root’s public Key Signature Hash(Dir) Root block Hash(File) Block content Block 1 Block content Block 2

(33)

One real server can run several virtual servers

Different servers have different capabilities

Number of virtual servers depends on the capabilities “Big” servers run more virtual servers

CFS operates at virtual server level

Virtual nodes on a single real node know each other Possible to use short cuts in routing

Quotas can be used to limit storage for each node

Quotas set on a per-IP address basis

Better quotas require central administration

Some systems implement “better” quotas, e.g., PAST

(34)

First write

Create data elements to store Sign root document

Publish data elements

CFS will store any block under its content hash

Highly unlikely to find two blocks with same SHA-1 hash No explicit protection for content blocks needed

Only publisher can change data in filesystem

All links in CFS originate in root document Root block is signed by publisher

• Validity can be checked by storage peer and users • Publisher must keep private key secret

No explicit delete operation

Data stored only for agreed-upon period

Publisher must refresh periodically if persistence is needed

(35)

History:

OceanStore developed at UC Berkeley Runs on Tapestry DHT

Supports object modification

Vision of ubiquitous computing:

Intelligent devices, transparently in the environment Highly dynamic, untrusted environment

Question: Where does persistent information reside?

OceanStore aims to be the answer OceanStore’s target in 2003/2004:

• 1010 users, each with 10000 files, i.e., 1014 files total

See Cloud storage today

OceanStore

(36)

Users pay for the storage service

Several companies can provide services together

Aiming to support:

1. Untrusted infrastructure

Everything is encrypted, infrastructure unreliable

However, assume that “most servers are working correctly most of the time”

One class of servers trusted to follow protocol (but not trusted with data)

2. Nomadic data

Anytime, anywhere

Introspection used to tune system at run-time

(37)

OceanStore suitable for many kinds of applications:

Storage and sharing of large amounts of data

Data follows users dynamically

Groupware applications

Concurrent updates from many people For example, calendars, contact lists, etc. In particular, email applications

Streaming applications

Also for sensor networks and dissemination

Here we concentrate on storage

(38)

Objects:

Each object has globally unique identifier (GUID) Objects replicated and migrated on the fly

Replicas located in two ways

Probabilistic, fast algorithm tried first

Slower, but deterministic algorithm used if first one fails

Objects exist in two forms, active and archival

Active is the latest version with handle for updates Archival is a permanent, read-only version

• Archive versions encoded with erasure codes with lot of redundancy • Only a global disaster can make data disappear

(39)

Object naming

Self-certifying object names

• Hash of the object’s owners key and human readable name

Allows directories

System has no root, but user can select her own root(s)

Access control

Restricting readers

• All data is encrypted, can restrict readers by not giving key • Revoke read permission by re-encrypting everything

Restricting writers

• All writes must be signed and compared against ACL

(40)

Probabilistic algorithm

Frequently accessed objects likely to be nearby and easily found This algorithm finds them fast

Uses attenuated Bloom filters

• See next slides

Deterministic algorithm

OceanStore based on Tapestry

• Like the p2p overlay Pastry

Deterministic routing is Tapestry’s routing Guaranteed to find the object

(41)

Bloom filters are

“a space-efficient probabilistic data structure that is used to test whether or not an element is a member of a set”

Test if element e is part of a set:

• Answer NO:

– Answer is guaranteed

– False negatives are NOT possible • Answer YES:

– Element might be in set – False positives are possible

Components of an bloom filter:

Array of k bits

Also need m different hash functions,

• Each hash function maps input to an array index ({0,k-1}): bit at index set then to 1

To insert

Calculate all m hash functions ( m indices) and set bits at indices to 1

To check

Calculate all m hash functions

• If all bits are 1, key is “probably” in the set • If any bit is 0, then it is definitely not in

Sidenote: Bloom Filters

(42)

To insert or check for an item Bloom filters take O(m) time

Fixed constant! Independent of number of entries

No other data structure allows for this (hash table is close)

Attenuated Bloom filter of depth D: An array of D normal

Bloom filters

First filter is for locally stored objects at current node

The i-th Bloom filter is the union of all Bloom filters at distance i

• Through any path from current node

Attenuated Bloom filter for each network edge

• Gives information “What is in that direction”

Queries routed on the edge where the distance to object is shortest

• „Shortest“: low level of matching Bloom filter indicates promising direction

(43)

Node n1 wants to find object X

X hashes to bits 0, 1, 3 Set bits at index 0,1,3 to 1

Node n1 does not have 0, 1, and 3 in local filter

Neighbor filter for n2 has them, forward query to n2

Node n2 does not have them in local filter

Filter for neighbor n3 has them, forward to n3

Node n3 has object

Probabilistic Query Process - Example

Node Local filter Neighbor filter n1 n2 n3 n4 X 10101 11100 11010 00011 11010 00011 11011 11100

(44)

Update model in OceanStore based on conflict resolution

Update semantics

Each update has a list of predicates with associated actions Predicates evaluated in order

Actions for first true predicate are atomically applied (commit) If no predicates are true, action aborts

Update is logged in both cases

List of predicates is short

Untrusted environment limits what predicates can do Available predicates:

• Compare-version (metadata comparison, easy) • Compare-size (same as above)

• Compare-block (easy if encryption is position-dependent block cipher) • Search (possible to search ciphertext, get boolean result)

(45)

Four operations available

Replace-block Insert-block Delete-block Append

In case of position-dependent cipher

I.e. symmetric encryption key depends on storage key Replace-block and Append are easy operations

For Insert-block and Delete-block

Two kinds of blocks: Index and data blocks

Index blocks can contain pointers to other blocks

To insert a block, we replace old block with an index block which points to old block and new block

• Actual blocks are appended to object

May be susceptible to traffic analysis

(46)

Replicas divided into two tiers

Primary tier

• Is trusted to follow protocol

• Peers cooperate in a Byzantine agreement protocol

Secondary tier is everyone else

• Communicates with primary tier and within the second tier via epidemic algorithms

Reason for two tiers:

Fault-tolerant protocols possible with only a small number of replicas, protocols communication-intensive

Primary tier is well-connected and small

(47)

Scenario:

Several divisions of the Byzantine army surround an enemy city. Each division is commanded by a general.

Communication

The generals communicate only through messenger

• Need to decide on a common plan after observing the enemy Some of the generals may be traitors

• Traitors can send false messages

Required: An algorithm to guarantee that

All loyal generals decide upon the same plan of action, irrespective of what the traitors do.

A small number of traitors cannot cause the loyal generals to adopt a bad plan.

Solution:

If no more than m generals out of n = 3m + 1 are traitors Loyal generals can agree on plan and follow the orders

Byzantine Generals Problem

(48)

Node in secondary tier initiates an update

Sends update to one node in primary tier

Sends also some replicas to random nodes in secondary tier

Update propagation

In primary tier: Byzantine agreement is performed

• Object eventually accepted (or rejected)

In seconday tier: replicas are forwarded randomly (epidemically)

Update flush

In case of positive result of Byzantine agreement:

• Update is sent over multicast tree to all secondary replicas

(49)

Archival mechanism uses erasure codes

Reed-Solomon, Tornado, etc.

Generate redundant data fragments

Created by the primary tier

If there are enough fragments and they are spread

widely, then it is likely that we can retrieve the data

Archival copies are created when objects are changed

Every version is archived

Can be tuned to be done less frequently

(50)

History

Ivy developed at MIT Based on Chord

Provides NFS-like semantics

At least for fully connected networks Any user can modify any file

Ivy handles everything through logs

Ivy presents a conventional filesystem interface

Ivy

(51)

Multiple distributed writers

make it difficult to maintain consistent metadata

Unreliable participants make locking unattractive

Locking could help maintain consistency

Participants cannot be trusted

Machines may be compromised Need to be able to undo

Distributed filesystem may become partitioned

System must remain operational during partitions

Help applications repair conflicting updates made during partitions

(52)

Each participant maintains log of its changes

Logs maintain filesystem data and metadata

Participant has private snapshot of logs of others Participant writes to its own log, reads all others

• Log-head points to most recent log record

Logs stored in DHash (see under CFS for details)

Version vectors impose order on log records from multiple logs

Solution Approach: Logs

View block Log Head

…

(53)

Views:

Each user writing to a filesystem has its own log Participants agree on a view

• View is a set of logs that comprise the filesystem

View block is immutable (“root”)

View block has log heads for all participants

Snapshots

Private snapshots avoid traversing whole filesystem

• Contents of snapshots mostly the same for all participants • DHash will store them only once

Snapshot contains the entire state To create snapshot, node:

• Gets all logs recent than current snapshot • Write new snapshot

New user must either build from scratch or take a trusted snapshot

(54)

How to determine order of modifications from logs?

Order should obey causality

All participants should agree on the order

Each new log record (= modification) has

Sequence number: increasing (managed locally) Version vector: summarizes knowledge about log

• Has sequence numbers from other logs in view

Log records ordered by comparing version vectors

Vectors u and v comparable if u < v, v < u, or v = u Otherwise concurrent

Simultaneous operations result in equal or concurrent vectors

Ordered by public keys of participants

May need special actions to return to consistency • In case of overlapping modifications

(55)

Built on top of unreliable nodes and network

How to achieve reliability?

Replication gives reliability

Replication makes maintaining consistency difficult

• Common view of all nodes on file system needed

Replication difficult if file modified or deleted

Nodes cannot be trusted in the general case

Must encrypt all data

• How to manage keys and access control rights?

Hard to do content-based conflict resolution (e.g., diff)

Performance

Distributed filesystems have much lower performance

(56)

What is right area of application?

Intranet?

• Trusted environment, high bandwidth • Possibly easy to deploy?

– Need to make a product first?

• But is a server net sufficient?

Global Internet?

• Lots of untrusted peers, widely distributed • All the problems from above

• Can take off as a “hobby” project?

– see LifeSocial – p2p based social network

Nowhere?

• P2P filesystems are a total waste of time? • Especially in the presence of Cloud storage

(57)

Do we need a P2P FS to build useful applications?

Yes, they allow distributed storage

• Working storage system is the basic building block of a useful application

– E.g. social networks

• Unlimited scalability (?)

– Combining Youtube and Facebook

• Privacy

– Distributed secure storage might guarantee privacy

No, DHT is enough (?)

Vision of fully reliable P2P FS

Would make network into one virtual computer

• Still some performance issues

Building P2P apps would be like building normal apps See idea of p2p cloud

(58)

Capacity of hybrid architecture, and efficient schemes

How much centralized provisioning is necessary for a given distribution of peer resources?

Redundancy placement issues

Cover 24 hours

Minimize storage cost & access latency

Adaptive redundancy management

Different data may need different QoS

Fine

‐

grained access control

Example scenarios from P2P-based online social networks

Heterogeneity, energy efficiency, load

‐

balance etc.

(59)

Microsoft has done lot of research in this area

Even built prototypes

Why no products on the market?

Below some wild speculation

P2P FS would compete with traditional file servers?

• P2P needs to be built in to the OS, would kill server market? • Who would build such an OS?

– No big companies. But small companies are to small? Academia? – Building “too much” software in academia: not research?

Impossible to build a “good enough” P2P FS?

• Theoretically doable, but too slow and weak in practice?

P2P filesystem can be done, but has no advantages?

• Solving one task (file system) opens many new requirements

– Monitoring, quality control, security and access control, privacy, ability to delete criminal content, accounting, quota enforcement, …

– Many tasks simple to solve centralized, but hard to solve decentralized

(60)

P2P Filesystems

P2P Filesystems: Basic Techniques

Replication Strategies

Three Examples of P2P Filesystems

CFS – Cooperative File System OceanStore

Ivy