• No results found

Replication and Consistency in Cloud File Systems

N/A
N/A
Protected

Academic year: 2021

Share "Replication and Consistency in Cloud File Systems"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

A. Reinefeld, F. Schintke, ZIB 1

Replication and Consistency

in Cloud File Systems

Alexander Reinefeld und Florian Schintke

Zuse-Institut Berlin

(2)

A. Reinefeld, F. Schintke, ZIB 2

Let’s start with a little quiz

Who invented Cloud Computing?

a) Werner Vogels

b) Ian Foster

c) Konrad Zuse

The correct answer is … c)

“Schließlich werden auch Rechenzentren über

Fernmeldeleitungen miteinander vernetzt werden, …”

Konrad Zuse in „Rechnender Raum“ (1969)

Konrad Zuse 22.06.1910-18.12.1995

(3)

A. Reinefeld, F. Schintke, ZIB 3

Zuse Institute Berlin

Research institute for applied mathematics

and computer science • Peter Deuflhard

chair for scientific computing, FU Berlin

• Martin Grötschel

chair for discrete mathematics, TU Berlin

• Alexander Reinefeld

(4)

A. Reinefeld, F. Schintke, ZIB 4

HPC Systems @ ZIB

1987 Cray X-MP 471 MFlops 1994 Cray T3D 38 GFlops 1997 Cray T3E 486 GFlops 2002 IBM p690 2,5 TFlops 1984 Cray 1M 160 MFlops

1.000.000-fold performance increase in 25 years

1984 2009

2008/09 SGI ICE, XE

(5)

H

L

R

N

2 sites 98 computer racks 26112 CPU cores 128 TB memory 1620 TB disk 300 TF peak perf.

(6)

S

T

O

R

A

G

E

3 SL8500 robots 39 tape drives 19000 slots

(7)

A. Reinefeld, F. Schintke, ZIB 7

What is Cloud Computing?

Cloud Computing = Grid Computing on Datacenters?

• not that simple …

Cloud and Grid both abstract resources through interfaces.

Grid: via new middleware.

Requires Grid APIs.

Cloud: via virtualization.

Allows legacy APIs.

Software as a Service (SaaS)

Applications Application Services

Platform as a Service (PaaS)

Programming Environment Execution Environment

Infrastructure as a Service (IaaS)

Infrastructure Services Resource Set

(8)

Alexander Reinefeld, ZIB 8

Why Cloud?

Pros

• It scales

because it‘s their resources, not yours

• It‘s simple

because they operate it

• Pay for what you need

don‘t pay for empty spinning disks

Cons

• It‘s expensive

Amazon S3 charges $.15 / GB / month. = $1800 / TB / year

• It‘s not 100% secure

S3 now allows to bring your own

RSA key-pair. But: Would you put your bank account into the cloud?

• It‘s not 100% available

S3 provides “service credits“ if availability drops (10% for 99.0-99.9% availability)

(9)

Alexander Reinefeld, ZIB 9

File System Landscape

ext3, ZFS, NTFS NFS, SMB AFS/Coda Lustre, Panasas, GPFS, CEPH... Cloud/Grid Cluster FS/ Datacenter Network FS/ Centralized PC, local system GDM "gridftp" Grid File System

(10)

Alexander Reinefeld, ZIB 10

Consistency, Availability, Partition tolerance:

Pick two of three!

Consistency: All clients have the same view of the data.

Availability: Each client can always read and write.

Partition tolerance: Operations will complete, even if individual components are unavailable.

Brewer, Eric. Towards Robust Distributed Systems. PODC Keynote, 2004.

A

C

P

A + P Amazon S3 Mercurial Coda/AFS C + A single server, Linux HA (one data center)

C + P

distributed databases, distributed file systems

(11)

Alexander Reinefeld, ZIB 11

Which semantic do you expect?

Distributed file systems should provide C + P

(12)

Alexander Reinefeld, ZIB 12

Grid File System

• provides access to heterogeneous

storage resources, but

• middleware causes additional complexity, vulnerability

• requires explicit file transfer

• whole file: latency to 1st access,

bandwidth, disk storage

• also partial file access (gridftp) and pattern access (falls)

• no consistency among replicas

• user must take care

(13)

Alexander Reinefeld, ZIB 13

Cloud File System: XtreemFS

Focus on • data distribution • data replication • object based Key features • MRCs are separated from OSDs

• fat Client is the “link”

MRC = metadata & replica catalogue

OSD = object storage device

(14)

A. Reinefeld, F. Schintke, ZIB 14

A closer look at XtreemFS

Features

distributed, replicated POSIX compliant file system

Server software (Java) runs on Linux, OS X, SolarisClient software (C++) runs on Linux, OS X, Windows

secure: X.509 and SSL

open source (GPL)

Assumptions

• synchronous clocks with max. time drift (needed for OSD lease

negotiation, reasonable assumption in clouds)

• upper limit on round trip time

(15)

A. Reinefeld, F. Schintke, ZIB 15

(16)

Alexander Reinefeld, ZIB 16

File access protocol

XtreemFS Client (fuse) User appl.

(Linux VFS) MRC OSD

FileSize = 128k Update(Cap, FileSize=128k)

(17)

A. Reinefeld, F. Schintke, ZIB 17

Client

gets list of OSDs from MRC

• get a capability (signed by MRC) per file

selects best OSD(s) for parallel I/O

• various striping policies: scatter/gather, RAIDx, erasure codes

scalable and fast access

• no communication between OSD and MRC needed

(18)

A. Reinefeld, F. Schintke, ZIB 18

MRC – Metadata and Replication Catalogue

provides

• open(), close(), readdir(), rename(), …

• attributes per file: size, last access, access rights, location (OSDs), …

• capability (file handle) to authorize a client to access objects on OSDs

implemented with a key/value store (BabuDB)

• fast index

• append-only DB

(19)

A. Reinefeld, F. Schintke, ZIB 19

OSD – Object Storage Device

serves file content operations

• read(), write(), truncate(), flush(), …

implements object replication

• also partial replicas for read-access 

• data is filled on demand

gets OSD list from MRC

slave OSD redirects to master OSD

• write ops only on master OSD

(20)

A. Reinefeld, F. Schintke, ZIB 20

OSD – Object Storage Device

Which OSD to select?

• object list

• bandwidth

• rarest first

• network coordinates, datacenter map, …

(21)

A. Reinefeld, F. Schintke, ZIB 21

OSD – Object Storage Device

implements concurrency control for replica consistency

• POSIX compliant

• master/slave replication with failover

• group membership service provided by MRC

• lease service “Flease”: distributed, scalable and failure-tolerant

• 50,000 leases/sec with 30 OSDs

(22)

A. Reinefeld, F. Schintke, ZIB 22

Quorum consensus

Basic algorithm

• When a majority is informed, each other majority has at least one

member with up-to-date information.

• A minority may crash at any time.

Paxos Consensus

1 Step: Check whether a consensus c was already established

2 Step: Re-establish c or try to establish own proposal

x

x

x

x

x

x

x x

(23)

Alexander Reinefeld, ZIB 23

Proposer

Init

r = 1 // lokale Runden-Nr

rlatest= 0 // Nr der höchsten bestätigten Runde

latestv= ⊥ // Wert d. höchsten bestätigten Runde // Neues Proposal senden

acknum= 0 // Anzahl gültiger Bestätigungen

Sende prepare(r) an alle acceptors

Empfange ack(rack,vi,ri) von acceptor i Falls r == rack

acknum++ Falls ri> rlatest

rlatest= ri // jüngere akzeptierte Runde

latestv= vi // jüngerer Wert

Falls acknum≥maj Falls latestv== ⊥

schlage selbst einen Wert latestv≠ ⊥vor sende accept(r, latestv) an alle acceptors

Learner

numaccepted= 0 // Anzahl gesammelter accepts

Empfange accepted(r, v) von acceptor i Wenn r steigt: numaccepted= 0

numaccepted++

Falls numaccepted== maj

decide v; inform client // v ist Konsens

Acceptor

Init

rack = 0 // zuletzt bestätigte Runde

raccepted= 0 // zuletzt akzeptierte Runde

v = ⊥ // aktueller lokaler Wert

Empfange prepare(r) von proposer

Falls r > rack ∧r > raccepted // höhere Runde

rack= r

Sende ack(rack, v, raccepted) an Proposer

Empfange accept(r, w) Falls r ≥rack∧r > raccepted

raccepted= r v = w

Sende accepted (raccepted, v) an Learners

(24)

A. Reinefeld, F. Schintke, ZIB 24

Striping Performance on Cluster

Striping

• parallel transfer from/to

many OSDs

• bandwidth scales with the

number of OSDs

• client is the bottleneck:

(slower reads are caused by TCP ingress problem) R E A D W R IT E

One client writes/reads a single 4GB file using

asynchr. writes, read-ahead, 1MB chunk size, 29 OSDs. Nodes are connected with IP over IB (1.2 GB/s).

(25)

A. Reinefeld, F. Schintke, ZIB 25

Snapshots & Backups

Metadata snapshots (MRC)

need atomic operation without service interrupt

• asynchronous consolidation in background

• granularity: subdirectories or volumes

implemented by BabuDB or Scalaris

File snapshots (OSD)

• taken implicitly when file is idle

or explicitly when closing file or fsync()

(26)

A. Reinefeld, F. Schintke, ZIB 26

Atomic Snapshots in MRC

implemented with BabuDB backend

• a large-scale DB for data that exceeds the system’s main memory

• 2 components:

small mutable overlay trees (LSM trees)

large immutable memory-mapped index on disk

• non-transactional key-value store

• prefix and range queries

primary design goal: Performance!

• 300,000 lookups/sec (30M entries)

• fast crash recovery

(27)

A. Reinefeld, F. Schintke, ZIB 27

Log Structured Merge Trees: .

(28)

Alexander Reinefeld, ZIB 28

Replicating MRC, OSDs

Master/Slave Scheme

Pros

• fast local read

• no distributed transactions

• easy to implement

Cons

• master is performance bottleneck

• interrupt when master fails: needs stable master election

Replicated State Machine (Paxos)

Pros

• no master, no single point of failure

• no extra latency on failure

Cons

• slower: 2 round trips per op

(29)

Alexander Reinefeld, ZIB 29

XtreemFS Features

Release 1.2.1 (current)

• RAID and parallel I/O

• POSIX compatibility

• Read-only replication

• Partial replicas (on-demand)

• Security (SSL, X.509)

• Internet ready

• Checksums

• Extensions

• OSD and replica selection (Vivaldi, datacenter maps)

• Asynchronous MRC backups

• Metadata caching

• Graphical admin console

• Hadoop file system driver (experimental)

Release 1.3 (very soon)

• DIR and MRC replication with automatic failover

• Read/write replication

Release 2.x

• Consistent Backups

• Snapshots

• Automatic replica creation, deletion and maintenance

(30)

A. Reinefeld, F. Schintke, ZIB 30

Source Code

XtreemFS

• http://code.google.com/p/xtreemfs

• 35.000 lines of C++ and Java code

• GNU GPL v2 license

BabuDB

• http://code.google.com/p/babudb

• 10.000 lines of Java code

• new BSD license

Scalaris

• http://code.google.com/p/scalaris

• 28.214 lines of Erlang and C++ code

(31)

A. Reinefeld, F. Schintke, ZIB 31

Summary

Cloud file systems require replication

• availability

• fast access, striping

Replication requires consistency algorithm

when crashes are rare: use master/slave replication

with frequent crashes: use Paxos

Only Consistency + Partition tolerance from CAP theorem

Our next step: Faster high-level data services

References

Related documents

Thus far, with the exception of India’s Jamkesmas program targeting the poor (which, if you recall, is completely subsided by the central government), none of the

The coherently dedispersed pulse profile of FRB180528 at its DM of 899 pc cm −3 shows hints of temporal broadening at high time resolution. Fitting the profile with the model defined

Cultural inaccessibility is a concept I’ve created to describe the ways that women are made to feel unwelcome in spaces of game play and games culture, both offline and

This enterprise-wide compliance data warehouse is currently supporting a host of tax compliance management and reporting solutions, including tax discovery programs, advanced

Consequently, this study aims to define the factors (techniques) of social media marketing in B & B industry, and then utilize decision trees (DT) to recognize the

Warning : You must turn off the power switch once the battery is fully charged and then take out the plug otherwise you may shorten the working life of the battery.. If the red

Strong Brands AmBev Portfolio Leader in the flavor segment Second in cola segment Leader in sports drinks.. Strong Brands Key

the Rhodesian Parliament had condoned an aot of rebellion and had actively supported the regime. The Queen refused to appoint a Governor-General on the advice