A. Reinefeld, F. Schintke, ZIB 1
Replication and Consistency
in Cloud File Systems
Alexander Reinefeld und Florian Schintke
Zuse-Institut Berlin
A. Reinefeld, F. Schintke, ZIB 2
Let’s start with a little quiz
Who invented Cloud Computing?
a) Werner Vogels
b) Ian Foster
c) Konrad Zuse
The correct answer is … c)
“Schließlich werden auch Rechenzentren über
Fernmeldeleitungen miteinander vernetzt werden, …”
Konrad Zuse in „Rechnender Raum“ (1969)
Konrad Zuse 22.06.1910-18.12.1995
A. Reinefeld, F. Schintke, ZIB 3
Zuse Institute Berlin
Research institute for applied mathematicsand computer science • Peter Deuflhard
chair for scientific computing, FU Berlin
• Martin Grötschel
chair for discrete mathematics, TU Berlin
• Alexander Reinefeld
A. Reinefeld, F. Schintke, ZIB 4
HPC Systems @ ZIB
1987 Cray X-MP 471 MFlops 1994 Cray T3D 38 GFlops 1997 Cray T3E 486 GFlops 2002 IBM p690 2,5 TFlops 1984 Cray 1M 160 MFlops1.000.000-fold performance increase in 25 years
1984 2009
2008/09 SGI ICE, XE
H
L
R
N
2 sites 98 computer racks 26112 CPU cores 128 TB memory 1620 TB disk 300 TF peak perf.S
T
O
R
A
G
E
3 SL8500 robots 39 tape drives 19000 slotsA. Reinefeld, F. Schintke, ZIB 7
What is Cloud Computing?
Cloud Computing = Grid Computing on Datacenters?
• not that simple …
Cloud and Grid both abstract resources through interfaces.
• Grid: via new middleware.
Requires Grid APIs.
• Cloud: via virtualization.
Allows legacy APIs.
Software as a Service (SaaS)
Applications Application Services
Platform as a Service (PaaS)
Programming Environment Execution Environment
Infrastructure as a Service (IaaS)
Infrastructure Services Resource Set
Alexander Reinefeld, ZIB 8
Why Cloud?
Pros
• It scales
because it‘s their resources, not yours
• It‘s simple
because they operate it
• Pay for what you need
don‘t pay for empty spinning disks
Cons
• It‘s expensive
Amazon S3 charges $.15 / GB / month. = $1800 / TB / year
• It‘s not 100% secure
S3 now allows to bring your own
RSA key-pair. But: Would you put your bank account into the cloud?
• It‘s not 100% available
S3 provides “service credits“ if availability drops (10% for 99.0-99.9% availability)
Alexander Reinefeld, ZIB 9
File System Landscape
ext3, ZFS, NTFS NFS, SMB AFS/Coda Lustre, Panasas, GPFS, CEPH... Cloud/Grid Cluster FS/ Datacenter Network FS/ Centralized PC, local system GDM "gridftp" Grid File System
Alexander Reinefeld, ZIB 10
Consistency, Availability, Partition tolerance:
Pick two of three!
Consistency: All clients have the same view of the data.
Availability: Each client can always read and write.
Partition tolerance: Operations will complete, even if individual components are unavailable.
Brewer, Eric. Towards Robust Distributed Systems. PODC Keynote, 2004.
A
C
P
A + P Amazon S3 Mercurial Coda/AFS C + A single server, Linux HA (one data center)C + P
distributed databases, distributed file systems
Alexander Reinefeld, ZIB 11
Which semantic do you expect?
Distributed file systems should provide C + P
Alexander Reinefeld, ZIB 12
Grid File System
• provides access to heterogeneousstorage resources, but
• middleware causes additional complexity, vulnerability
• requires explicit file transfer
• whole file: latency to 1st access,
bandwidth, disk storage
• also partial file access (gridftp) and pattern access (falls)
• no consistency among replicas
• user must take care
Alexander Reinefeld, ZIB 13
Cloud File System: XtreemFS
Focus on • data distribution • data replication • object based Key features • MRCs are separated from OSDs
• fat Client is the “link”
MRC = metadata & replica catalogue
OSD = object storage device
A. Reinefeld, F. Schintke, ZIB 14
A closer look at XtreemFS
•
Features
• distributed, replicated POSIX compliant file system
• Server software (Java) runs on Linux, OS X, Solaris • Client software (C++) runs on Linux, OS X, Windows
• secure: X.509 and SSL
• open source (GPL)
•
Assumptions
• synchronous clocks with max. time drift (needed for OSD lease
negotiation, reasonable assumption in clouds)
• upper limit on round trip time
A. Reinefeld, F. Schintke, ZIB 15
Alexander Reinefeld, ZIB 16
File access protocol
XtreemFS Client (fuse) User appl.
(Linux VFS) MRC OSD
FileSize = 128k Update(Cap, FileSize=128k)
A. Reinefeld, F. Schintke, ZIB 17
Client
•
gets list of OSDs from MRC
• get a capability (signed by MRC) per file
•
selects best OSD(s) for parallel I/O
• various striping policies: scatter/gather, RAIDx, erasure codes
•
scalable and fast access
• no communication between OSD and MRC needed
A. Reinefeld, F. Schintke, ZIB 18
MRC – Metadata and Replication Catalogue
•
provides
• open(), close(), readdir(), rename(), …
• attributes per file: size, last access, access rights, location (OSDs), …
• capability (file handle) to authorize a client to access objects on OSDs
•
implemented with a key/value store (BabuDB)
• fast index
• append-only DB
A. Reinefeld, F. Schintke, ZIB 19
OSD – Object Storage Device
•
serves file content operations
• read(), write(), truncate(), flush(), …
•
implements object replication
• also partial replicas for read-access
• data is filled on demand
•
gets OSD list from MRC
•
slave OSD redirects to master OSD
• write ops only on master OSD
A. Reinefeld, F. Schintke, ZIB 20
OSD – Object Storage Device
•
Which OSD to select?
• object list
• bandwidth
• rarest first
• network coordinates, datacenter map, …
A. Reinefeld, F. Schintke, ZIB 21
OSD – Object Storage Device
•
implements concurrency control for replica consistency
• POSIX compliant
• master/slave replication with failover
• group membership service provided by MRC
• lease service “Flease”: distributed, scalable and failure-tolerant
• 50,000 leases/sec with 30 OSDs
A. Reinefeld, F. Schintke, ZIB 22
Quorum consensus
•
Basic algorithm
• When a majority is informed, each other majority has at least one
member with up-to-date information.
• A minority may crash at any time.
•
Paxos Consensus
• 1 Step: Check whether a consensus c was already established
• 2 Step: Re-establish c or try to establish own proposal
x
x
x
x
x
x
x x
Alexander Reinefeld, ZIB 23
Proposer
Init
r = 1 // lokale Runden-Nr
rlatest= 0 // Nr der höchsten bestätigten Runde
latestv= ⊥ // Wert d. höchsten bestätigten Runde // Neues Proposal senden
acknum= 0 // Anzahl gültiger Bestätigungen
Sende prepare(r) an alle acceptors
Empfange ack(rack,vi,ri) von acceptor i Falls r == rack
acknum++ Falls ri> rlatest
rlatest= ri // jüngere akzeptierte Runde
latestv= vi // jüngerer Wert
Falls acknum≥maj Falls latestv== ⊥
schlage selbst einen Wert latestv≠ ⊥vor sende accept(r, latestv) an alle acceptors
Learner
numaccepted= 0 // Anzahl gesammelter accepts
Empfange accepted(r, v) von acceptor i Wenn r steigt: numaccepted= 0
numaccepted++
Falls numaccepted== maj
decide v; inform client // v ist Konsens
Acceptor
Init
rack = 0 // zuletzt bestätigte Runde
raccepted= 0 // zuletzt akzeptierte Runde
v = ⊥ // aktueller lokaler Wert
Empfange prepare(r) von proposer
Falls r > rack ∧r > raccepted // höhere Runde
rack= r
Sende ack(rack, v, raccepted) an Proposer
Empfange accept(r, w) Falls r ≥rack∧r > raccepted
raccepted= r v = w
Sende accepted (raccepted, v) an Learners
A. Reinefeld, F. Schintke, ZIB 24
Striping Performance on Cluster
•
Striping
• parallel transfer from/to
many OSDs
• bandwidth scales with the
number of OSDs
• client is the bottleneck:
(slower reads are caused by TCP ingress problem) R E A D W R IT E
One client writes/reads a single 4GB file using
asynchr. writes, read-ahead, 1MB chunk size, 29 OSDs. Nodes are connected with IP over IB (1.2 GB/s).
A. Reinefeld, F. Schintke, ZIB 25
Snapshots & Backups
Metadata snapshots (MRC)
• need atomic operation without service interrupt
• asynchronous consolidation in background
• granularity: subdirectories or volumes
• implemented by BabuDB or Scalaris
File snapshots (OSD)
• taken implicitly when file is idle
or explicitly when closing file or fsync()
A. Reinefeld, F. Schintke, ZIB 26
Atomic Snapshots in MRC
•
implemented with BabuDB backend
• a large-scale DB for data that exceeds the system’s main memory
• 2 components:
• small mutable overlay trees (LSM trees)
• large immutable memory-mapped index on disk
• non-transactional key-value store
• prefix and range queries
•
primary design goal: Performance!
• 300,000 lookups/sec (30M entries)
• fast crash recovery
A. Reinefeld, F. Schintke, ZIB 27
Log Structured Merge Trees: .
Alexander Reinefeld, ZIB 28
Replicating MRC, OSDs
Master/Slave Scheme
• Pros
• fast local read
• no distributed transactions
• easy to implement
• Cons
• master is performance bottleneck
• interrupt when master fails: needs stable master election
Replicated State Machine (Paxos)
• Pros
• no master, no single point of failure
• no extra latency on failure
• Cons
• slower: 2 round trips per op
Alexander Reinefeld, ZIB 29
XtreemFS Features
Release 1.2.1 (current)
• RAID and parallel I/O
• POSIX compatibility
• Read-only replication
• Partial replicas (on-demand)
• Security (SSL, X.509)
• Internet ready
• Checksums
• Extensions
• OSD and replica selection (Vivaldi, datacenter maps)
• Asynchronous MRC backups
• Metadata caching
• Graphical admin console
• Hadoop file system driver (experimental)
Release 1.3 (very soon)
• DIR and MRC replication with automatic failover
• Read/write replication
Release 2.x
• Consistent Backups
• Snapshots
• Automatic replica creation, deletion and maintenance
A. Reinefeld, F. Schintke, ZIB 30
Source Code
•
XtreemFS
• http://code.google.com/p/xtreemfs
• 35.000 lines of C++ and Java code
• GNU GPL v2 license
•
BabuDB
• http://code.google.com/p/babudb
• 10.000 lines of Java code
• new BSD license
•
Scalaris
• http://code.google.com/p/scalaris
• 28.214 lines of Erlang and C++ code
A. Reinefeld, F. Schintke, ZIB 31
Summary
•
Cloud file systems require replication
• availability
• fast access, striping
•
Replication requires consistency algorithm
• when crashes are rare: use master/slave replication
• with frequent crashes: use Paxos