D u k e S y s t e m s
Scalable File Storage
Jeff Chase
Duke University
Why a shared network file service?
•
Data sharing across people and their apps
–
common name space (/usr/project/stuff
…
)
•
Resource sharing
–
fluid mapping of data to storage resources
–
incremental scalability
–
diversity of demands, not predictable in advance
–
statistical multiplexing, central limit theorem
•
Obvious?
Network File System (NFS)
syscall layerUFS
NFS
server
VFS VFSNFS
client
UFS
syscall layer
client
user programs
network
server
Virtual File System (VFS) enables pluggable file system
implementations as OS kernel modules (“drivers”).
Google File System (GFS)
•
SOSP 2003
•
Foundation for data-intensive parallel cluster
computing at Google
–
MapReduce OSDI 2004, 2000+ cites
•
Client access by RPC library, through kernel
system calls (via FUSE)
•
Uses Chubby lock service for consensus
–
e.g., Master election
Google File System (GFS)
Similar
: Hadoop HDFS, p-NFS, many other parallel file systems.
A
master
server stores metadata (names, file maps) and acts as lock server.
Clients call master to open file, acquire locks, and obtain metadata. Then they
read/write directly to a scalable array of data servers for the actual data. File
data may be spread across many data servers: the maps say where it is.
GFS (or HDFS) and MapReduce
•
Large files
•
Streaming access (sequential)
•
Parallel access
•
Append-mode writes
•
Record-oriented
MapReduce: Example
Handles failures automatically, e.g., restarts tasks if a node fails; Runs multiple copies of a task so a slow node does not limit the job.
GFS Architecture
Separate data (chunks) from metadata (names etc.).
Centralize the metadata; spread the chunks around.
Chunks
•
Variable size, up to 64MB
•
Stored as a file, named by a handle
•
Replicated on multiple nodes, e.g., x3
–
chunkserver
==
datanode
•
Master caches chunk maps
–
per-file chunk map: what chunks make up a file
GFS Architecture
•
Single master, multiple chunkservers
Single master
•
From distributed systems we know this is a:
–
Single point of failure
–
Scalability bottleneck
•
GFS solutions:
–
Shadow masters
–
Minimize master involvement
•
never move data through it, use only for metadata
–
and cache metadata at clients
•
large chunk size
•
master delegates authority to primary replicas in
data mutations (chunk leases)
GFS Write
The client asks the master for a list of replicas, and which replica holds the lease to act as primary. If no one has a lease, the master grants a lease to a replica it
chooses. ...The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations on a file that is being renamed).
Google File System (GFS)
Similar
: Hadoop HDFS, p-NFS, many other parallel file systems.
A
master
server stores metadata (names, file maps) and acts as lock server.
Clients call master to open file, acquire locks, and obtain metadata. Then they
read/write directly to a scalable array of data servers for the actual data. File
data may be spread across many data servers: the maps say where it is.
GFS: leases
•
Primary must hold a “lock” on its chunks.
•
Use leased locks to tolerate primary failures.
We use leases to maintain a consistent mutation order across replicas. The master grants a chunk lease to one of the replicas, which we call the primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary.
The lease mechanism is designed to minimize management overhead at the master. A lease has an initial timeout of 60 seconds. However, as long as the
chunk is being mutated, the primary can request and typically receive exten- sions from the master indefinitely. These extension requests and grants are piggybacked on the HeartBeat messages regularly exchanged between the master and all
chunkservers. …Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires.
Leases (leased locks)
•
A lease is a grant of ownership or
control for a limited time.
•
The owner/holder can renew or
extend the lease.
•
If the owner fails, the lease expires
and is free again.
•
The lease might end early.
–
lock service may
recall
or
evict
A lease service in the real world
acquire
grant
acquire
A
B
x=x+1
X
grant
release
x=x+1
Leases and time
•
The lease holder and lease service must agree when
a lease has expired.
–
i.e., that its expiration time is in the past
–
Even if they can’t communicate!
•
We all have our clocks, but do they agree?
–
synchronized clocks
•
For leases, it is sufficient for the clocks to have a
known bound on clock drift.
–
|T(C
i) – T(C
j)| <
ε
–
Build in
slack time
> ε into the lease protocols as a safety
A network partition
C rashed
router
A
network partition
is any event that blocks all
message traffic between subsets of nodes.
Never two kings at once
acquire
grant
acquire
A
x=x+1
???
B
grant
release
x=x+1
Lease callbacks/recalls
•
GFS master recalls primary leases to give the master
control for metadata operations
–
rename
–
snapshots
...The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations on a file that is being renamed). Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires….
...When the master receives a snapshot request, it first revokes any
outstanding leases on the chunks in the files it is about to snapshot. This
ensures that any subsequent writes to these chunks will require an interaction with the master to find the lease holder. This will give the master an opportunity to create a new copy of the chunk first...
GFS: who is the primary?
•
The master tells clients which chunkserver is the primary
for a chunk.
•
The primary is the current lease owner for the chunk.
•
What if the primary fails?
–
Master gives lease to a new primary.
–
The client’s answer may be cached and may be stale.
Since clients cache chunk locations, they may read from a stale
replica before that information is refreshed. This window is limited by the cache entry’s timeout and the next open of the file, which purges from the cache all chunk information for that file.
Lease sequence number
•
Each lease has a
lease sequence number
.
–
Master increments it when it issues a lease.
–
Client and replicas get it from the master.
•
Use it to validate that a replica is up to date before
accessing the replica.
–
If replica fails/disconnects, its lease number lags.
–
Easy to detect by comparing lease numbers.
•
The lease sequence number is a common technique. In
GFS chunk version number
•
In GFS, the sequence number for a chunk lease acts as
a
chunk version number.
•
Master passes it to the replicas after issuing a lease, and
to the client in the
chunk handle
.
–
If a replica misses updates to a chunk, its version falls behind
–
Client checks for stale chunks on reads.
–
Replicas report chunk versions to master: master reclaims any
stale chunks, creates new replicas.
Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-to-date replicas…before any client is notified and therefore before it can start writing to the chunk. If another replica is currently unavailable, its chunk version number will not be advanced. The master will detect that this chunkserver has a stale replica when the
GFS consistency model
Easy.
Primary chooses an arbitrary total
order for concurrent writes. Writes
that cross chunk boundaries are not
atomic: the two chunk primaries
may choose different orders.
Records appended
atomically at least once.
…but file may contain
duplicates and/or padding.
Anything can happen if writes fail:
a failed write may succeed at
some replicas but not others.
Reads from different replicas
may return different results.
PSM: a closer look
•
The following slides are from Roxana Geambasu
–
Summer internship at Microsoft Research
–
Now at Columbia
•
Goal: specify consistency and failure formally
•
For primary/secondary/master (PSM) systems
–
GFS
–
BlueFS (MSR scalable storage behind Hotmail etc.)
•
The study is useful to understand where PSM protocols
35
GFS
Primary Client write ACKwrite read read
value/error write write write ACK ACK/error Master:
• Maintains replica group config.
• Monitoring • Reconfiguration • Recovery
Master
ACK[Roxana Geambasu]
36
Counter-examples to GFS’ Linearizability
n
Counter-example 1 –
Non-atomic writes
¨
In GFS, files are split into chunks, replicated by separate groups
¨
So, writes to file can be
split
and executed non-atomically
n
Counter-example 2 –
Stale reads
¨
In GFS, no strong membership checks are done for reads
¨
So, a read can go to a
stale
replica
n
Counter-example 3 –
Read uncommitted
¨
In GFS, a read can go to
any
replica
¨
So, a read can return the value of an in-progress write
Chunks
“Append atomically at least once”
•
If append would cross chunk boundary
–
Primary pads to chunk boundary; client retries (on next chunk).
•
If append fails on any replica: client retries.
–
May cause duplicates on subset of replicas where earlier write
succeeded..
The primary checks to see if appending the record to the current chunk would cause the chunk to exceed the maximum size (64 MB). If so, it pads the
chunk to the maximum size, tells secondaries to do the same, and replies to the client indicating that the operation should be retried on the next chunk. … If a record append fails at any replica, the client retries the operation. As a result, replicas of the same chunk may contain different data possibly
including duplicates of the same record in whole or in part. GFS does not guarantee that all replicas are bytewise identical. It only guarantees that the data is written at least once as an atomic unit. …
GFS: choose scalable semantics
•
GFS argues that the loose semantics are “good enough”.
–
“Regardless of consistency and concurrency issues, this
approach has served us well.”
Appending is far more efficient and more resilient to application failures than random writes.
In one typical use, a writer generates a file from beginning to end. It …
periodically checkpoints how much has been successfully written. Checkpoints may include application-level checksums. Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined state. …Checkpointing allows writers to restart incrementally...
…Moreover, as most of our files are append-only, a stale replica usually returns a premature end of chunk rather than outdated data. When a reader retries and contacts the master, it will immediately get current chunk locations.
Impact on apps
•
Sometimes it is useful to push problems up to a higher level so that
the system doesn’t have to solve them.
•
Instead, change the semantics and place more burden on apps.
•
Here’s how it’s done:
GFS applications can accommodate the relaxed consistency model with a few simple techniques already needed for other purposes: relying on
appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records....
Readers deal with the occasional padding and duplicates as follows. Each record prepared by the writer contains extra information like checksums so that its validity can be verified. A reader can identify and discard extra
padding and record fragments using the checksums. If it cannot tolerate the occasional duplicates (e.g., if they would trigger non-idempotent
operations), it can filter them out using unique identifiers in the records, which are often needed anyway…
GFS: talking points
•
commodity solution
•
executables are replicated more widely (why?)
•
pipelining and dissemination tree
•
hard vs. soft state?
–
operation log on the master for namespace
–
chunk inventories are soft state on master
•
Q: could multiple masters (volumes) share the same
pool of chunkservers?
•
atomic namespace ops, and namespace locking
•
chunk balancing, and interconnect
Fault Tolerance
•
High availability
–
fast recovery
•
master and chunkservers restartable in a few seconds
–
chunk replication
•
default: 3 replicas.
–
shadow masters
•
Data integrity
GFS:
master log
log snapshot
Recoverable Data with a Log
volatile memory
log snapshot
Your program (or file system or database software) executes transactions that
read or change the data structures in memory. Your data structures
in memory
Push a checkpoint or
snapshot of the entire structure to disk
periodically
Log transaction events as they occur
Anatomy of a Log
•
A log
is a sequence of records (entries) on
recoverable storage (e.g., disk).
•
Each entry is associated with some
operation/transaction T.
•
Create log entries for T as T executes, to
record progress of T.
•
Log writes are atomic and durable, and
complete detectably in order.
– Append/write entry to log
– Truncate older entries up to time t
•
Log entries for T must be durable (e.g.,
pushed to disk) before any effects of T
become visible.
–
Called write-ahead logging
(old)
(new)
LSN 11 XID 18 LSN 13 XID 19 LSN 12 XID 18 LSN 14 XID 18 commit...
Log
Sequence
Number
(LSN)
Transaction
ID (XID)
commit
record
GFS master log: details
All metadata is kept in the master’s memory. The first two types (names- paces and file-to-chunk mapping) are also kept persistent by logging
mutations to an operation log…. The operation log contains a historical [and persistent] record of critical metadata changes… it also serves as a logical time line that defines the order of concurrent operations…the master’s operation log defines a global total order of these operations.
The log records contain the sequence of commands/requests executed, rather than the updates made as they executed: this is called operational logging.
The master recovers its file system state by replaying the operation log… loading the latest checkpoint from local disk and replaying only the limited number of log records after that.
“Replay” the log by re-executing the operations listed in the log, in order. They are deterministic: the result is the same as their execution before the crash. First it rolls back to the state of the most recent checkpoint. Then it re-executes only the operations after the checkpoint, since the checkpoint already includes the results from all preceding operations.
GFS master log: details
We must store [the operation log] reliably and not make changes visible to clients until metadata changes are made persistent. …Therefore, we replicate it on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely.
The master batches several log records together before flushing thereby
reducing the impact of flushing and replication on overall system throughput.
The log is made durable by writing copies to multiple disks/machines. The log writes must complete before the master responds to the operation request. This is a form of write-ahead logging also called output commit.
This is a common optimization called group commit. It increases disk write bandwidth for the log at the price of increasing commit latency.
Parallel File Systems 101
§
Manage data sharing in large data stores
[Renu Tewari, IBM]
Asymmetric
• E.g., PVFS2, Lustre, High Road
• Ceph, GFS
Symmetric
• E.g., GPFS, Polyserve
Parallel NFS (pNFS)
pNFS
Clients
Block (FC) /
Object (OSD) /
File (NFS)
Storage
NFSv4+ Server
data[David Black, SNIA]
Modifications to standard NFS protocol (v4.1, 2005-2010) to offload
bulk data storage to a scalable cluster of block servers or OSDs.
Based on an asymmetric structure similar to GFS and Ceph.
pNFS architecture
• Only this is covered by the pNFS protocol
• Client-to-storage data path and server-to-storage control path are specified elsewhere, e.g.
– SCSI Block Commands (SBC) over Fibre Channel (FC)
– SCSI Object-based Storage Device (OSD) over iSCSI
– Network File System (NFS)
pNFS
Clients
Block (FC) /
Object (OSD) /
File (NFS)
Storage
NFSv4+ Server
datapNFS basic operation
•
Client gets a layout from the NFS Server
•
The layout maps the file onto storage devices and addresses
•
The client uses the layout to perform direct I/O to storage
•
At any time the server can recall the layout (leases/delegations)
•
Client commits changes and returns the layout when it’s done
•
pNFS is optional, the client can always use regular NFSv4 I/O
Clients
Storage NFSv4+ Server
layout
Cache consistency
•
How to ensure that each read sees the value stored by
the most recent write? (Or some reasonable value)?
•
This problem also appears in multi-core architecture.
•
It appears in distributed data systems of various kinds.
–
DNS, Web
•
Various solutions are available.
–
It may be OK for clients to read data that is “a little bit stale”.
–
In some cases, the clients themselves don’t change the data.
•
But for “strong” consistency, we need (in essence) a
distributed reader/writer lock (SharedLock) for the
clients to coordinate for each piece of data.
NFS: revised picture
BufferCache
FS
Applications
BufferCache
FS
Client
File
server
Multiple clients
BufferCache
FS
Applications
BufferCache
FS
File
server
BufferCache
FS
Applications
BufferCache
FS
Applications
Multiple clients
BufferCache
FS
Applications
BufferCache
FS
Applications
BufferCache
FS
Applications
Read(server=xx.xx
…
, inode=i27412, blockID=27,
…
)
Multiple clients
BufferCache
FS
Applications
BufferCache
FS
Applications
BufferCache
FS
Applications
Write(server=xx.xx
…
, inode=i27412, blockID=27,
…
)
Multiple clients
BufferCache
FS
Applications
BufferCache
FS
Applications
BufferCache
FS
Applications
What if either of the other clients reads that block?
Will it get the right data? What is the “right” data?
Will it get the “last” version of the block written?
How to coordinate reads/writes and caching on multiple clients?
How to keep the copies “in sync”?
Lease example:
network file cache
•
A read lease ensures that no other client is writing the
data. Holder is free to read from its cache.
•
A write lease ensures that no other client is reading or
writing the data. Holder is free to read/write from cache.
•
Writer must push modified (dirty) cached data to the
server before relinquishing lease.
–
Must ensure that another client can see all updates before it is
able to acquire a lease allowing it to read or write.
•
If some client requests a conflicting lock, server may
recall or evict on existing leases.
–
Callback
RPC from server to lock holder: “please release now.”
Lease example
network file cache consistency
This approach is used in NFS and various other networked data services.
Stronger write atomicity
•
The systems in this module are not suitable to store
complex data with multiple writers.
•
“Complex” means structured data, in which multiple
writes to different parts of the structure must complete
together (grouped writes).
•
Q: How to ensure that a group of writes completes “all or
nothing” with respect to crashes and other client reads?
Transactions
BEGIN T1
read X
read Y
…
write X
END
Database systems and other systems use a programming
construct called
atomic transactions
(“ACID”) to represent
a group of related reads/writes, often on different data items.
BEGIN T2
read X
write Y
…
write X
END
Transactions commit
atomically in a serial order.
Atomic
Consistent
Independent