Scalable File Storage

(1)

D u k e S y s t e m s

Scalable File Storage

Jeﬀ Chase

Duke University

(2)

Why a shared network file service?

• Data sharing across people and their apps

–

 

common name space (/usr/project/stuff

…

)

• Resource sharing

–

 

fluid mapping of data to storage resources

–

 

incremental scalability

–

 

diversity of demands, not predictable in advance

–

 

statistical multiplexing, central limit theorem

• Obvious?

(3)

Network File System (NFS)

syscall layer

UFS

NFS

server

VFS VFS

NFS

client

UFS

syscall layer

client

user programs

network

server

Virtual File System (VFS) enables pluggable file system

implementations as OS kernel modules (“drivers”).

(4)

(5)

Google File System (GFS)

• SOSP 2003

• Foundation for data-intensive parallel cluster

computing at Google

–

 

MapReduce OSDI 2004, 2000+ cites

• Client access by RPC library, through kernel

system calls (via FUSE)

• Uses Chubby lock service for consensus

–

 

e.g., Master election

(6)

Google File System (GFS)

Similar

: Hadoop HDFS, p-NFS, many other parallel file systems.

A

master

server stores metadata (names, file maps) and acts as lock server.

Clients call master to open file, acquire locks, and obtain metadata. Then they

read/write directly to a scalable array of data servers for the actual data. File

data may be spread across many data servers: the maps say where it is.

(7)

GFS (or HDFS) and MapReduce

• Large files

• Streaming access (sequential)

• Parallel access

• Append-mode writes

• Record-oriented

(8)

MapReduce: Example

Handles failures automatically, e.g., restarts tasks if a node fails; Runs multiple copies of a task so a slow node does not limit the job.

(9)

(10)

GFS Architecture

Separate data (chunks) from metadata (names etc.).

Centralize the metadata; spread the chunks around.

(11)

Chunks

• Variable size, up to 64MB

• Stored as a file, named by a handle

• Replicated on multiple nodes, e.g., x3

–

 

chunkserver

==

datanode

• Master caches chunk maps

–

 

per-file chunk map: what chunks make up a file

(12)

GFS Architecture

• Single master, multiple chunkservers

(13)

Single master

• From distributed systems we know this is a:

–

 

Single point of failure

–

 

Scalability bottleneck

• GFS solutions:

–

 

Shadow masters

–

 

Minimize master involvement

• never move data through it, use only for metadata

–

 

and cache metadata at clients

• large chunk size

• master delegates authority to primary replicas in

data mutations (chunk leases)

(14)

(15)

(16)

GFS Write

The client asks the master for a list of replicas, and which replica holds the lease to act as primary. If no one has a lease, the master grants a lease to a replica it

chooses. ...The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations on a file that is being renamed).

(17)

(18)

(19)

(20)

Google File System (GFS)

Similar

: Hadoop HDFS, p-NFS, many other parallel file systems.

A

master

server stores metadata (names, file maps) and acts as lock server.

Clients call master to open file, acquire locks, and obtain metadata. Then they

read/write directly to a scalable array of data servers for the actual data. File

data may be spread across many data servers: the maps say where it is.

(21)

(22)

GFS: leases

• Primary must hold a “lock” on its chunks.

• Use leased locks to tolerate primary failures.

We use leases to maintain a consistent mutation order across replicas. The master grants a chunk lease to one of the replicas, which we call the primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary.

The lease mechanism is designed to minimize management overhead at the master. A lease has an initial timeout of 60 seconds. However, as long as the

chunk is being mutated, the primary can request and typically receive exten- sions from the master indefinitely. These extension requests and grants are piggybacked on the HeartBeat messages regularly exchanged between the master and all

chunkservers. …Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires.

(23)

Leases (leased locks)

• A lease is a grant of ownership or

control for a limited time.

• The owner/holder can renew or

extend the lease.

• If the owner fails, the lease expires

and is free again.

• The lease might end early.

–

 

lock service may

recall

or

evict

(24)

A lease service in the real world

acquire

grant

acquire

A

B

x=x+1

X

grant

release

x=x+1

(25)

Leases and time

• The lease holder and lease service must agree when

a lease has expired.

–

 

i.e., that its expiration time is in the past

–

 

Even if they can’t communicate!

• We all have our clocks, but do they agree?

–

 

synchronized clocks

• For leases, it is sufficient for the clocks to have a

known bound on clock drift.

–

 

|T(C

_i

) – T(C

_j

)| <

ε

–

 

Build in

slack time

> ε into the lease protocols as a safety

(26)

A network partition

C rashed

router

A

network partition

is any event that blocks all

message traffic between subsets of nodes.

(27)

Never two kings at once

acquire

grant

acquire

A

x=x+1

_???

B

grant

release

x=x+1

(28)

Lease callbacks/recalls

• GFS master recalls primary leases to give the master

control for metadata operations

–

 

rename

–

 

snapshots

...The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations on a file that is being renamed). Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires….

...When the master receives a snapshot request, it first revokes any

outstanding leases on the chunks in the files it is about to snapshot. This

ensures that any subsequent writes to these chunks will require an interaction with the master to find the lease holder. This will give the master an opportunity to create a new copy of the chunk first...

(29)

GFS: who is the primary?

• The master tells clients which chunkserver is the primary

for a chunk.

• The primary is the current lease owner for the chunk.

• What if the primary fails?

–

 

Master gives lease to a new primary.

–

 

The client’s answer may be cached and may be stale.

Since clients cache chunk locations, they may read from a stale

replica before that information is refreshed. This window is limited by the cache entry’s timeout and the next open of the file, which purges from the cache all chunk information for that file.

(30)

Lease sequence number

• Each lease has a

lease sequence number

.

–

 

Master increments it when it issues a lease.

–

 

Client and replicas get it from the master.

• Use it to validate that a replica is up to date before

accessing the replica.

–

 

If replica fails/disconnects, its lease number lags.

–

 

Easy to detect by comparing lease numbers.

• The lease sequence number is a common technique. In

(31)

GFS chunk version number

• In GFS, the sequence number for a chunk lease acts as

a

chunk version number.

• Master passes it to the replicas after issuing a lease, and

to the client in the

chunk handle

.

–

 

If a replica misses updates to a chunk, its version falls behind

–

 

Client checks for stale chunks on reads.

–

 

Replicas report chunk versions to master: master reclaims any

stale chunks, creates new replicas.

Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-to-date replicas…before any client is notified and therefore before it can start writing to the chunk. If another replica is currently unavailable, its chunk version number will not be advanced. The master will detect that this chunkserver has a stale replica when the

(32)

(33)

GFS consistency model

Easy.

Primary chooses an arbitrary total

order for concurrent writes. Writes

that cross chunk boundaries are not

atomic: the two chunk primaries

may choose different orders.

Records appended

atomically at least once.

…but file may contain

duplicates and/or padding.

Anything can happen if writes fail:

a failed write may succeed at

some replicas but not others.

Reads from different replicas

may return different results.

(34)

PSM: a closer look

• The following slides are from Roxana Geambasu

–

 

Summer internship at Microsoft Research

–

 

Now at Columbia

• Goal: specify consistency and failure formally

• For primary/secondary/master (PSM) systems

–

 

GFS

–

 

BlueFS (MSR scalable storage behind Hotmail etc.)

• The study is useful to understand where PSM protocols

(35)

35

GFS

Primary Client write ACK

write read read

value/error write write write ACK ACK/error Master:

•  Maintains replica group config.

•  Monitoring •  Reconfiguration •  Recovery

Master

ACK

[Roxana Geambasu]

(36)

36

Counter-examples to GFS’ Linearizability

n 

Counter-example 1 –

Non-atomic writes

¨ 

In GFS, files are split into chunks, replicated by separate groups

¨ 

So, writes to file can be

split

and executed non-atomically

n 

Counter-example 2 –

Stale reads

¨ 

In GFS, no strong membership checks are done for reads

¨ 

So, a read can go to a

stale

replica

n 

Counter-example 3 –

Read uncommitted

¨ 

In GFS, a read can go to

any

replica

¨ 

So, a read can return the value of an in-progress write

Chunks

(37)

“Append atomically at least once”

• If append would cross chunk boundary

–

 

Primary pads to chunk boundary; client retries (on next chunk).

• If append fails on any replica: client retries.

–

 

May cause duplicates on subset of replicas where earlier write

succeeded..

The primary checks to see if appending the record to the current chunk would cause the chunk to exceed the maximum size (64 MB). If so, it pads the

chunk to the maximum size, tells secondaries to do the same, and replies to the client indicating that the operation should be retried on the next chunk. … If a record append fails at any replica, the client retries the operation. As a result, replicas of the same chunk may contain different data possibly

including duplicates of the same record in whole or in part. GFS does not guarantee that all replicas are bytewise identical. It only guarantees that the data is written at least once as an atomic unit. …

(38)

GFS: choose scalable semantics

• GFS argues that the loose semantics are “good enough”.

–

 

“Regardless of consistency and concurrency issues, this

approach has served us well.”

Appending is far more efficient and more resilient to application failures than random writes.

In one typical use, a writer generates a file from beginning to end. It …

periodically checkpoints how much has been successfully written. Checkpoints may include application-level checksums. Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined state. …Checkpointing allows writers to restart incrementally...

…Moreover, as most of our files are append-only, a stale replica usually returns a premature end of chunk rather than outdated data. When a reader retries and contacts the master, it will immediately get current chunk locations.

(39)

Impact on apps

• Sometimes it is useful to push problems up to a higher level so that

the system doesn’t have to solve them.

• Instead, change the semantics and place more burden on apps.

• Here’s how it’s done:

GFS applications can accommodate the relaxed consistency model with a few simple techniques already needed for other purposes: relying on

appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records....

Readers deal with the occasional padding and duplicates as follows. Each record prepared by the writer contains extra information like checksums so that its validity can be verified. A reader can identify and discard extra

padding and record fragments using the checksums. If it cannot tolerate the occasional duplicates (e.g., if they would trigger non-idempotent

operations), it can filter them out using unique identifiers in the records, which are often needed anyway…

(40)

GFS: talking points

• commodity solution

• executables are replicated more widely (why?)

• pipelining and dissemination tree

• hard vs. soft state?

–

 

operation log on the master for namespace

–

 

chunk inventories are soft state on master

• Q: could multiple masters (volumes) share the same

pool of chunkservers?

• atomic namespace ops, and namespace locking

• chunk balancing, and interconnect

(41)

Fault Tolerance

• High availability

–

 

fast recovery

• master and chunkservers restartable in a few seconds

–

 

chunk replication

• default: 3 replicas.

–

 

shadow masters

• Data integrity

(42)

GFS:

master log

log snapshot

(43)

Recoverable Data with a Log

volatile memory

log snapshot

Your program (or file system or database software) executes transactions that

read or change the data structures in memory. Your data structures

in memory

Push a checkpoint or

snapshot of the entire structure to disk

periodically

Log transaction events as they occur

(44)

Anatomy of a Log

• A log

is a sequence of records (entries) on

recoverable storage (e.g., disk).

• Each entry is associated with some

operation/transaction T.

• Create log entries for T as T executes, to

record progress of T.

• Log writes are atomic and durable, and

complete detectably in order.

–  Append/write entry to log

–  Truncate older entries up to time t

• Log entries for T must be durable (e.g.,

pushed to disk) before any effects of T

become visible.

–

 

Called write-ahead logging

(old)

(new)

LSN 11 XID 18 LSN 13 XID 19 LSN 12 XID 18 LSN 14 XID 18 commit

...

Log

Sequence

Number

(LSN)

Transaction

ID (XID)

commit

record

(45)

GFS master log: details

All metadata is kept in the master’s memory. The first two types (names- paces and file-to-chunk mapping) are also kept persistent by logging

mutations to an operation log…. The operation log contains a historical [and persistent] record of critical metadata changes… it also serves as a logical time line that defines the order of concurrent operations…the master’s operation log defines a global total order of these operations.

The log records contain the sequence of commands/requests executed, rather than the updates made as they executed: this is called operational logging.

The master recovers its file system state by replaying the operation log… loading the latest checkpoint from local disk and replaying only the limited number of log records after that.

“Replay” the log by re-executing the operations listed in the log, in order. They are deterministic: the result is the same as their execution before the crash. First it rolls back to the state of the most recent checkpoint. Then it re-executes only the operations after the checkpoint, since the checkpoint already includes the results from all preceding operations.

(46)

GFS master log: details

We must store [the operation log] reliably and not make changes visible to clients until metadata changes are made persistent. …Therefore, we replicate it on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely.

The master batches several log records together before flushing thereby

reducing the impact of flushing and replication on overall system throughput.

The log is made durable by writing copies to multiple disks/machines. The log writes must complete before the master responds to the operation request. This is a form of write-ahead logging also called output commit.

This is a common optimization called group commit. It increases disk write bandwidth for the log at the price of increasing commit latency.

(47)

Parallel File Systems 101

§

 

Manage data sharing in large data stores

[Renu Tewari, IBM]

Asymmetric

•  E.g., PVFS2, Lustre, High Road

•  Ceph, GFS

Symmetric

•  E.g., GPFS, Polyserve

(48)

Parallel NFS (pNFS)

pNFS

Clients

Block (FC) /

Object (OSD) /

File (NFS)

Storage

NFSv4+ Server

data

[David Black, SNIA]

Modifications to standard NFS protocol (v4.1, 2005-2010) to offload

bulk data storage to a scalable cluster of block servers or OSDs.

Based on an asymmetric structure similar to GFS and Ceph.

(49)

pNFS architecture

•  Only this is covered by the pNFS protocol

•  Client-to-storage data path and server-to-storage control path are specified elsewhere, e.g.

–  SCSI Block Commands (SBC) over Fibre Channel (FC)

–  SCSI Object-based Storage Device (OSD) over iSCSI

–  Network File System (NFS)

pNFS

Clients

Block (FC) /

Object (OSD) /

File (NFS)

Storage

NFSv4+ Server

data

(50)

pNFS basic operation

• Client gets a layout from the NFS Server

• The layout maps the file onto storage devices and addresses

• The client uses the layout to perform direct I/O to storage

• At any time the server can recall the layout (leases/delegations)

• Client commits changes and returns the layout when it’s done

• pNFS is optional, the client can always use regular NFSv4 I/O

Clients

Storage NFSv4+ Server

layout

(51)

Cache consistency

• How to ensure that each read sees the value stored by

the most recent write? (Or some reasonable value)?

• This problem also appears in multi-core architecture.

• It appears in distributed data systems of various kinds.

–

 

DNS, Web

• Various solutions are available.

–

 

It may be OK for clients to read data that is “a little bit stale”.

–

 

In some cases, the clients themselves don’t change the data.

• But for “strong” consistency, we need (in essence) a

distributed reader/writer lock (SharedLock) for the

clients to coordinate for each piece of data.

(52)

NFS: revised picture

BufferCache

FS

Applications

BufferCache

FS

Client

File

server

(53)

Multiple clients

BufferCache

FS

Applications

BufferCache

FS

_File

server

BufferCache

FS

Applications

BufferCache

FS

Applications

(54)

Multiple clients

BufferCache

FS

Applications

BufferCache

FS

Applications

BufferCache

FS

Applications

Read(server=xx.xx

…

, inode=i27412, blockID=27,

…

)

(55)

Multiple clients

BufferCache

FS

Applications

BufferCache

FS

Applications

BufferCache

FS

Applications

Write(server=xx.xx

…

, inode=i27412, blockID=27,

…

)

(56)

Multiple clients

BufferCache

FS

Applications

BufferCache

FS

Applications

BufferCache

FS

Applications

What if either of the other clients reads that block?

Will it get the right data? What is the “right” data?

Will it get the “last” version of the block written?

How to coordinate reads/writes and caching on multiple clients?

How to keep the copies “in sync”?

(57)

Lease example:

network file cache

• A read lease ensures that no other client is writing the

data. Holder is free to read from its cache.

• A write lease ensures that no other client is reading or

writing the data. Holder is free to read/write from cache.

• Writer must push modified (dirty) cached data to the

server before relinquishing lease.

–

 

Must ensure that another client can see all updates before it is

able to acquire a lease allowing it to read or write.

• If some client requests a conflicting lock, server may

recall or evict on existing leases.

–

 

Callback

RPC from server to lock holder: “please release now.”

(58)

Lease example

network file cache consistency

This approach is used in NFS and various other networked data services.

(59)

Stronger write atomicity

• The systems in this module are not suitable to store

complex data with multiple writers.

• “Complex” means structured data, in which multiple

writes to different parts of the structure must complete

together (grouped writes).

• Q: How to ensure that a group of writes completes “all or

nothing” with respect to crashes and other client reads?

(60)

Transactions

BEGIN T1

read X

read Y

…

write X

END

Database systems and other systems use a programming

construct called

atomic transactions

(“ACID”) to represent

a group of related reads/writes, often on different data items.

BEGIN T2

read X

write Y

…

write X

END

Transactions commit

atomically in a serial order.

Atomic

Consistent

Independent

(61)

ACID properties of transactions

• Transactions are Atomic

–

 

Each transaction either commits or aborts: it either executes

entirely or not at all.

–

 

Transactions don’t interfere with one another (

I

).

• Transactions appear to commit in some serial order

(

serializable schedule

).

• Each transaction is coded to transition the store from one

C

onsistent state to another.

• One-copy serializability

(1SR): Transactions observe the effects

of their predecessors, and not of their successors.

• Transactions are Durable.

(62)

Serial schedule

Sn

S0

S1

S2

T1

_T2

Tn

Consistent States

A

consistent state

is one that does not violate any

internal invariant relationships in the data.

(63)

Limitations of Transactions?

• Why not use ACID transactions for

everything?

• How much work is it to serialize and commit

transactions?

(64)

Two-phase commit (2PC)

“commit or abort?”

“here’s my vote”

“commit/abort!”

TM/C

RM/P

precommit

or prepare

vote

decide

notify

RMs validate Tx and

prepare by logging their

local updates and

decisions

TM logs commit/

abort (commit

point)

If unanimous to commit

decide to commit

else decide to abort

More on 2PC later.

What about CAP?

(65)

Transactions: logging

1. Begin transaction

2. Append info about modifications to a log

3. Append

“

commit

”

to log to end x-action

4. Write new data to normal database

ê

 

Single-sector write commits x-action (3)

Invariant: append new data to log before applying to DB

Called

“

write-‐ahead logging

”

Be

gin

Wri

te1

…

Wri

teN

Co

mmi

t

TransacHon Complete

(66)

Transactions: logging

1. Begin transaction

2. Append info about modifications to a log

3. Append

“

commit

”

to log to end x-action

4. Write new data to normal database

ê

 

Single-sector write commits x-action (3)

What if we crash here (between 3,4)?

On reboot, reapply commiPed updates in log order.

Be

gin

Wri

te1

…

Wri

teN

Co

mmi

t

(67)

Transactions: logging

1. Begin transaction

2. Append info about modifications to a log

3. Append

“

commit

”

to log to end x-action

4. Write new data to normal database

ê

 

Single-sector write commits x-action (3)

What if we crash here?

On reboot, discard uncommiPed updates.

Be

gin

Wri

te1

…

Wri

teN

(68)

Recovery from a log

• Log entries for T record the writes by

T (or operations in T).

–

 

Redo logging

:

the log records contain

sufficient info to “redo” T after a crash.

• To recover, read the checkpoint and

replay committed

log entries.

–

 

“

Redo

”

by reissuing writes or reinvoking

the operations/methods.

–

 

Redo in order (old to new)

–

 

Skip the records of uncommitted Ts

–

 

Skip records of any Ts that committed

before the atomic checkpoint was taken.

(old)

(new)

...

Log

Sequence

Number

(LSN)

Transaction

ID (XID)

commit

record

(69)

Managing a log

• On checkpoint, truncate the log

–

 

No longer need the entries to recover.

• Checkpoint how often? Tradeoff:

–

 

Checkpoints are expensive, BUT

–

 

Long logs take up space.

–

 

Long logs increase recovery time.

• Checkpoint+truncate is

“

atomic

”

–

 

Is it safe to redo/replay records whose

effect is already in the checkpoint?

–

 

Checkpoint

“

between

”

transactions, so

checkpoint records a consistent state.

–

 

Lots of approaches

(old)

(new)

...

Log

Sequence

Number

(LSN)

Transaction

ID (XID)

commit

record