File System Design and Implementation

(1)

Transactions and Reliability

Sarah Diesburg Operating Systems CS 3430

(2)

Motivation

File systems have lots of metadata:

Free blocks, directories, file headers, indirect blocks

Metadata is heavily cached for performance

(3)

Problem

System crashes

OS needs to ensure that the file system does not reach an inconsistent state

Example: move a file between directories

Remove a file from the old directory

Add a file to the new directory

What happens when a crash occurs in the middle?

(4)

UNIX File System (Ad Hoc

Failure-Recovery)

Metadata handling:

Uses a synchronous write-through caching policy

A call to update metadata does not return until the changes are propagated to disk

Updates are ordered

When crashes occur, run fsck to repair in-progress operations

(5)

Some Examples of Metadata Handling

Undo effects not yet visible to users

If a new file is created, but not yet added to the directory

Delete the file

Continue effects that are visible to users

If file blocks are already allocated, but not recorded in the bitmap

(6)

UFS User Data Handling

Uses a write-back policy

Modified blocks are written to disk at 30-second intervals

Unless a user issues the sync system call

Data updates are not ordered

In many cases, consistent metadata is good enough

(7)

Example: Vi

Vi saves changes by doing the following

1. Writes the new version in a temp file

Now we have old_file and new_temp file

2. Moves the old version to a different temp file

Now we have new_temp and old_temp

3. Moves the new version into the real file

Now we have new_file and old_temp

4. Removes the old version

(8)

Example: Vi

When crashes occur

Looks for the leftover files

Moves forward or backward depending on the integrity of files

(9)

Transaction Approach

A transaction groups operations as a unit, with the following characteristics:

Atomic: all operations either happen or they do not (no partial operations)

Serializable: transactions appear to happen one after the other

Durable: once a transaction happens, it is recoverable and can survive crashes

(10)

More on Transactions

A transaction is not done until it is

committed

Once committed, a transaction is durable

If a transaction fails to complete, it must

rollback as if it did not happen at all

Critical sections are atomic and serializable, but not durable

(11)

Transaction Implementation (One Thread)

Example: money transfer

Begin transaction x = x – 1;

y = y + 1; Commit

(12)

Transaction Implementation (One Thread)

Common implementations involve the use of a log, a journal that is never erased

A file system uses a write-ahead log to track all transactions

(13)

Transaction Implementation (One Thread)

Once accounts of x and y are on a log, the log is committed to disk in a single write

Actual changes to those accounts are done later

(14)

Transaction Illustrated

x = 1; y = 1; x = 1;

(15)

Transaction Illustrated

x = 1; y = 1; x = 0;

(16)

Transaction Illustrated

x = 1; y = 1; x = 0; y = 2; begin transaction old x: 1 old y: 1 new x: 0 new y: 2 commit

Commit the log to disk before

updating the actual values on disk

(17)

Transaction Steps

Mark the beginning of the transaction

Log the changes in account x

Log the changes in account y

Commit

Modify account x on disk

(18)

Scenarios of Crashes

If a crash occurs after the commit

Replays the log to update accounts

If a crash occurs before or during the commit

(19)

Two-Phase Locking (Multiple Threads)

Logging alone not enough to prevent multiple transactions from trashing one another (not serializable)

Solution: two-phase locking

1. Acquire all locks

2. Perform updates and release all locks

Thread A cannot see thread B’s changes until thread A commits and releases locks

(20)

Transactions in File Systems

Almost all file systems built since 1985 use write-ahead logging

NTFS, HFS+, ext3, ext4, …

+ Eliminates running fsck after a crash + Write-ahead logging provides reliability - All modifications need to be written twice

(21)

Log-Structured File System (LFS)

If logging is so great, why don’t we treat everything as log entries?

Log-structured file system

Everything is a log entry (file headers, directories, data blocks)

Write the log only once

Use version stamps to distinguish between old and new entries

(22)

More on LFS

New log entries are always appended to the end of the existing log

All writes are sequential

Seeks only occurs during reads

Not so bad due to temporal locality and caching

Problem:

Need to create more contiguous space all the time

(23)

RAID and Reliability

 So far, we assume that we have a single disk

 What if we have multiple disks?

The chance of a single-disk failure increases

 RAID: redundant array of independent disks

Standard way of organizing disks and classifying the reliability of multi-disk systems

General methods: data duplication, parity, and error-correcting codes (ECC)

(24)

RAID 0

No redundancy

Uses block-level striping across disks

i.e., 1st_{block stored on disk 1, 2}nd_{block stored}

on disk 2

(25)

Non-Redundant Disk Array Diagram

(RAID Level 0)

open(foo) read(bar) write(zoo)

File System

(26)

Mirrored Disks (RAID Level 1)

Each disk has a second disk that mirrors its contents

Writes go to both disks

+ Reliability is doubled + Read access faster - Write access slower

(27)

Mirrored Disk Diagram (RAID Level 1)

File System

(28)

Memory-Style ECC (RAID Level 2)

Some disks in array are used to hold ECC

Byte to detect error, extra bits for error correcting

Bit-level striping

Bit 1 of file on disk 1, bit 2 of file on disk 2

+ More efficient than mirroring

+ Can correct, not just detect, errors - Still fairly inefficient

(29)

Memory-Style ECC Diagram

(RAID Level 2)

File System

(30)

Byte-Interleaved Parity (RAID Level

3)

 Uses bye-level striping across disks

i.e., 1st_{byte stored on disk 1, 2}nd_{byte stored on}

disk 2

 One disk in the array stores parity for the other disks

Parity can be used to recover bits on a lost disk

No detection bits needed, relies on disk controller to detect errors

+ More efficient than Levels 1 and 2 - Parity disk doesn’t add bandwidth

(31)

Parity Method

Disk 1: 1001

Disk 2: 0101

Disk 3: 1000

Parity: 0100 = 1001 xor 0101 xor 1000

To recover disk 2

(32)

Byte-Interleaved RAID Diagram (Level 3)

File System

(33)

Block-Interleaved Parity (RAID Level 4)

Like byte-interleaved, but data is interleaved in blocks

+ More efficient data access than level 3 - Parity disk can be a bottleneck

- Small writes require 4 I/Os

Read the old block

Read the old parity

Write the new block

(34)

Block-Interleaved Parity Diagram

(RAID Level 4)

File System

(35)

Block-Interleaved Distributed-Parity

(RAID Level 5)

Sort of the most general level of RAID

Spreads the parity out over all disks

+ No parity disk bottleneck

+ All disks contribute read bandwidth

(36)

Block-Interleaved Distributed-Parity

Diagram (RAID Level 5)

File System