• No results found

File System Design and Implementation

N/A
N/A
Protected

Academic year: 2021

Share "File System Design and Implementation"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

Transactions and Reliability

Sarah Diesburg Operating Systems CS 3430

(2)

Motivation

File systems have lots of metadata:

Free blocks, directories, file headers, indirect blocks

Metadata is heavily cached for performance

(3)

Problem

System crashes

OS needs to ensure that the file system does not reach an inconsistent state

Example: move a file between directories

Remove a file from the old directory

Add a file to the new directory

What happens when a crash occurs in the middle?

(4)

UNIX File System (Ad Hoc

Failure-Recovery)

Metadata handling:

Uses a synchronous write-through caching policy

A call to update metadata does not return until the changes are propagated to disk

Updates are ordered

When crashes occur, run fsck to repair in-progress operations

(5)

Some Examples of Metadata Handling

Undo effects not yet visible to users

If a new file is created, but not yet added to the directory

Delete the file

Continue effects that are visible to users

If file blocks are already allocated, but not recorded in the bitmap

(6)

UFS User Data Handling

Uses a write-back policy

Modified blocks are written to disk at 30-second intervals

Unless a user issues the sync system call

Data updates are not ordered

In many cases, consistent metadata is good enough

(7)

Example: Vi

Vi saves changes by doing the following

1. Writes the new version in a temp file

Now we have old_file and new_temp file

2. Moves the old version to a different temp file

Now we have new_temp and old_temp

3. Moves the new version into the real file

Now we have new_file and old_temp

4. Removes the old version

(8)

Example: Vi

When crashes occur

Looks for the leftover files

Moves forward or backward depending on the integrity of files

(9)

Transaction Approach

A transaction groups operations as a unit, with the following characteristics:

Atomic: all operations either happen or they do not (no partial operations)

Serializable: transactions appear to happen one after the other

Durable: once a transaction happens, it is recoverable and can survive crashes

(10)

More on Transactions

A transaction is not done until it is

committed

Once committed, a transaction is durable

If a transaction fails to complete, it must

rollback as if it did not happen at all

Critical sections are atomic and serializable, but not durable

(11)

Transaction Implementation (One Thread)

Example: money transfer

Begin transaction x = x – 1;

y = y + 1; Commit

(12)

Transaction Implementation (One Thread)

Common implementations involve the use of a log, a journal that is never erased

A file system uses a write-ahead log to track all transactions

(13)

Transaction Implementation (One Thread)

Once accounts of x and y are on a log, the log is committed to disk in a single write

Actual changes to those accounts are done later

(14)

Transaction Illustrated

x = 1; y = 1; x = 1;

(15)

Transaction Illustrated

x = 1; y = 1; x = 0;

(16)

Transaction Illustrated

x = 1; y = 1; x = 0; y = 2; begin transaction old x: 1 old y: 1 new x: 0 new y: 2 commit

Commit the log to disk before

updating the actual values on disk

(17)

Transaction Steps

Mark the beginning of the transaction

Log the changes in account x

Log the changes in account y

Commit

Modify account x on disk

(18)

Scenarios of Crashes

If a crash occurs after the commit

Replays the log to update accounts

If a crash occurs before or during the commit

(19)

Two-Phase Locking (Multiple Threads)

Logging alone not enough to prevent multiple transactions from trashing one another (not serializable)

Solution: two-phase locking

1. Acquire all locks

2. Perform updates and release all locks

Thread A cannot see thread B’s changes until thread A commits and releases locks

(20)

Transactions in File Systems

Almost all file systems built since 1985 use write-ahead logging

NTFS, HFS+, ext3, ext4, …

+ Eliminates running fsck after a crash + Write-ahead logging provides reliability - All modifications need to be written twice

(21)

Log-Structured File System (LFS)

If logging is so great, why don’t we treat everything as log entries?

Log-structured file system

Everything is a log entry (file headers, directories, data blocks)

Write the log only once

Use version stamps to distinguish between old and new entries

(22)

More on LFS

New log entries are always appended to the end of the existing log

All writes are sequential

Seeks only occurs during reads

Not so bad due to temporal locality and caching

Problem:

Need to create more contiguous space all the time

(23)

RAID and Reliability

 So far, we assume that we have a single disk

 What if we have multiple disks?

The chance of a single-disk failure increases

RAID: redundant array of independent disks

Standard way of organizing disks and classifying the reliability of multi-disk systems

General methods: data duplication, parity, and error-correcting codes (ECC)

(24)

RAID 0

No redundancy

Uses block-level striping across disks

i.e., 1st block stored on disk 1, 2nd block stored

on disk 2

(25)

Non-Redundant Disk Array Diagram

(RAID Level 0)

open(foo) read(bar) write(zoo)

File System

(26)

Mirrored Disks (RAID Level 1)

Each disk has a second disk that mirrors its contents

Writes go to both disks

+ Reliability is doubled + Read access faster - Write access slower

(27)

Mirrored Disk Diagram (RAID Level 1)

open(foo) read(bar) write(zoo)

File System

(28)

Memory-Style ECC (RAID Level 2)

Some disks in array are used to hold ECC

Byte to detect error, extra bits for error correcting

Bit-level striping

Bit 1 of file on disk 1, bit 2 of file on disk 2

+ More efficient than mirroring

+ Can correct, not just detect, errors - Still fairly inefficient

(29)

Memory-Style ECC Diagram

(RAID Level 2)

open(foo) read(bar) write(zoo)

File System

(30)

Byte-Interleaved Parity (RAID Level

3)

 Uses bye-level striping across disks

i.e., 1st byte stored on disk 1, 2nd byte stored on

disk 2

One disk in the array stores parity for the other disks

Parity can be used to recover bits on a lost disk

No detection bits needed, relies on disk controller to detect errors

+ More efficient than Levels 1 and 2 - Parity disk doesn’t add bandwidth

(31)

Parity Method

Disk 1: 1001

Disk 2: 0101

Disk 3: 1000

Parity: 0100 = 1001 xor 0101 xor 1000

To recover disk 2

(32)

Byte-Interleaved RAID Diagram (Level 3)

open(foo) read(bar) write(zoo)

File System

(33)

Block-Interleaved Parity (RAID Level 4)

Like byte-interleaved, but data is interleaved in blocks

+ More efficient data access than level 3 - Parity disk can be a bottleneck

- Small writes require 4 I/Os

Read the old block

Read the old parity

Write the new block

(34)

Block-Interleaved Parity Diagram

(RAID Level 4)

open(foo) read(bar) write(zoo)

File System

(35)

Block-Interleaved Distributed-Parity

(RAID Level 5)

Sort of the most general level of RAID

Spreads the parity out over all disks

+ No parity disk bottleneck

+ All disks contribute read bandwidth

(36)

Block-Interleaved Distributed-Parity

Diagram (RAID Level 5)

open(foo) read(bar) write(zoo)

File System

References

Related documents

• restore data directories with db config file • restart Ingres. • transaction log contents can be moved to journals only if a valid config file

understanding how older workers learn in the workplace, and professional learning grounded within socio cultural literature of workplace learning, there is a valid claim for

The ability to gather, store, use and share knowledge on customers might be the source of a long-term competitive advantage..

As there is currently a movement bringing about the legalisation of cannabis use for medicinal purposes, many studies are being carried out to discover if cannabis or

Aer this initial phase, the maintenance cycle truly begins: preparation of up- dates to follow the latest version of the Debian Policy, fixing bugs reported by users, inclusion of

According to this traceability model, there must be traces between a use case and the other types of requirements related to that use case like data definitions and business

Underpinning this novel technique, the contribution of this work is to verify that the SbSR–NEPID provides better control performance accuracy than the standard NEPID and the