File systems for persistent memory. CS Persistence

(1)

File systems for persistent memory

CS 839 - Persistence

(2)

Questions on homework?

• Can we shift the schedule and do BPFS on Thursday and Nova on

Monday? Drop Aerie or SplitFS.

(3)

Learning outcomes

• Understand how disk-based file systems update metadata and handle consistency

• Understand the properties of NVM that can change file system design

• Understand the key ordering requirements for file systems

• Understand BPFS software and hardware mechanisms and their

limitations

(4)

Background story

• PCM is becoming popular, first for main memory

• Obvious approach seems to be use it for file systems too

• Question: how do you optimize?

(5)

Background: normal file systems

• Use page cache to buffer data in DRAM

• Access SSD through block layer

• Use logging for consistency

(6)

Background: FS data structures

• Standard FS data structures

• Superblock: describes FS parameters, location of root inode

• Inode: metadata for a single file

• Attributes, size, location of data blocks

• Number

• Data block: holds file or directory contents

• Directory entry: String name and inode number

• Inode and data block bitmaps: track free/used locations on storage

• Indirect block: location of other data blocks or indirect blocks

(7)

Background: FS consistency

• What gets updated when appending to a file?

• Allocate block from data bitmap

• Write data to data block

• Write block address to inode or indirect block

• Update file length & modification time in inode

• What happens if system crashes in the middle?

(8)

Background: FS consistency mechanisms

• Journaling: write metadata (and/or data) to a journal before writing it in place – redo logging

• Write journal, force to storage

• Later checkpoint – write metadata/data in real place

• Can skip data journaling for performance

• Shadow updates: write data/metadata updates to new location (used in BPFS)

• Basically copy-on write data structures

(9)

Review 1: Journaling

• Write to journal, then write to file system

A B

file system

journal ^A’ ^B’

B’

A’

• Reliable, but all data is written twice

9

(10)

Review 2: Shadow Paging

• Use copy-on-write up to root of file system

B

A A’ B’

file’s root pointer

• Any change requires bubbling to the FS root

• Small writes require large copying overhead

¹⁰

(11)

Atomicity requirements

• What happens when you crash while writing data to a file?

1. Entire write takes place or none takes place

2. Some blocks may be written entirely but not all 3. Arbitrary bytes of file may be replaced

• What do normal file systems do?

• “torn write” – partially written block

• Data vs metadata journaling

(12)

Basic idea: RAM disk

• Idea 1: RAM disk

• Make a block device that access NVM instead of going to a device

• BTT: block-translation table, uses shadow updates to allow atomic block-sized writes

• Problems:

• Still copy data to DRAM – inefficient

• All writes are block sized -- inefficient

(13)

What changes with NVM/SCM/PMem?

(14)

What changes with NVM/SCM/PMem?

• Fine grained writes

• Don’t have to write entire blocks when updating a single value

• Fast random access

• Don’t need to optimize metadata for sequential extents

• No buffering

• Can serve data directly from memory

• But:

• Loss of ordering

(15)

Short-Circuit Shadow Paging

• Uses byte-addressability and atomic 64b writes

B

A A’ B’

15

• Inspired by shadow paging

– Optimization: In-place update when possible

(16)

Opt. 1: In-Place Writes

• Aligned 64-bit writes are performed in place

• Data and metadata

in-place write

16

(17)

• Appends committed by updating file size

file’s root pointer + size

in-place append file size update

17

Opt. 2: Exploit Data-Metadata

Invariants

(18)

BPFS Example

directory directory file

inode file

root pointer

indirect blocks

inodes

add entry

remove entry

18

• Cross-directory rename bubbles to common ancestor

(19)

What happens if you memory-map a file?

(20)

• Rely on hardware for 1-word atomic update

➢CPU cache may reorder writes to NVM

• Breaks “crash-consistent” update protocols`

Consistent updates

20

0xC02

Write-back Cache

0

NVM

0xDEADBEEF value

valid value

valid 1

1

STORE value = 0xC02 STORE valid = 1

(21)

Primitive operation: ordering writes

• Why?

• Ensures ability to commit a change

• How?

• Flush – MOVNTQ/CLFLUSH

• Fence – MFENCE

• Inefficiencies:

• Removes recent data from cache

21

0

Write-back NVM cache

0xDEADBEEF value

valid value

valid 1

0xC02

STORE value = 0xC02 FLUSH (&value)

FENCE

STORE valid = 1

(22)

BPRAM L1 / L2 ...

CoW

Commit ...

Ordering in BPFS

22

(23)

...

CoW

Commit ...

Atomicity in BPFS

L1 / L2

BPRAM

23

(24)

Enforcing Ordering and Atomicity

• Ordering

• Solution: Epoch barriers to declare constraints

• Faster than write-through

• Important hardware primitive (cf. SCSI TCQ)

• Atomicity

• Solution: Capacitor on DIMM

• Simple and cheap!

24

(25)

Intel x86 flush mechanism

25

A CLWB ASFENCEST A

ACK

SFENCE COMMITS

A

ST A ST B CLWB A CLWB B SFENCE ST C CLWB C SFENCE

25

(26)

Intel x86 flush mechanism

26

Drawback 1: No distinction between ordering and

durability

Drawback 2: Ordering introduces stalls

26

ST A ST B CLWB A CLWB B SFENCE ST C CLWB C SFENCE

(27)

Epoch ordering

• Goal:

• No software flushes – too expensive/complex

• Ordering is asynchronous – too expensive to stall

• Solution

• Persist barriers

(28)

Persist barriers: Ordering Fence

28

ST A=1 Volatile Memory Order

ST A=1 Persistence Order

Time ST B=2

ST B=2 Thread 1 Thread 1

Barrier

• Orders stores preceding barrier before later stores

Happens Before

(29)

Ordering Epochs without Flushing

29

CPU 1

Local TS

L1 Cache

2526

1. ST A = 1 2. ST B = 1 3. LD R1 = A 4. BARRIER 5. ST A = 2

A = 1 25 A = 2 26

B = 1 25

(30)

...

CoW

Barrier Commit

...

Ordering and Atomicity with Epoch Barriers

L1 / L2

BPRAM

1

1 1

2

Ineligible for eviction!

30

(31)

Epoch ordering complexity

• When is it safe to let something leave the cache?

• When all writes from preceding epoch have left already

• What happens if you overwrite something from a preceding epoch?

• Must flush earlier epoch first – can’t store multiple versions

• What happens when you access something from another core?

• Can’t track ordering across cores (epoch numbers across cores aren’t orderd)

• Old data must be flushed

• How do you implement efficiently?

• Store 8-bit pointer in each cacheline to registers holding 8 in-flight epochs

(32)

Considerations for epoch ordering

• How complex is it?

• How easy to use is it?

(33)

Considerations for epoch ordering

• How complex is it?

• Need hardware walkers to evict cachelines during cache replacement

• How easy to use is it?

• Dependencies across volatile variables not recorded

• Example:

• Could reboot with Y=2, A=4

Acquire(vol_lock);

X = 1;

Y = 2;

Release(vol_lock);

Acquire(vol_lock);

A = 4;

B = 5;

Release(vol_lock);

(34)

0 2 4 6 8 10

8 64 512 4096

Thousands

Random n Byte Write

Microbenchmarks

0 0.4 0.8 1.2 1.6 2

8 64 512 4096

Time (s)

Append n Bytes

NTFS - Disk NTFS - RAM BPFS - RAM

34

NOT DURABLE!

DURABLE!

(35)

Notes from reviews

• How much performance improvement should we expect?

• How important is using real PCM (or real PCM latency) in evaluation?

• Could we have systems with just Pmem and no SSD?

• What journaling mode does NTFS use?

• Ordered journaling

• Is modifying HW ok?

• Using volatile structures

• Free blocks, freed & allocate inode numbers,

• data freed by CoW operation

• Dentry cache

(36)

How well does it perform?

• Evaluation:

• Implement in Windows & run over DRAM (no epoch barrier delays)

• Implement in usermode & run in a simulator

• Analytical model