File systems for persistent memory
CS 839 - Persistence
Questions on homework?
• Can we shift the schedule and do BPFS on Thursday and Nova on
Monday? Drop Aerie or SplitFS.
Learning outcomes
• Understand how disk-based file systems update metadata and handle consistency
• Understand the properties of NVM that can change file system design
• Understand the key ordering requirements for file systems
• Understand BPFS software and hardware mechanisms and their
limitations
Background story
• PCM is becoming popular, first for main memory
• Obvious approach seems to be use it for file systems too
• Question: how do you optimize?
Background: normal file systems
• Use page cache to buffer data in DRAM
• Access SSD through block layer
• Use logging for consistency
Background: FS data structures
• Standard FS data structures
• Superblock: describes FS parameters, location of root inode
• Inode: metadata for a single file
• Attributes, size, location of data blocks
• Number
• Data block: holds file or directory contents
• Directory entry: String name and inode number
• Inode and data block bitmaps: track free/used locations on storage
• Indirect block: location of other data blocks or indirect blocks
Background: FS consistency
• What gets updated when appending to a file?
• Allocate block from data bitmap
• Write data to data block
• Write block address to inode or indirect block
• Update file length & modification time in inode
• What happens if system crashes in the middle?
Background: FS consistency mechanisms
• Journaling: write metadata (and/or data) to a journal before writing it in place – redo logging
• Write journal, force to storage
• Later checkpoint – write metadata/data in real place
• Can skip data journaling for performance
• Shadow updates: write data/metadata updates to new location (used in BPFS)
• Basically copy-on write data structures
Review 1: Journaling
• Write to journal, then write to file system
A B
file system
journal A’ B’
B’
A’
• Reliable, but all data is written twice
9
Review 2: Shadow Paging
• Use copy-on-write up to root of file system
B
A A’ B’
file’s root pointer
• Any change requires bubbling to the FS root
• Small writes require large copying overhead
10Atomicity requirements
• What happens when you crash while writing data to a file?
1. Entire write takes place or none takes place
2. Some blocks may be written entirely but not all 3. Arbitrary bytes of file may be replaced
• What do normal file systems do?
• “torn write” – partially written block
• Data vs metadata journaling
Basic idea: RAM disk
• Idea 1: RAM disk
• Make a block device that access NVM instead of going to a device
• BTT: block-translation table, uses shadow updates to allow atomic block-sized writes
• Problems:
• Still copy data to DRAM – inefficient
• All writes are block sized -- inefficient
What changes with NVM/SCM/PMem?
What changes with NVM/SCM/PMem?
• Fine grained writes
• Don’t have to write entire blocks when updating a single value
• Fast random access
• Don’t need to optimize metadata for sequential extents
• No buffering
• Can serve data directly from memory
• But:
• Loss of ordering
Short-Circuit Shadow Paging
• Uses byte-addressability and atomic 64b writes
B
A A’ B’
file’s root pointer
15
• Inspired by shadow paging
– Optimization: In-place update when possible
Opt. 1: In-Place Writes
• Aligned 64-bit writes are performed in place
• Data and metadata
file’s root pointer
in-place write
16
• Appends committed by updating file size
file’s root pointer + size
in-place append file size update
17
Opt. 2: Exploit Data-Metadata
Invariants
BPFS Example
directory directory file
inode file
root pointer
indirect blocks
inodes
add entry
remove entry
18
• Cross-directory rename bubbles to common ancestor
What happens if you memory-map a file?
• Rely on hardware for 1-word atomic update
➢CPU cache may reorder writes to NVM
• Breaks “crash-consistent” update protocols`
Consistent updates
20
0xC02
Write-back Cache
0
NVM
0xDEADBEEF value
valid value
valid 1
1
STORE value = 0xC02 STORE valid = 1
Primitive operation: ordering writes
• Why?
• Ensures ability to commit a change
• How?
• Flush – MOVNTQ/CLFLUSH
• Fence – MFENCE
• Inefficiencies:
• Removes recent data from cache
21
0
Write-back NVM cache
0xDEADBEEF value
valid value
valid 1
0xC02
STORE value = 0xC02 FLUSH (&value)
FENCE
STORE valid = 1
BPRAM L1 / L2 ...
CoW
Commit ...
Ordering in BPFS
22
...
CoW
Commit ...
Atomicity in BPFS
L1 / L2
BPRAM
23
Enforcing Ordering and Atomicity
• Ordering
• Solution: Epoch barriers to declare constraints
• Faster than write-through
• Important hardware primitive (cf. SCSI TCQ)
• Atomicity
• Solution: Capacitor on DIMM
• Simple and cheap!
24
Intel x86 flush mechanism
25
A CLWB ASFENCEST A
ACK
SFENCE COMMITS
A
ST A ST B CLWB A CLWB B SFENCE ST C CLWB C SFENCE
25
Intel x86 flush mechanism
26
Drawback 1: No distinction between ordering and
durability
Drawback 2: Ordering introduces stalls
26
ST A ST B CLWB A CLWB B SFENCE ST C CLWB C SFENCE
Epoch ordering
• Goal:
• No software flushes – too expensive/complex
• Ordering is asynchronous – too expensive to stall
• Solution
• Persist barriers
Persist barriers: Ordering Fence
28
ST A=1 Volatile Memory Order
ST A=1 Persistence Order
Time ST B=2
ST B=2 Thread 1 Thread 1
Barrier
• Orders stores preceding barrier before later stores
Happens Before
Ordering Epochs without Flushing
29
CPU 1
Local TS
L1 Cache
2526
1. ST A = 1 2. ST B = 1 3. LD R1 = A 4. BARRIER 5. ST A = 2
A = 1 25 A = 2 26
B = 1 25
...
CoW
Barrier Commit
...
Ordering and Atomicity with Epoch Barriers
L1 / L2
BPRAM
1
1 1
2
Ineligible for eviction!
30
Epoch ordering complexity
• When is it safe to let something leave the cache?
• When all writes from preceding epoch have left already
• What happens if you overwrite something from a preceding epoch?
• Must flush earlier epoch first – can’t store multiple versions
• What happens when you access something from another core?
• Can’t track ordering across cores (epoch numbers across cores aren’t orderd)
• Old data must be flushed
• How do you implement efficiently?
• Store 8-bit pointer in each cacheline to registers holding 8 in-flight epochs
Considerations for epoch ordering
• How complex is it?
• How easy to use is it?
Considerations for epoch ordering
• How complex is it?
• Need hardware walkers to evict cachelines during cache replacement
• How easy to use is it?
• Dependencies across volatile variables not recorded
• Example:
• Could reboot with Y=2, A=4
Acquire(vol_lock);
X = 1;
Y = 2;
Release(vol_lock);
Acquire(vol_lock);
A = 4;
B = 5;
Release(vol_lock);
0 2 4 6 8 10
8 64 512 4096
Thousands
Random n Byte Write
Microbenchmarks
0 0.4 0.8 1.2 1.6 2
8 64 512 4096
Time (s)
Append n Bytes
NTFS - Disk NTFS - RAM BPFS - RAM
34
NOT DURABLE!
NOT DURABLE!
DURABLE!
DURABLE!
Notes from reviews
• How much performance improvement should we expect?
• How important is using real PCM (or real PCM latency) in evaluation?
• Could we have systems with just Pmem and no SSD?
• What journaling mode does NTFS use?
• Ordered journaling
• Is modifying HW ok?
• Using volatile structures
• Free blocks, freed & allocate inode numbers,
• data freed by CoW operation
• Dentry cache
How well does it perform?
• Evaluation:
• Implement in Windows & run over DRAM (no epoch barrier delays)
• Implement in usermode & run in a simulator
• Analytical model