• No results found

File systems for persistent memory. CS Persistence

N/A
N/A
Protected

Academic year: 2021

Share "File systems for persistent memory. CS Persistence"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

File systems for persistent memory

CS 839 - Persistence

(2)

Questions on homework?

• Can we shift the schedule and do BPFS on Thursday and Nova on

Monday? Drop Aerie or SplitFS.

(3)

Learning outcomes

• Understand how disk-based file systems update metadata and handle consistency

• Understand the properties of NVM that can change file system design

• Understand the key ordering requirements for file systems

• Understand BPFS software and hardware mechanisms and their

limitations

(4)

Background story

• PCM is becoming popular, first for main memory

• Obvious approach seems to be use it for file systems too

• Question: how do you optimize?

(5)

Background: normal file systems

• Use page cache to buffer data in DRAM

• Access SSD through block layer

• Use logging for consistency

(6)

Background: FS data structures

• Standard FS data structures

• Superblock: describes FS parameters, location of root inode

• Inode: metadata for a single file

• Attributes, size, location of data blocks

• Number

• Data block: holds file or directory contents

• Directory entry: String name and inode number

• Inode and data block bitmaps: track free/used locations on storage

• Indirect block: location of other data blocks or indirect blocks

(7)

Background: FS consistency

• What gets updated when appending to a file?

• Allocate block from data bitmap

• Write data to data block

• Write block address to inode or indirect block

• Update file length & modification time in inode

• What happens if system crashes in the middle?

(8)

Background: FS consistency mechanisms

• Journaling: write metadata (and/or data) to a journal before writing it in place – redo logging

• Write journal, force to storage

• Later checkpoint – write metadata/data in real place

• Can skip data journaling for performance

• Shadow updates: write data/metadata updates to new location (used in BPFS)

• Basically copy-on write data structures

(9)

Review 1: Journaling

• Write to journal, then write to file system

A B

file system

journal A’ B’

B’

A’

• Reliable, but all data is written twice

9

(10)

Review 2: Shadow Paging

• Use copy-on-write up to root of file system

B

A A’ B’

file’s root pointer

• Any change requires bubbling to the FS root

• Small writes require large copying overhead

10

(11)

Atomicity requirements

• What happens when you crash while writing data to a file?

1. Entire write takes place or none takes place

2. Some blocks may be written entirely but not all 3. Arbitrary bytes of file may be replaced

• What do normal file systems do?

• “torn write” – partially written block

• Data vs metadata journaling

(12)

Basic idea: RAM disk

• Idea 1: RAM disk

• Make a block device that access NVM instead of going to a device

• BTT: block-translation table, uses shadow updates to allow atomic block-sized writes

• Problems:

• Still copy data to DRAM – inefficient

• All writes are block sized -- inefficient

(13)

What changes with NVM/SCM/PMem?

(14)

What changes with NVM/SCM/PMem?

• Fine grained writes

• Don’t have to write entire blocks when updating a single value

• Fast random access

• Don’t need to optimize metadata for sequential extents

• No buffering

• Can serve data directly from memory

• But:

• Loss of ordering

(15)

Short-Circuit Shadow Paging

• Uses byte-addressability and atomic 64b writes

B

A A’ B’

file’s root pointer

15

• Inspired by shadow paging

– Optimization: In-place update when possible

(16)

Opt. 1: In-Place Writes

• Aligned 64-bit writes are performed in place

• Data and metadata

file’s root pointer

in-place write

16

(17)

• Appends committed by updating file size

file’s root pointer + size

in-place append file size update

17

Opt. 2: Exploit Data-Metadata

Invariants

(18)

BPFS Example

directory directory file

inode file

root pointer

indirect blocks

inodes

add entry

remove entry

18

• Cross-directory rename bubbles to common ancestor

(19)

What happens if you memory-map a file?

(20)

• Rely on hardware for 1-word atomic update

➢CPU cache may reorder writes to NVM

• Breaks “crash-consistent” update protocols`

Consistent updates

20

0xC02

Write-back Cache

0

NVM

0xDEADBEEF value

valid value

valid 1

1

STORE value = 0xC02 STORE valid = 1

(21)

Primitive operation: ordering writes

• Why?

• Ensures ability to commit a change

• How?

• Flush – MOVNTQ/CLFLUSH

• Fence – MFENCE

• Inefficiencies:

• Removes recent data from cache

21

0

Write-back NVM cache

0xDEADBEEF value

valid value

valid 1

0xC02

STORE value = 0xC02 FLUSH (&value)

FENCE

STORE valid = 1

(22)

BPRAM L1 / L2 ...

CoW

Commit ...

Ordering in BPFS

22

(23)

...

CoW

Commit ...

Atomicity in BPFS

L1 / L2

BPRAM

23

(24)

Enforcing Ordering and Atomicity

• Ordering

• Solution: Epoch barriers to declare constraints

• Faster than write-through

• Important hardware primitive (cf. SCSI TCQ)

• Atomicity

• Solution: Capacitor on DIMM

• Simple and cheap!

24

(25)

Intel x86 flush mechanism

25

A CLWB ASFENCEST A

ACK

SFENCE COMMITS

A

ST A ST B CLWB A CLWB B SFENCE ST C CLWB C SFENCE

25

(26)

Intel x86 flush mechanism

26

Drawback 1: No distinction between ordering and

durability

Drawback 2: Ordering introduces stalls

26

ST A ST B CLWB A CLWB B SFENCE ST C CLWB C SFENCE

(27)

Epoch ordering

• Goal:

• No software flushes – too expensive/complex

• Ordering is asynchronous – too expensive to stall

• Solution

• Persist barriers

(28)

Persist barriers: Ordering Fence

28

ST A=1 Volatile Memory Order

ST A=1 Persistence Order

Time ST B=2

ST B=2 Thread 1 Thread 1

Barrier

• Orders stores preceding barrier before later stores

Happens Before

(29)

Ordering Epochs without Flushing

29

CPU 1

Local TS

L1 Cache

2526

1. ST A = 1 2. ST B = 1 3. LD R1 = A 4. BARRIER 5. ST A = 2

A = 1 25 A = 2 26

B = 1 25

(30)

...

CoW

Barrier Commit

...

Ordering and Atomicity with Epoch Barriers

L1 / L2

BPRAM

1

1 1

2

Ineligible for eviction!

30

(31)

Epoch ordering complexity

• When is it safe to let something leave the cache?

• When all writes from preceding epoch have left already

• What happens if you overwrite something from a preceding epoch?

• Must flush earlier epoch first – can’t store multiple versions

• What happens when you access something from another core?

• Can’t track ordering across cores (epoch numbers across cores aren’t orderd)

• Old data must be flushed

• How do you implement efficiently?

• Store 8-bit pointer in each cacheline to registers holding 8 in-flight epochs

(32)

Considerations for epoch ordering

• How complex is it?

• How easy to use is it?

(33)

Considerations for epoch ordering

• How complex is it?

• Need hardware walkers to evict cachelines during cache replacement

• How easy to use is it?

• Dependencies across volatile variables not recorded

• Example:

• Could reboot with Y=2, A=4

Acquire(vol_lock);

X = 1;

Y = 2;

Release(vol_lock);

Acquire(vol_lock);

A = 4;

B = 5;

Release(vol_lock);

(34)

0 2 4 6 8 10

8 64 512 4096

Thousands

Random n Byte Write

Microbenchmarks

0 0.4 0.8 1.2 1.6 2

8 64 512 4096

Time (s)

Append n Bytes

NTFS - Disk NTFS - RAM BPFS - RAM

34

NOT DURABLE!

NOT DURABLE!

DURABLE!

DURABLE!

(35)

Notes from reviews

• How much performance improvement should we expect?

• How important is using real PCM (or real PCM latency) in evaluation?

• Could we have systems with just Pmem and no SSD?

• What journaling mode does NTFS use?

• Ordered journaling

• Is modifying HW ok?

• Using volatile structures

• Free blocks, freed & allocate inode numbers,

• data freed by CoW operation

• Dentry cache

(36)

How well does it perform?

• Evaluation:

• Implement in Windows & run over DRAM (no epoch barrier delays)

• Implement in usermode & run in a simulator

• Analytical model

• Workloads

References

Related documents

Mechanism Name Location LQR ACSWBC ACSWBC vs. As demonstrated in Table 3, the suggested ACSWBC have potentially more influence on the seismic response mitigation in

Renters insurance on the best life insurance auto insurance policy with the company because their home.. Better business bureau rating indicates where is my elephant are how much

Check 267: Form submission error messages may not identify empty

The major attributes of the causes of challenging behaviour were biological, psychodynamic ecological and behavioural and this attributes were related to ways in which

In this paper, we present a case-based framework addressing the task placement problem by interleaving workflow management and cloud management.. In addition to traditional workflow

BPro Tonometry Required with validated cuff-based device before use SBP, DBP, HR, Pulse Waveform Continuous measurement.. Data sent to secure cloud server through bluetooth AAMI and

o Reduces post prandial glucose by stimulating burst of insulin from pancreas o Average decrease in FBS is 60mg/dl. o Average decrease in HbA1c is 1.5

 The SFS has developed a Code of Practice for Fire Safety Design [3] to help set professional practice standards, and there is wide use of the International Fire