Performance Analysis of Flash Storage Devices and their Application in High Performance Computing

(1)

Performance Analysis of Flash

Storage Devices and their

Application

in High Performance Computing

Nicholas J. Wright

With contributions from

R. Shane Canon,

Neal M. Master,

Matthew Andrews, and Jason Hick

(2)

Outline

•  Why look at flash memory ?

•  Performance evaluation of individual

flash devices

•  What applications is flash going to be

good for ?

•  Flash in a parallel filesystem

•  Summary

(3)

Data Driven Science

-  Ability to generate data is challenging

our ability to store, analyze, & archive it.

-  Some observational devices grow in capability

with Moore’s Law.

-  Data sets are growing exponentially.

•  Petabyte (PB) data sets soon will be common:

–  Climate: next IPCC estimates 10s of PBs

–  Genome: JGI alone will have .5 PB this year

and double each year

–  Particle physics: LHC projects 16 PB / yr

–  Astrophysics: LSST, others, estimate 5 PB / yr

•  Redefine the way science is done?

–  One group generates data, different group

(4)

(5)

I/O Performance Challenges

Performance Crisis #1

• Disks are outpaced by compute speed.

• To achieve reasonable aggregate

bandwidth many spindles needed –

10

3

_{spindles = 1PB but only ~0.1 TB/s !}

1 10 100 1000 10000 is te r -c h ip -c h ip R A M n ec t te m

Pi

co

Jo

u

le

s

now

2018

Performance Crisis #2

Data Motion on an Exascale

Machine

(6)

Memory Capacity Trends

•  Technology trends:

–  Memory density 2X every 3 yrs; processor logic every 2

–  Storage costs ($/MB) drops more gradually than logic costs

Source: David Turek, IBM

Cost of Computation vs. Memory

(7)

Hardware Trends are

exacerbating the issue

•  Data volumes exploding!

•  Memory Capacity per Flop decreasing

•  I/O Bandwidths not keeping pace

•  Data movement is expensive

•  Will NVRAM save the day ?

(8)

(9)

Flash – What is it good for?

•  Read - Word level ~20 us

•  Write - Erase/Write - block level ~200 us

•  $/GB

•  Finite Number of Erase Cycles

•  Lots of open Q’s:

–  PCI vs SATA vs ?

–  SLC vs MLC

–  Write requires block erase - performance

dependent upon previous IO pattern

–  Correct algorithm in software at all levels

–  ….

(10)

Devices Evaluated

•  3 PCI-e SLC

–  Virident tachIOn 400GB 8x

–  FusionIO ioDrive Duo 2x

160GB 4x

–  Texas Memory Systems

RamSan-20 450GB 4x

•  2 SATA MLC

–  Intel X-25M 160GB

–  OCZ Colossus 250GB

Performance Analysis of Commodity and Enterprise Class Flash Devices.

Neal M. Master, Matthew Andrews, Jason Hick, Shane Canon & Nicholas J. Wright

PDSI Workshop, Supercomputing 2010, New Orleans.

(11)

IOZone Experiments

•  Bandwidth

– Vary block size: 2

n

_{KB, n =2-8}

– Vary concurrency: 2

n

_{threads, n=0-7 (1-128)}

– Vary IO Patterns: Sequential Write/Re-write,

Sequential Read/Re-read, Random Write,

Mixed Random Write/Read, Random Read

•  IOPS

– 4KB block size

(12)

PCI-Bandwidths

4 16 64 256 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1 ₂ 4 ₈ 16 ₃₂ 64 ₁₂₈ IO Block Size (KB) B an d w id th (MB /s ) Number of Threads 0-100 100-200 200-300 300-400 400-500 500-600 600-700 700-800 800-900 900-1000 1000-1100 1100-1200

(13)

Bandwidth Summary

0 200 400 600 800 1000 1200 1400

TMS RamSan 20 Virident tachIOn Fusion IO ioDrive Intel X-25M OCZ Colossus

B an d w id th (MB /s )

Read Bandwidth Company Reported Read Bandwidth Write Bandwidth Company Reported Write Bandwidth

(14)

IOPS - READ

0 20 40 60 80 100 120 140 160 180 1 2 4 8 16 32 64 128 IO /s (th o u sa n d s) Number of Threads

Virident tachIOn (400GB) TMS RamSan 20 (450GB) Fusion IO ioDrive Duo (Single Slot, 160GB) Intel X-25M (160GB)

(15)

IOPS - Write

0 20 40 60 80 100 120 140 160 180 IO /s (th o u sa n d s)

(16)

Flash Device Evaluation - IOPS

0 20 40 60 80 100 120 140 160 180 TMS RamSan 20 (450GB) Virident tachIOn (400GB) Fusion IO ioDrive Duo (Single Slot,

160GB) Intel X-25M (160GB) OCZ Colossus (250 GB) T h o u sa n d s IO Ps

(17)

Degradation Experiment

•  Create a file using

– Cat /dev/urandom | dd

– that fills X% of the drive X=30,50,70,90

•  Using FIO randomly write to the file

– Using 4KB blocks - IOPS

– Using 128KB blocks - BW

(18)

Degradation - IOPS

0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 IO /s (th o u sa n d s) Minutes

(19)

Degradation – IOPS Summary

0% 20% 40% 60% 80% 100% 120% Virident tachIOn (400GB) TMS RamSan 20 (450GB) Fusion IO ioDrive Duo (Single Slot, Intel X-25M (160GB) OCZ Colossus (250 GB) Pe rc en ta g e o f Pe ak W ri te IO /s 30% 50% 70% 90%

(20)

Degradation - Bandwidth

0 200 400 600 800 1000 1200 0 10 20 30 40 50 MB /s Minutes

(21)

Degradation BW Summary

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

Virident tachIOn TMS RamSan 20 Fusion IO ioDrive Intel X-25M OCZ Colossus

Pe rc en ta g e o f Pe ak W ri te B an d w id th

(22)

(23)

GPFS Flash Filesystem

•  8 Virident Devices

–  2 per node dual socket Nehalem 2.67 GHz 24 GB

QBR-IB

–  v.1.0 Virident Driver Software

•  GPFS v 3.2

–  4 NSD servers – 2 cards per server

–  256K block size (default)

–  Metadata stored with data

(24)

GPFS: Bandwidth Measurements

0 1000 2000 3000 4000 5000 6000 7000 8000 8 16 32 64 128 256 B an d w id th (MB /s )

(25)

GPFS: IOPS Measurements

0.0E+00 5.0E+04 1.0E+05 1.5E+05 2.0E+05 2.5E+05 IO PS

(26)

GPFS block size variation: Read

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M B an d w id th (MB /s ) Block Size

(27)

GPFS block size variation: write

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 B an d w id th (MB /s )

(28)

Performance for Two Devices

Simultaneously

0 0.5 1 1.5 2 2.5 Read Write Pe rfo rm an ce R ati o

One Device Two Simultaneously Two RAID0

(29)

GPFS Unaligned I/O Performance

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Pe rc en ta g e o f A lig n ed Pe rfo rm an ce Read Write

(30)

Lustre - GPFS Comparison

0 2000 4000 6000 8000 10000 12000 14000 Read Write B W (MB /s ) Lustre GPFS 0.E+00 5.E+04 1.E+05 2.E+05 2.E+05 3.E+05 3.E+05 Read Write IO Ps Lustre GPFS

(31)

Applications for Flash?

Device

Price in $

Capacity

in GB

Bandwidth

/ GB/s

IOPS

$/GB

$/GB/s

$/IOP

SATA MLC

FLASH

120

64

0.2 8600

$1.88

$600

$0.01

SATA SLC

FLASH

740

64

0.2 5000

$11.56

$3,700

$0.15

PCI SLC

FLASH

11500

640

1.2 140000

$17.97

$9,583

$0.08

SATA HDD

80 2000

0.07

90 $0.04

$1,143

$0.89

High-Perf

Array

250000

240000

5 100000

$1.04

$50,000

$2.50

(32)

Graph500 with NAND Flash

Graph500: Traversing massive graphs with

NAND Flash

Roger Pearce

12

Maya Gokhale

1

Nancy M. Amato

2

1_{Center for Applied Scientific Computing}

Lawrence Livermore National Laboratory

2_{Department of Computer Science and Engineering}

Texas A&M University

June 2011

This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE- AC52-07NA27344. LLNL-PRES-487136.

(33)

Graph500 Implementation

Semi-External

�

O(|V|) data can fit into main memory, G = (V, E)

�

Read-only from NAND Flash, output and algorithm data kept

in-memory

�

e.g. store graph in external memory, keep BFS data in memory

Asynchronous Traversal Technique [SC 2010]

�

Exploit fine-grained path parallelism

�

Tolerate data latencies to graph storage (NAND Flash)

�

Re-order vertex visitation to improve page-level locality

�

Allow over-subscription of thread-level parallelism (256 threads) to

maximize NAND Flash IOPS

(34)

34 Experimental Setup

Kraken’s hardware:

�

single node, 32-core Opteron(tm) Processor 6128 @ 2.0Ghz

�

512 GB DRAM

�

4x 640GB Fusion-io MLC; Software RAID0

�

Red Hat Enterprise Linux 5.6

�

Approximate system cost $71K

�

$25K for base system

�

$46K for 2.56TB of Fusion-io NAND Flash

Graph500: Traversing massive graphs with NAND Flash Roger Pearce 4/6

Graph500 Results

Using Fusion-io: 8x larger with 50% performance loss over DRAM only

DRAM + Fusion-io: Scale 34, 55.6 MTEPS

DRAM Only:

Scale 31, 104.6 MTEPS

(35)

Summary

•  Bandwidths per device are impressive

–  But cf. RAID’d set of regular HDD’s

•  IOP’s numbers are very impressive

–  100x HDD – RAID’ing won’t help here

–  Large numbers of threads needed to saturate

(Pearce, Gokhale & Amato SC10)

•  Previous I/O pattern can effect performance

•  Parallel Filesystem Software needs Tweaking

to use with Flash

(36)

Going Forward…

•  ‘When you’ve got a hammer in your

hand, everything looks like a nail.’

•  THE FIRST Law of Technology says we

invariably overestimate the short-term

impact of new technologies while

underestimating their longer-term

effects - Dr. Francis Collins

(37)

Flash Going Forward

•  $/GB unlikely to match regular disk

•  $/IOP already significantly better

•  Energy costs will be less that a regular

HDD

•  Usefulness with depend upon the data

(38)

Reliability – Too Few Electrons

Per Gate

Source: The Inevitable Rise of NVM in Computing, Jim Handy, Nonvolatile Memory Seminar,

Hot Chips Conference August 22, 2010 Memorial Auditorium Stanford University

(39)

Performance Analysis of Flash Storage Devices and their Application in High Performance Computing