HPC data becomes Big Data. Peter Braam

(1)

HPC data becomes Big Data

Peter Braam

(2)

me

1983 - 2000 Academia

• Maths & Computer Science

Entrepreneur with startups (5x)

• 4 startups sold • Lustre emerged

• Held executive jobs with acquirers

2014 – Independent, advise, research

(3)

Two Questions

§ Given an HPC storage system, how can it be used

for Big Data Analysis?

§ What storage platforms are candidates to meet

(6)

IDC market data

Fact 2011 2013

% of sites using co-processors 28.2% 76.9% HPC sites performing big data analysis 67% % of compute cycles dedicated to big data 30% % of sites using cloud infrastructure for HPC 18.8% 23.5% Year over year growth in high density

servers ($)

25.5% Year over year growth in servers ($) -6.2%

(7)

Other facts

§ Flash and much faster persistent memory tiers are

inevitably coming.

§ Multiple software challenges arise from this

§ Management of tiers

§ Much faster storage software to keep up with devices

§ Gap between disk and other system performance

continues to increase

§ There is “embedded” processing on servers with

attached storage and client-server processing with clients networked to servers.

(8)

(9)

Big Data Problems – samples

§ Input generally from simulation or sensors

§ Climate modeling – simulate then …

§ Find the hottest day each year in Cape Town

§ Find very low pressure spots (typhoons) on Earth

§ Genomics, Astronomy

§ Find patterns (e.g. strings, galaxies) in huge data sets

§ Pre-process data at TB/sec rates

§ Data management

(10)

Big Data Problems – samples 2

§ Social network, advertising & intelligence

§ Most of these become graph problems, some very hard

§ Non-compliance in stock market transaction logs

§ Replace legacy consumer information data

warehousing with modern analytics

§ Replacements of Teradata / Netezza sometimes difficult

(11)

Wide variations

§ Some problems (e.g. some graph problems) must

be executed in RAM. Graph500 benchmark

2000x speedup in 2.5 years

§ Other problems require many iterations through

disk-resident data

§ Netezza analytics systems use FPGA’s for

(12)

Big Data Algorithms

§ Considerable variation

§ Machine learning

§ Bayesian analysis

§ Indexing, sorting – DB like

§ Graph algorithms

§ Maximal Information Coefficients – generalize

regressions

§ Compressed sensing (aka sparse recovery)

§ Topological Data Analysis

(13)

Ogres …

§ Analogously to Berkeley Dwarfs big data problems

have been classified: see

Understanding Big Data

Applications and Architectures 1st JTC 1 SGBD Meeting

SDSC San Diego March 19 2014 Geoffrey Fox

Judy Qiu

(14)

So…

Given these variations a single architecture is not likely to address all big data problems well.

(15)

(16)

HPC data

§ Traditional model – cluster file system and

§ Single Shared File (with # cores readers / writers)

§ File Per Process (and 1 process per core …)

§ Tightly coupled problems allow little scheduling of

“tasklets” or redistribution of I/O

§ Problems…

§ Throughput == #server nodes x (speed of slowest node)

(17)

Results quite reasonable

§ Systems like Lustre, GPFS, Panasas

§ Use carefully configured and tested hardware

§ Fast networks

§ Deliver 80% of slowest hardware component

§ Pipelines from clients to disk are uniformly wide

§ Servers can deliver ~3GB/sec / controller

§ Achilles heels:

§ Metadata

§ Availability

(18)

A sample of hard cases…

First write then read. Why the gap? Opening & creating files is too slow. Should run >2x faster!

First seen at ORNL in 2006.

Metadata performance on Sequoia and on Cove (50 & 5 SSD drives) Low 1000’s to ~15K ops / sec Maximum seen ever ~50K ops

(19)

HPC hard cases ctd

Larger numbers of concurrent metadata clients are not easy.

Conclusion:

1.  Problems systems like Lustre remain 2.  Sensitivity to uniformly good hardware

3.  Honest data from the users & understanding exists 4.  It has been used at very large scale

(20)

Cloud data into HPC file system

§ Intel’s FastForward project

§ Ingest massive ACG graphs through Hadoop

§ Represent ACG using an HDF5 adaptation layer (HAL) & in Lustre

DAOS objects.

§ Then compute.

Acknowledgement: Figure from Intel’s hpdd.intel.com wiki

(21)

(22)

Hybrid solutions may be best

TACC “Wrangler” system

§ Big Data “companion” to Stampede

§ DSSD storage is PCI connected and has KV interface

§ 120 node Dell cluster with DSSD storage

275M IOPS

Undoubtedly

§ This will solve many big data problems well

§ There will be problems that don’t fit or for which

(23)

Typical Cloud Storage

§ Combines

§ memcached

§ key value stores or DB’s

§ Relational, Distributed Key Value, Embedded Key Value

§ MySql, Cassandra / Hbase, Rocksdb / LevelDB

§ object stores (swift, CEPH, …)

§ Results

§ Read heavy loads from one cluster

§ 100’s of servers serving 10M’s of requests/sec

§  Only the embedded DBs keep up with flash and NVRam

§ Flash means: ~10us / read or write, RAM means <1us

(24)

Manageability

§ AWS elastic cloud – master piece

§ Open source solutions do similar

(25)

Tiered storage

When is tiered storage important?

§ For HPC dumping RAM requires flash cache

§ Likely of increased importance:

§ L1,2,3 – PCM – Flash – Disk – Tape

Tiered storage can use container concept

§ Cache misses fetch a container to faster memory

§ High bandwidth transfers container relatively quickly

§ One time latency – e.g. 1 sec

§ Then speed of faster tiers

(26)

Cloud object stores - CEPH

§ Object is file with an “id” not with a name

§ CEPH manages

§ Removal and addition of storage

§ Failed nodes, racks

§ Quite clever load balancing and data placement

(27)

Cloud objects still to demonstrate

§ HPC bandwidth == #nodes x BW/node

§ only limited testing at scale, no models

§ Not yet clear: how it integrates with tiered storage

(28)

Data layout - placement

§ How to place many stripes?

§ Bottleneck in RAID arrays:

§ Rebuild a drive goes at rate of BW of 1 drive – takes days

§ Parity de-clustering & distributed spare

§ Rebuild at BW of N drives (N = 60 / 600 / 6000?)

§ For e.g. 10+2 redundancy, speedup 60/10, 600/10, etc.

§ Benefit is large 5x – 100x+

§ Algorithms & math is hard: block mappings

(29)

Data layout – erasure codes

§ How to rebuild a single stripe faster

§ Generalizes RAID, Solomon-Reed codes etc.

§ Benefits stripe reconstruction I/O 1-2x

§ Tons of attention and publications

§ If the network is the slowest component this is

(30)

(31)

Conclusions

§ There are many Big Data algorithms

§ There are many cloud storage solutions

§ Big data on HPC – several vendors

§ New specialized solutions (DSSD)

§ More attention for modeling the problems & solutions

(32)

HPC data becomes Big Data. Peter Braam