HPC data becomes Big Data
Peter Braam
me
1983 - 2000 Academia
• Maths & Computer Science
Entrepreneur with startups (5x)
• 4 startups sold • Lustre emerged
• Held executive jobs with acquirers
2014 – Independent, advise, research
Contents
§ Introduction market & key questions
§ Some Big Data problems & Algorithms
§ HPC storage
§ Cloud storage
Two Questions
§ Given an HPC storage system, how can it be used
for Big Data Analysis?
§ What storage platforms are candidates to meet
IDC market data
Fact 2011 2013
% of sites using co-processors 28.2% 76.9% HPC sites performing big data analysis 67% % of compute cycles dedicated to big data 30% % of sites using cloud infrastructure for HPC 18.8% 23.5% Year over year growth in high density
servers ($)
25.5% Year over year growth in servers ($) -6.2%
Other facts
§ Flash and much faster persistent memory tiers are
inevitably coming.
§ Multiple software challenges arise from this
§ Management of tiers
§ Much faster storage software to keep up with devices
§ Gap between disk and other system performance
continues to increase
§ There is “embedded” processing on servers with
attached storage and client-server processing with clients networked to servers.
Big Data Problems – samples
§ Input generally from simulation or sensors
§ Climate modeling – simulate then …
§ Find the hottest day each year in Cape Town
§ Find very low pressure spots (typhoons) on Earth
§ Genomics, Astronomy
§ Find patterns (e.g. strings, galaxies) in huge data sets
§ Pre-process data at TB/sec rates
§ Data management
Big Data Problems – samples 2
§ Social network, advertising & intelligence
§ Most of these become graph problems, some very hard
§ Non-compliance in stock market transaction logs
§ Replace legacy consumer information data
warehousing with modern analytics
§ Replacements of Teradata / Netezza sometimes difficult
Wide variations
§ Some problems (e.g. some graph problems) must
be executed in RAM. Graph500 benchmark
2000x speedup in 2.5 years
§ Other problems require many iterations through
disk-resident data
§ Netezza analytics systems use FPGA’s for
Big Data Algorithms
§ Considerable variation
§ Machine learning
§ Bayesian analysis
§ Indexing, sorting – DB like
§ Graph algorithms
§ Maximal Information Coefficients – generalize
regressions
§ Compressed sensing (aka sparse recovery)
§ Topological Data Analysis
Ogres …
§ Analogously to Berkeley Dwarfs big data problems
have been classified: see
Understanding Big Data
Applications and Architectures 1st JTC 1 SGBD Meeting
SDSC San Diego March 19 2014 Geoffrey Fox
Judy Qiu
So…
Given these variations a single architecture is not likely to address all big data problems well.
HPC data
§ Traditional model – cluster file system and
§ Single Shared File (with # cores readers / writers)
§ File Per Process (and 1 process per core …)
§ Tightly coupled problems allow little scheduling of
“tasklets” or redistribution of I/O
§ Problems…
§ Throughput == #server nodes x (speed of slowest node)
Results quite reasonable
§ Systems like Lustre, GPFS, Panasas
§ Use carefully configured and tested hardware
§ Fast networks
§ Deliver 80% of slowest hardware component
§ Pipelines from clients to disk are uniformly wide
§ Servers can deliver ~3GB/sec / controller
§ Achilles heels:
§ Metadata
§ Availability
A sample of hard cases…
First write then read. Why the gap? Opening & creating files is too slow. Should run >2x faster!
First seen at ORNL in 2006.
Metadata performance on Sequoia and on Cove (50 & 5 SSD drives) Low 1000’s to ~15K ops / sec Maximum seen ever ~50K ops
HPC hard cases ctd
Larger numbers of concurrent metadata clients are not easy.
Conclusion:
1. Problems systems like Lustre remain 2. Sensitivity to uniformly good hardware
3. Honest data from the users & understanding exists 4. It has been used at very large scale
Cloud data into HPC file system
§ Intel’s FastForward project
§ Ingest massive ACG graphs through Hadoop
§ Represent ACG using an HDF5 adaptation layer (HAL) & in Lustre
DAOS objects.
§ Then compute.
Acknowledgement: Figure from Intel’s hpdd.intel.com wiki
Hybrid solutions may be best
TACC “Wrangler” system
§ Big Data “companion” to Stampede
§ DSSD storage is PCI connected and has KV interface
§ 120 node Dell cluster with DSSD storage
275M IOPS
Undoubtedly
§ This will solve many big data problems well
§ There will be problems that don’t fit or for which
Typical Cloud Storage
§ Combines
§ memcached
§ key value stores or DB’s
§ Relational, Distributed Key Value, Embedded Key Value
§ MySql, Cassandra / Hbase, Rocksdb / LevelDB
§ object stores (swift, CEPH, …)
§ Results
§ Read heavy loads from one cluster
§ 100’s of servers serving 10M’s of requests/sec
§ Only the embedded DBs keep up with flash and NVRam
§ Flash means: ~10us / read or write, RAM means <1us
Manageability
§ AWS elastic cloud – master piece
§ Open source solutions do similar
Tiered storage
When is tiered storage important?
§ For HPC dumping RAM requires flash cache
§ Likely of increased importance:
§ L1,2,3 – PCM – Flash – Disk – Tape
Tiered storage can use container concept
§ Cache misses fetch a container to faster memory
§ High bandwidth transfers container relatively quickly
§ One time latency – e.g. 1 sec
§ Then speed of faster tiers
Cloud object stores - CEPH
§ Object is file with an “id” not with a name
§ CEPH manages
§ Removal and addition of storage
§ Failed nodes, racks
§ Quite clever load balancing and data placement
Cloud objects still to demonstrate
§ HPC bandwidth == #nodes x BW/node
§ only limited testing at scale, no models
§ Not yet clear: how it integrates with tiered storage
Data layout - placement
§ How to place many stripes?
§ Bottleneck in RAID arrays:
§ Rebuild a drive goes at rate of BW of 1 drive – takes days
§ Parity de-clustering & distributed spare
§ Rebuild at BW of N drives (N = 60 / 600 / 6000?)
§ For e.g. 10+2 redundancy, speedup 60/10, 600/10, etc.
§ Benefit is large 5x – 100x+
§ Algorithms & math is hard: block mappings
Data layout – erasure codes
§ How to rebuild a single stripe faster
§ Generalizes RAID, Solomon-Reed codes etc.
§ Benefits stripe reconstruction I/O 1-2x
§ Tons of attention and publications
§ If the network is the slowest component this is
Conclusions
§ There are many Big Data algorithms
§ There are many cloud storage solutions
§ Big data on HPC – several vendors
§ New specialized solutions (DSSD)
§ More attention for modeling the problems & solutions