HPC ABDS: The Case for an Integrating Apache Big Data Stack

(1)

HPC‐ABDS: The Case for an

Integrating Apache Big Data Stack

with HPC

1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu

Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington

(2)

Enhanced

Apache Big

Data Stack

ABDS

• ~120 Capabilities • >40 Apache • Green layers have strong HPC Integration opportunities • Goal • Functionality of ABDS • Performance of HPC

(3)

Broad Layers in HPC‐ABDS

• Workflow‐Orchestration • Application and Analytics • High level Programming • Basic Programming model and runtime – SPMD, Streaming, MapReduce, MPI • Inter process communication – Collectives, point to point, publish‐subscribe • In memory databases/caches • Object‐relational mapping • SQL and NoSQL, File management • Data Transport • Cluster Resource Management (Yarn, Slurm, SGE) • File systems(HDFS, Lustre …) • DevOps (Puppet, Chef …) • IaaS Management from HPC to hypervisors (OpenStack) • Cross Cutting – Message Protocols – Distributed Coordination – Security & Privacy – Monitoring

(4)

(5)

(6)

Getting High Performance on Data

Analytics (e.g. Mahout, R …)

• On the systems side, we have two principles – The Apache Big Data Stack with ~120 projects has important broad functionality with a vital large support organization – HPC including MPI has striking success in delivering high performance with however a fragile sustainability model

• There are key systems abstractions which are levels in HPC‐ABDS software stack where Apache approach needs careful integration with HPC – Resource management – Storage – Programming model ‐‐ horizontal scaling parallelism – Collective and Point to Point communication – Support of iteration – Data interface (not just key‐value)

• In application areas, we define application abstractions to support – Graphs/network

– Geospatial – Images etc.

(7)

4 Forms of MapReduce

7

(a) Map Only (c) Iterative _Synchronous(d) Loosely MapReduce (b) Classic MapReduce Input map reduce Input map reduce Iterations Input Output map P_ij BLAST Analysis Parametric sweep Pleasingly Parallel

High Energy Physics (HEP) Histograms Distributed search

Classic MPI

PDE Solvers and particle dynamics

Domain of MapReduce and Iterative Extensions Science Clouds

MPI

Giraph

Expectation maximization Clustering e.g. Kmeans Linear Algebra, Page Rank

(a) Map Only (c) Iterative _Synchronous(d) Loosely MapReduce (b) Classic MapReduce Input Input map map reduce reduce Input Input map map reduce reduce Iterations Iterations Input Input Output Output map map P_ij BLAST Analysis Parametric sweep Pleasingly Parallel

High Energy Physics (HEP) Histograms Distributed search

Classic MPI

PDE Solvers and particle dynamics

Domain of MapReduce and Iterative Extensions Science Clouds

MPI

Giraph

Expectation maximization Clustering e.g. Kmeans Linear Algebra, Page Rank

MPI is Map followed by Point to Point or Collective Communication – as in style c) plus d)

(8)

HPC‐ABDS

Hourglass

HPC ABDS

System (Middleware)

High performance

Applications

• HPC Yarn for Resource management • Horizontally scalable parallel programming model • Collective and Point to Point communication • Support of iteration System Abstractions/standards • Data format • Storage

120 Software Projects

Application Abstractions/standards Graphs, Networks, Images, Geospatial …. SPIDAL (Scalable Parallel Interoperable Data Analytics Library) or High performance Mahout, R, Matlab …..

(9)

(10)

We are sort of working on Use Cases with HPC‐ABDS

• Use Case 10 Internet of Things: Yarn, Storm, ActiveMQ • Use Case 19, 20 Genomics. Hadoop, Iterative MapReduce, MPI, Much better analytics than Mahout • Use Case 26 Deep Learning. High performance distributed GPU (optimized collectives) with Python front end (planned) • Variant of Use Case 26, 27 Image classification using Kmeans: Iterative MapReduce • Use Case 28 Twitter with optimized index for Hbase, Hadoop and Iterative MapReduce • Use Case 30 Network Science. MPI and Giraph for network structure and dynamics (planned) • Use Case 39 Particle Physics. Iterative MapReduce (wrote proposal) • Use Case 43 Radar Image Analysis. Hadoop for multiple individual images moving to Iterative MapReduce for global integration over “all” images • Use Case 44 Radar Images. Running on Amazon

(11)

Features of Harp Hadoop Plug in

• Hadoop Plugin (on Hadoop 1.2.1 and Hadoop

2.2.0)

• Hierarchical data abstraction on arrays, key‐values

and graphs for easy programming expressiveness.

• Collective communication model to support

various communication operations on the data

abstractions.

• Caching with buffer management for memory

allocation required from computation and

communication

• BSP style parallelism

• Fault tolerance with check‐pointing

(12)

Architecture

YARN MapReduce V2

Harp

MapReduce Applications Map‐Collective Applications

Application

Framework

(13)

Performance on Madrid Cluster (8

nodes)

0 200 400 600 800 1000 1200 1400 1600 100m 500 10m 5k 1m 50k Execution Time (s) Problem Size K‐Means Clustering Harp v.s. Hadoop on Madrid

Hadoop 24 cores Harp 24 cores Hadoop 48 cores Harp 48 cores Hadoop 96 cores Harp 96 cores

Note compute same in each case as product of centers times points identical Increasing

Communication Identical Computation

(14)

Mahout and Hadoop MR – Slow due to MapReduce

Python slow as Scripting

Spark Iterative MapReduce, non optimal communication

Harp Hadoop plug in with ~MPI collectives

MPI fastest as C not Java

Increasing Communication Identical Computation

(15)

Performance of MPI Kernel Operations

1 100 10000 0B 2B 8B _32B 128 B 512 B 2KB 8KB _32KB 128KB 512KB Av era ge time (u s) Message size (bytes) MPI.NET C# in Tempest FastMPJ Java in FG OMPI‐nightly Java FG OMPI‐trunk Java FG OMPI‐trunk C FG Performance of MPI send and receive operations 5 5000 4B _16B _64B 256 B 1KB 4KB _16KB _64KB 256KB 1M B 4M B Avera ge time (u s) Message size (bytes) MPI.NET C# in Tempest FastMPJ Java in FG OMPI‐nightly Java FG OMPI‐trunk Java FG OMPI‐trunk C FG Performance of MPI allreduce operation 1 100 10000 1000000 4B _16B _64B 256B 1KB 4KB 16KB 64KB _256KB 1MB 4MB Aver ag e Time (u s) Message Size (bytes) OMPI‐trunk C Madrid OMPI‐trunk Java Madrid OMPI‐trunk C FG OMPI‐trunk Java FG 1 10 100 1000 10000 0B 2B 8B _32B 128B 512B 2KB 8KB 32K B 128KB 512KB Av erag e Time (u s) Message Size (bytes) OMPI‐trunk C Madrid OMPI‐trunk Java Madrid OMPI‐trunk C FG OMPI‐trunk Java FG Performance of MPI send and receive on Infiniband and Ethernet Performance of MPI allreduce on Infiniband and Ethernet Pure Java as in FastMPJ slower than Java interfacing to C version of MPI

(16)

Use case 28: Truthy: Information diffusion research from Twitter Data • Building blocks: – Yarn – Parallel query evaluation using Hadoop MapReduce – Related hashtag mining algorithm using Hadoop MapReduce: – Meme daily frequency generation using MapReduce over index tables – Parallel force‐directed graph layout algorithm using Twister (Harp) iterative MapReduce

(17)

Use case 28: Truthy: Information diffusion research from

Twitter Data

Two months’ data loading

for varied cluster size Scalability of iterative graph _{layout algorithm on Twister}

Hadoop‐FS not indexed

(18)

0 200 400 600 800 1000 1200 1400 1600 1800 2000 24 48 96 To ta l executi o n ti me (s) number of mappers Different Kmeans Implementation Total execution time vs. mapper number

Hadoop 100m,500 Hadoop 10m,5000 Hadoop 1m,50000 Harp 100m,500 Harp 10m,5000 Harp 1m,50000 Pig HD1 100m,500 Pig HD1 10m,5000 Pig HD1 1m,50000 Pig Yarn 100m,500 Pig Yarn 10m,5000 Pig Yarn 1m,50000

Pig

(19)

Lines of Code

Pig Kmeans Hadoop Kmeans Pig IndexedHBase meme‐cooccur‐ count IndexedHBase meme‐cooccur‐ count Java ~345 780 152 ~434 Pig 10 0 10 0 Python / Bash ~40 0 0 28 Total Lines 395 780 162 462

(20)

DACIDR for Gene Analysis (Use Case 19,20)

• Deterministic Annealing Clustering and Interpolative

Dimension Reduction Method (DACIDR)

• Use Hadoop for pleasingly parallel applications, and

Twister (replacing by Yarn) for iterative MapReduce

applications

• Sequences – Cluster  Centers

• Add Existing data and find Phylogenetic Tree

All‐Pair Sequence Alignment Streaming Pairwise Clustering Multidimensional Scaling Visualization Simplified Flow Chart of DACIDR

(21)

Summarize a million Fungi Sequences

Spherical Phylogram Visualization

RAxML result visualized in FigTree.

Spherical Phylogram from new MDS method visualized in PlotViz

(22)

Lessons / Insights

• Integrate (don’t compete) HPC with “Commodity Big

data” (Google to Amazon to Enterprise data Analytics)

– i.e. improve Mahout; don’t compete with it – Use Hadoop plug‐ins rather than replacing Hadoop – Enhanced Apache Big Data Stack HPC‐ABDS has 120 members – please improve!

• HPC‐ABDS+ Integration areas include

– file systems, – cluster resource management, – file and object data management, – inter process and thread communication, – analytics libraries, – Workflow – monitoring