Iterative MapReduce Enabling HPC Cloud Interoperability

(1)

S

A

L

S

A

HPC Group

http://salsahpc.indiana.edu

(2)

A New Book from Morgan Kaufmann Publishers, an imprint of Elsevier, Inc., Burlington, MA 01803, USA. (ISBN: 9780123858801)

Distributed Systems and Cloud Computing:

From Parallel Processing to the Internet of Things

(3)

Twister

Bingjing Zhang, Richard Teng

Funded by Microsoft Foundation Grant, Indiana University's Faculty Research Support Program and NSF OCI-1032677 Grant

Twister4Azure

Thilina Gunarathne

Funded by Microsoft Azure Grant

High-Performance

Visualization Algorithms

For Data-Intensive Analysis

(4)

DryadLINQ CTP Evaluation

Hui Li, Yang Ruan, and Yuduo Zhou

Funded by Microsoft Foundation Grant

Cloud Storage, FutureGrid

Xiaoming Gao, Stephen Wu

Funded by Indiana University's Faculty Research Support Program and Natural Science Foundation Grant 0910812

Million Sequence Challenge

Saliya Ekanayake, Adam Hughs, Yang Ruan

Funded by NIH Grant 1RC2HG005806-01

Cyberinfrastructure for

Remote Sensing of Ice Sheets

Jerome Mitchell

(5)

5

(6)

(7)

(8)

(9)

S

A

L

S

A

Linux HPC

Bare-system

Amazon Cloud Windows Server

HPC

Bare-system

_{Virtualization}

Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling)

Kernels,

Genomics

,

Proteomics

, Information Retrieval, Polar Science,

Scientific Simulation Data Analysis and Management, Dissimilarity

Computation,

Clustering

,

Multidimensional Scaling

, Generative Topological

Mapping

CPU Nodes

Virtualization

Applications

Programming

Model

Infrastructure

Hardware

Azure Cloud

Security, Provenance, Portal

High Level Language

Distributed File Systems

Data Parallel File System

Grid

Appliance

GPU Nodes

Support Scientific Simulations (Data Mining and Data Analysis)

Runtime

Storage

Services and Workflow

(10)

(11)

What are the challenges?

Providing both cost effectiveness and powerful parallel programming paradigms that is

capable of handling the incredible increases in dataset sizes.

(

large-scale data analysis for Data Intensive applications

)

Research issues

§

portability between HPC and Cloud systems

§

scaling performance

§

fault tolerance

These challenges must be met for both

computation and storage

. If computation and

storage are separated, it’s not possible to bring computing to data.

Data locality

§

its impact on performance;

§

the factors that affect data locality;

§

the maximum degree of data locality that can be achieved.

Factors beyond data locality to improve performance

To achieve the best data locality is not always the optimal scheduling decision. For

instance, if the node where input data of a task are stored is overloaded, to run the task

on it will result in performance degradation.

Task granularity and load balance

In MapReduce , task granularity is fixed.

This mechanism has two drawbacks

1) limited degree of concurrency

(12)

12 12

(13)

S

A

L

S

A

13

(14)

S

A

L

S

A

Clouds hide Complexity

14

SaaS

: Software as a Service

(e.g. Clustering is a service)

IaaS

(

HaaS

): Infrasturcture as a Service

(get computer time with a credit card and with a Web interface like EC2)

PaaS

: Platform as a Service

IaaS plus core software capabilities on which you build SaaS (e.g. Azure is a PaaS; MapReduce is a Platform)

Cyberinfrastructure

(15)

• Please sign and return your video waiver.

• Plan to arrive early to your session in order to copy

your presentation to the conference PC.

• Poster drop-off is at Scholars Hall on Wednesday

from 7:30 am – Noon. Please take your poster

with

you after the session on Wednesday evening.

Topic

Innovations in IP (esp. Open Source) SystemsConsistency models Integration of Mainframe and Large Systems Power-aware Profiling, Modeling, and OptimizationsIT Service and Relationship Management Scalable Fault Resilience Techniques for Large ComputingNew and Innovative Pedagogical Approaches Data grid & Semantic webPeer to peer computing Autonomic Computing Scalable Scheduling on Heterogeneous ArchitecturesHardware as a Service (HaaS) Utility computing Novel Programming Models for Large ComputingFault tolerance and reliability Optimal deployment configurationLoad balancing Web services Auditing, monitoring and schedulingSoftware as a Service (SaaS) Security and Risk Virtualization technologies High-performance computing Cloud-based Services and EducationMiddleware frameworks Cloud /Grid architecture

Number of Submissions

0 20 40 60 80 100

(16)

S

A

L

S

A

Gartner 2009 Hype Curve

Source: Gartner (August 2009)

(17)

17

L1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lock/unlock 25 ns

Main memory reference 100 ns

Compress 1K w/cheap compression algorithm 3,000 ns

Send 2K bytes over 1 Gbps network 20,000 ns

Read 1 MB sequentially from memory 250,000 ns

Round trip within same datacenter 500,000 ns

Disk seek 10,000,000 ns

Read 1 MB sequentially from disk 20,000,000 ns

(18)

18

http://thecloudtutorial.com/hadoop-tutorial.html

Servers running Hadoop at Yahoo.com

(19)

19

(20)

(21)

21

SPMD Software

v

Single Program Multiple Data (SPMD):

a coarse-grained

SIMD approach to programming for MIMD systems.

v

Data parallel software:

do the same thing to all elements

of a structure (e.g., many matrix algorithms).

Easiest to write and understand.

v

Unfortunately, difficult to apply to complex problems (as were the SIMD

machines; Mapreduce).

(22)

22

MPMD Software

v

Multiple Program Multiple Data (SPMD):

a coarse-grained M

IMD approach to programming

v

Data parallel software:

do the same thing to all elements

of a structure (e.g., many matrix algorithms).

Easiest to write and understand.

v

It applies to complex problems (e.g. MPI, distributed system).

(23)

23

Programming Models and Tools

23

MapReduce in Heterogeneous Environment

(24)

S

A

L

S

A

Next Generation Sequencing Pipeline on Cloud

24

Blast PairwiseDistance Calculation

Dissimilarity Matrix

N(N-1)/2 values FASTA File

N Sequences Pairingsblock

MapReduce

1

2

3

Clustering Visualization Plotviz

4

Visualization Plotviz MDS Pairwise clustering

MPI

4

5

• Users submit their jobs to the pipeline and the results will be shown in a visualization tool.

• This chart illustrate a hybrid model with MapReduce and MPI. Twister will be an unified solution for the pipeline mode.

• The components are services and so is the whole pipeline.

(25)

Motivation

Data

Deluge

MapReduce

Runtimes (MPI)

Classic Parallel

Experiencing in

many domains

Data Centered, QoS

Efficient and

Proven techniques

Input

Output

map

Input

map

reduce

Input

map

reduce

iterations

Pij

Expand the Applicability of MapReduce to more

classes

of Applications

(26)

• Distinction on static and variable

data

• Configurable long running

(cacheable) map/reduce tasks

• Pub/sub messaging based

communication/data transfers

• Broker Network for facilitating

(27)

configureMaps(..) configureReduce(..)

runMapReduce(..) while(condition){

} //end while

updateCondition() close() Combine() operation Reduce() Map() Worker Nodes

Communications/data transfers via the pub-sub broker network & direct TCP

Iterations

May send <Key,Value> pairs directly

Local Disk

Cacheable map/reduce tasks

• Main program may contain many

MapReduce invocations or iterative

MapReduce invocations

(28)

Worker Node Local Disk Worker Pool Twister Daemon Master Node Twister Driver Main Program B B B B

Pub/sub

Broker Network

Worker Node Local Disk Worker Pool Twister Daemon Scripts perform:

Data distribution, data collection, andpartition file creation

map

reduce _{Cacheable tasks}

(29)

(30)

(31)

(32)

• Distributed, highly scalable and highly available cloud

services as the building blocks.

• Utilize eventually-consistent , high-latency cloud services

effectively to deliver performance comparable to

traditional MapReduce runtimes.

• Decentralized architecture with global queue based

dynamic task scheduling

• Minimal management and maintenance overhead

• Supports dynamically scaling up and down of the compute

resources.

(33)

(34)

Performance with/without

data caching Speedup gained using data cache

Scaling speedup Increasing number of iterations

Number of Executing Map Task Histogram

Strong Scaling with 128M Data Points

(35)

Performance with/without

data caching Speedup gained using data cache

Scaling speedup Increasing number of iterations

Azure Instance Type Study Number of Executing Map Task Histogram Weak Scaling Data Size Scaling

(36)

Parallel Visualization

(37)

GTM

MDS (SMACOF)

Maximize Log-Likelihood

Minimize STRESS or SSTRESS

Objective Function

O(KN) (K << N)

O(N

2

₎

Complexity

• Non-linear dimension reduction

• Find an optimal configuration in a lower-dimension

• Iterative optimization method

Purpose

EM

Iterative Majorization (EM-like)

Optimization Method

Vector-based data

Non-vector (Pairwise similarity matrix)

(38)

MPI / MPI-IO

• Finding K clusters for N data points

• Relationship is a bipartite graph (bi-graph)

• Represented by K-by-N matrix (K << N)

• Decomposition for P-by-Q compute grid

• Reduce memory requirement by 1/PQ

K latent

points

N data

points

1

2 A

B

C

1

2 A

B

C

Parallel File System

Cray / Linux / Windows Cluster

Parallel HDF5

ScaLAPACK

(39)

Parallel MDS

• O(N

2

_{) memory and computation}

required.

– 100k data  480GB memory

• Balanced decomposition of NxN

matrices by P-by-Q grid.

– Reduce memory and computing requirement by 1/PQ

• Communicate via MPI primitives

MDS Interpolation

• Finding approximate

mapping position w.r.t.

k-NN’s prior mapping.

• Per point it requires:

–

O(M) memory

–

O(k) computation

• Pleasingly parallel

• Mapping 2M in 1450 sec.

–

vs. 100k in 27000 sec.

–

7500 times faster than

estimation of the full MDS.

39 c1

c2

c3

r1

(40)

• Full data processing by GTM or MDS is computing- and

memory-intensive

• Two step procedure

• Training

: training by M samples out of N data

• Interpolation

: remaining (N-M) out-of-samples are

approximated without training

n

In-sample

N-n

Out-of-sample

Total N data

(41)

PubChem data with CTD

visualization by using MDS (left) and GTM (right)

About 930,000 chemical compounds are visualized as a point in 3D space, annotated by the related genes in Comparative Toxicogenomics Database (CTD)

Chemical compounds shown in literatures, visualized by MDS (left) and GTM (right)

(42)

• This demo is for real time visualization of the

process of multidimensional scaling(MDS)

calculation.

• We use Twister to do parallel calculation inside the

cluster, and use PlotViz to show the intermediate

results at the user client computer.

(43)

(44)

Master Node Twister

Driver Twister-MDS

ActiveMQ

Broker MDS Monitor

PlotViz I. Send message to

start the job II. Send intermediate

results

Local Disk

(45)

MDS Output Monitoring Interface Pub/Sub

Broker Network

Worker Node Worker Pool Twister Daemon Master Node

Twister Driver Twister-MDS

Worker Node Worker Pool Twister Daemon

map reduce

map

(46)

Gene Sequences (N

= 1 Million)

Distance Matrix Interpolative MDS with Pairwise Distance Calculation Multi-Dimensional Scaling (MDS) Visualizatio

n 3D Plot

Reference Sequence Set

(M = 100K)

N - M Sequence Set (900K) Select Referenc e Reference Coordinates

x, y, z

N - M

Coordinates x, y, z

Pairwise Alignment & Distance Calculation

(47)

Twister Driver Node

Twister Daemon Node ActiveMQ Broker Node

Broker-Daemon Connection Broker-Broker Connection

Broker-Driver Connection

7 Brokers and 32 Computing Nodes in total

A. Full Mesh Network

B. Hierarchical Sending

C. Streaming

(48)

189.288 359.625 816.364 1508.487 148.805 303.432 737.073 1404.431

Number of Data Points

38400 51200 76800 102400

Total Execution Time (Seconds) 0.000 200.000 400.000 600.000 800.000 1000.000 1200.000 1400.000 1600.000

Twister-MDS Execution Time

100 iterations, 40 nodes, under different input data sizes

(49)

13.07 18.79 24.50 46.19 70.56 93.14

400M 600M 800M

Broadcasting Time (Unit: Second) 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00

(50)

Twister New Architecture

Worker Node

Twister Daemon Twister Driver

Worker Node

Twister Daemon map

reduce merge Broker

Map

Reduce

Master Node

Configure Mapper

Add to MemCache

Broker Broker

Cacheable tasks map broadcasting chain

(51)

Chain/Ring Broadcasting

Twister Daemon Node

Twister Driver Node • Driver sender:

• send broadcasting data • get acknowledgement

• send next broadcasting data • …

• Daemon sender:

• receive data from the last daemon (or driver)

• cache data to daemon

• Send data to next daemon (waits for ACK)

(52)

Chain Broadcasting Protocol

send

get ack send

Driver

Daemon 0

receive handle data send

Daemon 1

ack receive handle data send ack receive handle data get ack ack receive handle data get ack send send get ack

Daemon 2

receive handle data ack receive handle data ack ack get ack get ack ack

ack get ack

I know this is the end of Daemon Chain

(53)

Broadcasting Time Comparison

Execution No.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Broadcasting

Time

(Unit:

Seconds)

0 5 10 15 20 25 30

Broadcasting Time Comparison on 80 nodes, 600 MB data, 160

pieces

(54)

S

A

L

S

A

Applications & Different Interconnection Patterns

Map Only

Classic

MapReduce

Iterative Reductions

Twister

Loosely Synchronous

CAP3Analysis

Document conversion (PDF -> HTML)

Brute force searches in cryptography

Parametric sweeps

High Energy Physics (HEP) Histograms

SWG gene alignment Distributed search Distributed sorting Information retrieval Expectation maximization algorithms Clustering Linear Algebra

Many MPI scientific applications utilizing wide variety of

communication constructs including local interactions - CAP3 Gene Assembly

- PolarGrid Matlab data analysis

Information Retrieval -HEP Data Analysis

- Calculation of Pairwise Distances for ALU

Sequences - Kmeans - Deterministic Annealing Clustering - Multidimensional ScalingMDS

- Solving Differential Equations and

- particle dynamics with short range forces

Input

Output

map

Input

map

reduce

Input

map

reduce

iterations

Pij

(55)

• Development of

library of Collectives

to use at Reduce phase

• Broadcast and Gather needed by current applications

• Discover other important ones

• Implement efficiently on each platform – especially Azure

• Better

software message routing

with broker networks using

asynchronous I/O with communication fault tolerance

• Support

nearby location of data and computing

using data

parallel file systems

• Clearer application

fault tolerance

model based on implicit

synchronizations points at iteration end points

• Later: Investigate

GPU

support

(56)

Convergence is Happening

Multicore

Clouds

Data Intensive Paradigms

Data intensive application (three basic activities): capture, curation, and analysis (visualization)

Cloud infrastructure and runtime

(57)

S

A

L

S

A

FutureGrid: a Grid Testbed

• IU

Cray operational,

IU

IBM (iDataPlex) completed stability test May 6

• UCSD

IBM operational,

UF

IBM stability test completes ~ May 12

• Network

,

NID

and

PU

HTC system operational

• UC

IBM stability test completes ~ May 27;

TACC

Dell awaiting delivery of components

NID: Network Impairment Device

Private

(58)

S

A

L

S

A

• Switchable clusters on the same hardware (~5 minutes between different OS such as Linux+Xen to Windows+HPCS)

• Support for virtual clusters

• SW-G : Smith Waterman Gotoh Dissimilarity Computation as an pleasingly parallel problem suitable for MapReduce style applications Pub/Sub Broker Network Summarizer Switcher Monitoring Interface iDataplex Bare-metal Nodes XCAT Infrastructure Virtual/Physical Clusters

Monitoring & Control Infrastructure

iDataplex Bare-metal Nodes

(32 nodes)

XCAT Infrastructure

Linux Bare-system Linux on Xen Windows Server 2008 Bare-system SW-G Using

Hadoop SW-G UsingHadoop SW-G UsingDryadLINQ

Monitoring Infrastructure

Dynamic Cluster Architecture

SALSAHPC Dynamic Virtual Cluster on

(59)

S

A

L

S

A

SALSAHPC Dynamic Virtual Cluster on

FutureGrid -- Demo at SC09

• Top: 3 clusters are switching applications on fixed environment. Takes approximately 30 seconds.

• Bottom: Cluster is switching between environments: Linux; Linux +Xen; Windows + HPCS. Takes approxomately 7 minutes

• SALSAHPC Demo at SC09. This demonstrates the concept of Science on Clouds using a FutureGrid iDataPlex.

(60)

S

A

L

S

A

• Background: data intensive computing requires storage

solutions for huge amounts of data

• One proposed solution: HBase, Hadoop implementation of

Google’s BigTable

(61)

S

A

L

S

A

System design and implementation –

solution

• Inverted index:

“cloud” -> doc1, doc2, …

“computing” -> doc1, doc3, …

• Apache Lucene:

- A library written in Java for building inverted indices and supporting

full-text search

- Incremental indexing, document scoring, and multi-index search with

merged results, etc.

- Existing solutions using Lucene store index data with files – no natural

integration with HBase

(62)

S

A

L

S

A

System design

• Data from a real digital library application: bibliography data,

page image data, texts data

(63)

S

A

L

S

A

System design

• Table schemas:

- title index table: <term value> --> {frequencies:[<doc id>, <doc id>, ...]}

- texts index table: <term value> --> {frequencies:[<doc id>, <doc id>, ...]}

- texts term position vector table: <term value> --> {positions:[<doc id>,

<doc id>, ...]}

• Natural integration with HBase

• Reliable and scalable index data storage

• Real-time document addition and deletion

(64)

S

A

L

S

A

System implementation

• Experiments completed in the Alamo HPC cluster of FutureGrid

• MyHadoop -> MyHBase

(65)

S

A

L

S

A

Index data analysis

• Test run with 5 books

• Total number of distinct terms: 8263

(66)

(67)

S

A

L

S

A

Comparison with related work

• Pig and Hive:

- Distributed platforms for analyzing and warehousing large data sets

- Pig Latin and HiveQL have operators for search

- Suitable for batch analysis to large data sets

• SolrCloud, ElasticSearch, Katta:

- Distributed search systems based on Lucene indices

- Indices organized as files; not a natural integration with HBase

• Solandra:

(68)

S

A

L

S

A

Future work

• Distributed performance evaluation

• More data analysis or text mining based on the index data

(69)

(70)

(71)

(72)

(73)

73 High Energy Physics Data Analysis

Input to a map task: <key, value>

key = Some Id value = HEP file Name

Output of a map task: <key, value>

key = random # (0<= num<= max reduce tasks) value = Histogram as binary data

Input to a reduce task: <key, List<value>>

key = random # (0<= num<= max reduce tasks) value = List of histogram as binary data

Output from a reduce task: value

value = Histogram file

Combine outputs from reduce tasks to form the final histogram

(74)

74 Reduce Phase of Particle Physics

“Find the Higgs” using Dryad

• Combine Histograms produced by separate Root “Maps” (of event data to partial histograms) into a single Histogram delivered to Client

• This is an example using MapReduce to do distributed histogramming.

(75)

75

http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview.png

Sort

0.11,0.89,0.27 0.29,0.23,0.89 0.27, 0.23,0.11

Emit <

Bin

i

,

1 >

0.23, 0.89,0.27,

0.29,0.23, 0.89,

0.27, 0.23, 0.11

Events (

P

i

)

Bin23, 1 Bin89, 1 Bin27, 1

Bin29,1 Bin23, 1 Bin89, 1

Bin27, 1 Bin23, 1 Bin11, 1

Bin23,1 Bin23, 1 Bin23, 1 Bin11, 1 Bin11, 1

Bin27, 1 Bin27, 1

Bin89,1 Bin89, 1

Bin11, 2

Bin23, 3

Bin27, 2

Bin89, 2

(76)

76

18. _{output, Reporter reporter) throws IOException {}public voidmap(LongWritable key, Text value, OutputCollector<Text, IntWritable>

19. String line = value.toString();

20. StringTokenizer tokenizer = new StringTokenizer(line);

21. while (tokenizer.hasMoreTokens()) {

22. word.set(tokenizer.nextToken());// Parsing 23. output.collect(word, one);

24. }

/* Single property/histogram */

double event = 0; int BIN_SIZE = 100;

double bins[] = new double[BIN_SIZE]; …….

if (event is in bin[i]) {//Pseudo

event.set (i); }

output.collect(event, one);

From

WordCount

to

HEP Analysis

/* Multiple properties/histograms */

int PROPERTIES = 10;

int BIN_SIZE = 100; //assume properties are normalized double eventVector[] = new double[VEC_LENGTH]; double bins[] = new double[BIN_SIZE];

…….

for (int i=0; i <VEC_LENGTH; i++) {

for (int j = 0; j < PROPERTIES; j++) {

if (eventVector[i] is in bins[j]) {//Pseudo

++bins[j]; }

} }

(77)