Analysis Tools for Data Enabled Science

(1)

S

A

L

S

A

HPC Group

http://salsahpc.indiana.edu

(2)

Twister

Bingjing Zhang

Funded by Microsoft Foundation Grant,

Indiana University's Faculty Research Support

Program and NSF OCI-1032677 Grant

Twister4Azure

Thilina Gunarathne

Funded by Microsoft Azure Grant

High-Performance

Visualization Algorithms

For Data-Intensive Analysis

Seung-Hee Bae and Jong Youl Choi

(3)

DryadLINQ CTP Evaluation

Hui Li, Yang Ruan, and Yuduo Zhou

Funded by Microsoft Foundation Grant

Million Sequence Challenge

Saliya Ekanayake, Adam Hughs, Yang Ruan

Funded by NIH Grant 1RC2HG005806-01

Cyberinfrastructure for

Remote Sensing of Ice Sheets

Jerome Mitchell

(4)

Linux HPC

Bare-system

Amazon Cloud Windows Server

HPC

Bare-system

_{Virtualization}

CPU Nodes

Virtualization

Infrastructure

Hardware

Azure Cloud

_Grid

Appliance

GPU Nodes

Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling)

Kernels, Genomics, Proteomics, Information Retrieval, Polar Science

Scientific Simulation Data Analysis and Management

Dissimilarity Computation, Clustering, Multidimentional Scaling, Generative

Topological Mapping

Applications

Programming

Model

Services and Workflow

High Level Language

Distributed File Systems

Data Parallel File System

Runtime

Storage

Object Store

(5)

(a) Map Only (b) Classic MapReduce (c) Iterative MapReduce (d) Loosely Synchronous Input map reduce Input map reduce Iterations Input Output map

P

ij CAP3 Analysis Smith-Waterman Distances Parametric sweeps

PolarGrid Matlab data analysis

High Energy Physics (HEP) Histograms

Distributed search Distributed sorting Information retrieval

Many MPI scientific applications such as solving differential equations and particle dynamics

Domain of MapReduce and Iterative Extensions MPI

Expectation maximization clustering e.g. Kmeans

(6)

GTM

MDS (SMACOF)

Maximize Log-Likelihood

Minimize STRESS or SSTRESS

Objective

Function

O(KN) (K << N)

O(N

2

₎

Complexity

• Non-linear dimension reduction

• Find an optimal configuration in a lower-dimension

• Iterative optimization method

Purpose

EM

Iterative Majorization (EM-like)

Optimization

Method

Vector-based data

Non-vector (Pairwise similarity matrix)

(7)

Parallel Visualization

(8)

• Distinction on static and variable

data

• Configurable long running

(cacheable) map/reduce tasks

• Pub/sub messaging based

communication/data transfers

• Broker Network for facilitating

(9)

configureMaps(..)

configureReduce(..)

runMapReduce(..)

while(

condition

){

} //end while

updateCondition()

close()

Combine()

operation

Reduce()

Map()

Worker Nodes

Communications/data transfers via the

pub-sub broker network & direct TCP

Iterations

May send <Key,Value> pairs directly

Local Disk

Cacheable map/reduce tasks

• Main program may contain many

MapReduce invocations or iterative

MapReduce invocations

(10)

Worker Node

Local Disk

Worker Pool

Twister Daemon

Master Node

Twister

Driver

Main Program

B

Pub/sub

Broker Network

Worker Node

Local Disk

Worker Pool

Twister Daemon

Scripts perform:

Data distribution, data collection,

and

partition file

creation

map

reduce

_{Cacheable tasks}

(11)

Master Node

Twister

Driver

Twister-MDS

ActiveMQ

Broker

MDS Monitor

PlotViz

I. Send message to

start the job

II. Send intermediate

results

(12)

• Method A

• Hierarchical Sending

• Method B

• Improved Hierarchical Sending

• Method C

(13)

Twister Driver Node Twister Daemon Node ActiveMQ Broker Node

Broker-Daemon Connection Broker-Broker Connection 8 Brokers and 32 Daemon Nodes in total

(14)

Twister Daemon Node ActiveMQ Broker Node

Broker-Daemon Connection Broker-Broker Connection

8 Brokers and 32 Daemon Nodes in total

Twister Driver Node

(15)

• Time used for the first level sending,

(

𝑁

𝑏

+ 𝑏−1)𝛼

• Time used for the second level sending

𝑁

𝑏

𝛼

(sending in

parallel)

• 𝑁

is the number of Twister Daemon Nodes

• 𝑏

is the number of brokers

• 𝛼

is the transmission time for each sending

• Get the derivation of

𝑏

,

𝑏 = 2𝑁

• That is when

𝑏 = 2𝑁

, the total broadcasting time is the

(16)

Twister Driver Node Twister Daemon Node

ActiveMQ Broker Node 7 Brokers and 32 DaemonNodes in total

(17)

Twister Daemon Node

ActiveMQ Broker Node 7 Brokers and 32 ComputingNodes in total

Twister Driver Node

(18)

• 𝑡 = 𝑏−1 𝛼 +

𝑁

𝑏−1

𝛼

,

𝑡

comes to the minimum when

𝑏 = 𝑁 + 1

,

𝑡 = 2 𝑁 𝛼

• 𝑁

is the number of Twister Daemon Nodes

• 𝑏

is the number of brokers

(19)

Number of Brokers

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Execution

Time

(Unit:

Second)

(20)

Twister Daemon Node

ActiveMQ Broker Node 5 Brokers and 4 ComputingNodes in total

Twister Driver Node

(21)

Centroids

Centroid 1

Centroid 2

Centroid 3

(22)

Centroid 1

Twister Daemon Node ActiveMQ Broker Node

Twister Driver Node

Centroid 2 Centroid 3 Centroid 4

Centroid 1

Centroid 2

Centroid 4

Centroid 3

Centroid 1

Centroid 2

Centroid 4

Centroid 3

Centroid 1

Centroid 2

Centroid 4

Centroid 3

Centroid 1

Centroid 2

(23)

Twister Map Task ActiveMQ Broker Node

Centroid 1

Centroid 2

Centroid 3

Centroid 4

Twister Reduce Task

Centroid 1

Centroid 2

Centroid 4

Centroid 3

Centroid 1

Centroid 2

Centroid 4

Centroid 3

Centroid 1

Centroid 2

Centroid 4

Centroid 3

Centroid 1

Centroid 2

(24)

13.07 18.79 24.50 46.19 70.56 93.14

400M 600M 800M

Broadcasting Time (Unit: Second) 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00

(25)

• Distributed, highly scalable and highly available cloud

services as the building blocks.

• Utilize eventually-consistent , high-latency cloud services

effectively to deliver performance comparable to

traditional MapReduce runtimes.

• Decentralized architecture with global queue based

dynamic task scheduling

• Minimal management and maintenance overhead

• Supports dynamically scaling up and down of the compute

resources.

(26)

(27)

(28)

Performance with/without

data caching

Speedup gained using data cache

Scaling speedup

Increasing number of iterations

Number of Executing Map Task Histogram

Strong Scaling with 128M Data Points

(29)

Performance with/without

data caching

Speedup gained using data cache

Scaling speedup

Increasing number of iterations

Azure Instance Type Study Number of Executing Map Task Histogram

Weak Scaling

Data Size Scaling

(30)

(31)

Chris Hemmerich, Adam Hughes, Yang Ruan, Aaron Buechlein, Judy Qiu, and Geoffrey Fox. Map-Reduce Expansion of the ISGA Genomic Analysis Web Server (2010) The 2nd IEEE International Conference on Cloud Computing Technology and Science

ISGA

Ergatis

TIGR Workflow

SGE

Condor

_{Other DCEs}

Cloud,

<<XML>>

(32)

(33)

Gene Sequences (N

= 1 Million)

Distance Matrix Interpolative MDS with Pairwise Distance Calculation Multi-Dimensional Scaling (MDS) Visualizatio

n 3D Plot

Reference Sequence Set

(M = 100K)

N - M Sequence Set (900K) Select Referenc e Reference Coordinates

x, y, z

N - M

Coordinates x, y, z

Pairwise Alignment & Distance Calculation

(34)

• Input DataSize: 680k

• Sample Data Size: 100k

• Out-Sample Data Size: 580k

• Test Environment: PolarGrid with 100 nodes, 800 workers.

(35)

(36)

MPI / MPI-IO

•

Finding K clusters for N data points

• Relationship is a bipartite graph (bi-graph)

• Represented by K-by-N matrix (K << N)

• Decomposition for P-by-Q compute grid

• Reduce memory requirement by 1/PQ

K latent

points

N data

points

1

2 A

B

C

1

2 A

B

C

Parallel File System

Cray / Linux / Windows Cluster

Parallel HDF5

ScaLAPACK

(37)

Parallel MDS

• O(N

2

_{) memory and computation}

required.

–

100k data



480GB memory

• Balanced decomposition of NxN

matrices by P-by-Q grid.

–

Reduce memory and computing

requirement by 1/PQ

• Communicate via MPI primitives

MDS Interpolation

• Finding approximate

mapping position w.r.t.

k-NN’s prior mapping.

• Per point it requires:

–

O(M) memory

–

O(k) computation

• Pleasingly parallel

• Mapping 2M in 1450 sec.

–

vs. 100k in 27000 sec.

–

7500 times faster than

estimation of the full MDS.

37 c1

c2

c3

r1

(38)

• Full data processing by GTM or MDS is computing- and

memory-intensive

• Two step procedure

• Training

: training by M samples out of N data

• Interpolation

: remaining (N-M) out-of-samples are

approximated without training

n

In-sample

N-n

Out-of-sample

Total N data

(39)

PubChem data with CTD

visualization by using MDS (left)

and GTM (right)

About 930,000 chemical compounds

are visualized as a point in 3D space,

annotated by the related genes in

Comparative Toxicogenomics

Database (CTD)

Chemical compounds shown in

literatures, visualized by MDS (left)

and GTM (right)

(40)

(41)

(42)

(43)

(44)

(45)

• Investigate in applicability and performance of DryadLINQ CTP to

develop scientific applications.

• Goals:

• Evaluate key features and

interfaces

• Probe parallel programming

models

• Three applications:

• SW-G bioinformatics application

• Matrix Multiplication

(46)

• Parallel algorithms for

matrix multiplication

• Row partition

• Row column partition

• 2 dimensional block

decomposition in Fox

algorithm

• Multi core technologies

• PLINQ, TPL, and Thread Pool

• Hybrid parallel model

• Port multi-core to Dryad task

to improve Performance

• Timing model for MM

Input data size

2400 4800 7200 9600 12000 14400 16800 19200

Parallel Efficiency 0 0.1 0.2 0.3 0.4

0.5 _RowPartition _{RowColumnPartition} _Fox-Hey

RowPartition RowColumnPartition Fox-Hey

Speed up 0 20 40 60 80 100 120 140

(47)

• Workload of SW-G, a pleasingly parallel application, is heterogeneous

due to the difference in input gene sequences. Hence workload balancing

becomes an issue.

• Two approach to alleviate it:

• Randomized distributed input data

• Partition job into finer granularity tasks

0 50 100 150 200 250 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Skewed Randomized Standard Deviation Exectuion Time (Seconds)

Number of Partitions

31 62 124 186 248

Execution Time (Seconds) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

(48)