• No results found

Scientific Data Analytics on Cloud and HPC Platforms

N/A
N/A
Protected

Academic year: 2020

Share "Scientific Data Analytics on Cloud and HPC Platforms"

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)

Scientific Data Analytics on Cloud

and HPC Platforms

S

A

L

S

A

HPC Group

http://salsahpc.indiana.edu

School of Informatics and Computing

Indiana University

Judy Qiu

(2)

03/02/2020

Bill Howe, eScience Institute

2

"... computing may someday be organized as a public utility just as

the telephone system is a public utility... The computer utility could

become the basis of a new and important industry.

Emeritus at Stanford

Inventor of LISP

-- John McCarthy

(3)
(4)

Challenges and Opportunities

Iterative MapReduce

A Programming Model instantiating the paradigm of

bringing computation to data

Supporting for Data Mining and Data Analysis

Interoperability

Using the same computational tools on HPC and Cloud

Enabling scientists to focus on science not programming

distributed systems

Reproducibility

Using Cloud Computing for Scalable, Reproducible

Experimentation

(5)

S

A

L

S

A

(6)

S

A

L

S

A

Linux HPC

Bare-system

Amazon Cloud Windows Server

HPC

Bare-system

Virtualization

Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling)

Kernels,

Genomics

,

Proteomics

, Information Retrieval, Polar Science,

Scientific Simulation Data Analysis and Management, Dissimilarity

Computation,

Clustering

,

Multidimensional Scaling

, Generative Topological

Mapping

CPU Nodes

Virtualization

Applications

Programming

Model

Infrastructure

Hardware

Azure Cloud

Security, Provenance, Portal

High Level Language

Distributed File Systems

Data Parallel File System

Grid

Appliance

GPU Nodes

Support Scientific Simulations (Data Mining and Data Analysis)

Runtime

Storage

Services and Workflow

(7)

S

A

L

S

A

Map

Reduce

Programming Model

Moving Computation

to Data

Scalable

Fault

Tolerance

Simple programming model

Excellent fault tolerance

Moving computations to data

Works very well for data intensive pleasingly

parallel applications

(8)

8

MapReduce in Heterogeneous Environment

(9)

Twister

[1]

Map->Reduce->Combine->Broadcast

Long running map tasks (data in memory)

Centralized driver based, statically scheduled.

Daytona

[3]

Iterative MapReduce on Azure using cloud services

Architecture similar to Twister

Haloop

[4]

On disk caching, Map/reduce input caching, reduce output caching

Spark

[5]

Iterative Mapreduce Using Resilient Distributed Dataset to ensure the

fault tolerance

Pregel

[6]

Graph processing from Google

(10)

Others

Mate-EC2

[6]

Local reduction object

Network Levitated Merge

[7]

RDMA/infiniband based shuffle & merge

Asynchronous Algorithms in MapReduce

[8]

Local & global reduce

MapReduce online

[9]

online aggregation, and continuous queries

Push data from Map to Reduce

Orchestra

[10]

Data transfer improvements for MR

iMapReduce

[11]

Async iterations, One to one map & reduce mapping, automatically

joins loop-variant and invariant data

CloudMapReduce

[12]

& Google AppEngine MapReduce

[13]

(11)
(12)

Distinction on static and variable

data

Configurable long running

(cacheable) map/reduce tasks

Pub/sub messaging based

communication/data transfers

(13)

configureMaps(..) configureReduce(..)

runMapReduce(..)

while(condition){

} //end while

updateCondition() close() Combine() operation Reduce() Map() Worker Nodes

Communications/data transfers via the pub-sub broker network & direct TCP

Iterations

May send <Key,Value> pairs directly

Local Disk

Cacheable map/reduce tasks

Main program may contain many

MapReduce invocations or iterative

MapReduce invocations

(14)

Worker Node Local Disk Worker Pool Twister Daemon Master Node Twister Driver Main Program B B B B

Pub/sub

Broker Network

Worker Node Local Disk Worker Pool Twister Daemon Scripts perform:

Data distribution, data collection,

andpartition file creation

map

reduce Cacheable tasks

(15)

Applications of Twister4Azure

Implemented

Multi Dimensional Scaling

KMeans Clustering

PageRank

SmithWatermann-GOTOH sequence alignment

WordCount

Cap3 sequence assembly

Blast sequence search

GTM & MDS interpolation

Under Development

(16)

Twister4Azure Architecture

(17)

Data Intensive Iterative Applications

Growing class of applications

Clustering, data mining, machine learning & dimension

reduction applications

Driven by data deluge & emerging computation fields

Compute Communication Reduce/ barrier

New Iteration

Larger

Loop-Invariant Data

(18)

Iterative MapReduce for Azure Cloud

Merge step

http://salsahpc.indiana.edu/twister4azure

Extensions to support

broadcast data

Multi-level caching

of static data

Hybrid intermediate

data transfer

Cache-aware

Hybrid Task

Scheduling

Collective

Communication

Primitives

(19)

Performance of Pleasingly Parallel Applications

on Azure

BLAST Sequence Search

Cap3 Sequence Assembly

Smith Watermann Sequence Alignment

(20)

Number of Instances/Cores

32 64 96 128 160 192 224 256

Relative Parallel Efficiency 0 0.2 0.4 0.6 0.8 1 1.2

Twister4Azure Twister Hadoop

Performance with/without

data caching Speedup gained using data cache

Scaling speedup Increasing number of iterations

Number of Executing Map Task Histogram

Strong Scaling with 128M Data Points

Weak Scaling Task Execution Time Histogram

First iteration performs the initial data fetch

Overhead between iterations

Scales better than Hadoop on bare metal

Num Nodes x Num Data Points

32 x 32 M 64 x 64 M 96 x 96 M 128 x 128 M 192 x 192 M 256 x 256 M

(21)

Weak Scaling Data Size Scaling

Performance adjusted for sequential performance difference

X:

Calculate invV

(BX)

Map Reduce Merge

BC:

Calculate BX

Map Reduce Merge

Calculate

Stress

Map Reduce Merge

New Iteration

(22)

Parallel Data Analysis using Twister

MDS (Multi Dimensional Scaling)

Clustering (Kmeans)

SVM (Scalable Vector Machine)

Indexing

(23)

MDS projection of 100,000 protein sequences showing a few experimentally identified clusters in preliminary work with Seattle Children’s Research Institute

(24)

Data Intensive Kmeans Clustering

Image Classification: 1.5 TB;

500 features per image;10k clusters

1000 Map tasks; 1GB data transfer per Map task

(25)

Ø

Broadcasting

q Data could be large

q Chain & MST

Ø

Map Collectives

q Local merge

Ø

Reduce Collectives

q Collect but no merge

Ø

Combine

q Direct download or Gather

Map Tasks Map Tasks

(26)

Improving Performance of Map Collectives

(27)

Polymorphic Scatter-Allgather in Twister

Number of Nodes

0

20

40

60

80

100

120

140

(28)

Twister Performance on Kmeans Clustering

Per Iteration Cost (Before)

Per Iteration Cost (After)

Ti

me

(U

ni

t:

Seco

nd

s)

0

50

100

150

200

250

300

350

400

450

(29)

Twister on InfiniBand

InfiniBand successes in HPC community

More than 42% of Top500 clusters use InfiniBand

Extremely high throughput and low latency

Up to 40Gb/s between servers and 1μsec latency

Reduce CPU overhead up to 90%

Cloud community can benefit from InfiniBand

Accelerated Hadoop (sc11)

HDFS benchmark tests

RDMA can make Twister faster

Accelerate static data distribution

Accelerate data shuffling between mappers and reducer

In collaboration with ORNL on a large InfiniBand

(30)
(31)

Twister Broadcast Comparison:

Ethernet vs. InfiniBand

Seco

nd

0

5

10

15

20

25

InfiniBand Speed Up Chart – 1GB bcast

(32)

32

Building Virtual Clusters

Towards Reproducible eScience in the Cloud

Separation of concerns between two layers

Infrastructure Layer

– interactions with the Cloud API

(33)

33

Separation Leads to Reuse

Infrastructure Layer = (*)

Software Layer = (#)

(34)

34

Design and Implementation

Equivalent machine images (MI) built in separate clouds

Common underpinning in separate clouds for software

installations and configurations

Configuration management used for software automation

(35)

35

Cloud Image Proliferation

ahassanyandbos ashley-image-bucket buzztrollcentos53centos56 cidtestimage clovr debian-rm1984 dikim-fedora-bucketfedora-image-bucket fedora-mex-image-bucket grid-appliance

grid-appliance-test1gridappliance-twisterimage-bucket-gerald

jdiazjklingin mybucketmyimage p434-ubuntu.9.04-image-bucket pegasus-images provision saga-mr-euca-bucket SGXImage smaddi2-bfast-bj tbuckettry-xen ubuntu-image-bucket ubuntu-MEX-image-bucket ubuntu904wchen wchen-server-stage-1 yye 0 2 4 6 8 10 12

(36)
(37)

37

Implementation - Hadoop Cluster

Hadoop cluster commands

knife hadoop launch {name} {slave count}

(38)

38

Running CloudBurst on Hadoop

Running CloudBurst on a 10 node Hadoop Cluster

• knife hadoop launch cloudburst 9

• echo ‘{"run list": "recipe[cloudburst]"}' > cloudburst.json

• chef-client -j cloudburst.json

Cluster Size (node count)

10 20 50

Run Time (seconds ) 0 50 100 150 200 250 300 350

400 CloudBurst Sample Data Run-Time Results

CloudBurst FilterAlignments

(39)
(40)
(41)

Applications & Different Interconnection Patterns

Map Only

Classic

MapReduce

Iterative MapReduce

Twister

Synchronous

Loosely

CAP3Analysis

Document conversion (PDF -> HTML)

Brute force searches in cryptography

Parametric sweeps

High Energy Physics (HEP) Histograms

SWG gene alignment Distributed search Distributed sorting Information retrieval Expectation maximization algorithms Clustering Linear Algebra

Many MPI scientific applications utilizing wide variety of

communication constructs including local interactions - CAP3 Gene Assembly

- PolarGrid Matlab data analysis

Information Retrieval -HEP Data Analysis

- Calculation of Pairwise Distances for ALU

Sequences - Kmeans - Deterministic Annealing Clustering - Multidimensional ScalingMDS

- Solving Differential Equations and

- particle dynamics with short range forces

Input

Output

map

Input

map

reduce

Input

map

reduce

iterations

Pij

(42)
(43)

S

A

L

S

A

HPC Group

http://salsahpc.indiana.edu

References

Related documents

As a result of this wage moderation, workers experienced deteriorating real wages resulting in a strong wage compression at the upper tail of the real hourly wage distribution

“I think we’re just scared.” Sooyoung took time to look at Hyoyeon and then at Jessica?. “We’re scared of knowing what she

Table 3b shows that when SENTIMENT is positive, monthly returns are 0.32 percent higher on profitable than unprofitable firms and 0.45 percent higher on payers than nonpayers..

Na detekciu prítomnosti karty a na zisťovanie polohy, v ktorej sa nachádza posuvník proti zápisu na kartu sa používajú dva piny, ktoré sú pripojené na GPIO rozhranie..

In particular, we find that long-term capital gain distributions are increasing in the proportion of defined contribution assets in the fund and that mutual funds held primarily

• Other top municipalities in 2007 thus far are Atlantic City with $151.4 million of work, Newark with $130.7 million (both of these figures are mentioned above – do you want to

In such a distribution setup the customer service (lead-time) is maximized, without increasing the inventory levels (being the main logistics costs driver). The responsive,

This study allows educators within higher education to better understand the complex processes of civic commitment development and how to holistically support college students