FutureGrid and Cyberinfrastructure supporting Data Analysis

(1)

FutureGrid and

Cyberinfrastructure

supporting Data Analysis

October 11 2010

Googleplex

Mountain View CA

Geoffrey Fox

[email protected]

http://www.infomall.org http://www.futuregrid.org

Community Grids Laboratory, Pervasive Technology Institute

(2)

Abstract

• TeraGrid has been NSF's production environment typified by large scale scientific computing simulations typically using MPI.

• Recently TeraGrid has added three experimental environments: Keeneland (a GPGPU cluster), Gordon (a distributed shared memory cluster with SSD disks aimed at data analysis and visualization), and FutureGrid.

• Futuregrid is a small distributed set of clusters (~5000 cores) supporting HPC, Cloud and Grid computing experiments for both applications and computer science.

• Users can request arbitrary configurations and those FutureGrid nodes are rebooted on demand from a library of certified images.

• _{FutureGrid will in particular allow traditional Grid and MPI researchers to}

explore the value of new technologies such as MapReduce, Bigtable and basic cloud VM infrastructure.

• _{Further it supports development of the needed new approaches to data}

(3)

US Cyberinfrastructure

Context

• _{There are a rich set of facilities}

–

_{Production TeraGrid}

_{facilities with distributed and}

shared memory

–

_{Experimental “Track 2D” Awards}

• _FutureGrid

_{: Distributed Systems experiments cf. Grid5000}

• _Keeneland

_{: Powerful GPU Cluster}

• _Gordon

_{: Large (distributed) Shared memory system with}

SSD aimed at data analysis/visualization

–

_{Open Science Grid}

_{aimed at High Throughput}

computing and strong campus bridging

(4)

4 TeraGrid ‘10

August 2-5, 2010, Pittsburgh, PA

SDSC

TACC

UC/ANL

NCSA

ORNL PU

IU

PSC NCAR

Caltech

USC/ISI

UNC/RENCI UW

Resource Provider (RP)

Software Integration Partner

Grid Infrastructure Group (UChicago)

TeraGrid

• ~2 Petaflops; over 20 PetaBytes of storage (disk and

_{tape), over 100 scientific data collections}

NICS

LONI

(5)

Keeneland – NSF-Funded Partnership to Enable

Large-scale Computational Science on Heterogeneous

Architectures

• _{NSF Track 2D System of Innovative}

Design

– _{Georgia Tech} – _{UTK NICS} – _ORNL – _UTK

• _{Two GPU clusters}

– _{Initial delivery (~250 CPU, 250 GPU)}

• _{Being built now; Expected availability is} November 2010

– _{Full scale (> 500 GPU) – Spring 2012} – _{NVIDIA, HP, Intel, Qlogic}

• _{Operations, user support}

• _{Education, Outreach, Training for}

scientists, students, industry

• _{Software tools, application development}

• _{Exploit graphics}

processors to provide

extreme performance

and energy efficiency

5

NVIDIA’s new Fermi GPU

(6)

(7)

(8)

FutureGrid key Concepts I

• _{FutureGrid is an}

_{international testbed}

_{modeled on Grid5000}

• _{Rather than loading images onto VM’s, FutureGrid supports}

_Cloud,

Grid and Parallel computing

environments by

dynamically

provisioning

software as needed onto “bare-metal” using

Moab/xCAT

– _{Image library}_{for MPI, OpenMP, Hadoop, Dryad,}_gLite_,_Unicore_{, Globus,}

Xen, ScaleMP (distributed Shared Memory), Nimbus, Eucalyptus,

OpenNebula, KVM, Windows …..

• The FutureGrid testbed provides to its users:

– A flexible development and testing platform for middleware and application users looking at interoperability, functionality and performance

– Each use of FutureGrid is an experiment that is reproducible

– A rich education and teaching platform for advanced cyberinfrastructure classes

(9)

Dynamic Provisioning

Results

4 8 16 32

0:00:00 0:00:43 0:01:26 0:02:09 0:02:52 0:03:36 0:04:19

Total Provisioning Time minutes

Time

Time elapsed between requesting a job and the jobs reported start time on the provisioned node. The numbers here are an average of 2 sets of experiments.

Time minutes

(10)

FutureGrid key Concepts II

• _Support

_{Computer Science}

_and

_{Computational Science}

–

_{Industry and Academia}

–

_Europe

_{and USA}

• _{FutureGrid has ~5000 distributed cores with a dedicated network and}

a Spirent XGEM network fault and delay generator

• _{Key early user oriented milestones:}

–

_{June 2010}

_{Initial users}

–

_{November 2010-September 2011}

_{Increasing number of users}

allocated by FutureGrid

–

_{October 2011}

_{FutureGrid allocatable via TeraGrid process}

–

_{3 classes using FutureGrid this fall}

(11)

FutureGrid Partners

• _{Indiana University}

_{(Architecture, core software, Support)}

–

_{Collaboration between research and infrastructure groups}

• _{Purdue University}

_{(HTC Hardware)}

• _{San Diego Supercomputer Center}

_{at University of California San Diego}

(INCA, Monitoring)

• _{University of Chicago}

_{/Argonne National Labs (Nimbus)}

• _{University of Florida}

_{(ViNE, Education and Outreach)}

• _{University of Southern California Information Sciences (Pegasus to manage}

experiments)

• _{University of Tennessee Knoxville (Benchmarking)}

• _{University of Texas at Austin}

_{/Texas Advanced Computing Center (Portal)}

• _{University of Virginia (OGF, Advisory Board and allocation)}

• _{Center for Information Services and GWT-TUD from Technische Universtität}

Dresden. (VAMPIR)

(12)

FutureGrid: a Grid/Cloud/HPC

Testbed

• Operational: IU Cray operational; IU , UCSD, UF & UC IBM iDataPlex operational

• _{Network, NID operational}

• _{TACC Dell finished acceptance tests}

NID: Network Impairment Device

Private

Public FG Network

(13)

Network & Internal Interconnects

• _{FutureGrid has}_{dedicated network}_{(except to TACC) and a}_{network fault}

and delay generator

• _{Can isolate experiments on request; IU runs Network for NLR/Internet2} • _(Many)_{additional partner machines}_{will run FutureGrid software and}

be supported (but allocated in specialized ways)

Machine Name Internal Network

IU Cray xray Cray 2D Torus SeaStar

IU iDataPlex india DDR IB, QLogic switch with Mellanox ConnectX adapters Blade Network Technologies & Force10 Ethernet switches

SDSC

iDataPlex sierra DDR IB, Cisco switch with Mellanox ConnectX adapters Juniper Ethernet switches UC iDataPlex hotel DDR IB, QLogic switch with Mellanox ConnectX adapters Blade

Network Technologies & Juniper switches

(14)

Network Impairment

Device

• Spirent XGEM Network Impairments Simulator for

jitter, errors, delay, etc

• Full Bidirectional 10G w/64 byte packets

• up to 15 seconds introduced delay (in 16ns

increments)

• 0-100% introduced packet loss in .0001% increments

• Packet manipulation in first 2000 bytes

• up to 16k frame size

(15)

FutureGrid Usage Model

• The goal of FutureGrid is to

support the research

on the future

of distributed, grid, and cloud computing

• FutureGrid will build a robustly managed simulation

environment and test-bed to support the development and

early use in science of new technologies at all levels of the

software stack: from

networking to middleware to scientific

applications

• The environment will mimic TeraGrid and/or general parallel

and distributed systems –

FutureGrid is part of TeraGrid

(but

not part of formal TeraGrid process for first two years)

–

Supports Grids, Clouds, and classic HPC

–

It will mimic commercial clouds (initially IaaS not PaaS)

–

Expect FutureGrid PaaS to grow in importance

• FutureGrid can be considered as a (small ~5000 core)

Science/Computer Science Cloud

but it is more accurately a

virtual machine or bare-metal based simulation environment

• This test-bed will succeed if it enables major advances in

(16)

Some Current FutureGrid

early uses

• _{Investigate metascheduling approaches on Cray and iDataPlex}

• Deploy Genesis II and Unicore end points on Cray and iDataPlex clusters

• Develop new Nimbus cloud capabilities

• Prototype applications (BLAST) across multiple FutureGrid clusters and Grid’5000

• Compare Amazon, Azure with FutureGrid hardware running Linux, Linux on Xen or Windows for data intensive applications

• Test ScaleMP software shared memory for genome assembly

• Develop Genetic algorithms on Hadoop for optimization

• Attach power monitoring equipment to iDataPlex nodes to study power use versus use characteristics

• _{Industry (Columbus IN) running CFD codes to study combustion strategies to maximize}

energy efficiency

• _{Support evaluation needed by XD TIS and TAS services} • _{Investigate performance of Kepler workflow engine} • _{Study scalability of SAGA in difference latency scenarios}

• _{Test and evaluate new algorithms for phylogenetics/systematics research in CIPRES portal} • _{Investigate performance overheads of clouds in parallel and distributed environments} • _{Support tutorials and classes in cloud, grid and parallel computing (IU, Florida, LSU)} • ~12 active/finished users out of ~32 early user applicants

(17)

Grid Interoperability

from Andrew Grimshaw

• _Colleagues,

• _{FutureGrid has as two of its many goals the creation of a Grid middleware testing and}

interoperability testbed as well as the maintenance of standards compliant endpoints against which experiments can be executed. We at the University of Virginia are tasked with bringing up three stacks as well as maintaining standard-endpoints against which these experiments can be run.

• _{We currently have UNICORE 6 and Genesis II endpoints functioning on X-Ray (a Cray). Over the} next few weeks we expect to bring two additional resources, India and Sierra (essentially Linux clusters), on-line in a similar manner (Genesis II is already up on Sierra). As called for in the FutureGrid program execution plan, once those two stacks are operational we will begin to work on g-lite (with help we may be able to accelerate that). Other standards-compliant endpoints are welcome in the future , but not part of the current funding plan.

• _{I’m writing the PGI and GIN working groups to see if there is interest in using these resources} (endpoints) as a part of either the GIN or PGI work, in particular in demonstrations or projects for OGF in October or SC in November. One of the key differences between these endpoints and others is that they can be expected to persist. These resources will not go away when a demo is done. They will be there as a testbed for future application and middleware

development (e.g., a metascheduler that works across g-lite and Unicore 6).

http://futuregrid.org 17

(18)

OGF’10 Demo

SDSC SDSC

UF UF

UC UC

Lille Lille

Rennes Rennes

Sophia Sophia ViNe provided the necessary

inter-cloud connectivity to deploy CloudBLAST across 5

Nimbus sites, with a mix of public and private subnets.

(19)

University of Arkansas Indiana University University of California at Los Angeles Penn State Iowa State Univ.Illinois at Chicago University of Minnesota Michigan State Notre Dame University of Texas at El Paso IBM Almaden Research Center Washington University San Diego Supercomputer Center University of Florida Johns Hopkins

July 26-30, 2010 NCSA Summer School Workshop http://salsahpc.indiana.edu/tutorial

(20)

Software Components

• _Portals

_{including “Support” “use FutureGrid” “Outreach”}

• _Monitoring

_{– INCA, Power (GreenIT)}

• _Experiment

_Manager

_{: specify/workflow}

• _Image

_{Generation and Repository}

• _Intercloud

_{Networking ViNE}

• _{Virtual Clusters}

_{built with virtual networks}

• _Performance

_library

• _Rain

_or

_R

_untime

_A

_daptable

_I

_nsertio

_N

_{Service: Schedule}

and Deploy images

• _Security

_{(including use of isolated network),}

(21)

FutureGrid

Layered Software

Stack

http://futuregrid.org 21

(22)

FutureGrid Interaction with

Commercial Clouds

•We support experiments that link Commercial Clouds and FutureGrid with one or more workflow environments and portal technology installed to link components across these platforms

•_{We support environments on FutureGrid that are similar to Commercial}

Clouds and natural for performance and functionality comparisons

–These can both be used to prepare for using Commercial Clouds and as the most likely starting point for porting to them

–One example would be support of MapReduce-like environments on

FutureGrid including Hadoop on Linux and Dryad on Windows HPCS which are already part of FutureGrid portfolio of supported software

•We develop expertise and support porting to Commercial Clouds from other Windows or Linux environments

•We support comparisons between and integration of multiple commercial Cloud environments – especially Amazon and Azure in the immediate future

•_{We develop tutorials and expertise to help users move to Commercial}

(23)

(24)

Scientific Computing Architecture

• Traditional Supercomputers (TeraGrid and DEISA) for large scale parallel computing – mainly simulations

– Likely to offer major GPU enhanced systems

• Traditional Grids for handling distributed data – especially instruments and sensors

• Clouds for “high throughput computing” including much data analysis and

emerging areas such as Life Sciences using loosely coupled parallel computations

– May offer small clusters for MPI style jobs

– Certainly offer MapReduce

• What is architecture for data analysis?

– MapReduce Style? Certainly good in several cases

– MPI Style? Data analysis uses linear algeabra, iterative EM

– Shared Memory? NSF Gordon

• Integrating these needs new work on distributed file systems and high quality data

transfer service

– Link Lustre WAN, Amazon/Google/Hadoop/Dryad File System

(25)

Application Classes

1 Synchronous Lockstep Operation as in SIMD architectures _SIMD

2 Loosely

Synchronous Iterative Compute-Communication stages with independent compute (map) operations for each CPU. Heart of most MPI jobs

MPP

3 Asynchronous Computer Chess; Combinatorial Search often supported

by dynamic threads MPP

4 Pleasingly Parallel Each component independent – in 1988, Fox estimated

at 20% of total number of applications Grids

5 Metaproblems Coarse grain (asynchronous) combinations of classes

1)-4). The preserve of workflow. Grids

6 MapReduce and

Enhancements It describes file(database) to file(database) operations which has subcategories including. 1) Pleasingly Parallel Map Only

2) Map followed by reductions

3) Iterative “Map followed by reductions” – Extension of Current Technologies that

supports much linear algebra and data mining

Clouds

Hadoop/ Dryad

Twister

Pregel

(26)

Applications & Different Interconnection Patterns

Map Only Classic

MapReduce

Iterative Reductions MapReduce++

Loosely Synchronous

CAP3 Analysis

Document conversion (PDF -> HTML)

Brute force searches in cryptography

Parametric sweeps

High Energy Physics (HEP) Histograms SWG gene alignment Distributed search Distributed sorting Information retrieval Expectation maximization algorithms Clustering Linear Algebra

Many MPI scientific applications utilizing wide variety of

communication constructs including local interactions

- CAP3 Gene Assembly - PolarGrid Matlab data analysis

- Information Retrieval - HEP Data Analysis

- Calculation of Pairwise Distances for ALU

Sequences

- Kmeans

-Deterministic

Annealing Clustering - Multidimensional Scaling MDS

- Solving Differential Equations and

- particle dynamics with short range forces

Input Output map Input map reduce Input map reduce iterations iterations Pij

(27)

What hardware/software is

needed for data analysis?

• _{Largest compute systems in world are commercial systems used for}

internet and commerce data analysis

• _{Largest US academic systems (TeraGrid) are essentially not used for}

data analysis

– Open Science Grid and EGI (European Grid Initiative) have large CERN LHC data analysis component – largely pleasingly parallel problems

– _{TeraGrid “data systems” shared memory}

• _{Runtime models}

– _{“Dynamic Scheduling”, MapReduce, MPI …. (when they work, all ~same}

performance)

• _{Agreement on architecture for large scale simulation (GPGPU}

relevance question of detail); little consensus on scientific data

analysis architecture

(28)

www.egi.eu EGI-InSPIRE RI-261323 www.egi.eu EGI-InSPIRE RI-261323

European Grid Infrastructure

Status April 2010 (yearly increase) • 10000 users: +5%

• 243020 LCPUs (cores): +75% • 40PB disk: +60%

• 61PB tape: +56%

• 15 million jobs/month: +10% • 317 sites: +18%

• 52 countries: +8%

• 175 VOs: +8%

• 29 active VOs: +32%

(29)

29

Performance Study

MapReduce v Scheduling

Linux,

Linux on VM, Windows, Azure, Amazon on

(30)

Hadoop/Dryad Comparison

Inhomogeneous Data I

0 50 100 150 200 250 300

1500 1550 1600 1650 1700 1750 1800 1850 1900

Randomly Distributed Inhomogeneous Data Mean: 400, Dataset Size: 10000

DryadLinq SWG Hadoop SWG Hadoop SWG on VM

Standard Deviation

T

im

e

(s

)

Inhomogeneity of data does not have a significant effect when the sequence lengths are randomly distributed

(31)

Scaled Timing with

Azure/Amazon MapReduce

64 * 1024

96 * 1536

128 * 2048

160 * 2560

192 * 3072 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900

Cap3 Sequence Assembly

Azure MapReduce Amazon EMR

Hadoop Bare Metal Hadoop on EC2

Number of Cores * Number of files

T

im

e

(s

(32)

(33)

Smith Waterman

MPI DryadLINQ Hadoop

10000 20000 30000 40000 0.000 0.005 0.010 0.015 0.020 0.025 Hadoop SW-G MPI SW-G DryadLINQ SW-G

No. of Sequences

Ti m e p e r A ct u al C al cu la ti o n ( m s)

(34)

Parallel Data Analysis

Algorithms



_Clustering

 _{Vector based O(N)}  _{Distance based O(N}2₎



_{Dimension Reduction}

_{for visualization and analysis}

 _{Vector based Generative Topographic Map GTM O(N)}  _{Distance based Multidimensional Scaling MDS O(N}2₎



All have faster hierarchical (interpolation) algorithms



_{All with deterministic annealing (DA)}



_{Easy to parallelize but linear algebra/Iterative EM –}

_{need MPI}

(35)

Typical Application Challenge:

DNA Sequencing Pipeline

Illumina/Solexa Roche/454 Life Sciences Applied Biosystems/SOLiD

Modern Commercial Gene Sequencers

Internet Read Alignment Read Alignment Visualization Visualization Blocking

Blocking _Alignment/Sequence

Assembly Sequence Alignment/ Assembly MDS MDS Dissimilarity Matrix N(N-1)/2 values Dissimilarity Matrix N(N-1)/2 values FASTA File N Sequences FASTA File N Sequences block Pairings Pairwise clustering Pairwise clustering MapReduce MPI

(36)

Metagenomics

This visualizes results of dimension reduction to 3D of 30000 gene sequences from an environmental sample. The many different genes are classified by clustering algorithm and visualized by MDS

(37)

General Deterministic Annealing Formula

N data points E(x) in D dimensions space and minimize F by EM

2 1

1

N

( ) ln{

_kK

exp[ ( ( )

( )) / ]

x

F

T

p x

_

E x

Y k

T



 





Deterministic Annealing Clustering

(DAC)

• F is Free Energy (E(x) is energy to be

minimized)

• p(

x

) with



p(

x

) =1

• T

is annealing temperature varied down from



with final value of 1

• Determine cluster center

Y(

k

)

by EM method

• EM is well known expectation maximization

method corresponding to steepest descent

(38)

Deterministic Annealing I

• _Gibbs

_{Distribution at Temperature T}

P(



) = exp( - H(



)/T) /



d



exp( - H(



)/T)

• _Or

_P(

_

_{) = exp( - H(}

_

_{)/T + F/T )}

• _Minimize

_{Free Energy}

F

= < H

- T S(P) > =



d



{P(



)H

+ T P(



) lnP(



)}

• _Where

_

_{are (a subset of) parameters to be minimized}

• _{Simulated annealing}

_{corresponds to doing these integrals by}

Monte Carlo

• _{Deterministic annealing}

_{corresponds to doing integrals}

analytically and is naturally much faster

• _{In each case temperature is lowered slowly – say by a factor}

(39)

• _{Minimum evolving as temperature decreases}

• _{Movement at fixed temperature going to local minima if}

not initialized “correctly

Solve Linear Equations for each temperature

Nonlinearity effects mitigated by initializing with solution at previous higher temperature

Deterministic

Annealing

F({y}, T)

(40)

Deterministic Annealing II

• _{For some cases such as vector clustering and Gaussian Mixture Models}

_one

can do integrals by hand

but usually will be impossible

• So introduce Hamiltonian

H

₀

(



,



)

which by choice of



can be made

similar to H(



) and which has

tractable integrals

• P

₀

(



) = exp( - H

₀

(



)/T + F

₀

/T ) approximate Gibbs

• F

_R

(P

₀

) = < H

_R

- T S

₀

(P

₀

) >|

₀

= < H

_R

– H

₀

> |

₀

+ F

₀

(P

₀

)

• Where

<…>|

₀

denotes



d



P

_o

(



)

• _{Easy to show that real Free Energy}

F

_A

(P

_A

) ≤ F

_R

(P

₀

)

• _{In many problems, decreasing temperature is classic}

_multiscale

_{– finer}

resolution (T is “just” distance scale)

• _{Related to variational inference}

(41)

Implementation of DA I

• Expectation step E

is find



minimizing F

_R

(P

₀

) and

• Follow with

M step setting



= <



> |

₀

=



d



P

_o

(



)

and if

one does not anneal over all parameters and one follows

with a traditional minimization of remaining parameters

• _{In clustering, one then looks at}

_{second derivative}

_matrix

of F

_R

(P

₀

) wrt



and as temperature is lowered this

develops

negative eigenvalue

corresponding to instability

• _{This is a}

_{phase transition}

_{and one splits cluster into two}

and continues EM iteration

• _{One starts with just one cluster}

(42)

42

Rose, K., Gurewitz, E., and Fox, G. C.

``Statistical mechanics and phase transitions in clustering,'' Physical Review Letters,

65(8):945-948, August 1990.

(43)

Implementation II

• Clustering variables are M_i(k) where this is probability point i belongs to cluster k

• In Clustering, take H₀ = _i₌₁N 

k=1K Mi(k) i(k)

• <M_i(k)> = exp( -_i(k)/T ) / _k=1K_{exp( -}

i(k)/T )

• Central clustering has _i(k) = (X(i)- Y(k))2_and

i(k) determined by Expectation step

in pairwise clustering

–

H

_Central

=



_i₌₁N



k=1K

M

i

(k) (X(i)- Y(k))

2

–

H

_central

and H

₀

are identical

–

_{Centers Y(k) are determined in M step}

• _{Pairwise Clustering}_{given by nonlinear form}

• _H_PC_{= 0.5}__i₌₁N 

j=1N



(i, j) k=1K Mi(k) Mj(k) / C(k)

• with C(k) = _i₌₁N_M

i(k) as number of points in Cluster k

• And now H₀ and H_PC are different

(44)

Multidimensional Scaling MDS

• _{Map points}

_{in high dimension to}

_{lower dimensions}

• _{Many such}

_{dimension reduction}

_{algorithm (}

_PCA

_{Principal component}

analysis easiest); simplest but perhaps best is

MDS

• Minimize Stress



(X) =



_i<j₌₁n

_weight(

_i,j

_{) (}



ij

- d(X

i

,

X

j

))

2

• 

_ij

are input dissimilarities and

d(X

_i

,

X

_j

)

the Euclidean distance squared in

embedding space (3D usually)

• _{SMACOF or}

_{Scaling by minimizing a complicated function}

_{is clever steepest}

descent (expectation maximization EM) algorithm

• _{Computational complexity goes like N}

2

_{. Reduced Dimension}

• _{There is}

_{Deterministic annealed}

_{version of it}

• _{Could just view as non linear}



2

_{problem (Tapia et al. Rice)}

(45)

Implementation III

• _{One tractable form was a linear Hamiltonian}

• Another is Gaussian

H

₀

=



_i=₁n

_(X(

_i

₎

_-



₍

_i

₎₎

2

_{/ 2}

• Where X(

i

)

are vectors to be determined as in formula for

Multidimensional scaling

• H

_MDS

=



_{i< j=}₁n

_weight(

_i,j

_{) (}



₍

_i

_,

_j

_{) - d(X(}

_i

₎

,

X(

j

)

))

2

• _Where



(

i

,

j

)

are observed dissimilarities and we want to represent as

Euclidean distance between points

X(

i

)

and

X(

j

)

(

H

_MDS

is quartic or

involves square roots)

• _{The E step is minimize}



_{i< j=}₁n

_weight(

_i,j

_{) (}



₍

_i

_,

_j

_{) – constant.T - (}



₍

_i

_{) -}



₍

_j

₎₎

2

₎

2

• _{with solution}

_

₍

_i

₎

_{= 0 at large T}

• _{Points pop out from origin as Temperature lowered}

(46)

MPI & Iterative MapReduce papers

• _{MapReduce on MPI}_{Torsten Hoefler, Andrew Lumsdaine and Jack Dongarra,}_{Towards Efficient}

MapReduce Using MPI, Recent Advances in Parallel Virtual Machine and Message Passing

Interface Lecture Notes in Computer Science, 2009, Volume 5759/2009, 240-249

• MPI with generalized MapReduce

• Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, Geoffrey Fox Twister: A Runtime for Iterative MapReduce, Proceedings of the First International Workshop on MapReduce and its Applications of ACM HPDC 2010 conference, Chicago, Illinois, June 20-25, 2010 http://grids.ucs.indiana.edu/ptliupages/publications/twister__hpdc_mapreduce.pdf

http://www.iterativemapreduce.org/

• _{Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser,} and Grzegorz Czajkowski Pregel: A System for Large-Scale Graph Processing, Proceedings of the

2010 international conference on Management of data Indianapolis, Indiana, USA Pages:

135-146 2010

• Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst HaLoop: Efficient Iterative Data Processing on Large Clusters, Proceedings of the VLDB Endowment, Vol. 3, No. 1, The 36th

International Conference on Very Large Data Bases, September 1317, 2010, Singapore.

• _{Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica}_{Spark: Cluster}

Computing with Working Sets poster at

(47)

Twister

• Streaming based communication

• Intermediate results are directly transferred from the map tasks to the reduce tasks –

eliminates local files

• Cacheablemap/reduce tasks

•Static data remains in memory

• Combine phase to combine reductions

• User Program is the composer of MapReduce computations

• Extendsthe MapReduce model to iterative

computations Data Split D _MR Driver User Program

Pub/Sub Broker Network

D File System M R M R M R M R Worker Nodes M R D Map Worker Reduce Worker MRDeamon Data Read/Write Communication

Reduce (Key, List<Value>) Reduce (Key, List<Value>)

Iterate

Map(Key, Value) Map(Key, Value)

Combine (Key, List<Value>) Combine (Key, List<Value>) User Program User Program Close() Close() Configure() Configure() Static data Static data δ flow δ flow

(48)

Iterative and non-Iterative Computations

K-means K-means

Performance of K-Means

(49)

Matrix Multiplication 64 cores

Square blocks Twister

Row/Col decomp Twister

(50)

Overhead OpenMPI v Twister

negative overhead due to cache

(51)

Performance of Pagerank using

ClueWeb Data (Time for 20 iterations)

(52)

Fault Tolerance and MapReduce

• _MPI

_{does “maps” followed by “communication” including}

“reduce” but does this iteratively

• _{There must (for most communication patterns of interest) be a}

strict synchronization

at end of each communication phase

–

_{Thus if a}

_{process fails then everything grinds to a halt}

• _{In MapReduce, all Map processes and all reduce processes are}

independent

and stateless and read and write to disks

–

_{As 1 or 2 (reduce+map) iterations, no difficult synchronization issues}

• _Thus

_{failures can easily be recovered}

_{by rerunning process without}

other jobs hanging around waiting

• _{Re-examine MPI fault tolerance in light of MapReduce}

–

_{Relevant for Exascale?}

(53)

TwisterMPIReduce

• _{Runtime package supporting subset of MPI}

mapped to Twister

• _{Set-up, Barrier, Broadcast, Reduce}

TwisterMPIReduce TwisterMPIReduce PairwiseClustering MPI PairwiseClustering MPI Multi Dimensional Scaling MPI Multi Dimensional Scaling MPI Generative Topographic Mapping MPI Generative Topographic Mapping MPI Other … Other …