Bioinformatics on Cloud Cyberinfrastructure

(1)

Bioinformatics on Cloud

Cyberinfrastructure

Bio-IT

April 14 2011

Geoffrey Fox

[email protected]

http://www.infomall.org

http://www.futuregrid.org

Director, Digital Science Center, Pervasive Technology Institute

(2)

Abstract

• _{Clouds offer computing on demand plus}

important platforms capabilities including

MapReduce and Data Parallel File systems.

• _{This talk will look at public and private clouds for}

large scale sequence processing characterizing

performance and usability

• _{As well as FutureGrid, an NSF facility supporting}

such studies.

(3)

Philosophy of

Clouds and Grids

• _Clouds

_{are (by definition) commercially supported approach to large}

scale computing

–

So we should expect

Clouds to replace Compute Grids

–

_{Current Grid technology involves “non-commercial” software solutions which}

are hard to evolve/sustain

–

Maybe Clouds

~4% IT

expenditure 2008 growing to

14%

in 2012 (IDC Estimate)

• _{Public Clouds}

_{are broadly accessible resources like Amazon and}

Microsoft Azure – powerful but not easy to customize and perhaps data

trust/privacy issues

• _{Private Clouds}

_{run similar software and mechanisms but on “your own}

computers” (not clear if still elastic)

–

_{Platform features such as Queues, Tables, Databases currently limited}

• _Services

_{still are correct architecture with either REST (Web 2.0) or Web}

Services

(4)

Cloud Computing:

Infrastructure and Runtimes

• _{Cloud infrastructure:}

_{outsourcing of servers, computing, data, file}

space, utility computing, etc.

–

_{Handled through Web services that control virtual machine}

lifecycles.

• _{Cloud runtimes or Platform:}

_{tools (for using clouds) to do}

data-parallel (and other) computations.

–

_{Apache Hadoop, Google MapReduce, Microsoft Dryad, Bigtable,}

Chubby and others

–

_{MapReduce designed for information retrieval but is excellent for}

a wide range of

science data analysis applications

–

_{Can also do much traditional parallel computing for data-mining}

if extended to support

iterative

operations

(5)

Components of a Scientific Computing Platform

Authentication and Authorization: Provide single sign in to both FutureGrid and Commercial Clouds linked by workflow

Workflow: Support workflows that link job components between FutureGrid and Commercial Clouds. Trident from Microsoft Research is initial candidate

Data Transport: Transport data between job components on FutureGrid and Commercial Clouds respecting custom storage patterns

Program Library: Store Images and other Program material (basic FutureGrid facility)

Blob: Basic storage concept similar to Azure Blob or Amazon S3

DPFS Data Parallel File System: Support of file systems like Google (MapReduce), HDFS (Hadoop) or Cosmos (dryad) with compute-data affinity optimized for data processing

Table: Support of Table Data structures modeled on Apache Hbase/CouchDB or Amazon SimpleDB/Azure Table. There is “Big” and “Little” tables – generally NOSQL

SQL: Relational Database

Queues: Publish Subscribe based queuing system

Worker Role: This concept is implicitly used in both Amazon and TeraGrid but was first introduced as a high level construct by Azure

MapReduce: Support MapReduce Programming model including Hadoop on Linux, Dryad on Windows HPCS and Twister on Windows and Linux

Software as a Service: This concept is shared between Clouds and Grids and can be supported without special attention

(6)

MapReduce

• _{Implementations (Hadoop – Java; Dryad – Windows) support:}

–

_{Splitting of data}

–

_{Passing the output of map functions to reduce functions}

–

_{Sorting the inputs to the reduce function based on the intermediate}

keys

–

_{Quality of service}

Map(Key, Value)

Reduce(Key, List<Value>)

Data Partitions

Reduce Outputs

(7)

MapReduce “File/Data Repository” Parallelism

Instruments

Disks Map₁ Map₂ Map₃ Reduce

Communication

Map = (data parallel) computation reading and writing data

Reduce = Collective/Consolidation phase e.g. forming multiple global sums as in histogram

Portals /Users

Iterative MapReduce

(8)

All-Pairs Using DryadLINQ

35339 50000 0

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

DryadLINQ MPI

Calculate Pairwise Distances (Smith Waterman Gotoh)

125 million distances 4 hours & 46 minutes 125 million distances 4 hours & 46 minutes

• Calculate pairwise distances for a collection of genes (used for clustering, MDS)

• _{Fine grained tasks in MPI}

• Coarse grained tasks in DryadLINQ

• Performed on 768 cores (Tempest Cluster)

(9)

Hadoop VM Performance Degradation

15.3% Degradation at largest data set size

10000 20000 30000 40000 50000

-5% 0% 5% 10% 15% 20% 25% 30%

Perf. Degradation On VM (Hadoop)

No. of Sequences

Perf. Degradation = (T

_vm

– T

_baremetal

)/T

_baremetal

(10)

Cap3 Performance with

Different EC2 Instance Types

Serie s1 0 500 1000 1500 2000 0.00 1.00 2.00 3.00 4.00 5.00 6.00 Amortized Compute Cost Compute Cost (per hour units)

(11)

Cap3 Cost

0 2 4 6 8 10 12 14 16 18

Azure MapReduce Amazon EMR

Hadoop on EC2

Num. Cores * Num. Files

C

o

st

(

(12)

SWG Cost

64 * 1024 96 * 1536 128 * 2048 160 * 2560 192 * 3072 0

5 10 15 20 25 30

AzureMR Amazon EMR Hadoop on EC2

Num. Cores * Num. Blocks

C

o

st

(

(13)

Smith Waterman:

Daily Effect

W ed_D

ay W

ed_N igh_t Th u D ay Th u N igh_t

Fri_D

ay Fri N_igh t Sa t D ay Sa t N igh_t Su n D ay Su n N igh_t M on D ay M on_N

igh_t Tu e D ay Tu e N igh_t 1000 1020 1040 1060 1080 1100 1120 1140 1160

EMR

Azure MR Adj.

Ti

m

e

(

(14)

Grids MPI and Clouds

• _Grids

_{are useful for}

_{managing distributed systems}

– _{Pioneered service model for Science} – Developed importance of Workflow

– _{Performance issues – communication latency – intrinsic to distributed systems} – Can never run large differential equation based simulations or datamining

• _Clouds

_{can execute any job class that was good for Grids}

_plus

– More attractive due to platform plus elastic on-demand model

– MapReduce easier to use than MPI for appropriate parallel jobs

– _{Currently have performance limitations due to poor affinity (locality) for compute-compute}

(MPI) and Compute-data

– These limitations are not “inevitable” and should gradually improve as in July 13 2010 Amazon Cluster announcement

– _{Will probably never be best for most sophisticated parallel differential equation based}

simulations

• _{Classic Supercomputers}

_{(MPI Engines) run}

_{communication demanding differential}

equation based simulations

– _{MapReduce and Clouds replaces MPI}_{for other problems}

– _{Much more data processed today by MapReduce than MPI (Industry Informational Retrieval}

(15)

Fault Tolerance and MapReduce

• _MPI

_{does “maps” followed by “communication” including “reduce”}

but does this iteratively

• _{There must (for most communication patterns of interest) be a}

_strict

synchronization

at end of each communication phase

–

_{Thus if a}

_{process fails then everything grinds to a halt}

• _{In MapReduce, all Map processes and all reduce processes are}

independent

and stateless and read and write to disks

–

_{As 1 or 2 (reduce+map) iterations, no difficult synchronization issues}

• _Thus

_{failures can easily be recovered}

_{by rerunning process without}

other jobs hanging around waiting

• _{Re-examine MPI fault tolerance in light of MapReduce}

(16)

Twister v0.9

March 15, 2011

New Interfaces for Iterative MapReduce Programming

http://www.iterativemapreduce.org/

SALSA Group

Bingjing Zhang, Yang Ruan, Tak-Lon Wu, Judy Qiu, Adam

Hughes, Geoffrey Fox,

Applying Twister to Scientific

Applications

, Proceedings of IEEE CloudCom 2010

Conference, Indianapolis, November 30-December 3, 2010

Twister4Azure

to be released May 2011

MapReduceRoles4Azure

available now at

(17)

• _{Iteratively refining operation}

• _{Typical MapReduce runtimes incur extremely high overheads}

– _{New maps/reducers/vertices in every iteration} – _{File system based communication}

• _{Long running tasks and faste}

_{r communication in Twister enables it to}

perform close to MPI

Time for 20 iterations

K-Means Clustering

map map

reduce

Compute the distance to each data point from each cluster center and assign points to cluster centers

Compute new cluster centers

(18)

Twister

• Streaming based communication

• Intermediate results are directly transferred from the map tasks to the reduce tasks –

eliminates local files

• Cacheablemap/reduce tasks

•Static data remains in memory

• Combine phase to combine reductions

• User Program is the composer of MapReduce computations

• Extendsthe MapReduce model to iterative computations Data Split D _MR Driver User Program Pub/Sub Broker Network

D File System M R M R M R M R Worker Nodes M R D Map Worker Reduce Worker MRDeamon Data Read/Write Communication

Reduce (Key, List<Value>) Reduce (Key, List<Value>)

Iterate

Map(Key, Value) Map(Key, Value)

Combine (Key, List<Value>) Combine (Key, List<Value>) User Program User Program Close() Close() Configure() Configure() Static data Static data δ flow δ flow

(19)

Performance of Pagerank using

ClueWeb Data (Time for 20 iterations)

(20)

Twister-BLAST vs.

(21)

Twister4Azure

early results

128 228 328 428 528 628 728

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Hadoop-Blast EC2-ClassicCloud-Blast DryadLINQ-Blast AzureTwister

Number of Query Files

(22)

Twister4Azure

Architecture

Azure BLOB Storage

MW₁ MW₂ MW₃ MW_m

RW₁ RW₂

Azure BLOB Storage

Intermediate Data

(through BLOB storage)

Reduce Task Int. Data Transfer Table Meta-Data on intermediate data products Map Workers Reduce Workers

M_n . . M_x . . M₃ M₂ M₁

Map Task Queue

Rk . . Ry . . R3 R2 R1

Reduce Task Queue Client API

Command Line or Web UI

Map Task Meta-Data Table

Reduce Task Meta-Data Table

(23)

(24)

(25)

Scaling MDS in Cloud

• _{MDS makes clustering quality very clear}

• _{MDS scales like O(N}

2

_{) and 100,000 points can}

take several hours on a 1000 cores

• _Using

_Twister

_on

_Azure

_and

_{ordinary clusters}

to run combination of MDS and interpolated

MDS which scales like N

• _{Aim to process 20 million points for both MDS}

(26)

https://portal.futuregrid.org

US Cyberinfrastructure Context

• _{There are a rich set of facilities}

–

_{Production TeraGrid}

_{facilities with distributed and}

shared memory

–

_{Experimental “Track 2D” Awards}

• _FutureGrid

_{: Distributed Systems experiments cf. Grid5000}

• _Keeneland

_{: Powerful GPU Cluster}

• _Gordon

_{: Large (distributed) Shared memory system with}

SSD aimed at data analysis/visualization

–

_{Open Science Grid}

_{aimed at High Throughput}

computing and strong campus bridging

(27)

FutureGrid key Concepts I

• _{FutureGrid is an}

_{international testbed}

_{modeled on Grid5000}

• _{Supporting international}

_{Computer Science}

_and

_{Computational}

Science

research in cloud, grid and parallel computing (HPC)

–

_{Industry and Academia}

–

_{Note much of current use Education, Computer Science Systems}

and Biology/Bioinformatics

• _{The FutureGrid testbed provides to its users:}

–

_{A flexible development and testing platform for middleware}

and application users looking at

interoperability

,

functionality

,

performance

or

evaluation

–

_{Each use of FutureGrid is an}

_experiment

_{that is}

_reproducible

–

_{A rich}

_{education and teaching}

_{platform for advanced}

(28)

FutureGrid key Concepts II

• Rather than loading images onto VM’s, FutureGrid supports

Cloud, Grid and Parallel computing

environments by

dynamically provisioning

software as needed onto “bare-metal”

using Moab/xCAT

– Image library

for MPI, OpenMP, Hadoop, Dryad, gLite, Unicore, Globus,

Xen, ScaleMP (distributed Shared Memory), Nimbus, Eucalyptus,

OpenNebula, KVM, Windows …..

• Growth comes from users depositing novel images in library

• FutureGrid has ~4000 (will grow to ~5000) distributed cores

with a dedicated network and a Spirent XGEM network fault

and delay generator

Image1

Image1 _Image2Image2

…

_ImageNImageN

Load

(29)

Dynamic Provisioning Results

4 8 16 32

0:00:00 0:00:43 0:01:26 0:02:09 0:02:52 0:03:36 0:04:19

Total Provisioning Time minutes

Time elapsed between requesting a job and the jobs reported start time on the provisioned node. The numbers here are an average of 2 sets of experiments.

(30)

FutureGrid Partners

• _{Indiana University}

_{(Architecture, core software, Support)}

• _{Purdue University}

_{(HTC Hardware)}

• _{San Diego Supercomputer Center}

_{at University of California San Diego}

(INCA, Monitoring)

• _{University of Chicago}

_{/Argonne National Labs (Nimbus)}

• _{University of Florida}

_{(ViNE, Education and Outreach)}

• _{University of Southern California Information Sciences (Pegasus to manage}

experiments)

• _{University of Tennessee Knoxville (Benchmarking)}

• _{University of Texas at Austin}

_{/Texas Advanced Computing Center (Portal)}

• _{University of Virginia (OGF, Advisory Board and allocation)}

• _{Center for Information Services and GWT-TUD from Technische Universtität}

Dresden. (VAMPIR)

(31)

FutureGrid:

a Grid/Cloud/HPC Testbed

Private

Public FG Network

(32)

5 Use Types for FutureGrid

• _~110

_{approved projects over last 8 months}

• _{Training Education and Outreach}

–

_{Semester and short events; promising for non research intensive}

universities

• _{Interoperability test-beds}

–

_{Grids and Clouds;}

_Standards

_{; Open Grid Forum OGF really needs}

• _{Domain Science applications}

–

_{Life sciences highlighted}

• _{Computer science}

–

_{Largest current category (> 50%)}

• _{Computer Systems Evaluation}

–

_{TeraGrid (TIS, TAS, XSEDE), OSG, EGI}

• _{Clouds are meant to need less support than other models;}

FutureGrid needs more

user support

…….

(33)

Some Current FutureGrid projects I

Project Institution Details

Educational Projects

VSCSE Big Data IU PTI, Michigan, NCSA and 10 sites

Over 200 students in week Long Virtual School of Computational Science and Engineering on Data Intensive Applications &

Technologies

LSU Distributed Scientific

Computing Class LSU

13 students use Eucalyptus and SAGA enhanced version of MapReduce

Topics on Systems: Cloud

Computing CS Class IU SOIC

27 students in class using virtual machines, Twister, Hadoop and Dryad

Interoperability Projects

OGF Standards Virginia, LSU, Poznan Interoperability experiments between OGF standard Endpoints

(34)

Some Current FutureGrid projects II

34

Domain Science Application Projects

Combustion _Cummins Performance Analysis of codes aimed at

engine efficiency and pollution

Cloud Technologies for

Bioinformatics Applications IU PTI

Performance analysis of pleasingly

parallel/MapReduce applications on Linux, Windows, Hadoop, Dryad, Amazon, Azure with and without virtual machines

Computer Science Projects

Cumulus _{Univ. of Chicago} Open Source Storage Cloud for Science

based on Nimbus

Differentiated Leases for IaaS _{University of Colorado} Deployment of always-on preemptible

VMs to allow support of Condor based on demand volunteer computing

Application Energy Modeling _UCSD/SDSC Fine-grained DC power measurements on

HPC resources and power benchmark system

Evaluation and TeraGrid/OSG Support Projects

Use of VM’s in OSG _{OSG, Chicago, Indiana} Develop virtual machines to run the

services required for the operation of the OSG and deployment of VM based applications in OSG environments.