• No results found

MapReduce in the Clouds for Science

N/A
N/A
Protected

Academic year: 2020

Share "MapReduce in the Clouds for Science"

Copied!
22
0
0

Loading.... (view fulltext now)

Full text

(1)

MapReduce in the Clouds for

Science

CloudCom 2010 Nov 30 – Dec 3, 2010

Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox

(2)

Introduction

Cloud computing combined with cloud infrastructure services

A very viable alternative for scientists

MapReduce

Excellent fault tolerance features

Scalability

Ease of use.

Several options for using MapReduce in cloud environments

MapReduce as a service

Setting up MapReduce cluster on cloud instances

Specialized cloud MapReduce runtimes

(3)

Introduction

Analyze the performance and viability of performing 2 types of bioinformatics computations using MapReduce

in cloud environments

Sequence alignment

Sequence assembly

AzureMapReduce

Leverages the high latency, eventually consistent, yet highly scalable Azure infrastructure services to provide an

decentralized, on demand MapReduce framework

(4)

Platforms

Apache Hadoop

On BareMetal

On EC2

Amazon Web Services

Elastic MapReduce

Microsoft Azure

(5)

Challenges for MapReduce in the

clouds

Data storage

Reliability

Master node

Metadata storage

Performance consistency

Communication consistency and scalability Sustained performance

Choosing suitable instance types

(6)

AzureMapReduce

A solution to the void of parallel programming frameworks on Microsoft Azure

Built on using Azure cloud services

Distributed, highly scalable & highly available

services, backed by industrial strength data centers and technologies

Minimal management / maintenance overhead Reduced footprint

Decentralized control

(7)

AzureMapReduce Features

Familiar programming model

Fault Tolerance

Co-exist with eventual consistency of cloud services

Easy testing and deployment

Combiner step

(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)

AzureMapReduce Architecture

Starting the Sort & Reduce phases,

When all the map tasks are finished &

When a reduce task is finished downloading all the intermediate data products

No guarantee when all the intermediate

data will appear in Task tables

(16)

Performance

Parallel efficiency

AzureMapReduce

Azure small instances – Single Core

Hadoop Bare Metal -IBM iDataplex cluster

EMR & Hadoop on EC2

Cap3 – HighCPU Extra Large (8 Cores, 20 CU)

(17)

Sequence Alignment

Smith-Waterman-GOTOH to calculate

all-pairs dissimilarity

OutFile1

OutFile2

OutFile3

(18)
(19)

Seqeunce Assembly

Assemble sequences using Cap3

Pleasingly parallel

(20)
(21)
(22)

Thanks

http://salsahpc.indiana.edu/azuremapredu

References

Related documents

The results of the achievement post-tests, Giving Directions and Naming Features obtained from the students tend to suggest that if the computer is to be used

We applaud the White House and Department of Homeland Security for recognizing the immense security challenges posed by the proliferation of information technology, and for

As found elsewhere (Du and Girma 2007b), overall sample estimation result (Table 4, Column 1) suggests that controlling for the endogeneity of finance variables, there is a

Firstly, they process the whole training set for building a DT without storing the set in the main memory, and secondly, they are faster than the most

The components for the LCC include the total cost of fuel consumed throughout the operation, O&M cost which includes the fixed, variable and major maintenance

We performed measurements of physical function, muscle mass, grip strength, walking speed, isometric knee extension strength, and UST.. Muscle mass meas- urement was carried out

The major and the novel findings of this investiga- tion are: (a) RLS-treated mice demonstrate decreased serum ALT and AST and improved liver recovery 72 hrs after APAP

The popular cross-coupling methodologies are addressed (Suzuki, Stille, Kumada, Negishi, Hiyama, Heck, Sonogashira, and Buchwald-Hartwig reactions) and novel