• No results found

Clouds will win!

N/A
N/A
Protected

Academic year: 2020

Share "Clouds will win!"

Copied!
26
0
0

Loading.... (view fulltext now)

Full text

(1)

Clouds will win!

CTS Conference 2011

Philadelphia

May 23 2011

Geoffrey Fox

[email protected]

http://www.infomall.org http://www.futuregrid.org

Director, Digital Science Center, Pervasive Technology Institute

(2)

Important Trends

Data Deluge

in all fields of science

Multicore

implies parallel computing important again

Performance from extra cores – not extra clock speed

GPU enhanced systems can give big power boost

Clouds

– new commercially supported data center

model replacing compute

grids

(and your general

purpose computer center)

Light weight clients

: Sensors, Smartphones and tablets

accessing and supported by backend services in cloud

Commercial efforts

moving

much faster

than

academia

(3)

Jaliya Ekanayake - School of Informatics and Computing 3

Big Data in Many Domains

Ø

According to

one

estimate, we created 150 exabytes (billion gigabytes) of data in

2005. This year, we will create 1,200 exabytes

Ø

PC’s have ~100 Gigabytes disk and 4 Gigabytes of memory

Ø

Size of the web

~ 3 billion web pages: MapReduce at Google was on average

processing 20PB per day in January 2008

Ø

During 2009, American drone aircraft flying over Iraq and Afghanistan sent back

around 24 years’ worth of video footage

Ø http://www.economist.com/node/15579717

Ø New models being deployed this year will produce ten times as many data streams as their predecessors, and those in 2011 will produce 30 times as many

Ø

~108 million sequence records in

GenBank

in 2009, doubling in every 18 months

Ø

~20 million purchases at Wal-Mart a day

Ø

90 million

Tweets a day

Ø

Astronomy, Particle Physics, Medical Records …

Ø

Most scientific task shows CPU:IO ratio of 10000:1 – Dr. Jim Gray

Ø

The Fourth Paradigm: Data-Intensive Scientific Discovery

(4)
(5)

Data Centers Clouds &

Economies of Scale I

Range in size from “edge”

facilities to megascale.

Economies of scale

Approximate costs for a small size

center (1K servers) and a larger,

50K server center.

Each data center is

11.5 times

the size of a football field

Technology Cost in small-sized Data Center

Cost in Large

Data Center Ratio Network $95 per Mbps/

month $13 per Mbps/month 7.1 Storage $2.20 per GB/

month $0.40 per GB/month 5.7 Administration ~140 servers/

Administrator >1000 Servers/Administrator 7.1

2 Google warehouses of computers on

the banks of the Columbia River, in

The Dalles, Oregon

Such centers use 20MW-200MW

(Future) each with 150 watts per CPU

Save money from large size,

(6)

6

Builds giant data centers with 100,000’s of computers;

~ 200-1000 to a shipping container with Internet access

“Microsoft will cram between 150 and 220 shipping containers filled

with data center gear into a new 500,000 square foot Chicago

facility. This move marks the most significant, public use of the

shipping container systems popularized by the likes of Sun

(7)

X as a Service

SaaS

:

Software

as a

Service

imply software capabilities

(programs) have a service (messaging) interface

– Applying systematically reduces system complexity to being linear in number of components

– Access via messaging rather than by installing in /usr/bin

IaaS

:

Infrastructure

as a

Service

or

HaaS

:

Hardware

as a

Service

– get your

computer time with a credit card and with a Web interface

PaaS

:

Platform

as a

Service

is

IaaS

plus core software capabilities on

which you build

SaaS

Cyberinfrastructure

is

“Research as a Service”

Other Services

(8)

Sensors as a Service

Cell phones are important sensor

Sensors as a Service

Sensor

Processing as

(9)

Clouds and Jobs

Clouds are a major industry thrust with a growing fraction of IT

expenditure that IDC estimates will grow to $44.2 billion direct

investment in 2013 while 15% of IT investment in 2011 will be

related to cloud systems with a 30% growth in public sector.

Gartner also rates cloud computing high on list of critical

emerging technologies with for example “Cloud Computing” and

“Cloud Web Platforms” rated as transformational (their highest

rating for impact) in the next 2-5 years.

Correspondingly there is and will continue to be major

opportunities for new jobs in cloud computing with a recent

European study estimating there will be 2.4 million new cloud

computing jobs in Europe alone by 2015.

Cloud computing is an attractive for projects focusing on

workforce development. Note that the recently signed “America

Competes Act” calls out the importance of economic

(10)

Gartner 2009 Hype Curve

Clouds, Web2.0

Service Oriented Architectures

Transformational

High

Moderate

Low

(11)

Philosophy of Clouds and Grids

Clouds

are (by definition) commercially supported

approach to large scale computing

So we should expect

Clouds to replace Compute Grids

Current Grid technology involves “non-commercial” software

solutions which are hard to evolve/sustain

Public Clouds

are broadly accessible resources like

Amazon and Microsoft Azure – powerful but not easy

to optimize and perhaps data trust/privacy issues

Private Clouds

run similar software and mechanisms

but on “your own computers”

Services

still are correct architecture with either REST

(Web 2.0) or Web Services

(12)

Grids MPI and Clouds

Grids

are useful for

managing distributed systems

Pioneered service model for Science

Developed importance of

Workflow

Performance issues – communication latency – intrinsic to distributed systems

Can never run large differential equation based simulations or datamining

Clouds

can execute any job class that was good for Grids

plus

More attractive due to platform plus

elastic

on-demand model

MapReduce

easier to use than MPI for appropriate parallel jobs

Currently have performance limitations due to poor affinity (locality) for

compute-compute (MPI) and Compute-data

These limitations are not “inevitable” and should gradually improve as in July

13 2010 Amazon Cluster announcement

Will probably never be best for most sophisticated parallel differential

equation based simulations

Classic Supercomputers

(MPI Engines) run

communication demanding

differential equation based simulations

MapReduce and Clouds replaces MPI

for other problems

(13)

Fault Tolerance and MapReduce

MPI

does “maps” followed by “communication” including

“reduce” but does this iteratively

There must (for most communication patterns of interest) be a

strict synchronization

at end of each communication phase

Thus if a

process fails then everything grinds to a halt

In MapReduce, all Map processes and all reduce processes are

independent

and stateless and read and write to disks

As 1 or 2 (reduce+map) iterations, no difficult synchronization

issues

Thus

failures can easily be recovered

by rerunning process

without other jobs hanging around waiting

Re-examine MPI fault tolerance in light of MapReduce

(14)

Components of a Scientific Computing Platform

Authentication and Authorization: Provide single sign in to both FutureGrid and Commercial Clouds linked by workflow

Workflow: Support workflows that link job components between FutureGrid and Commercial Clouds. Trident from Microsoft Research is initial candidate

Data Transport: Transport data between job components on FutureGrid and Commercial Clouds respecting custom storage patterns

Program Library: Store Images and other Program material (basic FutureGrid facility)

Blob: Basic storage concept similar to Azure Blob or Amazon S3

DPFS Data Parallel File System: Support of file systems like Google (MapReduce), HDFS (Hadoop) or Cosmos (dryad) with compute-data affinity optimized for data processing

Table: Support of Table Data structures modeled on Apache Hbase/CouchDB or Amazon SimpleDB/Azure Table. There is “Big” and “Little” tables – generally NOSQL

SQL: Relational Database

Queues: Publish Subscribe based queuing system

Worker Role: This concept is implicitly used in both Amazon and TeraGrid but was first introduced as a high level construct by Azure

MapReduce: Support MapReduce Programming model including Hadoop on Linux, Dryad on Windows HPCS and Twister on Windows and Linux

Software as a Service: This concept is shared between Clouds and Grids and can be supported without special attention

(15)
(16)

Traditional File System?

Typically a shared file system (Lustre, NFS …) used to

support high performance computing

Big advantages in flexible computing on shared data but

doesn’t “

bring computing to data

(17)

Data Parallel File System?

No archival storage and computing brought to data

C Data C Data C Data C Data C Data C Data C Data C Data C Data C Data C Data C Data C Data C Data C Data C Data File1 Block1 Block2 BlockN

……

Breakup Replicate each block

File1 Block1 Block2 BlockN

……

Breakup

(18)

Important Platform Capability

MapReduce

Implementations (Hadoop – Java; Dryad – Windows)

support:

Splitting of data

Passing the output of map functions to reduce functions

Sorting the inputs to the reduce function based on the

intermediate keys

Quality of service

Map(Key, Value)

Reduce(Key, List<Value>)

Data Partitions

Reduce Outputs

(19)

MapReduce “File/Data Repository” Parallelism

Instruments

Disks Map1 Map2 Map3 Reduce

Communication

Map = (data parallel) computation reading and writing data

Reduce = Collective/Consolidation phase e.g. forming multiple global sums as in histogram

Portals /Users

Iterative MapReduce

Map Map Map Map

(20)

DNA Sequencing Pipeline

Visualization Plotviz

Blocking Sequencealignment

MDS

Dissimilarity Matrix

N(N-1)/2 values FASTA File

N Sequences

Form block Pairings

Pairwise clustering

Illumina/Solexa Roche/454 Life Sciences Applied Biosystems/SOLiD

Internet

Read Alignment

~300 million base pairs per day leading to ~3000 sequences per day per instrument ? 500 instruments at ~0.5M$ each

MapReduce

(21)

Cap3 Cost

Num. Cores * Num. Files

64 * 102496 * 1536 128 *

2048 160 *2560 192 *3072

Co

st

($

)

0 2 4 6 8 10 12 14 16 18

Azure MapReduce Amazon EMR

(22)

SWG Cost

Num. Cores * Num. Blocks

64 * 1024 96 * 1536 128 * 2048 160 * 2560 192 * 3072

Co

st

($

)

0 5 10 15 20 25 30

(23)

Twister v0.9

March 15, 2011

New Interfaces for Iterative MapReduce Programming

http://www.iterativemapreduce.org/

SALSA Group

Bingjing Zhang, Yang Ruan, Tak-Lon Wu, Judy Qiu, Adam

Hughes, Geoffrey Fox,

Applying Twister to Scientific

Applications

, Proceedings of IEEE CloudCom 2010

Conference, Indianapolis, November 30-December 3, 2010

Twister4Azure

released May 2011

http://salsahpc.indiana.edu/twister4azure/

MapReduceRoles4Azure

available for some time at

(24)

Iteratively refining operation

Typical MapReduce runtimes incur extremely high overheads

– New maps/reducers/vertices in every iteration

– File system based communication

Long running tasks and faste

r communication in Twister enables it to

perform close to MPI

Time for 20 iterations

K-Means Clustering

map map

reduce

Compute the distance to each data point from each cluster center and assign points to cluster centers

Compute new cluster centers

Compute new cluster centers

(25)

Twister

• Streaming based communication

• Intermediate results are directly transferred from the map tasks to the reduce tasks –eliminates local files

• Cacheablemap/reduce tasks

• Static data remains in memory • Combinephase to combine reductions

• User Program is the composerof MapReduce computations

Extendsthe MapReduce model to

iterativecomputations

Data Split

D MR

Driver ProgramUser

Pub/Sub Broker Network

D File System M R M R M R M R Worker Nodes M R D Map Worker Reduce Worker MRDeamon Data Read/Write Communication

Reduce (Key, List<Value>) Iterate

Map(Key, Value)

Combine (Key, List<Value>) User Program Close() Configure() Static data δ flow

(26)

Twister4Azure

early results

Number of Query Files

128 228 328 428 528 628 728 828

Parallel

Efficiency

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00%

Hadoop-Blast

EC2-ClassicCloud-Blast

DryadLINQ-Blast

References

Related documents

Correlation between Breast Self-Examination Practices and Demographic Characteristics, Risk Factors and Clinical Stage of Breast Cancer among Iraqi Patients..

The hit rate ( H ) to detect SGI-based groundwater droughts using the SPI with the (a) optimal accumulation period and (b, c) 6 and 12 months of uniform accumulation periods at

Number of approaches in rational drug design, the molecular mechanism of drug action, Drug metabolizing enzyme action upon the structure of to drug molecule,

The corrosion inhibition of Armco steel in 0.5 M sulfuric acid in the presence of cationic Copolymers CQGP of Quaternary 4-vinylpyridine (QVPy) Graft

The analysis of the formation of the institution of civil legal personality of citizens at various periods in the development of history allows us to say that

Based on some sort of this kind of considerations, some uncertainty factor like, limit of detection, coefficient of variation, Figure of merit, standard deviation, minimum

In [7] the main challenging problem for latent fingerprint enhancement is to remove various types of image noises while reliably restoring the corrupted regions

Travel and Evacuation Insurance: Each participant MUST have Travel Medical and Evacuation insurance to attend this course.. This insurance is available from