• No results found

Lightweight Streaming based Runtime for Cloud Computing

N/A
N/A
Protected

Academic year: 2020

Share "Lightweight Streaming based Runtime for Cloud Computing"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

Lightweight Streaming-based Runtime for Cloud Computing

Shrideep Pallickara

Community Grids Lab, Indiana University

(2)

A unique confluence of factors have

driven the need for cloud computing

DEMAND PULLS: Process and store large

data volumes

• Y02 22-EB : Y06 161-EB : Y10 988-EB ~ I ZB

TECHNOLOGY PUSHES: Falling hardware

(3)

Since the cloud is not monolithic it is

easier to cope with flux and evolve

• Replacement and upgrades are two sides of the same coin

• Desktop: 4 GB RAM, 400 GB Disk, 50 GFLOPS & 4 cores

(4)

Cloud and traditional HPC systems have

some fundamental differences

• One job at a time = Underutilization

– Execution pipelines – IO Bound activities

• An application is the sum of its parts

(5)

Projects utilizing

NaradaBrokering

Q

u

akeS

(6)
(7)

Projects utilizing

NaradaBrokering

Se

n

sor G

ri

d

s

RFID Tag RFID Reader

(8)

Lessons learned from multi-disciplinary

settings

Framework for processing streaming data

• Compute demands will outpace availability

(9)

Big Picture

Networked Devices

{Sensors, Instruments}

Simulations & Services

Experimental Models

(10)

Fueled the need for a new type of

computational task

• Operate on dynamic and voluminous data. • Comparably smaller CPU-bound times

– Milliseconds to minutes

BUT concurrently interleave 1000s of these long running tasks

(11)

Granules is dispersed over, and

permeates, distributed components

R1 G R2 G RN G C C G A Discovery Services Lifecycle Management Diagnostics Deployment Manager Concurrent Execution

Queue InitializerDataset

G

Content Distribution Network

(12)

An application is the sum of its

computational tasks that are …

• Agnostic about the resources that they execute on

• Responsible for processing a subset of the data

– Fragment of a stream – Subset of files

(13)

Granules does most of the work for

the applications except for …

1. Processing Functionality 2. Specifying the Datasets

(14)

Granules processing functionality

1

is

domain specific

• Implement just one method: execute()

(15)

Computational tasks often need to

cope with multiple datasets

2

T

YPES: Streams, Files, Databases & URIs

I

NITIALIZE: Configuration & allocations

A

CCESS: Permissions and authorizations

(16)

Computational tasks specify their

lifetime and scheduling

3

strategy

Number of times Data availability

Permute on any of

these dimensions

Change during

(17)

Granules discovers resources to

deploy computational tasks

Deploy computation instance on multiple

resources

Instantiate computational tasks & execute in Sandbox

Initialize task’s STATE and DATASETS

Interleave multiple computations on a

(18)

C G

Content Distribution Network

Deploying computational tasks

RN

(19)

Granules manages the state

transitions for computational tasks

Terminate

Initialize Dormant

Activated

Transition Triggers:

• Data Availability • Periodicity

(20)

Activation and processing of

computational tasks

(21)

M

AP

-R

EDUCE

enables concurrent

processing of large datasets

d1

d2

dN

Map1

Map2

MapN

Reduce1

ReduceK

o1

oK

D

at

ase

(22)

Substantial benefits can be accrued in a

streaming version of M

AP

-R

EDUCE

• File-based = Disk IO  Expensive

• Streaming is much faster

– Allows access to intermediate results – Enables time bound responses

(23)

In Granules M

AP

and R

EDUCE

are two

roles of a computational task

Linking of MAP-REDUCE roles is easy

M1.addReduce(R1) or R1.addMap(M1)

– Unlinking is easy too: remove

• Maps generate result streams, which are consumed by reducers

(24)

In Granules M

AP

-R

EDUCE

roles are

interchangeable

M1

M2

M3

R1

R2

RM

(25)

Scientific applications can harness

M

AP

-R

EDUCE

variants in Granules

ITERATIVE: Fixed number of times

RECURSIVE: Till termination condition

PERIODIC

(26)

Complex computational pipelines

can be set up using Granules

• Iterative, Periodic, Recursive & Data driven

(27)

Granules manages pipeline

communications complexity

• No arduous management of

fan-ins

• Facilities to track outputs

Confirm receipt from all

preceding stages.

1

2

3

(28)

Granules allows computational tasks

to be cloned

M1

M2

M3 M1

M2

R1

R2 R1

• Fine tune redundancies

Double-check results

(29)

Related work

HADOOP: File-based, Java, HDFS

DRYAD: Dataflow graphs, C#, LinQ, MSD

DISCO: File-based, Erlang

PHEONIX: Multicore

(30)

Granules outperforms Hadoop & Dryad

in a traditional IR benchmark

(31)

Clustering data points using K-means

0.1 1 10 100 1000 10000

0.1 1 10 100

O ver h e ad ( S ec on ds )

(32)

Computing the product of two 16K

x

16 K

matrices using streaming datasets

2048 4096 8192 16384

om

pu

ting ov

er

he

ad i

n S

ec

(33)

Maximizing core utilizations when

assembling mRNA sequences

8 16 32 64 128 256 512 1024

1 2 4 8 16 32 64 1

2 4 8 16 32 64 C om pu ting ov er he ad i n S ec onds S peed -up

Number of Cores Computing Overhead

Speedup

(34)

Preserving scheduling periodicity for 10

4

concurrent computational tasks

(35)

The streaming substrate provides

consistent high performance throughput

0 1 2 3 4 5

100 1000 10000 100000 0

1 2 3 4 5 M e an t rans it de la y (M ill is e cond s) S tand ar d D ev ia tion ( M ill is ec on ds )

Content Payload Size (Bytes)

Streaming overheads for different payload sizes (100B - 100KB)

(36)

The streaming substrate provides

consistent high performance throughput

10 20 30 40 50 60 4 6 8 10 12 14 rans it de la y (M ill is e cond s) ar d D ev ia tion ( M ill is ec on ds )

Streaming overheads for different payload sizes (100B - 1MB)

(37)

Cautionary Tale: Gains when Disk-IO

cannot keep pace with processing

2000 4000 6000 8000 10000 12000 14000

5 10 15 20 25 30 35 40 45 50

P roc es si ng ov er head ( S ec on ds )

Number of Processors

Histogramming words in a 500 GB dataset 1600 Files & 100 Map Functions

95%

72%

41%

65%

(38)

Key innovations within Granules

Easy to develop applications

• Support for real-time streaming datasets

• Rich lifecycle and scheduling support for computational tasks.

(39)

Future Work

Probabilistic guarantees within the cloud

• Efficient generation of compute streams

Throttling and steering of computations

Staging datasets to maximize throughput

(40)

Conclusions

• Pressing need to cloud-enable network data intensive systems

Complexity should be managed BY the

runtime, and NOT by domain specialists

Autonomy of Granules instances allows it

References

Related documents

But, for example, the significant positive effect of a number of measures for innovation on the probability of enterprises to be involved in a combination of export and

mas mokytojo profesija, visuomenės dėmesio stoka. Tai atsispindi vyk­ dant savo, kaip dėstytojų bei studentų, priedermes. Čia paminėjome tik bendriausias Vakarų šalių

Per the expectations set by the funder via the original request for proposal announcement, sites included in the study were required to have well-established CMM services delivered

Additionally, Global Network for Isotopes in Precipitation data (means for 11 months, Irkutsk) and Lake Baikal mean isotope composition (light-green square) are given; (b) daily

En general, hoy, gracias a los nuevos enfoques existe todo un universo de líneas y temas de investigación dentro de la historia militar en España, como el de la violencia

 A brief outline of the DSA assessment process and its importance. This letter would confirm that SLC accepts the evidence of disability provided by the student and clearly sets

Compared to the primary findings, our study found no significant differences in the sensitivity analysis on viral load suppression rates between Teen Club

We look critically at the construction of celebrity Jools Oliver and fictional character Bridget Jones (Fielding, 2013), to show how aesthetic labour has become a central