DynamicCloudSim: Simulating Heterogeneity in Computational Clouds

(1)

DynamicCloudSim:

Simulating Heterogeneity in

Computational Clouds

Marc Bux, Ulf Leser

{bux|leser}@informatik.hu-berlin.de

The 2nd international workshop on Scalable Workflow

Enactment Engines and Technologies (SWEET'13)

(2)

(3)

(4)

(5)

•

Small Instance

: 1.7 GB RAM, 1 EC2

Compute Unit

, 160 GB local storage

•

Compute Unit

: equiv. CPU capacity of a 1.0-1.2 GHz Opteron or Xeon

•

No guarantees wrt. I/O throughput and network delay / bandwidth

(6)

Any one cloud instance is unlike another.

(7)

Heterogeneity in EC2 Cloud Instances

•

Different CPUs

on physical

host systems

[Jackson10, Schad10]

– Intel Xeon E5430 (2.66 GHz quad)

– AMD Opteron 270 (2 GHz dual)

– AMD Opteron 2218 HE (2.6 GHz dual)

•

I/O

throughput varies as well

[Dejun10]

– No correlation between

CPU and I/O performance

Am az on E C2 P erf ormanc e [Scha d10] Sourc e: [Dejun10]

(8)

•

Occasional CPU performance slumps and

failures

during task

execution

[Dejun10, Jackson10]

•

Variance in

I/O

and

network

throughput

[Zaharia08 ,Jackson10]

•

Performance depends on hour of day and day of week

[Schad10]

Dynamic Changes of Performance

EC2 Disk performance vs. VM co-allocation [Zaharia08]

(9)

Vision

Adaptive scheduling

of scientific workflows

•

Exploit

heterogeneous

resources

(10)

Vision

•

The standard approach for evaluation is

simulation

[Braun01, Blythe05]

(11)

Agenda

1) Simulating Heterogeneity in Computational Clouds

2) Evaluating Established Workflow Schedulers

(12)

Agenda

3) Summary and Outlook

(13)

CloudSim

Datacenter

Host

VM

Task

•

R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. De Rose, R. Buyya (2011),

CloudSim: a toolkit for modeling and simulation of cloud computing

environments and evaluation of resource provisioning algorithms

,

Software - Practice and Experience 41(1):23-50.

•

More than 250 citations in Google Scholar

(14)

DynamicCloudSim

Datacenter

Heterogeneous Host

Dynamic VM

Error-prone Task

•

Extend CloudSim with models for

1. Heterogeneous computational resources (Het) 2. Dynamic changes of performance at runtime (DCR) 3. Straggler VMs and failed task executions (SaF)

•

More fine-grained representation of computational resources

(15)

Realism – can we ever get there?

•

Simulation can never perfectly resemble reality

•

We model inhomogeneity and dynamic changes by

sampling from

normal distributions

•

Default

mean

and STD/

RSD

Parameters are obtained

from

[Zaharia08, Dejun10, Jackson10, Schad10, Iosup11]

(16)

Simulating VM Performance: DCS vs CS

1. Heterogeneous computational resources (

Het

)

2. Dynamic changes of performance at runtime (

DCR

)

(17)

Agenda

a) Scheduling Scientific Workflows

b) Evaluation Workflows

c) Evaluation Results

(18)

Agenda

a) Scheduling Scientific Workflows

(19)

Scheduling of Scientific Workflows

•

Scheduling

:

–

Mapping tasks to the available physical resources

–

Usual goal: minimize overall execution time

•

Static

Scheduling:

–

Schedule is assembled prior to workflow execution

–

Schedule is strictly abided at runtime

•

Adaptive

Scheduling:

–

Monitor computational infrastructure

(20)

Static Schedulers

•

Baseline:

Round Robin

–

Assign tasks to resources in turn

–

Equal amount of tasks per resource

•

Elaborate:

HEFT

(Het. Earliest Finish Time)

[Topcuoglu02]

–

Implemented in SWfMS

Pegasus

–

Requires

runtime estimates

for each task on each resource

–

Assign tasks with longest time to finish a fixed timeslot on

a suitable (well-performing) resource

(21)

Adaptive Schedulers

•

Baseline:

Greedy Task Queue

–

Assign tasks to resources at runtime in

first-come-first-served manner

–

Adapts to changes of performance at runtime (

DCR

)

•

Elaborate:

LATE

(Longest Approx. Time to End)

[Zaharia08]

–

Developed for

Hadoop

to increase robustness to instability

–

10% of Tasks progressing at rate below average are

replicated and

speculatively executed

–

Exploit dynamic changes of performance

(22)

Agenda

(23)

(24)

Abstract Montage Workflow

(25)

Concrete Montage Workflow

•

43,318 tasks

reading and writing

534 GB of data

•

10 GB

input files which have to be

uploaded

to the cloud

(26)

(27)

(28)

Concrete Genomics Workflow

•

Align 10% of the reads produced in a sequencing experiment

against the smallest of human chromosomes (chr22)

–

Use about

0.2% of the available data

•

4,266 tasks reading and writing 436 GB of data (2.3 GB upload)

Indexing (bowtie, SHRiMP, PerM) Alignment (bowtie, SHRiMP, PerM) Convert (samtools view)

Sort (samtools sort) Merge (merge)

Preprocess (samtools mpileup) Variant calling (VarScan)

“Sense-Making” (VCFTools) Upload to cloud

(29)

Agenda

(30)

Runtime depending on Heterogeneity (

Het

)

0 0.125 0.25 0.375 0.5 0 200 400 600 800 1000 1200 1400 Static Round

Robin HEFT Greedy

Queue LATE 368 304 296 ₃₁₁ 371 301 ₃₀₀ 308 450 296 ₃₀₃ 315 715 296 ₃₀₈ 313 1314 286 ₃₀₀ 300 RSD Parameters for Heterogeneous Resources (Het) A ve ra ge R u n ti m e in Min u te s 0 0.125 0.25 0.375 0.5 0 200 400 600 800 203 143 ₁₆₃ 178 220 148 ₁₆₃ 179 275 150 ₁₆₆ 177 602 152 187 ₁₈₂ 747 149 195 ₁₈₅ RSD Parameters for Heterogeneous Resources (Het) A ve ra ge R u n ti m e in M in u te s

(31)

Runtime depending on Dynamic Changes (

DCR

)

0 0.125 0.25 0.375 0.5 0 100 200 300 400 500 600 Static Round

Robin HEFT Greedy

Queue LATE 368 304 296 ₃₁₁ 352 301 ₂₉₆ 317 394 357 299 ₃₀₈ 465 439 311 299 574 530 307 289 RSD Parameters for Dynamic Changes at Runtime (DCR) A ve ra ge R u n ti m e in M in u te s 0 0.125 0.25 0.375 0.5 0 100 200 300 400 Static Round

Robin HEFT Greedy 203 143 163 178 216 165 ₁₆₆ 176 241 190 165 179 295 255 170 ₁₈₀ 393 314 207 177 RSD Parameters for Dynamic Changes at Runtime (DCR) A ve ra ge R u n ti m e in M in u te s

(32)

Runtime with Stragglers and Failures (

SaF

)

0 0.00625 0.0125 0.01875 0.025 0 500 1000 1500 2000 2500 3000 Static Round

Robin HEFT Greedy

Queue LATE 368 304 296 311 598 405 ₃₉₆ 316 876 659 586 317 1365 962 790 316 2559 1291 1137 321 Likelihood of Straggler VMs and

Failed Tasks (SaF)

A ve ra ge R u n ti m e in M in u te s 0 0.00625 0.0125 0.01875 0.025 0 500 1000 1500 2000 203 143 ₁₆₃ 178 352 262 237 180 617 411 ₄₄₄ 187 1025 604 ₆₃₅ 188 1990 984 1125 195 Likelihood of Straggler VMs and

Failed Tasks (SaF)

A ve ra ge R u n ti m e in M in u te s

(33)

That’s all well and good, but…

•

Scheduling in SWfMS: Static or Greedy Task Queue

•

HEFT and LATE have a

computational overhead

and

require information not available in real scenarios:

–

HEFT:

runtime estimates

of each task on each machine

–

LATE:

progress rate

of each running task

•

Untapped optimization potential:

multiple resource scheduling

(34)

Summary and Outlook

•

EC2:

Heterogeneity

and

instability

in VM performance

•

DynamicCloudSim

introduces several factors of

instability into CloudSim

•

Simulation experiments

reproduce known strengths

and shortcomings of established schedulers

(35)

Thanks for your attention!

(36)

DynamicCloudSim: Simulating Heterogeneity in Computational Clouds 36

(37)

Literature

•

[Braun01] T. D. Braun, H. J. Siegel, N. Beck, L. L. Boloni, M.

Maheswarans, A. I. Reuther, J. P. Robertson, M. D. Theys, B.

Yao, D. Hensgen, R. F. Freund (2001),

A Comparison Study of

Eleven Static Heuristics for Mapping a Class of Independent

Tasks onto Heterogeneous Distributed Computing Systems

,

Journal of Parallel and Distributed Computing 61:810–837.

•

[Blythe05] J. Blythe, S. Jain, E. Deelman, Y. Gil, K. Vahi, A.

Mandal, K. Kennedy (2005),

Task Scheduling Strategies for

Workflow-based Applications in Grids

, in: Proceedings of the

5th IEEE International Symposium on Cluster Computing and

the Grid, volume 2, Cardiff, UK, pp. 759–767.

(38)

Literature (cont.)

•

[Jackson10] K. R. Jackson, et al. (2010),

Performance Analysis

of High Performance Computing Applications on the Amazon

Web Services Cloud

, in: Proceedings of the 2nd International

Conference on Cloud Computing Technology and Science,

Indianapolis, USA, pp. 159-168.

•

[Dejun09] J. Dejun, et al. (2009),

EC2 Performance Analysis for

Resource Provisioning of Service-Oriented Applications

, in:

Proceedings of the 7th International Conference on Service

Oriented Computing, Stockholm, Sweden, pp. 197-207.

•

[Zaharia08] M. Zaharia, et al. (2008),

Improving MapReduce

Performance in Heterogeneous Environments

, in: Proceedings

of the 8th USENIX Symposium on Operating Systems Design

(39)

Literature (cont.)

•

[Schad10] J. Schad, J. Dittrich, J.-A. Quiané-Ruiz (2010),

Runtime Measurements in the Cloud: Observing, Analyzing,

and Reducing Variance

, Proceedings of the VLDB Endowment

3(1):460–471.

•

[Iosup11] A. Iosup, N. Yigitbasi, D. Epema (2011),

On the

Performance Variability of Production Cloud Services

, in:

Proceedings of the 2011 11th IEEE/ACM International

Symposium on Cluster, Cloud and Grid Computing, Newport

Beach, California, USA, pp. 104–113.

(40)

Literature (cont.)

•

[Topcuoglu02] H. Topcuoglu, S. Hariri, M.-Y. Wu (2002),

Performance-Effective and Low-Complexity Task Scheduling

for Heterogeneous Computing

, IEEE Transactions on Parallel

and Distributed Systems 13(3):260-274.

•

[Berriman04] G. B. Berriman, et al. (2004),

Montage: a

grid-enabled engine for delivering custom science-grade mosaics

on demand

, in: Proceedings of the SPIE Conference on

Astronomical Telescopes and Instrumentation, volume 5493,

Glasgow, Scotland, pp. 221-232.

https://code.google.com/p/cloudsim/

https://code.google.com/p/dynamiccloudsim/