DynamicCloudSim:
Simulating Heterogeneity in
Computational Clouds
Marc Bux, Ulf Leser
{bux|leser}@informatik.hu-berlin.de
The 2nd international workshop on Scalable Workflow
Enactment Engines and Technologies (SWEET'13)
•
Small Instance
: 1.7 GB RAM, 1 EC2
Compute Unit
, 160 GB local storage
•
Compute Unit
: equiv. CPU capacity of a 1.0-1.2 GHz Opteron or Xeon
•
No guarantees wrt. I/O throughput and network delay / bandwidth
Any one cloud instance is unlike another.
Heterogeneity in EC2 Cloud Instances
•
Different CPUs
on physical
host systems
[Jackson10, Schad10]– Intel Xeon E5430 (2.66 GHz quad)
– AMD Opteron 270 (2 GHz dual)
– AMD Opteron 2218 HE (2.6 GHz dual)
•
I/O
throughput varies as well
[Dejun10]– No correlation between
CPU and I/O performance
Am az on E C2 P erf ormanc e [Scha d10] Sourc e: [Dejun10]
•
Occasional CPU performance slumps and
failures
during task
execution
[Dejun10, Jackson10]•
Variance in
I/O
and
network
throughput
[Zaharia08 ,Jackson10]•
Performance depends on hour of day and day of week
[Schad10]Dynamic Changes of Performance
EC2 Disk performance vs. VM co-allocation [Zaharia08]
Vision
Adaptive scheduling
of scientific workflows
•
Exploit
heterogeneous
resources
Vision
•
The standard approach for evaluation is
simulation
[Braun01, Blythe05]Agenda
1) Simulating Heterogeneity in Computational Clouds
2) Evaluating Established Workflow Schedulers
Agenda
1) Simulating Heterogeneity in Computational Clouds
2) Evaluating Established Workflow Schedulers
3) Summary and Outlook
CloudSim
Datacenter
Host
VM
Task
•
R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. De Rose, R. Buyya (2011),
CloudSim: a toolkit for modeling and simulation of cloud computing
environments and evaluation of resource provisioning algorithms
,
Software - Practice and Experience 41(1):23-50.
•
More than 250 citations in Google Scholar
DynamicCloudSim
Datacenter
Heterogeneous Host
Dynamic VM
Error-prone Task
•
Extend CloudSim with models for
1. Heterogeneous computational resources (Het) 2. Dynamic changes of performance at runtime (DCR) 3. Straggler VMs and failed task executions (SaF)
•
More fine-grained representation of computational resources
Realism – can we ever get there?
•
Simulation can never perfectly resemble reality
•
We model inhomogeneity and dynamic changes by
sampling from
normal distributions
•
Default
mean
and STD/
RSD
Parameters are obtained
from
[Zaharia08, Dejun10, Jackson10, Schad10, Iosup11]Simulating VM Performance: DCS vs CS
1. Heterogeneous computational resources (
Het
)
2. Dynamic changes of performance at runtime (
DCR
)
Agenda
1) Simulating Heterogeneity in Computational Clouds
2) Evaluating Established Workflow Schedulers
a) Scheduling Scientific Workflows
b) Evaluation Workflows
c) Evaluation Results
Agenda
1) Simulating Heterogeneity in Computational Clouds
2) Evaluating Established Workflow Schedulers
a) Scheduling Scientific Workflows
b) Evaluation Workflows
c) Evaluation Results
Scheduling of Scientific Workflows
•
Scheduling
:
–
Mapping tasks to the available physical resources
–
Usual goal: minimize overall execution time
•
Static
Scheduling:
–
Schedule is assembled prior to workflow execution
–
Schedule is strictly abided at runtime
•
Adaptive
Scheduling:
–
Monitor computational infrastructure
Static Schedulers
•
Baseline:
Round Robin
–
Assign tasks to resources in turn
–
Equal amount of tasks per resource
•
Elaborate:
HEFT
(Het. Earliest Finish Time)
[Topcuoglu02]–
Implemented in SWfMS
Pegasus
–
Requires
runtime estimates
for each task on each resource
–
Assign tasks with longest time to finish a fixed timeslot on
a suitable (well-performing) resource
Adaptive Schedulers
•
Baseline:
Greedy Task Queue
–
Assign tasks to resources at runtime in
first-come-first-served manner
–
Adapts to changes of performance at runtime (
DCR
)
•
Elaborate:
LATE
(Longest Approx. Time to End)
[Zaharia08]–
Developed for
Hadoop
to increase robustness to instability
–
10% of Tasks progressing at rate below average are
replicated and
speculatively executed
–
Exploit dynamic changes of performance
Agenda
1) Simulating Heterogeneity in Computational Clouds
2) Evaluating Established Workflow Schedulers
a) Scheduling Scientific Workflows
b) Evaluation Workflows
c) Evaluation Results
Abstract Montage Workflow
Concrete Montage Workflow
•
43,318 tasks
reading and writing
534 GB of data
•
10 GB
input files which have to be
uploaded
to the cloud
Concrete Genomics Workflow
•
Align 10% of the reads produced in a sequencing experiment
against the smallest of human chromosomes (chr22)
–
Use about
0.2% of the available data
•
4,266 tasks reading and writing 436 GB of data (2.3 GB upload)
Indexing (bowtie, SHRiMP, PerM) Alignment (bowtie, SHRiMP, PerM) Convert (samtools view)
Sort (samtools sort) Merge (merge)
Preprocess (samtools mpileup) Variant calling (VarScan)
“Sense-Making” (VCFTools) Upload to cloud
Agenda
1) Simulating Heterogeneity in Computational Clouds
2) Evaluating Established Workflow Schedulers
a) Scheduling Scientific Workflows
b) Evaluation Workflows
c) Evaluation Results
Runtime depending on Heterogeneity (
Het
)
0 0.125 0.25 0.375 0.5 0 200 400 600 800 1000 1200 1400 Static RoundRobin HEFT Greedy
Queue LATE 368 304 296 311 371 301 300 308 450 296 303 315 715 296 308 313 1314 286 300 300 RSD Parameters for Heterogeneous Resources (Het) A ve ra ge R u n ti m e in Min u te s 0 0.125 0.25 0.375 0.5 0 200 400 600 800 203 143 163 178 220 148 163 179 275 150 166 177 602 152 187 182 747 149 195 185 RSD Parameters for Heterogeneous Resources (Het) A ve ra ge R u n ti m e in M in u te s
Runtime depending on Dynamic Changes (
DCR
)
0 0.125 0.25 0.375 0.5 0 100 200 300 400 500 600 Static RoundRobin HEFT Greedy
Queue LATE 368 304 296 311 352 301 296 317 394 357 299 308 465 439 311 299 574 530 307 289 RSD Parameters for Dynamic Changes at Runtime (DCR) A ve ra ge R u n ti m e in M in u te s 0 0.125 0.25 0.375 0.5 0 100 200 300 400 Static Round
Robin HEFT Greedy 203 143 163 178 216 165 166 176 241 190 165 179 295 255 170 180 393 314 207 177 RSD Parameters for Dynamic Changes at Runtime (DCR) A ve ra ge R u n ti m e in M in u te s
Runtime with Stragglers and Failures (
SaF
)
0 0.00625 0.0125 0.01875 0.025 0 500 1000 1500 2000 2500 3000 Static RoundRobin HEFT Greedy
Queue LATE 368 304 296 311 598 405 396 316 876 659 586 317 1365 962 790 316 2559 1291 1137 321 Likelihood of Straggler VMs and
Failed Tasks (SaF)
A ve ra ge R u n ti m e in M in u te s 0 0.00625 0.0125 0.01875 0.025 0 500 1000 1500 2000 203 143 163 178 352 262 237 180 617 411 444 187 1025 604 635 188 1990 984 1125 195 Likelihood of Straggler VMs and
Failed Tasks (SaF)
A ve ra ge R u n ti m e in M in u te s
That’s all well and good, but…
•
Scheduling in SWfMS: Static or Greedy Task Queue
•
HEFT and LATE have a
computational overhead
and
require information not available in real scenarios:
–
HEFT:
runtime estimates
of each task on each machine
–
LATE:
progress rate
of each running task
•
Untapped optimization potential:
multiple resource scheduling
Summary and Outlook
•
EC2:
Heterogeneity
and
instability
in VM performance
•
DynamicCloudSim
introduces several factors of
instability into CloudSim
•
Simulation experiments
reproduce known strengths
and shortcomings of established schedulers
Thanks for your attention!
DynamicCloudSim: Simulating Heterogeneity in Computational Clouds 36
Literature
•
[Braun01] T. D. Braun, H. J. Siegel, N. Beck, L. L. Boloni, M.
Maheswarans, A. I. Reuther, J. P. Robertson, M. D. Theys, B.
Yao, D. Hensgen, R. F. Freund (2001),
A Comparison Study of
Eleven Static Heuristics for Mapping a Class of Independent
Tasks onto Heterogeneous Distributed Computing Systems
,
Journal of Parallel and Distributed Computing 61:810–837.
•
[Blythe05] J. Blythe, S. Jain, E. Deelman, Y. Gil, K. Vahi, A.
Mandal, K. Kennedy (2005),
Task Scheduling Strategies for
Workflow-based Applications in Grids
, in: Proceedings of the
5th IEEE International Symposium on Cluster Computing and
the Grid, volume 2, Cardiff, UK, pp. 759–767.
Literature (cont.)
•
[Jackson10] K. R. Jackson, et al. (2010),
Performance Analysis
of High Performance Computing Applications on the Amazon
Web Services Cloud
, in: Proceedings of the 2nd International
Conference on Cloud Computing Technology and Science,
Indianapolis, USA, pp. 159-168.
•
[Dejun09] J. Dejun, et al. (2009),
EC2 Performance Analysis for
Resource Provisioning of Service-Oriented Applications
, in:
Proceedings of the 7th International Conference on Service
Oriented Computing, Stockholm, Sweden, pp. 197-207.
•
[Zaharia08] M. Zaharia, et al. (2008),
Improving MapReduce
Performance in Heterogeneous Environments
, in: Proceedings
of the 8th USENIX Symposium on Operating Systems Design
Literature (cont.)
•
[Schad10] J. Schad, J. Dittrich, J.-A. Quiané-Ruiz (2010),
Runtime Measurements in the Cloud: Observing, Analyzing,
and Reducing Variance
, Proceedings of the VLDB Endowment
3(1):460–471.
•
[Iosup11] A. Iosup, N. Yigitbasi, D. Epema (2011),
On the
Performance Variability of Production Cloud Services
, in:
Proceedings of the 2011 11th IEEE/ACM International
Symposium on Cluster, Cloud and Grid Computing, Newport
Beach, California, USA, pp. 104–113.
Literature (cont.)
•
[Topcuoglu02] H. Topcuoglu, S. Hariri, M.-Y. Wu (2002),
Performance-Effective and Low-Complexity Task Scheduling
for Heterogeneous Computing
, IEEE Transactions on Parallel
and Distributed Systems 13(3):260-274.
•
[Berriman04] G. B. Berriman, et al. (2004),
Montage: a
grid-enabled engine for delivering custom science-grade mosaics
on demand
, in: Proceedings of the SPIE Conference on
Astronomical Telescopes and Instrumentation, volume 5493,
Glasgow, Scotland, pp. 221-232.