DYNAMIC CLOUD PROVISIONING
FOR SCIENTIFIC GRID WORKFLOWS
Simon Ostermann, Radu Prodan and Thomas Fahringer
Institute of Computer Science, University of Innsbruck Technikerstrasse 21a, Innsbruck, Austria
• Introduction
• Optimized Cloud Provisioning
• Cloud Start • Instance Size • Grid Scheduling • Cloud Stop
• Evaluation using 3 scientific workflows
• Wien2k • Invmod • Meteoag
• Conclusion
INTRODUCTION
• Infrastructure as a Service a branch of Cloud computing • On-demand resources i.e.: Amazon EC2, GoGrid, ...
• Other common Cloud computing areas not covered: • Platform as a Service
• Software as a Service
CLOUD COMPUTING FOR
SCIENTIFIC COMPUTING?
• Rent resources instead of buying own hardware
• Eliminates permanent operation, maintenance, and
deprecation costs
• Scale up/down an infrastructure based on temporary
immediate needs
• Significantly reduced over-provisioning
• Virtualised resources enables scalable deployment and
provisioning of application software
• Reliability through business SLA relationships that bind
CLOUD MODELS
• Cloud computing mostly available on a hourly basis • Some research papers assume finer granularity
• Interesting problems arise:
• How much do i use this full hour?
• How can i maximize the usage / minimize the cost?
nothing 50 Unallocated 100 Requested 100 Starting 100 Running 30 Accessible 270 Shutting down 50 Terminated 10 Unallocated 100 !""#$%&'#( )*%+(%,-#./*'( 0$*&'#(%,-#./*'( 1%2#( 0 ,*' '3 "*-# +( 0 ,*' '3 "*-# +( 4-*. 5, 6( 78 ,, %, 6( 7# 98 #$ -# +( 1#.2%,*-#+( 4: 8;, 6( +3 < ,( =#>-(&%''%,6( %,-#./*'(
GRID COMPUTING
• Grid has emerged as a worldwide shared distributed platform
for solving large-scale scientific problems
• Grid computing with additional Cloud resources to speed up
scientific computing
• Just in time Scheduler from ASKALON, a workflow execution
system for Grid and Cloud resources
• ASKALON is a Workflow system developed by the DPS
group at the University of Innsbruck
GROUDSIM
• Grid and Cloud Simulator
• Event based for scalability reasons
• Experiments showed up to 90% better performance and
better scalability then GridSim
• Java based - to allow integration into existing software
• Simulation allows wide analysis of Cloud without expenses • Simulation results match real executions
GROUDSIM ARCHITECTURE
!"#$%&'()*+),")-* ./($0!"#* 12-/* 3$4$/-*+5-)4* 6"24*73+68* ./"0*&)0*9%($0*+)''-2* 3&"%$/-*.-)-/&4(/* :&;<,/($)0* 6(&0-/*Infrastructure + application simulation Callbacks
Put events in list Get next event
Submit jobs Transfer files Generate failure ="24/">$'()* ="24/">$'()* ="24/">$'()*
OPTIMIZED CLOUD
PROVISIONING
• Analysis of regular executions and the resulting costs
• Analysis resulted in multiple parts needing optimization
• Choices have to be made about: start and stop of resources
and the amount of instances requested
• Four optimizations found, defined as algorithms (in the paper)
CLOUD START
• Parallel regions with more tasks then available cores
• Depending of Cloud and Grid speed Serialization and
Imbalance overheads are analyzed
• When minimization of the runtime of the parallel section is
possible Cloud resources are started
Grid core 3 120 120 Grid core 2 120 120 Grid core 1 120 120 Cloud core 1 250 Grid core 3 120 120 Grid core 2 120 120 Grid core 1 120 Cloud core 1 300 !" #!!" $!!" %&'(")*&+"," %&'(")*&+"$" %&'(")*&+"#" -*."#" -*."$" -*."," -*."/" -*."0" -*."1" 2+&'34'536*7" !" #!!" $!!" ,!!" %&'(")*&+"," %&'(")*&+"$" %&'(")*&+"#" 84*9(")*&+"#" -*."," -*."$" -*."#" -*."/" -*."1" -*."0" :;.3437)+" <';+" <';+"
INSTANCE SIZE
• Instances may offer different number of cores
• When only part of the Cloud cores are used the cost efficiency
is lower
• Getting to little cores may result in serialization / no benefit • Important to decide if number of instances to request is
rounded up or down resulting in 2 behaviors:
• generous: better performance but more expensive
GRID SCHEDULING
• Grid is a dynamical shared environment
• Resources may become available while workflow execution
uses Cloud resources
• Rescheduling resources to Grid might save cost / might
decrease execution time
• depending of work already completed from a job mapped
to a Cloud resource and the speed difference from Grid and Cloud decisions are made
CLOUD STOP
• Unused resources are shut down to save money
• Shutdown after 5 minutes of a payed hour is as expensive as
after 58 minutes
• Resources might be reused in the upcoming 53 minutes and
this reuse will reduce the overall Cloud provisioning overheads
• Shut down time is in payed period therefor the point in time
has to be chosen knowing the Shut down time of the Cloud
EVALUATION
• Three different scientific workflows with different levels of parallelism • Execution simulated using GroudSim
• Impact of different optimizations on the three workflows when using 3
different types of Cloud resources and 3 Clusters from the Austrian Grid
METRIC
• Comparison of executions on Grid resources and executions
using Grid and additional on demand Cloud resources
• We define a new metric CT called cost per unit of saved time
($/T)
• Represents how expensive a unit of saved execution time
comes with the assumption that Grid resources are freely available
WORKFLOWS
• From different fields of science with different structures
• Parallelisation size x representing a factor that represents the
amount of tasks in a workflow which is evaluated for values from 1 - 900
• Computationally intensive, data transfers are small part of each
workflow
• Cloud network speed and storage influence kept low
GENERAL OBSERVATIONS
0 20 40 60 80 100 120 140 160 180 0 100 200 300 400 500 600 700 800 900 Cost [$] Parallelisation size [x]Grid+m1.small (Cloud stop) Grid+m1.large (Cloud stop) Grid+c1.xlarge (Cloud stop)
Grid+m1.small (no opt.) Grid+m1.large (no opt.) Grid+c1.xlarge (no opt.)
Comparison of regular and optimized executions
of different big workflows
WIEN2K
• Vienna University of Technology • Theoretical chemistry
(materials science)
• Electronic structure calculations
for solids using density functional theory
• Number of activities • 2 * x + 3
0 5 10 15 20 25 30 35 0 100 200 300 400 500 600 700 800 900 Time [hours] Parallelisation size [x] Grid Grid + m1.small Grid + m1.large Grid + c1.xlarge 0 20 40 60 80 100 120 140 160 180 0 100 200 300 400 500 600 700 800 900 Cost [$] Parallelisation size [x] Grid + m1.small Grid + m1.large Grid + c1.xlarge
WIEN2K
Execution times and
cost on the Grid
and with additional
Cloud resources
Cost per unit of saved time ($/T) for the
three different Cloud with logarithmic scale
0.01 0.1 1 10
0 100 200 300 400 500 600 700 800 900
Cost / Saved time [min/$], logarithmic scale [log C
T ] Parallelisation size [x] Grid + m1.small Grid + m1.large Grid + c1.xlarge
INVMOD
• A hydrological application
using Levenberg-Marquardt algorithm to minimize the error between simulation and measurements • Number of activities • 12 * x + 1 • x = parallelisation size
10 15 20 25 30 35 40 45 50 50 100 150 200 250 300 Time [hours] Parallelisation size [x] Grid Grid + m1.small Grid + m1.large Grid + c1.xlarge 0 50 100 150 200 250 50 100 150 200 250 300 Cost [$] Parallelisation size [x] Grid + m1.small Grid + m1.large Grid + c1.xlarge
INVMOD
0.01 0.1 1 10 100 50 100 150 200 250 300Cost / Saved time [min/$], logarithmic scale [log C
T ] Parallelisation size [x] Grid + m1.small Grid + m1.large Grid + c1.xlarge
Execution times and
cost on the Grid
and with additional
Cloud resources
Cost per unit of saved time ($/T) for the
three different Cloud with logarithmic scale
METEOAG
• Meteorology and Geophysics
Institute
• Meteorological simulations with
the numerical model RAMS
• Resolve alpine watersheds and
thunderstorms in the Arlberg region of the West Austria
• Number of activities • 69 * x + 2
• x = parallelisation size
simulation_init case_init
rams_makevfile rams_makevfile rams_makevfile
rams_init revu_compare raver rams_hist revu_dump stageout continue?
Initial Conditions Initial Conditions Initial Conditions
no
yes 6 h Simulation
Post Process
Post Process Verify and Select
18 h Simulation
case_init case_init
METEOAG
0 20 40 60 80 100 120 140 160 50 100 150 200 250 300 Time [hours] Parallelisation size [x] Grid Grid + m1.small Grid + m1.large Grid + c1.xlarge 0 100 200 300 400 500 600 700 800 900 50 100 150 200 250 300 Cost [$] Parallelisation size [x] Grid + m1.small Grid + m1.large Grid + c1.xlargeExecution times and
cost on the Grid
and with additional
Cloud resources
Cost per unit of saved time ($/T) for the
three different Cloud with logarithmic scale
0.01 0.1 1 10 100 50 100 150 200 250 300
Cost / Saved time [min/$], logarithmic scale [log C
T ] Parallelisation size [x] Grid + m1.small Grid + m1.large Grid + c1.xlarge
CONCLUSION
• Granularity of Cloud payment has an important roll in Cloud
allocation decisions
• Optimizations like the presented needed to allow efficient
usage of this dynamic resource class
• The longer Cloud resources needed the lower the impact