DYNAMIC CLOUD PROVISIONING FOR SCIENTIFIC GRID WORKFLOWS

(1)

DYNAMIC CLOUD PROVISIONING

FOR SCIENTIFIC GRID WORKFLOWS

Simon Ostermann, Radu Prodan and Thomas Fahringer

Institute of Computer Science, University of Innsbruck Technikerstrasse 21a, Innsbruck, Austria

(2)

• Introduction

• Optimized Cloud Provisioning

• Cloud Start • Instance Size • Grid Scheduling • Cloud Stop

• Evaluation using 3 scientific workflows

• Wien2k • Invmod • Meteoag

• Conclusion

(3)

INTRODUCTION

• Infrastructure as a Service a branch of Cloud computing • On-demand resources i.e.: Amazon EC2, GoGrid, ...

• Other common Cloud computing areas not covered: • Platform as a Service

• Software as a Service

(4)

CLOUD COMPUTING FOR

SCIENTIFIC COMPUTING?

• Rent resources instead of buying own hardware

• Eliminates permanent operation, maintenance, and

deprecation costs

• Scale up/down an infrastructure based on temporary

immediate needs

• Significantly reduced over-provisioning

• Virtualised resources enables scalable deployment and

provisioning of application software

• Reliability through business SLA relationships that bind

(5)

CLOUD MODELS

• Cloud computing mostly available on a hourly basis • Some research papers assume finer granularity

• Interesting problems arise:

• How much do i use this full hour?

• How can i maximize the usage / minimize the cost?

nothing 50 Unallocated 100 Requested 100 Starting 100 Running 30 Accessible 270 Shutting down 50 Terminated 10 Unallocated 100 !""#$%&'#( )*%+(%,-#./*'( 0$*&'#(%,-#./*'( 1%2#( 0 ,*' '3 "*-# +( 0 ,*' '3 "*-# +( 4-*. 5, 6( 78 ,, %, 6( 7# 98 #$ -# +( 1#.2%,*-#+( 4: 8;, 6( +3 < ,( =#>-(&%''%,6( %,-#./*'(

(6)

GRID COMPUTING

• Grid has emerged as a worldwide shared distributed platform

for solving large-scale scientific problems

• Grid computing with additional Cloud resources to speed up

scientific computing

• Just in time Scheduler from ASKALON, a workflow execution

system for Grid and Cloud resources

• ASKALON is a Workflow system developed by the DPS

group at the University of Innsbruck

(7)

GROUDSIM

• Grid and Cloud Simulator

• Event based for scalability reasons

• Experiments showed up to 90% better performance and

better scalability then GridSim

• Java based - to allow integration into existing software

• Simulation allows wide analysis of Cloud without expenses • Simulation results match real executions

(8)

GROUDSIM ARCHITECTURE

!"#$%&'()*+),")-* ./($0!"#* 12-/* 3$4$/-*+5-)4* 6"24*73+68* ./"0*&)0*9%($0*+)''-2* 3&"%$/-*.-)-/&4(/* :&;<,/($)0* 6(&0-/*

Infrastructure + application simulation Callbacks

Put events in list Get next event

Submit jobs Transfer files Generate failure ="24/">$'()* ="24/">$'()* ="24/">$'()*

(9)

OPTIMIZED CLOUD

PROVISIONING

• Analysis of regular executions and the resulting costs

• Analysis resulted in multiple parts needing optimization

• Choices have to be made about: start and stop of resources

and the amount of instances requested

• Four optimizations found, defined as algorithms (in the paper)

(10)

CLOUD START

• Parallel regions with more tasks then available cores

• Depending of Cloud and Grid speed Serialization and

Imbalance overheads are analyzed

• When minimization of the runtime of the parallel section is

possible Cloud resources are started

Grid core 3 120 120 Grid core 2 120 120 Grid core 1 120 120 Cloud core 1 250 Grid core 3 120 120 Grid core 2 120 120 Grid core 1 120 Cloud core 1 300 !" #!!" $!!" %&'(")*&+"," %&'(")*&+"$" %&'(")*&+"#" -*."#" -*."$" -*."," -*."/" -*."0" -*."1" 2+&'34'536*7" !" #!!" $!!" ,!!" %&'(")*&+"," %&'(")*&+"$" %&'(")*&+"#" 84*9(")*&+"#" -*."," -*."$" -*."#" -*."/" -*."1" -*."0" :;.3437)+" <';+" <';+"

(11)

INSTANCE SIZE

• Instances may offer different number of cores

• When only part of the Cloud cores are used the cost efficiency

is lower

• Getting to little cores may result in serialization / no benefit • Important to decide if number of instances to request is

rounded up or down resulting in 2 behaviors:

• generous: better performance but more expensive

(12)

GRID SCHEDULING

• Grid is a dynamical shared environment

• Resources may become available while workflow execution

uses Cloud resources

• Rescheduling resources to Grid might save cost / might

decrease execution time

• depending of work already completed from a job mapped

to a Cloud resource and the speed difference from Grid and Cloud decisions are made

(13)

CLOUD STOP

• Unused resources are shut down to save money

• Shutdown after 5 minutes of a payed hour is as expensive as

after 58 minutes

• Resources might be reused in the upcoming 53 minutes and

this reuse will reduce the overall Cloud provisioning overheads

• Shut down time is in payed period therefor the point in time

has to be chosen knowing the Shut down time of the Cloud

(14)

EVALUATION

• Three different scientific workflows with different levels of parallelism • Execution simulated using GroudSim

• Impact of different optimizations on the three workflows when using 3

different types of Cloud resources and 3 Clusters from the Austrian Grid

(15)

METRIC

• Comparison of executions on Grid resources and executions

using Grid and additional on demand Cloud resources

• We define a new metric C_T called cost per unit of saved time

($/T)

• Represents how expensive a unit of saved execution time

comes with the assumption that Grid resources are freely available

(16)

WORKFLOWS

• From different fields of science with different structures

• Parallelisation size x representing a factor that represents the

amount of tasks in a workflow which is evaluated for values from 1 - 900

• Computationally intensive, data transfers are small part of each

workflow

• Cloud network speed and storage influence kept low

(17)

GENERAL OBSERVATIONS

0 20 40 60 80 100 120 140 160 180 0 100 200 300 400 500 600 700 800 900 Cost [$] Parallelisation size [x]

Grid+m1.small (Cloud stop) Grid+m1.large (Cloud stop) Grid+c1.xlarge (Cloud stop)

Grid+m1.small (no opt.) Grid+m1.large (no opt.) Grid+c1.xlarge (no opt.)

Comparison of regular and optimized executions

of different big workflows

(18)

WIEN2K

• Vienna University of Technology • Theoretical chemistry

(materials science)

• Electronic structure calculations

for solids using density functional theory

• Number of activities • 2 * x + 3

(19)

0 5 10 15 20 25 30 35 0 100 200 300 400 500 600 700 800 900 Time [hours] Parallelisation size [x] Grid Grid + m1.small Grid + m1.large Grid + c1.xlarge 0 20 40 60 80 100 120 140 160 180 0 100 200 300 400 500 600 700 800 900 Cost [$] Parallelisation size [x] Grid + m1.small Grid + m1.large Grid + c1.xlarge

WIEN2K

Execution times and

cost on the Grid

and with additional

Cloud resources

Cost per unit of saved time ($/T) for the

three different Cloud with logarithmic scale

0.01 0.1 1 10

0 100 200 300 400 500 600 700 800 900

Cost / Saved time [min/$], logarithmic scale [log C

T ] Parallelisation size [x] Grid + m1.small Grid + m1.large Grid + c1.xlarge

(20)

INVMOD

• A hydrological application

using Levenberg-Marquardt algorithm to minimize the error between simulation and measurements • Number of activities • 12 * x + 1 • x = parallelisation size

(21)

10 15 20 25 30 35 40 45 50 50 100 150 200 250 300 Time [hours] Parallelisation size [x] Grid Grid + m1.small Grid + m1.large Grid + c1.xlarge 0 50 100 150 200 250 50 100 150 200 250 300 Cost [$] Parallelisation size [x] Grid + m1.small Grid + m1.large Grid + c1.xlarge

INVMOD

0.01 0.1 1 10 100 50 100 150 200 250 300

Execution times and

cost on the Grid

and with additional

Cloud resources

(22)

METEOAG

• Meteorology and Geophysics

Institute

• Meteorological simulations with

the numerical model RAMS

• Resolve alpine watersheds and

thunderstorms in the Arlberg region of the West Austria

• Number of activities • 69 * x + 2

• x = parallelisation size

simulation_init case_init

rams_makevfile rams_makevfile rams_makevfile

rams_init revu_compare raver rams_hist revu_dump stageout continue?

Initial Conditions Initial Conditions Initial Conditions

no

yes 6 h Simulation

Post Process

Post Process Verify and Select

18 h Simulation

case_init case_init

(23)

METEOAG

0 20 40 60 80 100 120 140 160 50 100 150 200 250 300 Time [hours] Parallelisation size [x] Grid Grid + m1.small Grid + m1.large Grid + c1.xlarge 0 100 200 300 400 500 600 700 800 900 50 100 150 200 250 300 Cost [$] Parallelisation size [x] Grid + m1.small Grid + m1.large Grid + c1.xlarge

Execution times and

cost on the Grid

and with additional

Cloud resources

0.01 0.1 1 10 100 50 100 150 200 250 300

(24)

CONCLUSION

• Granularity of Cloud payment has an important roll in Cloud

allocation decisions

• Optimizations like the presented needed to allow efficient

usage of this dynamic resource class

• The longer Cloud resources needed the lower the impact

(25)

DYNAMIC CLOUD PROVISIONING FOR SCIENTIFIC GRID WORKFLOWS