Performance, Availability and Power Analysis for IaaS Cloud

(1)

Performance, Availability and Power

Analysis for IaaS Cloud

Kishor Trivedi

1

Kishor Trivedi

[email protected] www.ee.duke.edu/~kst

Dept. of ECE, Duke University, Durham, NC 27708

Universita Napoli

(2)

Duke University

Research Triangle Park (RTP)

UNC-CH

Duke

NC state

2

NC state

(3)

Theory

Books:

Software

Stochastic modeling methods & numerical solution methods:

Large Fault trees, Stochastic Petri Nets, Large/stiff Markov & non-Markov models Fluid stochastic models

Performability & Markov reward models Software aging and rejuvenation

Attack countermeasure trees

Trivedi’s Research Triangle

3

HARP (NASA), SAVE (IBM), IRAP (Boeing) SHARPE, SPNP, SREPT

Applications

Books:

Blue, Red, White

Software

Packages

Reliability/availability/performance Avionics, Space, Power systems, Transportation systems,

Automobile systems

Computer systems (hardware/software) Telco systems

Computer Networks Virtualized Data center Cloud computing

(4)

Overview of Reliability and Availability Quantification

Overview of Cloud Computing

Performance Quantification for IaaS Cloud (PRDC 2010)

Availability Quantification for IaaS Cloud (DSN 2011)

Talk outline

Availability Quantification for IaaS Cloud (DSN 2011)

Power Quantification for IaaS Cloud (DSN workshop 2011)

Future Research

(5)

An Overview of Reliability and

Availability Quantification Methods

5

Availability Quantification Methods

Software + hardware in operation

Dynamic as opposed to static behavior

(6)

Measurement-Based

More Accurate

Expensive

due to

many parameters and

configurations

Reliability and Availability Quantification

Not always possible during system design.

Model-Based

Combined approach where measurements are made at

the subsystem level and models are built to derive

system-level measures

(7)

Reliability and Availability Evaluation Methods

Model-based Discrete-event simulation Hybrid Quantitative Evaluation Measurement-based 7

Numerical solution via a tool

Closed-form solution Model-based _Hybrid Analytic Models Numerical solution of analytic models not as well utilized; Unnecessarily excessive

use of simulation

(8)

Analytic Modeling Taxonomy

Model-based Discrete-event simulation Hybrid Analytic Models Quantitative Dependability Evaluation Measurement-based Hierarchical composition Fixed point iterative models

Analytic models

Non-state-space models

State-space models

(9)

Non-state space models

Modeling using reliability block diagrams (RBDs),

reliability graphs (relgraphs) and fault trees (FTs) are

easy to use and efficient to solve for

system reliability,

system availability and system mean time to failure

(MTTF)

9

Product-form queuing networks for performance

analysis

(10)

Example: Reliability Analysis of Boeing 787

Current Return Network Modeled as a

Reliability Graph

(Relgraph)

(11)

Reliability Analysis of Boeing 787

(cont’d)

This real problem has too many minpaths

Non-state space models also face largeness problem

Number of paths from source to target

(12)

Reliability Analysis of Boeing 787

(cont’d)

Our Approach : Developed a new efficient algorithm for

(un)reliability bounds computation developed and

incorporated in SHARPE

SHARPE (Symbolic Hierarchical Automated Reliability and Performance Evaluator)

(13)

Non-State-Space Models

Failure/Repair Dependencies are often present; RBDs, relgraphs,

FTREEs

cannot easily handle these

(e.g., shared repair, warm/cold

spares, imperfect coverage, non-zero switching time, travel time of

repair person, reliability with repair).

13

Product-form does not often hold when modeling real-life aspects

such as simultaneous resource possession, priorities, retries, etc.

(14)

State-space models : Markov chains

To model complex interactions between components, use models such

as

Markov chains or more generally state space models.

Many examples of dependencies among system components have been

observed in practice and captured by continuous-time Markov chains

(CTMCs)

Extension to Markov reward models makes computation of measures

of interest relatively easy.

(15)

Markov

Availability

model of WebSphere AP Server

Failure detection By WLM By Node Agent Manual detection Recovery Node Agent Auto process 15 15

Application server and proxy server (with escalated levels of recovery) • Delay and imperfect coverage in each step of recovery modeled

Auto process restart Manual recovery Process restart Node reboot Repair

(16)

Analytic Modeling Taxonomy

Non-state-space models

Analytic models

(17)

Should I Use Markov Models?

+ Model Fault-Tolerance and Recovery/Repair

+ Model Dependencies

+ Model Contention for Resources and concurrency

+ Generalize to Markov Reward Models for Degradable systems

17

+ Generalize to Markov Reward Models for Degradable systems

+ Can relax exponential assumption

+

Performance, Availability and Performability Modeling Possible

-

Large State Space

(18)

State Space Explosion

State space explosion can be avoided by using

hierarchical model composition.

Use state-space models

for those parts of a system

that require them, and

use non-state-space models

that require them, and

use non-state-space models

for the more “well-behaved” parts of the system.

(19)

Analytic models

Non-state-sapce models

Efficiency, simplicity

State-space models

Dependency capture

Analytic Modeling Taxonomy

(20)

Example: Architecture of SIP on IBM WebSphere

Replication

domain Nodes

1 A, D

AS: WebSphere Appl. Server (WAS) 1 A, D 2 A, E 3 B, F 4 B, D 5 C, E 6 C, F

(21)

Hierarchical composition

AS 6 6C BSCCM1 AS 5 5C BSCCM1 AS 4 4B BSBCM1 AS 3 3B BSBCM1 AS 2 2 A BSACM1 AS 1 1A BSACM1 AS 12 AS 11 AS 10 AS 9 AS 8 AS 7 App servers System Failure PX 1 P1 BSGCM1 PX 2 P2 BSHCM2 proxy system k of 12 AS1 1A BSA CM1 21 6F BSFCM2 3F BSFCM2 5 E BSECM2 2E BSECM2 4D BSDCM2 1D BSDCM2

This model was responsible for the actual sale of the system by IBM to their Telco customer

(22)

Fixed-Point Iteration

Input parameters of sub-models can be functions of outputs of other models If the import graph is not acyclic then we solve using fixed-point iteration

Non-state-space models Efficiency, simplicity Hierarchical composition To avoid largeness Analytic models Efficiency, simplicity State-space models Dependency capture Fixed-Point Iteration

To deal with interdependent submodels

(23)

An Overview of Cloud Computing

(24)

Definition by National Institute of Standards and Technology

(NIST):

“Cloud computing is a model for enabling convenient,

NIST definition of cloud computing

“Cloud computing is a model for enabling convenient,

on-demand

network access to a

shared pool of configurable

computing resources

(e.g., networks, servers, storage, applications,

and services) that can be

rapidly provisioned and released

with

(25)

On-demand self-service:

Provisioning of computing capabilities without human intervention

R

esource pooling:

Shared physical and virtualized environment

Rapid elasticity:

Key characteristics

25

Rapid elasticity:

Through standardization and automation, quick scaling

Metered Service:

Pay-as-you-go model of computing

Source: P. Mell and T. Grance, “The NIST Definition of Cloud Computing”, October 7, 2009

(26)

Time line of evolution

Evolution of cloud computing

Cloud computing

Around 2000

Around 2005-06

*Source: http://seekingalpha.com/article/167764-tipping-point-gartner-annoints-cloud-computing-top-strategic-technology

Cluster computing

Grid computing

Utility computing

Cloud computing

Early 80s

Early 90s

(27)

Infrastructure-as-a-Service (IaaS) Cloud:

Examples: Amazon EC2, IBM Smart Business Development and Test

Cloud

Platform-as-a-Service (PaaS) Cloud:

Examples: Microsoft Windows Azure, Google AppEngine

Cloud Service models

27

Examples: Microsoft Windows Azure, Google AppEngine

Software-as-a-Service (SaaS) Cloud:

Examples: Gmail, Google Docs

(28)

Private Cloud:

- Cloud infrastructure solely for an organization

- Managed by the organization or third party

- May exist on premise or off-premise

Public Cloud:

Deployment models

Public Cloud:

- Cloud infrastructure available for use for general users

- Owned by an organization providing cloud services

Hybrid Cloud:

- Composition of two or more clouds (private or public)

(29)

Three critical metrics for a cloud:

- Service (un)availability

- Performance (response time) unpredictability

- Power consumption

Large number of parameters can affect performance, availability and power

- Workload parameters

Key Challenges

29

- Workload parameters

- Failure/recovery characteristics

- Types of physical infrastructure

- Characteristics of virtualization infrastructures

- Large scale; thousands of servers

Performance, availability & power quantification are difficult!

(30)

Our goals in the IBM Cloud project

Develop a comprehensive analytic modeling approach

High fidelity

Scalable and tractable

Apply these models to cloud capacity planning

(31)

Our approach and motivations behind it

Difficulty with measurement-based approach:

expensive experimentation for each workload and system configuration

Monolithic analytic model will suffer largeness and hence is not

scalable

31

Our approach:

overall system model consists of a set of sub-models

sub-model solutions composed via an interacting Markov chain approach

scalable and tractable

(32)

Joint work with

Rahul Ghosh and Dong Seong Kim (Duke), Francesco Longo (Univ. of Messina)

Duke/IBM project on cloud computing

Rahul Ghosh and Dong Seong Kim (Duke), Francesco Longo (Univ. of Messina) Vijay Naik, Murthy Devarakonda and Daniel Dias

(33)

Performance Quantification for IaaS Cloud

[paper in Proc. IEEE PRDC 2010]

33

[paper in Proc. IEEE PRDC 2010]

(34)

System model

Current Assumptions [will be relaxed soon]

Homogenous requests

All physical machines (PMs) are identical.

To minimize power consumption, PMs divided into three pools:

Hot pool– fast provisioning but high power usage

Warm pool—slower provisioning but lower power usage

(35)

Life-cycle of a job inside a IaaS cloud

Run-time Execution

Arrival Queuing Instantiation VM

deployment

Actual Service Out

Resource Provisioning Decision Engine Provisioning Decision

Provisioning response delay

35

Provisioning and servicing steps:

(i) resource provisioning decision,

(ii) VM provisioning and

(iii) run-time execution

Job rejection due to buffer full

Job rejection due to insufficient capacity

(36)

Resource provisioning decision engine (RPDE)

deployment

(37)

Flow-chart:

Resource provisioning decision engine (RPDE)

(38)

CTMC model for RPDE

– mean search delays for resource provisioning

decision engine: from searching algorithms or measurements

– probability of being able to provision: computed from

VM provisioning model

N – maximum # jobs in RPDE: from system/server specification

c w h δ δ δ ,1/ ,1/ / 1

λ

c w h

P

,

39

N – maximum # jobs in RPDE: from system/server specification

Output Measures:

Job rejection probability due to buffer full (P

_block

)

Job rejection probability due to insufficient capacity (P

_drop

)

Mean decision delay for an accepted job (E[T

_decision

])

Mean queuing delay for an accepted job (E[T

_{q_dec}

])

(40)

VM provisioning

deployment

(41)

VM provisioning model

Resource Hot PM pool Hot PM 41 Service out Resource Provisioning Decision Engine Accepted jobs Running VMs

Idle resources on hot machine Idle resources on warm machine Idle resources on cold machine

Warm pool

µ

_µ

µ

2 µ

2

2 µ

… … … …

L_h is the buffer size and m is max. # VMs that can run

h

µ

2

µ ) 1 (m−

µ

m

µ ) 1 (m−

µ

m

µ

m

µ ) 1 (m− µ ) 1 (m− … … … … …

i,j,k i = number of jobs in the queue, j = number of VMs being provisioned,

(43)

Input Parameters:

can be measured experimentally

obtained from the lower level run-time model

obtained from the resource provisioning decision model

h block h n P ) 1 ( − = λ λ h β / 1

µ

/ 1 block P

VM provisioning model (for each hot PM)

43

Hot pool model is the set of independent hot PM models Output Measure:

= prob. that a job is accepted in the hot pool =

where, is the steady state probability that a PM can not accept job for provisioning - from the solution of the Markov model of a hot PM on the previous slide

h P h h h n h m L m i h i L ) ( 1 ₍( )_,₀_, ₎ 1 0 ) ( ) , 1 , (

ϕ

+ −

_∑

− = h n ) ( ₍( )_,₀_, ₎ 1 0 ) ( ) , 1 , ( h m L m i h i L_h ϕ _h ϕ +

∑

(44)

VM provisioning model for each warm PM

0,0,0

λ

w 0,1*,0 L_w,1*,0 w

β

_w w

λ

_λ

0,1,1 w

λ

(45)

VM provisioning model for each cold PM

0,0,0 0,1*,0 Lc,1*, 0 c

λ

c

β

µ

… 0,1,0 L_c,1,0 0,1**, 0 _1**,0Lc, c

γ

β

… c

λ

c c

λ

c

λ

c

λ

_λ

_c c

γ

β

45 0,0,1 (L_c-1),1,1 Lc,1, 1 c h

β

h

To solve hot, warm and cold PM models, we need

from resource

provisioning decision model

To solve provisioning decision model, we need

from hot, warm

and cold pool model respectively

Fixed-point iteration

block P c w h P P P , ,

This leads to a cyclic dependency among the resource provisioning

decision model and VM provisioning models (hot, warm, cold)

We resolve this dependency via fixed-point iteration

Observe, our fixed-point variable is

and corresponding fixed-point

equation is of the form:

block

P

) ( _block block f P P =

(49)

1 PM per pool and 1 VM per PM

Performance measures comparison with monolithic model

Jobs/hr Mean RPDE queue length Rejection probability

ISP monolithic ISP Monolithic

1

9.0332e-07 9.2321e-07 9.8899e-06 1.1221e-03

5

4.1622e-05 4.3364e-05 4.2334e-02 8.0500e-02

10

2.3731e-04 2.4225e-04 2.3496e-01 2.6587e-01

49

The error is between e-03 and e-07 for all the results.

The number of states in monolithic model is 912 while in ISP model it is 21

10

2.3731e-04 2.4225e-04 2.3496e-01 2.6587e-01

15

6.3539e-04 6.4377e-04 3.9860e-01 4.1493e-01

20

1.2526e-03 1.2655e-03 5.1069e-01 5.1969e-01

25

2.0990e-03 2.1179e-03 5.8915e-01 5.9449e-01

30

3.1826e-03 3.2091e-01 6.4648e-01 6.4985e-01

(50)

Availability Quantification for IaaS Cloud

[paper in Proc. IEEE/IFIP DSN 2011] [paper in Proc. IEEE/IFIP DSN 2011]

(51)

Assumptions

We consider the net effect of different failures and repairs of PMs

MTTF of each hot PM is and that of each warm PM is

with < .

MTTF of each cold PM is with << .

h λ / 1 w λ / 1 h λ / 1 1/λ_w λ / 1 1/λ 1/λ 51

MTTF of each cold PM is with << .

Each pool has repair facilities and shared repair policy is assumed

PMs can migrate from one pool to another upon a failure and

repair

c

λ /

(52)

Monolithic availability model

(53)

Interacting Sub-models

SRN sub-model for warm pool

53

SRN sub-model for hot pool

SRN sub-model for warm pool

SRN sub-model for cold pool

(54)

Import graph and model outputs

Model outputs:

mean number of PMs in each pool (E[#P_h], E[#P_w], and E[#P_c])

availability of cloud when at least k PMs (with ) are available across all the pools.

downtime

)

(

(55)

Monolithic vs. interacting sub-models

Number of model states and non-zero entries

#PMs in each pool #monolithic model states #sub-models states #monolithic model non-zero entries #sub-models non-zero entries 5 7056 56 44520 210 10 207636 286 1535490 1320 55 15 1775616 136 13948160 480 17 3508920 171 27976968 612 19 6468000 210 52189200 760 20 Memory overflow 231 Memory overflow 840 50 - 1326 5100 100 - 5151 20200 150 - 11476 45300 Copyright © 2011 by K.S. Trivedi

(56)

Monolithic vs. interacting sub-models

Average number of PMs in each pool

#PMs in each

pool to start with

Avg. #PMs in pools for monolithic model

Avg. #PMs in pools for interacting sub-models

hot warm cold hot warm cold

with 5 4.99 4.98 4.99 5.00 4.98 4.99 10 10.00 9.96 9.98 10.00 9.96 9.98 15 14.99 14.95 14.97 15.00 14.95 14.97 17 16.99 16.94 16.97 17.00 16.94 16.97 19 18.99 18.93 18.97 19.00 18.93 18.97

(57)

Monolithic vs. interacting sub-models

Comparison of downtime with 10 PMs in each pool to start with.

Cloud is available when at least “k” PMs are UP.

Maximum number of PMs that can be repaired in parallel is n

_r

k n_r Downtime (minutes/year)

Monolithic Interacting sub-models

30 1 23185.793 23178.956 57 Copyright © 2011 by K.S. Trivedi 30 1 23185.793 23178.956 2 22904.919 22898.454 3 22903.681 22897.219 29 1 792.475 798.651 2 499.081 505.258 3 497.787 503.964 28 1 24.722 25.336 2 8.412 8.691 3 7.118 7.396

(58)

Monolithic vs. interacting sub-models

Comparison of solution times

#PMs in each pool to start with

Monolithic model (sec) sub-models (sec)

5 0.627 0.406 10 18.670 0.517 15 373.822 0.278 15 373.822 0.278 17 1004.494 0.279 19 2459.553 0.280 20 Memory overflow 0.281 50 - 0.296 100 - 0.377 150 - 0.564 200 - 0.948

(59)

Solution time for large IaaS cloud

We use closed-form solutions of the sub-models

#PMs in each pool to start with

Solution time (sec)

(60)

Resiliency Quantification for IaaS Cloud

[paper in Proc. IEEE SRDS RACOS workshop 2010] [paper in Proc. IEEE SRDS RACOS workshop 2010]

(61)

Resiliency Quantification: Definitions

Past research mostly interpreted resiliency as fault tolerant capability of the

system

We use following definition

Resiliency is the persistence of service delivery that is predictable and

can be trusted to perform when subjected to

changes*

61

Changes

of interest in the context of IaaS cloud:

Increase/decrease in workload

Increase/decrease in system capacity

Increase/decrease in faultload

Security attacks

Accidents or disasters

*[1] J. Laprie, “From Dependability to resiliency”, DSN 2008

[2] L. Simoncini, “Resilient Computing: An Engineering Discipline”, IPDPS 2009

(62)

General steps for resiliency quantification

(1) Construct a stochastic analytic model of a given system to find measure(s) of interest.

Such a model can be performance or availability model of the system.

(2) Determine the steady state behavior of the developed model in step (1).

We compute steady state values of performance and/or availability measures.

Note the analogy with the Phased-Mission System reliability analysis

(3) Apply change(s) to the system by increasing (or decreasing) the value(s) of input parameter(s) of the model.

Examples of such changes can be variation of call arrival rates, failure rates.

(4) Analyze the transient behavior of the system model to compute the transient measures after applying the change(s).

Initial probabilities for this transient analysis are obtained from the steady state

probabilities as computed from the system model in the step (2).

Transient response of the performance/availability measures quantify the

resiliency of the system.

(63)

IaaS cloud resiliency w.r.t. change in arrival rate

t_set is settling time; one of the metrics to quantify resiliency

(64)

Power Quantification for IaaS Cloud

[paper in Proc. IEEE/IFIP DSN workshop DCDV 2011] [paper in Proc. IEEE/IFIP DSN workshop DCDV 2011]

(66)

Power Consumption from Warm PM Model

Warm PM CTMC states Reward rates

l l l l

w

h

w

≤

3 2 1

(67)

Power Consumption from Cold PM Model

Cold PM CTMC states Reward rates

c

h

c

≤

3 2 1

(68)

Power-performance trade-offs

region where intuition based grouping is bad

(i, j, k) denotes #PMs in hot, warm and cold pool respectively

optimization problem:

What is the optimal #PMs per pool that minimizes total power consumption but

does not violate the SLA (upper bound on mean response delay)?

(69)

Future Research

(70)

Providers have two key costs for providing cloud based services

(i)

Capital Expenditure (CapEx) and

(ii)

Operational Expenditure (OpEx)

Capital Expenditure (CapEx)

Example of CapEx includes infrastructure cost, software licensing cost

Cost analysis

Example of CapEx includes infrastructure cost, software licensing cost

Usually CapEx is fixed over time

Operational Expenditure (OpEx)

Example of OpEx includes power usage cost, cost or penalty due to

violation of different SLA metrics, management costs

OpEx is more interesting since it varies with time depending upon

different factors like system configuration, management strategy or

workload arrivals

(71)

What is the optimal

#PMs so that total

cost is minimized and

SLA is upheld?

SLA driven capacity planning

71

Large sized cloud, large variability, fixed # configurations

(72)

Proposed Extensions to Current Models

More detailed workload Model

Different workload arrival processes [e.g., bursty]

Different types of service time distributions

Heterogeneous requests

Requests with different priorities

More detailed availability model

Different types of service time distributions

Model validation

Application of existing models to different cloud services/systems

Cost analysis

(73)

Conclusions

(74)

Conclusions

Analytic models are powerful for the construction and numerical

solution of various reliability, availability, performance, and

resiliency [behavior under changes in workload, faultload,

configuration] models

Not only exponential but also non-exponential distribution can be

admitted to construct such models.

For very complex systems such as clouds, hierarchical, fixed-point

iterative and approximate solutions needed.

Performance, availability, resiliency and power consumption analysis can be done using such an approach.