Performance, Availability and Power
Analysis for IaaS Cloud
Kishor Trivedi
1
Kishor Trivedi
[email protected] www.ee.duke.edu/~kst
Dept. of ECE, Duke University, Durham, NC 27708
Universita Napoli
Duke University
Research Triangle Park (RTP)
UNC-CH
Duke
NC state
2
NC state
Theory
Theory
Books:
Software
Software
Stochastic modeling methods & numerical solution methods:
Large Fault trees, Stochastic Petri Nets, Large/stiff Markov & non-Markov models Fluid stochastic models
Performability & Markov reward models Software aging and rejuvenation
Attack countermeasure trees
Trivedi’s Research Triangle
3
HARP (NASA), SAVE (IBM), IRAP (Boeing) SHARPE, SPNP, SREPT
Applications
Applications
Books:
Blue, Red, WhiteSoftware
Software
Packages
Packages
Reliability/availability/performance Avionics, Space, Power systems, Transportation systems,Automobile systems
Computer systems (hardware/software) Telco systems
Computer Networks Virtualized Data center Cloud computing
Overview of Reliability and Availability Quantification
Overview of Cloud Computing
Performance Quantification for IaaS Cloud (PRDC 2010)
Availability Quantification for IaaS Cloud (DSN 2011)
Talk outline
Availability Quantification for IaaS Cloud (DSN 2011)
Power Quantification for IaaS Cloud (DSN workshop 2011)
Future Research
An Overview of Reliability and
Availability Quantification Methods
5
Availability Quantification Methods
Copyright © 2011 by K.S. Trivedi
Software + hardware in operation
Dynamic as opposed to static behavior
Measurement-Based
More Accurate
Expensive
due to
many parameters and
configurations
Reliability and Availability Quantification
Not always possible during system design.
Model-Based
Combined approach where measurements are made at
the subsystem level and models are built to derive
system-level measures
Reliability and Availability Evaluation Methods
Model-based Discrete-event simulation Hybrid Quantitative Evaluation Measurement-based 7Numerical solution via a tool
Closed-form solution Model-based Hybrid Analytic Models Numerical solution of analytic models not as well utilized; Unnecessarily excessive
use of simulation
Analytic Modeling Taxonomy
Model-based Discrete-event simulation Hybrid Analytic Models Quantitative Dependability Evaluation Measurement-based Hierarchical composition Fixed point iterative modelsAnalytic models
Non-state-space models
State-space models
Non-state space models
Modeling using reliability block diagrams (RBDs),
reliability graphs (relgraphs) and fault trees (FTs) are
easy to use and efficient to solve for
system reliability,
system availability and system mean time to failure
(MTTF)
9
Product-form queuing networks for performance
analysis
Example: Reliability Analysis of Boeing 787
Current Return Network Modeled as a
Reliability Graph
(Relgraph)
Reliability Analysis of Boeing 787
(cont’d)
This real problem has too many minpaths
Non-state space models also face largeness problem
Number of paths from source to target
Reliability Analysis of Boeing 787
(cont’d)
Our Approach : Developed a new efficient algorithm for
(un)reliability bounds computation developed and
incorporated in SHARPE
SHARPE (Symbolic Hierarchical Automated Reliability and Performance Evaluator)
Non-State-Space Models
Failure/Repair Dependencies are often present; RBDs, relgraphs,
FTREEs
cannot easily handle these
(e.g., shared repair, warm/cold
spares, imperfect coverage, non-zero switching time, travel time of
repair person, reliability with repair).
13
Product-form does not often hold when modeling real-life aspects
such as simultaneous resource possession, priorities, retries, etc.
State-space models : Markov chains
To model complex interactions between components, use models such
as
Markov chains or more generally state space models.
Many examples of dependencies among system components have been
observed in practice and captured by continuous-time Markov chains
observed in practice and captured by continuous-time Markov chains
(CTMCs)
Extension to Markov reward models makes computation of measures
of interest relatively easy.
Markov
Availability
model of WebSphere AP Server
Failure detection By WLM By Node Agent Manual detection Recovery Node Agent Auto process 15 15Application server and proxy server (with escalated levels of recovery) • Delay and imperfect coverage in each step of recovery modeled
Auto process restart Manual recovery Process restart Node reboot Repair
Analytic Modeling Taxonomy
Non-state-space models
Analytic models
Should I Use Markov Models?
+ Model Fault-Tolerance and Recovery/Repair
+ Model Dependencies
+ Model Contention for Resources and concurrency
+ Generalize to Markov Reward Models for Degradable systems
17
+ Generalize to Markov Reward Models for Degradable systems
+ Can relax exponential assumption
+
Performance, Availability and Performability Modeling Possible
-
Large State Space
State Space Explosion
State space explosion can be avoided by using
hierarchical model composition.
Use state-space models
for those parts of a system
that require them, and
use non-state-space models
that require them, and
use non-state-space models
for the more “well-behaved” parts of the system.
Analytic models
Non-state-sapce models
Efficiency, simplicity
State-space models
Dependency capture
Analytic Modeling Taxonomy
19 Hierarchical composition To avoid largeness Analytic models Dependency capture Copyright © 2011 by K.S. Trivedi
Example: Architecture of SIP on IBM WebSphere
Replication
domain Nodes
1 A, D
AS: WebSphere Appl. Server (WAS) 1 A, D 2 A, E 3 B, F 4 B, D 5 C, E 6 C, F
Hierarchical composition
AS 6 6C BSCCM1 AS 5 5C BSCCM1 AS 4 4B BSBCM1 AS 3 3B BSBCM1 AS 2 2 A BSACM1 AS 1 1A BSACM1 AS 12 AS 11 AS 10 AS 9 AS 8 AS 7 App servers System Failure PX 1 P1 BSGCM1 PX 2 P2 BSHCM2 proxy system k of 12 AS1 1A BSA CM1 21 6F BSFCM2 3F BSFCM2 5 E BSECM2 2E BSECM2 4D BSDCM2 1D BSDCM2This model was responsible for the actual sale of the system by IBM to their Telco customer
Fixed-Point Iteration
Input parameters of sub-models can be functions of outputs of other models If the import graph is not acyclic then we solve using fixed-point iteration
Non-state-space models Efficiency, simplicity Hierarchical composition To avoid largeness Analytic models Efficiency, simplicity State-space models Dependency capture Fixed-Point Iteration
To deal with interdependent submodels
An Overview of Cloud Computing
23 Copyright © 2011 by K.S. Trivedi
Definition by National Institute of Standards and Technology
(NIST):
“Cloud computing is a model for enabling convenient,
NIST definition of cloud computing
“Cloud computing is a model for enabling convenient,
on-demand
network access to a
shared pool of configurable
computing resources
(e.g., networks, servers, storage, applications,
and services) that can be
rapidly provisioned and released
with
On-demand self-service:
Provisioning of computing capabilities without human intervention
R
esource pooling:
Shared physical and virtualized environment
Rapid elasticity:
Key characteristics
25
Rapid elasticity:
Through standardization and automation, quick scaling
Metered Service:
Pay-as-you-go model of computing
Source: P. Mell and T. Grance, “The NIST Definition of Cloud Computing”, October 7, 2009
Time line of evolution
Evolution of cloud computing
Cloud computing
Around 2000
Around 2005-06
*Source: http://seekingalpha.com/article/167764-tipping-point-gartner-annoints-cloud-computing-top-strategic-technologyCluster computing
Grid computing
Utility computing
Cloud computing
Early 80s
Early 90s
Infrastructure-as-a-Service (IaaS) Cloud:
Examples: Amazon EC2, IBM Smart Business Development and Test
Cloud
Platform-as-a-Service (PaaS) Cloud:
Examples: Microsoft Windows Azure, Google AppEngine
Cloud Service models
27
Examples: Microsoft Windows Azure, Google AppEngine
Software-as-a-Service (SaaS) Cloud:
Examples: Gmail, Google Docs
Private Cloud:
- Cloud infrastructure solely for an organization
- Managed by the organization or third party
- May exist on premise or off-premise
Public Cloud:
Deployment models
Public Cloud:
- Cloud infrastructure available for use for general users
- Owned by an organization providing cloud services
Hybrid Cloud:
- Composition of two or more clouds (private or public)
Three critical metrics for a cloud:
- Service (un)availability
- Performance (response time) unpredictability
- Power consumption
Large number of parameters can affect performance, availability and power
- Workload parameters
Key Challenges
29
- Workload parameters
- Failure/recovery characteristics
- Types of physical infrastructure
- Characteristics of virtualization infrastructures
- Large scale; thousands of servers
Performance, availability & power quantification are difficult!
Copyright © 2011 by K.S. TrivediOur goals in the IBM Cloud project
Develop a comprehensive analytic modeling approach
High fidelity
Scalable and tractable
Apply these models to cloud capacity planning
Copyright © 2011 by K.S. Trivedi
Our approach and motivations behind it
Difficulty with measurement-based approach:
expensive experimentation for each workload and system configuration
Monolithic analytic model will suffer largeness and hence is not
scalable
31
Our approach:
overall system model consists of a set of sub-models
sub-model solutions composed via an interacting Markov chain approach
scalable and tractable
Joint work with
Rahul Ghosh and Dong Seong Kim (Duke), Francesco Longo (Univ. of Messina)
Duke/IBM project on cloud computing
Rahul Ghosh and Dong Seong Kim (Duke), Francesco Longo (Univ. of Messina) Vijay Naik, Murthy Devarakonda and Daniel Dias
Performance Quantification for IaaS Cloud
[paper in Proc. IEEE PRDC 2010]
33
[paper in Proc. IEEE PRDC 2010]
System model
Current Assumptions [will be relaxed soon]
Homogenous requests
All physical machines (PMs) are identical.
To minimize power consumption, PMs divided into three pools:
Hot pool– fast provisioning but high power usage
Warm pool—slower provisioning but lower power usage
Life-cycle of a job inside a IaaS cloud
Run-time Execution
Arrival Queuing Instantiation VM
deployment
Actual Service Out
Resource Provisioning Decision Engine Provisioning Decision
Provisioning response delay
35
Provisioning and servicing steps:
(i) resource provisioning decision,
(ii) VM provisioning and
(iii) run-time execution
Job rejection due to buffer full
Job rejection due to insufficient capacity
Resource provisioning decision engine (RPDE)
Run-time Execution
Arrival Queuing Instantiation VM
deployment
Actual Service Out
Resource Provisioning Decision Engine Provisioning Decision
Provisioning response delay
Copyright © 2011 by K.S. Trivedi
Job rejection due to buffer full
Job rejection due to insufficient capacity
Flow-chart:
Resource provisioning decision engine (RPDE)
37 Copyright © 2011 by K.S. Trivedi
CTMC model for RPDE
0,0 0,hλ
1,hλ
h hP δ δhPh δhPh δhPh ) 1 ( −P δ δ (1−P ) δwPw δh(1−Ph) δλ
N-1,hλ
…i = number of jobs in queue, s = pool (hot, warm or cold)
i,s 1,w
λ
1,cλ
0,w ) 1 ( h h −P δ δh(1−Ph) δh(1−Ph) ) 1 ( c c −P δ ) 1 ( c c −P δ 0,c ) 1 ( w w − P δ (1 ) w w − P δ δw(1− Pw) ) 1 ( c c −P δ ) 1 ( c c −P δ w wP δ δwPw w w w wP δ c cP δ δcPc c cP δ c cP δλ
λ
N-1,w … N-1,cλ
…λ
RPDE model: parameters & measures
Input Parameters:
–arrival rate: data collected from cloud
– mean search delays for resource provisioning
decision engine: from searching algorithms or measurements
– probability of being able to provision: computed from
VM provisioning model
N – maximum # jobs in RPDE: from system/server specification
c w h δ δ δ ,1/ ,1/ / 1
λ
c w hP
P
P
,
,
39N – maximum # jobs in RPDE: from system/server specification
Output Measures:
Job rejection probability due to buffer full (P
block)
Job rejection probability due to insufficient capacity (P
drop)
Mean decision delay for an accepted job (E[T
decision])
Mean queuing delay for an accepted job (E[T
q_dec])
VM provisioning
Run-time Execution
Arrival Queuing Instantiation VM
deployment
Actual Service Out
Resource Provisioning Decision Engine Provisioning Decision
Provisioning response delay
Job rejection due to buffer full
Job rejection due to insufficient capacity
VM provisioning model
Resource Hot PM pool Hot PM 41 Service out Resource Provisioning Decision Engine Accepted jobs Running VMsIdle resources on hot machine Idle resources on warm machine Idle resources on cold machine
Warm pool
VM provisioning model for each hot PM
0,0,0 0,1,0 Lh,1,0 0,0,1 (Lh-1),1,1 Lh,1,1 hλ
λ
hλ
h hλ
hλ
λ
h hβ
β
h hβ
hβ
µ
µ
µ
µ
µ
2
µ
2
2
µ
… … … …Lh is the buffer size and m is max. # VMs that can run
simultaneously on a PM 1,0,m L h,0,m Lh ,1,(m-1) 0,0,m (Lh-1),1,(m-1) 0,0,(m-1) 0,1,(m-1) h
λ
λ
h hλ
hλ
hλ
hλ
hλ
hβ
hβ
β
h hβ
β
h hβ
β
hµ
2
µ ) 1 (m−µ
m
µ ) 1 (m−µ
m
µ
m
µ ) 1 (m− µ ) 1 (m− … … … … …i,j,k i = number of jobs in the queue, j = number of VMs being provisioned,
Input Parameters:
can be measured experimentally
obtained from the lower level run-time model
obtained from the resource provisioning decision model
h block h n P ) 1 ( − = λ λ h β / 1
µ
/ 1 block PVM provisioning model (for each hot PM)
43
Hot pool model is the set of independent hot PM models Output Measure:
= prob. that a job is accepted in the hot pool =
where, is the steady state probability that a PM can not accept job for provisioning - from the solution of the Markov model of a hot PM on the previous slide
h P h h h n h m L m i h i L ) ( 1 (( ),0, ) 1 0 ) ( ) , 1 , (
ϕ
ϕ
+ −∑
− = h n ) ( (( ),0, ) 1 0 ) ( ) , 1 , ( h m L m i h i Lh ϕ h ϕ +∑
− = Copyright © 2011 by K.S. TrivediVM provisioning model for each warm PM
0,0,0λ
w 0,1*,0 Lw,1*,0 wβ
µ
µ
… 0,1,0 Lw,1,0 0,1**, 0 Lw, 1**,0 wλ
λ
w wλ
wλ
wγ
γ
wβ
wλ
… wλ
0,0,1 (Lw-1),1,1 Lw,1,1 w hβ
hβ
µ
µ
µ
µ
2 2µ
2µ
… 1,0,m L ,0,m Lw,1,(m-1) 0,0,m (Lw -1),1,(m-1) 0,0,(m-1) 0,1,(m-1) hβ
hβ
β
h hβ
β
h hβ
β
h µ ) 1 (m−µ
m µ ) 1 (m−µ
mµ
m µ ) 1 (m− µ ) 1 (m− … … … … … w β hβ
hβ
wλ
λ
w wλ
λ
wλ
w wλ
λ
λ
λ
0,1,1 wλ
VM provisioning model for each cold PM
0,0,0 0,1*,0 Lc,1*, 0 cλ
cβ
µ
µ
… 0,1,0 Lc,1,0 0,1**, 0 1**,0Lc, cγ
β
… cλ
λ
c cλ
cλ
cλ
λ
c cγ
β
45 0,0,1 (Lc-1),1,1 Lc,1, 1 c hβ
hβ
µ
µ
µ
µ
2 2µ
2µ
… 1,0, m Lc,0,m Lc,1,(m-1) 0,0, m (Lc-1),1,(m-1) 0,0,(m-1) 0,1,(m-1) hβ
hβ
β
h hβ
β
h hβ
β
h µ ) 1 (m−µ
m µ ) 1 (m−µ
mµ
m µ ) 1 (m− µ ) 1 (m− … … … … … 1**,0 hβ
hβ
cλ
λ
c cλ
λ
c cλ
λ
c cλ
λ
c cλ
cβ
0,1,1 cλ
Warm/cold PM model is similar to hot PM, except:
(i) Effective job arrival rate
(ii) For first job, warm/cold PM requires additional start-up time (iii) Mean provisioning delay for a VM for the first job is longer
Outputs of hot, warm and cold pool models:
Probabilities ( )that at least one PM in hot/warm/cold pool can
VM provisioning model: Summary
c w
h P P
P , ,
Probabilities ( )that at least one PM in hot/warm/cold pool can accept a job
c w
h P P
Import graph for performance models
RPDE model
block
P Pblock
job rejection probability and mean response delay
47 VM provisioning models h
P
P
c hP
Hot pool model Warm pool model Cold pool model block hP
wP
wP
block P blockTo solve hot, warm and cold PM models, we need
from resource
provisioning decision model
To solve provisioning decision model, we need
from hot, warm
and cold pool model respectively
Fixed-point iteration
block P c w h P P P , ,This leads to a cyclic dependency among the resource provisioning
decision model and VM provisioning models (hot, warm, cold)
We resolve this dependency via fixed-point iteration
Observe, our fixed-point variable is
and corresponding fixed-point
equation is of the form:
block
P
) ( block block f P P =1 PM per pool and 1 VM per PM
Performance measures comparison with monolithic model
Jobs/hr Mean RPDE queue length Rejection probability
ISP monolithic ISP Monolithic
1
9.0332e-07 9.2321e-07 9.8899e-06 1.1221e-035
4.1622e-05 4.3364e-05 4.2334e-02 8.0500e-0210
2.3731e-04 2.4225e-04 2.3496e-01 2.6587e-0149
The error is between e-03 and e-07 for all the results.
The number of states in monolithic model is 912 while in ISP model it is 21
10
2.3731e-04 2.4225e-04 2.3496e-01 2.6587e-0115
6.3539e-04 6.4377e-04 3.9860e-01 4.1493e-0120
1.2526e-03 1.2655e-03 5.1069e-01 5.1969e-0125
2.0990e-03 2.1179e-03 5.8915e-01 5.9449e-0130
3.1826e-03 3.2091e-01 6.4648e-01 6.4985e-01Availability Quantification for IaaS Cloud
[paper in Proc. IEEE/IFIP DSN 2011] [paper in Proc. IEEE/IFIP DSN 2011]
Assumptions
We consider the net effect of different failures and repairs of PMs
MTTF of each hot PM is and that of each warm PM is
with < .
MTTF of each cold PM is with << .
h λ / 1 w λ / 1 h λ / 1 1/λw λ / 1 1/λ 1/λ 51
MTTF of each cold PM is with << .
Each pool has repair facilities and shared repair policy is assumed
PMs can migrate from one pool to another upon a failure and
repair
c
λ /
Monolithic availability model
Interacting Sub-models
SRN sub-model for warm pool
53
SRN sub-model for hot pool
SRN sub-model for warm pool
SRN sub-model for cold pool
Import graph and model outputs
Model outputs:
mean number of PMs in each pool (E[#Ph], E[#Pw], and E[#Pc])
availability of cloud when at least k PMs (with ) are available across all the pools.
downtime
Copyright © 2011 by K.S. Trivedi
)
(
Monolithic vs. interacting sub-models
Number of model states and non-zero entries
#PMs in each pool #monolithic model states #sub-models states #monolithic model non-zero entries #sub-models non-zero entries 5 7056 56 44520 210 10 207636 286 1535490 1320 55 15 1775616 136 13948160 480 17 3508920 171 27976968 612 19 6468000 210 52189200 760 20 Memory overflow 231 Memory overflow 840 50 - 1326 5100 100 - 5151 20200 150 - 11476 45300 Copyright © 2011 by K.S. Trivedi
Monolithic vs. interacting sub-models
Average number of PMs in each pool
#PMs in each
pool to start with
Avg. #PMs in pools for monolithic model
Avg. #PMs in pools for interacting sub-models
hot warm cold hot warm cold
with 5 4.99 4.98 4.99 5.00 4.98 4.99 10 10.00 9.96 9.98 10.00 9.96 9.98 15 14.99 14.95 14.97 15.00 14.95 14.97 17 16.99 16.94 16.97 17.00 16.94 16.97 19 18.99 18.93 18.97 19.00 18.93 18.97
Monolithic vs. interacting sub-models
Comparison of downtime with 10 PMs in each pool to start with.
Cloud is available when at least “k” PMs are UP.
Maximum number of PMs that can be repaired in parallel is n
rk nr Downtime (minutes/year)
Monolithic Interacting sub-models
30 1 23185.793 23178.956 57 Copyright © 2011 by K.S. Trivedi 30 1 23185.793 23178.956 2 22904.919 22898.454 3 22903.681 22897.219 29 1 792.475 798.651 2 499.081 505.258 3 497.787 503.964 28 1 24.722 25.336 2 8.412 8.691 3 7.118 7.396
Monolithic vs. interacting sub-models
Comparison of solution times
#PMs in each pool to start with
Monolithic model (sec) sub-models (sec)
5 0.627 0.406 10 18.670 0.517 15 373.822 0.278 15 373.822 0.278 17 1004.494 0.279 19 2459.553 0.280 20 Memory overflow 0.281 50 - 0.296 100 - 0.377 150 - 0.564 200 - 0.948
Solution time for large IaaS cloud
We use closed-form solutions of the sub-models
#PMs in each pool to start with
Solution time (sec)
500 0.251 1000 0.592 59 Copyright © 2011 by K.S. Trivedi 1000 0.592 1500 0.911 2000 1.715 3000 2.483 4000 2.651
Resiliency Quantification for IaaS Cloud
[paper in Proc. IEEE SRDS RACOS workshop 2010] [paper in Proc. IEEE SRDS RACOS workshop 2010]
Resiliency Quantification: Definitions
Past research mostly interpreted resiliency as fault tolerant capability of the
system
We use following definition
Resiliency is the persistence of service delivery that is predictable and
can be trusted to perform when subjected to
changes*
61
Changes
of interest in the context of IaaS cloud:
Increase/decrease in workload
Increase/decrease in system capacity
Increase/decrease in faultload
Security attacks
Accidents or disasters
*[1] J. Laprie, “From Dependability to resiliency”, DSN 2008
[2] L. Simoncini, “Resilient Computing: An Engineering Discipline”, IPDPS 2009
General steps for resiliency quantification
(1) Construct a stochastic analytic model of a given system to find measure(s) of interest.
Such a model can be performance or availability model of the system.
(2) Determine the steady state behavior of the developed model in step (1).
We compute steady state values of performance and/or availability measures.
Note the analogy with the Phased-Mission System reliability analysis
(3) Apply change(s) to the system by increasing (or decreasing) the value(s) of input parameter(s) of the model.
Examples of such changes can be variation of call arrival rates, failure rates.
(4) Analyze the transient behavior of the system model to compute the transient measures after applying the change(s).
Initial probabilities for this transient analysis are obtained from the steady state
probabilities as computed from the system model in the step (2).
Transient response of the performance/availability measures quantify the
resiliency of the system.
IaaS cloud resiliency w.r.t. change in arrival rate
tset is settling time; one of the metrics to quantify resiliency
Power Quantification for IaaS Cloud
[paper in Proc. IEEE/IFIP DSN workshop DCDV 2011] [paper in Proc. IEEE/IFIP DSN workshop DCDV 2011]
Power Consumption from Hot PM Model
0,0,0 0,1,0 Lh,1,0 0,0,1 (Lh -1),1,1 Lh,1,1 hλ
λ
hλ
h hλ
hλ
λ
h hβ
β
h hβ
hβ
µ
µ
µ
µ
µ
2µ
2 2µ
… … … … 65 Copyright © 2011 by K.S. TrivediWhen no VM is running, hot PM consumes an idle power of hl . Power consumption of a VM with average resource utilization is assumed to be va For each state (i, j, k) of the
CTMC we assign a reward rate: r(i, j, k) = hl + kva
1,0,m L h,0,m Lh ,1,(m-1) 0,0,m (Lh -1),1,(m-1) 0,0,(m-1) 0,1,(m-1) h
λ
λ
h hλ
hλ
hλ
hλ
hλ
hβ
hβ
β
h hβ
β
h hβ
β
hµ
2 µ ) 1 (m−µ
m µ ) 1 (m−µ
mµ
m µ ) 1 (m− µ ) 1 (m− … … … …Power Consumption from Warm PM Model
Warm PM CTMC states Reward rates
l l l l
w
w
h
w
≤
≤
≤
3 2 1Power Consumption from Cold PM Model
Cold PM CTMC states Reward rates
67 Copyright © 2011 by K.S. Trivedi l l l l
c
c
h
c
≤
≤
≤
3 2 1Power-performance trade-offs
region where intuition based grouping is bad
Copyright © 2011 by K.S. Trivedi
(i, j, k) denotes #PMs in hot, warm and cold pool respectively
optimization problem:
What is the optimal #PMs per pool that minimizes total power consumption but
does not violate the SLA (upper bound on mean response delay)?
Future Research
69 Copyright © 2011 by K.S. Trivedi
Providers have two key costs for providing cloud based services
(i)
Capital Expenditure (CapEx) and
(ii)
Operational Expenditure (OpEx)
Capital Expenditure (CapEx)
Example of CapEx includes infrastructure cost, software licensing cost
Cost analysis
Example of CapEx includes infrastructure cost, software licensing cost
Usually CapEx is fixed over time
Operational Expenditure (OpEx)
Example of OpEx includes power usage cost, cost or penalty due to
violation of different SLA metrics, management costs
OpEx is more interesting since it varies with time depending upon
different factors like system configuration, management strategy or
workload arrivals
What is the optimal
#PMs so that total
cost is minimized and
SLA is upheld?
SLA driven capacity planning
71
Large sized cloud, large variability, fixed # configurations
Proposed Extensions to Current Models
More detailed workload Model
Different workload arrival processes [e.g., bursty]
Different types of service time distributions
Heterogeneous requests
Requests with different priorities
More detailed availability model
Different types of service time distributions
Different types of service time distributions
Model validation
Application of existing models to different cloud services/systems
Cost analysis
Conclusions
73 Copyright © 2011 by K.S. Trivedi
Conclusions
Analytic models are powerful for the construction and numerical
solution of various reliability, availability, performance, and
resiliency [behavior under changes in workload, faultload,
configuration] models
Not only exponential but also non-exponential distribution can be
admitted to construct such models.
For very complex systems such as clouds, hierarchical, fixed-point
iterative and approximate solutions needed.
iterative and approximate solutions needed.
Performance, availability, resiliency and power consumption analysis can be done using such an approach.