RELIABILITY
AND
AVAILABILITY
OF
CLOUD COMPUTING
Eric Bauer
Randee Adams
IEEE
IEEE PRESS©WILEY
CONTENTS
Figures
xvii Tables xxiEquations
xxiii Introduction xxvI
BASICS 11
CLOUD COMPUTING 31.1 Essential Cloud Characteristics 4 1.1.1 On-Demand Self-Service 4 1.1.2 Broad Network Access 4
1.1.3 Resource
Pooling
51.1.4
Rapid Elasticity
51.1.5 Measured Service 6
1.2 Common Cloud Characteristics 6
1.3 But What,
Exactly,
Is CloudComputing?
71.3.1 What Is aData Center? 8
1.3.2 How Does Cloud
Computing
Differ from TraditionalDataCenters? 9
1.4 ServiceModels 9
1.5 Cloud
Deployment
Models 111.6 Roles in Cloud
Computing
12 1.7 Benefits of CloudComputing
141.8 Risks of Cloud
Computing
15VIRTUALIZATION 2.1
Background
2.2 What Is Virtualization? 2.2.1
Types or'Hypervisors
2.2.2 Virtualization and Emulation 2.3 ServerVirtualization 2.3.1 Full Virtualization 2.3.2 Paravirtualization 2.3.3 OS Virtualization 2.3.4 Discussion 2.4 VMLifecycle
2.4.1 VM Snapshot 2.4.2Cloning
VMs2.4.3
High Availability
Mechanisms2.5
Reliability
andAvailability
Risks of VirtualizationSERVICE RELIABILITY AND SERVICE AVAILABILITY 3.1 Errorsand Failures
3.2
Eight-Ingredient
Framework 3.3 ServiceAvailability
3.3.1 Service
Availability
Metric 3.3.2 MTBF and MTTR3.3.3 Service and Network Element
Impact Outages
3.3.4 Partial
Outages
3.3.5Availability Ratings
3.3.6Outage
Attributability
3.3.7 Plannedor Scheduled Downtime
3.4 Service
Reliability
3.4.1 Service
Reliability
Metrics 3.4.2 Defective Transactions 3.5 Service Latency3.6
Redundancy
andHigh
Availability
3.6.1Redundancy
3.6.2
High Availability
3.7
High Availability
and DisasterRecovery
3.8Streaming
Services3.8.1 Control and Data Planes 3.8.2 Service
Quality
Metrics3.8.3 Isochronal Data 3.8.4
Latency Expectations
3.8.5
Streaming
Quality
Impairments
CONTENTS iX
II
ANALYSIS 634
ANALYZING CLOUD RELIABILITY AND AVAILABILITY 654.1
Expectations
for ServiceReliability
andAvailability
654.2 Risks of Essential Cloud Characteristics 66 4.2.1 On-Demand Self-Service 66 4.2.2 Broad Network Access 67
4.2.3 Resource
Pooling
674.2.4
Rapid
Elasticity
674.2.5 Measured Service 69
4.3
Impacts
of Common Cloud Characteristics 704.3.1 Virtualization 70 4.3.2
Geographic
Distribution 70 4.3.3 ResilientComputing
71 4.3.4 AdvancedSecurity
71 4.3.5 Massive Scale 71 4.3.6Homogeneity
714.4 RisksofService Models 72 4.4.1 Traditional
Accountability
724.4.2 Cloud-Based
Application
Accountability
73 4.5 IT ServiceManagement
andAvailability Risks 744.5.1 ITIL Overview 74
4.5.2 Service
Strategy
754.5.3 Service
Design
764.5.4 Service Transition 77 4.5.5 Service
Operation
77 4.5.6 Continual ServiceImprovement
784.5.7 IT Service
Management Summary
79 4.5.8 Risks of Service Orchestration 794.5.9 IT Service
Management
Risks 804.6
Outage
Risksby
ProcessArea 804.6.1
Validating Outage Attributability
82 4.7 Failure Detection Considerations 83 4.7.1 HardwareFailures 834.7.2
Programming
Errors 854.7.3 Data
Inconsistency
and Errors 85 4.7.4Redundancy
Errors 86 4.7.5System
Power Failures 864.7.6 Network Errors 86
4.7.7
Application
ProtocolErrors 864.8 Risksof
Deployment
Models 87X CONTENTS
5
RELIABILITYANALYSIS OF VIRTUALIZATION 90 5.1Reliability
Analysis Techniques
905.1.1
Reliability
BlockDiagrams
905.1.2
Single
Point ofFailureAnalysis
925.1.3 Failure Mode Effects
Analysis
92 5.2Reliability
Analysisof VirtualizationTechniques
955.2.1
Analysis
of Full Virtualization 955.2.2
Analysis
of OSVirtualization 955.2.3
Analysis
ofParavirtualization 965.2.4
Analysis
of VMCoresidency
975.2.5 Discussion 99
5.3 Software Failure Rate
Analysis
100 5.3.1 Virtualization and Software Failure Rate 100 5.3.2Hypervisor
Failure Rate 101 5.3.3 Miscellaneous Software Risks of Virtualizationand Cloud 101
5.4
Recovery
Models 1015.4.1 Traditional
Recovery Options
1015.4.2 Virtualized
Recovery Options
1025.4.3 Discussion 107
5.5
Application
ArchitectureStrategies
1085.5.1 On-Demand
Single-User
Model 1085.5.2 Single-User Daemon Model 109 5.5.3 MultiuserServerModel 109 5.5.4 Consolidated Server Model 109 5.6
Availability Modeling
of VirtualizedRecovery
Options
110 5.6.1Availability
of VirtualizedSimplex
Architecture 111 5.6.2Availability
of Virtualized RedundantArchitecture 1115.6.3 Critical Failure Rate 112 5.6.4 Failure
Coverage
113 5.6.5 Failure DetectionLatency
113 5.6.6 SwitchoverLatency
113 5.6.7 Switchover SuccessProbability
1145.6.8
Modeling
and "FastFailure" 114 5.6.9Comparison
of Native and VirtualizedDeployments
1156
HARDWARE RELIABILITY,VIRTUALIZATION,
AND SERVICE AVAILABILITY 116
6.1 Hardware Downtime Expectations 116
6.2 Hardware Failures 117
CONTENTS Xl"
6.4 Hardware Failure Detection 121
6.5 Hardware Failure Containment 122
6.6 Hardware Failure
Mitigation
122 6.7Mitigating
Hardware Failures via Virtualization 1246.7.1 Virtual CPU 124
6.7.2 Virtual
Memory
1256.7.3 Virtual
Storage
1266.8 Virtualized Networks 127
6.8.1 Virtual Network Interface Cards 127 6.8.2 Virtual Local Area Networks 128
6.8.3 Virtual IPAddresses 129
6.8.4 Virtual Private Networks 129
6.9 MTTR of Virtualized Hardware 129
6.10 Discussion 131
7
CAPACITY AND ELASTICITY 1327.1
System
Load Basics 1327.1.1
Extraordinary
Event Considerations 1347.1.2 Slashdot Effect 134
7.2 Overload, Service
Reliability,
and ServiceAvailability
135 7.3 TraditionalCapacity
Planning
1367.4 Cloudand
Capacity
1377.4.1 Nominal Cloud
Capacity
Model 138 7.4.2Elasticity Expectations
141 7.5Managing
OnlineCapacity
1447.5.1
Capacity Planning Assumptions
of CloudComputing
1457.6
Capacity-Related
Service Risks 147 7.6.1Elasticity
andElasticity
Failure 1477.6.2 Partial
Capacity
Failure 149 7.6.3 ServiceLatency
Risk 150 7.6.4Capacity Impairments
and ServiceReliability
152 7.7Capacity
Management
Risks 153 7.7.1 BrittleApplication
Architecture 154 7.7.2Faulty
orInadequate Monitoring
Data 1557.7.3
Faulty Capacity
Decisions 155 7.7.4 UnreliableCapacity
Growth 1557.7.5 Unreliable
Capacity
Degrowth
1567.7.6
Inadequate
Slew Rate 156 7.7.7Tardy Capacity Management
Decisions 156 7.7.8 Resource StockOutNotCovered 157XII CONTENTS
7.7.9 Cloud BurstFails 157 7.7.10
Policy
Constraints 157 7.8Security
andServiceAvailability
157 7.8.1Security
Risk to ServiceAvailability
157 7.8.2 Denial of ServiceAttacks 159 7.8.3Defending against
DoS Attacks 160 7.8.4Quantifying
ServiceAvailability Impact
of
Security
Attacks 1617.8.5 Recommendations 162
7.9
Architecting
for Elastic Growthand Degrowth 1628
SERVICE ORCHESTRATION ANALYSIS 1648.1 Service Orchestration Definition 164
8.2
Policy-Based Management
166 8.2.1 The Role ofSLRs 167 8.2.2 ServiceReliability
andAvailability
Measurements 168
8.3 Cloud
Management
1688.3.1 Role of
Rapid Elasticity
in CloudManagement
169 8.3.2 Role of CloudBursting
in CloudManagement
169 8.4 Service Orchestration'sRole in RiskMitigation
1698.4.1
Latency
1708.4.2
Reliability
1708.4.3
Regulatory
1718.4.4
Security
1718.5
Summary
1729
GEOGRAPHICDISTRIBUTION,GEOREDUNDANCY,
AND DISASTER RECOVERY 174
9.1
Geographic
Distribution versusGeoredundancy
175 9.2 Traditional DisasterRecovery
1759.3 Virtualizationand Disaster
Recovery
1779.4 Cloud
Computing
and DisasterRecovery 1789.5
Georedundancy Recovery
Models 180 9.6 CloudandTraditional Collateral Benefitsof
Georedundancy
180 9.6.1 Reduced PlannedDowntime 180 9.6.2Mitigate
Catastrophic
Network Element Failures 181 9.6.3Mitigate
ExtendedUncoveredandDuplex
Failure
Outages
181CONTENTS Xiii
III
RECOMMENDATIONS 18310
APPLICATIONS, SOLUTIONS,AND ACCOUNTABILITY 18510.1
Application Configuration
Scenarios 18510.2
Application Deployment
Scenario 187 10.3System
DowntimeBudgets
188 10.3.1 TraditionalSystem
DowntimeBudget
18910.3.2 Virtualized
Application
DowntimeBudget
18910.3.3 IaaS Hardware Downtime
Expectations
191 10.3.4 Cloud-BasedApplication
DowntimeBudget
19310.3.5
Summary
19510.4 End-to-End Solutions Considerations 197 10.4.1 What is an End-to-End Solution? 197
10.4.2
Consumer-Specific
Architectures 19810.4.3 DataCenter
Redundancy
199 10.5Attributability
for ServiceImpairments
201 10.6 Solution Service Measurement 20410.6.1 ServiceAvailabilityMeasurement Points 204 10.7
Managing
Reliability
and Service of CloudComputing
20711
RECOMMENDATIONS FORARCHITECTINGA RELIABLE SYSTEM 209
11.1
Architecting
for Virtualization and Cloud 20911.1.1
Mapping
Software into VMs 21011.1.2 Service Load Distribution 210 11.1.3 Data
Management
211 11.1.4 SoftwareRedundancy
andHighAvailability
Mechanisms 212 11.1.5
Rapid Elasticity
214 11.1.6 OverloadControl 214 11.1.7Coresidency
215 11.1.8Multitenancy
215 11.1.9 IsochronalApplications
216 11.2 DisasterRecovery
21611.3 IT ServiceManagementConsiderations 217 11.3.1 Software
Upgrade
and Patch 217 11.3.2 Service TransitionActivity
EffectAnalysis
218 11.3.3Mitigating
Service TransitionActivity
Effects via VM
Migration
219xiv CONTENTS
11.3.5
Minimizing
Procedural Errors 22111.3.6 Service Orchestration Considerations 223
11.4
Many
Distributed CloudsversusFewerHuge
Clouds 22411.5
Minimizing
Hardware-Attributed Downtime 22511.5.1 Hardware Downtime in Traditional
High
Availability Configurations
226 11.6 ArchitecturalOptimizations
231 11.6.1Reliability
andAvailability
Criteria 23211.6.2
Optimizing
Accessibility
23311.6.3
Optimizing High Availability,
Retainability, Reliability,
and
Quality
23511.6.4
Optimizing
DisasterRecovery
23511.6.5
Operational
Considerations 23611.6.6 Case
Study
23611.6.7
Theoretically Optimal Application
Architecture 241DESIGN FOR RELIABILITY OFVIRTUALIZED APPLICATIONS 244
12.1
Design
forReliability
244 12.2Tailoring
DfR for VirtualizedApplications
246 12.2.1 HardwareIndependence Usage
Scenario 246 12.2.2 Server ConsolidationUsage
Scenario 247 12.2.3 MultitenantUsage
Scenario 248 12.2.4 VirtualAppliance Usage
Scenario 24812.2.5 Cloud
Deployment
UsageScenario 24812,3
Reliability Requirements
248 12.3.1 GeneralAvailability Requirements
249 12.3.2 ServiceReliability
andLatency
Requirements
25012.3.3 Overload
Requirements
251 12.3.4 OnlineCapacity
Growth andDegrowth
25312.3.5
(Virtualization)
LiveMigration Requirements
25312.3.6
System
TransitionActivity Requirements
254 12.3.7Georedundancy
and ServiceContinuity
Requirements
25512.4
Qualitative
Reliability
Analysis
256 12.4.1 SPOF Analysisfor VirtualizedApplications
256 12.4.2 Failure Mode EffectsAnalysis
for VirtualizedApplications
25812.4.3
Capacity
Growth andDegrowth Analysis
25812.5
Quantitative
Reliability
Budgeting
andModeling
25912.5.1
Availability
(Downtime)Modeling
25912.5.2
Converging
DowntimeBudgets
andTargets
260CONTENTS XV
12.6 Robustness
Testing
26012.6.1 Baseline Robustness
Testing
26112.6.2 Advanced
Topic:
Can VisualizationEnable BetterRobustness
Testing?
26512.7
Stability Testing
26712.8 Field Performance
Analysis
268 12.9Reliability Roadmap
269 12.10 HardwareReliability
27013
DESIGN FOR RELIABILITY OF CLOUD SOLUTIONS 27113.1 Solution
Design
forReliability
27113.2 Solution
Scope
andExpectations
27313.3
Reliability Requirements
27513.3.1 Solution
Availability Requirements
275 13.3.2 SolutionReliability Requirements
276 13.3.3 DisasterRecovery Requirements
277 13.3.4Elasticity Requirements
277 13.3.5Specifying Configuration
Parameters 278 13.4 SolutionModeling
andAnalysis
27913.4.1
Reliability
BlockDiagram
of Cloud DataCenter
Deployment
27913.4.2 Solution Failure Mode Effects
Analysis
28013.4.3 Solution Service Transition
Activity
EffectsAnalysis
280 13.4.4 Cloud Data Center ServiceAvailability
(MP 2) Analysis
28013.4.5
Aggregate
ServiceAvailability
(MP 3)Modeling
281 13.4.6Recovery
PointObjective Analysis
285 13.5 ElementReliability
Diligence
285 13.6 Solution TestingandValidation 28513.6.1 Robustness
Testing
286 13.6.2 ServiceReliability Testing
286 13.6.3Georedundancy Testing
28613.6.4
Elasticity
and OrchestrationTesting 28713.6.5 Stability
Testing
287 13.6.6 InServiceTesting
28813.7 Track and
Analyze
Field Performance 28813.7.1 Cloud Service Measurements 289
13.7.2 Solution
Reliability
Roadmapping
291 13.8 Other SolutionReliability
Diligence Topics
292 13.8.1 Service-LevelAgreements
292 13.8.2 Cloud Service Provider Selection 293xvi
14
SUMMARY14.1 Service
Reliability
andServiceAvailability
14.2 Failure
Accountability
and CloudComputing
14.3
Factoring
Service Downtime14.4 Service
Availability
Measurement Points14.5 Cloud
Capacity
andElasticity
Considerations 14.6Maximizing
ServiceAvailability
14.6.1
Reducing
Product Attributable Downtime 14.6.2Reducing
Data Center Attributable Downtime 14.6.3Reducing
IT ServiceManagement
Downtime 14.6.4Reducing
DisasterRecovery
Downtime 14.6.5Optimal
Cloud ServiceAvailability
14.7Reliability Diligence
14.8
Concluding
RemarksAbbreviations
References
About the Authors Index CONTENTS 296 297 299 301 303 306 306 307 307 307 308 308 309 310 311 314 318 319