Virtualizing Mission-Critical Apps
1PM EST, 3/29/2011
Ilya Mirman
2
Agenda
•
The Rise of “The Virtualization Chasm”
•
3 Fundamental inefficiencies
•
Best practices
4
Before Virtualization
10 12 14 16 2 4 8 6 C a p a c it y• Traditional IT guarantees apps’ performance by
– Dedicating physical
machines (PM) to apps
– Provisioning sufficient capacity to service peak loads
• Consider an app requiring
16 cores, 8GB memory and 10k IOPS (IO Per Sec) IO bandwidth to service its peaks
5
Over-Provisioning Waste
•
Workloads are ‘bursty’:
Average/peak is often
under 10%
•
Dedicating hardware
wastes the slack capacity
between average & peak
106
Virtualization is Set to Resolve This Waste
•
Consolidate workloads into shared PMs
•
This increases average utilization additively
•
But it also increases interference among VMs
– E.g., Peak traffic of VM1 can interfere with CPU availability for other VMs
VM1 VM2 VM3 VM4 VM5 VM6 VM7 VM8 VM9 VM10 2
4 8 6
Peak Workloads of VMs
PMs Consolidate
7
VMs Compete for Resources
•
Best-effort resource allocations (vs. dedicated)
– VMs get their allocations, if capacity is available
– VMs experience interference when capacity is insufficient
•
Interference can create congestion, bottlenecks and delays
•
Performance-
in
sensitive apps can tolerate interference
– Permit simple, risk-free virtualization
8
The Rise of “The Virtualization Chasm”
Percentage Apps Virtualized
20% 80% 100%
R
O
I
40%
Production Apps
“The Virtualization-Chasm”
Virtualization 1.0 Virtualization 2.0
• Virtualization 1.0: Virtualize performance-insensitive apps
– E.g., Print servers, non-critical web apps (The low-hanging fruits) – 20%-30% of enterprise apps
Performance-Insensitive Apps
• Virtualization 2.0: Virtualize production apps
10
The Key Challenge:
Ensuring That Production
Apps Get Their Resources
•
Interference results from statistical over-commitment
– Apps’ demands can exceed capacity momentarily
•
Interference may be controlled by two mechanisms
– Resource allocation: protect apps against over-commitment
– Workload placement: move workloads to minimize interference
11
VMWare Best Practices:
Managing Productions Apps Performance
Best Practice Guide to Exchange Server Virtualization:
http://www.vmware.com/files/pdf/Exchange_2010_ on_VMware_-_Best_Practices_Guide.pdf
“It is recommended that standalone
servers…be designed to not exceed 70% utilization during peak period.”
Assure Peak Utilization:
Avoid Over-Commitment:
“For performance-critical Exchange virtual machines (i.e., production systems), try to ensure the total number of vCPUs assigned to all the virtual machines is equal
12
VMWare Best Practices:
Managing Productions Apps Performance
VMWare Production Apps Strategy Rests on 2 Rules:
VMs running production apps should ensure that:
“Resource allocations are sufficient to serve
peak demands.”
“Resource allocations are sufficient to serve
peak demands.”
R-I
R-I
“Aggregate allocations
do not exceed the
PM capacity.”
“Aggregate allocations
do not exceed the PM capacity.”
R-II
R-II
R-I guarantees that an app may get its peak demands
served, if capacity is available.
R-I guarantees that an app may get its peak demands
served, if capacity is available.
R-II guarantees that the capacity allocation will be
available.
R-II guarantees that the capacity allocation will be
available.
i.e., if VM1 and VM2 each need 4 vCPUs, we need a PM with ≥8 CPUs!
13
Wait….Really? Then why virtualize?
•
Though there’s no sharing of resources, still enjoy the other
benefits of virtualization (app isolation, VM set-up, back-up,
etc.)
“Resource allocations are sufficient to serve
peak demands.”
“Resource allocations are sufficient to serve
peak demands.”
R-I
R-I
“Aggregate allocations
do not exceed the
PM capacity.”
“Aggregate allocations
do not exceed the PM capacity.”
R-II
R-II
R-I guarantees that an app may get its peak demands
served, if capacity is available.
R-I guarantees that an app may get its peak demands
served, if capacity is available.
R-II guarantees that the capacity allocation will be
available.
R-II guarantees that the capacity allocation will be
14
Virtualization Can Result in
3 Fundamental Inefficiencies
Over-provisioning inefficiency
Over-provisioning
inefficiency Workload packing inefficiency Workload packing
inefficiency control inefficiencyNon-adaptive Non-adaptive control inefficiency
1.
1. 2.2. 3.3.
16
How to Avoid Over-Provisioning Waste?
•
To Avoid Waste: Increase
average workload without
increasing reservations
– Add performance-insensitive apps with high average workload
– E.g., consolidate spam-filter apps, email archival apps alongside mission-critical apps
•
Need additional best
practice rule: Smart
consolidation
Best Practice #1:
Maintain a
consolidation-balance between
performance-sensitive and
insensitive workloads
Best Practice #1:
Maintain a
consolidation-balance between
18
A Greatly Simplified Example
2 4 8 6 10 12 14 16
PM1 PM2 PM3
2 4 8 6
VM1 VM2 VM3 VM4 VM5 VM6
Virtualized Workloads
Manual Ad-Hoc Workload Assignment
CPU capacity: 16 cores
Memory capacity: 8 GB
19
What If We Get New VMs?
2 4 8 6 10 12 14 16
PM1 PM2 PM3
•
Can we do better?
•
Optimized assignment uses
40% less resources (3 PM vs. 5)
2 4 8 6 10 12 14 16
PM1 PM2 PM3 PM4 PM5
Ad Hoc Assignment VM7 VM8 VM9 VM10
20
What Can We Learn from This Example?
•
Changes may require (re-)assignment of workloads
•
Even a trivialized example can be very complex
•
Complexity and waste can grow dramatically
– When the number of VMs increases – When physical machines vary
– When there are constraints (e.g., storage access, security policies) – When the rate of changes is high
•
Ad hoc processes can lead to costly inefficiencies
21
Overcoming the Packing Inefficiency
•
Use improved workload
placement algorithms
– Look holistically at all
workloads and resources
– Exploit the flexibility of performance-insensitive workloads
– Exploit the dynamics of
workloads peaks & troughs
Best Practice #2:
Use improved workload
placement algorithms
Best Practice #2:
23
1
15 16 17 18 19 20 21 22 23 24 01 02 03 04 05 06 07 08 09 10 11 12 13 14
10
k-IO
P
S
R
at
e
Time
Mission-Critical App Example
•
Virtualized MS Exchange app
•
High IOPS during the night (2AM-5AM)
– Peak: 10 k-IOPS
24
What If Workloads Grow?
•
Can we do better?
•
Optimized assignment uses
25% less resources
2 4 8 6 10 12 14 16
PM1 PM2 PM3 PM4 VM1 VM2 VM3 VM4 VM5 VM6
2 4 8 6
What if VM1 needs more memory & storage?
2 4 8 6 10 12 14 16
25
Adaptive vs. Non-Adaptive Workload Control
• Workloads demands (and interference) change over time – E.g., Exchange server is active through the night
– Why keep its reservation during the day?
• Static workload mgmt is limited in handling emergent problems
– Apps profiles reflect long-term statistics; fluctuations can cause interferences
• Adaptive workload control offers superior mgmt
– Exploit workload dynamics to reduce waste of static policies – Eliminate emergent interferences
Best Practice #3:
Provide adaptive control to
optimize resource use & avoid
interference
Best Practice #3:
Provide adaptive control to
optimize resource use & avoid
interference
Best Practice #4:
Use of forward looking
workload projection
Best Practice #4:
26
Adaptive Control:
Too Complex for Manual Management
•
Manual management requires administrators to:
– Master voluminous details of hypervisor and applications internals
– Manage interference and waste problems manually – Manage resource allocations and move applications
as workloads change
– Maintain tight-coordination between virtualization
& app administrators
Virtualizing Production Apps:
28
Conclusions
•
Workload placement can be very inefficient
– Over-provisioning waste; workload-packing waste; non-adaptive inefficiencies
•
Virtualization is much too complex for manual administration
•
Must be augmented by workload management:
– Eliminate the over-provisioning waste through balanced
consolidation
– Minimize the workload-packing waste by exploiting workload
features
– Support adaptive control to optimize resource use & avoid interference