Virtualizing Mission-Critical Apps

(1)

Virtualizing Mission-Critical Apps

1PM EST, 3/29/2011

Ilya Mirman

(2)

2

Agenda

•

The Rise of “The Virtualization Chasm”

•

3 Fundamental inefficiencies

•

Best practices

(3)

(4)

4

Before Virtualization

10 12 14 16 2 4 8 6 C a p a c it y

• Traditional IT guarantees apps’ performance by

– Dedicating physical

machines (PM) to apps

– Provisioning sufficient capacity to service peak loads

• Consider an app requiring

16 cores, 8GB memory and 10k IOPS (IO Per Sec) IO bandwidth to service its peaks

(5)

5

Over-Provisioning Waste

•

Workloads are ‘bursty’:

Average/peak is often

under 10%

•

Dedicating hardware

wastes the slack capacity

between average & peak

10

(6)

6

Virtualization is Set to Resolve This Waste

•

Consolidate workloads into shared PMs

•

This increases average utilization additively

•

But it also increases interference among VMs

– E.g., Peak traffic of VM1 can interfere with CPU availability for other VMs

VM1 VM2 VM3 VM4 VM5 VM6 VM7 VM8 VM9 VM10 2

4 8 6

Peak Workloads of VMs

PMs Consolidate

(7)

7

VMs Compete for Resources

•

Best-effort resource allocations (vs. dedicated)

– VMs get their allocations, if capacity is available

– _{VMs experience interference when capacity is insufficient}

•

Interference can create congestion, bottlenecks and delays

•

Performance-

in

sensitive apps can tolerate interference

– _{Permit simple, risk-free virtualization}

(8)

8

The Rise of “The Virtualization Chasm”

Percentage Apps Virtualized

20% 80% 100%

R

O

I

40%

Production Apps

“The Virtualization-Chasm”

Virtualization 1.0 Virtualization 2.0

• _{Virtualization 1.0: Virtualize performance-insensitive apps}

– E.g., Print servers, non-critical web apps (The low-hanging fruits) – _{20%-30% of enterprise apps}

Performance-Insensitive Apps

• _{Virtualization 2.0: Virtualize production apps}

(9)

(10)

10

The Key Challenge:

Ensuring That Production

Apps Get Their Resources

•

Interference results from statistical over-commitment

– _{Apps’ demands can exceed capacity momentarily}

•

Interference may be controlled by two mechanisms

– _{Resource allocation: protect apps against over-commitment}

– _{Workload placement: move workloads to minimize interference}

(11)

11

VMWare Best Practices:

Managing Productions Apps Performance

Best Practice Guide to Exchange Server Virtualization:

http://www.vmware.com/files/pdf/Exchange_2010_ on_VMware_-_Best_Practices_Guide.pdf

“It is recommended that standalone

servers…be designed to not exceed 70% utilization during peak period.”

Assure Peak Utilization:

Avoid Over-Commitment:

“For performance-critical Exchange virtual machines (i.e., production systems), try to ensure the total number of vCPUs assigned to all the virtual machines is equal

(12)

12

VMWare Best Practices:

Managing Productions Apps Performance

VMWare Production Apps Strategy Rests on 2 Rules:

VMs running production apps should ensure that:

“Resource allocations are sufficient to serve

peak demands.”

R-I

“Aggregate allocations

do not exceed the

PM capacity.”

do not exceed the PM capacity.”

R-II

R-I guarantees that an app may get its peak demands

served, if capacity is available.

R-II guarantees that the capacity allocation will be

available.

i.e., if VM1 and VM2 each need 4 vCPUs, we need a PM with ≥8 CPUs!

(13)

13

Wait….Really? Then why virtualize?

•

Though there’s no sharing of resources, still enjoy the other

benefits of virtualization (app isolation, VM set-up, back-up,

etc.)

peak demands.”

R-I

do not exceed the

PM capacity.”

do not exceed the PM capacity.”

R-II

available.

(14)

14

Virtualization Can Result in

3 Fundamental Inefficiencies

Over-provisioning inefficiency

Over-provisioning

inefficiency Workload packing _inefficiency Workload packing

inefficiency _{control inefficiency}Non-adaptive Non-adaptive control inefficiency

1.

1. 2._2. 3._3.

(15)

(16)

16

How to Avoid Over-Provisioning Waste?

•

To Avoid Waste: Increase

average workload without

increasing reservations

– Add performance-insensitive apps with high average workload

– E.g., consolidate spam-filter apps, email archival apps alongside mission-critical apps

•

Need additional best

practice rule: Smart

consolidation

Best Practice #1:

Maintain a

consolidation-balance between

performance-sensitive and

insensitive workloads

Best Practice #1:

Maintain a

consolidation-balance between

(17)

(18)

18

A Greatly Simplified Example

2 4 8 6 10 12 14 16

PM1 PM2 PM3

2 4 8 6

VM1 VM2 VM3 VM4 VM5 VM6

Virtualized Workloads

Manual Ad-Hoc Workload Assignment

CPU capacity: 16 cores

Memory capacity: 8 GB

(19)

19

What If We Get New VMs?

2 4 8 6 10 12 14 16

PM1 PM2 PM3

•

Can we do better?

•

Optimized assignment uses

40% less resources (3 PM vs. 5)

2 4 8 6 10 12 14 16

PM1 PM2 PM3 PM4 PM5

Ad Hoc Assignment VM7 VM8 VM9 VM10

(20)

20

What Can We Learn from This Example?

•

Changes may require (re-)assignment of workloads

•

Even a trivialized example can be very complex

•

Complexity and waste can grow dramatically

– _{When the number of VMs increases} – When physical machines vary

– _{When there are constraints (e.g., storage access, security policies)} – _{When the rate of changes is high}

•

Ad hoc processes can lead to costly inefficiencies

(21)

21

Overcoming the Packing Inefficiency

•

Use improved workload

placement algorithms

– _{Look holistically at all}

workloads and resources

– Exploit the flexibility of performance-insensitive workloads

– _{Exploit the dynamics of}

workloads peaks & troughs

Best Practice #2:

Use improved workload

placement algorithms

Best Practice #2:

(22)

(23)

23

1

15 16 17 18 19 20 21 22 23 24 01 02 03 04 05 06 07 08 09 10 11 12 13 14

10

k-IO

P

S

R

at

e

Time

Mission-Critical App Example

•

Virtualized MS Exchange app

•

High IOPS during the night (2AM-5AM)

– _{Peak: 10 k-IOPS}

(24)

24

What If Workloads Grow?

•

Can we do better?

•

Optimized assignment uses

25% less resources

2 4 8 6 10 12 14 16

PM1 PM2 PM3 PM4 VM1 VM2 VM3 VM4 VM5 VM6

2 4 8 6

What if VM1 needs more memory & storage?

2 4 8 6 10 12 14 16

(25)

25

Adaptive vs. Non-Adaptive Workload Control

• Workloads demands (and interference) change over time – E.g., Exchange server is active through the night

– Why keep its reservation during the day?

• Static workload mgmt is limited in handling emergent problems

– Apps profiles reflect long-term statistics; fluctuations can cause interferences

• Adaptive workload control offers superior mgmt

– Exploit workload dynamics to reduce waste of static policies – Eliminate emergent interferences

Best Practice #3:

Provide adaptive control to

optimize resource use & avoid

interference

Best Practice #3:

Provide adaptive control to

optimize resource use & avoid

interference

Best Practice #4:

Use of forward looking

workload projection

Best Practice #4:

(26)

26

Adaptive Control:

Too Complex for Manual Management

•

Manual management requires administrators to:

– Master voluminous details of hypervisor and applications internals

– _{Manage interference and waste problems manually} – _{Manage resource allocations and move applications}

as workloads change

– _{Maintain tight-coordination between virtualization}

& app administrators

(27)

Virtualizing Production Apps:

(28)

28

Conclusions

•

Workload placement can be very inefficient

– Over-provisioning waste; workload-packing waste; non-adaptive inefficiencies

•

Virtualization is much too complex for manual administration

•

Must be augmented by workload management:

– _{Eliminate the over-provisioning waste through balanced}

consolidation

– _{Minimize the workload-packing waste by exploiting workload}

features

– Support adaptive control to optimize resource use & avoid interference

Virtualization 2.0 Strategy:

(29)

(30)

Virtualizing Mission-Critical Apps