Data Center Network Minerals Tutorial

(1)

vManage: Loosely Coupled Platform

and Virtualization Management in

Data Centers

Sanjay Kumar (Intel), Vanish Talwar (HP Labs),

Vibhore Kumar (IBM Research), Partha Ranganathan (HP Labs),

Karsten Schwan (Georgia Tech)

2

Problem – Silos of management domains

•

Inefficiency and redundancy

•

High costs for customers

Platforms Mgmt. (e.g, server config, power mgmt.)

Virtualization/OS Mgmt. (e.g, VM prov., runtime monitoring)

App Mgmt. (e.g, SLA mgmt., patch updates)

Subsystem-level solutions across layers

(2)

0 5 10 15 20 25 30 35 280 300 320 340 360 380 400 420 0 20 40 60 80 100 120 R es p o n se Ti m e (m se cs ) Po w er (W a tt s) Time(sec)

Avg Power Resp Time

3 1 July 2009

Significant violations & instability due to oscillatory behavior

An illustrative problem with silos

0 5 10 15 20 25 30 35 280 300 320 340 360 380 400 420 0 20 40 60 80 100 120 Re sp on se Ti m e (m se cs ) Po w er (W at ts ) Time(sec)

Avg Power Resp Time

4 1 July 2009

Oscillations among SLA and power violations

Significant violations & instability due to oscillatory behavior

(3)

5 1 July 2009

Node (s)

Platform Manager

Platform Sensors & Actuators Cluster Virtualization

Manager

Virtualization & App Sensors & Actuators Local Virtualization Access Point Utilization Repository Management Node Power Usage Repository

vManage: New enablements for coordinated mgmt.

6 1 July 2009

Node (s)

Platform Sensors & Actuators Virtualization & App Sensors &

Actuators Local Virtualization Access Point Utilization Repository Management Node Power Usage Repository Coordinator 1 Platform Manager ClusterVirtualizationManager Coordinator 2

Registry & Proxy Service Registry & Proxy

Service

Key building blocks

(4)

7 1 July 2009

Node (s)

Platform Sensors & Actuators Virtualization & App Sensors &

Actuators Local Virtualization Access Point Utilization Repository Management Node Power Usage Repository Coordinator 1 Platform Manager ClusterVirtualizationManager Coordinator 2

Registry & Proxy Service Registry & Proxy

Service Key building Blocks Platform-aware Virtualization Management Virtualization-aware Platform Management

vManage: New enablements for coordinated mgmt.

• Structured & automated • Loosely-coupled & extensible • Works with legacy controllers

8 1 July 2009

Node (s)

Platf orm Sensors & Actuators Virtualization & App Sensors &

Actuators Local Virtualization Access Point Utilization Repository Management Node Power Usage Repository Coordinator Platform Manager Cluster Virtualization Manager

Coordinator

Registry & Proxy Service Key building blocks Platform-aware Virtualization Management Virtualization-aware Platform Management

Benefits

vManage: New enablements for coordinated mgmt.

Challenges

• Discovery & meta-data

registration

• Coordinated policies

(5)

VM Resource Requirements Virtualization Mgr. Requests

Coordinated policies

Node 1 Node 2 Node 3 . . . Node n

Current State of Art

Virtualization Mgr. Requests

Coordinated policies

Platform-aware

Virtualization Manager

VM Res Rqt + Power Budget Coordinator _VM migration VM migration

(6)

Virtualization Mgr. Requests

Coordinated policies

Platform-aware

Coordinator VM Res Rqt + Power Budget + Stability      R r T j v 0 0 j 0 0 M j T t) (t), (a T) , P( e.g., t t t t F

Coordinated policies

12 1 July 2009 P – State actuators Power Violations Power Mgr. Perf counters

Current State of Art

(7)

Coordinated policies

13 1 July 2009 P – State actuators Power Violations Power Mgr.

Virtualization-aware

Power Manager

Coordinator SLA + power violations SLA notifications VM migration actuators To Platform-aware Virtualization Manager Virtualization Mgr. Requests

Stability

Platform-aware

Coordinator VM Res Rqt + Power Budget + Stability

Stability criterion

Pick node having highest probability for the placement decision to remain valid over certain duration in the future

(8)

Stability (contd.)

• Assume we can do offline profiling of applications or behavior traces

available

• i.e mean, standard deviation, PDF are known apriori

• Average probability with which the host can provide sufficient resources

of a particular type to a set M of VMs over a given time interval T is

T

t)

(t),

(a

T)

,

(

p

T j v 0 0 r 0 0 M j j









t t

t

F

• If we assume the PDF to have a normal distribution

)

2

(t)

σ

(t)

μ

x

erf

(1

2

1

t)

(x,

M j M j M j v v v







F

Application OS Xen Hypervisor Guest VM (s) Dom-M Power Manager

Registry & Proxy Services

iLO

Power Sensors iLO Firmware Dom0 Coordinator SLA Sensors

x86 Hardware

Vir t. Se ns or s, A ctu ato rs; Po w er A ctu ato rs

Prototype implementation

Per-node view

(9)

Experimental Setup

• Hardware

−Dual-core dual-socket machines with Intel 5150 processors −4 GB memory each machine

• Applications

−Rubis multi-tier app (4 VMs) −Nutch Web 2.0 app (1 VM) −Webserver app (1VM)

−Batch model app - CPU-intensive custom script

• Sensors and actuators

−Custom SLA sensors

−iLO power sensors and offline model-based calibration −XenMon for utilization

−CPU freq. driver for p-state actuations −VM migration actuator ( using Xen api )

Evaluation Results

•

28 VMs run over 20 hours on a 13-node testbed

−10 Nutch instances, 3 RUBiS instances, 6 static webserver instances

18 1 July 2009 210 220 230 240 250 Base (no coordination)CoordinatedSolution

(vManage) Average Power (Watts) 0 20 40 60 80 100 120 Base (no cordination)CoordinatedSolution

(vManage) Stability (# VM migrations) 0 0.5 1 Base (no coordination)CoordinatedSolution

(vManage)

SLAViolations (normalized to Base)

•

Significantly better QoS (71%)

•

Improved power savings (10%)

(10)

19 1 July 2009 0 2 4 6 8 10 12 14 16 18 20 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 R e s p o n s e T im e (s e c ) Time (sec)

Application response time

200 220 240 260 280 300 320 340 360 380 400 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 P o w e r (W a tt s ) Time (sec) Power usage

Traditional manager With our approach

Snapshot of prototype operation

Application response time

With our approach

Snapshot of prototype operation

Traditional managers at low load

Traditional manager

(11)

Application response time

With our approach

Snapshot of prototype operation

Traditional manager at high load

Violations Violations Traditional manager Traditional manager 22 1 July 2009 0 2 4 6 8 10 12 14 16 18 20 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 R e s p o n s e T im e (s e c ) Time (sec)

Application response time

With our approach

Snapshot of prototype operation

Coordinated manager at low load

Lower Power Traditional manager

(12)

Application response time

With our approach

Snapshot of prototype operation

Coordinated manager at high load

VM migration VM migration Traditional manager Traditional manager 24 1 July 2009 0 2 4 6 8 10 12 14 16 18 20 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 R e s p o n s e T im e (s e c ) Time (sec)

Application response time

With our approach

With our approach VM migration

Richness of information & actuation

Snapshot of prototype operation

Coordinated manager at high load

VM migration Traditional manager

(13)

0 0.2 0.4 0.6 0.8 1 Base C1 C2 C3 SLA Violations (normalized to Base) 0.84 0.88 0.92 0.96 1 Base C1 C2 C3 Power (normalized to Base) 0 20 40 60 80 100 120 Base C1 C2 C3 Stability (# migrations)

Benefits of our approach

Additional experimentation

Effects of stabilizer in coordinated solution

•

C1: “First fit” coordinated placement

•

C2: “Best fit” coordinated placement

•

C3: “Stability-aware” coordinated placement

Related Work

• Several individual mgmt solutions for virtualization mgmt, platform

mgmt., application mgmt.

− exist in isolated silos and represent partial subsystem-level solutions

• Few recent studies towards coordinated and unified mgmt.

[Raghavendra08], [Nathuji07], [Verma08], [Kephart07, Das08], [Chen08], [Adve02]

− lack a systematic systems/architecture approach to the coordination problem across hw-sw

− some focused on ad-hoc solutions dealing with limited actuators only

• Overall, vManage takes a loosely-coupled and practical approach

− works with most existing mgmt. infrastructures − easy plug-and-play of coordination solutions − extensible with multiple actuators

• Real prototype solution with enterprise applications running on a large

(14)

Summary

•

Management silos a critical and relevant problem

•

Our contributions

−architecture for cross-layer coordination in mgmt. systems

−mechanisms for unified discovery, coordination policies, and stability −Xen-based prototypes; experimentation on real testbeds

•

Future Work

−applying coordination mechanisms to more use cases

−extending the architecture for large scale (targeting millions of managed objects)