vManage: Loosely Coupled Platform
and Virtualization Management in
Data Centers
Sanjay Kumar (Intel), Vanish Talwar (HP Labs),
Vibhore Kumar (IBM Research), Partha Ranganathan (HP Labs),
Karsten Schwan (Georgia Tech)
2
Problem – Silos of management domains
•
Inefficiency and redundancy
•
High costs for customers
Platforms Mgmt. (e.g, server config, power mgmt.)
Virtualization/OS Mgmt. (e.g, VM prov., runtime monitoring)
App Mgmt. (e.g, SLA mgmt., patch updates)
Subsystem-level solutions across layers
0 5 10 15 20 25 30 35 280 300 320 340 360 380 400 420 0 20 40 60 80 100 120 R es p o n se Ti m e (m se cs ) Po w er (W a tt s) Time(sec)
Avg Power Resp Time
3 1 July 2009
Significant violations & instability due to oscillatory behavior
An illustrative problem with silos
0 5 10 15 20 25 30 35 280 300 320 340 360 380 400 420 0 20 40 60 80 100 120 Re sp on se Ti m e (m se cs ) Po w er (W at ts ) Time(sec)
Avg Power Resp Time
4 1 July 2009
Oscillations among SLA and power violations
Significant violations & instability due to oscillatory behavior
5 1 July 2009
Node (s)
Platform Manager
Platform Sensors & Actuators Cluster Virtualization
Manager
Virtualization & App Sensors & Actuators Local Virtualization Access Point Utilization Repository Management Node Power Usage Repository
vManage: New enablements for coordinated mgmt.
6 1 July 2009
Node (s)
Platform Sensors & Actuators Virtualization & App Sensors &
Actuators Local Virtualization Access Point Utilization Repository Management Node Power Usage Repository Coordinator 1 Platform Manager ClusterVirtualizationManager Coordinator 2
Registry & Proxy Service Registry & Proxy
Service
Key building blocks
7 1 July 2009
Node (s)
Platform Sensors & Actuators Virtualization & App Sensors &
Actuators Local Virtualization Access Point Utilization Repository Management Node Power Usage Repository Coordinator 1 Platform Manager ClusterVirtualizationManager Coordinator 2
Registry & Proxy Service Registry & Proxy
Service Key building Blocks Platform-aware Virtualization Management Virtualization-aware Platform Management
vManage: New enablements for coordinated mgmt.
• Structured & automated • Loosely-coupled & extensible • Works with legacy controllers
8 1 July 2009
Node (s)
Platf orm Sensors & Actuators Virtualization & App Sensors &
Actuators Local Virtualization Access Point Utilization Repository Management Node Power Usage Repository Coordinator Platform Manager Cluster Virtualization Manager
Coordinator
Registry & Proxy Service Key building blocks Platform-aware Virtualization Management Virtualization-aware Platform Management
Benefits
vManage: New enablements for coordinated mgmt.
Challenges
• Discovery & meta-dataregistration
• Coordinated policies
VM Resource Requirements Virtualization Mgr. Requests
Coordinated policies
Node 1 Node 2 Node 3 . . . Node nCurrent State of Art
Virtualization Mgr. Requests
Coordinated policies
Node 1 Node 2 Node 3 . . . Node nPlatform-aware
Virtualization Manager
VM Res Rqt + Power Budget Coordinator VM migration VM migrationVirtualization Mgr. Requests
Coordinated policies
Node 1 Node 2 Node 3 . . . Node nPlatform-aware
Virtualization Manager
Coordinator VM Res Rqt + Power Budget + Stability R r T j v 0 0 j 0 0 M j T t) (t), (a T) , P( e.g., t t t t FCoordinated policies
12 1 July 2009 P – State actuators Power Violations Power Mgr. Perf countersCurrent State of Art
Coordinated policies
13 1 July 2009 P – State actuators Power Violations Power Mgr.Virtualization-aware
Power Manager
Coordinator SLA + power violations SLA notifications VM migration actuators To Platform-aware Virtualization Manager Virtualization Mgr. RequestsStability
Node 1 Node 2 Node 3 . . . Node nPlatform-aware
Virtualization Manager
Coordinator VM Res Rqt + Power Budget + StabilityStability criterion
Pick node having highest probability for the placement decision to remain valid over certain duration in the future
Stability (contd.)
• Assume we can do offline profiling of applications or behavior traces
available
• i.e mean, standard deviation, PDF are known apriori
• Average probability with which the host can provide sufficient resources
of a particular type to a set M of VMs over a given time interval T is
T
t)
(t),
(a
T)
,
(
p
T j v 0 0 r 0 0 M j j
t tt
t
F
• If we assume the PDF to have a normal distribution
)
2
(t)
σ
(t)
μ
x
erf
(1
2
1
t)
(x,
M j M j M j v v v
F
Application OS Xen Hypervisor Guest VM (s) Dom-M Power ManagerRegistry & Proxy Services
iLO
Power Sensors iLO Firmware Dom0 Coordinator SLA Sensorsx86 Hardware
Vir t. Se ns or s, A ctu ato rs; Po w er A ctu ato rsPrototype implementation
Per-node view
Experimental Setup
• Hardware−Dual-core dual-socket machines with Intel 5150 processors −4 GB memory each machine
• Applications
−Rubis multi-tier app (4 VMs) −Nutch Web 2.0 app (1 VM) −Webserver app (1VM)
−Batch model app - CPU-intensive custom script
• Sensors and actuators
−Custom SLA sensors
−iLO power sensors and offline model-based calibration −XenMon for utilization
−CPU freq. driver for p-state actuations −VM migration actuator ( using Xen api )
Evaluation Results
•
28 VMs run over 20 hours on a 13-node testbed
−10 Nutch instances, 3 RUBiS instances, 6 static webserver instances
18 1 July 2009 210 220 230 240 250 Base (no coordination)CoordinatedSolution
(vManage) Average Power (Watts) 0 20 40 60 80 100 120 Base (no cordination)CoordinatedSolution
(vManage) Stability (# VM migrations) 0 0.5 1 Base (no coordination)CoordinatedSolution
(vManage)
SLAViolations (normalized to Base)
•
Significantly better QoS (71%)
•
Improved power savings (10%)
19 1 July 2009 0 2 4 6 8 10 12 14 16 18 20 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 R e s p o n s e T im e (s e c ) Time (sec)
Application response time
200 220 240 260 280 300 320 340 360 380 400 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 P o w e r (W a tt s ) Time (sec) Power usage
Traditional manager With our approach
Traditional manager With our approach
Snapshot of prototype operation
20 1 July 2009 0 2 4 6 8 10 12 14 16 18 20 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 R e s p o n s e T im e (s e c ) Time (sec)
Application response time
200 220 240 260 280 300 320 340 360 380 400 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 P o w e r (W a tt s ) Time (sec) Power usage
With our approach
With our approach
Snapshot of prototype operation
Traditional managers at low load
Traditional manager21 1 July 2009 0 2 4 6 8 10 12 14 16 18 20 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 R e s p o n s e T im e (s e c ) Time (sec)
Application response time
200 220 240 260 280 300 320 340 360 380 400 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 P o w e r (W a tt s ) Time (sec) Power usage
With our approach
With our approach
Snapshot of prototype operation
Traditional manager at high load
Violations Violations Traditional manager Traditional manager 22 1 July 2009 0 2 4 6 8 10 12 14 16 18 20 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 R e s p o n s e T im e (s e c ) Time (sec)Application response time
200 220 240 260 280 300 320 340 360 380 400 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 P o w e r (W a tt s ) Time (sec) Power usage
With our approach
With our approach
Snapshot of prototype operation
Coordinated manager at low load
Lower Power Traditional manager
23 1 July 2009 0 2 4 6 8 10 12 14 16 18 20 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 R e s p o n s e T im e (s e c ) Time (sec)
Application response time
200 220 240 260 280 300 320 340 360 380 400 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 P o w e r (W a tt s ) Time (sec) Power usage
With our approach
With our approach
Snapshot of prototype operation
Coordinated manager at high load
VM migration VM migration Traditional manager Traditional manager 24 1 July 2009 0 2 4 6 8 10 12 14 16 18 20 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 R e s p o n s e T im e (s e c ) Time (sec)
Application response time
200 220 240 260 280 300 320 340 360 380 400 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 P o w e r (W a tt s ) Time (sec) Power usage
With our approach
With our approach VM migration
Richness of information & actuation
Snapshot of prototype operation
Coordinated manager at high load
VM migration Traditional manager
0 0.2 0.4 0.6 0.8 1 Base C1 C2 C3 SLA Violations (normalized to Base) 0.84 0.88 0.92 0.96 1 Base C1 C2 C3 Power (normalized to Base) 0 20 40 60 80 100 120 Base C1 C2 C3 Stability (# migrations)
Benefits of our approach
Additional experimentation
Effects of stabilizer in coordinated solution
•
C1: “First fit” coordinated placement
•C2: “Best fit” coordinated placement
•
C3: “Stability-aware” coordinated placement
Related Work
• Several individual mgmt solutions for virtualization mgmt, platform
mgmt., application mgmt.
− exist in isolated silos and represent partial subsystem-level solutions
• Few recent studies towards coordinated and unified mgmt.
[Raghavendra08], [Nathuji07], [Verma08], [Kephart07, Das08], [Chen08], [Adve02]
− lack a systematic systems/architecture approach to the coordination problem across hw-sw
− some focused on ad-hoc solutions dealing with limited actuators only
• Overall, vManage takes a loosely-coupled and practical approach
− works with most existing mgmt. infrastructures − easy plug-and-play of coordination solutions − extensible with multiple actuators
• Real prototype solution with enterprise applications running on a large
Summary
•
Management silos a critical and relevant problem
•
Our contributions
−architecture for cross-layer coordination in mgmt. systems
−mechanisms for unified discovery, coordination policies, and stability −Xen-based prototypes; experimentation on real testbeds
•
Future Work
−applying coordination mechanisms to more use cases
−extending the architecture for large scale (targeting millions of managed objects)