OS Virtualization. CSC 456 Final Presentation Brandon D. Shroyer

(1)

OS Virtualization

CSC 456 Final Presentation Brandon D. Shroyer

(2)

Introduction

  Virtualization: Providing an interface to software that maps to some underlying system.

  A one-to-one mapping between a guest and the host on which it runs [9, 10].

  Virtualized system should be an “efficient, isolated duplicate” [8] of the real one.

  Process virtual machine just supports a process;

system virtual machine supports an entire system.

(3)

Why Virtualize?

  Reasons for Virtualization

  Hardware Economy

  Versatility

  Environment Specialization

  ^Security

  Safe Kernel Development

  OS Research [12]

(4)

Virtualization Layer

Process Virtualization

  VM interfaces with single process

  Application sees “virtual machine” as address space, registers, and instruction set [10].

  Examples:

 

Multiprogramming

 

Emulation for binaries

 

High-level language VMMs (e.g., JVM)

Hardware OS

Application

(5)

System Virtualization

Classical Virtualization

OS

Hardware

Virtualization Layer Application

OS OS

Hardware

Virtualization Layer Application

Hosted Virtualization/

Emulation

(6)

System Virtualization

  Interfaces with operating system

  OS sees VM as an actual machine—memory, I/O, CPU, etc [10].

  Classic virtualization: virtualization layer runs atop the hardware.

  Usually found on servers (Xen, VMWare ESX)

  Hosted or whole-system virtualization: virtualization runs on an operating system

  Popular for desktops (VMWare Workstation, Virtual

PC)

(7)

Emulation

  Providing an interface to a system so that it can run on a system with a different interface [10].

  Lets compiled binaries, OSes run on architectures with different ISA (binary translation)

  Performance usually worse than classic virtualization.

  Example: QEMU [11]

  Breaks CPU instructions into small ops, coded in C.

  C code is compiled into small objects on native ISA.

  dyngen utility runs code by dynamically stitching

objects together (dynamic code generation).

(8)

Some Important Terms

  Virtual Machine (VM): An instance of of an

operating system running on a virtualized system.

Also known as a virtual or guest OS.

  hypervisor: The underlying virtualization system sitting between the guest OSes and the hardware.

Also known as a Virtual Machine Monitor (VMM).

(9)

Requirements of a VMM

Developed by Popek & Goldberg in 1974 [8]:

1.  Provides environment identical to underlying hardware.

2.  Most of the instructions coming from the guest OS are executed by the hardware without being

modified by the VMM.

3.  Resource management is handled by the VMM

(this all non-CPU hardware such as memory and

peripherals).

(10)

  Hypervisor exists as a layer between the operating

systems and the hardware.

  Performs memory management and

scheduling required to coordinate multiple operating systems.

  May also have a separate controlling interface.

Guest OS Model

Hardware

Hypervisor (Host)

Guest OS Guest OS Guest OS

Apps Apps Apps

(11)

Virtualization Challenges

  Privileged Instructions

  Handling architecture-imposed instruction privilege levels.

  Performance Requirements

  Holding down the cost of VMM activities.

  Memory Management

  Managing multiple address spaces efficiently.

  I/O Virtualization

  Handling I/O requests from multiple operating

systems.

(12)

Virtualizing Privileged Instructions

  x86 architecture has four privilege levels (rings).

  The OS assumes it will be executing in Ring 0.

  Many system calls require 0-level privileges to

execute.

  Any virtualization strategy must find a way to

circumvent this.

VMware Understanding Full Virtualization, Paravirtualization, and Hardware Assist

CPU Virtualization 3

Figure 4 – x86 privilege level architecture without virtualization

the hardware. The functionality of the hypervisor varies greatly based on architecture and

implementation. Each VMM running on the hypervisor implements the virtual machine hardware abstraction and is responsible for running a guest OS. Each VMM has to partition and share the CPU, memory and I/O devices to successfully virtualize the system.

CPU Virtualization

The Challenges of x86 Hardware Virtualization

X86 operating systems are designed to run directly on the bare-metal hardware, so they naturally assume they fully ‘own’ the computer hardware. As shown in Figure 4, the x86 architecture offers four levels of privilege known as Ring 0, 1, 2 and 3 to operating systems and applications to manage access to the computer

hardware. While user level applications typically run in Ring 3, the operating system needs to have direct

access to the memory and hardware and must execute its privileged instructions in Ring 0. Virtualizing the x86 architecture requires placing a virtualization layer under the operating system (which expects to be in the

most privileged Ring 0) to create and manage the virtual machines that deliver shared resources.

Further complicating the situation, some sensitive

instructions can’t effectively be virtualized as they have different semantics when they are not executed in Ring 0. The difficulty in trapping and translating these sensitive and privileged instruction requests at runtime was the challenge that originally made x86 architecture virtualization look impossible.

VMware resolved the challenge in 1998, developing binary translation techniques that allow the VMM to run in Ring 0 for isolation and performance, while moving the operating system to a user level ring with greater privilege than applications in Ring 3 but less privilege than the virtual

machine monitor in Ring 0. While VMware’s full virtualization approach using binary translation is the de facto standard today based on VMware’s 20,000 customer installed base and large partner ecosystem, the industry as a whole has not yet agreed on open standards to define and manage virtualization. Each company developing virtualization solutions is free to interpret the technical challenges and develop solutions with varying strengths and weaknesses.

As clarified below, three alternative techniques now exist for handling sensitive and privileged instructions to virtualize the CPU on the x86 architecture:

• Full virtualization using binary translation

• OS assisted virtualization or paravirtualization

• Hardware assisted virtualization (first generation)

Image Source: VMWare White Paper, “Understanding Full Virtualization, Paravirtualization, and Hardware Assist”, 2007.

(13)

Full Virtualization

  “Hardware is functionally identical to underlying architecture.” [3]

  Typically accomplished through interpretation or binary translation.

  Advantage: Guest OS will run without any changes to source code.

  Disadvantage: Complex, usually slower than

paravirtualization.

(14)

Paravirtualization

  Replace certain

unvirtualized sections of OS code with

virtualization-friendly code.

  Virtual architecture

“similar but not identical to the underlying

architecture.” [3]

  Advantages: easier, lower virtualization overhead

  Disadvantages: requires modifications to guest OS

(15)

Performance

  Modern VMMs based around trap-and-emulate [8].

  When a guest OS executes a privileged instruction,

control is passed to VMM (VMM “traps” on

instruction), which decides how to handle instruction [8].

  VMM generates instructions to handle trapped

instruction (emulation).

  Non-privileged instructions do not trap (system stays in guest context).

CPU_INST

TRAP CPU_INST1

EXEC

CPU_INST

Guest OS

VMM

(16)

Trap-and-Emulate Problems

  Trap-and-emulate is expensive

  Requires context-switch from guest OS mode to VMM.

  x86 is not trap-friendly

  Guest’s CPL privilege level is visible in hardware

registers; cannot change it in a way that the guest OS cannot detect [5].

  Some instructions are not privileged, but access

privileged systems (page tables, for example) [5].

(17)

VMWare Virtualization

  Full virtualization implemented through dynamic binary translation [5].

  Translated code is grouped and stored in translation caches (TCs).

  Callout method replaces traps with stored emulation functions.

  In-TC emulation blocks are even more efficient.

  Adaptive binary translation rewrites translated blocks to minimize PTE traps [5].

  Direct execution of user-space code further reduces

overhead [5].

(18)

Xen Virtualization

  Xen occupies privilege level 0; guest OS occupies privilege level 1.

  OS code is modified so that high-privilege calls (hypercalls) are made to and trapped by Xen [3].

  Xen traps guest OS instructions using table of exception handlers.

  Frequently used handlers (e.g., system calls) have special handlers that allow guest OS to bypass privilege level 0 [3].

  Approach does not work with page faults.

  Handlers are vetted by Xen before being stored.

(19)

Hardware-Assisted Virtualization

  Hardware virtualization-assist released in 2006 [5].

  Intel, AMD both have technologies of this type.

  Introduces new VMX runtime mode.

  Two modes: guest (for OS) and root (for VMM).

  Each mode has all four CPL privilege levels available [8].

  Switching from guest to VMM does not require changes in privilege level.

  Root mode supports special VMX instructions.

  Virtual machine control block [5] contains control flags and state information for active guest OS.

  New CPU instructions for entering and exiting VMM mode.

  Does not support I/O virtualization.

(20)

Intel VT-X

  Both modes have no restrictions on privilege

  No need for software-based deprivileging

Image Source: Smith, J. and Nair, R. Virtual Machines, Morgan Kaufmann, 2005.

(21)

Applications of VT-X

  Xen uses Intel VT-x to host fully-virtualized guests alongside paravirtualized guests [6].

  System has root (VMM) and non-root (guest) modes, each with privilege levels 0-3.

  QEMU/Bochs projects provide emulations

  VMWare does not make use of VT technology [5].

  VMWare’s software-based VMMs significantly outperformed VT-X-based VMMs [5].

  VT-X virtualization is trap-based, and DBT tries to

eliminate traps wherever possible.

(22)

Virtualizing Memory

  Virtualization software must find a way to handle paging requests of operating systems, keeping each set of pages separate.

  Memory virtualization must not impose too much overhead, or performance and scalability will be impaired.

  Guest OS must each have an address space, be convinced that it has access to the entire address space.

  SOLUTION: most modern VMMs add an additional layer of abstraction in address space [4].

  Machine Address—bare hardware address.

  Physical Address—VMM abstraction of machine address, used by guest Oses.

  Guest maintains virtual-to-physical page tables.

  VMM maintains pmap structure containing physical-to-machine page mappings.

(23)

Memory Problem

a b

virtual physical

Page Table for Program m on

VM n.

b c

physical machine

Pmap structure in

VMM.

That’s a lot of lookups!

frame

(24)

Shadow Page Tables

  Shadow page tables map virtual memory to machine memory [4].

  One page table maintained per guest OS.

  TLB caches results from shadow page tables.

  Shadow page tables must be kept consistent with guest pages.

  VMM updates shadow page tables when pmap (physical-to-machine) records are updated.

  VMM now has access to virtual addresses,

eliminating two page table lookups.

(25)

Shadow Page Tables

a b

virtual physical

Page Table for Program m on

VM n.

b c

physical machine

Pmap structure in

VMM.

a c

virtual machine

Shadow page table in VMM.

VMM

Guest

(26)

Shadow Page Table Drawbacks

  Updates are expensive

  On a write, the VMM must update the VM and the shadow page table.

  TLB must be flushed on world switch.

  TLB from other guest will be full of machine addresses

that would be invalid in the new context.

(27)

Direct Access

  Direct access to hardware is not permitted by the Popek and Goldberg model [8].

  VMWare and Xen both bend this rule, allow guests to access hardware directly in certain cases.

  Xen uses validated access model [3].

  Fine-grained control over direct access.

  VMWare allows user-mode instructions to bypass BT, go straight to CPU [5].

  Memory accesses are sometimes batched to minimize

context switches.

(28)

Load Balancing Problem

  Assume VMM divides address space evenly among guests.

  If guest workload is not balanced, one guest could be routinely starved for memory.

  Other guests have way more than they need.

  Solution: memory overcommitment

2/n 1/n ^(n–2)/n 4/n

(29)

Memory Overcommitment

  Overcommitment: committing more total memory to guest OSes than actually exists on the system [4].

  Guest memory can be adjusted according to workload.

  Higher-workload servers get better performance than with a simple even allocation.

  Requires some mechanism to reclaim memory from other guests [4].

  Poor page replacement schemes can result in double paging [4].

  VMM marks page for reclamation, OS immediately moves reclaimed page out of memory

  Most common in high memory-usage situations.

(30)

Ballooning

  Mechanism for page reclamation.

  Technique to induce page- ins, page-outs in a guest OS.

  “Balloon module” [4] loaded on guest OS reserves

physical pages; can be expanded or contracted.

  Balloon inflates, guest starts releasing memory

  Balloon deflates, guest may start allocating pages.

  VMWare and Xen both support ballooning.

Image Source: Waldspurger, C. “Memory Resource Management in VMware ESX Server”, OSDI 2002.

(31)

I/O Virtualization

  Performance is critical for virtualized I/O

 

Many I/O devices are time- sensitive or require low latency [7].

  Most common method:

device emulation

 

VMM presents guest OS with a virtual device [7].

 

Preserves security, handles concurrency, but imposes more overhead.

Guest OS Guest Driver

VMM Virtual Device

Virtual Driver

Physical Device

(32)

I/O Virtualization Problems

  Multiplexing

  How to share hardware access among multiple OSes.

  Switching Expense

  Low-level I/O functionality happens at the VMM level,

requiring a context switch.

(33)

Packet Queuing

  Both major VMMs use an asynchronous ring buffer to store I/O descriptors.

  Batches I/O operations to minimize cost of world switches [7].

  Sends and receives exist in same buffer.

  If buffer fills up, an exit is triggered [7].

Request Consumer Private pointer in Xen

Request Producer Shared pointer updated by guest OS

Response Consumer Private pointer in guest OS Response Producer

Shared pointer updated by Xen

Request queue - Descriptors queued by the VM but not yet accepted by Xen Outstanding descriptors - Descriptor slots awaiting a response from Xen Response queue - Descriptors returned by Xen in response to serviced requests Unused descriptors

Figure 2: The structure of asynchronous I/O rings, which are used for data transfer between Xen and guest OSes.

Figure 2 shows the structure of our I/O descriptor rings. A ring is a circular queue of descriptors allocated by a domain but accessi- ble from within Xen. Descriptors do not directly contain I/O data;

instead, I/O data buffers are allocated out-of-band by the guest OS and indirectly referenced by I/O descriptors. Access to each ring is based around two pairs of producer-consumer pointers: domains place requests on a ring, advancing a request producer pointer, and Xen removes these requests for handling, advancing an associated request consumer pointer. Responses are placed back on the ring similarly, save with Xen as the producer and the guest OS as the consumer. There is no requirement that requests be processed in order: the guest OS associates a unique identifier with each request which is reproduced in the associated response. This allows Xen to unambiguously reorder I/O operations due to scheduling or priority considerations.

This structure is sufficiently generic to support a number of different device paradigms. For example, a set of ‘requests’ can provide buffers for network packet reception; subsequent ‘responses’

then signal the arrival of packets into these buffers. Reordering is useful when dealing with disk requests as it allows them to be scheduled within Xen for efficiency, and the use of descriptors with out-of-band buffers makes implementing zero-copy transfer easy.

We decouple the production of requests or responses from the notification of the other party: in the case of requests, a domain may enqueue multiple entries before invoking a hypercall to alert Xen; in the case of responses, a domain can defer delivery of a notification event by specifying a threshold number of responses.

This allows each domain to trade-off latency and throughput requirements, similarly to the flow-aware interrupt dispatch in the ArseNIC Gigabit Ethernet interface [34].

3.3 Subsystem Virtualization

The control and data transfer mechanisms described are used in our virtualization of the various subsystems. In the following, we discuss how this virtualization is achieved for CPU, timers, memory, network and disk.

3.3.1 CPU scheduling

Xen currently schedules domains according to the Borrowed Vir- tual Time (BVT) scheduling algorithm [11]. We chose this par- ticular algorithms since it is both work-conserving and has a spe- cial mechanism for low-latency wake-up (or dispatch) of a domain when it receives an event. Fast dispatch is particularly important to minimize the effect of virtualization on OS subsystems that are designed to run in a timely fashion; for example, TCP relies on

the timely delivery of acknowledgments to correctly estimate network round-trip times. BVT provides low-latency dispatch by using virtual-time warping, a mechanism which temporarily violates

‘ideal’ fair sharing to favor recently-woken domains. However, other scheduling algorithms could be trivially implemented over our generic scheduler abstraction. Per-domain scheduling parame- ters can be adjusted by management software running in Domain0.

3.3.2 Time and timers

Xen provides guest OSes with notions of real time, virtual time and wall-clock time. Real time is expressed in nanoseconds passed since machine boot and is maintained to the accuracy of the proces- sor’s cycle counter and can be frequency-locked to an external time source (for example, via NTP). A domain’s virtual time only ad- vances while it is executing: this is typically used by the guest OS scheduler to ensure correct sharing of its timeslice between application processes. Finally, wall-clock time is specified as an offset to be added to the current real time. This allows the wall-clock time to be adjusted without affecting the forward progress of real time.

Each guest OS can program a pair of alarm timers, one for real time and the other for virtual time. Guest OSes are expected to maintain internal timer queues and use the Xen-provided alarm timers to trigger the earliest timeout. Timeouts are delivered using Xen’s event mechanism.

3.3.3 Virtual address translation

As with other subsystems, Xen attempts to virtualize memory access with as little overhead as possible. As discussed in Sec- tion 2.1.1, this goal is made somewhat more difficult by the x86 architecture’s use of hardware page tables. The approach taken by VMware is to provide each guest OS with a virtual page table, not visible to the memory-management unit (MMU) [10]. The hypervisor is then responsible for trapping accesses to the virtual page table, validating updates, and propagating changes back and forth between it and the MMU-visible ‘shadow’ page table. This greatly increases the cost of certain guest OS operations, such as creat- ing new virtual address spaces, and requires explicit propagation of hardware updates to ‘accessed’ and ‘dirty’ bits.

Although full virtualization forces the use of shadow page tables, to give the illusion of contiguous physical memory, Xen is not so constrained. Indeed, Xen need only be involved in page table up- dates, to prevent guest OSes from making unacceptable changes.

Thus we avoid the overhead and additional complexity associated with the use of shadow page tables — the approach in Xen is to register guest OS page tables directly with the MMU, and restrict guest OSes to read-only access. Page table updates are passed to Xen via a hypercall; to ensure safety, requests are validated before being applied.

To aid validation, we associate a type and reference count with each machine page frame. A frame may have any one of the following mutually-exclusive types at any point in time: page direc- tory (PD), page table (PT), local descriptor table (LDT), global descriptor table (GDT), or writable (RW). Note that a guest OS may always create readable mappings to its own page frames, regardless of their current types. A frame may only safely be retasked when its reference count is zero. This mechanism is used to maintain the invariants required for safety; for example, a domain cannot have a writable mapping to any part of a page table as this would require the frame concerned to simultaneously be of typesPT and RW.

The type system is also used to track which frames have already been validated for use in page tables. To this end, guest OSes indi- cate when a frame is allocated for page-table use — this requires a one-off validation of every entry in the frame by Xen, after which Image Source: Barham, P. et al. “Xen and the Art of Virtualization”, SOSP 2003.

(34)

I/O Rings, continued

Xen

  Rings contain memory

descriptors pointing to I/O buffer regions declared in guest address space.

  Guest and VMM deposit and remove messages using a producer-consumer model [2].

  Xen 3.0 places device

drivers on their own virtual domains, minimizing the effect of driver crashes.

VMWare

  Ring buffer is constructed in and managed by VMM.

  If VMM detects a great

deal of entries and exits, it starts queuing I/O

requests in ring buffer [7].

  Next interrupt triggers transmission of

accumulated messages.

(35)

Summary

  Current VMM implementations provide safe, relatively efficient virtualization, albeit often at the expense of theoretical soundness [8].

  The x86 architecture requires a) binary translation, b) paravirtualization, or c) hardware support to virtualize.

  Binary translation and instruction trapping costs are currently the largest drains on efficiency [5].

  Management of memory and other resources remains a complex and expensive task in modern virtualization

implementations.

(36)

References

1.  Singh, A. “An Introduction To Virtualization”, www.kernelthread.com, 2004.

2.  VMWare White Paper, “Understanding Full Virtualization, Paravirtualization, and Hardware Assist”, 2007.

3.  Barham, P. et al. “Xen and the Art of Virtualization”, SOSP 2003.

4.  Waldspurger, C. “Memory Resource Management in VMware ESX Server”, OSDI 2002.

5.  Adams, K. and Agesen, O. “A Comparison of Software and Hardware Techniques for x86 Virtualization”, ASPLOS 2006.

6.  Pratt, I. et al. “Xen 3.0 and the Art of Virtualization”, Linux Symposium 2005.

7.  Sugerman, J. et al. “Virtualizing I/O Devices on Vmware Workstation’s Hosted Virtual Machine Monitor”, Usenix, 2001.

8.  Popek, G. and Kgoldberg, R. “Formal Requirements for Virtualizable Third-Generation Architectures”, Communications of the ACM, 1974.

9.  Mahalingam, M. “I/O Architectures for Virtualization”, VMWorld, 2006.

10.  Smith, J. and Nair, R. Virtual Machines, Morgan Kaufmann, 2005.

11.  Bellard, F. “QEMU, a Fast and Portable Translator”, USENIX 2005.

12.  Silberschatz, A., Galvin, P., Gagne, G. Operating System Concepts, Eighth Edition. Wiley

& Sons, 2009.

OS Virtualization. CSC 456 Final Presentation Brandon D. Shroyer