Nested hardware support - Nesting Virtual Machines in Virtualization Test Frameworks

Virtual-

Box VMware XEN KVM

HV HV HV HV VirtualBox DBT X X X X HV × × × × VMware DBT ∼ X × X HV × × × × Xen PV X X X X HV × × × × KVM HV × × × ×

Table 5.12: Overview of the nesting setups with second generation hardware support as the L1 hypervisor technique.

translation and Xen using paravirtualization are the most suitable choice for the L2 hypervisor.

Many setups that were unresponsive in section 5.2 became responsive when using a hardware supported MMU. The use of EPT or NPT improves the performance for the memory management and releases the L1 hypervisor from maintaining shadow tables. The maintenance of the shadow tables is based on software and can contain bugs. It must also be implemented in a performance oriented way since it is a crucial part. After some research8, it was clear that hypervisors normally take shortcuts in order to improve the performance of the memory management. Thus, the main issue is the shadow tables, which optimize the MMU virtualization but not exactly follow architecture equivalence for performance reasons. Two levels of shadow page tables seemed to be the cause of unresponsiveness in several setups. Replacing the shadow tables in the L1 hypervisor by the use of EPT or NPT removes the inaccurate virtualization of the memory management unit. The second generation hardware support inserts an accurate hardware MMU with two levels of address translation in the L1 hypervisor allowing L2 hypervisors and L2 guests to run successfully.

5.4 Nested hardware support

Nested hardware support is the support of hardware extensions for virtualization on x86 architectures within a guest. The goal of nested hardware support is mainly supporting nested virtualization for L2 hypervisors based on that hardware support.

5.4. NESTED HARDWARE SUPPORT 51

In section 4.3, we concluded that in order to nest a hypervisor based on hardware support, the virtualized processor should provide the hardware extensions. In subsection 5.2.3 and subsection 5.3.3, we noticed that none of the hypervisors provide a virtualized processor with hardware extensions, resulting in none of the setups being able to nest a hypervisor. Recently, KVM and Xen started research in this domain in order to develop nested hardware support. In the following subsection, the work in progress of both KVM and Xen is presented.

5.4.1 KVM

Nested hardware support was not supported by default in KVM. The virtualized processor provided to the guest is similar to the host processor, but lacks the hardware extensions for virtualization. These extensions are needed in order to use KVM or any other hypervisor based on hardware support. The introduction of nested hardware support should allow these hypervisors to be nested inside a virtual machine.

The first announcement of nested hardware support was made on September 2008 in a blog post of Avi Kivity [48]. He writes about an e-mail of Alexander Graf and Joerg Roedel presenting a patch for nested SVM support [49], i.e. nested hardware support for AMD processors with SVM support, and about the relative simplicity of this patch. More information on AMD SVM itself can be found in section 3.3. Alexander Graf and Joerg Roedel are both developers working on new features for KVM. The patch was eventually included in development version kvm-82 and allows the guest on an AMD processor, with hardware extensions for virtualization, to run a nested hypervisor based on hardware support. The implementation of the patch stayed relatively simple by exploiting the design of the SVM instruction set.

A year later on September 2009, Avi Kivity announced that support for nested VMX, i.e. nested hardware support for Intel processors with Intel VT-x extensions, is coming. The bad news is that it will take longer to implement this feature since nested VMX is more complex than nested SVM. In section 3.3, we explained that Intel VT-x and AMD SVM are very similar but the terminology is somewhat different. Besides the similarities, there are some fundamental differences in their implementation that make VMX support more complex.

A first difference is the manipulation of the data structure used by the hypervisor to communicate with the processor. For Intel VT-x, this data structure is called the VMCS, the equivalent in AMD SVM is called VMCB. Intel uses two instructions, VMREAD and VMWRITE, to manipulate the VMCS, while AMD allows manipulation of the VMCB by reading and writing in a memory region. The drawback of the two extra instructions is that KVM must trap and emulate the special instructions. For SVM, KVM could just allow the guest to read and write to the memory region of the VMCB without intervention.

A second difference is the number of fields used in the data structure. Intel uses a lot more fields to allow hypervisor-processor intercommunication. AMD SVM has 91 fields in the VMCS, while Intel VT-x has no less than 144 fields. KVM needs to virtualize all these fields and make sure that the guest, running a hypervisor, can use those fields in a correct way.

5.4. NESTED HARDWARE SUPPORT 52

another reason for the longer development time for the nested VMX support is that the patch will immediately support nested EPT. This means that not only the hypervisor in the host can use Extended Page Tables, see section 3.4, but the hypervisor in the guest also benefits from EPT support. As already pointed out in section 4.3, nested EPT or nested NPT could be critical for obtaining reasonable performance. With the VMX support, a KVM guest must support the 32 bit and 64 bit page tables format and the EPT format.

In practice

The nested hardware support was tested on an AMD processor9 since the nested

SVM patch was already released. The installation is the same as a regular install but in order to use the patch one must set a flag when loading the modules. We can do this using the following commands:

modprobe kvm

modprobe kvm−amd n e s t e d =1

“nested=1” indicates that we want to use the nested SVM. The tested setup was KVM as both L1 and L2 hypervisor. After installing and booting the L1 guest, KVM was installed inside the guest in exactly the same way as a normal installation of KVM. The nested hypervisor’s modules do not need to be loaded with “nested=1”. In subsection 5.2.3 and subsection 5.3.3, we could not install KVM within the guest. Installing KVM within the guest is a promising step towards nested virtualization with KVM, or any other hypervisor based on hardware support, as a nested hypervisor. When starting the L2 guest for installation of an operating system or for booting an existing operating system, some “handle exit” messages occurred. On KVM’s mailing list, Joerg Roedel replied10 on March 2010 that the messages result from a difference between a real hardware SVM and the emulated SVM from KVM. A patch should fix this issue, as it needs more testing the current setup was not able to boot. Nonetheless, developers are constantly improving the nested SVM by means of new patches and tests so it is just a matter of time before the current setup will work.

5.4.2 Xen

Xen is also working on nested virtualization with an emphasis on virtualization based on hardware support. On November 2009, during the Xen Summit in Asia, Qing He presented his work on nested virtualization [50]. Qing He has been working on Xen since 2006 and is a software engineer from the Intel Open Source Technology Center. His work focusses on hardware support based virtualization and more specifically on Intel’s VT-x hardware support. The current progress is a proof of concept for a simple scenario with a single processor and one nested guest. The nested guest is able to boot to an early stage successfully with KVM as the L2 hypervisor. Before releasing the current version, it still needs some stabilization and refinement.

The main target is the virtualization of VMX in order to present a virtualized VMX to the guest. This means that everything of the hardware support must be

The nested hardware support was tested on a Quad-Core AMD OpteronTM2350 processor.

5.4. NESTED HARDWARE SUPPORT 53

available in the guest. The guest should be able to use the data structures and the instructions to manipulate the VMCS. The guest should also be able to control the execution flow of the VMX with VMEntry and VMExit instructions.

Figure 5.6: Nested virtualization architecture based on hardware support. The data structures are shown in figure 5.6. The L1 guest has a VMCS that is loaded into the hardware when this guest is running. The VMCS is maintained by the L1 hypervisor. If the L2 guest wants to execute, it needs to have a corresponding VMCS. That corresponding VMCS is maintained by the L2 hypervisor running in the L1 guest and is called the virtual VMCS, or vVMCS. The L2 hypervisor sees the virtual VMCS as the controlling VMCS of the L2 guest but it is called virtual because the L1 hypervisor maintains a corresponding shadow VMCS, or sVMCS. This shadow VMCS is not a complete duplicate of the virtual VMCS but contains translations, similar to the shadow tables (see subsection 3.1.3). It is the shadow VMCS that is loaded to the hardware when the L2 guest is running. Thus, each nested guest has a virtual VMCS in the L2 hypervisor and a corresponding shadow VMCS in the L1 hypervisor. The general idea is to treat the L2 guests as a guest of the L1 hypervisor using the shadow VMCS.

Figure 5.7 shows the execution flow in a nested virtualization scenario based on hardware support. On the left side of the figure, the L1 guest is running and wants to start a nested guest. The guest does this by executing a VMEntry with the instruction VMLAUNCH or VMRESUME. The virtual VMEntry can not directly switch to the L2 guest because it is not supported by the hardware. The L1 guest is already using the VMX guest mode and can only trigger a VMExit. The VMExit results in a transition to the L1 hypervisor which will intercept the VMEntry call and tries to switch to the shadow VMCS indicated by the VMEntry. This results in the transition to the L2 guest and the L2 can run from then on.

Similar to a virtual VMEntry, the virtual VMExit will transition to the L1 hypervisor. The L1 hypervisor does not know whether the VMExit is a virtual VMExit

5.4. NESTED HARDWARE SUPPORT 54

Figure 5.7: Execution flow in nested virtualization based on hardware support.

or whether the VMExit happened due to the L2 guest executing a privileged instruction. When the L2 guest tries to run a privileged instruction, the L1 hypervisor can fix this without having to forward the VMExit to the L2 hypervisor. An algorithm in the L1 hypervisor determines whether this is a virtual VMExit and should be forwarded to the L2 hypervisor, or it is another type of VMExit that can be handled by the L1 hypervisor. For a virtual VMExit, the L1 hypervisor forwards to the L2 hypervisor and the shadow VMCS of the L2 guest is unloaded. The L1 hypervisor switches the controlling VMCS to the VMCS of the L1 guest. In the figure, there are 3 VMExits which result in a transition to the L1 hypervisor. The first and the last VMExit is forwarded by the L1 hypervisor to the L2 hypervisor and the second VMExit is handled by the L1 hypervisor itself.

There is no special handling in place for the memory management. The nested EPT, as described in the previous subsection, is also very helpful in this case because it significantly reduces the number of virtual VMExits. Nested EPT support is still work in progress.

5.4. NESTED HARDWARE SUPPORT 55 Subsections VirtualBo x VMw are XEN KVM Gen. HV DBT HV DBT HV PV HV HV VirtualBo x DBT × × × X × × × 1 st gen. X X X X 2 nd gen. HV × × × × × × × 1 st gen. × × × × 2 nd gen. VMw are DBT × ∼ × X × × X 1 st gen. ∼ X × X 2 nd gen. HV × × × × × × × 1 st gen. × × × × 2 nd gen. Xen PV × ∼ X X × X X 1 st gen. X X X X 2 nd gen. HV × × × × × × × 1 st gen. × × × × 2 nd gen. KVM HV × × × × × × × 1 st gen. × × × × 2 nd gen.

CHAPTER

6

Performance results

This chapter elaborates on the performance of the working setups for nested virtualization on x86 architectures. Chapter 5 showed that there was one working setup for nested virtualization when using dynamic binary translation as the L1 hypervisor technique. There were also ten working setups when using a L1 hypervisor based on hardware support with a processor that contains the second generation hardware extensions for virtualization on x86 architectures. The performance in a normal virtual machine is compared to the performance in a nested virtual machine in order to get an idea about the performance degradation between virtualization and nested virtualization.

The performed tests measure the processor, memory and I/O performance. These are the three most important components of a computer system. The evolu- tion of hardware support for virtualization on x86 architecture also shows that the processor, the memory management unit and I/O are important components, see chapter 3. The first generation hardware support focusses on the processor, second generation hardware support concentrates on a hardware supported MMU and the newer generation provides support for directed I/O. The benchmarks used for the tests are sysbench1, iperf2 and iozone3. sysbench was used for the processor, memory and file I/O performance. iperf was used for network performance and iozone was used for a second benchmark for file I/O.

The rest of this chapter is organized using these three components. The first section elaborates on the performance of the processor in nested virtualization. The next section evaluates the memory performance of the nested virtual machines and the third section shows the performance of I/O in a nested setup. The last section gives an overall conclusion on the performance of nested virtualization.

1_{http://sysbench.sourceforge.net/} 2

http://iperf.sourceforge.net/

6.1. PROCESSOR PERFORMANCE 57

Whenever a test ran directly on the host operating system, without any virtualization, the test is labeled with the word “native”. If the label is a name of a single virtualization product, the test ran inside a L1 guest with the indicated hypervisor as L1 hypervisor. The “DBT” suffix indicates that the L1 hypervisor uses the dynamic binary translation technique. All “HV” tests use the hardware support of the processor4 for virtualization. A label of the form L1hypervisor -L2hypervisor shows the result of a performance test executed in a L2 guest using the given L2 hypervisor and L1 hypervisor. For example, “KVM (HV) - VirtualBox (DBT)” indicates the setup where KVM is used as L1 hypervisor and VirtualBox is used as L2 hypervisor based on dynamic binary translation. All nested setups use the hardware support of the processor in the L1 hypervisor, except for “VMware (DBT) - Xen (PV)”. The latter uses VMware as the L1 hypervisor based on dynamic binary translation and uses Xen as L2 hypervisor based on paravirtualization. The L2 hypervisor is never based on hardware support as can be seen in chapter 5. Thus, VirtualBox and VMware are always based on dynamic binary translation and Xen is always based on paravirtualization, when used as L2 hypervisor.

6.1 Processor performance

The experiment used to measure the performance of the processor consists of a sysbench test which calculates prime numbers. It calculates the prime numbers until a set maximum and does this a given amount of times. The number of threads that will calculate the prime numbers can also be modified prior to running the test. In the executed tests, the maximum number for the primes was 150000 and all prime numbers until 150000 were calculated 10000 times spread over 10 threads. The measured unit of the test was the duration in seconds.

Figure 6.1 shows the first results of the performance test for the processor. The left bar is the result on the computer system without virtualization and the other bars are the results of the tests in L1 guests. The figure shows a serious gap between the native performance and the performance in a virtual machine. The reason for this big gap in performance is the use of only one core inside the virtual machine while the host operating system can use four cores. The tests were executed in virtual machines with only one core so that the comparison between the different virtualization software would be fair.

In order to get an indication of the real performance degradation, the same test was executed in a VMware guest that can use four cores and in a “VMware (HV) - VMware (DBT)” nested guest that can use four cores. The results of these tests are given in figure 6.2. The figure shows that the performance degradation between a virtual machine and a nested virtual machine is less than the performance degradation between a native platform and a virtual machine. By adding an extra level

4_{All performance tests were executed on an Intel} R

CoreTM_{i7-860 processor that provides second}

6.2. MEMORY PERFORMANCE 58 0 50 100 150 200 250 300 350 400

native VirtualBox (HV) VMware (HV) Xen (HV) KVM (HV)

Duration (in seconds)

CPU

Figure 6.1: CPU performance for native with four cores and L1 guest with one core (lower is better).

of virtualization, one expects a certain overhead, but this shows that the performance degradation for the extra level is promising. The performance overhead is linear and does not increase exponentially, which is promising because the latencies of VMEntry and VMExit instructions (see section 3.3) do not have to be improved dramatically in order to get acceptable performance in the nested guest.

The results of the tests on virtual machines and nested virtual machines are shown in figure 6.3. The performance between L1 guests with “HV” is about the same since the L1 hypervisors use hardware support for virtualization. The L1 guest that is virtualized using dynamic binary translation, “VMware (DBT)”, was able to perform equally well. The results of the L2 guests vary heavily between the different setups and are higher than the results of the L1 guests. However, the performance degradation is not problematic, except for one outlier which uses dynamic binary translation for the L1 hypervisor. With a duration of 496.83 seconds, the “VMware (DBT) - Xen (PV)” setup performs much worse than other nested setups.

6.2 Memory performance

In this section, the performance degradation of the memory management unit is evaluated. In section 5.3 we explained that the hardware supported L1 hypervisors use the hardware supported MMU of the processor and the L2 hypervisors use a software technique for maintaining the page tables of their guests. In the “VMware

6.2. MEMORY PERFORMANCE 59 0 20 40 60 80 100 120 140 160

native VMware (HV) VMware (HV) - VMware (DBT)

Duration (in seconds)

CPU

Figure 6.2: CPU performance for native, L1 and L2 guest with four cores (lower is better).

(DBT) - Xen (PV)” setup, the L1 hypervisor maintains shadow tables and the L2 hypervisor provides paravirtual interfaces to its guests.

The performed memory tests evaluate the read and write throughput. The tests read or write data with a total size of 2 Gb from or to the memory in block sizes of 256 bytes. The tests were done in twofold, one that reads or writes in a sequential order and one that reads or writes in a random order. Figure 6.4 presents the results of the memory tests for the native platform, L1 guests and L2 guests. Several observations for nested virtualization can be made from the results.

A first observation is that the duration of the tests increases greatly when using virtualization. The L1 guests needed approximately 10 seconds to read or write 2 Gb,

In document Nesting Virtual Machines in Virtualization Test Frameworks (Page 61-77)