• No results found

Nesting Virtual Machines in Virtualization Test Frameworks

N/A
N/A
Protected

Academic year: 2021

Share "Nesting Virtual Machines in Virtualization Test Frameworks"

Copied!
99
0
0

Loading.... (view fulltext now)

Full text

(1)

Nesting Virtual Machines in

Virtualization Test Frameworks

Dissertation submitted on May 2010 to the

Department of Mathematics and Computer Science of the Faculty of Sciences, University of Antwerp, in partial fulfillment of the requirements

for the degree of Master of Science.

Supervisor: prof. Dr. Jan Broeckhove Co-supervisor: Dr. Kurt Vanmechelen

Mentors: Sam Verboven & Ruben Van den Bossche

Olivier Berghmans

Research Group Computational Modelling and Programming

(2)

Contents

List of Figures iv

List of Tables vi

Nederlandstalige samenvatting vii

Preface viii Abstract x 1 Introduction 1 1.1 Goals . . . 1 1.2 Outline . . . 2 2 Virtualization 3 2.1 Applications . . . 4 2.2 Taxonomy . . . 5

2.2.1 Process virtual machines . . . 6

2.2.2 System virtual machines . . . 7

2.3 x86 architecture . . . 9

2.3.1 Formal requirements . . . 9

2.3.2 The x86 protection level architecture . . . 11

2.3.3 The x86 architecture problem . . . 11

3 Evolution of virtualization for the x86 architecture 13 3.1 Dynamic binary translation . . . 13

3.1.1 System calls . . . 15

3.1.2 I/O virtualization . . . 15

3.1.3 Memory management . . . 16

3.2 Paravirtualization . . . 17

(3)

3.2.1 System calls . . . 18

3.2.2 I/O virtualization . . . 18

3.2.3 Memory management . . . 19

3.3 First generation hardware support . . . 19

3.4 Second generation hardware support . . . 22

3.5 Current and future hardware support . . . 23

3.6 Virtualization software . . . 24

3.6.1 VirtualBox . . . 24

3.6.2 VMware . . . 24

3.6.3 Xen . . . 25

3.6.4 KVM . . . 25

3.6.5 Comparison between virtualization software . . . 26

4 Nested virtualization 28 4.1 Dynamic binary translation . . . 30

4.2 Paravirtualization . . . 32

4.3 Hardware supported virtualization . . . 33

5 Nested virtualization in Practice 34 5.1 Software solutions . . . 36

5.1.1 Dynamic binary translation . . . 36

5.1.2 Paravirtualization . . . 38

5.1.3 Overview software solutions . . . 40

5.2 First generation hardware support . . . 40

5.2.1 Dynamic binary translation . . . 42

5.2.2 Paravirtualization . . . 43

5.2.3 Hardware supported virtualization . . . 44

5.2.4 Overview first generation hardware support . . . 45

5.3 Second generation hardware support . . . 46

5.3.1 Dynamic binary translation . . . 47

5.3.2 Paravirtualization . . . 48

5.3.3 Hardware supported virtualization . . . 48

5.3.4 Overview second generation hardware support . . . 49

5.4 Nested hardware support . . . 50

5.4.1 KVM . . . 51 5.4.2 Xen . . . 52 6 Performance results 56 6.1 Processor performance . . . 57 6.2 Memory performance . . . 58 6.3 I/O performance . . . 61 6.3.1 Network I/O . . . 61 6.3.2 Disk I/O . . . 62 6.4 Conclusion . . . 64 ii

(4)

7 Conclusions 66

7.1 Nested virtualization and performance results . . . 66

7.2 Future work . . . 67

Appendices 72 Appendix A Virtualization software 73 A.1 VirtualBox . . . 73

Appendix B Details of the nested virtualization in practice 76 B.1 Dynamic binary translation . . . 76

B.1.1 VirtualBox . . . 76

B.1.2 VMware Workstation . . . 78

B.2 Paravirtualization . . . 79

B.3 First generation hardware support . . . 80

B.3.1 Dynamic binary translation . . . 80

B.3.2 Paravirtualization . . . 82

B.4 Second generation hardware support . . . 83

B.4.1 Dynamic binary translation . . . 84

B.4.2 Paravirtualization . . . 84

B.5 KVM’s nested SVM support . . . 84

Appendix C Details of the performance tests 86 C.1 sysbench . . . 86

C.2 iperf . . . 88

C.3 iozone . . . 88

(5)

List of Figures

2.1 Implementation layers in a computer system. . . 5

2.2 Taxonomy of virtual machines. . . 8 2.3 The x86 protection levels. . . 11

3.1 Memory management in x86 virtualization using shadow tables. . . . 17

3.2 Execution flow using virtualization based on Intel VT-x. . . 20

3.3 Latency reductions by CPU implementation [30]. . . 21

4.1 Layers in a nested virtualization setup with hosted hypervisors. . . . 29 4.2 Memory architecture in a nested situation. . . 31 5.1 Layers for nested paravirtualization in dynamic binary translation. . 37 5.2 Layers for nested Xen paravirtualization. . . 39 5.3 Layers for nested dynamic binary translation in paravirtualization. . 39 5.4 Layers for nested dynamic binary translation in a hypervisor based

on hardware support. . . 42 5.5 Layers for nested paravirtualization in a hypervisor based on

hard-ware support. . . 44 5.6 Nested virtualization architecture based on hardware support. . . 53

5.7 Execution flow in nested virtualization based on hardware support. . 54

6.1 CPU performance for native with four cores and L1 guest with one

core. . . 58

6.2 CPU performance for native, L1 and L2 guest with four cores. . . . 59

6.3 CPU performance for L1 and L2 guests with one core. . . 60

6.4 Memory performance for L1 and L2 guests. . . 61

6.5 Threads performance for native, L1 guests and L2 guests with

sys-bench sys-benchmark. . . 62

6.6 Network performance for native, L1 guests and L2 guests. . . 63

(6)

6.7 File I/O performance for native, L1 guests and L2 guests with sys-bench sys-benchmark. . . 64 6.8 File I/O performance for native, L1 guests and L2 guests with iozone

benchmark. . . 65

(7)

List of Tables

3.1 Comparison between a selection of the most popular hypervisors. . . 27

5.1 Index table containing directions in which subsections information can be found about a certain nested setup. . . 35 5.2 The nesting setups with dynamic binary translation as the L1

hyper-visor technique. . . 38 5.3 The nesting setups with paravirtualization as the L1 hypervisor

tech-nique. . . 40

5.4 Overview of the nesting setups with a software solution as the L1

hypervisor technique. . . 41 5.5 The nesting setups with first generation hardware support as the L1

hypervisor technique and DBT as the L2 hypervisor technique. . . . 43

5.6 The nesting setups with first generation hardware support as the L1

hypervisor technique and PV as the L2 hypervisor technique. . . 44

5.7 The nesting setups with first generation hardware support as the L1 and L2 hypervisor technique. . . 45 5.8 Overview of the nesting setups with first generation hardware support

as the L1 hypervisor technique. . . 46

5.9 The nesting setups with second generation hardware support as the

L1 hypervisor technique and DBT as the L2 hypervisor technique. . 48

5.10 The nesting setups with second generation hardware support as the

L1 hypervisor technique and PV as the L2 hypervisor technique. . . 49

5.11 The nesting setups with first generation hardware support as the L1 and L2 hypervisor technique. . . 49 5.12 Overview of the nesting setups with second generation hardware

sup-port as the L1 hypervisor technique. . . 50 5.13 Overview of all nesting setups. . . 55

(8)

Nederlandstalige samenvatting

Virtualisatie is uitgegroeid tot een wijdverspreide technologie die gebruikt wordt om computing resources te abstraheren, te combineren of op te delen. Verzoeken voor deze resources zijn op deze manier minimaal afhankelijk van de onderliggende fysieke laag. De x86 architectuur is niet speciaal ontworpen voor virtualisatie en bevat een aantal niet-virtualiseerbare instructies. Verschillende software-oplossingen en hardware-ondersteuning hebben hier voor een oplossing gezorgd. Het groeiend aantal toepassingen zorgt ervoor dat gebruikers steeds meer wensen virtualisatie te hanteren. Onder andere de noodzaak voor volledige fysieke opstellingen voor onderzoeksdoeleinden kan vermeden worden door het gebruik van virtualisatie. Om componenten, die mogelijk zelf virtualisatie gebruiken, te kunnen virtualiseren, moet het mogelijk zijn om virtuele machines in elkaar te nesten. Er was slechts weinig informatie over geneste virtualisatie beschikbaar en dit proefschrift gaat dieper in op wat mogelijk is met de huidige technieken.

We testen het nesten van hypervisors gebaseerd op de verschillende virtualisatie technieken. De technieken die gebruikt werden zijn dynamic binary translation, par-avirtualization en hardware-ondersteuning. Voor de hardware-ondersteuning werd een onderscheid gemaakt tussen eerste generatie en tweede generatie hardware-on-dersteuning. Succesvolle geneste opstellingen maken gebruik van software-oplossing-en voor de tweede hypervisor software-oplossing-en hardware-ondersteuning voor de eerste hypervisor. Slechts ´e´en werkende geneste oplossing gebruikt voor beide een software-oplossing.

Benchmarks werden uitgevoerd om te kijken of de prestaties van werkende ge-neste opstellingen performant zijn. De prestaties van de processor, het geheugen en I/O werden getest en vergeleken met de verschillende niveaus van virtualisatie.

We ontdekten dat geneste virtualisatie werkt voor bepaalde opstellingen, voor-al met een software-oplossing bovenop een hypervisor met hardware-ondersteuning. Opstellingen met hardware-ondersteuning voor de bovenste hypervisor zijn nog niet mogelijk. Geneste hardware-ondersteuning zal binnenkort beschikbaar worden, maar voorlopig is de enige optie het gebruik van een software-oplossing voor de bovenste hypervisor. Uit de resultaten van de benchmarks bleek dat de prestaties van geneste opstellingen veelbelovend zijn.

(9)

Preface

In this section I will give some insight on the creation of this thesis. It was submitted in partial fulfillment of the requirements for a Master’s degree of Computer Science. I have always been fascinated by virtualization and during the presentation of open thesis subjects I stumbled upon the subject of nested virtualization. Right from the start I found the subject very interesting so I made an appointment for more information and I eventually got it!

I had already used some virtualization software but I did not know much about the underlying techniques. During the first semester I followed a course on virtual-ization, which helped me to learn the fundamentals. It took time to become familiar with the installation and use of the different virtualization packages. At first, it took a long time to test one nested setup and it seemed that all I was doing was installing operating systems in virtual machines. Predefined images can save a lot of work but I had to find this out the hard way! But even with these predefined images, a nested setup can take a long time to test and re-test since there are so many possible configurations.

After the first series of tests, I was quite disappointed about the obtained results. Due to some setbacks in December and January, I also fell behind on schedule leading to a hard second semester. It was hard combining this thesis with other courses and with extracurricular responsibilities during this second semester. I am pleased that I got back on track and finished the thesis on time! This would not have been possible without the help from the people around me. I want to thank my girlfriend Anneleen Wislez for supporting me, not only during this year but during the last few years. She also helped me with creating the figures for this thesis and reading the text.

(10)

Further, I would like to show appreciation to my mentors Sam Verboven and Ruben Van den Bossche for always pointing me in the right direction and for the help during this thesis. Additionally, I also want to thank my supervisor Prof. Dr. Jan Broeckhove and co-supervisor Dr. Kurt Vanmechelen giving me the opportunity to make this thesis.

A special thank you goes out to all my fellow students and especially to Kristof Overdulve for the interesting conversations and the laughter during the past years. And last but not least I want to thank my parents and sister for supporting me throughout my education; my dad for offering support by buying his new computer earlier and borrowing it so I could do a second series of tests on a new processor and my mom for the excellent care and interest in what I was doing.

(11)

Abstract

Virtualization has become a widespread technology that is used to abstract, combine or divide computing resources to allow resource requests to be described and fulfilled with minimal dependence on the underlying physical delivery. The x86 architecture was not designed with virtualization in mind and contains certain non-virtualizable instructions. This has resulted in the emergence of several software solutions and has led to the introduction of hardware support. The expanding range of applications ensures that users increasingly want to use virtualization. Among other things, the need for entire physical setups for research purposes can be avoided by using virtualization. For components that already use virtualization, executing a virtual machine inside a virtual machine is necessary, this is called nested virtualization. There has been little related work on nested virtualization and this thesis elaborates on what is possible with current techniques.

We tested the nesting of hypervisors based on the different virtualization tech-niques. The techniques that were used are dynamic binary translation, paravir-tualization and hardware support. For hardware support, a distinction was made between first generation and second generation hardware support. Successful nested setups use a software solution for the inner hypervisor and hardware support for the bottom layer hypervisor. Only one working nested setup uses software solutions for both hypervisors.

Performance benchmarks were conducted to find out if the performance of work-ing nested setups is reasonable. The performance of the processor, the memory and I/O was tested and compared with the different levels of virtualization.

We found that nested virtualization on the x86 architecture works for certain setups, especially with a software solution on top of a hardware supported hyper-visor. Setups with hardware support for the inner hypervisor are not yet possible. The nested hardware support will be coming soon but until then, the only option is the use of a software solution for the inner hypervisor. Results of the performance benchmarks showed that performance of the nested setups is promising.

(12)

CHAPTER

1

Introduction

Within the research surrounding grid and cluster computing there are many devel-opments at different levels that make use of virtualization. Virtualization can be used for all, or a selection of the components in grid or cluster middleware. Grids or clusters are also using virtualization to run separate applications in a sandbox envi-ronment. Both developments bring advantages concerning security, fault tolerance, legacy support, isolation, resource control, consolidation, etc.

Complete test setups are not available or desirable for many development and research purposes. If certain performance limitations do not pose a problem, virtu-alization of all components in a system can avoid the need for physical grid or cluster setups. This thesis focusses on the latter, the consolidation of several physical clus-ter machines by virtualizing them on a single physical machine. The virtualization of cluster machines that use virtualization themselves leads to a combination of the above mentioned levels.

1.1

Goals

The goal of this thesis is to find out whether different levels of virtualization are pos-sible with current virtualization techniques. The research question is whether nested virtualization works on the x86 architecture. In cases where nested virtualization works we want to find out what the performance degradation is when compared to a single level of virtualization or to a native solution. For cases where nested virtualization does not work we search for the reasons of the failure and what needs to be changed in order for it to work. The experiments are conducted with some of the most popular virtualization software to find an answer to the posed question.

(13)

1.2. OUTLINE 2

1.2

Outline

The outline of this thesis is as follows. Chapter 2 contains an introduction to vir-tualization, a brief history of virtualization is given followed by a few definitions and a taxonomy of virtualization in general. The chapter ends with the formal re-quirements needed for virtualization on a computer architecture and how the x86 architecture compares to these requirements.

Chapter 3 describes the evolution of virtualization for the x86 architecture. Vir-tualization software first used software techniques, at a later stage processor vendors provided hardware support for virtualization. The last section of the chapter pro-vides an overview of a selection of the most popular virtualization software.

Chapter 4 provides a theoretical view for the requirements of nested virtualiza-tion on the x86 architecture. For each technique described in chapter 3, a detailed explanation of the theoretical requirements gives more insight in whether nested virtualization can work for the given technique.

Chapter 5 investigates the actual nesting of virtual machines using some of the most popular virtualization software solutions. The different virtualization tech-niques are combined to get an overview of which nested setup works best. Chap-ter 6 presents performance results of the working nested setups in chapChap-ter 5. System benchmarks are executed on each setup and the results are compared.

Chapter 7 summarizes the results in this thesis and gives directions for future work.

(14)

CHAPTER

2

Virtualization

In recent years virtualization has become a widespread technology that is used to abstract, combine or divide computing resources to allow resource requests to be de-scribed and fulfilled with minimal dependence on the underlying physical delivery. The first tracks of virtualization can be traced back to the 1960’s [1, 2] in research projects that provided concurrent, interactive access to mainframes. Each virtual machine (VM) gave the user the illusion of working directly on a physical machine. By partitioning the system into virtual machines, multiple users could concurrently use the system each within their own operating system. The projects provided an elegant way to enable time- and resource-sharing on expensive mainframes. Users could execute, develop, and test applications within their own virtual machine with-out interfering with other users. In that time, virtualization was used to reduce the cost of acquiring new hardware and to improve the productivity by letting more users work simultaneously.

In the late 1970’s and early 1980’s virtualization became unpopular because of the introduction of cheaper hardware and multiprocessing operating systems. The popular x86 architecture lacked the power to run multiple operating systems at the same time. But since this hardware was so cheap, a dedicated machine was used for each separate application. The use of these dedicated machines led to a decrease in the use of virtualization.

The ideas of virtualization became popular again in the late 1990’s with the emergence of a wide variety of operating systems and hardware configurations. Vir-tualization was used for executing a series of applications, targeted for different hardware or operating systems, on a given machine. Instead of buying dedicated machines and operating systems for each application, the use of virtualization on one machine offers the ability to create virtual machines that are able to run these applications.

Virtualization concepts can be used in many areas of computer science. Large variations in the abstraction level and underlying architecture lead to many

(15)

defini-2.1. APPLICATIONS 4

tions of virtualization. In “A survey on virtualization technologies”, S. Nanda and T. Chiueh define virtualization by the following relaxed definition [1]:

Definition 2.1 Virtualization is a technology that combines or divides computing resources to present one or many operating environments using methodologies like hardware and software partitioning or aggregation, partial or complete machine sim-ulation, emsim-ulation, time-sharing, and many others.

The definition mentions the aggregation of resources but in this context the focus lies on the partitioning of resources. Throughout the rest of this thesis, virtualization provides infrastructure used to abstract lower-level, physical resources and to create multiple independent and isolated virtual machines.

2.1

Applications

The expanding range of computer applications and their varied requirements for hardware and operating systems increases the need for users to start using virtu-alization. Most people will have already used virtualization without realizing it because there are many applications where virtualization can be used in some form. This section elaborates on some practical applications where virtualization can be used. S. Nanda and T. Chiueh enumerate some of these applications in “A survey on virtualization technologies” but the list is not complete and one can easily think of other applications [1].

A first practical application that benefits from using virtualization is server con-solidation [3]. It allows system administrators to consolidate workloads of multiple under-utilized machines to a few powerful machines. This saves hardware, man-agement, administration of the infrastructure, space, cooling and power. A second application that also involves consolidation is application consolidation. A legacy application might require faster and newer hardware but might also require a legacy operating system. The need for such legacy applications could be served well by virtualizing the newer hardware.

Virtual machines can be used for providing secure, isolated environments to run foreign or less-trusted applications. This form of sandboxing can help build secure computing platforms. Besides sandboxing, virtualization can also be used for debugging purposes. It can help debug complicated software such as operating systems or device drivers by letting the user execute them on an emulated PC with full software controls. Moreover, virtualization can help produce arbitrary test scenarios that are hard to produce in reality and thus eases the testing of software. Virtualization provides the ability to capture the entire state of a running virtual machine, which creates new management possibilities. Saving the state of a virtual machine, also called a snapshot, offers the user the capability to roll back to the saved state when, for example, a crash occurs in the virtual machine. The saved state can also be used to package an application together with its required operating system, this is often called an “appliance”. This eases the installation of that application on

(16)

2.2. TAXONOMY 5

a new server, lowering the entry barrier for its use. Another advantage of snapshots is that the user can copy the saved state to other physical servers and use the new instance of the virtual machine without having to install it from scratch. This is useful for migrating virtual machines from one physical server to other physical servers when needed.

Another practical application is the use of virtualization within distributed net-work computing systems [4]. Such a system must deal with the complexity of de-coupling local administration policies and configuration characteristics of distributed resources from the quality of service expected from end users. Virtualization can simplify or eliminate this complex decoupling because it offers functionality like consolidation of physical resources, security and isolation, flexibility and ease of management.

It is not difficult to see that the practical applications given in this section are just a few examples of the many possible uses for virtualization. The number of possible advantages that virtualization can provide continues to rise, making it more and more popular.

2.2

Taxonomy

Virtual machines can be divided into two main categories, namely process virtual machines and system virtual machines. In order to describe the differences, this section starts with an overview of the different implementation layers in a computer system, followed by the characteristics of process virtual machines. Finally, the characteristics of system virtual machines are explained. Most information in this section is deduced from the book “Virtual machines: Versatile platforms for systems and processes” by J. E. Smith and R. Nair [5].

(17)

2.2. TAXONOMY 6

The complexity in computer systems is tackled by the division into levels of abstraction separated by well-defined interfaces. Implementation details at lower levels are ignored or simplified by introducing levels of abstraction. In both hard-ware and softhard-ware in a computer system, the levels of abstraction correspond to implementation layers. A typical architecture of a computer system consist of sev-eral implementation layers. Figure 2.1 shows the key implementation layers in a typical computer system. At the base of the computer system we have the hardware layer consisting of all the different components of a modern computer. Just above the hardware layer, we find the operating system layer which exploits the hardware resources to provide a set of services to system users [6]. The libraries layer allows application calls to invoke various services available on the system, including those provided by the operating system. At the top, the application layer consists of the applications running on the computer system.

Figure 2.1 also shows the three interfaces between the implementation layers – the instruction set architecture (ISA), the application binary interface (ABI), and the application programming interface (API) – which are especially important for virtual machine construction [7]. The division between hardware and software is marked by the instruction set architecture. The ISA consists of two interfaces, the user ISA and the system ISA. The user ISA includes the aspects visible to the libraries and application layers. The system ISA is a superset of the user ISA which also includes those aspects visible to supervisor software, such as the operating system.

The application binary interface provides a program or library access to the hardware resources and services available in the system. This interface consists of the user ISA and a system call interface which allows application programs to interact with the shared hardware resources indirectly. The ABI allows the operating system to perform operations on behalf of a user program.

The application programming interface allows a program to invoke various ser-vices available on the system and is usually defined with respect to a high-level language (HHL). An API enables application written to the API to be ported easily to other systems that support the same API. The interface consists of the user ISA and of HHL library calls.

Using the three interfaces, virtual machines can be divided into two main cate-gories: process virtual machines and system virtual machines. A process VM runs a single program, supporting only an individual process. It provides a user application with a virtual ABI or API environment. The process virtual machine is created when the corresponding process is created and terminates when the process terminates. The system virtual machines provide a complete system environment in which many processes can coexist. System VMs do this by virtualizing the ISA layer.

2.2.1 Process virtual machines

Process virtual machines virtualize the ABI or API and can run only a single user program. Each virtual machine thus supports a single process, possibly consisting of multiple threads. The most common process VM is an operating system. It supports multiple user processes to run simultaneously by time-sharing the limited hardware resources. The operating system provides a replicated process VM for

(18)

2.2. TAXONOMY 7

each executing program so that each program thinks that it has its own machine. Program binaries that are compiled for a different instruction set are also sup-ported by process VMs. There are two approaches for emulating the instruction set. Interpretation is a simple but slow approach; an interpreter fetches, decodes and emulates each individual instruction. A more efficient approach is dynamic binary translation, which is explained in section 3.1.

The emulation between different instruction sets provides cross-platform com-patibility only on case-by-case basis and requires considerable programming effort. Designing a process-level VM together with an HLL application development

en-vironment is an easier way to achieve full cross-platform portability. The HHL

virtual machine does not correspond to any real platform, but is designed for ease of portability. The Java programming language is a widely used example of a HHL VM.

2.2.2 System virtual machines

System virtual machines provide a complete system environment by virtualizing the ISA layer. They allow a physical hardware system to be shared among multiple, isolated guest operating system environments simultaneously. The layer that pro-vides the hardware virtualization is called the virtual machine monitor (VMM) or hypervisor. It manages the hardware resources so that multiple guest operating system environments and their user programs can execute simultaneously. Subdi-vision is centered on the supported ISAs of the guest operating systems, whether virtualization or emulation is used. Virtualization can be further subdivided based on the location where the hypervisor is executed: native or hosted. The following two paragraphs clarify the subdivision according to the supported ISAs.

Emulation: Guest operating systems with a different ISA from the host ISA

can be supported through emulation. The hypervisor must emulate both the appli-cation and operating system code by translating each instruction to the ISA of the physical machine. The translation is applied to each instruction so that the hyper-visor can easily manage all hardware resources. Using emulation for guest operating systems with the same ISA as the host ISA, performance will be severely lower than using virtualization.

Virtualization: When the ISA of the guest operating system is the same as

the host ISA, virtualization can be used to improve performance. It treats non-privileged instructions and non-privileged instructions differently. A non-privileged instruc-tion is an instrucinstruc-tion that traps when executed in user mode instead of in kernel mode and will be discussed in more detail in section 2.3. Non-privileged instructions are executed directly on the hardware without intervention of the hypervisor. Privi-leged instructions are caught by the hypervisor and translated in order to guarantee correct results. When guest operating systems primarily execute non-privileged in-structions, the performance is comparable to near native speed.

Thus, when the ISA of the guest and the host are the same, the best performing technique is virtualization. It improves performance in terms of execution speed by running non-privileged instructions directly on the hardware. If the ISA of the guest and the host are different, emulation is the only way to execute the guest operating

(19)

2.2. TAXONOMY 8

system. The subdivision of virtualization based on the location of the hypervisor is clarified in the next two paragraphs.

Native, bare-metal hypervisor: A native, bametal hypervisor, also

re-ferred to as a Type 1 hypervisor, is the first layer of software installed on a clean system. The hypervisor runs in the most privileged mode, while all the guests run in a less privileged mode. It runs directly on the hardware and executes the in-tercepted instructions directly on the hardware. According to J. E. Smith and R. Nair, a bare-metal hypervisor is more efficient than a hosted hypervisor in many respects since it has direct access to hardware resources, enabling greater scalabil-ity, robustness and performance [5]. There are some variations of this architecture where a privileged guest operating system handles the intercepted instructions. The disadvantage of a native, bare-metal hypervisor is that a user must clear the existing operating systems in order to install the hypervisor.

Hosted hypervisor: An alternative to a native, bare-metal hypervisor is

the hosted or Type 2 hypervisor. It runs on top of a standard operating system and supports the broadest range of hardware configurations [3]. The installation of the hypervisor is similar to the installation of an application within the host operating system. The hypervisor relies on the host OS for device support and physical resource management. Privileged instructions cannot be executed directly on the hardware but are modified by the hypervisor and passed down to the host OS.

The implementation specifics of Type 1 and Type 2 hypervisors can be separated into several categories: dynamic binary translation, paravirtualization and hardware assisted virtualization. These approaches are discussed in more detail in chapter 3, which elaborates on virtualization within system virtual machines. An overview of the taxonomy of virtual machines is shown in figure 2.2.

(20)

2.3. X86 ARCHITECTURE 9

2.3

x86 architecture

The taxonomy given in the previous section provides an overview of different virtual machines and different implementation approaches. This section gives detailed infor-mation about the requirements associated with virtualization and the problems that occur when virtualization technologies are implemented on the x86 architecture.

2.3.1 Formal requirements

In order to provide insight into the problems and solutions for virtualization on top of the x86 architecture, the formal requirements for a virtualizable architecture are given first. These requirements describe what is needed in order to use virtualiza-tion on a computer architecture. In “Formal requirements for virtualizable third generation architectures”, G. J. Popek and R. P. Goldberg defined a set of formal requirements for a virtualizable computer architecture [8]. They divided the ISA instruction into several groups. The first group contains the privileged instructions: Definition 2.2 Privileged instructions are all the ISA instruction that only work in kernel mode and trap when executed in user mode instead of in kernel mode. Another important group of instructions that will have a big influence on the vir-tualizability of a particular machine are the sensitive instructions. Before defining sensitive instructions, the notions of behaviour sensitive and control sensitive are explained.

Definition 2.3 An instruction is behaviour sensitive if the effect of its execution depends on the state of the hardware, i.e. upon its location in real memory, or on the mode.

Definition 2.4 An instruction is control sensitive if it changes the state of the hardware upon execution, i.e. it attempts to change the amount of resources available or affects the processor mode without going through the memory trap sequence. With these notions, instructions can be separated into sensitive instructions and innocuous instructions.

Definition 2.5 Sensitive instructions is the group of instructions that are either control sensitive or behaviour sensitive.

Definition 2.6 Innocuous instructions is the group of instruction that are not sen-sitive instructions.

According to Popek and Goldberg, there are three properties of interest when any arbitrary program is executed while the control program (the virtual machine mon-itor) is resident: efficiency, resource control, and equivalence.

The efficiency property: All innocuous instructions are executed by the

(21)

2.3. X86 ARCHITECTURE 10

The hypervisor should not intervene for instructions that do no harm. These in-structions do not change the state of the hardware and should be executed by the

hardware directly in order to preserve performance. The more instructions are

executed directly, the better the performance of the virtualization will be. This property highlights the contrast between emulation - where every single instruction is analyzed - and virtualization.

The resource control property: It must be impossible for that arbitrary

pro-gram to affect the system resources, i.e. memory, available to it; the allocator of the control program is to be invoked upon any attempt.

The hypervisor is in full control of the hardware resources. A virtual machine should not be able to access the hardware resources directly. It should go through the hy-pervisor to ensure correct results and isolation from other virtual machines.

The equivalence property: Any program K executing with a control program

resident, with two possible exceptions, performs in a manner indistinguishable from the case when the control program did not exist and K had whatever freedom of ac-cess to privileged instructions that the programmer had intended.

A program running on top of a hypervisor should perform the identical behaviour as in the case where the program would run on the hardware directly. As men-tioned, there are two exceptions: timing and resource availability problems. The hypervisor will occasionally intervene and instruction sequences may take longer to execute. This can lead to incorrect results in the assumptions about the length of the program. The second exception, the resource availability problem, might occur when the hypervisor does not satisfy a particular request for space. The program may then be unable to function in the same way as if the space were made available. The problem could easily occur, since the virtual machine monitor itself and other possible virtual machines take space as well. A virtual machine environment can be seen as a “smaller” version of the actual hardware: logically the same, but with lesser quantity of certain resources.

Given the categories of instructions and the properties, they define the hypervisor and a virtualizable architecture as:

Definition 2.7 We say that a virtual machine monitor, or hypervisor, is any con-trol program that satisfies the three properties of efficiency, resource concon-trol and equivalence. Then functionally, the environment which any program sees when run-ning with a virtual machine present is called a virtual machine. It is composed of the original real machine and the virtual machine monitor.

Definition 2.8 For any conventional third generation computer, a virtual machine monitor may be constructed, i.e. it is a virtualizable architecture, if the set of sen-sitive instructions for that computer is a subset of the set of privileged instructions.

(22)

2.3. X86 ARCHITECTURE 11

2.3.2 The x86 protection level architecture

The x86 architecture recognizes four privilege levels, numbered from 0 to 3 [9]. Figure 2.3 shows how the privilege levels can be interpreted as rings of protection. The center ring, ring 0, is reserved for the most privileged code and is used for the kernel of an operating system. When the processor is running in kernel mode, the code is executing in ring 0. Rings 1 and 2 are less privileged and are used for operating system services. These two are rarely used but some techniques in virtualization will run the guests inside ring 1. The most outer ring is used for applications and has the least privileges. The code of applications running in users mode will execute in ring 3.

Figure 2.3: The x86 protection levels.

These rings are used to prevent a program operating in a lower ring from access-ing more privileged system routines. A call gate is used to allow an outer raccess-ing to access an inner ring’s resource in a predefined manner.

2.3.3 The x86 architecture problem

A computer architecture can support virtualization if it meets the formal require-ments described in subsection 2.3 . The x86 architecture, however, does not meet the requirements posed above. The x86 instruction set architecture contains sensitive instructions that are non-privileged, called non-virtualizable instructions. In other words, these instruction will not trap when executed in user mode and they depend on or change the hardware state. This is not desirable because the hypervisor can-not simulate the effect of the instruction. The current hardware state could belong to another virtual machine, producing an incorrect result for the current virtual machine.

The non-virtualizable instructions make virtualization on the x86 architecture more difficult. Virtualization techniques will need to deal with these instructions. Applications will only run at near native speed when they contain a minimum

(23)

2.3. X86 ARCHITECTURE 12

amount of non-virtualizable instructions. Approaches that overcome the limitations of the x86 architecture are discussed in the next chapter.

(24)

CHAPTER

3

Evolution of virtualization for the x86 architecture

Developers of virtualization software did not wait until processor vendors solved the x86 architecture problem. They introduced software solutions like binary transla-tion and, when virtualizatransla-tion became more popular, paravirtualizatransla-tion. Processor vendors then introduced hardware support to solve the design problem of the x86 architecture and at a later stage to improve the performance. The next generation hardware support was introduced to improve performance concerning the memory management. This chapter gives an overview of the evolution towards hardware supported virtualization on x86 architectures. Dynamic binary translation, a soft-ware solution that tries to circumvent the design problem of the x86 architecture, is explained in the first section. The second section explains paravirtualization, a software solution which tries to improve the binary translation concept. It has some advantages and disadvantages over dynamic binary translation. The third section gives details on the first generation hardware support and its advantages and disad-vantages over software solutions. In many cases the software solutions outperform the hardware support. The next generation hardware support tries to further close the performance gap by eliminating major sources of virtualization overhead. The second generation hardware support focusses on memory management and is dis-cussed in the fourth section. The last section gives an overview of VirtualBox, KVM and Xen, which are virtualization products and VMware, a company providing mul-tiple virtualization products.

3.1

Dynamic binary translation

In full virtualization, the guest OS is not aware that it is running inside a vir-tual machine and requires no modifications [10]. Dynamic binary translation is a technique that implements full virtualization. It requires no hardware assisted or

(25)

3.1. DYNAMIC BINARY TRANSLATION 14

operating system assisted support while other techniques, like paravirtualization, need modifications to either the hardware or the operating system.

Dynamic binary translation is a technique which works by translating code from one instruction set to another. The word “dynamic” indicates that the translation is done on the fly and is interleaved with execution of the generated code [11]. The word “binary” indicates that the input is binary code and not source code. To improve performance, the translation is mostly done on blocks of code instead of single instructions [12]. A block of code is defined by a sequence of instructions that end with a jump or branch instruction. A translation cache is used to avoid retranslating code blocks multiple times.

In x86 virtualization, dynamic binary translation is not used to translate be-tween different instruction set architectures. Instead, the translation is done from x86 instructions to x86 instructions. This makes the translation a lot lighter than previous binary translation technologies [13]. Since it is a translation between the same ISA, a copy of the original instructions often suffices. In other words, generally no translation is needed and the code can be executed as is. In particular, when-ever the guest OS is executing code in user mode, no translation will be carried out and the instructions are executed directly, which is comparable in performance to execution of the code natively. Code that the guest OS wants to execute in kernel mode will be translated on the fly and is saved in the translation cache.

Even when the guest OS is running kernel code, most times no translation is needed and the code is copied as is. Only in some cases will the hypervisor need to translate instructions of the kernel code to guarantee the integrity of the guest. The kernel of the guest is executed in ring 1 instead of ring 0 when using software virtu-alization. As explained in section 2.3, the x86 instruction set architecture contains sensitive instructions that are non-privileged. If the kernel of the guest operating system wants to execute privileged instructions or one of these non-virtualizable instructions, the dynamic binary translation will translate the instructions into a safe equivalent. The safe equivalent will not harm other guests or the hypervisor. For example, if access to the physical hardware is needed, the performed transla-tion assures that the code will use the virtual hardware instead. In these cases, the translation ensures that the safe code is also less costly than the code with privileged instructions. The code with privileged instructions would trap when running in ring 1 and the hypervisor should handle these traps. The dynamic binary translation thus avoids the traps by replacing the privileged instruction so that there are less interrupts and the safe code will be less costly.

The translation of code into safer equivalents is less costly than letting the priv-ileged instructions trap, but the translation itself should also be taken into account. Luckily, the translation overhead is rather low and will decrease over time since translated pieces of code are cached in order to avoid retranslation in case of loops in the code. Yet, dynamic binary translation has a few cases it cannot fully solve: system calls, I/O, memory management and complex code. The latter is the set of code that, for example, does self-modification or has indirect control flows. This code is complex to execute, even on an operating system that runs natively. The other cases are now described in more detail in the next subsections.

(26)

3.1. DYNAMIC BINARY TRANSLATION 15

3.1.1 System calls

A system call is a mechanism used by processes to access the services provided by the operating system. This involves a transition to the kernel where the required function is then performed [6, 14]. The kernel of an operating system is also a process, but it differs from other processes in that it has privileged access to processor instructions. The kernel will not execute directly but only when it receives an interrupt from the processor or a system call from another process also running in the operating system. There are many different techniques for implementing system calls. One way is to use a software interrupt and trap, but for x86 a faster technique was chosen [13, 15]. Intel and AMD have come up with the instructions SYSCALL/SYSENTER and SYSRET/SYSEXIT for a process to do a system call. These instructions transfer control to the kernel without the overhead of an interrupt.

In software virtualization the kernel of the guest will run inside ring 1 instead

of ring 0. This implies that the hypervisor should intercept a SYSENTER (or

SYSCALL), translate the code and hand over control to the kernel of the guest. This kernel then executes the translated code and execute a SYSEXIT (or SYS-RET) to return control back to the process that requested the service of the kernel. Because the kernel of the guest is running inside ring 1, it does not have the privi-lege to perform the SYSEXIT. This will cause an interrupt at the processor and the hypervisor has to emulate the effect of this instruction.

System calls will cause a significant amount of overhead when using software virtualization. In a virtual machine, a system call costs about 10 times the cycles needed for a system call on a native machine. In “A comparison of software and hardware techniques for x86 virtualization”, the authors measured that a system call on a 3.8 GHz Pentium 4 takes 242 cycles [11]. On the same machine, a system call in a virtual machine, virtualized with dynamic binary translation and the kernel running in ring 1, takes 2308 cycles. In an environment where virtualization is used there will most likely be more than one virtual machine on a physical machine. In this case, the overhead of the system calls can become a significant part of the virtualization overhead. As we will see later, hardware support for virtualization offers a solution for this.

3.1.2 I/O virtualization

When creating a virtual machine, not only the processor needs to be virtualized but also all the essential hardware like memory and storage. Each I/O device type has its own characteristics and needs to be controlled in its own special way [5]. There are often a large number of devices for an I/O device type and this number continues to rise. The strategy consists of constructing a virtual I/O device and then virtualizing the I/O activity that is directed at the device. Every access to

this virtual hardware must be translated to the real hardware. The hypervisor

must intercept all I/O operations issued by the guest operating system and it must emulate these instructions using software that understands the semantics of the specific I/O port accessed [16]. The I/O devices are emulated because of the ease of migration and multiplexing advantages [17]. Migration is easy because the virtual

(27)

3.1. DYNAMIC BINARY TRANSLATION 16

device exists in memory and can easily be transferred. The hypervisor can present a virtual device to each guest while performing the multiplexing.

Emulation has the disadvantage of poor performance. The hypervisor must

perform a significant amount of work to present the illusion of a virtual device. The great number of physical devices make the emulation of the I/O devices in the hypervisor complex. The hypervisor needs drivers for every physical device in order to be usable on different physical systems. A hosted hypervisor has the advantage that it can reuse the device drivers provided by the host operating system. Another problem is that the virtual I/O device is often a device model which does not match the full power of the underlying physical devices [18]. This means that optimizations implemented by specific devices can be lost in the process of emulation.

3.1.3 Memory management

In an operating system, every application has the illusion that it is working with a piece of contiguous memory. Whereas in reality, the memory used by applications can be dispersed across the physical memory. The application is working with virtual addresses that are translated to physical addresses. The operating system manages a set of tables to do the translation of the virtual memory to the physical addresses. The x86 architecture provides support for paging in the hardware. Paging is the process that translates virtual addresses of a process to a system physical address. The hardware that translates the virtual addresses to physical addresses is called the memory management unit or MMU.

The page table walker performs address translation using the page tables and uses a hardware page table pointer, the CR3 register, to start the page walk [19]. It will traverse several page table entries which point to the next level of the walk. The memory hierarchy will be traversed many times when the page walker performs address translation. To keep this overhead within limits, a translation look-aside buffer (TLB) is used. The most recent translation will be saved in this buffer. The processor will first check the TLB to see whether the translation is located in the cache. When the translation is found in the buffer this translation is used, otherwise a page walk is performed and this result is saved in the TLB. The operating system and the processor must cooperate in order to assure that the TLB stays consistent. Inside a virtual machine the guest operating system manages its own page tables. The task of the hypervisor is to virtualize the memory but also virtualize the virtual

memory so that the guest operating system can use virtual memory [20]. This

introduces an extra level of translation which maps physical addresses of the guest to real physical addresses of the system. The hypervisor must manage the address translation on the processor using software techniques. It derives a shadow version of the page table from the guest page table, which holds the translations of the virtual guest addresses to the real physical addresses. This shadow page table will be used by the processor when the guest is active and the hypervisor manages this shadow table to keep it synchronized with the guest page table. The guest does not have access to these shadow page tables and can only see his guest page tables which runs on an emulated MMU. It has the illusion that it can translate the virtual addresses to real physical ones. In the background, the hypervisor will deal with the

(28)

3.2. PARAVIRTUALIZATION 17

real translation using the shadow page tables.

Figure 3.1: Memory management in x86 virtualization using shadow tables. Figure 3.1 shows the translations needed for translating a virtual guest address into a real physical address. Without the shadow page tables, the virtual guest memory (orange area) will be translated into physical guest memory (blue area) and the latter is translated into real physical memory (white area). The shadow page tables avoid the double translation by immediately translating the virtual guest memory (orange) into real physical memory (white) as shown by the red arrow.

In software, several techniques can be used to keep the shadow page tables and guest page tables consistent. These techniques use the page fault exception mecha-nism of the processor. It throws an exception when a page fault occurred and allows the hypervisor to update the current shadow page table. This introduces extra page faults due to the shadow paging. The shadow page tables introduce an overhead because of the extra page faults and the extra work in keeping the shadow tables up to date. The shadow page tables also consume additional memory. Maintaining shadow page tables for SMP guests also introduces a certain overhead. Each proces-sor in the guest can use the same guest page table instance. The hyperviproces-sor could maintain shadow page tables instances that can be used at each processor, which results in memory overhead. Another possibility is to share the shadow page table between the virtual processors leading to synchronization overheads.

3.2

Paravirtualization

Paravirtualization is in many ways comparable to dynamic binary translation. It is also a software technique designed to enable virtualization on the x86 architecture. As explained in “Denali: Lightweight Virtual Machines for Distributed and Net-worked Applications,” and used in Denali [21], paravirtualization exposes a virtual architecture to the guest that is slightly different than the physical architecture.

(29)

3.2. PARAVIRTUALIZATION 18

Dynamic binary translation translates “critical” code into safe code on the fly. Par-avirtualization does the same thing but requires changes in the source code of the operating system in advance. The operating systems built for the x86 architecture are by default not compatible with the paravirtualized architecture. This is a major disadvantage for existing operating systems because extra effort is needed in order to run these operating systems inside a paravirtualized guest. In the case of De-nali, which provides light weight virtual machines, it allowed them to co-design the virtual architecture with the operating system.

The advantages of a successful paravirtualization is a simpler hypervisor imple-mentation and an improvement in the performance degradation compared to the physical system. Better performance is achieved because many unnecessary traps by the hypervisor are eliminated. The hypervisor provides hypercall interfaces for critical kernel operations such as memory management, interrupt handling and time keeping [10]. The guest operating system is adapted so that it is aware of the vir-tualization. The kernel is modified to replace non-virtualizable instructions with hypercalls that communicate directly with the hypervisor. The binary translation overhead is completely eliminated since the modifications are done in the operat-ing system at design time. The implementation of the hypervisor is much simpler because it does not contain the binary translator.

3.2.1 System calls

The overhead of system calls can be improved a bit. The dynamic binary trans-lation technique intercepts each SYSENTER/SYSCALL instruction and translates the instruction to hand over the control to the kernel of the guest operating system. Afterwards, the guest operating system’s kernel executes a SYSEXIT/SYSRET in-struction to return to the application. This inin-struction is again intercepted and translated by the dynamic binary translation. The paravirtualization technique al-lows guest operating systems to install a handler for system calls, permitting direct calls from an application into its guest OS and avoiding indirection through the hypervisor on every call [22]. This handler is validated before installation and is accessed directly by the processor without indirection via ring 0.

3.2.2 I/O virtualization

Paravirtualization software mostly uses a different approach for I/O virtualization compared to the emulation used with dynamic binary translation. The guest oper-ating system utilizes a paravirtualized driver that operates on a simplified abstract device model exported by the hypervisor [23]. The real device driver can reside in the hypervisor, but often resides in a separate device driver domain which has privi-leged access to the device hardware. The latter one is attractive since the hypervisor does not need to provide the device drivers but the drivers of a legacy operating sys-tem can be used. Separating the address space of the device drivers from guest and hypervisor code also prevents buggy device drivers from causing system crashes.

The paravirtualized drivers remove the need to emulate devices. They free up processor time and resources which would otherwise be needed to emulate hardware. Since there is no emulation of the device hardware, the overhead is significantly

(30)

re-3.3. FIRST GENERATION HARDWARE SUPPORT 19

duced. In Xen, well-known for its use of paravirtualization, the real device drivers reside in a privileged guest known as domain 0. A description of Xen can be found in subsection 3.6.3. However, Xen is not the only hypervisor that uses paravirtu-alization for I/O. VMware has a paravirtualized I/O device driver, vmxnet, that shares data structures with the hypervisor [10]. “A Performance Comparison of Hypervisors” states that by using the paravirtualized vmxnet network driver they can now run network I/O intensive datacenter applications with very acceptable network performance [24].

3.2.3 Memory management

Paravirtual interfaces can be used by both the hypervisor and guest to reduce hy-pervisor complexity and overhead in virtualizing x86 paging [19]. When using a paravirtualized memory management unit, the guest operating system page tables are registered directly with the MMU [22]. To reduce the overhead and complexity associated with the use of shadow page tables, the guest operating system has read-only access to the page tables. A page table update is passed to Xen via a hypercall and validated before being applied. Guest operating systems can locally queue page table updates and apply the entire batch with a single hypercall. This minimizes the number of hypercalls needed for the memory management.

3.3

First generation hardware support

In the meantime, processor vendors noticed that virtualization was becoming in-creasingly popular and they created a solution that solves the virtualization prob-lem on the x86 architecture by introducing hardware assisted support. Hardware support for processor virtualization enables simple, robust and reliable hypervisor software [25]. It eliminates the need for the hypervisor to listen, trap and execute certain instructions for the guest OS [26]. Both Intel and AMD provide these hard-ware extensions in the form of Intel VT-x and AMD SVM respectively [11, 27, 28]. The first generation hardware support introduces a data structure for virtualiza-tion, together with specific instructions and a new execution flow. In AMD SVM, the data structure is called the virtual machine control block (VMCB). The VMCB combines control state with the guest’s processor state. Each guest has its own VMCB with its own control state and processor state. The VMCB contains a list of which instructions or events in the guest to intercept, various control bits and the guest’s processor state. The various control bits specify the execution environment of the guest or indicate special actions to be taken before running guest code. The VMCB is accessed by reading and writing to its physical address. The execution environment of the guest is referred to as guest mode. The execution environment of the hypervisor is called host mode. The new VMRUN instruction transfers control from host to guest mode. The instruction saves the current processor state and loads the corresponding guest state from the VMCB. The processor now runs the guest code until an intercept event occurs. This results in a #VMEXIT at which point

(31)

3.3. FIRST GENERATION HARDWARE SUPPORT 20

the processor writes the current guest state back to the VMCB and resumes host execution at the instruction following the VMRUN. The processor is then executing the hypervisor again. The hypervisor can retrieve information from the VMCB to handle the exit. When the effect of the exiting operation is emulated, the hypervisor can execute VMRUN again to return to guest mode.

Although Intel has implemented their own version of hardware support, it has many similarities with the implementation of AMD although the terminology is somewhat different. Intel uses a virtual machine control structure (VMCS) instead of a VMCB. A VMCS can be manipulated by the new instructions VMCLEAR, VMPTRLD, VMREAD and VMWRITE which clears, loads, reads from, and writes to a VMCS respectively. The hypervisor runs in “VMX root operation“ and the

guest in ”VMX non-root operation“ instead of host and guest mode. Software

enters the VMX operation by executing the VMXON instruction. From then on, the hypervisor can use a VMEntry to transfer control to one of its guest. There are two instructions available for triggering a VMEntry: VMLAUNCH and VMRESUME. As with AMD SVM, the hypervisor regains control using VMExits. Eventually, the hypervisor can leave the VMX operation with the instruction VMXOFF.

Figure 3.2: Execution flow using virtualization based on Intel VT-x.

The execution flow of a guest, virtualized by hardware support, can be seen in figure 3.2. The VMXON instruction starts and the VMXOFF stops the VMX operation. The guest is started using a VMEntry which loads the VMCS of the guest into the hardware. The hypervisor regains control using a VMExit when a guest tries to execute a privileged instruction. After intervention of the hypervisor, a VMEntry transfers control back to the guest. In the end, the guest can shut down and control is handed back to the hypervisor with a VMExit.

The basic idea behind the first generation hardware support is to fix the problem that the x86 architecture cannot be virtualized. The VMExit forces a transition from guest to hypervisor, which is based on the trap all exceptions and privileged instructions philosophy. Nevertheless, each transition between the hypervisor and a

(32)

3.3. FIRST GENERATION HARDWARE SUPPORT 21

virtual machine requires a fixed amount of processor cycles. When the hypervisor has to handle a complex operation, the overhead is relatively low. However, for a simple operation the overhead of switching from guest to hypervisor and back is relatively high. Creating processes, context switches, small page table updates are all simple operations that will have a large overhead. In these cases, software solutions like binary translation and paravirtualization perform better than hardware supported virtualization.

The overhead can be improved by reducing the number of processor cycles re-quired for a transition between guest and hypervisor. The exact number of extra processor cycles depends on the processor architecture. For Intel, the format and lay-out of the VMCS in memory is not architecturally defined, allowing implementation-specific optimizations to improve performance in VMX non-root operation and to reduce the latency of a VMEntry and VMExit [29]. Intel and AMD are improving these latencies in their next processors, as you can see for Intel in figure 3.3.

Figure 3.3: Latency reductions by CPU implementation [30].

System calls are an example of complex operations having a low transition over-head. System calls do not automatically transfer control from the guest to the hypervisor in hardware supported virtualization. A hypervisor intervention is only needed when the system call contains critical instructions. The overhead when a sys-tem call requires intervention is relatively low since a syssys-tem call is rather complex and already requires a lot of processor cycles.

First generation hardware support does not include support for I/O virtualiza-tion and memory management unit virtualizavirtualiza-tion. Hypervisors that use the first generation hardware extensions will need to use a software technique for virtualiz-ing the I/O devices and the MMU. For the MMU, this can be done usvirtualiz-ing shadow tables or paravirtualization of the MMU.

(33)

3.4. SECOND GENERATION HARDWARE SUPPORT 22

3.4

Second generation hardware support

First generation hardware support has made the x86 architecture virtualizable, but only in some cases an improvement in performance can be measured [11]. Main-taining the shadow tables can be an intensive task, as was pointed out in subsec-tion 3.1.3. The next step of the processor vendors was to provide hardware MMU support. This second generation hardware support adds memory management sup-port so the hypervisor does not have to maintain the integrity of the shadow page table mappings [17].

The shadow page tables remove the need to translate the virtual memory of the process to the guest OS physical memory and then translate the latter into the real physical memory, as can be seen in figure 3.1. It provides the ability to immediately translate the virtual memory of the guest process into real physical memory. On the other hand, the hypervisor must do the bookkeeping to keep the shadow page table up to date when an update occurs to the guest OS page table. In existing software solutions like binary translation, this bookkeeping introduces overhead which was even worse for first generation hardware support. The hypervisor must maintain the shadow page tables and every time a guest tries to translate a memory address, the hypervisor must intervene. In software solutions this intervention is an extra page fault, but in the first generation hardware support this will result in a VMExit and VMEntry roundtrip. As shown in figure 3.3, the latencies of such a roundtrip are improving but the second generation hardware support removes the need for the roundtrip.

Intel and AMD introduced their own hardware MMU support. Like the first generation hardware support, this results in two different implementation but with similar characteristics. Intel proposed the extended page tables (EPT) and AMD proposed their nested page tables (NPT). In Intel’s EPT, the page tables translate from virtual memory to guest physical addresses while a separate set of page tables, the extended page tables, translate from guest physical addresses to the real physical addresses [29]. The guest can modify its page tables without hypervisor intervention. The new extended page tables remove the VMExits associated with page table virtualization.

AMD’s nested paging also use additional page tables, the nested page tables (nPT), to translate guest physical addresses to real physical addresses [19]. The guest page tables (gPT) map the virtual memory addresses to guest physical ad-dresses. The gPT are set up by the guest and the nPT by the hypervisor. When nested paging is enabled and a guest attempts to reference memory using a virtual address, the page walker performs a two dimensional walk using the gPT and nPT to translate the guest virtual address to the real physical address. Like Intel’s EPT, nested paging removes the overheads associated with software shadow paging.

Another feature introduced by both Intel and AMD in the second generation hardware support is tagged TLBs. Intel uses Virtual-Processor Identifiers (VPIDs) that allow a hypervisor to assign a different identifier to each virtual processor. The zero VPID is reserved for the hypervisor itself. The processor then uses the VPIDs

(34)

3.5. CURRENT AND FUTURE HARDWARE SUPPORT 23

to tag translations in the TLB. AMD calls these identifiers the Address Space IDs (ASIDs). During a TLB lookup, the VPID or ASID value of the active guest is matched against the ID tag in the TLB entry. In this way, TLB entries belonging to different guests and to the hypervisor can coexist without causing incorrect ad-dress translations. The tagged TLBs eliminate the need for TLB flushes on every VMEntry and VMExit, furthermore it eliminates the impact of those flushes on performance. The tagged TLBs are an improvement compared to the other virtu-alization techniques. These techniques need to flush the TLB every time a guest switches to the hypervisor or back. The drawback of the extended page tables or nested paging is that a TLB miss has a larger performance hit for guests because it introduces an additional level of address translation. This is rectified by making the TLBs much larger than before. Previous techniques like shadow page tables imme-diately translate the virtual guest address to the real physical address eliminating the additional level of address translation.

The second generation hardware support is completely focussed on the improve-ment of the memory manageimprove-ment. It eliminates the need for the hypervisor to maintain the shadow tables and eliminates the TLB flushes. The EPT and NPT help to improve performance for memory intensive workloads.

3.5

Current and future hardware support

Intel and AMD are still working on support for virtualization. They are improving the latencies of the VMEntry and VMExit instructions, but are also working on new hardware techniques for supporting virtualization on the x86 architecture. The first generation hardware support for virtualization was based primarily on the processor and the second generation focusses on the memory management unit. The final component required next to CPU and memory virtualization is device and I/O virtualization [10]. Recent techniques are Intel VT-d and AMD IOMMU.

There are three general techniques for I/O virtualization. The first technique is emulation and is described in subsection 3.1.2. The second technique, explained in subsection 3.2.2, is paravirtualization. The last technique is direct I/O. The device is not virtualized but assigned directly to a guest virtual machine. The guest’s device drivers are used for the dedicated device.

In order to improve the performance for I/O virtualization, Intel and AMD are looking at allowing virtual machines to talk to the device hardware directly. With Intel VT-d and AMD IOMMU, hardware support is introduced to support assigning I/O devices to virtual machines. In such cases, the ability to multiplex the I/O device is lost. Depending on the I/O device, this does not need to be an issue. For example, network card interfaces can easily be added to the hardware in order to provide a NIC for each virtual machine.

References

Related documents