A Generic Virtualization Interface - Secure Virtualization of Latency-Constrained Systems

To address the problems mentioned above, we added new mechanisms to the L4 kernel meant to genuinely support virtualization of CPUs, referring to the resulting mechanism as a virtual CPU (vCPU).

The goal is to provide a virtualization interface that supports both faithful virtualization as well as paravirtualization with the same functional approach, covering CPU and memory virtualization. Further goals include the integration into microkernel-based systems so that VMs and native microkernel applications can coexist and work together, creating a symbiosis.

The last goal is to provide a platform for latency-conscious workloads.

The basic idea for the virtualization interface is straightforward. Operating systems have been developed to run on hardware and hardware follows a simple model of execution. A processor executes instructions until the executed instructions cause an exception or an external interruption occurs. In this case the processor continues execution at handler code that has been provided by the operating system. The operating system will then handle the exception and remove the reason for the exception so that the original execution can continue where it has been interrupted.

Virtual machine monitors follow a similar approach: They execute the guest operating system and handle any exception that is caused by executing the guest operating system.

The virtualization interface provided by the microkernel shall have similar characteristics when executing guest operating systems: let the guest execute code and branch any exception to a predefined exception handler function within the guest kernel. For integration with the host system the virtualized guest shall also be able to communicate with other system modules by means of L4 operations such as IPC. As such a virtualization interface provides virtual CPUs to the guest operating systems we call this interface briefly the vCPU interface.

Overall, the following features need to be provided by a vCPU:

• One or more entry points must be defined where execution continues when an event shall be delivered to the vCPU.

• It must be possible to disable injection of events to the vCPU to be able to implement atomic sections.

• A memory area must be available which can hold the state of one interrupted execution.

• Communication with other microkernel threads must be possible through the same mechanism, allowing to use any existing L4 component directly and without restrictions.

• Multi-processor guests must be supported.

Entry Point(s) The entry point is the start of a function similar to an interrupt vector on physical processors. Execution is branched to this function whenever an interrupt is injected to the vCPU. Besides external events there are also synchronous events that are triggered during code execution, for example, an exception when executing an undefined instruction or faulting on memory accesses. Those types of events can be handled in the same way as external events, given the possibility to differentiate them.

Event Delivery Flag The event delivery flag indicates whether the vCPU is ready to be interrupted and to be branched to an entry point. This flag is used very frequently by prospective guest operating systems and must therefore be modifiable with very low overhead.

State Save Area The state save area is an area of memory that is used to store sufficient state to resume execution after handling interrupts and exceptions in the guest operating system. Both the host system and the vCPU access this region. The event delivery flag prevents reentry and thus locks this region and thereby prevents inconsistencies from concurrent modifications. The kernel disables event delivery whenever the vCPU is branched to the entry point. When the context information has been saved from the state save area, the guest can re-enable event delivery to signal that the state save is clear for new content.

Receiving and Sending Messages The primary communication method on L4 systems is IPC which is synchronous. This is in contrast to the asynchronous nature of vCPUs that receive messages whenever event delivery is enabled without invoking a specific operation.

However, vCPUs shall be able to communicate with other servers and to use the respective client libraries of the servers without the need to redesign those services and libraries. It is therefore crucial that L4 IPC works the same way in vCPU mode as in non-vCPU mode.

For transferring message data, L4 IPC uses a user-level thread control block (UTCB). Each L4 thread has a UTCB and UTCBs are located on kernel-provided memory. Senders store their message into the UTCB prior to invoking the corresponding IPC operation and get messages out of the UTCB when receiving messages. The UTCBs are also used for memory and capability mappings.

When in vCPU mode, receiving an IPC message can be twofold, depending on the state of event delivery:

Event delivery is enabled: Whenever a message is received, execution continues at the entry point. Event delivery is disabled and message contents are stored in the UTCB and the state save area.

Event delivery is disabled: The vCPU behaves as in non-VCPU mode and messages can be received by explicitly invoking a message receive operation.

With vCPUs a thread is able to receive an IPC message while not blocking in a corresponding IPC receive operation. Receiving interrupt notifications works in the same way as those are also delivery via the IPC mechanism. Asynchronous event reception is not only required in virtualization environments but shall also prove useful for any kind of asynchronous usage models.

Sending an IPC message, or more generally, invoking any IPC operation, can be done anytime while in vCPU mode. However, for the duration of the IPC event delivery through the entry point will be disabled so that the semantics of IPC are retained. Further, as sending and receiving IPC involves copying data to and from the UTCB, event delivery must be disabled throughout handling the UTCB. Otherwise an asynchronously incoming IPC message will overwrite UTCB contents. Consequently, event delivery to the vCPU must be disabled before putting data into the UTCB and only enabled again after retrieving the reply from the UTCB. Event delivery must also be disabled for library calls that may invoke IPC operations.

Paravirtualization and Multiple Processes Operating systems commonly use multi-ple virtual address spaces to provide isolation among different processes on their system.

They also guard the guest kernel from the user processes. Protection of the kernel could be accomplished by using a separate address space, however, that would require address space switches for all kernel invocations such as system calls. To avoid the switching, the virtual address space is divided into a part accessible to the user process and a part that is only usable by the kernel. Furthermore the kernel is part of every user address space. With such a configuration kernel requests require only two privilege changes and address spaces are not switched.

To apply the same approach to paravirtualized setups, the system would need three privilege levels: for the host kernel (hypervisor), for the guest kernel and the user processes to guard the host kernel from the guest kernel and to guard the host and guest kernels from the user processes. However, commonly hardware only offers two privilege levels. For this reason guest setups can only use the non-privileged mode and therefore have to use separate address spaces for the guest kernel and user processes. A side effect of using separate address spaces is that both the guest kernel’s address space as well as the user’s address spaces can use the full size of the virtual address space available to user-level programs on the host system. In a potential system with three privilege levels the virtual address space must be divided in three parts for each level, yielding in less available address space for each part. This is in particular interest on 32-bit architectures and today’s systems with multi-gigabyte memory setups.

With the separate address spaces it is required that those address spaces are switched when changing from vCPU guest kernel mode execution to a vCPU user process and again when switching back. This ensures that the guest kernel code and data is protected from accesses by the user processes. To support this common use case of address space switching, a vCPU can switch between host tasks and thus different guest address spaces. When continuing execution in a user process, the guest kernel will supply the host kernel a corresponding address space where the vCPU will be migrated to. Whenever an event shall be delivered or the user code triggers an exception, the vCPU is switched back to its guest kernel address

space to continue execution at the entry point.

Besides the address space argument, the corresponding vCPU operation can also be supplied with pages to be mapped to the target address space. This allows to handle page faults of user tasks within the same call instead of using separate map operations, speeding up page fault handling.

Figure 3.1 depicts a schematic view on a paravirtualized operating system, consisting of an address space for the guest operating system kernel and two user processes.

Systemcall

Device

User Task A

User Task B OS Guest Kernel

Host Kernel

Figure 3.1: Using multiple address spaces for a vCPU to implement user-processes in a paravirtu-alized operating system. The execution starts in the kernel task and continues with switching the vCPU to task A (1). Executing user code, the vCPU will execute a system call and thus transition back to the kernel task (2). Similarly, executing in task B (3), an external event (4) switches the vCPU back to the kernel task (5).

3.3.1 Multi-Processor Guest Operating Systems

Symmetric multi-processor systems have multiple processors within one system that can be used in parallel. Operating systems make all the processors available to applications. In virtualized setups multiple processors can be made available to VMs, allowing guest systems to use multiple host processors.

With vCPUs, the guest operating system is supplied with multiple vCPUs that are placed on the available host processors. No specific features or enhancements are required by the vCPU functionality to run multi-processor capable guest operating systems. Inter Processor Interrupts (IPIs) are mapped to L4 interrupts that can be triggered by software.

3.3.2 Full Virtualization and vCPUs

Modern processors such as x86-based systems from AMD and Intel provide hardware support for operating system virtualization [Cor14; Dev11]. The virtualization extensions provide an execution mode that is free of virtualization holes and further provide several performance targeted features for virtualization, such as nested paging (see 2.4.1).

The execution of a virtual machine is very similar to the execution model of vCPUs. As already introduced earlier, each virtual CPU of a virtual machine is configured by a VMCx.

The CPUs provide additional privileged instructions for managing the VMCx and handling

virtual machines. Whenever an event inside the virtual machine cannot be handled within the virtual machine, the CPU will exit the VM and return to the VMM, delivering sufficient information to let the VMM handle the event.

Mapping this type of execution to vCPUs is straightforward. Virtual machines always require a task for execution comprising the resources, such as memory, for the VM. The VMCx is configured by the VMM, which is running a vCPU for each CPU of the virtual machine. To hold the VMCx data, a vCPU requires a bigger vCPU state save area than for a normal vCPU. Resuming into the virtual machine is done through the kernel, which copies and sanity checks the VMCx and calls into the virtual machine. The exit from the virtual machine goes through the kernel back to the VMM. Depending on the state of the event delivery within the vCPU, the VMM is either entered through vCPU’s entry or when the vCPU-resume call returns.

On Intel processors, when using nested paging, the page table for VMs uses the Extended Page Table format (EPT) [Cor14] which is different from the standard process page table format. For that reason a VM needs a different type of address space and cannot use a standard L4::Task for holding a VM’s page table. Instead it has to use an L4::Vm which is derived fromL4::Task and behaves in the same way but uses the EPT format.

Recent versions of the ARM architecture also include support for virtualization [Lim14].

On ARM, the CPU-based virtualization functionality has been integrated differently than on x86, for example, the guest VM configuration must be saved and restored register by registers instead of being handled as a whole by a single instruction as on x86. For memory virtualization ARM uses the same principle as x86, providing nested page-tables for VMs.

Overall, the ARM virtualization extensions fit well into the vCPU model. As the format of the nested page-table is the same for VMs and host applications, no specific L4::VM object is required as on x86 and the standard L4::Task object works for both VMs and host applications. For VMs an extended state save area has to be used to store the additional VM state. The data layout of the area is defined by the host kernel.

Although ARM and x86 follow different approaches for CPU virtualization, both virtualiza-tion architectures can be made available through the vCPU interface. I thus believe that hardware-assisted virtualization functionality on other architectures can be integrated in the same way.

3.3.3 Conclusion

Concluding, the vCPU interface provides an execution model with asynchronous event delivery for user-level programs on L4. It is generic to support any guest operating system, including multi-processor configurations, and multiple hardware architectures. Compared to other virtualization solutions, such as DISCO [BDR97] and Xen [Bar+03], the vCPU model integrates and naturally extends an existing system, allowing to use system services in virtualization components as well as using virtualization functionality in microkernel-based applications. As an example, the vCPU mechanism is used to implement redundant multi-threading for replication-based fault tolerance [Döb14].

In document Secure Virtualization of Latency-Constrained Systems (Page 56-61)