3.4 Evaluation
3.4.2 Execution Times
The execution times as presented in the following were determined by counting the processor cycles of the execution of the investigated code part. The basis for this is the instruction trace generated by the simulator. All instructions are executed by the PowerPC 405 in one processor cycle, except from multiplication and division, memory access with cache miss, and branch instructions [IBM, 2005]. For instructions that do not execute in one cycle, the worst-case execution time is always assumed (five cycles for a multiplication, three cycles for a branch with unsuccessful branch prediction).
The simulator is not cycle accurate for external memory accesses, which is why we
execute the analyzed software routines with preloaded instruction and data caches, resulting in a duration for each instruction fetch of one processor cycle, including load/store. In practice, this is not always possible, resulting in a larger execution time, depending on the system’s hardware.
Virtual Machine Context Switch
If multiple virtual machines share a core, switching between them involves saving of the context of the preempted VM (content of the virtualized registers), selection of the next VM, and resuming of this VM, including restoring of its context. The exe-cution time for this procedure is 1250 ns (375 processor cycles), without scheduling, since the execution time is highly dependent on the specific scheduling algorithm.
The algorithm-specific execution time has to be added.
Synchronized Shared Resource Access Routines
The execution time of the semaphore operations wait() and signal() are plotted in Figure 3.4.2, implemented according to Lamport’s Bakery Algorithm for synchro-nized shared resource access across processor cores. The execution time increases linearly with the number of cores, since the operations perform an iteration over an array of length equal to the number of cores.
However, these execution times denote actually the best case for the operation wait(), namely an interrupted execution of the operation, which takes place if re-source access is granted immediately. When the rere-source is not available, wait() causes a blocking of the calling VM. The worst-case blocking time is essential for real-time systems. In case of four cores, the worst case occurs if the calling VM is blocked by a VM on each of the three other cores. In this case, the worst-case block-ing time for synchronized shared resource access sums up to 1797 processor cycles or about 6µs [Gilles, 2012].
Interrupt Latency
Virtualization increases the interrupt latency. Any interrupt is first delegated to the hypervisor, analyzed there, and potentially forwarded back to the guest operating system. For example, the additional latency of a programmable timer interrupt (PIT IRQ) is about 6.6µs [Kerstan, 2011]. The additional latency for a system call interrupt (Syscall IRQ) is about 4µs [Kerstan, 2011]. To obtain the total interrupt latency, one has to add this delay to the interrupt latency of the guest OS. The additional latency is longer for the timer interrupt, since the virtual interrupt timer
3.4 Evaluation 63
1 2 3 4
400 600 800
417
547
660
743
400
530
643
727
Number of Cores
ExecutionTime(ns)
wait() signal()
Figure 3.4: Execution time of routines for protected access to a shared resource
has to be updated, as discussed in Section 3.3.7. In case of a system call interrupt, the hypervisor just has to analyze in which virtual processor execution mode it was raised, in order to find out whether it was caused by a hypercall or by a system call.
In case of a hypercall, the hypervisor dispatches to the associated hypercall handler.
Otherwise, the interrupt is delegated back to the VM and the latencies for this case are given above.
Efficient Virtualization by Paravirtualization
The emulation of privileged instructions is the major cause of virtualization overhead.
The virtualized execution of instructions that manipulate or depend on the hardware state is the core functionality of a hypervisor. Since the guest is executed in problem mode, it necessarily includes a context switch to the hypervisor and a processor mode switch. The emulation service is requested via interrupt (Programm IRQ) in case of full virtualization or hypercall in case of paravirtualization.
Paravirtualization can be exploited to achieve a significant speedup for the em-ulation of privileged instructions. An analysis of the steps of an emem-ulation routine helps to understand why:
1. Reenabling of the data translation and saving of the contents of those registers that are needed to execute the emulation routine.
2. Analysis of the exception in order to identify the correct emulation subroutine and jump to it (dispatching).
3. Actual emulation of the instruction.
4. Restoring of the register contents.
The actual emulation accounts for the smallest fraction of the total execution time and is the same for both full virtualization and paravirtualization. A significant per-formance gain of paravirtualization is based on the lower overhead for identification of the cause of the exception and dispatching to the correct handler routine. In case of full virtualization, a memory access is required to identify the instruction that raised the interrupt, since the PowerPC stores only the address of the instruction in a register. In addition, it includes the analysis in which virtual processor privilege mode the instruction was executed. In case of paravirtualization, only a register read-out is necessary in order to obtain the hypercall ID, since Proteus’ hypercall application binary interface specifies that register 13 is used to pass the hypercall
3.4 Evaluation 65
ID. The WCET of privileged instructions is between 7% and 42% smaller in case of paravirtualization compared to full virtualization. [Baldin, 2009]
An additional performance gain for paravirtualized guests is achieved by Innocu-ous Register File Mapping (IRFM) [Kerstan, 2011]. For each VM, the hypervisor maintains a virtual register set. Accesses to registers that have to be accessed by privileged instructions trap to the hypervisor, which emulates the instruction on the virtual register set. However, there are registers that can be accessed without having immediate influence on the state or behavior of the VM. By IRFM’s mapping of this specific set of registers into the memory space of the VM, they can be accessed by load and store instructions without trapping to the hypervisor. Paravirtualization is required since all calls of privileged instructions to access the registers have to be replaced by calls of load and store. A similar approach has been applied by Xen [Barhamet al., 2003].
Hypercalls
As introduced in Section 3.3.4, a guest operating system can request hypervisor ser-vices via the paravirtualization interface. The hypercall vm_yield, which voluntarily releases the core, has an execution time of 507 ns (152 processor cycles). By calling sched_set_param, the guest OS passes information to the hypervisor’s scheduler.
The execution time of this hypercall is 793 ns (238 processor cycles). The hypercall create_comm_tunnel requests the creation of a shared-memory tunnel for commu-nication between itself and a second VM and is characterized by an execution time of 1027 ns (308 processor cycles). The hypercall vm_yield does not return to the VM and the execution time is measured until the start of the hypervisor’s schedule routine. The other two hypercalls return to the VM and the execution time mea-surement is stopped when the calling VM resumes its execution.