Approach - SimuBoost: Scalable Parallelization of Functional System Simulation

To achieve a maximum speedup for a given hardware setup, the scalability of an acceleration method is of utmost importance. In § 3.2, we highlighted that the parallelization of simulation time has proven to provide very good scalability with almost linear speedups. In contrast to methods that parallelize the execution of simulated CPUs, the parallelization of simulation time is also applicable to single-core simulations. At the same time, it is not limited to sampling, but equally allows continuous full-length observation (i.e., temporal completeness). Another advantage is that the concept is very portable because it does not merely gain speedup from optimizing implementation details – which exhibits only limited scalability anyway.

Due to its effectiveness, we have chosen to employ the parallelization of simulation time as the core idea in SimuBoost[219]. The run time of the serial simulation is split into disjoint time intervals (see Figure 4.1). The intervals are then distributed over a set of nodes for parallel simulation, where the nodes are dedicated physical CPU cores on one or more hosts. In this model, the overall run time equals the simulation time of the longest interval.

The simulation for interval i[k] with k ∈ [2, n] and n ∈ N requires the simulated machine’s state at the beginning of this very interval. The machine state, in turn, results from the execution of the intervals i[1] to i[k − 1]. This creates a dependency chain which forbids the early start of parallel simulations. Another drawback of this method is that it neither improves accuracy nor allows interactivity because each interval is still executed with conventional slow functional simulation. t Serial Simula�on i[1] Node 1 i[2] … i[n] Node 2 … Node n i[1] i[2] … i[n] Node S Parallel Simula�on

Figure 4.1: The serial simulation is split into disjoint time intervals and distributed over simulation nodes (e.g., CPU cores or hosts) for parallel simulation.

A setup phase with an initial full-length serial simulation as in trace-driven ap- proaches would allow gathering the machine state at the interval boundaries. Subsequent runs could then be executed with parallelization. However, the lack of accuracy and interactivity remains. Also, requiring a time-consuming setup phase is unfavorable and considerably increases turnaround time.

To quickly retrieve the machine state at sampling points, pFSA[229] leverages hardware-assisted virtualization (HAV). Instead of running the workload in a serial functional simulation to fast-forward between samples, pFSA uses a regular fast virtual machine and forks off simulations as appropriate. SimuBoost adopts this general principle (see Figure 4.2) and applies it to continuous simulation: The workload executes in a hardware-assisted virtual machine (HVM) with near-native speed. Periodically, SimuBoost takes a checkpoint to capture the state of the VM, thereby marking interval boundaries. The checkpoint contains a consistent image of the virtual machine’s memory, device states, and persistent storage at the time of taking. SimuBoost uses the checkpoint to bootstrap the simulation of the respective interval on a different node. In contrast to forking, checkpoints can more easily be transferred over a network, improving scalability. They can also be saved to disk for repeated runs. Although the simulations do not collectively start right at the beginning, the slowdown between hardware-assisted virtualization and functional simulation (exemplarily depicted as 4x) drives the parallelization. Using a hardware-assisted virtual machine as input to the simulation has the additional benefit of recovering interactivity. Assuming checkpoints can be taken with low overhead (see Chapter 6), the HVM is fast enough to be actively controlled by a user just like a regular virtual machine in productive use. Similarly, the virtual machine is capable of communicating with non-simulated remote peers. Since the simulations execute in parallel to the HVM, the performance of the virtual

t HW-assist. Virtualiza�on i[1] Node 1 i[2] … i[n] Node 2 … Node n i[n] Node V Parallel Simula�on _… i[2] i[1]

Figure 4.2: The workload runs in a hardware-assisted virtual machine with near native speed. At interval boundaries, SimuBoost takes checkpoints, which it uses to bootstrap parallel simulations. The slowdown between HAV and functional simulation (here 4x) drives the parallelization.

machine is decoupled from the execution speed of the simulations. This effectively maintains interactivity even in the face of additional instrumentation overhead. The portability of the approach is only limited in that the target architecture must support some form of fast hardware-assisted virtualization. However, this feature is generally present today (e.g., x86, ARM, MIPS, and PowerPC) or planned for future releases (e.g., RISC-V).

4.2.1 State Deviation

Whereas simulations can be built to always emit identical deterministic runs, for example, according to a specified timing model, hardware-assisted virtualization is subject to non-deterministic input such as erratic I/O completion timing. In consequence, the parallel simulations in SimuBoost experience different timing behavior than the hardware-assisted execution of the same interval. Furthermore, the checkpoints only capture the result of past interaction, but they do not contain information on interactions in the current interval. Hence, any external input such as user commands or network packets received by the HVM is not reproduced in the simulations. The same applies to non-deterministic instruction results (e.g., readings of a timestamp counter). Consequently, the executions in the HVM and the FFSS of each interval diverge. Besides effectively losing interactivity, this is problematic in two ways:

1. Since simulations start off checkpoints that originate from the hardware- assisted virtual machine, the divergence breaks the functional continuity between interval boundaries in the simulation stage. That is, the machine state at the end of the simulation of interval i[k] does not match the state at the beginning (i.e., the checkpoint) of interval i[k + 1]. For example, recording a coherent instruction trace under such circumstances is infeasible. 2. For researchers to take measurements and retrieve data from a simulation

which behaves differently from what they can see (i.e., the HVM) is at least confusing. In the worst case, the simulation is of little value if it does not reproduce the desired behavior as triggered in the HVM.

A possible solution is to take checkpoints not periodically but at each non- deterministic event. This immanently captures the point in time as well as the state modification caused by the event. This is similar to what has been done in Kemari [253], where a backup VM is synchronized via a checkpoint whenever the master VM sends network packets or writes to persistent storage. However, Kemari possesses a 2x performance overhead. Furthermore, our experiments show that on average 7400 (max: 230K) non-deterministic events per second occur during a Linux kernel build. This suggests that we would have to take a checkpoint every 135µs on average and every 4 µs at peak times, which is clearly not feasible without severe performance degradation.

To counter state deviation, SimuBoost therefore leverages heterogeneous deter-

ministic replay. Non-deterministic input such as the timing of interrupts, the

payload of I/O operations, and the results of non-deterministic instructions are recorded during the hardware-assisted virtualization as discrete events and are precisely replayed in the simulations. This restores functional continuity and reproduces user as well as network input, which is indispensable to simulate interactive workloads. Since non-deterministic input needs to be fully captured prior simulation, SimuBoost delays the parallelization by one interval (compare Figure 4.2).

An advantage of deterministic replay is that it injects realistic timing into the simulation. This frees researchers from having to install a sophisticated timing model. In fact, the authors of PTLsim/X [292] advocate the use of deterministic replay even for microarchitectural simulations in order to create realistic timing. Furthermore, it has been shown that small changes in timing can have a great effect on simulation results[29]. By replaying different runs of the same scenario this can easily be taken into account with SimuBoost. This stands in stark contrast to deterministic simulations like, for instance, in Simics [163], which always produce exactly the same execution and thus miss to capture variations. On the other side, if a repeated, exactly identical execution is needed with SimuBoost, the same recording can be replayed any number of times.

On the flip side, the combination of checkpointing and deterministic replay closely ties the simulation to the hardware-assisted execution. Although Viennot et al. demonstrated that a replay can tolerate modified executable code to some extent [263], experiments generally have to remain passive observations. SimuBoost is thus not suited to, for instance, evaluate the effects of novel memory architectures because this would require a feedback loop into the execution to apply new timing information. However, forcing the simulation to deviate from the recorded execution path breaks functional continuity and prevents a correct replay of interactions and other non-deterministic events. Nevertheless, SimuBoost is perfectly suited if a detailed insight into an existing realistic execution is desired – for example when debugging or collecting traces. These traces, in turn, can be fed into new architectural models.

In document SimuBoost: Scalable Parallelization of Functional System Simulation (Page 87-90)