Miscellaneous - Multicast Checkpoint Distribution

7.3 Multicast Checkpoint Distribution

8.2.5 Miscellaneous

In addition to the aforementioned issues, we had to perform numerous other changes to the simulation in QEMU in order to faithfully imitate the hardware behavior of our test platform. In the following, we describe three of the more prominent changes.

12_{The most common distance between recorded and replayed instruction counts were either 5 or} 30 instructions[262].

Page Faults on Code Pages Besides faithfully reproducing page faults on data

accesses, the simulation also has to accurately trigger page faults for code pages. In the case of QEMU, the binary translator reads guest code when it generates a translation block. This creates a fundamentally different access pattern compared to a physical CPU:

• A TB typically comprises multiple instructions but does not cross jumps.

• The translator does not perform speculative execution.

• Previously translated guest code is accessed again only when the respective TB has been removed from the code cache.

To prevent early page faults due to block translation, we adjusted QEMU to stop translation at page boundaries. Fortunately, we did not observe or find corresponding statements in the architecture manual that speculative execution triggers page faults. Our solution is thus consistent with existing heterogeneous replay systems[287]. The last point is uncritical because the physical CPU raises a page fault on a previously faulted code page only if the corresponding mapping is invalidated or changed in the meantime. These are cases that a binary translator has to cope with anyways. The last point thus does not constitute a problem for deterministic replay.

Resume Flag Bit 16 in the EFLAGS register is the so-called resume flag (RF),

which controls whether the CPU stops at an instruction breakpoint[125]. The flag is intended to prevent recurrence of a breakpoint when a debugger continues execution. The CPU clears the RF flag after every instruction and sets it by pushing a corresponding EFLAGS value onto the stack when calling an event handler such as the breakpoint handler. This way, the set resume flag is popped from the stack on return. For interrupt handlers, the CPU sets the resume flag only if the interrupt arrives after any iteration of a repeated (REP) string instruction but the last iteration.

QEMU does not faithfully implement the resume flag, which leads to divergences in memory when the flag is pushed onto the stack. Furthermore, we observed that the physical CPU did not show fully deterministic RF states for interrupts during the last iteration of REP-prefixed instructions. We therefore replay the resume flag from the landmark in case of interrupts and replicate the behavior described in the architecture manual in all other cases.

Floating Point Unit (FPU) Besides the already mentioned reasons for diverging

memory images, we found that many differences later in the system boot phase (i.e., starting of user-mode services) stemmed from an incomplete FPU implementation. This became apparent whenever the state of the FPU was written to memory using the FSAVE or XSAVE instructions, for instance on a context switch.

Deficiencies included:

• Missing reflection of FPU exceptions in the SSE status register (MXCSR). • Missing implementation of the last FPU instruction pointer (FIP), the last

FPU data pointer (FDP), and the last FPU instruction opcode (FOP) registers. • Missing initialization of reserved words in the x87 FPU state and XSAVE

areas.

We fixed these issues by extending QEMU and do not have to record any supple- mental information.

8.3 Conclusion

Bootstrapping simulations based on periodic checkpoints alone does not reproduce the exact execution of the hardware-assisted virtualization. Instead, this requires recording and replay of non-deterministic events. In SimuBoost, we implemented a heterogeneous deterministic replay that collects events during hardware-assisted execution in KVM and precisely replays these events in QEMU’s binary translator for simulation. To guarantee identical runs, we have to capture at least nine types of events, six synchronous (e.g., CPUID, RDTSC, IN) and three asynchronous (INT, SMI, Write DMA). Although strictly necessary only for asynchronous events, we also capture landmarks for synchronous events, which greatly simplifies debugging. The landmarks are built around the retired instruction count and the RCX register to differentiate individual iterations of repeated instructions (REP). Since, however, the hardware performance counter for instruction counting is not fully reliable on x86, the replay matches supplementary CPU registers in a window around the alleged target instruction to recognize the correct injection time.

A particular challenge with heterogeneous deterministic replay is that besides exact handling of non-deterministic events, the simulation needs to be refined to match the recorded hardware platform also in the execution of deterministic operations. In this course, we adapted QEMU’s status flag computation, added memory write probing, and many more.

Interestingly, we found MMU-induced non-determinism to be entirely ignored in research – in contrast to the commercial products from VMware that seem to handle it. Our results confirm that a software-based solution is feasible, but only with considerable run-time overhead. We thus support VMware’s proposal for a corresponding hardware extension. As MMU-induced non-determinism does only surface in conjunction with stale TLB entries, we regard recording and replaying it as optional in order to protect against malicious guests. We did not include it in our final prototype.

The evaluation of our implementation revealed that trapping RDTSC instructions is responsible for most of the run-time overhead. In consequence, benchmarks making frequent use of the timestamp counter suffer from notable slowdown (up to 90% for apache). All other benchmarks are barely affected by recording and show run-time overheads below 1%. Compression with LZMA proved to be very effective with recording logs, reducing the necessary network bandwidth to only 3 MiB/s for the most demanding workload. However, the compression ratio strongly correlates with the share of DMA events in the log, which can result in higher bandwidth consumption during DMA-heavy phases – e.g., up to 16 MiB/s during the initialization phase of the kernel build benchmark. Nevertheless, even with parallel checkpoint distribution Gigabit Ethernet generally provides enough bandwidth. Our replay solution thus fulfills the requirements of SimuBoost.

Evaluation

In the previous chapters, we have described the four building blocks of Simu- Boost: (1) the performance model, (2) continuous checkpointing, (3) checkpoint distribution, and (4) heterogeneous deterministic replay. In this chapter, we bring all these components together and evaluate to what extent SimuBoost is able to accelerate functional full system simulation.

In particular, the evaluation covers:

• Achievable parallel simulation time and speedup

• Scalability and efficiency with increasing number of simulation nodes • Applicability of the performance model

We start with a detailed description of our evaluation setup in Section 9.1. In the following Section 9.2, we show that SimuBoost is able to drastically reduce the slowdown of functional full system simulation and that it is able to maintain this performance even with heavyweight instrumentation enabled. Section 9.3 demonstrates that SimuBoost delivers scalability beyond the limits of a single physical machine and that this is the basis for high acceleration. We further elaborate on the factors that determine the parallelization efficiency. In Section 9.4, we compare the predictions of the performance model with the actual results from our practical experiments and discuss conceptional weaknesses of the current model. We conclude the results in Section 9.5.

9.1 Evaluation Setup

As mentioned in § 7.3, our final prototype of SimuBoost does not yet integrate multicast distribution. Although this is a central component to perform immediate parallel simulation, we can do a representative evaluation of SimuBoost without

the live distribution. For this purpose, we separate the checkpointing and recording phase from the parallel simulation and instead manually copy all data to the simulation nodes in between. Our evaluation thus misses potential effects of multicast transmission delays because all data is already present on the target nodes. The results without live distribution will nevertheless be very close to what can be expected for a complete prototype. In Chapters 7 and 8, we demonstrate that the compression pipeline is typically able to reduce the data volume so that it remains below the bandwidth limit of Gigabit Ethernet. That means with actual distribution the simulation would start at most with the delay that is necessary for the compression and transmission of the first checkpoint. This delay is around 2 s in our experiments. All following checkpoints usually remain below the bandwidth limit and would not cause significant further delays.

Accordingly, we split each experiment into three consecutive phases:

1. Execution in a hardware-assisted virtual machine with active continuous checkpointing (copy-on-write, pre-scan, sparse) and recording of non- deterministic events. All data is compressed live, hence incurring compara- ble overhead as with multicast distribution. However, instead of sending the data into the network, we store everything on local storage.

2. Manual distribution of checkpoints and replay data to all systems in the simulation cluster (machine by machine) using rsync. Based on the aforementioned reasoning, we do not include this phase in the results. As stated in § 7.3, using a network file system to avoid the copy phase is not an option. 3. Parallel simulation using timed job submission to mimic the gradual availability of new checkpoints. We log the exact timing of checkpoints in the hardware-assisted run and then reproduce this very sequence.

Figure 9.1 illustrates our main evaluation setup consisting of five SMP systems with 108 physical cores – i.e., nodes for simulation – in total (see Tables 9.2 and 9.1 for details on the hardware and software configurations). Whereas the systems 2 to 5 only perform simulations, System V/1 serves as host for both the hardware- assisted execution at the beginning and simulations later on. To coordinate the parallel simulations, we establish a global job queue that is shared between all machines over the network. We use Python’s capability for synchronized access to remote objects for this purpose. In this case, this is a regular Python multiprocessing queue hosted on System V/1. Since only a single process per machine dequeues jobs (i.e., five in our setup), contention on the queue is not an issue. For larger setups, a more sophisticated job assignment with a cluster scheduler such as SLURM[290] is certainly appropriate.

To attain a thorough picture of SimuBoost’s speedup characteristics and to be able to verify the predictions of the performance model, we conduct experiments for all combinations of the following parameters:

System V/1 master 2x Xeon Gold 6138 40 Cores @ 2.00GHz System 2 slave 2x Xeon E5-2630 v3 16 Cores @ 2.40GHz 10 Gigabit Ethernet ckpt ckpt ckpt ckpt ckpt ckpt ckpt ckpt ckpt ckpt ckpt ckpt System 3 slave 2x Xeon E5-2630 v3 16 Cores @ 2.40GHz System 4 slave 2x Xeon E5-2630 v3 16 Cores @ 2.40GHz 9 Shared Job Queue 8₇ 6 10 11 … 1 2 3 4 Log Timed Job Submission System 5 slave 1x Xeon Gold 6138 20 Cores @ 2.00GHz ckpt ckpt ckpt 5

Figure 9.1: The hardware-assisted virtual machine runs on System V/1. Afterward, the checkpoints and replay data are copied to all other machines. Then, the benchmark framework starts parallel simulations that are coordinated with a shared job queue. A timed submission of jobs mimics the gradual availability of new checkpoints.

Workload In the previous chapters, we have found that the run-time overhead

for continuous checkpointing and recording of non-deterministic events heavily differs between various benchmarks. As the run-time overhead defines a line below which decreasing the interval length does not lead to higher but lower speedup, we can expect considerable differences in maximum achievable speedup between workloads. We run applications from the same set of workloads as in the previous chapters.

Although we see truly interactive workloads such as user-driven desktop usage as an important workload category for operating system research, we do not explicitly include such a benchmark. In Chapter 6, we demonstrate that SimuBoost is able to maintain interactivity for real-world workloads with downtimes below 10 ms. The deterministic replay, in turn, injects all interactions into the simulation. Regarding interactive workloads, this is the decisive improvement of SimuBoost over conventional slow functional full system simulation, where it is impossible to faithfully capture and simulate realistic user behavior. As interactive workloads generally exhibit a disproportionate amount of idle phases, we include the idle workload for comparison. Depending on the applications run, a true interactive scenario will then be located between idle and one of the other workloads.

Number of Simulation Nodes Gradually increasing the number of nodes allows

us to explore the scalability of SimuBoost. Although our simulation cluster provides a total of 216 logical cores due to hyperthreading, we only consider physical cores for simulation (i.e., 108). This is because preliminary experiments showed that running simulations on logical cores (i.e., two simulations per physical core) rarely increases performance. In fact, the opposite is often the case, which we

attribute to exceeding memory bandwidth limits and mutual cache pollution. This confirms results from Wallace and Hazelwood[267] who also did not observe further gain from using hyperthreads.

Accordingly, we repeat all parallel simulations with the following number of physical nodes: N ∈ {4, 8, 16, 24, 32, 48, 64, 80, 96, 108}. We use the same set of checkpoints and replay data for every configuration. In order to somewhat balance the load between the hosts, we assign each system a share of parallel jobs proportional to its number of physical cores.

Interval Length In order to find the optimal interval length for a certain number

of nodes and to verify the predictions of the performance model, we run each benchmark configuration with different interval lengths: L ∈ {100, 300, 500, 1000, 2000, 3000, 4000}ms. We take the one with the highest speedup.

Simulation Slowdown While the simulation slowdown already varies between

workloads due to each workload’s individual instruction mix, we additionally run each simulation a second time with activated hooks for tracing memory writes (w), and a third time with tracing hooks for both memory reads and writes (r+w). This gives an impression on how speedup, scalability, and optimal interval length react to changes in simulation slowdown as it would be the case for different types of analyses connected to the simulation. Since collecting memory traces requires considerable computational resources (e.g., for compression) that we rather want to use for exploring larger simulation clusters, we do not perform actual tracing. Instead, we only activate the hooks. This adds a helper call to a tracing function for each memory access (depending on the tracing mode). Compared to actual tracing, the function only assembles the trace entry but leaves out submitting it. This slows down the simulation without taxing other CPU cores. We omit configurations with N ∈ {4, 8} for the tracing runs as the higher slowdown makes such small setups less interesting.

For a more genuine comparison, we run conventional serial simulations without tracing with an unmodified version of TCG. Besides the tracing hooks, this version also misses the instruction counting and the refinements discussed in § 8.2. These are all features that further slow down the simulation. Both tracing runs, in contrast, add the hooks to our refined version of TCG.

We determine for each combination of workload and tracing mode the best config-

uration of L and N by gradually increasing N until a further increase does not

provide any significant improvement in parallel simulation time but only harms efficiency1. Based on our results, we found a threshold of 10 percent points to be appropriate. If, for instance, selecting the next higher N only reduces the slow- down from 1.5x to 1.45x, we prefer the smaller setup with the higher slowdown but better efficiency.

In document SimuBoost: Scalable Parallelization of Functional System Simulation (Page 189-197)