Reducing the Observation Space - Acceleration Techniques

3.2 Acceleration Techniques

3.2.2 Reducing the Observation Space

A prevailing practice to make analysis via simulation applicable to long-running workloads is to trade accuracy for speed by limiting the simulation to short time frames – so-called samples. To this end, many simulators support dynamically switching between different detailed simulation modes at run time, enabling the user to fast-forward between samples with some kind of accelerated execution. In Simics[163], the user is able to choose whether timing models are applied or not. SimOS[223] and MARSSx86 [199] allow switching between functional and microarchitectural simulation. PTLSim/X [292] and FSA [229], in turn, provide hardware-assisted virtualization as an alternative mode of execution.

A fundamental challenge with sampling is to find the right samples so the selected subset reflects the overall workload characteristics. The gathered information can then be extrapolated to draw conclusions for the whole workload. In research, three primary variants emerged[289]:

In truncated execution, the workload is simulated for only a short duration with the presumption that the abbreviated execution phase is representative for the whole program. Most applications exhibit an initialization phase, where internal data structures are set up and input data is loaded into memory before the application actually starts performing its task. The latter phase then usually dominates the program behavior. A common variant in truncated execution is thus to fast-forward over the initialization phase and start detailed simulation or analysis for a limited duration afterward. Depending on the level of simulation and type of analysis, a warm-up phase is prepended before taking measurements

to line-up any additional state (e.g., a cache model). Because the policy is easy to implement, it has been widely adopted in the literature. According to a study by Yi et al., over 50% of publications6 on HPCA, ISCA, and MICRO base their results on this technique[289]. However, the same study also revealed that truncated execution is highly inaccurate. This finding has also been confirmed by other groups[101,273]. The inaccuracy is caused by the fact that the approach does not account for changing program behavior and at the same time depends on manually and arbitrarily chosen parameters such as the time span to simulate.

SimPoint[109,237] leverages sampling to reduce simulation time and increases

accuracy by selecting multiple time frames to sample from. It is thus able to in- corporate changing program behavior. Furthermore, the time frames are selected algorithmically through detecting phase behavior in the simulated workload. Sim- Point thereby focuses the simulation and analysis on windows with representative characteristics. To gather initial information on the program behavior and to fast-forward between samples, SimPoint requires functional simulation. It is thus only suitable to accelerate more detailed execution modes such as microarchitectural simulation, which face even higher slowdowns than emulation. However, it does not present a solution to accelerate functional simulation, which is the goal of our work7_{. At the same time, SimPoint’s strength to build on phase behavior}

becomes its weakness when no sufficient phase behavior exists. SimPoint has been developed with a single process in mind. However, even for such a scenario, it has been shown that representative intervals may not be clearly identifiable due to too complex program behavior (e.g., gcc[273]). For operating system research, where a mix of processes run alongside OS kernel and driver threads, observing phase behavior becomes even more difficult.

SMARTS[284] evades this problem by collecting samples periodically with high

frequency, ignoring program behavior. The number of samples taken in SMARTS is thus higher, while each sample is considerably smaller (1000 instructions versus 100 M in SimPoint). SMARTS employs sampling theory to choose a min- imal sampling frequency and to achieve a quantifiable accuracy and precision in its measurements. As with SimPoint, SMARTS targets the acceleration of microarchitectural simulations and requires functional simulation to fast-forward between sampling points and to warm up microarchitectural state. The functional simulation consequently occupies more than 99% of simulation run time. DirectSMARTS [58] demonstrates how simulations using SMARTS can benefit from acceleration methods for functional simulation, in this case, using emulation with dynamic binary translation instead of interpretation in the process-level RSIM[196] simulator. To improve the run time for subsequent experiments with the same benchmark, Wenisch et al. checkpoint the warmed-up state in so-called LivePoints[276]. While subsequent runs can then start off the checkpoints, the

6_{Over a ten years period, ending in 2005.}

concept still requires a complete functional simulation beforehand. FSA[229], a recent publication targeting SMARTS, skips most of the functional simulation by using hardware-assisted virtualization instead. FSA then dynamically switches to functional simulation before each sample to warm up microarchitectural state and finally switches to detailed simulation for the sample itself.

A major drawback of all sampling-based methods is that they are directed toward the estimation of metrics that can be extrapolated from samples (e.g., instructions per cycle, etc.). Sampling is less suited to observe the actual system execution as required in security research, malware analysis, or debugging. Moreover, limiting the observation window to discrete samples may not be an option because it does not permit the tracking of individual events. For example, identification of memory <allocation, deallocation>-pairs cannot be done this way, as used in Undangle [52] to detect invalid pointers in use-after-free and double-free vulnerabilities. The same applies to the memory access pattern analysis in Bochspwn[133] and the evaluation of sharing opportunities for memory deduplication [105, 176, 217]. Instead, such applications demand an acceleration method that offers continuous simulation.

In document SimuBoost: Scalable Parallelization of Functional System Simulation (Page 78-80)