Overhead - Software Performance Engineering using Virtual Time Program Execution

3.3 Evaluation

3.3.2 Overhead

The main source of VEX overhead is the method entry and exit event logging. During each such method entry, the CPU-time counter is accessed twice (for lost time compensation purposes), the virtual time of the thread is updated and a timestamp is generated and pushed into the thread-local profiler stack for this event. This generated event is popped at method exit. I/O methods have a higher mean load, as a result of registering an identification point for each distinct I/O invocation point (presented in Chapter 4).

Table 3.2 shows the mean times in nanoseconds for various CPU-time counter selections. The

Timer Method instr. [ns] I/O method instr. [ns] Perfctr-light 408 2294

PAPI+perfctr 519 2311 PAPI+clock thread cputime id 524 2308 PAPI+times 532 2357 PAPI+proc 737 3322 PAPI+getrusage 794 3668 Clock gettime 4915 10504

Table 3.2: Method instrumentation overheads for different CPU-timer selections

3.3. Evaluation 91

module. Although the “PAPI+Perfctr” setup also accesses CPU-time counters through the “Perfctr” module, our “Perfctr-light” timer setup outperforms it by approximately 20%, by caching the pointer to the performance counter within VEX on a thread-local variable and removing some PAPI-related checks within the CPU-time returning code.

Overhead compensation

Compensating for delays incurred by VEX is essential for increasing the performance prediction accuracy and maintaining a correct interleaving of the simulated threads. In this section, we demonstrate the effect of the compensation scheme on VEX predictions. The instrumentation overheads are profiled as described in Section 3.2.3, namely by measuring the virtual time wrongly accounted for an empty method. We also show how small differences in code execution can affect the compensation success.

The testing program executes a loop that calls a method M in each iteration. We select three short computation-based methods for M , so that the ratio of VEX instrumentation code time to the execution time of M is large. The three resulting programs are executed in real time first and then in virtual time without and with compensation using the measured instrumentation code delays.

Test - g++Flag Real time VEX Prediction Prediction error (%) Mean [s] C.O.V. (%) Mean [s] C.O.V. (%)

Loops1-O3 0.121 1.14 0.893 5.72 639.32 Loops2-O3 0.069 3.78 0.901 4.04 1200.81 Loops3-O3 0.429 1.10 1.312 2.13 205.79 Loops1-O0 0.202 5.21 0.908 3.55 349.12 Loops2-O0 0.181 9.83 0.966 4.02 432.59 Loops3-O0 0.507 2.86 1.353 3.31 166.81

Table 3.3: Prediction errors for trivial short methods without any overhead compensation. Results from 100 measurements on Host-1 are shown

To investigate the behaviour of the profiler under such circumstances, we define three different methods that perform a short calculation (1-3 lines of code each) and which are invoked iteratively within a loop (called Loops1, Loops2 and Loops3). All method invocations are manually instrumented to call VEX method entry and exit calls. Setting up the experiment in

-120 -80 -40 0 40 80 120 160 200 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 Prediction error, E, [%]

Difference from profiled compensation value, ∆, [ns]

Sensitivity of prediction error to ns-level changes in compensation time "Loops1-O3" "Loops2-O3" "Loops3-O3" "Loops1-O0" "Loops2-O0" "Loops3-O0"

Figure 3.12: Demonstrating the effect of a single nanosecond in the lost time compensation scheme in the prediction error of VEX for three trivial short methods and two different compi- lation options. Results are from 100 measurements on Host-1 and 95% confidence intervals.

this way, we expect the observer effects from VEX profiling to be high. Indeed, without any compensation we get the predictions errors of Table 3.3, which are on average 5 times higher in comparison to the real time execution.

In Figure 3.12 we demonstrate how sensitive the compensation mechanism is when methods that are “short” in relation to the VEX overhead are invoked at a high rate. As explained in Section 3.2.3, VEX compensates for the additional execution time caused by the invocation of instrumentation code by profiling the corresponding code and acquiring a compensation value C. C is measured once for a simulation host and is then retrieved (from a file) for all virtual time executions on that host. Using C for overhead compensation in each of our test programs, we acquire the prediction errors that correspond to ∆ = 0 on the x-axis of Figure 3.12. The results for this compensation value vary from a relative overprediction of 20% for “Loops2- O3” to an underprediction of 60% for “Loops1-O0”. As we change C by 1ns, we find that each

3.3. Evaluation 93

program yields the best prediction results (E = 0 on the y-axis) for a simulation with a different compensation value. For example, “Loops1-O3” and “Loops2-O0” return the lowest prediction errors for −3 < ∆ < −2, “Loops3-O0” for ∆ = 0 and “Loops3-O3” for ∆ = 3. The sensitivity of the prediction error to a change of 1ns (observed by the slope of the lines corresponding to each test) also varies amongst the different test codes.

This could be attributed to the fact that the code of each test is optimised differently by the compiler (because of the use of the “-O3” g++ flag), thus generating different binaries in each case and affecting the code before and after the invocation of the VEX instruments. Disabling the optimisation and disassembling the code, we find that the additional code is exactly the same in all instrumented methods. The results for this setup are presented in Figure 3.12 for the “Loops*-O0” measurements show that the behaviour is similarly erratic, leading us to believe that the effect of the invocation of the instrumentation code to the CPU pipeline, instruction execution order and memory access patterns is different in each case and can only be approximately compensated. We conclude that the effect of the compensation scheme is beneficial to the accuracy of the prediction, but that it does not provide a precise solution: in extreme cases like the ones presented here it is preferable to remove the instrumentation altogether. This is the responsibility of the upper instrumentation layer presented in Section 5.1.3.

In document Software Performance Engineering using Virtual Time Program Execution (Page 90-93)