Implementation - Raising the level of abstraction : simulation of large chip multiprocessors ru

As explained in previous chapters, multithreaded applications are composed of sequential sections and parops. In Figure 5.6a we illustrate this concept for an OpenMP parallel loop. First, the application runs on one thread, which is sequential section A. Then, the application executes a fork parop to spawn a parallel loop and schedule its multiple iterations to the multiple available threads (four in this example). Then, every thread executes its share of the parallel loop, and each of the iterations is in itself a sequential section (B, C, D and E). After that, a join parop finalizes the parallel loop and continues execution on one thread.

Our methodology leverages this knowledge to effectively filter each sequential section separately. This makes sense because the memory accesses in a sequential section are always in the same order and their effects on the private L1 cache state are always the same with respect to the previous and following accesses. To accomplish this, the trace generation engine must be notified whenever a parop executes so it knows when to reset the filter cache.

The effort required to instrument parallel applications to catch parops dur- ing trace generation depends on the programming model. However, in general, all programming models provide a runtime system API (runtime library) and/or compiler support for intermediate code generation. Figure 5.6b shows an example of the code generated by the GCC compiler for an OpenMP parallel for loop [2]. The programmer annotates the parallel loop with the pragma omp for keywords, and the compiler transforms that construct to an intermediate code that calls the libgomp OpenMP runtime library. Opposite to this, the pthreads programming model exposes the API to the programmer, who then has to deal with managing parallelism.

5.4. Implementation Chapter 5. Trace Filtering of MT Apps

A

B

C

D

E

F

(a) OpenMP for scheme

#pragma omp for for (int i=0; i<n; ++i) {

body; }

long i, _s0, _e0;

if (GOMP_loop_runtime_start(0, n, 1, &_s0, &_e0)) do {

long _e1 = _e0;

for (i = _s0, i < _e0; i++) body;

} while (GOMP_loop_runtime_next(&_s0, _&e0)); GOMP_loop_end ();

Original OpenMP C code GCC generated code

(b) OpenMP for code

Figure 5.6: OpenMP for loop construct. The figure shows: (a) a scheme of the execution flow of an OpenMP for ; (b) the original OpenMP C code and the intermediate C code generated by GCC, including the calls to the libgomp OpenMP runtime library.

vide some information about the properties or the state of the application (e.g., omp get thread num). But most of them provide the execution of a parop (e.g., omp set lock ). Then, to make use of our methodology, the runtime system API has to be instrumented manually or using a dynamic binary instrumentation tool. At the same time, this instrumentation must work together with the tool used to generate the memory access trace for the notification of parop calls and the reset of the filter cache. Here we comment on two alternatives for trace generation: execution-driven simulators and dynamic binary instrumentation tools; and how they can be notified of parop calls to reset the filter cache.

Execution-driven simulators can be notified of certain application events by means of special instruction codes. Some simulators provide this mechanism built-in and available to users. An example of this is magic instructions in Sim- ics (see Section 2.5.2) for C programs. The user can incorporate instrumentation points in the code by using the MAGIC(X) construct. Then, the simulator in- vokes a callback function when it reads the magic assembly instruction, and the parameter X can be used to pass information to the simulator. Other simulators, such as Simplescalar (see Section 2.5.1), do not provide a built-in functional- ity but it can be easily added as it is open source and this is not a complex extension.

Dynamic binary instrumentation tools, such as PIN [113] and Valgrind [130], also provide instrumentation points for calling a callback function specified by the user. In this case, the user has to register API function symbols and the corresponding callback functions are invoked when the target program executes them, without having to manually add instrumentation in the code.

Considering this, the operation for using our methodology is the same for both alternatives. Whenever the execution of a parop triggers the execution of the associated callback function, this one resets the contents of the filter cache. In Section 5.2, we covered some works that assume a fixed core and L1 configuration for all experiments and replace the core and L1 cache models by a

trace reader that sends memory accesses directly to the L2. In our methodology, however, the trace-driven simulator must include the L1 caches. Since we want to account for cache coherent actions, such as invalidations, we need to maintain the L1 cache state for those actions to take place. In the following section, we describe our implementation of this methodology to show its feasibility and as an example of how it can be put in practice.

5.4.1 Sample implementation

We implemented this methodology in the mem mode of TaskSim (see Sec- tion 4.3.3) and using the simulation methodology explained in Chapter 3. We use our NANOS++ instrumentation plug-in and PIN to generate traces at the memory level of abstraction (see Section 4.2). Figure 5.7 shows how the different components are combined to generate traces and carry out simulation. Two ver- sions of the OmpSs application binary are generated with Mercurium, one linked to an instrumented version of NANOS++ (instrumented binary) and another one linked to a non-instrumented version (regular binary). The instrumented binary is run natively on one thread and it generates an application-level trace at the shared-memory abstraction level (see Section 4.2). The regular binary is executed with PIN, also in one thread, to catch all memory accesses. For this purpose, we developed a PIN tool that generates the memory-access trace for the instrumented application. If filtering is set in the PIN tool, the memory accesses are passed to a configurable filter cache that decides whether or not to record them in the trace. Also, the NANOS++ API functions are registered in our PIN tool to trigger the execution of callback functions to reset the filter cache, if filtering is enabled.

OmpSs .c Mercurium Instrumented NANOS++ .so Instrumented OmpSs Binary

Mercurium OmpSs_Binary

NANOS++ .so Trace Generation PIN Tool [filter/full] Run Run with PIN + Full/filtered Trace TaskSim Simulation Results Multi-core Model Config.

Figure 5.7: Trace generation and simulation process.

The result of the two runs is an application-level trace and one memory- access trace for every computation section. The two traces are combined together to generate a single (full or filtered) trace for TaskSim.

The trace generation process (dashed line in the figure) is carried out by our trace reader/writer library, which is also used in TaskSim to read the input trace.

5.5. Evaluation Chapter 5. Trace Filtering of MT Apps

In document Raising the level of abstraction : simulation of large chip multiprocessors running multithreaded applications (Page 106-109)