Chapter 7 Heterogeneous Parallel Execution Model
7.2.4 Execution Mode Operations
Execution modes might be understood as a set of thread capabilities; an exe- cution mode is a token that gives the thread permission to execute instructions from a specific ISA, and access to a subset of the virtual address space and file descriptors [Lev84]. This view is taken in the HPE model and, hence, two fundamental execution mode operations are offered to programmers: delegation and copy.
Delegation transfers the execution mode from the thread invoking the oper- ation to a target thread, effectively revoking the permission of the caller thread to switch to such execution mode. For instance, context delegation might be used in streaming applications, where each execution thread switches to an accelerator mode to perform some computation over a tile of the application dataset, and, when this computation is done, the execution thread delegates its accelerator execution context to the following thread in the pipeline.
Execution mode copy duplicates the execution mode of the invoking thread and transfers one copy to a target thread. This operation effectively allows several execution threads to share one accelerator execution mode. One sample application that uses execution mode copy is, for instance, hybrid filtering where several filters are applied to an input set and the output data from all filters are combined into a single output. However, sharing one accelerator execution mode among several threads might serialize the execution of these threads in accelerator mode. The accelerator execution mode copy operation provides a means to share data between several threads when running in accelerator mode. The data sharing granularity accomplished by execution mode copy is the whole virtual address sub-space accessible from the accelerator execution mode.
7.2.5
Benefits and Limitations
The HPE model simplifies the task of programming heterogeneous multi-ac- celerator systems. This model is fully compatible with existing programming models and simplifies the task of porting applications to use accelerators. Per- formance-critical functions can be substituted by accelerator calls in our model because it offers the same calling semantics. This simple porting path is not possible in the other execution models previously discussed. For instance, the IBM Cell SDK requires encapsulating performance critical functions into sep- arate execution threads. This might require major changes in the application code if non-thread-safe libraries or code is used. The CUDA driver API and
Figure 7.2: Sample data-flow that illustrates the importance of fine-grained synchronization between parallel control-flows in CPUs and accelerators.
OpenCL may require a major re-write of application code to incorporate context creation and management. The NVIDIA CUDA run-time API does not support by-reference parameter passing, so complex wrapping functions are required, as discussed in Chapter 4.
A potential limitation of the ADSM model is the need for specific calls to allo- cate data objects used in the accelerator code, but this limitation is also present in all existing programming models for heterogeneous systems. However, shared data object allocations are done with a single call, while the other models re- quire two separate calls to allocate system and accelerator memory. This is key to accomplish backward compatibility; ADSM shared data allocations in accel- erator-less systems only allocate system memory. However, accelerator memory allocation calls in other models simply fail, aborting the application execution. HPE cleanly complements ADSM to allow applications using accelerators to run on accelerator-less systems. In these systems, accelerator execution mode triggers the emulation of the accelerator code using the CPU [DKK09]. This ac- celerator emulation mode does not differ from floating-point emulation already implemented in by most OS’s [HH97].
Asynchronous accelerator execution is not directly supported in the HPE model. On the contrary, all accelerator calls in OpenCL and CUDA execu- tion models are asynchronous. Moreover, these models compel programmers to extensively use asynchronous accelerator calls and to defer accelerator synchro- nization (i.e., waiting for the accelerator to finish) as much as possible [NVI09]. Hence, the lack of asynchronous accelerator calls might be viewed as a major limitation of our execution model. However, HPE intentionally prevents asyn- chronous accelerator calls because they do not follow the sequential execution model and provides a limited form of parallel execution on CPUs and accelera- tors.
For instance, consider the data-flow in Figure 7.2, where dark-grey circles represent accelerator computations and light-grey circles CPU computations. This data-flow is a simplified version of a pattern found, for instance, in an Finite Difference Time Difference simulation of electromagnetic wave propagation in unbounded volumes, where snapshots of electromagnetic fields are taken every few iterations. In the data-flow in Figure 7.2, computations A and B, and C and D can proceed in parallel. However, the code in the CPU has to ensure that B has finished before starting C, but the GPU code can start D just after
Operation Description
accAlloc Allocates memory at the accelerator
accRelease Releases accelerator memory
accLaunch Launches a computation in the accelerator
Table 7.1: Basic accelerator object interface using in GMAC
finishing B. Asynchronous accelerator calls do not provide an easy mapping of this data-flow. The programmer asynchronously calls to B and compute A concurrently. Then, a synchronization call between CPU and accelerator is required to ensure that B has finished, before launching D asynchronously and start computing C. This CPU – accelerator synchronization, required due to the data dependency between B and C, degrades the application performance. For instance, the execution of D is delayed because the CPU has not finished computing A. HPE provides an elegant way to map the data-flow in Figure 7.2. The application uses one execution thread to compute A and C in the CPU, and another execution thread to compute B and D in the accelerator. A semaphore, for instance, can be used to avoid start computation C before B has finished: the first execution thread waits on the semaphore after finishing A, and the second thread increments the semaphore after finishing computation B. As illustrated in this example, the fine-grained inter-thread synchronization provided by HPE allows efficient CPU – accelerator parallel execution.