OmpSs Xeon Phi Support with AMA - AMA: Asynchronous Management of Accelerators

4.2 AMA: Asynchronous Management of Accelerators

4.2.3 OmpSs Xeon Phi Support with AMA

This section describes how task offloading on Intel Xeon Phi cards in OmpSs is supported by means of the hStreams library [20]. Developed by Intel, this library offers an interface to offload pieces of code on a Xeon Phi device. Conceptually, hStreams is very similar to CUDA or OpenCL: memory transfers must be explicit between host and device memory

1// A l l o c a t e d e v i c e memory

2v o i d ∗ nanos_malloc_hstreams ( s i z e _ t s i z e ) ; 3

4// Free d e v i c e memory

5v o i d nanos_free_hstreams (v o i d ∗ a d d r e s s ) ;

Figure 4.18: Nanos++ API functions for allocating and deallocating Xeon Phi memory

spaces, streams and events are used to issue and control device operations, data transfers and offloaded executions are asynchronous, etc.

The OmpSs Xeon Phi support on top of hStreams has been developed in an iterative and interactive process with Intel: early software releases were provided by Intel, new features have been requested to Intel and several bugs inside the hStreams library have been reported. We took advantage of class abstraction and inheritance to avoid duplicated code between the GPU and Xeon Phi support components. Then, the main execution flow implementation is shared between both devices and only some small parts have been specialized for each device. 4.2.3.1 Xeon Phi Accelerator Initialization

The hStreams library needs to be initialized before any call to the library and finalized at the end of the application. The initialization includes setting the desired options for the Xeon Phi device, like configuring the number of partitions or OpenMP core thread affinity.

The OmpSs Xeon Phi support component performs all these operations internally, so that they are hidden from the programmer side. The programmer can configure the number of hStreams partitions through an OmpSs environment variable.

Like the GPU component, one helper thread is created for each Xeon Phi card in the system and linked to one of the cards. In this case, the device characteristics are also captured to guarantee a correct execution of the application.

4.2.3.2 Xeon Phi Memory Management

By default, data is allocated on the Xeon Phi memory space the first time they are needed by a task. However, we detected that the hStreams interface used to allocate such data takes a long time to perform data allocations. Thus, for performance reasons, the OmpSs Xeon Phi support offers two Nanos++ API functions to allocate and deallocate user data. Figure 4.18 shows the syntax of these functions. In this case, it is not possible to follow the same approach as the GPU support component because the hStreams streams handle dependencies between operations based on the address of their parameters. If the whole device memory was allocated at once, we could break this hStreams dependence detection mechanism. Figure 4.18 shows the syntax of the functions.

4.2.3.3 Event-driven Flow

The hStreams library provides a slightly different stream abstraction compared to CUDA: operations issued to the same stream may not be executed in a FIFO order. Only those dependent operations, referring to the same host address, are guaranteed to execute in order.

4.2. AMA: Asynchronous Management of Accelerators Run HtD E7 R E8 R DtH E9 R Stream #3 Helper thread time Run HtD E₄ R E5 R DtH E6 R Stream #2 Stream #1 Xe o n P h i d ev ice Run kernel HtD transfer DtH transfer K CP E1 P CP E3 C E1 R E2 P E2 R E3 P E3 R E1 C E2 C

Figure 4.19: Distribution of asynchronous operations and events on Xeon Phi device streams

This automatic-dependence detection allows the OmpSs Xeon Phi support component to apply an optimization in the AMA design: it is not necessary to wait for the completion of input data transfers to launch the kernel. Instead, task offload can be issued immediately after its input data transfers in the same stream and the hStreams library will preserve their dependencies. So, in this case, the three stages of a task (active, run and completion) are issued to the same stream, and several streams are used to overlap the stages of different tasks.

In order to maximize resource utilization, the device helper thread creates several partitions of the Xeon Phi card: cores are evenly distributed between partitions. Then, tasks are assigned to partitions in a Round Robin fashion. In addition, several streams per partition are created, so that there can be several operations overlapping for each partition.

All device operations are issued asynchronously and an hStreams event is associated to each operation. The hStreams asynchronous API functions receive as one of their parameters a pointer to an hStreams event, so the event is automatically associated to its API call. Like CUDA, the library offers calls to either wait for event completion or query its state. The Xeon Phi helper thread always uses the query method to avoid blocking. Figure 4.19 shows how asynchronous operations are issued for a task: the input data transfers and kernel execution are launched one after the other in the same stream. Unlike CUDA, the hStreams runtime creates and initializes the events automatically for each operation. The events associated to input data transfers are still needed to notify the OmpSs software cache of their completion. Finally, when the kernel launch is completed, task’s output data transfers are issued. For simplicity, the complete event-driven flow is showed only for one task, issued to Stream #1, but the operations of other tasks (shaded boxes) can be handled simultaneously in other streams (in this example, Stream #2 and Stream #3 ).

Figure 4.20 illustrates the task execution flow of four tasks t1, t2, t3 and t4. The active stages for each task are issued one after the other. According to hStreams specification, the data transfers happen simultaneously from the programmer point of view. However, it is not clear how the DMA transfers are programmed in the hardware. Assuming that the Xeon Phi device is divided into four partitions, each task runs on a different partition, so they can run in parallel on the same device. Once the tasks are executed, their output data transfers are issued.

time t1 cycle t2 cycle t3 cycle t4 cycle Run t1 DtH t1 Run t2 HtD t2 DtH t2 Run t3 HtD t3 DtH t3 HtD t1 Run t4 DtH t4

Figure 4.20: Task execution flow on a Xeon Phi device

In document Programming models and scheduling techniques for heterogeneous architectures (Page 71-74)