Launch Promotion - VERTICAL AGGREGATION - Techniques for optimizing dynamic parallelism on grap

CHAPTER 4 VERTICAL AGGREGATION

4.1 Launch Promotion

Promotion is the first step in the transformation whereby recursive kernel calls are hoisted to the beginning of the kernel body. Figure 4.3 illustrates the transformation that takes place when promotion is applied to the example in Figure 4.2. In the transformed code, the kernel call is moved to the beginning

of the kernel before the prologue. However, to preserve correct ordering in between the parent and child, code is inserted to release a flag that the child will acquire to know when it is safe to execute. In all but the first kernel instance (which is launched from the host), code is also inserted before the prologue to acquire the release from the parent. The main advantage of this step is that the launch overhead is removed from the critical path and replaced with a more lightweight release acquire chain. The launch is thus set up by the device runtime and the child thread blocks are scheduled on the SMs while the parent thread block executes the prologue and before it reaches the release operation.

Figure 4.4 shows in pseudocode the code transformation that takes place to achieve kernel launch promotion. The target pattern of the original kernel is shown in Figure 4.4(a). The pattern is such that the kernel executes some prologue code, then selects a single launcher thread to perform a kernel launch, then executes an epilogue code. The kernel launch must be configured to have a single thread block. Moreover, the parameters/arguments of the kernel can be classified into two categories. The first category is the available parameters/arguments which are those that are available at the beginning of the kernel before the prologue. The second category is the postponed parameters/arguments which are those that are not available until after the prologue.

The kernel in Figure 4.4(a) is transformed to two different versions, one called from the host and one called recursively from the device, shown in Figures 4.4(b) and (c) respectively. In the kernel called from the host in Figure 4.4(b), the transformation is as follows. First, code is inserted before the prologue for the launcher thread to perform the launch prematurely. The launcher thread starts by allocating buffers (line 10) to be passed to the child grid where the postponed arguments will be stored later once they become available. The launcher thread also allocates a flag that is used by the parent to communicate with the child to release it when its data is ready (line 11). The child is then launched prematurely (lines 12-13) with the available arguments kept as they are and the postponed arguments replaced with their corresponding buffers. Performing a premature launch requires that any launch condition (not shown in the figure) also be promotable to the beginning of the kernel before the epilogue. In cases where the condition is not promotable because it is dependent on the prologue, the launch can be

01 __global__ void kernel(paramsavail, paramspost) {

02 prologue

03 if(launcher thread) {

04 kernel<<<1,nThreads>>>(argsavail, argspost)

05 }

06 epilogue

07 }

(a) Original Kernel

08 __global__ void kernel(paramsavail, paramspost) {

09 if(launcher thread) {

10 allocate postponed arg buffers

11 allocate child flag

12 kernel_from_kernel<<<1,nThreads>>>

13 (argsavail, postponed arg buffers, child flag)

14 }

15 prologue

16 if(launcher thread) {

17 store args_postin postponed arg buffers

18 memory fence

19 set child flag to release child

20 }

21 epilogue

22 }

(b) Transformed Kernel (called from host)

23 __global__ void kernel_from_kernel(paramsavail,

24 postponed param buffers, flag) {

25 if(launcher thread) {

26 allocate postponed arg buffers

27 allocate child flag

28 kernel_from_kernel<<<1,nThreads>>>

29 (args_avail, postponed arg buffers, child flag)

30 }

31 wait to acquire flag

32 load paramspostfrom postponed param buffers

33 prologue

34 if(launcher thread) {

35 store argspostin postponed arg buffers

36 memory fence

37 set flag to release child

38 }

39 epilogue

40 }

argspost : arguments whose values are “postponed” because they

are not available at the beginning of the kernel paramsavail : parameters corresponding to available arguments

params_post : parameters corresponding to postponed arguments

performed speculatively and an abort signal can be sent instead of a release signal to abort trailing launches.

After the child grid is launched prematurely, the prologue is executed normally. At the end of the prologue, the values of the postponed arguments become available so the launcher thread stores these variables in the postponed argument buffers previously allocated (line 17). A memory fence is then executed to ensure that these values become visible to the child grid before it is released (line 18). Finally, the flag is set to be released (line 19) which signals to the child that it is safe to start executing. The release is implemented in CUDA as a non-cached store while the acquire is implemented as a polling loop with a volatile load. In OpenCL 2.0 [25], it is also possible to use built-in support for release and acquire. After the child is released, the execution of the epilogue continues normally.

The second version of the transformed kernel is the one intended to be launched from the device. The transformation is shown in Figure 4.4(c). The transformation is largely similar to that in Figure 4.4(b) with a few essential differences. One difference is that postponed parameters are replaced with parameter buffers and a flag parameter is also added to the parameter list (line 24). Another difference is that between the premature child launch and the prologue, a wait is inserted to make sure that a child thread block waits for its parent thread block to release it before executing its prologue (lines 31). After the thread block is released, it also loads the postponed parameters from the parameter buffers.

In document Techniques for optimizing dynamic parallelism on graphics processing units (Page 40-43)