8.2 WFVOpenCL
8.2.2 Runtime Callbacks
OpenCL allows the user to organize instances in multiple dimensions (each instance is identified by an n-tuple of identifiers for n dimensions). Given a kernel and a global number of instances N0× · · · × Nn organized in an
n-dimensional grid with work groups of size G0× · · · × Gn, the driver is
responsible for calling the kernel N0×· · ·×Nntimes and for making sure that
calls to get global id etc. return the appropriate identifiers of the requested dimension. The most natural iteration scheme for this employs nested “outer” loops that iterate the number of work groups of each dimension (N0/G0, . . . , Nn/Gn) and nested “inner” loops that iterate the size of each work group
(G0, . . . , Gn). Consider Listing 8.1 for some pseudo-code of the iteration
scheme for two dimensions.
If the application uses more than one dimension for its input data, the driver has to choose one SIMD dimension for vectorization. This means that only queries for instance identifiers of this dimension will return a vector, queries for other dimensions return a single identifier. Because it is the natural choice for the kernels we have analyzed so far, our driver currently always uses the first dimension. However, it would be easy to implement a heuristic that chooses the best dimension, e.g. by comparing
Listing 8.1 Pseudo-code implementation of clEnqueueNDRangeKernel and the kernel wrapper before inlining and optimization (2D case, W = 4). The outer loops iterate the number of work groups, which can easily be parallelized across multiple threads. The inner loops iterate all instances of a work group (step size 4 for the SIMD dimension 0).
c l _ i n t c l E n q u e u e N D R a n g e K e r n e l(K e r n e l k e r n e l W r a p p e r, TA a r g _ s t r u c t , int* g l o b a l S i z e s , int* l o c a l S i z e s ) { int i t e r _ 0 = g l o b a l S i z e s [0] / l o c a l S i z e s [ 0 ] ; int i t e r _ 1 = g l o b a l S i z e s [1] / l o c a l S i z e s [ 1 ] ; for (int i =0; i < i t e r _ 0 ; ++ i ) { for (int j =0; j < i t e r _ 1 ; ++ j ) { int g r o u p I D s [2] = { i , j }; k e r n e l W r a p p e r( a r g _ s t r u c t , g r o u p I D s , g l o b a l S i z e s , l o c a l S i z e s ); } } } v o i d k e r n e l W r a p p e r(TA a r g _ s t r u c t , int* g r o u p I D s , int* g l o b a l S i z e s , int* l o c a l S i z e s ) { T0 p a r a m 0 = a r g _ s t r u c t . p0 ; ... TN p a r a m N = a r g _ s t r u c t . pN ; int b a s e 0 = g r o u p I D s [0] * l o c a l S i z e s [ 0 ] ; int b a s e 1 = g r o u p I D s [1] * l o c a l S i z e s [ 1 ] ; _ _ m 1 2 8 i b a s e 0 V = < base0 , base0 , base0 , base0 >; for (int i =0; i < l o c a l S i z e s [ 1 ]; ++ i ) { int l i d 1 = i ; int t i d 1 = b a s e 1 + l i d 1 ; for (int j =0; j < l o c a l S i z e s [ 0 ]; j + = 4 ) { _ _ m 1 2 8 i l i d 0 = <j , j +1 , j +2 , j +3 >; _ _ m 1 2 8 i t i d 0 = b a s e 0 V + l i d 0 ; s i m d K e r n e l( param0 , ... , paramN ,
lid0 , lid1 , tid0 , tid1 ,
g r o u p I D s , g l o b a l S i z e s , l o c a l S i z e s );
} } }
8.2 WFVOpenCL 145
the number of memory operations that can be vectorized in either case, leveraging information from the Vectorization Analysis. The inner loop that iterates over the dimension chosen for vectorization is incremented by W in each iteration as depicted in Listing 8.1.
We automatically generate a wrapper around the original kernel. This wrapper includes the inner loops while only the outer loops are implemented directly in the driver (to allow multi-threading, e.g. via OpenMP). This enables removal of all overhead of the callback functions: All these calls query information that is either statically fixed (e.g. get global size, which returns the total number of instances) or only depends on the state of the inner loop’s iteration (e.g. for one dimension, get global id is the work group size multiplied with the work group identifier plus the local identifier of the instance within its work group). The static values are supplied as arguments to the wrapper, the others are computed directly in the inner loops. After the original kernel has been inlined into the wrapper, we can remove all overhead of callbacks to the driver by replacing each call by a direct access to the corresponding value. Generation of the inner loops “behind” the driver-kernel barrier also exposes additional optimization potential of the kernel code together with the surrounding loops and the callback values. For example, loop-invariant code motion moves computations that only depend on work group identifiers out of the innermost loop. This would not be possible if those loops were implemented statically in the driver instead of generated at compile time of the kernel.