Generating CUDA-C shim - Linking Scheme code to data-parallel CUDA-C code

In order to generate the CUDA-C shim, our implementation requires the kernel name and the actual kernel arguments. Kernel parameters contain type information, so it is easy to identify C-types for kernel arguments and generate them. For example, kernel parameter u32v_src has the constant prefix u32v that identifies its C-type as uint32_t* . Here, u32 identifies the C-type uint32_t and v specifies that it is a C-pointer. Our implementation generates CUDA-C shim vector_addition_cu_driver for the kernel skeleton vector _addition on lines 7–32 of Listing 4.6. First, our implementation generates a forward declaration for the kernel vector_ addition in line 2. Name of the kernel is extracted from parse tree nodes. Then, the parameter list is generated. In parameter list, C-type for kernel parameter u32_constant is uint_32, and C-type for u32v_src is uint32_t*. The arguments that take the length of a vector are also generated. In this example, u32_src_len is generated by adding the suffix _len with the actual name u32_src. Its default type, int is also generated.

On line 7, our implementation generates CUDA-C shim enclosed with extern "C" because it is linked from a .c file containing the compiled host program in C. It generates the name vector_addition_cu_driver

by adding suffix _cu_driver to kernel name vector_addition. Then it generates the parameter list. Seven parameters are generated on lines 8–9 to take the execution configuration values and the size of the dynamic shared memory. Then the actual kernel parameters are generated and for each vector an extra parameter is also generated to take the length of that vector. For a vector h_ is added at the beginning of its name

1 // - - - k e r n e l f o r w a r d d e c l a r a t i o n - - - - 2 _ _ g l o b a l _ _ v o i d v e c t o r _ a d d i t i o n ( u i n t 3 2 _ t u 3 2 _ c o n s t a n t , u i n t 3 2 _ t * u 3 2v _ s r c , 3 int u 3 2 v _ s r c _ l e n ); 4 5 // - - - K e r n e l f o r w a r d d e c l a r a t i o n - - - - 6 // - - - CUDA - C s h i m Start - - - - 7 e x t e r n " C " {

8 v o i d v e c t o r _ a d d i t i o n _ c u _ d r i v e r (int gDx , int gDy , int gDz , int bDx ,

9 int bDy , int bDz , int s h a r e d _ s i z e ,

10 u i n t 3 2 _ t u 3 2 _ c o n s t a n t , u i n t 3 2 _ t * h _ u 3 2 v _ s r c , 11 int h _ u 3 2 v _ s r c _ l e n ){ 12 // d e v i c e p o i n t e r s 13 u i n t 3 2 _ t * d _ u 3 2 v _ s r c ; 14 // c a l c u l a t i n g the s i z e of d e v i c e m e m o r y 15 s i z e _ t s i z e _ u 3 2 v _ s r c = h _ u 3 2 v _ s r c _ l e n * s i z e o f( u i n t 3 2 _ t ); 16 // a l l o c a t i n g d e v i c e m e m o r y 17 c u d a M a l l o c ((v o i d **) & d _ u 3 2 v _ s r c , s i z e _ u 3 2 v _ s r c ); 18 // c o p y i n g h o s t to d e v i c e 19 c u d a M e m c p y ( d _ u 3 2 v _ s r c , h _ u 3 2 v _ s r c , s i z e _ u 3 2 v _ s r c , c u d a M e m c p y H o s t T o D e v i c e ); 20 // d e f i n i n g G r i d c o n f i g u r a t i o n 21 d i m 3 d i m G r i d ( gDx , gDy , gDz ); 22 // d e f i n i n g B l o c k c o n f i g u r a t i o n 23 d i m 3 d i m B l o c k ( bDx , bDy , bDz ); 24 s i z e _ t s iz e = s h a r e d _ s i z e ; 25 // Now , at t h i s p o i n t it c a l l s the k e r n e l 26 v e c t o r _ a d d i t i o n < < < dimGrid , d i m B l o c k , s i z e > > > ( u 3 2 _ c o n s t a n t , d _ u 3 2 v _ s r c , 27 h _ 3 2 v _ s r c _ l e n ); 28 // c o p y i n g d e v i c e to h o s t 29 c u d a M e m c p y ( h _ u 3 2 v _ s r c , d _ u 3 2 v _ s r c , s i z e _ u 3 2 v _ s r c , c u d a M e m c p y D e v i c e T o H o s t ); 30 // d e a l l o c a t i o n of d e v i c e m e m o r y 31 c u d a F r e e ( d _ u 3 2 v _ s r c ); 32 } 33 } 34 // - - - CUDA - C s h i m End - - - -

to distinguish from its device memory pointer on line 10. The name of the parameter taking the length of a vector is also generated by adding h_ at the beginning of a vector’s name and _len at the end. Vector u32v_src is named h_u32_src_len on line 11. For the scalar type kernel parameter u32_constant on line 11, it remains unchanged in the parameter list of CUDA-C shim. This is because a scalar type CUDA-C does not need another scalar type in device memory in order to pass it to a CUDA-C kernel.

Memory operations on lines 17,19, 29, and 31 are generated by the vector’s name, with prefixes added in order to distinguish them from device, host, and size variables. Since u32v_src has neither OUT nor IN in its name, line 19 transfers this vector from host to device memory before the kernel call (lines 26–27) and line 20 transfers it from device to host memory after the kernel call. Note that our implementation will not generate line 19 if u32v_src has an OUT notation. Similarly, line 29 will not be generated if an IN notation is present in u32v_src. Our implementation generates the device pointer by adding d_ at the beginning of a vector name, in line 13. Size calculation in bytes is also generated on line 15 by adding size to the name of the vector u32v_src.

The generation of grid and thread block configuration variables on lines 21 and 23 are fixed string. A kernel call with execution configuration is generated on lines 26 and 27. The kernel’s name is extracted from the parse tree. The execution configuration is also a fixed string

<<< dimGrid, dimBlock, size >>>

and the arguments are generated from the extracted parameters, but for a vector the device memory pointer is passed to a kernel. Our implementation can distinguish a vector from a scalar by identifying the sub-string v in the name of a kernel parameter. In the argument list, the first extracted kernel parameter u32_constant is added to the beginning of the argument list without any modification of its name because it is a scalar type. However, the second extracted kernel parameter u32v_src is identified as a vector, therefore its corresponding device memory pointer d_u32v_src is actually passed to the kernel. The suffix d_ is added to the vector name and instead of vector u32v_src, device memory pointer d_u32v_src is added to the argument list. For a vector, its length is also passed to a kernel. Therefore h_u32_src_len - representing the length of u32v_src - is added to the argument list.

This CUDA-C shim in Listing 4.6 is generated in a file named maingpu.cu because the kernel skeleton vector_addition is in file main.scm.

In document Linking Scheme code to data-parallel CUDA-C code (Page 82-84)