Scheme is a dynamically-typed language. Therefore, Gambit compiler does not check for type compatibility at compile time. It does not have a type-inference system that can automatically deduce the type of an expression. In this thesis, we link Scheme code to CUDA-C kernels, where CUDA-C is a typed language. Therefore, we must supply type information for kernel parameters to Gambit in order to generate appropriate types for them in both the Scheme and CUDA-C shim that link a CUDA-C kernel. In order get type information from Scheme we follow a strict naming convention with constant prefixes for kernel parameters. Gambit compiler can extract type information for kernel parameters from their names, check them, and generate appropriate type information to be used by nvcc .
In our implementation, constant prefixes for kernel parameters’ names help programmers to identify types easily. It also enables programmers to follow a standard so that others can easily read and understand type information. These constant prefixes help programmers to guess the generated operations both in Scheme and CUDA-C shims, as they depend strictly on types.
In our implementation, a vector as a kernel argument passes through the generated foreign-function interface to a CUDA-C kernel. In Scheme shim, a pointer-casting operation for that vector is generated to extract its C-pointer of that vector. This pointer-casting operation depends on the C-type for that vector. In
CUDA-C shim, operations such as the declaration of a device memory pointer, allocation of space in device memory, and transfer operations between host and device memory are also generated for that vector. All of these operations depend strictly on the type of vector extracted from a constant prefix attached to the vector’s name.
Moreover, there are some other reasons to use constant prefixes to name kernel parameters in our imple- mentation. First of all, suppose we want to pass a Scheme vector of 64-bit unsigned integers to a kernel. The vector must pass through the generated foreign-function interface of that kernel. In that interface, all the operations related to this vector must be compatible with its 64-bit unsigned-integer type. Suppose, Gambit generated code for 32-bit signed-integers instead of 64-bit unsigned-integers due to a lack of available type information regarding size in bits for the elements of that vector. In that case, allocated integers in that vector may contain values that are out of the range for 64-bit signed-integers and integer overflow may occur. Second, we can also save space in device memory by specifying the size of vector elements. For example, if we want to pass a vector of 8-bit unsigned integers of ten elements to CUDA-C kernel, then Gambit should generate code for allocating space in device memory for 8-bit unsigned-integers with ten elements. However, due to the a lack of size information for the vector elements, Gambit instead generates code to allocate 32-bit integers. Therefore, instead of allocating 10 bytes of space, 40 bytes of space is allocated in device memory, and 30 bytes of device memory would be unused. It is preferable, then, to generate appropriate size in memory allocation code for a vector to reduce unnecessary device memory consumption. Moreover, a vector of 40 bytes would take more time than a vector of 10 bytes to be transferred between host and device memory. This may affect the overall execution time and create overhead.
Moreover, following a strict naming convention for data types makes it easier for programmers to under- stand types and sizes of variables. The implementation of these constant prefixes in Gambit is easy. However, it is less flexible to change which may sometimes be irritating to programmers.
We can implement a type-inference system in Gambit in order to get types for kernel parameters. In that case, programmers do not need to use constant prefixes to name kernel parameters. However, implementation of a type-inference system is much harder compared to the implementation of constant prefixes in Gambit. Therefore, we chose constant prefixes to name kernel parameters in order to get their type information.
On line 2–3 of Listing 4.1 we provide an example of a kernel skeleton binding in Scheme for kernel vector_addition. The keyword kernel specifies that it is a binding for a kernel. This kernel takes two arguments - u32_constant and u32v_src - specified with constant prefixes on line 3. Here, u32_constant is a scalar type of 32-bit unsigned-integer and u32v_src is a vector of 32-bit unsigned-integers, as defined on line 6. The kernel skeleton does not have any expression as its body because our extension in Gambit does not compile the body of a kernel defined in Scheme. Our implementation requires a skeleton of a kernel in Scheme to generate both shims that link a kernel implemented in CUDA-C.
This kernel’s execution configuration is called, on lines 12–14 with with two arguments (constant and src):
(vector_addition <<< (nblocks) (blockSize) (* (* 2 N) size-int32) >>> constant
src)
In the execution configuration, nblocks specifies the organization of the grid; blockSize specifies the size of the participating thread blocks; and (* (* 2 N) size-int32) determines the size of dynamic shared memory. 1 ; ; B i n d i n g for a k e r n e l s k e l e t o n 2 (d e f i n e v e c t o r _ a d d i t i o n 3 (k e r n e l ( u 3 2 _ c o n s t a n t u 3 2 v _ s r c ))) 4 5 (d e f i n e N 1 0 0 ) 6 (d e f i n e src ( m a k e - u 3 2 v e c t o r N 0)) 7 (d e f i n e c o n s t a n t 7 8 6 ) 8 (let (( n b l o c k s 1) 9 ( b l o c k S i z e N ) 10 ( s i z e - i n t 3 2 4)) 11 ; ; G r i d - x B l o c k - x s h a r e d m e m o r y 12 ( v e c t o r _ a d d i t i o n < < < ( n b l o c k s ) ( b l o c k S i z e ) ( * ( * 2 N ) s i z e - i n t 3 2 ) > > > 13 c o n s t a n t ; ; s c a l a r t y pe 14 src ) ; ; v e c t o r t y p e 15 (d i s p l a y src ))
Listing 4.1: Binding of a kernel skeleton and a call to that kernel with execution configuration special form
Our implementation generates the Scheme shim and CUDA-C shim from a kernel skeleton. Therefore, we replace the Scheme shim shown in Listing 3.2 of Chapter 3 as file main.scm, with the kernel skeleton of vector_addition on lines 2–3 of Listing 4.1. Our implementation generates the Scheme shim (Listing 4.2) and the CUDA-C shim (Listing 4.3) from this kernel skeleton.
The kernel call on line 7 of Listing 3.1 is also changed with the new special form for calling a kernel in Scheme, shown on lines 12–14 of Listing 4.1. Our implementation converts this special form to an ordinary function call that calls the Scheme shim to link the CUDA-C kernel.