Summary - Linking Scheme code to data-parallel CUDA-C code

In this chapter, we gave details about the GPU programming model. We also described the different types of GPU memory with their features. Then we gave details about CUDA-C language constructs to develop data-parallel programs. We also described the linking strategy of a linker. Then we described data type mapping between Scheme to C, and the C-interfaces to link Scheme code to C code in Gambit.

The primary motivation of our work is to link Scheme code to CUDA-C kernels by generating a foreign- function interface from Scheme that reduces hands-on memory management with reasonable overhead in runtime. The next chapter will describe the designing of the generated interface in Gambit that links Scheme code to CUDA-C kernels.

Chapter 3 Design

Gambit provides special constructs to link C code with Scheme code. These special constructs can be used in a foreign-function interface that links a CUDA-C kernel to Scheme. In this chapter we describe our design for such a foreign-function interface.

At the beginning of this chapter we describe the necessary parts of a foreign-function interface to link a CUDA-C kernel from Scheme. First we describe the parts designed using Scheme constructs, and next we describe the parts of the foreign-function interface designed using CUDA-C constructs. We then we provide the complete code of a foreign-function interface in Listings 3.2 and 3.3. Finally, we give an example of makefile that shows the necessary command-lines to compile this foreign-function interface through Gambit.

3.1 A foreign-function interface for linking Scheme code to a CUDA-

C kernel

In this thesis we use both Scheme and CUDA-C constructs to design our foreign-function interface. Gambit provides C-interfaces such as c-lambda and c-declare to link C code from Scheme.

These parts of the foreign-function interface designed with Scheme constructs — referred to collectively as the Scheme shim — reside in a file with a .scm extension. The Scheme shim links a function call in Scheme to CUDA-C constructs of this foreign-function interface. These constructs include operations in device memory and kernel call with execution configuration.

Those parts of the foreign-function interface designed with CUDA-C constructs — referred to collectively as the CUDA-C shim — reside in a file with a .cu extension. The CUDA-C shim connects the Scheme shim to a CUDA-C kernel. We separate the Scheme constructs from the CUDA-C constructs into two different files, because Gambit forbids some standard C constructs in C-interface constructs.

In our approach, a function call in Scheme calls to a CUDA-C kernel through the both shims of this foreign-function interface. Therefore, kernel arguments of a function call in Scheme linking to a CUDA-C kernel are passed through both shims of this foreign-function interface to that CUDA-C kernel.

A CUDA-C kernel must be called from a C host function and the call for that kernel must have an execution configuration. Scheme does not have any construct to define the execution configuration. Therefore, we pass

the parameters defining a execution configuration for a CUDA-C kernel through the Scheme shim to the CUDA-C shim, and the kernel is called from CUDA-C shim. In this foreign-function interface, the first seven arguments of a function call in Scheme linking to a CUDA-C kernel represent execution configuration parameters. The first three of them define x-, y- and z-dimensions of a grid; the next three defines x-, y- and z-dimensions of a thread block; and the seventh argument defines the amount of dynamic shared memory needed by that CUDA-C kernel. These seven arguments are not passed directly to a CUDA-C kernel. Instead, CUDA-C shim uses them to define an execution configuration as a part of a kernel call in CUDA-C. They are converted to C types in Scheme shim because gsc requires that Scheme to C data type conversions must happen in C code defined with C-interfaces

The rest of the arguments of a function call in Scheme are the actual kernel arguments. They are passed through the Scheme shim and the CUDA-C shim of this foreign-function interface to a CUDA-C kernel. Note that these actual kernel arguments are defined in Scheme; this means they reside in host memory. They need to be transferred to device memory for data-parallel computation by a kernel because a CUDA-C kernel can only access device memory. Furthermore, because they are Scheme values they are also converted to C types. In CUDA-C, the copy of a vector in device memory is actually passed to the CUDA-C kernel. Moreover, to allocate space in device memory for each vector, the lengths of those vectors are also required. Lengths are only available to Scheme. Therefore, lengths of Scheme vectors are also calculated in Scheme shim and passed to CUDA-C shim to allocate space in device memory.

In many parallel applications the C code requires the vector length. Therefore we also pass lengths of Scheme vectors to a CUDA-C kernel. The benefit of this is that programmers don’t need to send lengths of vectors to a kernel manually. However, in cases where vector lengths may not be used for data-parallel computation in a kernel, these extra arguments are superfluous.

The Scheme shim of this foreign-function interface links a function call in Scheme to a CUDA-C shim has three parts:

1. A vector-length-calculation helper function calculates length of a Scheme vector passed as an argument to a CUDA-C kernel. It takes all arguments from a function call in Scheme and calls a c-lambda function with all the arguments it takes. For each vector it passes one extra arguments that calculates the length of that vector.

2. A c-lambda function performs Scheme to C data type conversion and calls into the CUDA-C shim with the converted C arguments in host memory. Data defined in Scheme does not have type information, but CUDA-C shim needs type information for those Scheme data. We therefore need to convert those Scheme data to C-type data.

3. A forward declaration of the CUDA-C shim entry point defined within a c-declare construct. The c-lambda function calls the CUDA-C shim.

The CUDA-C shim of this foreign-function interface links the Scheme shim to a CUDA-C kernel. The CUDA-C shim executes certain operations:

• It declares a device memory pointer for each converted C vector in host memory allocating space in device memory for each device memory pointer using the supplied lengths of vectors.

• It transfers each vector from host to device memory.

• It also defines CUDA-C built-in vector data types to define the execution configuration parameters. • Next, the CUDA-C shim calls a kernel with the execution configuration parameters. It also passes the

actual kernel arguments to the CUDA-C kernel. Note that for a vector as kernel argument length of that vector is also passed to the kernel.

• After a kernel finished its execution, CUDA-C shim copies vectors from device to host memory to return resultant data.

• Finally, it deallocates device memory occupied by the device memory pointers.

In Figure 3.1 we show how our foreign-function interface links a function call in Scheme to a CUDA-C kernel. In this diagram, the blue area represents the Scheme shim, and green represents the CUDA-C shim.

(kernel cfg s

₁

s

₂

v …)!

Scheme shim ! CUDA-C shim !

kernel <<<cfg>>> (s

₁

, s

₂

, p, l, …);!

!"#$%&' &()%' %(' *"+' ,%-' ./)%' *"+' %(' &()%' ''''''0''' ''''''0''' ''''''0'''

Figure 3.1: Foreign-function interface linking a function call in Scheme to a CUDA-C kernel

(kernel cfg s1 s2 v ...)

Here, kernel is the name of the kernel. It passes seven execution configuration parameters which is represented by the first argument cfg. It also passes the scalars s1, s2, ... and a vector v, ... to the CUDA-C kernel.

Each arrow in this diagram represents a direction of data flow. The following line

kernel <<< cfg >>> (s1, s2, p, l, ...);

is a kernel call in CUDA-C shim. The name of the kernel is kernel which is same as in Scheme. This is because we want to make the name of a kernel consistent to the programmers. Changing the name in CUDA C shim may create confusion for programmers because the name in function call would be different with the actual definition in CUDA-C.

The arrow from cfg in function call in Scheme pointing to cfg of CUDA-C call represents the seven execution configuration parameters passed through the Scheme shim to CUDA-C shim. These appear at the kernel call in CUDA-C to define the execution configuration. Similarly, arrows from scalar types s1, and s2 from function call in Scheme point to s1, and s2 in CUDA-C kernel. This signifies they passed through the Scheme shim to the CUDA-C shim and passed to the kernel. Note that seven execution configuration parameters and scalar types are converted to C types by Scheme Shim’s c-lambda function before passing them to CUDA-C shim, as represented by yellow dodecagons with C.

In this diagram we also show how vector type v is passed to a CUDA-C kernel. The arrow from vector v to the yellow triangle in Scheme shim denotes that this vector is passed to the vector-length-calculation helper function. From this triangle one arrow is pointing to the yellow oval in Scheme shim; this denotes that the vector will pass to the c-lambda function to cast the C pointer. The arrow from the yellow oval to the yellow trapezoid in CUDA-C shim denotes that the C pointer is passed to the CUDA-C shim to be copied to device memory. In order to copy this vector to device memory, the length of this vector is also required. The length of vector v is also passed to CUDA-C shim for copy operations, as denoted by another arrow from the yellow triangle to the same yellow trapezoid. Next, the arrow from the yellow trapezoid to the kernel argument p of CUDA-C kernel call denotes that after the copy operation the device memory pointer p is passed to the kernel. The arrow from the yellow triangle in Scheme shim to kernel argument l denotes that the length of vector v is also passed to the kernel. Note that length l is exactly the same as the device memory pointer p corresponding to the host pointer of vector v.

Finally, the arrow from p to the other yellow trapezoid in CUDA-C shim denotes that the pointer p is copied back to host memory. The arrow from this yellow trapezoid pointing to vector v represents that after data is copied to host memory it returns back to Scheme.

Note that if the vector is passed to that kernel only to assign it with the kernelâĂŹs resultant data, a vector might not need to be copied to device memory before a kernel call. Similarly, if that vector is passed to that kernel only to serve as input vector, it might need not to be copied back to host memory after a kernel call.

In the following sections we describe parts of Scheme shim and CUDA-C shim based on code provided in Listings 3.2 and 3.3.

In document Linking Scheme code to data-parallel CUDA-C code (Page 46-51)