Library functions in Scheme for GPU - Linking Scheme code to data-parallel CUDA-C code

In CUDA, there are library functions for querying and managing devices, controlling kernel executions, timing kernel executions, and managing versions of CUDA runtime libraries. These library functions are faceted. Programmers often need to provide multiple library functions to perform a task. Therefore, we provide some useful library functions in Scheme for GPUs for our implementation. These simple library functions reduce the burden of combining multiple library functions. It also gives opportunity for Scheme programmers to use those CUDA-C functionalities in Scheme.

4.8.1 Device management

Sometimes programmers may need some functionality for managing GPU devices. The following functionalities help programmers to query a device and find out an appropriate device for data-parallel computation.

• (query-device)

The (query-device) library function displays to the standard output the properties of all the devices. It prints the compute capability, shared memory size per block, total number of registers per block, wrap size, maximum threads per block, maximum grid dimension, clock rate, total constant memory,

and total constant memory texture alignment of all active devices. query-device uses CUDA library function

cudaGetDeviceProperties(struct cudaDeviceProp *prop, int device-id )

to get the device properties of a particular device. This utility function sets the CUDA structure cudaDeviceProp [12] for a valid device-id . Here, device-id is an integer index number of an active GPU device. It also uses cudaGetDeviceCount(int *count ) [12] to get the number of currently active devices.

• (check-and-set-device device-id )

The (check-and-set-device device-id ) library function checks for a valid device-id . This means it checks for an actual GPU device not the CUDA device CPU emulation (a CPU emulator that can run CUDA-C kernels on a system with no installed GPU). If the device-id is less than or equal to the system device count then it is a valid device. Next, it checks for CPU device emulation. If deviceProp.major defined in the structure cudeDeviceProp is greater than 999 then it is not a GPU device. If the device passes these two tests, it sets the device for the current execution and responses to the standard output device. Here, device-id is an integer number.

• (set-device device-id )

The (set-device device-id ) library function does not perform any checking operations, but rather sets the device using the CUDA library function

cudaSetDevice(int device-id )

If the device does not exist, it raises a cudaError_t type exception cudaErrorInvalidDevice . • (get-device-id)

The (get-device-id) library function returns an integer index number of an active device in device-id using the library function [12]

cudaGetDevice(int *device-id )

• (reset-device device-id )

The (reset-device device-id ) library functions resets all states of an active device. It also cleans up any resources allocations by the current program on the device. This function resets the device immediately. The programmer must ensure that the device is not accessed by any other running host threads when this function is called. Note that this function does not check for other running host

threads that might try to access the same device. We recommend that Scheme programmers use reset-device at the beginning or at the end of a host program.

• (reset-all-devices)

The (reset-all-devices) library function resets all the devices and destroys all previous resource allocations. We recommend that Scheme programmers also use (reset-all-devices) at the beginning or at the end of a host program.

4.8.2 Version management

• (driver-version)

We provide library function (driver-version) to know the version number of the installed CUDA driver application programming interface. This library function prints the version number to standard output. It prints 0.0 if no driver is installed. This version number describes the features supported by the installed driver and runtime libraries. Programmers can check whether they need to install a newer device driver than the one currently installed.

• (runtime-version)

The (runtime-version) library functions prints the version of the installed CUDA runtime libraries to standard output. Note that CUDA runtime libraries must be compatible with the version number of installed CUDA driver.

4.8.3 Timing function

We implemented a simple timing function (gpu-time kernel call ...) in Scheme to measure execution time for kernels. This timing function reduces the burden of defining, synchronizing, and measuring time lapse between separate CUDA events. It takes into account execution times for kernels, Scheme shim, and CUDA-C shim. This means it also counts vector transfer times between device and host memory, allocation and deallocation operations of device memory pointers in CUDA-C shim.

gpu-time displays the kernel execution time in milliseconds to standard output, and it also returns time for later use to a host program. It is similar to Scheme time macro. For example:

(gpu-time

(k1 <<< >>> arg ... ) (k2 <<< >>> arg ... ))

measures the combined execution time for kernel k1 and k2 .

In CUDA-C, programmers need to create two cudaEvent_t [12] type variables to capture the kernel start and end time-stamps. cudaEventCreate[12] then creates both the events one before the kernel call

and another after the kernel call. cudaEventSynchronize [12] then waits to finish the kernel execution and cudaEventRecord [12] computes the elapsed time between the two events in milliseconds. In our implementation, the gpu-time function only requires kernel call expressions and is very simple to use.

In order to implement this library function, our implementation generates two files: cudalib.scm and cudalib.cu. Here, Scheme file cudalib.scm contains c-lambda and c-declare constructs in order to call C-functions in file cudalib.cu. For example, library function (query-device) calls a c-lambda function query-device in file cudalib.scm, which then device-query calls the C-function deviceQuery in file cudalib.cu with the CUDA-C code for querying a device.

4.8.4 CUDA library functions unavailable in Scheme

We have implemented some useful library functions in Scheme that enable convenient GPU management for Scheme programmers. However, we do not provide all the CUDA library functions in Scheme. The reason is that many of these library functions are not frequently used in CUDA programs. Moreover, a Scheme implementation of these library functions would not be convenient for Scheme programs because many of them depend on many C struct values which would need to be constructed in Scheme. In order to use them in Scheme programs, programmers still need to learn and know low-level CUDA detail; and, this is not necessarily useful to Scheme or other functional languages.

Some CUDA library functions unavailable in Scheme are listed below:

• In CUDA, programmers can choose devices using the library function

cudaChooseDevice ( int * device, const struct cudaDeviceProp * prop )

based on the values set to fifty-one fields of the cudaDeviceProp structure type variable prop . We do not provide this library function in Scheme because programmers need to set all the fifty-one fields of the cudaDeviceProp structure type variable prop which might not be convenient for the Scheme programmers.

• CUDA provides library function

cudaPointerGetAttributes (struct cudaPointerAttributes *attributes, const void *ptr )

to return the attributes of a pointer ptr to a cudaPointerAttributes structure type variable attributes . This function can identify whether a pointer points to device or host memory. In our implementation we do not need to provide a separate device memory vector in Scheme. Our implementation hides a device memory pointer for a Scheme vector generated in CUDA-C shim to provide an abstraction to reduce hands-on memory management. In our implementation we assume

that a Scheme vector defined in a host program is in host memory, but when it is passed to a kernel skeleton it is in device memory. Therefore, we do not need to provide this library function in Scheme. • CUDA provides library function

cudaMallocHost(void ** ptr, size_t size )

that allocates size bytes of page-locked host memory to a pointer ptr which is accessible from a device. Allocation of an excessive amount of page-locked memory may degrade system performance due to a lack of available memory for paging. Accessing host memory from a kernel is an order of magnitude slower than accessing device memory. It also has both very high latency and very limited throughput. It is also recommended this library function to be used more economically in terms of host memory consumption. Apart from that, this library function is usually used to allocate host memory to manage streams of operations on the GPU. This is not covered in this thesis, therefore we do not provide this library function in Scheme.

• A device may have the capability to directly access the device memory of a peer device. CUDA provides library function

cudaDeviceCanAccessPeer (int * canAccessPeer, int device, int peerDevice )

to verify whether device device can access the device memory of another device peerDevice . This library function returns 1 in canAccessPeer if device has access to the device memory of device peerDevice . If not, this library function returns 0 in canAccessPeer . A device can enable the accessibility of another device to its memory by calling a library function cudaDeviceEnablePeer Access(). Note that access to peer device is not available to 32-bit CUDA applications. We do not provide this library function in Scheme because in this thesis we only focus on the interactions between host and device. Moreover, peer access is not a common practice in most of the CUDA applications1. • CUDA provides some stack-based library functions in order to launch a kernel. First, library function

cudaConfigureCall( dim3 gridDim , dim3 blockDim , size_t sharedMem = 0,

cudaStream_t stream = 0)

pushes execution configuration parameters to the top of stack. The arguments for a kernel can be pushed to the top of stack using the library function

1_{Only the Simple Peer-to-Peer Transfers with Multi-GPU CUDA SDK example uses this library function. It just demon-} strates how to access device memory of a peer device.

cudaSetupArgument( const void * arg , size_t size )

Here, argument arg of size bytes is pushed to the stack. Note that cudaSetupArgument() must be preceded by the library function cudaConfigureCall() . Kernel name entry is then launched using the library function

cudaLaunch( const char * entry )

which is preceded by calls to cudaSetupArgument() . Note that all three library functions are a part of C API. That means they can be compiled by a C-compiler. These library functions are used as an alternative of CUDA execution configuration syntax to launch a kernel. CUDA execution configuration syntax can only be compiled through nvcc compiler. In our implementation, we provide a special form in Scheme to call a kernel, therefore, we do not need to provide these stack-based library functions in Scheme to launch a kernel.

In document Linking Scheme code to data-parallel CUDA-C code (Page 88-93)