5.8 Related Work
5.8.1 GPGPU-Based Systems
The Lua programming language has an extension to access data-parallel OpenCL code known as LuaGPU [19]. It allows programmers to write host programs in Lua. LuaGPU saves programmers from a lot of error-prone error-checking and pointer-operations in OpenCL. Unlike our implementation, a host program in LuaGPU takes a data-parallel OpenCL kernel source code as a Lua string. Programmers also need to write their own datatype mapping operations to pass kernel arguments from a Lua program to an OpenCL kernel, whereas our system automatically generates these mapping operations in the Scheme shim. In addition, LuaGPU requires programmers to write out the memory transfer operations which are also automatically generated in the CUDA-C shim in our system.
Moreover, data and kernel are launched from a queue-like data structure defined in Lua, whereas our system does not require any additional data structure to call a kernel. A kernel call is like an ordinary function call in Scheme. It just requires an additional construct to define an execution configuration. Therefore, LuaGPU still leaves a lot of Lua coding for programmers.
The Python programming language also has an extension to access data-parallel CUDA-C code called PyCUDA [51]. This extension allows programmers to write host programs in Python that control and issue data-parallel programs written in CUDA-C. Unlike our system, PyCUDA requires programmers to define memory transfer operations, allocation/deallocation operations, and it also takes data-parallel CUDA-C code as Python strings. PyCUDA provides special arrays for GPUs, whereas our system does not require any additional special GPU arrays in Scheme. An ordinary Scheme vector can be passed to a kernel.
Our system does not provide any error-reporting facility, whereas in PyCUDA, errors generated from GPU computations are detected and reported automatically. In contrast to our system, every feature of the CUDA runtime system is accessible from Python via PyCUDA, including textures, pinned-host memory, OpenGL interaction, zero-copy host memory mapping, etc. PyCUDA also provides some library functionality such as element-wise arithmetic-operations, map-reduce, and parallel scan that allows a restricted subset of Python code to be automatically farmed out to the GPU. In addition, PyCUDA has a just-in-time compiler that generates NVIDIA’s low-level PTX abstract-machine code [6] which allows automated tuning of device code to improve runtime performance, whereas our system does not generate code at runtime.
PyCUDA has been used successfully in many research projects. Tomasz Rybak at Bialystok Technical University uses PyCUDA for generating recurrence diagrams for time-series analysis. He was able to achieve an 85-fold speedup compared to CPU computations. Romain Brette and Dan Goodman are also using PyCUDA to simulate spiking neural networks with their simulator Brian [39]. Brian relies on PyCUDA to generate runtime GPU code for the integration of differential equations provided by the users in a Python script. GPU performance was up to 60 times faster than a comparable CPU implementation for some models. There are some image processing applications developed with PyCUDA that implement k-means clustering routines [27]. For those applications, PyCUDA was about 10x slower than the CUDA-C implementations but these were still probably an improvement over CPU computations. In contrast to PyCUDA, overhead in our system for the test case 3DFD as a real-world example, discussed in Section 5.6, was 1.3x slower than the CUDA-C implementation.
Accelerate [28] is a domain-specific high-level skeleton-based language for GPGPU computing in the Haskell programming language. In Accelerate, both host and kernel programs are written in high-level Haskell, whereas our system allows only host programs in Scheme as high-level language. Accelerate provides abstractions for the programmers both for device and host programs that eases GPU programming. Unlike our system, Accelerate has a dynamic code generator that instantiates CUDA implementations as PTX at runtime. This code generator exploits runtime information to optimize GPU code. However, compiling kernels at runtime is an overhead at execution time. In order to reduce this overhead, Accelerate memoizes compiled kernels. Therefore, kernels that are invoked multiple times are only generated and compiled once.
In [28], Manuel M.T. Chakravarty et al. mention three test cases in order to evaluate performance of Accelerate. For parallel dot product, Accelerate takes almost precisely twice as long as CUDA-C, whereas overhead for this test case in our system was only 0–4% compared to CUDA-C. Another test case, the Black-Scoles option pricing algorithm shows that the overhead for Accelerate reduces with increasing vector sizes compared to CUDA-C. Similarly, sparse-matrix vector multiplication also shows similar performance behavior to Black-Scholes as overhead reducing compared to CUDA-C with increasing vector sizes. In our system, we also observed similar performance behavior, diminishing overhead rapidly with the increasing vector sizes for the test case parallel sum reduction, discussed in Section 5.4.
object-oriented features. Firepile allows programmers to write both host and kernel code in Scala, whereas our system allows only host programs in Scheme as high-level language. Like our system, the Firepile library hides details of GPU programming by managing devices and memory operations automatically. However, Firepile is a library to manage devices and memory operations, whereas our system generates code for the memory operations in the shims and GPUs are managed by the library functions in Scheme provided to our system.
In order to compile from Scala to OpenCL, Scala compiler first compiles both host and kernel into Java bytecode. Then the Java Virtual Machine executes the bytecode. Next, the Firepile library identifies the bytecode for the kernel and invokes its internal compiler to convert kernel bytecode to native OpenCL code. Next, Firepile copies data from host to device memory and then invokes the kernel. Finally, it copies back results from device to host memory.
In [56], Nathaniel Nystrom et al. mention five test cases: reduction, Black-Scholes, matrix multiplication, the discrete cosine transform (DCT8x8), and matrix transpose. The Firepile version of parallel reduction performed as well as CUDA-C. In our system, parallel reduction showed 35% overhead for small vector sizes. However, this overhead diminished rapidly with increasing vector sizes. For matrix multiplication, Firepile version was 15% faster than the NVIDIA CUDA-C version. In our system, matrix multiplication was also faster than the CUDA-C version. Firepile versions of discrete cosine transform (DCT8x8), Black-Scholes and matrix transpose were consistently slower than the CUDA-C versions. In contrast to Firepile, test cases for our system initially showed overhead for smaller vector sizes. However, we observed that overhead diminished rapidly with increasing vector sizes or number of thread blocks.