Many issues become evident when working with parallelism on GPU hardware. These issues are a result of adapting serial algorithms into parallel algorithms and mapping parallel concepts onto specific GPU architectures. GPU programmers must learn how to take advantage of beneficial hardware features while mitigating harmful hardware features in order to unlock high performance. GPU programmers must learn about using parallelism (instruction-level, data- level, thread-level, bit-level) to increase performance. In order to make these abstract lessons more concrete, I implemented algorithms on GPUs. While implementing these algorithms, I of course use my DASks and BASks9 to speed up my own development time and provide a high performing framework. Of course each case study implementation has additional valuable
use the syncthread() method for barriers within a thread block and barriers between GPU kernels happens automatically.
8 My kd-tree case study algorithm is data dependent. Since each thread represents a single query point,
each thread branches down its own unique path through the search tree. However, there are 32 threads per warp that move in lock-step through the code. Consequently, performance varies with data as CUDA serializes instructions from threads on different branches.
9 One exception, my kd-tree case study was written before I generalized the concept of DASks and BASks
8
lessons about performance. I showcase many of the lessons in GPU programming via five different case studies: Memory I/O via Copy, kd-trees, Reduce/Scan, Histogram, and Radix Sort.
1.3.1 Memory I/O via Copy:
The primary focus of my Memory I/O case study is on demonstrating all three of my DASks and both of my BASks via the Copy primitive. The Copy primitive copies n inputs onto n outputs. The secondary focus is on showing how my DASks can achieve a high percentage of peak I/O throughput on GPUs. To this end, I implement the Copy primitive in four different ways: Simple, Block, Column, and Row. Getting the simple copy kernel up and running correctly is described in more detail in Chapter 4 – Case Study Memory IO”. The other more complex and higher performing versions of Copy based on my three data access skeletons (Block, Column, and Row) are described in more detail in “Chapter 5 - Data Access Skeletons”. I
conduct experiments on all four versions of Copy to find the best performing balance between ILP and TLP and achieve up to 30%, 82%, 77%, and 77% of peak I/O throughput, respectively.
1.3.2 kd-tree for Nearest Neighbor Searches:
In the kd-tree case study, I implement GPU kernels for nearest neighbor search10 using a kd-tree11. With a nearest neighbor search, the goal is to find the closest point (or k points) within a search set of n points for each of m points in a query set. Note: Unlike my other case studies, this case study does not use any of my DASks.
My first exposure to GPU programming was implementing a kd-tree for use on nearest neighbor searches. I intended to use this nearest neighbor search as part of a terrain visualization problem on LIDAR data. My solution took much longer than I had originally budgeted.
However, eventually I got it working and in time achieved a 25× speedup in performance over the
10 Nearest neighbor searches are discussed in the following books and papers (Bentley, 1975; Bustos et al,
2006; Mount and Aray, 2010; Shakhnarovich et al, 2005).
9
equivalent single-threaded CPU code. I was proud of this result at the time. However, looking back with the benefit of more experience, I see that my original implementation was naïve. It did not take advantage of many GPU hardware features and ran head on into several hardware limitations that constrain parallel performance. This kd-tree implementation is described in more detail in “Chapter 7 – Case Study kd-tree”.
1.3.3 Reduce/Scan:
In my Reduce/Scan case study, I use my Row DASk to implement high performance parallel Reduce and Scan GPU primitives12. Reduce produces a total sum by accumulating n inputs into a single final sum. Scan (Prefix Sum) produces a running sum by accumulating n inputs into n outputs, where the ith output element is the cumulative sum of the first i (or i-1) input elements. Reduce and Scan have similar implementations. Both primitives are almost trivial (3-5 lines of code) to implement on a serial CPU. However, the parallel GPU implementations are much more complex. This complexity is a direct result of data being load-balanced across tens of thousands of threads and the requirement for partial per-thread sums to be hierarchically
combined and redistributed for correct final results. The Scan primitive requires that inputs be processed in sequential order for correct scanned results. Although the Reduce primitive does not require sequential ordering, I choose to implement it the same way as Scan.
As will be seen, I perform experiments on both ILP and TLP to find the optimal balance for best performance. My Reduce and Scan primitive achieve up to 89% and 85% of peak I/O throughput on the GTX 580 (and up to 76% of peak I/O throughput on the GTX Titan). My Reduce/Scan primitives are described in further detail in Chapter 6.
12 For more details about the Reduce and Scan primitives, see the following papers (Blelloch, 1989 and
1990; Blelloch and Maggs, 1996; Chatterjee, 1990; Harris et al, 2008; Hillis and Steel, 1986; Merrill and Grimshaw, 2010 Parallel Scan).
10
1.3.4 Histogram:
In my Histogram case study, I use my Column DASk to implement a parallel 8-bit histogram primitive on the GPU. A histogram13 summarizes the frequency distribution of an entire data set via a much smaller table of counts. In a nutshell, n input elements are counted into m bins. The resulting m frequency counts form the histogram output. Each of the n inputs is assumed to be taken from a range, R = [min, max). Each of the m bins represents a sub-range ri of R. (These sub-ranges uniquely partition and fully cover the original range R). Counting proceeds by selecting the matching sub-range for each input element and incrementing that bins counter.
An 8-bit histogram can be implemented using a simple indexing operation on 8-bit data. The serial CPU implementation is trivial (5-8 lines of code). Although histograms are
straightforward to implement on a sequential CPU, they have proven difficult to adapt for use on GPUs with low performance results in prior GPU histogram implementations. I ran into similar performance issues since my GPU Histogram only achieves up to 21% of Peak throughput on the GTX 580. Nevertheless, my GPU Histogram still runs up to 50% faster than prior GPU
histogram methods for random data and up to 2-4× faster for image data. My 8-bit GPU histogram is described in further detail in Chapter 8.
1.3.5 Radix Sort:
In my Radix Sort case study, I use my Row DASk to implement a parallel Radix Sort on the GPU. Even though a serial radix sort14 is straightforward to implement (as a 3-step Counting Sort pass over each r-bit digit within a numeric key), the corresponding CPU/GPU hybrid radix sort is much more complex. This complexity arises due to the need to load-balance data across tens of thousands of threads, hierarchically scan counts into starts, compress/decompress data,
13 Histograms were created by Pearson (Pearson, 1895).
11
and many other complex actions required to overcome hardware limitations. My hybrid solution has the CPU implement the radix sort as multiple passes over 4-bit digits within each 32-bit numeric key and then invoke three GPU kernels (GPU_CountKeys, GPU_ScanCounts, and GPU_DistributeKeys) to do a full counting sort on each chosen digit in each pass.
As will be seen, I perform experiments on both ILP and TLP to find the optimal balance for best performance. My GPU radix sort can sort up to 717 and 836 million ‹key/value› pairs per second on the GTX 580 and GTX Titan respectively, which I estimate15 are about (59% and 46%) of peak data throughput rates respectively. My Radix Sort is described in further detail in
Chapter 9.
15 Given 32-bit keys and 32-bit values with a 4-bit digit, the radix sort requires 8 passes (=32/4) to fully
sort the data. Given these assumptions, I estimate that the maximum data throughput rate is 1205 million and 1802 million ‹key/value› pairs per second on the GTX 580 and GTX Titan respectively.
12