CHAPTER 2. LITERATURE REVIEW
2.4 GPU-centric data reorganization approaches
2.4.3 Algorithms for Data Reorganization
Zhang et al. study the problem of dynamic irregularities occurring in both control flows as well as in memory references [158].
To address the problem of dynamic irregularities, Zhang et al. present heuristics- based algorithms and runtime adaptation techniques and framework called G-Streamline. The dy- namic irregularities are removed through the transformation techniques – data reordering and job swapping.
Data reordering. Consider an irregular reference such as A[P [tid]]. The Data reordering algorithm creates a new array A[tid] where A[tid] = A[P[tid]]. The memory references in kernel code are also redirected, and references to A[P[tid]] are replaced by A[tid]. By these set of transformations, all memory accesses made by a warp become contiguous. And the non-coalesced memory access problem is solved.
The size of new array A is as large as the number of GPU threads i.e. T, and no longer depends upon the size of the original array A. Also, a new array of size T is created for each reference to the same array. For example, two arrays are created for memory references like this: A[P [tid]] + A[P[tid] + v].
The Data Reordering algorithm is also referred as Duplication Algorithm. The reason being that it makes duplicated copies of a data element if the data element occurs multiple time in the indexing array P.
2.4.3.2 Wu et al.
Wu et al. study the problem of data reorganization to to eliminate the non-coalesced memory accesses that are a result of irregular memory references [150]. The study analyzes the inherent complexity of the data reorganization problem and proves that the problem is NP-complete. And points that the design of an appropriate data reorganization solution is nothing else but a tradeoff among three space, time, and complexity. Wu et al. propose two algorithms for data reorganization: (1) Padding, and (2) Sharing.
Figure 2.7: Illustration of data layouts generated and the access order in (b) duplication, (c) padding, and (d) sharing algorithms. (a) depicts the original layout and accesses. Note that the example shown assumes four objects per memory segment, four threads per warp, and four warps per block.
Padding Algorithm. The Padding algorithm seeks to reduce the number of data copies made in the duplication algorithm while also retaining the optimization quality. Key observa- tion behind the Padding algorithm is that if threads (t1 and t2) belonging to the same warp are accessing the same data element (a), then we need not create two copies of data element (a). Instead, the threads can be allowed to access the same copy of data element (a). This approach avoids creating non-coalesced accesses as both threads are still accessing the same segment of memory. However, the Padding approach changes the one-to-one type of regular mapping between data and threads provided by the duplication algorithm [158].
Sharing Algorithm. The Sharing algorithm overcomes the limitation of the padding algorithm. This is realized by leveraging the shared memory available in GPGPUs. Shared memory is used to increase applicability of duplication avoidance. The Sharing algorithm exploits several features of shared memory such as writes are visible only to threads of same thread block, low access latency when compared to that of global memory, performance of shared memory being insensitive, to some extent, to irregular memory accesses.
The Sharing algorithm operates as follows:
(1) Create a copy of the data accessed by a thread block and place the copy in a consecutive chunk of memory.
(2) Load the data in a consecutive fashion into shared memory thus ensuring the memory coalescing effect.
(3) Redirect the memory accesses made by the threads in the thread block to the respec- tive copies in the shared memory (instead of global memory). The algorithm uses clustering mechanism to further increase the sharing opportunity avoiding duplications.
The Sharing algorithm basically shifts the irregular accesses from global memory to shared memory. Since, the shared memory is visible to all the threads of the thread block, the scope of sharing is now no longer limited up to the warp level (as provided by Duplication algorithm), and is increased to the thread block level. In other words, the scope of duplication avoidance is now a thread block and not just a warp.
Figure 2.8: A CPU code for K-means clustering.
2.4.3.3 Mokhtari et al.
BigKernel [92], a compiler and runtime technique to address several challenges associated with data processing involving GPGPU. The BigKernel also addresses the problem of un- coalesced memory accesses occurring in Big Data-style computations. The key idea behind BigKernel is as follows: GPU threads identify the data they will be accessing in their com- putations online (Note that GPU threads do not access the data and do not perform any computation at this point). GPU transfers this information to CPU. CPU assembles the data and then transfers to the GPU memory. Now, GPU accesses the data in their computations.
The BigKernel technique operates as a four stage pipeline:
(1) Prefetch address generation (GPU side) when GPU threads calculate the memory access information and record it in an address buffer. This address buffer is sent to the CPU for data assembly.
(2) Data assembly (CPU side) – CPU assembles the prefetch data based upon the informa- tion contained in address buffer.
(3) Data transfer – the assembled prefetch data is transferred to the data buffer on the GPU side by the DAM engine.
(4) Kernel computation – the GPU threads access the data values from data buffer in- stead of original memory locations and do the computations. The kernel code on GPU is also transformed to accommodate this redirection.
The BigKernel technique also provides a notion of pseudo-virtual memory to GPU ap- plication. The BigKernel technique simplifies programming model wherein programmers can write kernels using arbitrarily large data structures that can be partitioned into independently operable segments.
Figure 2.9: A GPU code for K-means clustering.
Figure 2.10: GPU code for prefetch address generation (Stage 1) [92].
2.4.3.4 Goldfarb et al.
Goldfarb et al. propose general transformations for GPU execution of tree traversals [43]. The authors describe general-purpose techniques which can be used to implement irregular al- gorithms on GPUs. A key feature of these techniques is that they exploit commonalities present in structure of algorithms instead and do not rely upon the application-specific knowledge.
A transformation called auto-roping in developed. The idea of auto-roping is based on a key observation. The key observation being: primary costs involved in general tree traversals on GPUs is due to cost of frequent moves up and down the tree during the traversal process.
Typically, the tree traversal process involves stack-management. Some studies develop “stackless” traversals which encode the traversal orders into the tree itself via auxiliary pointers, called “ropes”. The provision of ropes avert the need for any stack-management. However, the task of encoding ropes into the tree is not trivial and involves tradeoffs. Reason being that it involves development of algorithm and implementation-specific passes which may result in dropping generality in favor of efficiency. Auto-roping is basically a generalization of ropes applicable to any recursive traversal algorithm.