2.1 Types of Parallelism
2.2.3 Parallel Coordination
GPU architects have provided only limited support for coordination between threads. Coordination is the behavioral cooperation across threads of a parallel algorithm needed to achieve correct results. Coordination includes the concept of communicating intermediate and final data results between threads. Typically on GPUs, coordination is only supported between threads belonging to the same thread block. The coordination mechanisms between threads are as follows -- memory, atomics, voting intrinsics, and barrier synchronization.
Memory:
Although GPUs support multiple types of memory including registers, shared memory, and global memory. As we will see shortly, shared memory is the best fit forcoordination/communication between threads. Registers: Each thread is only allowed to access registers that have been assigned to it by the CUDA compiler and GPU hardware. In addition, registers are not addressable, in other words, thread registers cannot be accessed using index operations. As a result, registers cannot be used to coordinate threads. Note: The new CC 3.5 Kepler architecture has support for a new PTX “shuffle” command that allows the threads within a single warp to move registers values between threads. Shared Memory: Unlike registers, shared memory is addressable and accessible by all threads within the same thread block. Shared memory can be used to coordinate threads by storing common results or behavioral state.
However, coordination/communication across thread blocks within a grid cannot be done using shared memory. Since shared memory is visible to all warps belonging to the same thread block and concurrently running warps can compete to access the same memory resources, programmers need to ensure mutual exclusion between different warps for correct parallel behavior. Global Memory: Although global memory can be used to store common results or state for all threads within a thread block, it runs much slower than shared memory. Similar to shared memory, programmers need to ensure mutual exclusion between different warps and blocks when accessing global memory used for communication/coordination. It is difficult to coordinate
39
behavior across thread blocks within a grid using global memory due to two main factors 1) the non-deterministic thread block schedule generated by the giga-engine scheduler 2) New thread blocks are not scheduled onto SMs until currently running thread blocks have completed. As a result, using global memory to coordinate thread blocks across a grid is not recommended.
Atomics:
An atomic operation is a small group of hardware instructions guaranteed to appear as if the entire group was executed as a single indivisible instruction by the rest of the system. Once an atomic operation has started other threads cannot interrupt the current thread until the atomic operation has successfully completed. As a result, atomic operations can be used to ensure mutual-exclusion when accessing common resources in shared or global memory by multiple threads (warps or blocks). However, the GPU hardware serializes thread access, which negatively impacts parallel performance. Consider, tens of thousands of threads competing to update the exact same memory address using atomics. Instead of tens of thousands of threads executing concurrently, they all now must execute sequentially. As a result, using atomics must be done carefully to avoid degrading performance due to the massive parallel overhead caused by hardware serialization. Note: Atomic operations on the Kepler architecture execute faster than on the Fermi architecture.Voting Intrinsics:
The GPU implements serialization by predication hardware that supports gathering Boolean predicates across all 32 threads within a warp into a single 32-bit mask. That predicate mask is then shared across all threads within a warp. NVIDIA has exposed this hardware functionality to programmers as voting intrinsics in the ISA. This allows the threads within each warp to communicate the results of simple predicate {true, false} tests with each other across registers in a single instruction without first needing to store that information in slower shared memory.Barriers:
The CUDA platform supports single-instruction barrier commands that force all warps within a thread block to be synchronized (coordinated) at a single check-point before40
any warp starts back up doing useful work. This synchronization helps support correct behavior of algorithms using multiple warps per-thread block. The parallel threads within a single warp do not need any barrier synchronization for correct behavior since they already move in lock-step according to their SIMD vector-parallel design.
Each GPU card supports moderate ILP, large DLP, massive TLP, and a complex memory hierarchy. ILP is supported via a 2-level MIMD/SIMD processing cores. The SM warp cores support pipelining, scoreboarding (Fermi Only), and multi-issue. DLP is supported via SP core replication within each SM core and then SM core replication within the GPU card. TLP is supported by 2-level warp-threading. The memory hierarchy in decreasing access speed is registers, shared memory, global memory, and CPU RAM. Fixed sizes on memory and execution contexts put constraints on how programmers exploit data-level parallelism in their algorithms. All this complexity plus constraints results in many issues that make it hard for GPU
programmers to write correct, robust, and fast code. In the next chapter, I will present some of the main issues that GPU programmers should be concerned about.
41