2.3 The CUDA programming model
2.3.3 Synchronisation
Key to the design of any parallel algorithm for execution on the GPU is synchroni- sation. As there are multiple memory types and levels of abstraction (from CUDA threads to thread blocks), there are also multiple types of parallel synchronisation. In the following subsections we review each of the synchronisation types available. Warp Synchronisation
Threads within a warp execute non-branching regions of kernels in parallel. There is implicit synchronisation between all threads as long as the warp execution path does not diverge. However, NVIDIA suggest avoiding warp-synchronous program- ming as this can lead to synchronisation issues and race conditions; furthermore, implementations often incorrectly assume that the number of threads in a warp will always be 32 [31]. NVIDIA note that although warp-synchronous implemen- tations might function correctly now, changes to the CUDA toolchain and CUDA capable hardware might may easily break these implementations. However, warp synchronous programming can lead to improved execution times.
Thread Synchronisation
There is no guarantee as to what order warps will execute, necessitating the usage of additional block synchronisation techniques where inter-thread communication
is required across multiple warps. The method syncthreads() will block the
execution of threads within a block up to a point in the kernel. After all threads have
reached the syncthreads() method the kernel execution will resume ensuring block
synchronisation. The syncthreads() method should not be called within an if-then-
else statement unless all threads follow the same execution path as this may cause a deadlock resulting in the kernel timing out and failing to execute successfully. As
of compute 2.0 there are three variants of the syncthreads() method that allow
similar voting functionality to warp vote. An example of a syncthreads() method
variant is syncthreads count(). This variant method takes an integer from each of
the threads in the block, synchronises all threads to a point and returns the number of non-zero integers input to each of the threads.
Block synchronisation
There is no built-in support for block synchronisation in CUDA during kernel execution. This allows the GPU to execute each block independently and adjust the block execution scheduling dynamically according to the number of CUDA cores available on the GPU. As a result there is no guarantee as to the execution order of blocks or which blocks are currently executing at any given point and developers should not presume ordering. In Fig. 2.7 we illustrate a simple alterna- tive to global synchronisation by splitting one kernel into two smaller kernels. A blocking synchronisation method cudaDeviceSynchronize() is used to ensure that all thread blocks have fully completed their kernel execution. A second kernel is then scheduled for execution by the host with the guarantee that the previous kernel had finished execution therefore achieving global synchronisation up to a point.
Kernel operations ...
T1 T2 T3 Tn
Thread block 1 Kernel operations
... T1 T2 T3 Tn Thread block 2 Kernel operations ... T1 T2 T3 Tn Thread block 3 Kernel operations ... T1 T2 T3 Tn Thread block 1 Kernel operations ... T1 T2 T3 Tn Thread block 1 Kernel operations ... T1 T2 T3 Tn Thread block 1
Figure 2.7: An example of global thread block synchronisation via executing the function cudaDeviceSynchronize() to explicitly synchronise all blocks.
One restriction with this method of achieving global synchronisation is that
additional expensive memory operations may be required. As previously men-
tioned, the lifetime of shared memory is limited to the scope of a single kernel. Therefore the contents of shared memory will have to be copied to global memory in the first kernel and back from global memory to shared memory in the second kernel. To avoid this issue, other methods have been proposed to achieve global synchronisation such as using atomic operations, but such practices are discouraged by NVIDIA as they could be unsupported after subsequent new releases of CUDA. Atomic Operations
As there is no guarantee as to which order threads, warps and thread blocks will execute, additional atomic operations are provided to read and update global memory values without interruption from other threads accessing the same region of memory. Atomic operations perform read-modify-write on a region of global
memory. Each operation locks a region of memory until the initial operating
thread has read the value, modified the value and written the value back to global memory [30]. After the lock is placed on a region of memory, all other threads must wait until this lock is removed before accessing the same region of memory [33].
A simple example of an atomic operator is atomicAdd. The atomicAdd oper- ation reads a word in global memory, adds a value to this word and writes the value back to global memory (read-modify-write). Without atomic operations two threads could potentially read the same value simultaneously, independently modify the value and write back to global memory with the second write overwriting the value of the first write. In this scenario the value written by the first thread is lost. By using the atomicAdd method, the second thread would wait until the first thread has fully completed the read-modify-write operations ensuring the value written to global memory is the initial read value for the second thread. Atomic operations are more expensive than read and write global memory accesses. The use of atomic operations should be limited to the absolute minimum to avoid scenarios where multiple threads are waiting on a single region of memory. For example, instead of using atomicMax, threads within a block should first use shared memory to compute the maximum value so as to avoid each thread using an atomic operation.