2.5 Hardware Acceleration
2.5.1 Basics of Single Instruction Multiple Data
In 1966, Michael Flynn proposed a classification of computer architectures, based on the relation of instructions and the data they process [Flynn 1972]. According to this classification, the computers we generally use are of the Single Instruction Single Data (SISD) type. A typical instruction can look like fmul R3, R1, R2 and it multiplies registers R1 and R2 and stores the result in the register R3, each containing a floating point values.
While this is the standard model for most applications, in graphics, as well as physics, video compression, and others, we are dealing with a large amount of data, that is all processed using the same code. One such example is alpha blending of two colors, where the formula A = α · B + (1 − α) · C, where A, B, and C are color channels, is applied to many affected pixels. An obvious improvement in this case is to run the computation on multiple pixels in parallel, using the exact same instructions, as each pixel performs the exact same operations.
This observation is the basis of Single Instruction Multiple Data (SIMD) approach. Here each register contains several numbers instead of just one and the operations are applied to each of these numbers. These reg- isters and their associated arithmetic units are also called vector registers or vector units, as they contain, and operate on, vectors of numbers. The term SIMD lane (or simply lane) is used to identify elements of the register, such
that lane 0 means we are talking about the 0th elements of registers, and the lane count refers to the number of lanes in each register. On the CPU, we most commonly encounter vector units with 4 (SSE [Raman et al. 2000,
Oberman et al. 1999]) and 8 (AVX [Lomont 2011]) lanes, but other values
are also quite common (e.g., SSE has 4 single precision float lanes, but only 2 double precision lanes, while Xeon Phi has 16 single precision lanes).
For brevity, the rest of the explanation will assume 4 lane SIMD. A typical instruction of this class would be simd.fmul V3, V1, V2, which would take the four floating point values stored in vector register V1, multiple each individually with a corresponding float number from a vector register V2, and store the individual results into a vector register V3. That is, it performs the operation for i in 0 to 3: V3[i] = V1[i] V2[i], where [i] represents the number in the ith lane of a given register. This is well complemented by the current wide memory buses, where the standard CPU 128-bit buses can deliver four floats at once. On the GPU, the buses are even wider (up to 512 bits or 16 floats) on the Radeon R9 390 cards [AMD 2015]. Therefore, with properly aligned memory, SIMD can use a single instruction to load four floats into a register at no additional cost over loading a single float and achieve a theoretical speed up of 4×.
Unfortunately, while the general code to determine color is the same for each pixel, there can still be differences that prevent a straightforward use of SIMD. The first obvious deviation from the principles mentioned above would be texturing, where the neighboring pixels might need to access texels that are not in a contiguous memory (e.g., the texture is rotated or needs filtering, most often both). In this case, we would still map the neighboring 4 pixels to the 4 lanes of vector registers but the sources for the computation would come from random memory locations. To achieve this, we need to implement a gather operation where, given 4 memory addresses, 4 floats are fetched into the 4 lanes, one from each memory location. This can be implemented either in hardware, where the memory controller analyzes the addresses and issues as few memory reads as possible using crossbars and other machinery to land the floats in their respective final lanes. Or it can be implemented in software, where the same process is done on the instruction level. The former approach is common to most GPUs while the latter is used on the standard x86 CPUs.
The basic approach is to perform a 128 bit aligned load of 4 floats from each of the 4 addresses, use bit operations (AND and OR) to reduce the up to 16 floats to the required 4, and then use swizzle operations (moves floats between lanes) to move them into the target lanes. If two addresses would load the same set of 4 floats the second load is skipped, so loads from consecutive (or identical) addresses can be up to 4× faster when compared to the worst case. An inverse operation, scatter, also exists and stores results from the individual lanes onto 4 separate addresses.
the lanes is also used to implement conditional statements, the second devia- tion from the straightforward application of SIMD. The code that computes the final pixel color will often have conditional statements (e.g., if ) where individual pixels take different branches. One example would be clamping of opacity, where pixels with opacity greater than 95% would have it snapped to 100%. For each condition, there are generally three possible outcomes: all lanes take the then branch; all lanes take the else branch; some lanes take then and some lanes take else. The first two cases are fairly trivial, the condition simply changes the instruction flow and all lanes use the instruc- tions from the given branch. However, in the case where the lanes disagree, the code will take both branches (i.e., first then and then else), using the previously introduced masking principle to store results of each branch for only the lanes that should be affected.
While it is possible to implement these principles by hand, explic- itly taking care of the gather, scatter, and branching, there are also auto- mated tools that make this task easier. In 2007, NVIDIA introduced their Compute Unified Device Architecture (CUDA, [NVIDIA 2015]) and with it a programming model they called Single Instruction Multiple Threads (SIMT). In this model, the programmer writes a standard scalar code that is then automatically mapped onto a SIMD machine. The language pro- vides built-in variables that the program can use to identify in which SIMD lane it is executed, which allows to map computation (e.g., pixel index) to a given lane. The same approach is also used by the cross-platform OpenCL standard [Khronos OpenCL Working Group and Munshi 2015]. Intel’s ispc compiler [Pharr and Mark 2012] compiles scalar code in slightly extended C language the SSE and AVX SIMD targets, as well as to Intel’s Xeon Phi coprocessors [Reinders 2012]. The output of the compiler can be linked into a standard CPU application allowing the users to use only for certain parts of their algorithm. The programming model is the same as the in CUDA and OpenCL, except that Intel calls this approach Single Program Multiple Data (SPMD). A more detailed description of CUDA and its terminology is given in the Section4.2.1.