7.1.1 Stencil Kernels
A stencil kernel is an iterative computation that updates an array element according to a fixed computational pattern involving neighboring array elements in the same or in a separate array. The fixed computational pattern is known as a stencil. Stencil computations is encountered in several scientific domains, such as differential equation solvers, image processing, finite-element methods, and cellular automata. For this reason, stencil kernels are recognized to be one of the core kernels of scientific computing (Asanovicet al., 2009).
A large body of work exists around optimizing stencil kernels for different performance concerns, such as cache reuse, communication optimizations, and SIMD vectorization (Rothet al., 1997), (Kamil et al., 2005), (Dattaet al., 2008), (Tanget al., 2011), (Ragan-Kelleyet al., 2013), (Henrettyet al., 2013), (Acharya and Bondhugula, 2015), (Rawatet al., 2015), (Yount, 2015). A majority of the methods look to improve cache performance of stencil kernel loops by introducing loop tiling or cache blocking methods. Rothet al. (Rothet al., 1997) explored data communication optimizations to improve the performance of stencil kernels on distributed-memory machines.
From this list of prior work, notably Henrettyet al. (Henrettyet al., 2011), and Yount (Yount, 2015) have looked at techniques specific to improving the SIMD vectorization performance on IA short-vector
SIMD vectorization isstreamingcontiguous chunks of memory into short SIMD vector registers that have multiple channels or lanes. A single SIMD instruction is then applied to the data in each lane. The big limitation in SIMD vectorization is that content of two vector registers need to bestream aligned, i.e., the operands of a SIMD operation need to be in the same lane of the respective registers. Typically, cross lane SIMD operations are not supported by short-vector machines. Henrettyet al. used the term stream alignment conflictto describe this problem. To get over the issue of stream alignment conflicts, they proposed a data-layout transformation technique. The technique that they termed asdimension-lift and transposeis subsumed in ourρφalgebra. Yount used a different technique. His method applied in-register swizzle operations to get data stream aligned.
7.1.2 Short-vector SIMD architectures
The SIMD vector processing units on most modern architectures, such as x86, AMD64, Power, and ARM64, are classified as streaming SIMD multimedia extensions. These architectures use Instruction Set Architecture (ISA) extensions to add short-vector registers to the architecture to support SIMD vectorization. The register size of SIMD ISA extensions on IA has doubled with each newer processor generation. The earliest Intel MMX extension offered 64-bit vector registers that provided eight 8-bit lanes. This has increased, as of 2018, to 512-bit in current architectures that support the AVX512 ISA extension. A 512-bit vector register has eight 64-bit lanes, 16 32-bit lanes, or 64 8-bit lanes.
These ISA extensions were introduced primarily to improve graphics and multimedia application performance on consumer-grade processors. Graphics applications use different number of bits to indicate the color of a pixel. This value is known as the bit or color depth. Graphic applications from the 1990’s mostly used 8-bit or 16-bit color. Most application in the 2010’s have moved to 24-bit or 32-bit color depths. SIMD ISA extensions allow packing multiple color data types into a SIMD register, and operating on them in parallel.
Short-vector architectures differ significantly from the large vector processors like Cray-1 from 1970’s and 1980’s, and even from GPGPUs. Most short-vector SIMD ISA extensions do not offer conditional execution via mask registers. They also do not offer non-unit stride and gather-scatter addressing modes in hardware. This is largely due to the memory organization of modern architectures. Almost all modern DRAM memory chips are organized as a two-dimensional array of DRAM cells. Rather than addressing individual memory location, memory addresses are multiplexed into two parts,
1 void stencil_9pt (float * restrict A1, const float * restrict A2) {
2 for(auto t = 1ul; t < T-1; ++t)
3 for(auto z = 1ul; z < Z-1; ++z)
4 for(auto y = 1ul; y < Y-1; ++y)
5 for(auto x = 1ul; x < X-1; ++x) {
6 A1[t][z][y][x] = A2[t-1][z][y][x] + A2[t+1][z][y][x]
7 + A2[t][z-1][y][x] + A2[t][z+1][y][x]
8 + A2[t][z][y-1][x] + A2[t][z][y+1][x]
9 + A2[t][z][y][x-1] + A2[t][z][y][x+1];
10 }
11 //... Elided boundary region computations
12 }
Listing 7.1: A nine-point scalar stencil
i.e.,row address selectionandcolumn address selection. Memory is addressed first using the row address, and then the column address is decoded to access an individual element in that row. Data is also not moved as individual words, rather it is always accessed in terms of cache-line sized blocks. Due to these reasons no gather-scatter addressing at the level of words is implemented in hardware.
7.1.3 Stream alignment conflict
The stream alignment conflict (SAC) metric is an array reuse distance-based measure for detecting scenarios that need data-layout transformations. A SAC occurs when the same array element is read more than once in successive iterations of an innermost loop, and the reuse distance is less than or equal to the architectural SIMD register width. For such cases, vectorization of the innermost loop would lead to the overlapping SIMD register scenario that was illustrated in Section 2.3. The following definitions formalizes this notion.
Definition 7.1. Reuse distance(Rd)
Rdof an array element is defined as the number of other distinct array elements that are accessed between two consecutive accesses of the same array element.Rdis measured in terms of the number of
memory references, and is a measure of temporal locality. 4
Definition 7.2. Stream alignment conflict
A SAC exists inside a stencil kernel if there are two distinct array read accessesa1anda2that access the same array element in two different iterationsiandi0, and theRdof that element is either less than or
Listing 7.1 is the same stencil used in Section 2.3. Definition 7.2 offers a way to identify the need for a data-layout transformation for this case. TheRdof each array element for the two accessesA2[x-1] andA2[x+1]is two, which is less than the SIMD vector width of eight. However, if the accesses are changed toA2[x-1]andA2[x+7]theRdis eight, and SAC does not exist. Both these scenarios are considering the array has a default lexicographic row-major data-layout.
7.1.4 Mitigating SAC
Data-layout transformations based on ourρφalgebra (Chapter 4) are a way to mitigate SAC. When using aρφto define a SIMD vectorizable data-layout, one or more array dimensions are reshaped and then transposed to build a new innermost dimension. This dimension is of the same size as the SIMD vector width. Section 5.3 presented QUARC’s ATL interface that is used for defining such transformations. The primary rationale for the transformation is to place the array elements in memory so that theRdfor any pair of array reference in a stencil kernel does not result in a SAC.