Stencil Codes: Metrics and Optimization - Efficient Domain Partitioning for Stencil-based Paral

Stencil computations are classified as memory-bound as compared to compute-bound because the memory bandwidth limits their performance rather than computations. To quantify the memory-boundedness of stencil codes, we describe two performance metrics. The first of these is Arithmetic Intensity (AI) or FLOPS/byte, which is the ratio of Floating Point Operations (FLOPS) to the bytes fetched from the main memory/caches [107]. A lower value of AI indicates memory-bandwidth limited kernels, such as the ones found in Sparse Linear Algebra applications [54]. As an example, consider the weighted Jacobi iteration or smoother:

vi,j,k= ω×ui,j,k+¯ω×(ui,j,k+1+ui,jk−1+ui,j+1,k+ui,j−1,k+ui+1,j,k+u1−1,j,k+H×fi,j,k). (2.48)

Equation (2.48) has 3 multiplication and 7 addition FLOPS. Assuming the data-type is double, the memory in bytes that is accessed in Equation (2.48) above is 9 × 8 = 72 bytes. It is only the array accesses (such as vi,j,k or ui+1,j,k) that are counted as the constant values including ω, H and ¯ω, can be stored in processor registers. The theoretical AI of the code above can be calculated as the number of FLOPS divided by the number of bytes that are accessed. Thus, AI = 10

72 = 0.14. Typically, the maximum AI for stencil codes is 1 FLOP/byte [108]. Operational Intensity (OI) is a term related to AI which signifies the data movement between caches and the main memory rather than between caches and the processor [54]. It is usually expressed as FLOPS/DRAM byte, with the word DRAM differentiating it from AI. Unfortunately, in prac- tice, because of various NUMA effects and behaviour of modern cache systems, computation of

AI or OI is not a straightforward process [108].

The Roofline model is a visual model which provides insight to programmers and designers to better optimize floating point computations [54]. The roofline gets its name from two lines: a horizontal line which illustrates the peak Floating Point performance (a hardware limit) and a diagonal line which denotes the maximum memory bandwidth (in GBytes/sec) for a varying operational intensity. The diagonal line is plotted using the STREAM [109] benchmark and at a varying operational intensity, i.e. STREAM is run at various values of operational intensity and the value is plotted. The angle that the diagonal line makes with the horizontal axis depends on the scales chosen to plot the graph. Further, the advantage of the Roofline model is that it needs to be calculated only once for a multicore system and not once per a computational kernel. On drawing a straight vertical line from the operational intensity axis, if it hits the roof (horizontal line) it means the performance is computation bound and if it intersects the diagonal line - the application is memory traffic bound. Further, the X-coordinate of the ridge point (intersection of diagonal with roofline) gives the minimum operational intensity at which peak floating point performance can be obtained. Thus, it is preferable to have the ridge point as far to the left as possible so that even kernels with a very small operational intensity can also achieve the theoretical maximal FLOPS. Further, additional rooflines such as ILP (Instruction Level Parallelism), software prefetching and SIMD (Single Instruction Multiple Data) can be added to the Roofline model. The maximal limits for these can be obtained by running appro- priate benchmarks [54].

General cache optimization techniques can be applied to Stencil codes. There have been several efforts to optimize and exploit spatial and temporal principles of the cache memory hierarchy to bridge the gap between the fast processor speed and the comparatively slower memory access times [5, 12–16]. Researchers advocate fetching a higher fraction of data from the higher levels of memory such as registers and L1 cache while reducing the fraction of data fetched from lower levels such as L3 cache and main memory [15]. The major source of cache- misses are nested loops which access the same data repetitively. Data access optimizations are transformations that change the pattern in which data is accessed in the loops to exploit temporal locality [105]. Transformations such as loop skewing, loop peeling, loop unroll, loop interchange, loop fusion (or jamming), loop fission, and loop blocking (or tiling) help to make better use of caches and expose available parallelism [105,110]. Cache tiling/blocking techniques have been heavily researched and they aim at bringing a sub-domain of data into the cache instead of traversing the entire domain in a single iteration [5, 15, 16]. The effectiveness of these cache tiling/blocking techniques in modern microprocessors has decreased due to advances in compiler technology and increasing size of on-chip caches [80].

methods, a variant of Gauss-Seidel, to combine the update of red and black points in a single sweep by updating red points in row i followed by black points in row i − 1. In the same context, a blocking technique allows multiple updates of a red/black point i.e. re-using cache across multiple time-steps by multiply updating red points in rows i and i − 2 and black points in rows i − 1 and i − 3 [15]. A 2-D blocking technique using a parallelogram shape sweeping through the grid has been proposed as an improvement to the simple blocking technique [15]. Further, the red and black points for unknowns and the corresponding right-hand side values can be stored in different arrays to reduce the traffic between various cache hierarchies, although the total traffic to the main memory remains the same [111].

Initial ground-breaking work proposed the use of partial 3-D blocking for 3-D loops which maximizes the size of the dimension which has continuous data [6]. Analytical cost models for cache tiling fail to address the difference between load and store operations [16]. Further, cache conflict misses occur when the data is read from and written to different grids represented by multi-dimensional arrays in the memory as in the case of Jacobi updates [17]. These cache optimization techniques also interfere with automatic optimization techniques implemented in the hardware and software in the modern microprocessors. These automatic techniques can be called streaming techniques and SIMD (Single Instruction Multiple Data) instructions (also called vectorization) and prefetching fall under it. Researchers have explored specific hardware optimizations along with software optimizations to enhance performance for specific platforms such as the IA64 (Itanium Architecture) [111]. In order to maximally reap the benefits of parallelism on specific CPUs, explicit SIMDization has to be implemented [80].

Microbenchmarks including the Stanza Triad (STriad) and Stencil Probe have been cre- ated that attempt to act as a proxy for modelling the prefetch behaviour of the actual pro- gram [13, 16]. These benchmarks do not account for the packing or unpacking times and the changing latency in the context of using derived datatypes in the MPI implementations [48, 61]. Researchers have used hardware performance counters such as cache-misses, Translation Look- Aside Buffers (TLB) misses, mispredicted branches, hardware prefetches, and regression anal- ysis to predict the performance of stencil codes [18]. Cache oblivious/transcendental [112] algorithms have been proposed which ignore the hardware characteristics of caches as opposed to Cache aware algorithms which use the cache specifications to minimize cache-misses. The idea behind every memory optimization is to minimize the data accesses between every access to the same memory location [15]. The ExaStencils project encourages a Domain Specific Language (DSL) for generation of stencil codes which range from an abstract mathematical description to highly optimized code for a particular platform [113]. Further, it stresses the fact that there are a variety of stencils and switching from one form of the stencil to another form is non-trivial in terms of the coding effort. Several domain specific stencil initiatives exist that have different goals such as autotuning, applying cache obliviousness and adding abstractions

to a high level language [113].

In document Efficient Domain Partitioning for Stencil-based Parallel Operators (Page 68-71)