High-level Optimizations - QOPT High-level Code-generation

6.2 QOPT High-level Code-generation

6.2.4 High-level Optimizations

High-level optimization of QKs tries to eliminate redundancy, and potentially fuse QKs. The goal is to potentially fuse array expressions that access the same memory location. There are two strategies for high-level optimization of QKs. LLVM scalar redundancy elimination using value numbering can identify fusion opportunities for simpler cases. Polyhedral dependence analysis can help identify fusion cases for more complicated cases. QOPT’s prototype implementation only implements the first strategy. We propose a design for extending QOPT for the polyhedral strategy.

Theqketfusionprocedure implements a local optimization,i.e., the scope is limited to a basic block. The procedure starts by identifying QKs that can be potentially fused. The decisions rests on the following two constraints:

Constraint 6.1. Currently, only QKs that are adjacent and access at least one common mddarray

reference are candidates for fusion.

Constraint 6.2. An LHS array reference for any QK in the set of adjacent QKs can only be accessed in any of the RHSiffthat arguments to the RHSaccess fncall are all zeroes.

Qketfusiononly looks to fuse QKs, and RQKs are not considered. Two QKs are considered adjacent if the end instruction of the first QK’s QKET is immediately followed by the start instruction of the second QK’s QKET. QOPT ignores any debug or LLVM intrinsic instructions when identifying adjacent QKs. QOPT uses only value tracking to check if two adjacent QK share at least one array reference. Thus, any kind of pointer-based indirection prevents fusion. If two adjacent QKs meet Constraint (6.1), then they are evaluated against Constraint (6.2). This constraint ensures that QK fusion does not introduce a loop carried dependence. This is a very broad check that may preclude legitimate fusion. Such fusion cases cannot be handled with data dependence-based analysis. The future extension proposal to enhance QOPT using polyhedral data dependence analysis addresses this issue.

Once a candidate set of QKs is identified, qketfusion outlines the set of QKETs into a separate function. Outlining is done to restrict the scope of scalar redundancy elimination to only the candidate set of QKETs inside one basic block. The outlined function is optimized using LLVM’s global value numbering (GVN) redundancy elimination pass. After running GVN,qketfusion invokes a slightly modified version of the QKET generation procedure. The procedure calledqkefgenworks the same way,

1 b = DRILL_{<0>(g) * a.}GSHIFT<1>() + DRILL_<1>(g)*a.GSHIFT<-1>();

2 d = DRILL_{<0>(g) * c.}GSHIFT<1>() + DRILL_<1>(g)*c.GSHIFT<-1>();

(a) Shared DRILL expressions across two QK.

1 a = b.GSHIFT< 1 , 0>();

2 a += b.GSHIFT<-1 , 0>();

3 a += b.GSHIFT< 0 , 1>();

4 a += b.GSHIFT< 0 ,-1>();

(b) Multiple add-assignment expressions to write a five-point stencil. Figure 6.4: METAL array expressions fusible usingqketfusion

1 using GS2D = quarc::metl::global_shape<2>;

2 using floatArr2D = quarc::metl::mddarray<float, GS2D>;

4 void unNormalizedBoxFilter (const floatArr2D &a1, floatArr2D &a2) {

5 auto a3 = a1.GSHIFT<-1,0>() + a1 + a1.GSHIFT<1,0>();

6 a2 = a3.GSHIFT<0,-1>() + a3 + a3.GSHIFT<0,1>();

7 }

Listing 6.4: An unnormalized box filter kernel from 2D image processing. These two QKs cannot be fused using GVN-based redundancy elimination.

but instead of constructing a single QKET generates multiple QKETs each represented inside a QKEF. Note that in a QKEF there are multiple QKETs, and one or more of these QKETs share common nodes. Listing 6.4a and Listing 6.4b are two examples whereqketfusioncan use GVN to fuse the QKs. In both cases, GVN would identify the redundant values across multiple QKs, and replace those values with a single value. Listing 6.4a is an excerpt from a multiple RHS linear solver kernel. GVN identifies the two

DRILL<0>(g)and the twoDRILL<1>(g)accesses that are common across both QKs. Listing 6.4b is equivalent to the five-point stencil kernel from our running example shown in Listing 6.1. Instead of writing the whole stencil as a single statement, multiple add-assignment operators are used to break it out into multiple statements.

§Implementation Note. As currently implemented,qketfusioncannot identify some potential fusion candidates. Listing 6.4 shows such an example. For this example,qketfusionidentifies the two QKs as potential fusion candidates. However, the QKs do not share any exact array reference, and GVN is unable to locate any redundancy.Qkefgenfails to build a QKEF for the same reason, and the two QKETs are

The optimization uses dependence analysis to identify that the arraya3can in fact be replaced by a temporary. Doing so then enables fusing these two QKs. This is an important optimization that is especially useful in image processing pipelines. The Halide image processing DSL implements this type of fusion optimization. The application of the optimization is dependent on external explicit specification of the fusion, and Halide does not provide an analysis framework for auto-discovery. Bhaskaracharya et al. (Bhaskaracharyaet al., 2016) do provide an automated polyhedral method to discover and apply this type of fusion.

Apart from this array storage management example, most other QK fusion cases fall under classical loop fusion. Modern polyhedral data dependence analysis, such as the one provided by ISL, allow identifying such cases. QOPT already integrates ISL in its compiler infrastructure. The current usage is restricted only to code generation out of QKSCoPs and MPI communication generation. To benefit from ISL’s data dependence analysis, we would have to expand QKSCoPs to encompass multiple QKs. Doing that would allow an inter QK dependence analysis, and leading to discovery of additional fusion and parallelization opportunities.

In document Deb_unc_0153D_18561.pdf (Page 86-88)