Loop body generation - QOPT Late Scalarization

6.4 QOPT Late Scalarization

6.4.4 Loop body generation

Loop body generation for QKSCoP statements is the final compilation step inside QOPT. The process starts by doing a post-order walk of the QKET’s RHS sub-tree, emitting the required code for each QKET node that it visits. TheqketBuildRHSprocedure shown in Figure 6.12 presents a high-level overview of the loop body generation steps. Each QKET node has different code-generation requirements. This section goes over the specifics of each case.

Building aDRILLexpressionnode requires only updating the index vector that is used to build a linearized array access. The METALDRILLoperator generates a constant index value for a nested

sdlarraydimension. This constant value is appended to the index vector, and used during address linearization.

ProcedureqketBuildRHS

Input:QKETqketN ode

Input:QKSCoP for the QKETqkscop

Output: Vector of LLVM virtual register valuesV

1 switchqketN ode.node typedo

2 caseDRILLdo

3 updateIndexVextor();

4 lv←qketBuildRHS(qketN ode.leftChild);

5 end

6 caseIF EVEN CHOOSEdo

7 lv←qketBuildRHS(qketN ode.leftChild);

8 rv←qketBuildRHS(qketN ode.rightChild);

9 V ←applyIfConversion(lv,rv);

10 end

11 casebinary mkerneldo

12 lv←qketBuildRHS(qketN ode.leftChild);

13 rv←qketBuildRHS(qketN ode.rightChild);

14 V ←inlineMkernel(lv,rv);

15 end

16 caseunary mkerneldo

17 lv←qketBuildRHS(qketN ode.leftChild);

18 V ←inlineMkernel(lv);

19 end

20 casearray accessdo

21 addinlineBoundaryChecks(qkscop,qketN ode);

22 V ←createArrayAccess(qketN ode);

23 end

24 casescalar accessdo

25 V ←createScalarAccess(qketN ode);

26 end 27 otherwise do 28 Other 29 end 30 end 31 returnV;

Figure 6.12: A recursive post-order walk is used to lower the RHS sub-tree of a QKET into LLVM instructions. The output ofqketBuildRHSis a set of LLVM virtual register that stores the output of a memory read operation or an arithmetic operation. The LLVM virtual register values get assigned to the LHS using LLVM memory write operation.

Building anIF EVEN CHOOSEexpressionnode results in an implicit if-conversion optimization. As defined in Section 5.1.2.5, METAL’sIF EVEN CHOOSEoperator generates this type of expression node. The expression has two children that are terminalGSHIFTexpressions. The expression encap- sulates a built-in predicate (Listing 5.1) that uses the local mddarrayindices to compute a global “parity” value at eachmddarrayindex position.QketBuildRHSgenerates the predicate as an inlined computation inside the innermost loop body. It then adds LLVMselectinstructions to select one of the two accesses at runtime. Thus, generating code forIF EVEN CHOOSEexpressions avoids adding extra control flow into the loop.

Inlining mkernel functionsis done using a QOPT’s custom domain-specific function inliner. This custom inliner benefits from domain-specific information that a general-purpose function inliner cannot decipher. The custom inliner applies optimizations that a general-purpose inline would not be able to discover.

Mkernel call inlining occurs only in the context of QKET loop body generation. AsqketBuildRHS does a depth-first walk of the tree, the mkernel function inliner is aware what arguments are passed to the mkernel call. Along with that it is also aware of guarantees provided by METAL API. For mkernels that operate onsdlarraydata types, the inliner removes all nestedsdlarrayaccesses. These nested accesses get replaced by a single linearized address offset from the parentmddarray’sbase address. It may do so as METAL ensures that nestedsdlarraymembers are allocated contiguously inside an

mddarray. For SIMD vectorized code-generation the inliner converts all scalar arithmetic operations into SIMD vector operations.

Building scalarmddarrayaccessesrequires two things to be considered: adding boundary condition checks when the QK hasGSHIFTexpressions, and generating LLVM memory operations.

Boundary condition checks are derived using integer set operations involving a QKSCoP statement’s iteration domain and the QKSCoP’s boundary domains. A boundary check is inserted whenever a statement’s domain intersects a boundary domain. If anmddarrayaccess falls inside a boundary, then it is loaded from the memory region corresponding to that boundary. The memory region can be a shared memory address, an address inside the same block, or a memory buffer that stores data copied over from a remote process. This distinction between memory regions is abstracted by QUARC-RT. Listing 6.6 shows the loop nests for the five-point example with the boundary condition checks added.

1 // Inner region statement’s headers

2 for (int i0 = 0; i0 < D0; i0 += 1)

3 for (int i1 = 0; i1 < D1; i1 += 1) {

4 // Check if last row

5 if (i0 == D0-1) { }

6 // Check if first row

7 if (i0 == 0) { }

8 // Check if last column

9 if (i1 == D1-1) { }

10 // Check if first column

11 if (i1 == 0) { }

12 }

13 // Boundary regions’ loop headers

14 {

15 if (D0 >= 1) {

16 for (int i1 = 0; i1 < D1; i1 += 1) {

17 // Check if last column

18 if (i1 == D1-1) { }

19 // Check if first column

20 if (i1 == 0) { }

21 }

22 if (D0 >= 2)

23 for (int i1 = 0; i1 < D1; i1 += 1) {

24 // Check if last column

25 if (i1 == D1-1) { }

26 // Check if first column

27 if (i1 == 0) { } 28 } 29 } 30 if (D1 >= 1) { 31 for (int i0 = 1; i0 < D0 - 1; i0 += 1); 32 if (D1 >= 2) 33 for (int i0 = 1; i0 < D0 - 1; i0 += 1); 34 } 35 }

Listing 6.6: Inline boundary checks inside generated loops

LLVM load instructions are inserted after the boundary checks. The code path that loads for non- boundary domain cases, adds a load from the local MPI window for themddarray. For the code path where an access falls inside a boundary domain, the load is from the memory region pointer returned by QUARC-RT. Each nestedsdlarrayelement is loaded with a separate load operation.

Building SIMDmddarrayaccessesinvolves extra steps compared to building scalarmddarray

accesses. The boundary condition checking is same as scalarmddarrayaccesses, but the loads inside a boundary iteration involves vector shuffle operations. These shuffle operations are needed due to an extra boundary condition introduces by QUARC’sρφ-based data-layout transformations. Reshaping of an

0,31 0,3 0,7 0,11 0,15 0,19 0,23 0,27 shufflevector(v1,v2,{7,0,1,2}) shufflevector(v1,v2,{7,0,1,2}) 0,3 0,7 0,11 0,15 0,19 0,23 0,27 0,31 v1 v2 0,19 0,23 0,27 0,31 0,3 0,7 0,11 0,15 v1 v2 0,0 0,4 0,8 0,12 0,1 0,5 0,9 0,13 0,2 0,6 0,10 0,14 0,3 0,7 0,11 0,15 GSHIFT<0,-1>() GSHIFT<0, 1>() 0,16 0,20 0,24 0,28 0,17 0,21 0,25 0,29 0,18 0,22 0,26 0,30 0,19 0,23 0,27 0,31 0,0 0,4 0,8 0,12 0,16 0,20 0,24 0,28 v1 v2 0,16 0,20 0,24 0,28 0,0 0,4 0,8 0,12 v1 v2 0,4 0,8 0,12 0,16 0,20 0,24 0,28 0,0 shufflevector(v1,v2,{1,2,3,4}) shufflevector(v1,v2,{1,2,3,4})

(a) SIMD data-layout created using ATL specification v:RT(1,4). Only the inner array dimension is reshaped and transposed to build the four-wide vector dimension.

0,31 0,7 16,31 16,7 0,15 0,23 16,15 16,23 shufflevector(v1,v2,{7,0,1,2}) shufflevector(v1,v2,{7,0,1,2}) 0,7 0,15 16,7 16,15 0,23 0,31 16,23 16,31 v1 v2 0,23 0,31 16,23 16,31 0,7 0,15 16,7 16,15 v1 v2 0,0 0,8 16,0 16,8 0,1 0,9 16,1 16,9 . . . . . . . . . . . . 0,6 0,14 16,6 16,14 0,7 0,15 16,7 16,15 GSHIFT<0,-1>() GSHIFT<0, 1>() 0,16 0,24 16,16 16,24 0,17 0,25 16,17 16,25 . . . . . . . . . . . . 0,22 0,30 16,22 16,30 0,23 0,31 16,23 16,31 0,0 0,8 16,0 16,8 0,16 0,24 16,16 16,24 v1 v2 0,16 0,24 16,16 16,24 0,0 0,8 16,0 16,8 v1 v2 0,8 0,16 1,8 1,16 0,24 0,0 16,24 16,0 shufflevector(v1,v2,{1,4,3,6}) shufflevector(v1,v2,{1,4,3,6})

(b) SIMD data-layout created using ATL specification v:RT(2,2). Both array dimensions are reshaped and transposed to build the four-wide vector dimension.

Figure 6.13: Showing the handling of boundaries forρφtransformedmddarraylayouts using vector shuffle operations. The example uses a32×32mddarraythat is blocked on the inner dimension. The sub-figures show two possible data-layouts within a block. A global two-dimensional indexing scheme is used to help understand the data distribution and data-layout. The array uses a periodic boundary condition.

well as a neighboring block. The data elements need to be shuffled to get them in the right vector lanes, before applying any SIMD arithmetic operations.

Figure 6.13 shows two scenarios that illustrate this need for vector shuffling. The two scenarios show two differentρφdata-layout transformations on a two-dimensionalmddarraywith a32×32global shape. The inner dimension of themddarrayhas been blocked by a factor of two. Each subfigure shows the first row within the two neighboring blocks. Notice that after the layout transformations, there is an inner vector dimension within each row. Therefore, each row consists of multiple vectors. The data-layout in both scenarios is different. In Figure 6.13a, the innermddarraydimension isρφ transformed to build a four-wide vector dimension. In Figure 6.13b, bothmddarraydimensions are transformed to build the vector dimension. As shown, the twoGSHIFToperations,GSHIFT<0,1>()

andGSHIFT<0,-1>(), on the inner dimension require data to be gathered from two different vector registers. These two vector registers must be shuffled or blended to get the needed elements in the right position inside a single vector register. The shuffle operation uses architecture-specific blend instructions, such as the AVXVBLENDVPSandVPBLENVPSinstructions, for this purpose. A blend instruction needs an instruction mask specified as a list of unsigned integer values to select the required elements from either vector register. The instruction mask needs to be generated at compile time.

We ensure that all shifts in a given direction for a reshaped dimension use the same instruction mask. That is, the instruction mask value only depends on theρφtransformation, and not on the shift value. This is done by enforcing a legality constraint when selecting aρφtransformation to define anmddarray

data-layout. This constraint is defined as follows:

Constraint 6.3. The absolute value of a shift on aρφtransformedmddarraydimension should be less than the reshaped extent of that dimension.

Constraint (6.3) ensures that all array elements in inner-region iterations have the shifted neighbor in the same vector lane on another vector register. For boundary-region iterations, the shifted neighbors of the array elements are in the next or prior vector lane. Moreover, boundary region separation follows the same logic as described in Section 6.4.2.

This constraint essentially restricts the space of applicableρφtransformations. The rationale for the constraint is based on the index mapping formulae defined in Section 4.2. Translating an array accesses

1 auto sq_norm = REDUCE_{(a1*a1, su3add);}

3 /* Semantically equivalent loop nest for the METAL REDUCE expression

4 * auto sq_norm = 0.0;

5 * for (auto k = 0ul; k < a1.get_local_extent(0); ++k) {

6 * for (auto l = 0ul; l < a1.get_local_extent(1); ++l) {

7 * autp sq = sqmag(a1.at(k,l), a1.at(k,l));

8 * sq_norm = su3add(sq_norm, sq); 9 * } 10 * } 11 * 12 * quarc_rt_mpi_allreduce(...); 13 */

Listing 6.7: After optimizing the REDUCE expression

operations. Applying the constraint ensures no division or modulo operation is needed, and a fixed translation of the shifted is possible for all shifts for a givenρφdefined data-layout.

Building scalar terminalnodes involve generating a load operation for the scalar variable referenced by the QKET leaf node. For SIMD code-generation, each scalar load is expanded into a vector load with the scalar value replicated across all the vector lanes. This is a standard compiler optimization known as scalar expansion.

In document Deb_unc_0153D_18561.pdf (Page 104-110)