Array Transformation Language (ATL) is a small specification language for programming data- placements for METALmddarrays. This Section describes ATL, and aspects of its overall design. ATL is written as YAML (Oren Ben-Kiki, Clark Evans, Brian Ingerson, 2009), and each ATL entry is a YAML “key-value” pair. Each entry defines anmddarraydata-placement attribute. These include the array’s data-layout, data distribution, and other aspects corresponding to how the array is distributed over an MPI Cartesian communicator. An ATL file is required to initialize anmddarray, and gets parsed only at runtime.
5.3.1 ATL attributes
P-griddefines the processor space on to which the array partitions are mapped. In QUARC’s current implementation the p-grid corresponds to an MPI Cartesian communicator. It is specified as a list of integers that specify the size of the communicator in each dimension.
Dist-rtf defines the blocking factors for eachmddarraydimension. It too is specified as a list of integers. The operation encoded by this attribute represents aρ transformation followed by a φ transformation that permutes all the block dimensions outwards. The resulting new shape of the array has a set of outermost block dimensions.
Example 5.2.
For a two-dimensional mddarray,A, the dist-rtf value of{2,2}is equivalent to the followingρφ transformation.
ρ(A,h2 2i)→φ(A,h0 2 1 3i).
using GS = global_shape<2>; using ATE = mddarray<T, GS>;
GS<2> gs(16,16);
ATE A(&gs, "atl-spec");
(a) Defining a two-dimensional mddarray.
{ "p-grid" : "2,2", "dist-rtf" : "2,2", "simd-rtf" : "2,2", "mapper" : "STATIC" }
(b) ATL specification as a YAML file
Figure 5.3: Example of an ATL specification.
The innermost block dimensions cumulatively show be equal to the SIMD register length for the target architecture. The transformation defines a new data-layout for themddarray. The simd-rtf values can be user specified for cases where the user wants to generate code for a specific data-layout. Optionally, QOPT can speculatively generate a set of data-layout choices that are encoded as multiple simd-rtf values. QOPT then generates a code version for each of the data-layouts.
When an mddarrayhas nested sdlarrayelements the effect of the ρφ transformation gets applied to each nestedsdlarraydimension. All the nested dimensions are permuted out, and the innermost dimension is still a SIMD vector dimension.
Example 5.3.
For a two-dimensional mddarray,A, the simd-rtf value of{2,2}is equivalent to the followingρφ transformation.
ρ(A,h2 2i)→φ(A,h1 3 0 2i).
Mapperspecifies the mapping function that mapsmddarrayblocks on to the p-grid. The mapper value is defined as a string. QUARC-RT internally has a dictionary mapping the string name for a mapper to a library implementation of a mapping function.
§Implementation Note. The current scope of QUARC was limited to grid and lattice-based ap- plications that exhibit regular data access patterns. Such applications benefit from rectilinear array partitioning to define “blocked” data-distributions. For this reason, currently, QUARC provides only one mapping function that statically mapsmddarrayblocks onto a single MPI rank within the p-grid. Future extensions can add more types of mapping functions for other use cases.
{16×16}global index-space {2×2×8×8} blocked index-space 1,1 0,1 1,0 0,0 8,0 8,1 8,2 8,3 8,4 8,5 8,6 8,7 9,0 9,1 9,2 9,3 9,4 9,5 9,6 9,7 10,0 10,1 10,2 10,3 10,4 10,5 10,6 10,7 11,0 11,1 11,2 11,3 11,4 11,5 11,6 11,7 12,0 12,1 12,2 12,3 12,4 12,5 12,6 12,7 13,0 13,1 13,2 13,3 13,4 13,5 13,6 13,7 14,0 14,1 14,2 14,3 14,4 14,5 14,6 14,7 15,0 15,1 15,2 15,3 15,4 15,5 15,6 15,7 dist-rtf : “2,2”
Block-space MPI ranks
0,0 0,1 1,0 1,1 0,0 0,1 1,0 1,1 Static mapping of blocks to MPI ranks
Figure 5.4: ATL specifications to define a two-dimensional block distribution for anmddarray. The dist-rtf specification is added via ATL, and blocks the global index space into four blocks. These blocks are mapped bijectively to the ranks of an MPI Cartesian communicator.
5.3.2 METAL-ATL interface
Themddarrayclass constructor requires an ATL specification as an input parameter. The ATL specification defines the data-distribution and data-layout for themddarray, and once defined these attributes are immutable. Listing 5.3a shows a simple example of an mddarrayconstructor call. Listing 5.3b shows the corresponding ATL specification file. This ATL specification is to define a two-dimensional{2×2}MPI Cartesian communicator. The dist-rtf, and simd-rtf attributes are the same as discussed in Example 5.2 and Example 5.3. Figure 5.4 provides a visualization of this data-distribution strategy. It only shows the block distribution over the MPI Cartesian communicator, and omits the
5.3.3 Compile-time ATL v/s Runtime ATL
The initial design of QOPT incorporated ATL as a compile-time compiler flag. Along with data- placement options, ATL specified even the mddarrayshapes. The design provided several code- generation advantages. Knowing an array’s shape and placement at compilation allowed specializing array expression loops to compile time know trip counts. QOPT could avoid several extra checks that are required when the trip counts are not known at compile time. Additionally, some auxiliary variables needed to support data communication could be allocated statically on the stack without requiring dynamic heap allocation at runtime.
Despite the advantages, we found it hard to extend the compile-time ATL design to real-world applications. A fixed array shape and data-placement required a recompilation for each problem size. Moreover, integrating with existing application code required multiple hooks to pre-compiled binaries generated by QUARC. Another issue that proved hard to resolve was using different shaped array types in the same QUARC program. There was no easy way to annotate the array shape and data-placement to an array declaration with a single compiler flag. The closest solution was to add extra ATL annotations to METAL’s source code. This was something we chose not to do to satisfy our design goal of a complete separation of domain-level algorithms from their architecture and parallel execution concerns. The final implementation of QUARC uses a runtime specification design for ATL due to the difficulties with a compile-time ATL design.
CHAPTER 6: CODE GENERATION AND RUNTIME SYSTEM
This chapter describes QUARC’s compiler and runtime system. Section 6.1 introduces the overall compiler architecture, the pass pipeline, and the different intermediate representations (IRs) used during code-generation. Section 6.2 presents the high-level code-generation steps. Section 6.3 describes our speculative SIMD vectorization technique. Section 6.4 describes QUARC’s scalarization steps for METAL array expressions. Each section provides the necessary implementation details of the set of compiler passes used in that stage of compilation. Section 6.5 presents QUARC’s runtime system. The runtime is a lightweight library that uses integer set analysis to generate MPI communication. It also provides an API to define data-distributions for METALmddarrays, and implements the interface for selecting a data-layout from the available options that were speculatively generated during compilation.