Infrastructure - System Implementation - Mercator Overview and Results

Chapter 3: Mercator Overview and Results

3.3 System Implementation

3.3.2 Infrastructure

Mercator’s core data structures and runtime system are implemented as a hierarchy of C++/CUDA classes. Module type, module instance, and queue objects are defined by base classes with member functions that delineate the semantics of data movement, scheduling, and module firing in the system. Inherited versions of these classes parameterized by data type implement proper queue storage and appropriately typed connection logic between objects.

A C++ front end incorporating the ROSE compiler infrastructure [99] first parses the developer’s app spec file and extracts the parameters for each module type, node, and queue from the module,

void markCycles() {

// start recursive process markCyclesRec(sourceNode); }

void markCyclesRec(ModuleInstance* node) {

// mark node as visited (preorder) node->set_visited(true);

// loop over children for (each child of node) {

if (child->is_visited()) // edge to child is a back edge {

// set cycle indicators node->set_isCycleTail(); child->set_isCycleHead();

e->set_isBackEdge(); // e is edge from node to child }

else // edge is not a back edge {

// recursively process child markCyclesRec(child);

// mark this node as predecessor if child

// is a cycle head node (but not along a back edge) if (child->isCycleHead())

node->set_isCyclePredecessor(); }

} }

Figure 3.8: Pseudocode of application graph traversal to mark cycle components: heads, tails, head prede- cessors, and back edges.

void verifyCycles() { // start recursive process verifyCyclesRec(sourceNode); }

bool verifyCyclesRec(ModuleInstance* node) {

// indicator for whether cycle is pending in traversal bool inCycle = false;

// loop over children for (each child of node) {

// recursively process child bool childInCycle;

if (child was not reached from back edge) childInCycle = verifyCyclesRec(child);

// if child passes on a pending cycle to this (parent) node if (childInCycle)

{

// if node was already in a pending cycle, FAIL if (inCycle)

// FAIL else {

// visit: mark node as being in a pending cycle now inCycle = true;

// check production rate of edge along cycle; must be <= 1 if (e->productionRate > 1) // e is edge to child

// FAIL }

}

} // end loop over children

// visit: account for node being head or tail of cycle if (node->isCycleHead()) // head node closes the cycle inCycle = false;

if (node->isCycleTail()) // tail node opens a cycle {

if (inCycle) // can't open a cycle if one is already pending // FAIL

else

inCycle = true; }

}

Figure 3.9: Pseudocode of postorder graph traversal to verify cycle validity according to Mercator’s topology constraints: overlapping and nested cycles are not permitted, and all production rates along a cycle’s edges must be  1. Overlapping or nested cycles are implicated if a node participates in more than one pending cycle in a bottom-up traversal of the graph. Note that a single node acting as head of one cycle and tail of another is permitted, as its head cycle is closed before its tail cycle is opened.

node, and edge declarations respectively. The front end then infers the application topology and checks for type compatibility and conformance to Mercator’s topological constraints. Specifically, each output edge is checked to ensure its type is identical to the input type of its downstream node; node and edge declarations are checked to ensure that any referenced topology objects therein have been validly declared elsewhere; and cycles are detected and validated. Cycle validation consists of ensuring the three conditions necessary for deadlock avoidance introduced in Section 3.1.1: cycle head nodes are marked as requiring suﬃcient queue space to prevent deadlock; cycle nodes are checked to ensure all production rates along the cycle are  1; and application topology is examined to ensure that no nested or overlapping cycles are present. These type and topology checks are performed using linear-time traversals of the application’s dataflow graph (see Figures 3.8 and 3.9). Finally, a codegen engine produces developer-facing CUDA function stubs for each module, along with code to create the necessary application objects as members of the Mercator infrastructure classes. The data types and other properties of modules specified in the app specification determine the signatures of the stub functions and application objects.

To form a working application, the developer compiles together the application skeleton with user- supplied function bodies for each module type, the system-supplied runtime supporting code (in- cluding host-GPU communication code and the module scheduler), and the host-side instantiation code using the regular CUDA toolchain.

Queue sizing Queue storage capacities for module instances are based on production rates of upstream nodes. In principle, every queue could be sized as small as 1 slot with the exception of cycle head nodes, which require at least 2 slots for deadlock avoidance. To facilitate SIMD parallelism, however, a queue sizing strategy should allow at least one full ensemble’s worth of items to be fired by each node without being blocked by a downstream queue. Nonuniform production rates across nodes prevent minimum queue sizes from being constant throughout the dataflow graph; for example, if a node produces two outputs per input, the queues of its downstream node(s) must be large enough to accommodate two full ensembles of items to allow a single full-ensemble firing.

The queue space required by each node to accommodate a full-ensemble upstream firing may be calculated with one linear graph traversal starting from the source node. At each step, the local production rate of a node is multiplied by the cumulative production rate of all prior nodes on its path from the source. This new cumulative production rate is stored in order to be accessed by downstream nodes, then multiplied by the size of an ensemble to obtain the final queue size for the node. To ensure suﬃcient queue space for head nodes of cycles to avoid deadlock, the production rate of their upstream nodes is considered to be the maximum of their non-cycle predecessor’s production rate and the cycle tail node’s rate.

To increase the likelihood of being able to fire one or more full ensembles’ worth of items, we heuristically scale each queue size from the SIMD-minimum described above by some small integer K (currently 4). In general, this strategy allows K full ensembles of items to propagate through the graph starting at the source, with space for at least K non-blocked full ensemble’s worth of items to fire at each node. As an alternative heuristic, queue sizes could be increased individually based on production rates and degrees of irregularity of nearby nodes.

In document Efficiently and Transparently Maintaining High SIMD Occupancy in the Presence of Wavefront Irregularity (Page 79-83)