Optimization Opportunities - Conclusion and Future Work

Chapter 5: Conclusion and Future Work

5.1 Optimization Opportunities

We foresee several opportunities to improve Mercator’s runtime and remapping support. First, because remapping is transparent to the application developer, we are free to optimize the underlying remapping primitives. For example, managing queues through atomic updates may sometimes be more eﬃcient than our current scanning and compaction approach.

Second, Mercator currently executes a copy of an application’s DFG with a single GPU block in isolation. It may be advantageous in some cases to share datapaths between blocks, allowing modules from the same copy of the DFG to fire simultaneously and hence exposing pipeline parallelism not currently accessible in our execution model. Eﬃciently managing dataflow in this environment would require load balancing and enforcing consistent data sharing across blocks.

Third, Mercator may be augmented to search for optimal thread-work mappings. Mercator is designed to exploit the lightweight, dynamic thread-to-work mapping available on wide-SIMD architectures such as GPUs to manage data wavefront irregularity. In addition to this functionality, the ability to tune thread-to-work mappings enables an exploration of mapping options for a given application– indeed, the expanded name of the Mercator framework is Mapping enumeERATOR for Cuda– and a search for the optimal mapping. We propose two ways in which the Mercator framework can be used to explore thread-to-work mappings: variable module mappings and module fusion/fission. Mercator currently supports performing both of these explorations manually, with support for automated exploration being future work.

5.1.1 Variable Thread-to-Work Module Mappings

To support different thread-to-work mappings inside developer module code, Mercator allows a user-specified number of input items to be assigned to each thread, or vice versa, inside the module’s run() function. For all experiments described in this dissertation so far, the mapping ratio has been one-to-one. However, different mapping ratios may confer performance benefits for particular modules: multiple inputs per thread offers the possibility of memory latency hiding through loop unrolling, while multiple threads per input allows for parallelization in the processing of each item and coalescing of memory accesses. We plan to explore under what conditions, if any, performance benefits consistently accrue for different mappings.

The modular structure of Mercator applications and the framework’s use of dynamic remapping make different mapping options straightforward to expose: developers simply provide the desired inputs-to-threads ratio as part of the mapping specification for each module, and the system provides an appropriate item or item set to the user in the parameters of the run() stub function (see the User Manual in Appendix A for details and examples). Although different mappings of a module currently require different developer-provided run() function implementations, future versions of Mercatormay automate the process of enumerating implementations to match desired mappings given a reference implementation based on a 1:1 mapping. Section 5.2.4 gives an example of an application we have manually implemented under diverse mappings with minimal code changes.

5.1.2 Module Fusion/Fission

Our system validation experiments in Chapter 3 were designed to test whether the benefit of in- creased utilization outweighed the overhead of Mercator queue management for our benchmark applications. Although we did find a net benefit for most node filtering rates and workloads tested, our experiments revealed a performance tradeoﬀ between overheads and benefits dependent on node filtering rates and workload. We propose extending our simple validation test to explore this tradeoﬀ

by searching for the optimal granularity of modularization of an application. Modularization granularity may be adjusted by fusing or splitting nodes in the application’s DFG, which corresponds to combining multiple module functions into one or dividing a module function into multiple functions. Fusing nodes may be advantageous when filtering rates or computation per node are low, as little cost is incurred from low SIMD utilization relative to the overhead of remapping operations. For example, in our benchmark tests, the Merged application outperformed the DiffTypePipe application when a filtering rate of 0.0 or a low node computation setting was used.

Conversely, splitting a node may be advantageous when filtering rates or computation per node are high, as vacant SIMD lanes incur a high cost that could be recovered by remapping for high utilization partway through the node’s computation. Our benchmark tests show this benefit for the DiffTypePipe application over the Merged application for most filtering rates and computation settings.

Mercator inherently supports module fusion and fission, but such operations must currently be performed by the developer via specifying a new application with the desired topology. Because manually choosing an optimal granularity of modularization requires either substantial timing analysis or intuition for an application’s dataflow characteristics, automating this choice would be a helpful aid to optimization.

Since module fission involves splitting a user-defined run() function into multiple functions across module boundaries, automating the fission process for arbitrary modules poses a significant chal- lenge. Sophisticated code analysis and transformation may be required to find feasible split points within the run() function (e.g. splitting inside a loop is not possible) and ensure consistency of the original module’s data (e.g. any function-local variables must be copied between modules). Func- tions with certain properties may be amenable to automated fission, but further study would be required to identify such function classes and determine the feasibility of automatic fission.

Module fusion holds more promise for automation than fission for some graph topologies. For example, in the case of a linear pipeline with no back edges, contiguous modules may be merged by concatenating their function code with no other eﬀect on application dataflow. Automating this process would still require appropriate data management to ensure that data previously passed between module functions via Mercator queues would be connected appropriately between the fused components of the new run() function via variables, but this task would seem to require substantially less code analysis and transformation than automatic fission.

In the case of more complex topologies, it may not be possible to fuse a given pair of modules without altering application dataflow. For example, straightforward fusion of a downstream module to an upstream module with more than one output edge in its merged DFG may create a new data path originating at the downstream module.

To proceed with a study of automated module fission/fusion, we propose first exploring automated fusion for simple graph topologies, then expanding the study to more complex topologies for fusion or simple function semantics for fission.

In document Efficiently and Transparently Maintaining High SIMD Occupancy in the Presence of Wavefront Irregularity (Page 122-125)