• No results found

Because of the syntax-directed nature of existing asynchronous design tools, generating high-speed implementations can be an arduous process. The best-known tools (e.g., Haste/Tangram (Haste, 2008), and Balsa (Bardsley and Edwards, 2000)) use syntax- directed translation to compile behavioral specifications directly to circuits, with few high-level optimizations. In this type of design flow, each language construct directly maps to a specific hardware component. Therefore, the performance of a specification depends on the amount of optimization the designer manually performs. Straight- forward specifications often have low performance due to unnecessary sequencing and unpipelined operation. As a result, designers must either contend with relatively slow implementations or bear the burden of writing highly optimized specifications them- selves.

Burdening the designer with optimizing a specification has several drawbacks. First, writing highly concurrent code entails much effort and is error-prone. Second, such code often lacks readability and maintainability, and is therefore hard to modify and reuse. Finally, such a manual approach hinders automatic design-space exploration. In an ideal design flow, performance analysis tools are typically used to identify bottlenecks in the system, and then local modifications are applied to remove the bottleneck; this procedure is repeated until desired performance is achieved. Therefore, code rewriting should ideally be automated.

This chapter introduces an alternative to manual optimization: an automated “source-to-source” compiler that transforms one behavioral specification into another behavioral specification with significantly higher concurrency. The proposed approach introduces a suite of transformations:

• pipelining for increasing concurrency within a statement group,

• arithmetic optimization for increasing concurrency at the sub-statement level, and

• re-ordering of channel communicationfor increasing concurrency across modules.

As a result, designers can write straightforward behavioral code, focusing mainly on its functional correctness rather than on concurrency and performance. The code is automatically transformed by the proposed source-to-source compiler to be highly con- current, and then passed back through the original Haste design flow.

The two techniques of arithmetic optimization and communication reordering are core contributions of our approach. While basic parallelization and pipelining may help optimize a specification at the granularity of individual statements, there are of- ten performance bottlenecks due to individual statements with long-latency arithmetic operations (e.g., 64-bit adds or multiplications). Further, a single statement may have a complex expression involving multiple arithmetic operators. The proposed approach pushes concurrency enhancement down to a sub-statement level by introducing all of the following: expression re-factoring to introduce parallelism, expression pipelining, and pipelining of individual (‘atomic’) operators. As a result, bottlenecks due to com- plex arithmetic are alleviated.

The proposed approach to reordering of channel communication actions addresses a challenging problem. In particular, for a given module, changing the order of two communication actions is fundamentally different from reordering two computational actions. In the latter case, dependency analysis can easy help determine which re- orderings or parallel groupings of those actions preserve the original semantics. How- ever, channel communication inherently involves subtle synchronization issues, and na¨ıvely reordering two communication actions may introduce a deadlock into the sys-

tem. A conservative approach is to always maintain the original order of channel ac- tions; although safe, such an approach is suboptimal. The proposed strategy, instead, is to pursue a more optimal approach that includes a careful analysis to determine the space of legal code transformations. As a result, our approach provides greater opportunity for concurrency enhancement.

Previous approaches for improving the throughput of implementations produced by the Haste and Balsa tools have mostly focused at the circuit and intermediate (hand- shake) levels, including more optimized circuit-level designs of handshake components (e.g., more concurrent sequencers (Plana et al., 2005)), and peephole optimization and re-synthesis at the intermediate level (Chelcea and Nowick, 2002). While some of these approaches have yielded significant speedup (1.54–2.06x), they are unable to take ad- vantage of the significantly greater optimization opportunities at a higher level. As Section 3.5 shows, optimizing at the source level can provide an order of magnitude greater speedup. Moreover, the intermediate and circuit-level approaches are orthog- onal to the proposed approach, therefore they are not excluded from being applied within the design flow.

The domain of specifications targeted by the proposed approach areslack elasticsys- tems (Manohar and Martin, 1998a). A slack elastic system preserves correct operation even if extra pipeline buffer stages (i.e., extra slack) are introduced on any communi- cation channel. It was shown that a system is slack elastic if it is deadlock-free and it satisfies certain properties regarding channel probing and non-determinism (Manohar and Martin, 1998a). Since the approach introduces pipelining into a specification, the assumption of slack elasticity is a requirement.

The proposed approach has been implemented in an automated tool, and eval- uated on a suite of design examples. The resulting concurrency-enhanced specifi- cations were run through the commercial Haste tools from Philips/Handshake Solu-

tions (Haste, 2008), and synthesized to gate-level netlists and simulated. Experimental results demonstrate that the original specifications are correctly and efficiently rewrit- ten into highly concurrent ones. If code length is used as an indicator of designer effort, the proposed approach reduces the required effort by a factor of 3.3x on average (up to 8.8x). Alternatively, the impact can be quantified by the throughput improvement achieved by optimizing the original specification: up to 59x speedup using the basic approach, and a further 5.2x using arithmetic pipelining.

The remainder of this chapter is organized as follows. Section 3.2 reviews the Haste flow and asynchronous pipelining, then discusses related previous work. Then, Section 3.3 presents the basic concurrency-enhancing transformations. Section 3.4 discusses advanced topics, including arithmetic optimization, handling of conditionals and loops, and reordering of channel communication actions. Section 3.5 presents results, and finally Section 3.6 gives conclusions and future work.