Algorithmic Extensions - Performance Improvement of Adaptive Processors: Hardware Synthesis, In

8.3 Algorithmic Extensions

The presented synthilation algorithm does not execute any optimization steps on its input values, intermediate representations or its generated output. Hence, the performance results of this approach represent the lower bound for the ac- celeration potential of token set synthilation. The improvement of the synthilation approach to the point of optimal performance is no matter of this thesis. Nonethe- less, some optimizations regarding the memory access behavior, the distribution of address data and the reduction of memory consumption by the synthilated token sets are presented to give a perspective of the algorithms capabilities.

8.3.1 Dataflow Graph Optimizations

The most promising optimization regards the dataflow graph. As e.g. all local variable load operations have to be executed on the local variable memory, it may happen that it becomes a bottleneck. Hence, it would be nice to reduce the load of this functional unit. Therefore, duplicate load operations shall be removed from the data path.

The sole remaining operation then stores the value of the local variable within a scratchpad memory. All consumers of this value then refer to the intermediate value instead of the original location. Thus, the local variable memory can be relieved. Furthermore, execution is accelerated because access operations to different scratchpads can be executed in parallel. Additionally, as already mentioned, scratchpads are implicitly addressed while the local variable memory is addressed explicitly. Hence, address data has to be sent to the local variable memory. This consumes time and may even stall the communication of other data packets. The described mechanism does not only work for local variables, but can be applied to all kinds of data from all kinds of memories.

This technique is called Common Subexpression Elimination (and should not be confused with the compilation technique of the same name). In order to keep the analysis steps efficient, and to avoid problems during the token creation process, the implemented algorithm is limited to the recognition of duplicate nodes within the graph. Nonetheless, it can be extended at will, and thus may become

180 CHAPTER 8. ASSEMBLY OF MICROINSTRUCTION GROUPS

overall complexity of the synthilation increases in case common subexpression elimination is enabled.

The implementation itself is realized in a single method which checks all the elimination rules for a given synthilation configuration. Here, it is decided whether a node may be eliminated or not. The configuration which has been evaluated for this thesis maps the following values to an intermediate and removes the corresponding original duplicate read access operations:

• All constant values which occur more than ten times.

• All local variables which are read a minimum of three times and require implicit addressing, i.e. all variables with an address greater than three.

• All intermediate values which are themselves read more than three times. These values basically are estimations of reasonable thresholds. The local variables with an address of three or lower are already accessed by special operations with implicit addressing. Hence, the move to a scratchpad does not save communication. In any case, thresholds for the duplication of intermediate and constant values has been chosen on a gut level. The research on good filters for this mechanism should be part of future research.

8.3.2 Parallel Distribution of Constant Values

An additional bottleneck may appear at the output port of the token machine. Here, all constant values which are required for the execution of a token are issued. These values may either be address data for the local variables or the object heap, but also can be constant values that are operands for arithmetic operations. The crucial point is, that these constants most probably are not des- tined for the same functional unit. Hence, they might block each other from being sent, and thus, they might slow down the execution.

In order to avoid this behavior, the token machine is equipped with additional output ports. The number of these ports is variable. The interesting point is the binding of the constant values to the ports. In the most simple binding process, the constants are bound to the output ports via round robin. More complex

8.3. ALGORITHMIC EXTENSIONS 181 iload 4 iload_2 iload 6 isub merge if_icmpge 26 isub (if_icmpge 26) aload_1 iload 4 iload 6 iadd (iinc 6 1) iconst_1 (iinc 6 1) iload_5 iadd aload_1 iaload iaload imul iadd

istore 5 _{(iinc 6 1)}istore 6 S

goto -29 F

Figure 8.14: Enhanced Dataflow Graph for the Autocorrelation Examples Inner Loop

algorithms may be able to optimize the binding, and thus yield even better performance improvements. All evaluations in this thesis utilize round robin.

182 CHAPTER 8. ASSEMBLY OF MICROINSTRUCTION GROUPS

8.3.3 Token Set Compression

Another optimization regards the already synthilated token sets for a given byte- code sequence. These token sets are represented by lines of the token memory, and as already mentioned, one token per functional unit can be stored in a single line. Hence, most lines will not be filled with actual tokens, but will contain many empty tokens. Nonetheless, an empty token consumes the same amount of token memory as an actual relevant token. This is the case, as each token line has an equal size regardless of the contained information.

The only way to reduce the required token memory for a token set is the elimination of lines from it. This is a simple process as the token distribution and the actual processing of the token are not synchronized with each other. Hence, it does not matter if a token is distributed in the cycle it was actually meant too, or if it is issued earlier and waits an additional cycle in the token queue.

It has just been mentioned, that many entries within the token memory are empty. Hence, they may be used to prepone some tokens which would have been issued later. In case all tokens from the last line of a token set can be preponed, the line can be eliminated.

The algorithm to process this compression is straight forward. The complete token set is iterated. In case a token for a specific functional unit succeeds an empty token for this functional unit, it is moved from its original location to the empty spot. Basically, an as soon as possible scheduling with respect for the current order of the tokens is processed.

The token set compression is a processing step which only reduces the amount of consumed token memory. It should not affect the latency of the synthilated token set. Nonetheless, the execution time may vary slightly due to side effects and potential artifacts.

In document Performance Improvement of Adaptive Processors: Hardware Synthesis, Instruction Folding and Microcode Assembly (Page 183-186)