Synthilation for an Unaltered Basic Processor

The description of the synthilation mechanism has shown, that the developed algorithms are highly parameterizable. The most essential ability of the algorithms is their ability to synthilate a token set for an unaltered basic processor. It is not

8.4. SYNTHILATION FOR AN UNALTERED BASIC PROCESSOR 183

necessary to add new hardware to the processor in order to profit from synthilation. The only minor change which has to be made is the equipment of the basic operand stack with extended scratchpad functionality.

Besides that, the synthilation mechanism can be completely executed in software and works on a processor without any additional hardware. Thus, in opposition to hardware synthesis or instruction folding, the synthilation can be considered a pure software accelerator. This section presents the runtime impact of synthilation for a basic processor as well as the characteristics of the created token sets.

Just like the two already presented acceleration mechanisms, the token set synthilation has been evaluated by the well-known set of 32 benchmark applications. The most interesting characteristics of the synthilation approach do not differ much from the synthesis’ or foldings attributes.

First of all, the gained speedup has been evaluated. Here, additional attention has been given to a comparison with the speedup gained through instruction folding. The synthilation speedup itself has been measured on the basic AMIDAR processor shown in section 3.3.

Furthermore, the ALU’s utilization gives an impression of the synthilated token sets effectivity, which is also mirrored by the amount of eliminated tokens. Al- though, the newly synthilated token sets may be more effective, they still require storage within the token generator, and thus the token sets size is analyzed. As the operand stack is almost eliminated from the data path of a synthilated token set, it is also interesting to take a look at the memory access behavior of these token sets, as they might create a new memory bottleneck.

The measurement values for all evaluations that are discussed in this chapter can be found in appendix B.4. Furthermore, the influence of the amount of bus structures on the performance of the basic processor has not been evaluated. All benchmark runs and processor configurations relied on a six bus communi- cation network. This allows the comparison of the synthilation results with the performance numbers of hardware synthesis and instruction folding, as these two mechanisms have been benchmarked on a processor with six buses as well.

184 CHAPTER 8. ASSEMBLY OF MICROINSTRUCTION GROUPS

Speedup of Whitelist Based Instruction Folding / Token set Synthilation Speedup of Token set Synthilation

1 2 3 4 5

Rijndael RKGSkipjack RKG3DES RKG IDEA RKG RC6 RKGSerpent RKGTwofish RKGXTEA RKGRijndael SBESkipjack SBE3DES SBE IDEA SBE RC6 SBESerpent SBETwofish SBEXTEA SBE

Speedup 1 2 3 4 5 BLAKE CubeHash

ECOH MD5 SIMD SHA1 SHA256

RadioGatunContrastFilter_{GrayscaleFilter}SobelFilterSwizzleFilterJpegEncoder CST

2-D DCT Quantization

Speedup

Figure 8.15: Comparison of Synthilation and Whitelist Based Folding with 1024 Entries

Runtime Impact and Comparison with Instruction Folding

In section 8.1.1 it has been mentioned, that even on very large instruction regis- ters, a significant amount of stack operations remains unfolded due to potential deadlocks. Therefore, the most interesting attribute of the synthilation approach is its performance in comparison to normal instruction folding.

The diagram in figure 8.15 shows the speedup comparison of the two acceleration mechanisms. As already mentioned, the synthilation has been executed for an AMIDAR processor that is equipped with the basic functional units. The instruction folding performance mirrors the results from figure 7.18, and repre- sents the whitelist based instruction folding with 1024 patterns from stack bal- ance based folding. This folding approach has one of the smallest hardware overheads for the implementation of the instruction folding logic.

Anyhow, it can be seen, that the synthilation performs significantly better than the instruction folding. The average speedup which is gained through synthilation is 2.67, while the conventional folding only achieves an acceleration factor of 1.42. This is almost an improvement by a factor of two. Furthermore, no

8.4. SYNTHILATION FOR AN UNALTERED BASIC PROCESSOR 185

ALUs Relative Amount of Time in busy / pending Operation Mode With 1 ALU and 1 Scratchpad

20 40 60 80 100

Rijndael RKGSkipjack RKG3DES RKG IDEA RKG RC6 RKGSerpent RKGTwofish RKGXTEA RKGRijndael SBESkipjack SBE3DES SBE IDEA SBE RC6 SBESerpent SBETwofish SBEXTEA SBE

ALU Utilization in % 20 40 60 80 100 BLAKE CubeHash

ECOH MD5 SIMD SHA1 SHA256 RadioGatun_{ContrastFilter} GrayscaleFilter SobelFilter SwizzleFilterJpegEncoder CST 2-D DCT Quantization ALU Utilization in %

Figure 8.16: Utilization of ALUs by Synthilation for Small Footprint Processor

application kernel performs worse when executed by a synthilated token set. Nonetheless, it has to be reminded that the instruction folding affects all parts of the applications code, while synthilation is executed on application kernels only. Still, the performance of the Jpeg-Encoder as a whole application is better with synthilation of the kernels than acceleration of the whole code through folding. Certainly, the instruction folding is capable of delivering better performance in case of dynamic detection and folding of sequences. This typically comes as a result of a larger hardware overhead. As the synthilation can be executed completely without additional hardware, an equitable comparison has to consider the hardware overhead of the folding logic. However, the synthilation seems to be the more promising acceleration approach.

Utilization of the ALU Functional Unit

The utilization of the ALU is shown in figure 8.16. It can be seen that it is busy

in ≈ 20% of all clock cycles. Its actual utilization is clearly suboptimal. Please

note, that the utilization numbers for the busy and pending operating states are stacked and not overlapping.

186 CHAPTER 8. ASSEMBLY OF MICROINSTRUCTION GROUPS

Amount of Remaining Tokens After Token Set Synthilation

20 40 60 80

Rijndael RKGSkipjack RKG3DES RKG IDEA RKG RC6 RKGSerpent RKGTwofish RKGXTEA RKGRijndael SBESkipjack SBE3DES SBE IDEA SBE RC6 SBESerpent SBETwofish SBEXTEA SBE

Remaining Tokens in % 20 40 60 80 BLAKE CubeHash

ECOH MD5 SIMD SHA1 SHA256 RadioGatun_{ContrastFilter} GrayscaleFilter SobelFilter SwizzleFilterJpegEncoder CST 2-D DCT Quantization Remaining Tokens in %

Figure 8.17: Eliminated Amount of Distributed Tokens by Token set Synthilation

The essential point of the diagram is, that memory accesses and the communi- cation between functional units are still slowing down the execution significantly.

The ALU is in pending operation mode during ≈ 66% of all clock cycles. This

means, that the ALU spends most of the time waiting for the operands of its currently executed operation. This slows the execution significantly.

A solution to this problem can either be the distribution of memory regions to scratchpad memories, which would allow parallel access to e.g. local variables, as well as the optimization of memory access patterns and avoiding actually un- necessary memory accesses.

Less Tokens, More Performance

In section 8.1 the proposition has been made, that each stack operation can be eliminated from a bytecode sequence. However, it has already been mentioned that this is not completely possible due to the semantics of some instructions and the AMIDAR principle of operation. In this case, the involved operations are executed on a scratchpad memory or even the operand stack.

8.4. SYNTHILATION FOR AN UNALTERED BASIC PROCESSOR 187

Received / Sent Data Packets per 100 Clock Cycles During Execution of Synthilated Token Sets Received / Sent Data Packets per 100 Clock Cycles During Plain Interpretation Execution

10 20 30 40 50

Token Machine Object Heap Method Stack Operand Stack LocalVar Memory

Data Packets

Figure 8.18: Influence of Synthilation on Memory Access Behavior and Frequency

Nonetheless, the number of micro-instructions which are required to execute a respective bytecode sequence should be significantly smaller due to the large amount of eliminated stack operations. Diagram 8.17 displays the amount of tokens which remain for the execution of the different benchmarks after token set synthiliation has been applied to the application kernels.

The token set synthilation eliminated ≈ 31% to ≈ 77% of all tokens, while the

average amount is ≈ 66%. Hence, in general, synthilation eliminates two of

three tokens and the corresponding operations. This correlates with the achieved

average speedup of ≈ 3. As only one third of the overall original amount of

operations has to be processed, execution accelerates by a factor of three.

The Memory Bottleneck Shifts

In section 3.2, it has been noticed that the operand stack is the bottleneck of an AMIDAR based Java machine. One goal of all three proposed acceleration mechanisms has been the elimination of stack operations from the data path and as a result a faster execution. Hardware acceleration by a CGRA does this by shifting the execution to a new stackless data path. In opposition to that, instruction folding and token set synthilation try to eliminate the operand stack from the data path by token relocation.

In case this is done successfully, the operand stack is relieved, and stack operations only occur in much smaller amounts. The remaining access operations belong to code which lies outside the loops of an application kernel or could not be folded. The amount of memory access operations per 100 clock cycles is shown in diagram 8.18.

188 CHAPTER 8. ASSEMBLY OF MICROINSTRUCTION GROUPS

Tokens Contained in Synthilated Token Sets Constant Values Contained in Synthilated Token Sets

512 1024 1536 2048

Rijndael RKGSkipjack RKG3DES RKG IDEA RKG RC6 RKGSerpent RKGTwofish RKGXTEA RKGRijndael SBESkipjack SBE3DES SBE IDEA SBE RC6 SBESerpent SBETwofish SBEXTEA SBE

Tokens / Constants 512 1024 1536 2048 BLAKE CubeHash

ECOH MD5 SIMD SHA1 SHA256 RadioGatun_{ContrastFilter} GrayscaleFilter SobelFilter_{SwizzleFilter} JpegEncoder CST 2-D DCT Quantization Tokens / Constants

Figure 8.19: Size of Synthilated Token Sets Regarding Tokens and Constant Values

Obviously, the operand stack has a much lower utilization. Hence, many access operations regarding this functional unit have been eliminated. The new bottle- necks are located within the token machine and the local variable memory. The token machine distributes a large amount of constant values that function as address data for the different memory functional units. Furthermore, the memories are utilized more frequently as a result of the acceleration process. The number of access operations is actually equal, but executed within a shorter timespan. Thus, the relative utilization increases, which may create a bottleneck.

Memory Requirements of Synthilated Token Sets

A critical point for the realization of token set synthilation is the size of the created token sets and the number of contained constant values that function as address data for memory access operations. In case the amount of tokens is too big, a realization may not be reasonable due to the large memory overhead.

As already mentioned in chapter 6, an AMIDAR based Java machine has to be able to store up to 2048 or 4096 tokens already. This memory size probably has

8.4. SYNTHILATION FOR AN UNALTERED BASIC PROCESSOR 189

Number of Immediate Values i.e. Scratchpad Entries

32 64 96 128

Rijndael RKGSkipjack RKG3DES RKG IDEA RKG RC6 RKGSerpent RKGTwofish RKGXTEA RKGRijndael SBESkipjack SBE3DES SBE IDEA SBE RC6 SBESerpent SBETwofish SBEXTEA SBE

Intermediate Values 32 64 96 128 BLAKE CubeHash

ECOH MD5 SIMD SHA1 SHA256 RadioGatun_{ContrastFilter} GrayscaleFilter SobelFilter SwizzleFilterJpegEncoder CST 2-D DCT Quantization Intermediate Values

Figure 8.20: Amount of Induced Intermediate Scratchpad Values

to be increased in case synthilated token sets shall be stored as well. Figure 8.19 depicts the amount of tokens and constant values which are contained in the synthilated token sets for the benchmarks application kernels.

Obviously, most of the benchmarks can be realized by a relatively small set of tokens and constant values. They are executed by token sets with less than 512 tokens and 256 constant values. The other benchmarks can be implemented with less than 2048 tokens and 512 constant values. Hence, a linear scaling of

the token memories size regarding to the average speedup of ≈ 3, which has

been shown in figure 8.15, should be sufficient. This would require an overall size of the token memory of e.g. 16k tokens and 1024 constant values.

Reasonable Scratchpad Dimensions

Another memory constraint of the synthilation process is the size of the scratchpad memories, i.e. the number of entries which can be stored within such a functional unit. The overall number of scratchpad entries is equal for all synthilation runs of an instruction sequence, as their injection into the dataflow graph is independent from the processors configuration or allocation.

190 CHAPTER 8. ASSEMBLY OF MICROINSTRUCTION GROUPS

The only parameter which affects the number of entries within a single scratchpad is the number of actually resident scratchpads in the underlying processor. As mentioned, the scratchpad entries are bound via balanced round robin. Hence, the number of mapped intermediate values within a scratchpad depends on the distribution of access operations within the code. Thus, the overall number of intermediates for a benchmark is an upper bound for the maximum number of intermediates within a single scratchpad, but no reliable projection about the actual number can be given.

The measurement values shown in figure 8.20 indicate that no benchmark relies on more than 128 intermediate values, which is equal to the number of scratchpad entries. Hence, scratchpads with 256 entries should be sufficient not only for the displayed benchmarks, but for even larger applications. Please remem- ber, that only one synthilated token set can be executed at a time in a single core AMIDAR processor. In a multicore configuration, scaling of the scratchpads size to the number of cores should be considered.

8.4.1 Influence of Common Subexpression Elimination

Besides the non-optimizing synthilation of token sets for the baseline processor configuration, several possibilities of algorithm improvements exits. One of these improvements is the elimination of common load operations from the dataflow graph, and therewith, the relocation of memory access operations to scratchpads. These do not have to be addressed by data packets, but with address information that is encoded within the token itself. Hence, accesses to scratchpads are faster. Furthermore, the scratchpads allow the parallelization of access operations to e.g. local variables. The performance results of the synthilation for the basic AMIDAR processor with and without common subexpression elimination are shown in figure 8.21.

It can be seen that the elimination of read access operations of the memory functional units provides a performance improvement for more than half of the benchmarks. The average speedup increases from 2.67 to 2.84. This is an improve-

ment of ≈ 6%. This is just a small improvement, but nonetheless, the common

subexpression elimination yields the potential for further runtime improvements on processors with more than a single ALU and multiple scratchpads.

In document Performance Improvement of Adaptive Processors: Hardware Synthesis, Instruction Folding and Microcode Assembly (Page 186-195)