• No results found

Speedup Gained From Token Set Synthilation for 1 ALU Without / With CSE

1 2 3 4 5

Rijndael RKGSkipjack RKG3DES RKG IDEA RKG RC6 RKGSerpent RKGTwofish RKGXTEA RKGRijndael SBESkipjack SBE3DES SBE IDEA SBE RC6 SBESerpent SBETwofish SBEXTEA SBE

Speedup 1 2 3 4 5 BLAKE CubeHash

ECOH MD5 SIMD SHA1 SHA256

RadioGatunContrastFilterGrayscaleFilterSobelFilterSwizzleFilterJpegEncoder CST

2-D DCT Quantization

Speedup

Figure 8.21: Kernel Speedups Gained Through Elimination of Common Subexpressions

The elimination of read operations implicates the introduction of additional inter- mediate values. Hence, the overall number of intermediate values increases, and so does the amount of required scratchpad entries. Figure 8.22 presents the changes regarding the scratchpads utilization. It is shown, that the average amount of intermediate values doubles up from 19.8 to 39.7 per benchmark. The maximum value of scratchpad entries has increased from 123 to 184. Thus, the scratchpads sizes should be doubled for common subexpression elimination.

8.5

Synthilation Performance on Multi-ALU Processors

The greatest advantage of the presented synthilation algorithm is its portability to processors with variable numbers of ALUs and scratchpads. Therefore, the binding and scheduling steps of the synthilation process are able to deal with resource constraints. This allows the usage of varying instances of hardware resources. In other words, an adaptation of the other synthilation steps is not necessary. In case more than a sole instance of an ALU or a scratchpad exists, the resulting token set for that processor differs only regarding the respective

192 CHAPTER 8. ASSEMBLY OF MICROINSTRUCTION GROUPS

Number of Immediate Values i.e. Scratchpad Entries Without / With CSE

64 128 192

Rijndael RKGSkipjack RKG3DES RKG IDEA RKG RC6 RKGSerpent RKGTwofish RKGXTEA RKGRijndael SBESkipjack SBE3DES SBE IDEA SBE RC6 SBESerpent SBETwofish SBEXTEA SBE

Intermediate Values 64 128 192 BLAKE CubeHash

ECOH MD5 SIMD SHA1 SHA256 RadioGatunContrastFilter GrayscaleFilter SobelFilter SwizzleFilterJpegEncoder CST 2-D DCT Quantization Intermediate Values

Figure 8.22: Increase of Scratchpad Utilization Through Common Subexpression Elimination

instances that carry out an operation, but the number of tokens and constants is equal to those of the basic processor.

In case the synthilation shall be executed for a multi-ALU processor, it is not clear which number of instances of the runtime critical resources is reasonable. Thus, the goal is to determine the sweet spot within all the possible processor con- figurations. Therefore, evaluation starts with a generously equipped processor with eight ALUs, eight Scratchpads, 32 bus structures and eight constant value distribution channels within the token machine.

Afterwards, the quantity of these resources is constrained one by one. The im- pact of each confinement is evaluated. Each confinement of the configuration implies a bisection of the hardware costs for the respective resource type. Thus, the target is the selection of a configuration with as few resources as possible, but still considerable performance improvements.

The results of these evaluations are shown in figure 8.23. The stacked bars from left to right represent the increasing resource constraints. Each bar displays the impact of another constraint on the processors performance. The bar on the left shows the confinement of the number of ALUs to one, two and four.

8.5. SYNTHILATION PERFORMANCE ON MULTI-ALU PROCESSORS 193

Synthilation With 1 / 2 / 4 / 8 ALUs, 8 Scratchpads, 32 Busses and 8 Constant Channels Synthilation With 2 ALUs, 1 / 2 / 4 / 8 Scratchpads, 32 Busses and 8 Constant Channels Synthilation With 2 ALUs, 2 Scratchpads, 32 Busses and 1 / 2 / 4 / 8 Constant Channels Synthilation With 2 ALUs, 2 Scratchpads, 6 / 8 / 16 / 32 Busses and 2 Constant Channels

2 4 6 8

Rijndael RKG Skipjack RKG 3DES RKG IDEA RKG RC6 RKG Serpent RKG Twofish RKG XTEA RKG

Speedup

2 4 6 8

Rijndael SBE Skipjack SBE 3DES SBE IDEA SBE RC6 SBE Serpent SBE Twofish SBE XTEA SBE

Speedup

2 4 6 8

BLAKE CubeHash ECOH MD5 SIMD SHA1 SHA256 RadioGatun

Speedup

2 4 6 8

ContrastFilter GrayscaleFilter SobelFilter SwizzleFilter JpegEncoder CST 2-D DCT Quantization

Speedup

Figure 8.23: Determination of Reasonable Hardware Dimensions for a Multi-ALU Processor

The processor with two ALUs achieves the best performance improvement in comparison to the amount of consumed chip area. Hence, all further evaluations are executed on a processor with two ALUs.

Afterwards, the number of scratchpads is constrained to the same dimensions as the ALUs. The measurements show that the number of scratchpad has a logarithmic impact on the processors performance. Each doubling of the number

194 CHAPTER 8. ASSEMBLY OF MICROINSTRUCTION GROUPS

keep the amount of resources low, further evaluations rely on a processor with two scratchpads.

Thirdly, the number of distribution channels for constant values is varied from one, over two and four up to eight. It can be seen, that the influence of this resource constraint is very small on a processor with only two ALUs and scratch- pads. In order to gain a small performance improvement, and considering the small amount of chip area which is consumed by such a port, the number of ports is set to two.

Finally, as the number of communication partners has already been limited, and most probably 32 buses are just too much, the number of bus structures is changed to six, eight and 16. Here, the configuration with eight buses performs significantly better than the basic processor with six buses. A further increased size of the communication network does not pay dividends, and thus the bus number is constrained to eigth.

The baseline processor with a single ALU and sole scratchpad, six bus structures and one constant distribution channel within the token machine, achieved an av- erage speedup of 2.24 with enabled common subexpression elimination. The restrained processor with two ALUs and scratchpads, eight buses and two con- stant channels delivers an improved speedup of 3.17. This means, that the overall

speedup increases by≈ 44%, while the amount of hardware resources has only

been increased by two functional units (one ALU and one scratchpad), two buses and an output port within the token machine.

Nonetheless, this is not the peak performance which can be achieved through

token set synthilation. The average speedup can be increased to≈ 4. Therefore,

the number of all resources has to be increased significantly. A processor with 16 ALUs and scratchpads, 64 buses and eight constant distribution channels in-

creases the average speedup to that point, while the maximum speedup of≈ 10

is reached for the SHA-256 digest.

Obviously, it is not possible to determine ideal characteristics for a processor with token set synthilation. In case hardware resources are the limiting factor, a processor with two ALUs delivers already very good performance. Then again, it

is possible to increase the average speedup from ≈ 3.2 to ≈ 4, which comes at