• No results found

FPGAs are a reprogrammable, highly parallel substrate commonly used for prototyping hardware. Structurally, the FPGA is comprised of a grid of ho- mogenous logic slices surrounded by a configurable interconnect network. Slices are comprised of a lookup table used to implement logic functionality and a handful of registers used to store state. To first order, FPGA implementation area can be estimated by the number of slices required to implement the de- sign. FPGA synthesis tools seamless map Register-Transfer-Level, i.e. Verilog, logic onto the FPGA substrate to obtain a cycle-accurate emulator of the RTL design. FPGAs may additionally contain small SRAM and DSP blocks which are used to reduce the overhead of implementing common but logically complex structures like multipliers. FPGAs are fundamentally more constrained than ap- plication specific integrated circuits (ASIC); as a result the silicon area required for an FPGA implementation is larger and the maximum operational frequency is lower than an implementation in a comparable ASIC process. Nonetheless FPGAs serve as an useful platform for evaluating microarchitecture, and most production RTL designs are at least partially verified on an FPGA platform.

Several implementations of MD6 were benchmarked using the Xilinx XUP development platform, which features a Xilinx Virtex-II Pro V30 FPGA. The test system consisted of the MD6 hardware, a PowerPC core (embedded in th FPGA), and a DDR DRAM. The PowerPC served to orchestrate the other hardware and to provide some debugging capacity. All hardware components in the system were run at 100 MHz. All synthesis results were obtained using Synopsis Synplify Pro for synthesis and the Xilinx PAR tool for backend place and route.

The hardware was benchmarked by writing a memory array to the DRAM memory, and then invoking a driver to begin the hardware hash. The result of the hash was then verified against the reference software implementation. All reported benchmark statistics are derived from hardware performance registers embedded in the MD6 FPGA design. Although the harware was tested with many bit lengths to ensure the correctness of the implementation, only whole block results are reported here, since partial blocks effectively take the same

Submodule Slice Usage Percent Step Function 3249 43.1 Shift Register 3286 43.6 S Generation 206 2.7 Control Logic 788 10.5 Total 7529 100

Figure 5.2: Compression Function Area Breakdown, 32-Parallel Parallel Steps Cycles per Compression Slice Usage f reqM AX MHz

1 3778 151.9 2 4070 146.1 4 4233 144.8 8 556.1 4449 150.6 16 249.1 5313 150.3 32 165.7 7529 141.7

Figure 5.3: Compression Function Parallelism

amount of time to process as a full block.

Currently, we have only benchmarked the level-by-level implementation of the MD6 hardware. In general, the performance on the FPGA was quite good, with throughputs as high as 233 MB/s obtained for MD6-512, on par with a quad-core CPU implementation. As shown in Figure 5.4, some inefficiencies ex- ist in processing small messages. Most of these inefficiencies can be attributed to the overhead of filling and draining the processing pipeline. For large messages and low degrees of parallelism, the cycles per compression reaches its asymptote at the point suggested by the level of parallelism in the compression function, indicating that the compression function bottlenecks the system.

Figure 5.2 and Figure 5.3 relate some synthesis results. The shift register

Message Blocks Cycles per Compression

1 540.0 4 263.6 16 189.3 64 173.6 256 167.9 1024 166.3 2048 165.7

CHAPTER 5. HARDWARE IMPLEMENTATIONS 73

Hash Size PPC 405 32 Parallel IP Core (100Mhz) (Bits) 32-bit Risc In-order Out-of-Order

64 93000 163.2 163.1

224 155000 178.1 163.1

256 167000 181.8 163.2

384 217000 196.7 165.2

512 256000 213.5 165.7

Figure 5.5: Cycles per Compression, various implementations

hardware takes a large portion of all the designs – its size is rougly 3200 slices in all cases. The dominance of the shift register is not a surprise; MD6 requires a long memory which helps provide increased security. As the step parallelism increases, the size of the step function increases linearly with the number of steps performed. The control logic and S generation logic make up a small portion of the total area. In general, the size of the control logic scales linearly with the number of parallel steps, since most of the control logic area is comprised of the I/O muxes. These muxes get larger as more steps are performed in parallel.

Usually, increasing the size of a circuit reduces the maxmimum attainable clock frequency. However, the maximum clock frequency increases with the number of parallelel steps between 4 and 16 steps. This is likely due to the simplification of the logic generating the S and constant propagation for the left and right shift amounts, which repeat every 16 cycles. At 32 parallel steps, a longer critical path is introduced by the second round circuitry, some of which depends upon the result of the first round.

The maximum memory bandwidth of the system is approximately 427 MB/s, simplex. Our non-overlapped controller architecture is unable to fully utilize the available memory bandwidth, since it must sometimes stall waiting for compu- tation to complete. Conversely, our out-of-order memory controller, coupled with highly parallel (32 steps or more) compression functions fully utilizes all available memory bandwidth. This can be seen in Figure 5.5, in which the cycles per compression are nearly the same for MD6-512, MD6-256, and the shorter bit lengths, even though the shorter bit lengths require far less computation than MD6-512.

In our FPGA designs, increasing the number of parallel steps per cycle in an individual compression unit was preferable to introducing a new compression function unit, mostly due to the high overhead of the shift register. Indeed, two compression functions could be implemented in the Virtex-II Pro 30 only if the number of parallel steps in each compression function was constrained to be less than 4. In the FPGA implementation a single compression function with 32-step parallelism was able to fully saturate the memory bandwidth for MD6-512, roughly 427 MB/s.

Our benchmark implementation of MD6 uses a single 100 MHz clock domain, but higher MD6 performance could be obtained by using multiple clock domains.

Algorithm Slice Usage f reqM AX MHz Throughput(Mbps) Whirlpool [?] 1456 131 382 Whirlpool [?] 4956 94.6 4790 Whirlpool [?] 3751 93 2380 SHA-1 [?] 2526 98 2526 SHA-2,256 [?] 2384 74 291 SHA-2,384 [?] 2384 74 350 SHA-2,512 [?] 2384 74 467 SHA-2,256 [?] 1373 133 1009 SHA-2,512 [?] 4107 46 1466 MD5 [?] 5732 84.1 652 MD6-512, 16-Parallel 5313 150.3 1232 MD6-512, 32-Parallel 7529 141.6 1894

Figure 5.6: Various Cryptographic Hash FPGA Implementations

The critical paths in the FPGA implementation run through the system bus and DDR control, rather than through the MD6 hardware. By running the MD6 hardware in a faster clock domain, it is possible to saturate the DDR bandwidth using less hardware.

The PowerPC processor on the Xilinx FPGA is a 32-bit, in-order, scalar RISC pipeline with 8-KB of code and data cache, and can be clocked at 300 Mhz. For the sake of comparing the performance of embedded processors and the MD6 hardware in a similar environment, we measure the performance of the reference MD6 code on the PowerPC embedded in the FPGA. The software implementation requires 256,000 cycles per MD6-512 compression, three orders of magnitude more time than the hardware requires. We estimate that the system implementation without hardware acceleration draws 4.6 Watts, while the system implementation with hardware acceleration draws 5.2 Watts. From a power perspective, this implies that the energy consumption of an embedded processor running MD6 is much greater than that of the hardware accelerated MD6, since the software implementation takes orders of magnitude longer to complete. Power consumption was determined by tying an ammeter to the system power supply and then testing various system configurations.

It is useful to compare MD6 to existing FPGA implementations of other cryptographic hash functions. A number of implementations of Whirlpool, MD5, SHA-1, and SHA-2 are presented in Figure 5.6, although this list is by no means complete. In general, MD6 compares quite favorably with these im- plementations, both in terms of throughput and area usage, particularly since MD6 requires a longer memory and more computation rounds than any of these hash algorithms.

CHAPTER 5. HARDWARE IMPLEMENTATIONS 75

Parallel Steps Gate Count Synthesis Area (µm2)

1 65595 148946 2 69119 156948 4 74571 169329 8 77691 176414 16 87627 198975 32 114862 260819 48 144717 328610

Figure 5.7: Compression Function PAR Results

Compression Cores Parallel Steps Gate Count Synthesis Area (µm2)

1 16 105102 238655

2 16 194379 441376

Figure 5.8: Full Implementation PAR results