MD6 Memory Usage - The MD6 hash function A proposal to NIST for SHA-3

The size of the MD6 state, as implemented, is 15504 bytes. Most of this is for the stack of values: there are 29 levels to the stack, each holds 64 words of data, and each word is 8 bytes; this gives 14,848 bytes just for the stack.

The reference implementation provides for handling the full maximum message length of 264− 1 bits. Should there be a smaller limit in practice, a correspondingly reduced stack size could be used. For example, if in an application there was a known maximum message size of 32GB (240 bits), then a stack size of 17 levels would suffice, reducing the necessary size of the MD6 state to roughly 9K bytes.

When L = 0, so that MD6 is operating entirely sequentially, the size of MD6 state is even further reduced. In this case, MD6 should be implementable using not much more than the size of one 89-word compression input, which is 712 bytes.

4.7 Parallel Implementations

Computing platforms are quickly becoming highly parallel, as chip manufac- turers realize little more performance is available by increasing clock rates. In- creased performance is instead being obtained by utilizing increased parallelism, as through multi-core chip designs.

Indeed, typical chips may become very highly parallel, very soon.

Anwar Ghoulum, at Intel’s Microprocessor Technology Lab, says “developers should start thinking about tens, hundreds, and thousands of cores now.”

Section 4.7.1 shows how the speed of MD6 can be dramatically increased by utilizing the CILK software system for programming multicore computers.

Then Section 4.7.2 shows how the speed of MD6 can be dramatically increase by implementing MD6 on a typical graphics card (which is highly parallel in- ternally).

4.7.1 CILK Implementation

We implemented MD6 for multicore processors using the CILK extension to the C programming language developed by Professor Charles Leiserson and colleagues. CILK began as an MIT research project, and is now a startup company based in Lexington, Mass.

The CILK technology makes multicore programming quite straightforward. The CILK programmer identifies her C procedures that are to be managed by the CILK runtime routines by the cilk keyword. She can identify procedure calls to be potentially handled by other processors by the spawn keyword. She synchronizes a number of spawned procedure calls using the sync statement. Details can be found at the MIT CILK web site3_{or at the company’s web site}4_.

Our implementation of MD6 in CILK used the layer-by-layer approach (so it assumes that the input message is available all at once). It processes each layer in turn, but uses parallelism to process a layer efficiently.

http://supertech.csail.mit.edu/cilk/

CHAPTER 4. SOFTWARE IMPLEMENTATIONS 61

Bradley Kuszmaul, Charles Leiserson, Stephen Lewin-Berlin, and others as- sociated with CILK have been extremely helpful in our experiments with MD6 (thanks!).

Using CILK, MD6’s parallel performance is the best one can hope for: its throughput increases linearly with the number of processors (cores) available.

We have experimented with 4-core, 8-core, and 16-core machines using our CILK implementation.

The 16-core machine shows best how MD6 scales up with more cores. This machine is a 2.2 GHz AMD Barcelona (64-bit) machine. See Figure 4.11.

Number of cores Processing speed (MB/sec)

1 40.4 2 121.6 3 202.2 4 270.2 5 338.3 6 407.7 7 474.1 8 539.8 9 607.2 10 674.2 11 740.0 12 805.8 13 875.9 14 940.0 15 1004.0 16 1069.9

Figure 4.11: Speed of MD6 using various numbers of cores on a 16-core machine using CILK.

4.7.2 GPU Implementation

A typical desktop or laptop computer contains not only the main CPU (which may be multicore, as noted earlier), but also a potentially high-performance, highly-parallel graphics processor (GPU).

Such general purpose graphics processing units are beginning to be used as cryptographic processors (e.g. Cool et al [?] and Harrison et al. [32]), since they provide a rich set of general-purpose logical instructions and a high degree of parallelism.

Current GPGPUs trade a traditional cache hierarchy and complex pipeline control for expanded vector-parallel computational resources. For example, the 8800GT (release in 2007) has 112 vector processing elements (PEs) arranged in gangs of 8, forming a cluster of 14 SIMD thread processing units (TPUs).

Figure 4.12: Chart of MD6 speed using various numbers of cores on a 16-core machine using CILK.

CHAPTER 4. SOFTWARE IMPLEMENTATIONS 63

The TPU operate on blocks of threads each of which must operate on the same vector instruction. If divergent control flow exists across the threads in a TPU thread block, the PEs revert to slow serial execution. TPUs, however, may follow different control paths from one another. Unlike traditional out-of- order processors, which hide high-latency operations with instructions from the same computation thread, the TPUs hide latency by running multiple blocks of threads in parallel. Indeed, the TPU cannot even execute back to back instructions from the same thread block in back to back cycles – the TPU will idle if there are no other thread blocks to execute. Thus, to exploit the resources of the GPU fully, a program must have suffcient vector parallelism to saturate a TPU, limited control flow within the TPU, and enough data-parallel tasks to keep all of the TPUs in the system busy.

MD6 is ideally suited to exploit the parallelism presented by such GPGPU architectures. The 16 steps in a compression round are vector parallel, and the compression function itself has statically determined control flow. Since individual compression functions are data parallel, multiple compression functions maybe run in parallel on the processing units of the GPGPU, thereby achieving high GPU utilization. MD6 can achieve throughputs as high as 610 MB/s on a single GPU,

We tested several GPU implementations by hashing a 512MB block of memory. Performance was determined by measuring the wall clock time between the completion of data initialization until the completion of hashing. In particular, heap memory allocation is not accounted for in the reported performance. Although memory allocation incurs substantial performance overhead, MD6 is likely to be used as part of a library in which the cost of memory allocation is amortized across many hash operations.

The GPU implementation of MD6 operates on the MD6 compression tree in the same manner as the multicore implementation, that is layer by layer (see Section 4.1.1.1). The routine uses the GPU to compress blocks of data, computing a minimum of 64 compression functions in parallel (32 kilobytes) per GPU invocation. Smaller data sizes, including the tip of the compression tree are calculated on the main system processor, which is faster than the GPU if insuffcient parallelism is available. In contrast to the traditional software implementation detailed in Section 4.1.2 which required little manual modification to achieve good performance, a number of transformations must be made to the MD6 reference code in order to achieve any reasonable performance on the GPU. Unfortunately, many of these transformations depend on detailed knowledge of the GPU processor and memory architecture. The following paragraphs will present some details about the transformations required to achieve high GPU performance, but may be bypassed without affecting the remaining discussion. Running the reference implementation of MD6 on the GPU results in the abysmal compression performance of 3MB/s. The naive parallel decomposition of the code into multiple data-independent compression functions, similar to the algorithm used in CILK, gives a low throughput of 14MB/s. The original compression function is not expressed in a vector parallel fashion; as such, the GPU has many idle processing elements and as a result gives low performance.

Unfortunately, vectorizing the code provides a throughput of only 23 MB/s. A major performance issue with the original code is that it uses a large array in which each element is written exacly once. While this approach is fine for a general purpose processor with a managed cache as discussed in Sec- tion 4.1.2, the GPU incurs a substantial performance penalty each time the on-board DRAM, the only place where such a large array can be stored, is ac- cessed. In MD6 compression, the ratio of computation per load is low and so the GPU is unable to hide the latency of main memory accesses with computation, resulting in severly degraded performance. To avoid this memory access penalty, the GPU implementation uses a smaller wrap-around shift register mapped in the TPU scratch pad memory. Since vector parallel instructions may write be- yond the end of the data array, the wrap-around arrays require a halo of memory around them to avoid data corruption. Thus, the shift register is sized at 121 words, which includes a 32 word halo. The wrap-around implementation solves the memory access latency problem, but introduces costly modulo arithmetic to compute array indices. However, the benefit of high memory bandwidth outweighs the cost of modular indexing, and this implementation achieves a throughput of 140MB/s. As an aside, it might seem that increasing the shift register to size 128 would reduce the modulo operator to a simple right shift. However, this optimization results in an array of shift registers that does not fit in the 16KB TPU scratchpad memory.

Although modular arithmetic is needed to compute indices into the reduced shift register, the index progression is fixed and index values can be precom- puted and stored in a lookup table in scratchpad memory. This optimization raises GPGPU thoughput to 224 MB/s. However, the table lookups still require modulo arithmetic operator. By removing the modulus operators and statically unrolling the entire compression loop, a performance of 360 MB/s was achieved. To this point, the GPU kernel has operated on large blocks consisting of a constant number of MD6 compression functions, but by operating the GPU at a finer granularity some speedup can be obtained. However, large scale parallelism is required to obtain good GPU performance. MD6 is no exception - the larger the block processed, the greater the processing effciency. It was empirically determined that 64 parallel compressions was the point at which GPU and CPU performance were roughly equivalent. Support for smaller block sizes increased MD6 throughput to 400 MB/s.

Modern machines have the capability to copy large pieces of physical memory to bus devices, such as the GPU. These direct memory access or DMA transfers are high-bandwidth and may be performed in parallel with computation - either at the GPU or CPU. By enabling GPU DMA, a further speedup to 475 MB/s was obtained for MD6-512. Unfortunately, configuring memory for use with DMA has non-neglidgable cost. However, if MD6 were used in the context of library, this cost would be amortized over many calls to the compression function.

Utilizing these methods on the newer, faster 9800GTX card we can achieve MD6 throughput of 610 MB/s for MD6-512. We also obtain high throughput for shorter hash lengths, as shown in Figure 4.13. In particular for extremely

CHAPTER 4. SOFTWARE IMPLEMENTATIONS 65 600 800 1000 1200 1400 1600 1800 50 100 150 200 250 300 350 400 450 500 550 Throughput (MB/s)

Hash Size (Bits)

300 350 400 450 500 550 600 650 700 750 1 2 3 4 5 6 7 8 Throughput (MB/s) Number of GPUs 9800GX2 9800GTX

CHAPTER 4. SOFTWARE IMPLEMENTATIONS 67

short hash lengths, we obtain throughputs in excess of 1600 MB/s. Although short hash lengths are not crytographically interesting, their high throughput gives some notion of the maximum throughput that can be achieved with future, higher-performance GPUs.

One can install multiple graphics cards in a single desktop to obtain a higher MD6 throughput. The MD6 processing tree can be trivially partitioned and subtrees allocated to various graphics cards within the system, theoretically obtaining a linear speedup in operation. The multi-GPU version of MD6 naively partitions the the MD6 tree into equally sized subtrees and assigns the subtrees to the available GPUs in the system. Once the subtree computation is complete, the host CPU gathers the subtree results and finishes the hash computation.

Figure 4.14 shows the MD6 throughputs achieved for MD6 on two multi- GPU platforms. The first used platform has two 9800GTX+ cards. The second of the platforms has four 9800GX2 cards, each of which has two 9800 series GPUs. The second platform has more aggregate compute, although the individual 9800GTX+ GPUs are superior to the GPUs used in the 9800GX2. For small increases in the number of GPUs, some performance increase is obtained. The rapidly diminising return for using multiple graphics cards can be at- tributed to two main causes. The first is the decreased GPU efficiency due to smaller problem size. To achieve good throughput GPUs require hundreds or thousands of threads. If the hash tree is partitioned at too fine a grain, GPUs suffer idle cycles during computation. The second is the competition for memory bandwidth among the cards. Current motherboards multiplex the PCI-E bus when multiple graphics cards are in use, decreasing the effective memory bandwidth to the all cards in the system.

4.8 Summary

MD6 has efficient software implementations on 8-bit, 32-bit, and (especially) 64-bit processors, without complex optimization techniques.

Because of its tree-based mode of operation, MD6 is particularly well-suited for parallel implementions on multicore processors and GPU’s. Speeds of many hundreds of megabytes per second are easily obtained, and speeds of 1-2 giga- bytes/second are very achievable.

Processor architecture is currently trending to larger numbers homogenous cores—both CPUs and GPUs are following this trend. Becuase the performance individual cores is not improving, the throughput of traditional sequential algorithms, which used to have exponential performance growth, has stagnated. On the other hand, highly parallel algorithms, like MD6, are likely to continue to see improved throughput well into the forseeable future as coarse-grained machine parallelism increases. Approximately nine months passed between the release of the 8800GT GPU (October 2007) and the 9800GTX GPU (July 3008), which runs MD6 nearly 30% faster than the 8800GT; it is unlikely that any existing sequential algorithm would have demonstrated such a marked performance gain over the same time period.

Hardware Implementations

Cryptographic operations are often computationally intensive – MD6 is no exception. In our increasingly interconnected world, embedded platforms require cryptographic authentication to establish trusted connections with users. How- ever, the limited general-purpose compute that is typically present in such systems may be incapable of satisfying the power-performance requirements imposed on such systems. To alieviate these issues, dedicated hardware implementations are deployed in these devices to meet performance requirements while using a fraction of the power required by general-purpose compute. Re- cent general-purpose processors [?] have included cryptographic accelerators, precisely because these common operations are compute-intensive. Therefore, any standard cryptographic operation must be efficiently implementable in hardware.

MD6 is highly parallelizable and exhibits strong data locality, enabling the development of efficient, extremely low power hardware implementations.

Section 5.1 first discusses our general hardware implementation strategy, paying particular attention to important hardware design tradeoffs.

We then present implementation results for a number of hardware designs for FPGA in Section 5.2. For example, throughputs as high as 233 MB/s are obtained on a common FPGA platform while consuming only 5 Watts of power. Section 5.3 provides some discussion of ASIC implementations; this is fol- lowed in Section 5.4 with a discussion of MD6 implementations on a custom multi-core embedded system.

5.1 Hardware Implementation

Our hardware implementation matches the version of MD6 submitted for the NIST contest. Therefore, some options and operational modes included in the definition of the algorithm, but not included in the proposed standard, were omitted to reduce the complexity of the hardware. We have obtained a func- tional FPGA implementation of MD6 that achieves throughput on par with a

CHAPTER 5. HARDWARE IMPLEMENTATIONS 69

modern quad-core general-purpose processor. We also give some intial metrics for an ASIC implementation of MD6 in a 90nm technology. We use the same RTL source to generate both FPGA and ASIC implementations.

All hardware source code is provided under the open source “MIT License” and can be obtained from OpenCores1.

5.1.1 Compression Function

U V Key Q From Memory Control To Memory Control

Figure 5.1: Compression Function Hardware

The MD6 compression function is essentially a linear feedback shift register, as depicted in Figure 5.1. To reduce the hardware overhead of the shift register, we constrain shifts to have a constant length, the number of compression steps performed per cycle. Since the shift length is constant, low-cost direct-wire connections between logically adjacent registers can be made. The logic used to compute the MD6 feedback function is similarly wired directly to the correct points in the shift register. Input to and output from the shift register are achieved by tying multiplexors to certain word registers. Some additional logic is used for bookkeeping during operation, but this state logic has limited impact on the operation of the compression function.

The compression occurs in four stages: initialization, data input, compression, and data output. In the initialization stage, the 25-word auxilliary input block is loaded into the shift register. During data input, the data input values are shifted into the shift register. During compression, the step function is ap- plied to the shift register, until the compression operation is completed. During data output, the hash result is streamed out.

The fundamental operation of compression is the application of the step function , 16 of which comprise a round; in hardware, to achieve high throughput, multiple steps and rounds may be composed within a single cycle. Since the first feedback tap of the shift register is located at index 17, up to 16 steps

may be carried out in parallel, without extending the circuit critical path. Of course, multiple 16-step rounds may be further composed to obtain as much parallelism as necessary; this lengthens the critical path, introducing a design tradeoff between throughput and clock frequency.

Since the number of hash rounds is a dynamic parameter, some hash lengths are not computable by some multiple-round per cycle implementations. For example, a 3-round-in-parallel implementation cannot compute a 256 bit hash (104, the number of rounds for this hash size, is not divisible by 3). Although this implementation will compute the correct hash result for all round lengths, the result will have a variable location in the shift register. To generalize such an implementation, additional control and multiplexing logic would be required to collect to the hash result.

5.1.2 Memory Control Logic

The hardware implementation can be viewed as a gang of compression function units orchestrated by a memory controller. The function of the memory controller is simple: it maintains top-level status information and issues memory requests to and from a memory store, or, in the case of a streaming im-

In document The MD6 hash function A proposal to NIST for SHA-3 (Page 62-74)