We are at a transition point in the design of processors.
No longer can we expect clock rates to continue to increase dramatically with each processor generation. Indeed, clock rates may have plateaued, after having increased by a factor of 4000 in the last ten years. The reason is that processor power usage increases linearly with clock frequency, with all other things being equal. (In practice, all other things aren’t equal, and the power usage increases nonlinearly, perhaps even quadratically or cubically, with clock frequency.) Fur- ther increases in clock rate would require exotic cooling technologies to handle the extremely high power densities that would result from the high clock rates. Instead, performance gains will now be obtained mostly through the use of parallelism—specifically, through the use of multicore processors. Dual and quad-core processors are now easily available, and processors with more cores are becoming so. We can reasonably expect the number of cores per processor to double with each successive processor generation, while the clock rates may change little, or even decline slightly.
Anwar Ghoulum, at Intel’s Microprocessor Technology Lab, says “developers should start thinking about tens, hundreds, and thousands of cores now.”
A related expected trend is that, an increase in the parallel computational power of chips (measured as transistors × clock rate) compared to their I/O bandwidth, as transistor density improves faster than that of I/O pins and pads. Over the next 15 years, this ratio is expected to roughly quadruple, while serial computational power (measured in clock rate) compared to chip I/O bandwidth will decrease by a factor of roughly 10.1 Hence, in the long term parallelizable
functions can afford more computation (and, hopefully, higher security) for a given level of I/O utilization.
A tree-based design is the most natural approach for exploiting parallelism in a hash function. Ralph Merkle’s PhD thesis [48] described the first such approach, now known as “Merkle trees.” The computation proceeds from the leaves towards the root, with each tree node corresponding to one compression function computation. If the compression function has a compression factor of 4, then each tree node will have four children. MD6 follows this approach.
Damg˚ard [25] also described parallel approaches to hash functions. These methods can be viewed as constructing a tree by a level-by-level bottom-up computation, wherein the nodes within each level of the tree are computed in parallel. Damg˚ard also suggested stopping the parallel level-by-level compu- tation after some fixed number of levels, and finishing up what is left with a sequential computation. MD6 also follows this approach when L is small; here 1These figures apply to high-performance chips and are derived from the International
CHAPTER 3. DESIGN RATIONALE 37
L (which he calls j) describes the number of levels to compute in parallel before switching to a sequential mode of computation.
Other authors have also explored parallel or tree-based hash function design. Perhaps most relevant to MD6 are various interesting and excellent proposals by P. Sarkar, such as [65]. His constructions are somewhat different, however, in that message input values are consumed at every tree node, not just at the leaves.
MD6 thus adopts a tree-based approach for maximum parallelizability. However, there is definitely a trade-off between parallelizability and memory usage; a fully parallelizable approach may have a memory footprint that is too large for some small embedded devices.
Therefore, MD6 follows Damg˚ard’s lead by parameterizing the amount of parallelization. MD6 allows the user to set the parameter L to be the number of parallel passes before switching to a sequential mode. By setting L to 0, MD6 acts in a purely sequential mode and uses minimal memory (under 1KB). By setting L large, parallelism is maximized and memory usage is proportional to the logarithm of the input message size. (The value of L should of course be communicated along with the hash function output in cases where MD6 is used with a non-standard value.)
3.4.1
Hierarchical mode of operation
The initial design for MD6 did not have a parameter L controlling the mode of operation; the hash was always fully hierarchical.
However, it was felt that (a) there might be substantial need for an MD6 version that met tighter storage limits, and (b) it was easy to add an optional control parameter L that limits the height of trees used (and thus the storage used).
For small L, MD6 makes a collection trees of height at most L (which can be done in parallel), and then, if there is enough data to make more than one such tree, combines their root values with a sequential Merkle-Damg˚ard pass.
When L > 0, an implementation with infinite parallelism could compute all of the trees in time L, leaving a sequential computation requiring time O(m/4L),
where m is the size of the original message.
In practice, infinite parallelism is unavailable, but even so, using MD6 to hash a long message on a multicore processor with P processors can result in a speedup by a factor of P , for any P reasonable to imagine over the next several decades.
Envisioning that multi-core processors will be very common, we set the default value for an unspecified L to be L = 64 (giving tree-based hashing): Hd,K = Hd,K,64.
However, applications on very restricted processors may wish to choose L = 0 for a purely sequential hash function with minimal memory requirements.
3.4.2
Branching factor of four
A very early version of the MD6 design had chaining variables of size 512 bits and a branching factor of eight (instead of the current branching factor of four). However, it was felt to be more important to ensure that the wide-pipe principle was maintained for all values of d (particularly for d = 512, where security requirements are the toughest), so the chaining variable size was increased to 1024 bits.
It would be possible, of course, to maintain the wide-pipe principle, while improving the branching factor when d < 512. For example, one could have c = 1024 for d = 512 or d = 384 (a branching factor of four), and c = 512 for d = 256 or d = 224 (a branching factor of eight). But this adds complexity and doesn’t actually do much in terms of improved efficiency (it makes a difference of about 16%), since almost all of the actual work is in the leaves.