Baseline Micro-architecture Sensitivity

5.7 Sensitivity Analysis

5.7.6 Baseline Micro-architecture Sensitivity

In addition to measuring sensitivity to the sensitivity to the structures most relevant to BOLT, we also measured BOLT’s sensitivity to several parameters of the underling micro- architecture. These experiments are briefly summarized here.

Superscalar issue width. BOLT can benefit more from a wider pipeline than a con-

ventional ROB processor can. Without BOLT, memory-bound programs benefit little from wider issue pipeline—an 8-wide pipeline averages a 3% speedup over the 4-wide baseline for the memory-bound subset of SPEC. By contrast, 8-wide BOLT improves performance by an additional 9% compared to 4-wide BOLT.

Likewise, BOLT suffers more from a narrower pipeline. A 2-wide ROB processor yields an 18% slowdown on the memory-bound programs compared to the 4-wide baseline. 2-wide BOLT suffers a 33% slowdown compared to 4-wide BOLT, making it 5% slower than the original 4-wide baseline.

L3 cache capacity. Varying L3 cache capacity from 1MB to 32MB produces the

expected behavior—BOLT’s advantages increase for smaller cache sizes due to more misses—for most benchmarks. The exception is mcf, whose behavior is dominated by pointer chasing. With the default 8MB L3 cache, BOLT obtains a 16% speedup onmcf. BOLT with a 1MB L3 cache suffers a 53% slowdown compared to the original baseline while a conventional ROB processor with a 1MB cache suffers a 56% slowdown. This means that BOLT’s relative advantage with the smaller L3 cache is only 3%, compared to 16% for thelarger L3 cache. This effect arises from the fact that BOLT is unable to

increase the performance of pointer chasing. As cache size decreases, more pointer chasing loads miss, causing the pointer chasing regions to be a larger portion of the execution time, which dilutes the gains BOLT can provide in other regions.

L2 cache capacity. Varying L2 cache capacity from 128KB to 1024KB (1MB) has

a relatively small (i.e. less than 5%) performance impact on most benchmarks, even in the conventional ROB processor. The three benchmarks which it impacts by more than 5% arelibquantum, astar, andgcc. libquantumspeeds up by 55% when L2 cache size is increased to 512KB. Its performance does not change further for 1024KB. It experiences no change in performance decreasing the cache size to 128KB.libquantum’s working-set fits in a 512KB cache but not in a 256KB cache. BOLT loses all performance advantages onlibquantumat the larger cache sizes because its latency tolerance never activates.

astarandgccboth lose 6% performance with the smaller (128KB) L2 cache size, and gain 6% (gcc) or 7% (astar) at the largest (1024KB). BOLT does not obtain significant performance advantages on either of these benchmarks at any of the cache sizes because their misses typically feed mis-predicted branches. BOLT is unable to exploit ILP, because the processor is executing down the wrong path. Inastar, some MLP is exposed down the wrong path, but it simply pollutes the caches and harms performance slightly. Coupling BOLT with Control Independence [3, 33, 65, 66]—a technique to avoid squashing correct path instructions after a mis-predicted branch’s control re-convergence point—might provide more performance opportunities in such situations.

For the remaining benchmarks, BOLT follows the expected trend—more relative ben- efits at smaller cache sizes—however, the relative differences are small because baseline ROB performance does not change much.

Chapter 6 iCFP: Load Latency Tolerance for

In-order Processors

Load latency tolerance is beneficial to in-order cores as well as out-of-order cores. In- order cores can even benefit from applying load latency tolerance to L1 misses that hit the L2—unlike out-of-order cores which naturally tolerate such latencies, in-order cores cannot re-order instructions around them. This chapter describes iCFP (in-order Continual Flow Pipeline), an in-order load latency tolerant design that is analogous to BOLT. It also qualitatively compares iCFP to other in-order load latency tolerant designs. It omits a quantitative performance and energy analysis primarily because the top-down area-based relative-energy approximation we use (in Chapter 5 for BOLT) is inappropriate when the marginal area is more than a few percent of the baseline area.

The mechanisms used in iCFP (described below) are similar to those used in BOLT. By design, BOLT’s mechanisms are largely core agnostic—they can be fitted onto any kind of core. The core agnostic design derives from the use of program-order interfaces. Miss-dependent instructions are deferred to the slice buffer in program order. Program order slices are re-injected into the execution core when misses return. This in-order interface can be attached to an in-order core—at in-order register read and completion— just as easily as it can be attached to an out-of-order core—at in-order rename/dispatch

and (speculative) retirement.

6.1 iCFP: In-order Continual Flow Pipeline

iCFP (In-order Continual Flow Pipeline) [31, 32] is an in-order load latency tolerant design that has many structural and “algorithmic” similarities to BOLT. Figure 6.1 shows BOLT (top) and iCFP (bottom). Although the underlying pipeline is different, the key latency tolerance structures (shaded grey)—the slice buffer and chained store buffer—are the same.

Like BOLT, iCFP uses checkpoint-backed speculative retirement and explicitly dis- tinguishes between tail and deferred instruction. Both designs slice out miss-dependent instructions, captures their miss-independent inputs, and defers them to a program-order slice buffer. Deferring instructions release their resources—here, the pipeline latches themselves—and free them for younger instructions. iCFP also makes multiple passes over the slice buffer—initiating passes as misses return—re-injecting miss-dependent instructions for re-execution. As with BOLT, re-execution can be multi-threaded with tail execution and filtered by antidote bit-vector based pruning mechanisms1_.

In document Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era (Page 148-151)