Summary Comparison - Energy Efficient Load Latency Tolerance: Single-Thread Performance for the

The previous sections provided detailed comparisons of Runahead to BOLT and CFP to BOLT. Figure 5.9 shows the best configuration for each design—pruned L2+L3 Runahead, L3-only CFP, and L2+L3 BOLT—all together.

-5 0 5 10

Runahead

CFP

BOLT

% Speedup

0 5 10

% Overhead

-10 0 10 20

% ED2 Change

calculixdealII gamess lsl3d namd povray tonto wrf astar bzip2 gcc gobmk h264 hmmeromnet sjeng xalanc Figure 5.10: Performance on Non-memory-Bound Benchmarks

5.5.1 Behavior on Non-Memory Bound Benchmarks

While this evaluation primarily focuses on memory-bound benchmarks, the behavior of Runahead, CFP, and BOLT on the non-memory-bound benchmarks is also important. Fig- ure 5.10 shows the performance, overhead, and ED2 _{of Runahead, CFP, and BOLT on the} remaining benchmarks in SPEC 2006. Note that only individual benchmarks are shown— the average for all of SPEC appears in Figure 5.9 and is not repeated here.

While these benchmarks do not have as many long latency misses as the memory- bound ones, there are still opportunities for performance gains. One benchmark of partic- ular interest isgamess, on which BOLT obtains an 8% speedup, while CFP and Runahead do not. gamessprimarily experiences L2 misses which hit the L3. CFP does nothing because it only applies latency tolerance to L3 misses. These misses are also far enough apart that they do not overlap in Runahead execution—Runahead experiences a slight slowdown while it learns to suppress Runahead. BOLT obtains a speedup in this case by exploiting ILP under the moderate latency L2 misses.

slowdowns. CFP suffers slight slowdowns on about half of these benchmarks, including 4% on hmmer. By contrast, BOLT suffers slowdowns on only two benchmarks— 3% on namd and 0.5% on astar. For these two benchmarks—on which CFP also sees slowdowns—the cause is cache pollution from the application of latency tolerance down the wrong path. Here, as the program executes down the wrong path due to a branch mis-prediction, cache misses for un-needed addresses occur. While conventional execution would result in these misses and their dependents clogging the issue queue, BOLT and CFP apply latency tolerance and expose more wrong-path MLP. This MLP hurts performance by replacing useful data with useless data in the caches. It may be possible in either CFP or BOLT to detect the wrong path execution using previously proposed techniques [5] and shut down latency tolerance to prevent this problem. Runahead is not affected because it only begins latency tolerance when the long latency miss reaches the head of the ROB—which will not happen because the older incorrectly predicted branch must resolve first.

CFP’s remaining slowdowns are due to its checkpoint overhead. When mis- speculations occur between checkpoints, older correct-path instructions must be squashed, decreasing performance and increasing re-execution overhead. This particu- larly hurtshmmerandsjeng, where the combination results in ED2 _{increases of 17% and} 15% respectively.

CFP outperforms BOLT (and Runahead) on one benchmark—tonto. Here, difficult forwarding patterns pose problems for BOLT’s speculatively indexed store queue. CFP uses a combination of associative and indexed store queues, and is able to forward more accurately in these cases. Runahead never outperforms BOLT.

Overall, BOLT typically improves ED2 _{relative to the baseline. When it hurts ED}2_{, it} does so in a small way. The only two benchmarks on which BOLT increases ED2_{by more} than 2% arenamd andastar—the two on which it obtains slowdowns due to wrong-path MLP. Runahead also has only two benchmarks with ED2 _{increases of more than 2%—} gamessandleslie3d. For these, Runahead incurs overheads and performance losses while

Runahead WIB D-KIP CFP BOLT Register management

Checkpoints 1 0 0 8 2

Re-renaming None None None Physical Logical

Load and store management

Store forwarding _{Queue + Fwd$}Assoc Store _{Store Queue}Assoc _{Store Queue}Assoc Assoc + Indexed_{Store Queues} _{Store Queue}Indexed Load verification _{Load Queue}Assoc _{Load Queue}Assoc _{Load Queue}Assoc Assoc + Set Assoc_{Load Queues} _{Load Queue}Indexed

Re-execution

Insns buffered None All Miss-dep Miss-dep Miss-dep

Insns re-executed All Miss-dep Miss-dep Miss-dep Miss-dep

Non-blocking N/A Yes No Yes Yes

Start at load Oldest Any Oldest Any Any

Visible to tail N/A Immediately On Squash Immediately Immediately

Pruning _OverlappingUseless/ Miss None None Miss/Join/_Pointer

Overall Effects

Performance Low High Moderate Moderate High

Static cost Low High High High Moderate

Dynamic cost High Moderate High Moderate Low

Table 5.4: Comparison and contrast of out-of-order load latency tolerant designs.

it learns that there is no MLP. By contrast, CFP has twelve benchmarks with ED2_increases greater than 2%, including three over 10%.

5.5.2 Qualitative Comparison of All Five Designs

Table 5.4 summarizes the differences between all five out-of-order load latency tolerant designs. The comparison of overall effects at the bottom—performance, static area and energy overhead, and dynamic energy overhead—come from the data shown above for Runahead, CFP, and BOLT. WIB and D-KIP are not simulated, but their entries can be inferred from their described behaviors and individual evaluations.

WIB. WIB’s performance should be comparable to BOLT’s with only miss-pruning. The problem with WIB is not performance, but rather static area and energy costs. WIB buffers all instructions and more importantly does not scale the register file. WIB does not itself scale the load/store queues, but could be coupled with SQIP/SVW.

D-KIP. D-KIP has problems in both performance and cost. For performance, D-KIP re-executes instructions on an in-order pipeline, which makes its re-execution blocking— if it encounters a long latency miss during re-execution, re-execution must stall for that miss to return. D-KIP does not improve performance in the presences of dependent (non-pointer chasing) misses. For static cost, D-KIP requires an entire in-order processor as well as a large number of register checkpoints which must support incremental updates—essentially, D-KIP maintains nine register files, one physical and eight logical. For dynamic overhead, D-KIP does not propagate slice register values from the in-order re-execution processor to the out-of-order tail execution processor except in the case of a tail squash. This forces D-KIP to defer and re-execute instruction un-necessarily. D-KIP must also make multiple copies of every register value.

In document Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era (Page 129-133)