KILO and Decoupled KILO-Instruction Processors

the issue queue in a manner similar to WIB and CFP—it buffers miss-dependent instructions in a non-associative structure until the miss they depend on returns. KILO calls this structure a Slow Lane Issue Queue (SLIQ).

KILO scales the register file using a combination of virtual registers and reference counting. At rename, logical registers are mapped to virtual registers—a namespace larger than the set of physical registers. The virtual register name is then re-mapped to a physical register when the instruction is actually ready to execute. KILO reference counts these virtual register names to determine when they can be reclaimed, and reclaims the underly- ing physical register at the same time. KILO does not consider the implementation details of this register management scheme. If this scheme were implemented using a reference counting matrix, the matrix would be quite large. It would need columns equal to the number of virtual register names, and would need one row per slow lane issue queue entry as well per conventional “fast lane” issue queue entry.

For the load or store queues, KILO cites various prior proposals, including those used by CFP, and states that any would be satisfactory.

2.6.1 Decoupled KILO-Instruction Processor

KILO’s register management scheme is quite complex. The follow-on design, Decoupled KILO-Instruction Processor (D-KIP) [60] addresses this problem by chaining execution between two different processors. In D-KIP, instructions start in the Cache Processor. The Cache Processor is a conventional out-of-order processor, except that miss-dependent instructions are forced out of its ROB after a certain number of cycles. As miss-dependent

instructions leave the Cache Processor, they are placed into the Long Latency Instruc- tion Buffer (LLIB)—a FIFO queue which chains the out-of-order Cache Processor to an in-order Memory Processor. LLIB instructions capture their ready register input values as they exit the Cache Processor, so that the Memory Processor’s execution is self- contained. D-KIP maintains precise register state in the Cache Processor using a series of register checkpoints. These checkpoints contain register value (not mappings), and miss- dependent instructions whose values appear in them are flagged to update the checkpoint when they execute on the Memory Processor. D-KIP scales the load and store queues using hierarchy. However, it arranges its hierarchy differently from CFP, using miss-dependence or independence rather than age to determine hierarchy level [59].

On disadvantage of D-KIP is excessive copying of register values. Another disadvantage is the complexity of implementing register checkpoints that support incremental updates. A third disadvantage is that the only value communication from the in-order Memory Processor to the out-of-order Cache Processor comes in the form of restoring register checkpoints. Consequently, instructions executing in the Cache Processor may appear to be miss-dependent and re-execute on the second processor, even if the miss they depend on has already returned and their input values are known in the Memory Pro- cessor. Delaying this communication not only causes extra re-executions, but also can increase the branch mis-prediction penalty. Neither KILO nor D-KIP is capable of toler- ating the latency of dependent load misses.

2.6.2 Scalability of Load Latency Tolerant Designs

The slice buffer of load latency tolerant designs like CFP has two scalability benefits compared to the issue queue and register file. The first benefit is that it only needs to hold miss-dependent instructions. Figure 2.13 shows the percentage of instructions which are miss-dependent (i.e., ever poisoned).

The second advantage is the physical scalability of the slice buffer structure compared to the issue queue. The slice buffer is an indexed structure whereas the issue queue is

0 10 20 30 40 50

% Miss-dependent Instructions

cactus gems lbm milc soplex spnx zeusmp libq mcf MEM AVG

Figure 2.13: Percentage of instructions which are miss-dependent.

0 10 20 30 40 50 60 70 Slice Buffer Issue Queue Register File 64 256 512 1024 2048 (416) (1664) (3328) (6656) (13312)

Read Energy (pJ)

Slice Buffer Capacity (Effective Window Size)

Figure 2.14: Scalability of a slice buffer versus the scalability of an issue queue and register file. In addition to the size of the structures, the x-axis is labeled with the effective window size given by that slice buffer capacity, assuming 15% of instructions are miss-dependent.

associatively searched. Figure 2.14 shows the per-instruction read energy of the slice buffer and the issue queue4_{. An}_{S-entry slice buffer is modeled as four} _S/₄_{-word 156-} bit RAMs with one read port and one write port. Each entry holds a 64-bit instruction, a 64-bit (captured) data value, an 8-bit load/store queue index, an 8-bit physical register number, and a 12-bit instruction sequence number. The x-axis shows the capacity, but also lists the effective window size for the corresponding slice buffer size, assuming the average percentage of miss-dependent instructions (15.4%).

A 256-entry slice buffer requires about the same energy per read as a 36-entry issue queue requires per search. However, the slice buffer is only accessed by miss-dependent

4_{This graph ignores the fact that many of these issue queue sizes require changing the voltage or clock}

instructions. A 256-entry issue queue consumes significantly more energy forall instruc- tions.

For the same number of entries, the slice buffer and register file require about the same read energy. Here, the key scalability difference is that the slice buffer only needs to hold miss-dependent instructions, whereas the register file must hold all in-flight instructions. Additionally, the slice buffer is not in the critical execution loop, so its access latency is not as important.

In document Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era (Page 52-55)