Issue Queue - Document Organization - Energy Efficient Load Latency Tolerance: Single-Thread Pe

1.6 Document Organization

2.1.1 Issue Queue

The issue queue holds all un-executed instructions and schedules instructions for execution as their inputs become available. Scheduling consists of two parts—selectselects N instructions for execution from among the instructions whose inputs are available;wakeup updates the ready bits of all instructions that depend on the instructions just selected. Select is usually implemented using priority encoders. Wakeup is implemented using a content associative memory (CAM) that holds the input physical register names of all

0 25 50 75 100

1 cycle, V. up/F. down 1 cycle wakeup/select 2 cycle wakeup/select V: +2% V: +14% V: +29% V: +62% F: -1% F: -8% F: -17% F: -35% 16 3648 64 128 256 512

Wakeup Energy (pJ)

Issue Queue Capacity

Figure 2.2: Energy scaling of the issue queue. The dashed line indicates designs where 1-cycle wakeup/select requires increasing voltage or decreasing clock frequency.

un-executed instructions. The issue queue in the Intel Core i7 processor has 36 entries, limiting the out-of-order window to 36 un-executed instructions.

Physical scalability. One difficulty with scaling the issue queue arises from the fact that its latency is critical. Scheduling that takes multiple clock cycles, prevents dependent instructions from executing in consecutive cycles. Essentially, the minimum execution latency in the processor becomes that of the scheduler’swakeup/selectloop. The complexity (and therefore latency) of both wakeup and select increases as the size of the issue queue increases [56].

Figure 2.2 quantifies the scalability of issue queue wakeup. The graph plots the energy required for a single wakeup operation for different issue queue sizes. The data in this graph is derived from a modified version of CACTI 4.1 [79] assuming a 4-wide out-of- order processor. We model an I-entry issue queue as 2 I-word 8-bit CAMs, each with four write ports and 4 associative match ports—each 8-bit entry corresponds to one 8-bit physical register tag. Two CAMs are modeled as each instruction may have two register inputs. CACTI cannot model the select logic, so its latency is estimated from the baseline processor’s issue queue. CACTI reports the baseline issue queue (36-entry) has a minimum latency of 284 ps. CACTI reports a design with a 292 ps latency which uses approximately two-thirds as much energy and about one-half the area of the minimum latency design. Assuming that a 3.2GHz Core i7 uses the second design, 20.5 ps remain

0 25 50 75

Memory Bound (1 cycle) Memory Bound (2 cycle)

SPEC 2006 (2 cycle) SPEC 2006 (1 cycle)

36 64 128 256 512 1024

% Speedup

Issue Queue Capacity

Figure 2.3: Performance scaling of the issue queue for both 1- and 2-cycle wakeup/select. All other window structures are infinite.

for the select logic. The data in the graph uses this estimation—20.5 ps—for all points and does not model the longer select latencies of larger issue queues.

The line marked by dark circles in Figure 2.2 plots the energy requirements for different issue queues sizes when the wakeup/select loop is constrained to a single cycle. For issue queue sizes of 64-entries and larger, CACTI is unable to produce any configuration which permits single-cycle wakeup/select at the baseline clock frequency and voltage. Is- sue queue designs of these sizes with single-cycle wakeup/select require either increasing voltage or decreasing clock frequency (or some of both). These designs are indicated with slightly lighter circles and a dashed line. Each point has the voltage increase/frequency decrease noted above/below it. The energy values plotted do not account for the voltage changei.e., they are the energy required if frequency were changed. The line marked with light triangles shows the energy consumption when wakeup/select is relaxed to two cycles.

Performance impact. Figure 2.3 shows the geometric mean speedup over the base-

line Core i7 configuration of different issue queue sizes and latencies for all of SPEC 2006 (diamonds) and the memory-bound subset (boxes). The solid lines represent single cycle wakeup/select, and the dashed lines represent two cycle wakeup/select. In all configura- tions, window structures other than the issue queue are made ideally large. This means that even in the configuration with the same size and latency issue queue as the baseline (36-entries, 1-cycle), the idealized window enjoys some speedups due to the benefits of

-20 -10 0

10 36 (1-cycle) 36 (2-cycle) 512 (1-cycle) 512 (2-cycle)

% Speedup

povray astar bzip2 gcc gobmk hmmer omnet xalanc

Figure 2.4: Slowdowns suffered by individual programs from 2-cycle wakeup/select.

the other structures. The clock frequency is un-modified in all experiments.

Maximizing IPC throughput on the memory-bound programs requires a 512-entry issue queue. Such a design requires a 6–7×increase in the per-search energy of the issue

queue as well as either 2-cycle wakeup select, a 35% frequency decrease, or a 62% voltage increase. Neither drastically increasing voltage nor decreasing frequency are attractive options. The 2-cycle wakeup/select design is also not attractive [10]. Experiments show that programs sufferaverageperformance losses of about 7%. However, the designs with 2-cycle wakeup/select suffer slowdowns compared to the baseline for individual benchmarks.

Figure 2.4 shows the impact of 2-cycle wakeup select on individual benchmarks. The graph shows four bars—36-entry/1-cycle, 36-entry/2-cycle, 512-entry/1-cycle, and 512- entry/2-cycle. The benchmarks selected for this graph are those which suffer a slowdown of greater than 2% in the 512-entry/2-cycle configuration. These programs suffer from the latency added to operations which are normally single-cycle. Some of these programs (e.g., povray, gcc, andbzip2) mitigate this cost slightly by having large window, but the longer latency operations are the dominant effect here.

Other approaches. Prior research has attacked the non-scalability of the issue queue in various ways. Cyclone [24] replaces a traditional associative wakeup/select issue queue with a FIFO queue/ An extended rename stage places instructions into this queue accord- ing to calculated dataflow height. The Cyclone scheduler is scalable, but cache misses re- sult in costly performance penalties. Forwardflow [27] represents register dependences as

linked lists—the readers of a given register are chained in a list that starts at that register— and implements wakeup by following list pointers. Forwardflow scales the issue queue, but may impact performance for long dependence listsi.e., register values with high fan- out.

Other techniques examine modifications to the traditional wakeup/select-based design. One approach exploits the fact that most instructions entering the window need to only wait for one input to become available. Putting such instructions into a CAM which only matches one physical register tag allows for lower latency and energy [23]. Other approaches create a large issue queue from multiple smaller structures [56, 63].

Other techniques attack the latency problem of wakeup/select fitting in one cycle, while allowing dependent instructions to issue in consecutive cycles. One approach speculatively wakes up grandchildren (i.e., a dependent instruction’s dependent instructions) [77]. This allows wakeup and select to take two cycles, but requires issue queue entries to hold more tags (i.e. the parents’ inputs) or to speculate one which inputs will arrive last.

In document Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era (Page 30-34)