Mark Smotherman
2.1.3 Superscalar Terminology
2.1.3.1 Program Order
Figure 2.2 illustrates various terms used in describing superscalar processors. The adjectivesin-orderand
out-of-orderrefer to the ordering of instructions as compared to the program. Figure 2.2a shows a simple scalar processor with four stages, and the actions of moving an instruction from one stage to the next are named fetch, issue, and complete, respectively. Instructions are decoded and issued in program order. A simple extension to the scalar pipeline is the use of multiple pipelines, as shown in Fig. 2.2b. This can allow a scalar processor to specialize its execution pipelines (or function units; e.g., integer vs. floating point), or, as of interest here, it can produce an in-order superscalar, in which multiple instructions are fetched, decoded, and issued in program order. The stage-to-stage terminology is the same as for the simple in-order scalar processor.
2.1.3.2 Instruction Completion and Precise Exceptions
All processors that attempt to execute instructions in parallel must deal with variations in instruction execution times. That is, some instructions, such as those involving simple integer arithmetic or logic operations, will need only one cycle for execution, while others, such as floating-point instructions, will need multiple cycles for execution. If these different instructions are started at the same time, as in a superscalar processor, or even in adjacent cycles, as in a scalar pipelined processor with separate function units or execution pipelines, a simple instruction can complete earlier than a longer running instruction that appears earlier in the instruction stream. This is called out-of-order completion. We can, of course, prevent out-of-order completion by techniques such as adding delay stages so that all execution paths have the same number of stages.
If we choose to allow a subsequent, simple instruction to write its result to storage before a longer running instruction completes, we may violate a data dependency. Dependency checking hardware can eliminate this problem while still allowing some out-of-order completions; however, dependency check- ing will not solve the problem of an inconsistent state of storage (registers or memory) if the longer
running instruction causes an exception. To handle this exception and to be able to resume the program, we must know the precise state of the storage, that is, we must know which instructions, before the one causing the exception, have not been completed and which instructions, after the one causing the exception, have been completed. To resume at a given point in program order, the processor must restore a consistent state with all previous instructions completed and no subsequent instructions completed. A standard technique to handle this is to provide a form of buffering for the results of instructions, usually called a reorder buffer. This method is depicted in Fig. 2.2c. Instructions completing out-of-order can place their results in preassigned entries in this buffer (assignments are made when instructions are in the decode stage); and, when available, the results areretiredout of this buffer in program order. (This action is alternatively called commit, completion, or graduation in some processors.) If an exception occurs, or for that matter, a branch or value misprediction occurs, instructions before the one causing the exception or misprediction are allowed to retire and then the contents of the reorder buffer beyond that instruction are flushed. Execution can then be resumed with a consistent state of storage.
Other techniques for dealing with exceptions include the use of a run slow mode bit to switch between in-order and out-of-order instruction completion (IBM RS=6000), the use of a history buffer (Motorola 88110), the use of exception barrier instructions (DEC Alpha), delaying instruction issue until excep- tions from previous instructions are guaranteed not to occur (calledsafe instruction recognitionin the Intel Pentium), and the use of a future file (UltraSPARC-III).
2.1.3.3 Instruction Issue
Up to this point, instructions have been described to issue, that is, start execution, in program order. Consider the case of a long-running dependent instruction pair followed by independent instructions. It would be advantageous if the compiler would statically schedule independent instructions between the two instructions of the dependent pair; however, not all programs will be so scheduled. An alternative is to provide dynamic instruction scheduling in the hardware, also known as out-of-order execution.
Icache Decode Execute Write back
Icache Decode
Execute
Execute
Execute Write back Write back
Icache Decode Reorder buffer Fetch Issue Complete
Fetch Issue Write back Complete Retire (c) Out-of-order- completion for scalar or superscalar (b) In-order-issue superscalar (a) In-order scalar
Execute Icache Decode Inst.
window
Reorder buffer Fetch Dispatch Issue
Write back Complete Retire (d) Out-of-order- issue scalar or superscalar Execute Execute
This requires providing some form of buffering for instructions subsequent to decode, as depicted in Fig. 2.2d, and the necessary control logic to identify and issue instructions that are ready to execute. The execution of ready instructions in an out-of-order processor is a form of data flow execution, and the dynamic scheduling of the portion of the instruction stream held in the instruction buffer has been calledrestricted data flow. The buffer for the instructions can take the form of a centralizedinstruction windowor a decentralized set ofreservation stations, in which a subset of the instruction buffers are located at each function unit. In Fig. 2.2d, the action of placing a decoded instruction into this buffer is calleddispatchand the term for the start of execution (i.e., choosing and routing instructions from the instruction buffer to execution units) remains issue. Unfortunately, some authors and processor manuals make the issue and dispatch terms synonymous and some even reverse the above meanings, so the reader is advised to always read the context carefully when encountering these two terms.
Figures 2.3 and 2.5 illustrate superscalar processing with in-order execution and out-of-order execu- tion, respectively, for two iterations of a short loop that adds a value to each element in an array of floating- point values. Throughput rates of two instructions per cycle are assumed for each stage-to-stage action, and branch prediction and operand forwarding are assumed. Loads and stores have two-cycle execution, floating-point add has four-cycle execution, and integer add and branch have single-cycle execution each. The stages for the in-order processor in Fig. 2.3 are FDEW: fetch, decode, execute, and write back. The stages for the out-of-order processor in Fig. 2.5 are FDIECR: fetch, decode, issue (or inst. window), execute, complete (or reorder buffer), and retire (or write back). Renaming is assumed for the out-of-order processor.
Without the ability to dispatch dependent instructions into an instruction window or to reservation stations, the decoder in the in-order processor in Fig. 2.3 stalls from cycle 3 to cycle 5, at which point the floating-point add can be issued. Similar decoder stalls can be observed at cycles 6, 13, and 16. Because of the data dependencies within the loop body, the throughput is less than one instruction per cycle. The overall effect is that the two iterations cannot be finished within 20 cycles. To take better advantage of the in-order processor, a compiler or assembly language programmer would need to unroll the loop. For example, unrolling by a factor of two would lead to the execution diagram in Fig. 2.4. In this case,
Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 load 0(r1), f0 F D E E W addf f0, f2, f4 F D E E E E W store f4, 0(r1) F D E E W add r1, #4, r1 F D E W bne r1, r2, loop F D E W load 0(r1), f0 F D E E W addf f0, f2, f4 F D E E E E W store f4, 0(r1) F D E E add r1, #4, r1 F D bne r1, r2, loop F
FIGURE 2.3 Superscalar execution with in-order execution.
Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 load 0(r1), f0 F D E E W load 4(r1), f2 F D E E W addf f0, f4, f6 F D E E E E W addf f2, f4, f8 F D E E E E W store f6, 0(r1) F D E E W store f8, 4(r1) F D E E W add r1, #8, r1 F D E W bne r1, r2, loop F D E W
one iteration of the unrolled loop, with two element updates, finishes in cycle 12. There is still some minor stalling at the decoder, as seen in cycles 4 and 6, but the overall performance has improved greatly. For the out-of-order superscalar processor illustrated in Fig. 2.5, each instruction has to traverse more stages, but dependent instructions can be buffered so that the decoder and execution units can bypass these instructions and uncover independent instructions that will be ready to execute. This can be observed in cycle 4, in which the integer add is issued before any of the three previous instructions complete. (The WAR dependency between the add and store instructions is handled by register renaming.) The integer add completes in cycle 6 but the result has to wait in the reorder buffer until it can retire in-order in cycle 15. Even without compiler unrolling, the two iterations finish by cycle 20. In fact, the use of renaming and dynamic scheduling allows the processor to dohardware unrollingof the loop, as seen in cycle 7 in which the load instruction of the second iteration is issued before the first iteration is finished. (The data dependency arising from the write to register f0 in this second load is handled by register renaming.) If the loop has enough iterations, throughput for the out-of-order superscalar will continue to increase as more hardware overlap of iterations occurs and will become more competitive with the unrolled, in-order superscalar throughput. The out-of-order superscalar will also be more tolerant of cache misses than the in-order superscalar.
A scalar processor typically has decode and issue rates of one instruction per cycle. As compared to a scalar design, a superscalar design must provide a greater than one throughput for each action—fetch, dispatch, issue, complete, and retire. Note that while the throughputs rates do not have to be equal, the overall throughput rate for a processor is limited by the smallest throughput rate across the individual stage-to-stage actions. In fact, a processor in the style of Fig. 2.2d that fetched, decoded, and executed multiple instructions per cycle but that could only retire one instruction per cycle would be in effect a scalar processor.