Kevin Skadron and David Tarjan
2.3.2 Software Techniques
Branches can be predicted or otherwise managed by both software and hardware techniques. This section focuses on software techniques, and Section 2.3.3 focuses on hardware techniques.
2.3.2.1 Branch Delay Slots
One early software technique that was able to eliminate the need for prediction in early processors is the branch delay slot. Instead of predicting the branch’s outcome, the instruction-set architecture can be defined so that some number of instructions following a branch execute regardless of the branch’s outcome. These instruction positions are called delay slots and must be filled with instructions that are safe to execute regardless of the outcome of the branch, or with nops (but nops do no useful work). Instructions to fill the delay slot might come from positions that preceded the branch in the original code schedule but can safely be reordered, for example. Consider the sequence of code:
1. add r1, r2, r3 2. add r4, r5, r6 3. bnez r6 4. (delay slot)
*Compaq Computer Corp., Houston, Texas.
Instruction 1 can safely be moved into the delay slot, because doing so violates no data dependencies. Instruction 2, of course, cannot be moved into the delay slot, because it computes the value of r6 that the branch then examines. More aggressive techniques can analyze instructions from after the branch, identify a safe instruction, and hoist it into the delay slot. A more thorough treatment of branch delay slots and associated techniques can be found in Ref. [11].
Unfortunately, delay slots have drawbacks. Even the most aggressive techniques still leave some delay slots unfilled, wasting instruction-issue opportunities. Delay slots also have the problem that they expose processor implementation details that might change. Current instruction sets that use delay slots were defined when processors issued instructions in order, one at a time, and pipelines were short. The branch resolution delay was hence just one cycle and the corresponding penalty was only one instruction issue slot, so these instruction sets defined branches to have a single delay slot. Examples include the MIPS*[12] and SPARCy[13] instruction sets. Yet, later implementations made the pipeline longer and issued multiple instructions per cycle. This meant that the resolution delay corresponded to many issue slots, even though the number of delay slots was still fixed by the instruction set at one instruction. In addition, with multiple issue, a bundle of instructions being considered for issue in any particular cycle might consist of several instructions following a branch. Exactly one of these—the delay slot—must be issued unconditionally, while the others are control-dependent on the branch and their execution depends on the branch outcome. For these reasons, later instruction sets like Alpha AXP [14] do not include delay slots.
2.3.2.2 Profiling and Compiler Annotation
An alternative software technique is to profile the program’s behavior by gathering data about how individual branches behave. This involves gathering data while the program is running about its branch’s behavior. This data can then be fed to a second compilation pass, which annotates the branches to indicate the predominant direction. The hardware then predicts each branch according to the annota- tion. So for example, a branch that is taken 80% of the time and not taken 20% of the time would be annotated predict-taken. More sophisticated profiling and compiler analysis can even make multiple copies of segments of code so that the branches therein have more consistent behavior, or uncover branches whose behavior is correlated and thus capture some of the same behavior as global-history prediction. This is described by Young and Smith [15].
2.3.2.3 Predication
A third technique is predication or if-conversion, in which the branch is removed and instructions from both the taken and not-taken paths can be executed simultaneously. This eliminates the need to predict the branch, and converts code that was control dependent into code that is data dependent on the branch condition. This defers the dependence to the execution core and permits fetching to continue without risk of rollback due to mispredictions. If done judiciously and execution from the two paths is properly balanced, if-conversion can be done without any performance penalties. Correctness is ensured by modifying the instructions that were once controlled or guarded by the if-converted branch so that they can only commit if the branch condition would have permitted it.
If-conversion is accomplished in one of two ways. In full predication, each individual instruction is guarded by a condition. This predicate value is specified as a third operand register, usually from a dedicated register file. Clearly, this requires instruction-set support in every instruction. In partial predication, on the other hand, there is no support for guarding predicates. Instead, predication is accomplished using conditional move instructions (CMOVs), which can simply be added to retrofit to existing instruction sets. One branch path is executed unconditionally. The results for the other path are
*MIPS Technologies, Mountainview, California.
computed into temporary registers and then moved into their final destination with CMOVs. The CMOV only completes if the specified condition (the branch condition) holds true. The following code sequence gives an example:
Original code Full predication Partial predication if (cond) pdef cond, p add a, b, x
x¼aþb; add a, b, x(p) cmov a, x (cond) else mov a, x(!p) mul x, x, y
x¼a; mul x, x, y y¼x*x;
The pdef instruction defines a predicate; the condition is evaluated and the result placed in p. In all cases,y¼x*xgets the correct value ofxbecauseyis data dependent onxand can only usexonce its final value is assigned. The final value ofx, in turn, is either control dependent (original code) or data dependent (predicated code) on cond. Although in this example, the partially-predicated sequence is shorter, partial predication has two drawbacks. It requires a CMOV instruction for each destination register on the path being predicated, and each destination register requires a temporary register [16].
Research by Mahlke et al. [17] has shown that predication substantially reduces both the number of branches executed as well as the branch misprediction rate. Nevertheless, resource constraints mean that not all branches can be predicated, and so predication still requires the presence of branch prediction. This brings us to the hardware techniques, which can be used alone or in conjunction with the software techniques just described. Note that predication can actually hurt the predictability of the remaining branches. As seen in the next section, many branch prediction algorithms depend on the ability to track the history of earlier branch outcomes. Predication removes branches and hence removes this history from view. Simon et al. [18] explore some ways to rectify this problem.
The literature on predication techniques is a rich body in its own right, and interested readers are encouraged to consult both the architecture and compiler literature.