Kevin Skadron and David Tarjan
2.3.3 Hardware Techniques
2.3.3.1 Static Techniques
The simplest hardware technique is to simply stall after every branch until its outcome is known. As described above, the consequent delays lead to untenable performance penalties. A better yet still simple technique is to statically predict all branches to be either taken or not taken. A static not-taken policy is the easier of the two, because it corresponds to sequential execution. This eliminates the need for the fetch engine to identify the instructions that are branches or to compute branch targets. Unfortunately, in most programs more than half of branches are taken [19], making the performance of static-not- taken usually quite poor. On the other hand, a static taken policy either requires the fetch engine to identify the instructions that are branches and immediately identify their taken targets, or requires some delay while instructions are decoded and the target is computed.
A third policy takes advantage of the fact that backward conditional branches almost always corres- pond to loops, which tend to iterate multiple times, so these branches are likely to be taken. Non- backward branches, on the other hand, are less biased. Patterson and Hennessy [11] found that 85% of backward branches are taken whereas only 60% of forward branches are taken. This suggests a static policy of backwards-taken, forwards-not-taken, or BTFNT. The problem of computing branch targets remains.
These policies were described by Smith [19] along with the core, bimodal dynamic prediction technique described in Section 2.3.3.4. Another seminal paper from this era is the exploration of branch predictor and branch target address cache (BTAC) design choices by Lee and Smith [20]. Both papers also survey the earliest literature on branch handling.
2.3.3.2 Branch Target Address Caches
Not only static techniques, but in fact all branch-prediction techniques have the problem that on a predicted-taken branch, the branch’s target must be computed. This requires extracting the offset field from the branch instruction and adding it to the PC; tasks which typically cannot be performed until the instruction-decode stage. If this is the case, some stall cycles result, called a branch-taken bubble. A second type of predictor—a branch target predictor—can eliminate this problem. In its simplest form, this is simply a small on-chip memory in the fetch stage that serves as a table of recently seen branches, BTAC [21,22]. (The BTAC is also often referred to as a branch target buffer [BTB], but this latter term is too heavily overloaded.) The BTAC is indexed with the branch’s address (in other words, the PC—program counter—used to fetch the branch). It may be direct-mapped or associative, and tagged or not tagged. Omitting tags reduces cost, but then a BTAC miss cannot be identified, the predicted- taken branch will use the wrong target, and this will not be discovered until the branch resolves. For this reason, BTACs are best tagged.
The dynamic hardware schemes described later in this section maintain tables in which they track state about conditional branch directions. These direction-prediction tables are often indexed using the branch address. Because the BTAC table is also indexed by branch address, it may be convenient with these dynamic schemes to store the direction-prediction information in the BTAC along with each branch’s target. Apart from the convenience of integrating these different sources of information into one table, this confers the advantage that if the BTAC is tagged, any branch prediction state stored in the BTAC is also tagged. Although some processors use this organization, Calder and Grunwald [23] point out that many branches are not taken and hence do not require the BTAC to store a target. Decoupling the direction-prediction state from the target-prediction state therefore permits a smaller BTAC. It also improves flexibility, as some predictors, such as global- history predictors (see Section 2.3.3.5) do not keep a one-to-one mapping between branch addresses and direction-prediction entries.
Instead of a BTAC, the processor might employ a branch target instruction cache, which stores some actual instructions from the branch target rather than merely the target address. This replicates quite a bit of state from the instruction cache, so this organization is rarely seen, although it does appear in the Motorola*PowerPCyG4 [24], for example.
The BTAC can also be integrated with the instruction cache. Each cache line can simply store the target address of one or more of its branches in case that branch is predicted taken. Alternatively, the I-cache can implement a next-line predictor [25]. Each cache line now stores the index of the next cache line to be fetched (and also the set if the cache is associative) [26]. If no branches are taken in the current line, the next-line address will be the next sequential address. If there is a taken branch, the next- line address will be the appropriate target address. As branches change their taken=not-taken behavior, this next-line address is updated accordingly. The next-line predictor is, therefore, a combination of the functionality of a BTB and a bimodal predictor (see Section 2.3.3.4). If a more sophisticated direction predictor is present, it overrides the next-line predictor. One motivation for using such an organization is to permit a larger, slower, but more accurate direction predictor that may not be able to be accessed in a single cycle. The Alpha 21264 takes such an approach [27], using as its slower but more accurate direction predictor the hybrid predictor described in Section 2.3.3.6.
2.3.3.3 Pipeline Issues
In the most efficient organization, both the BTAC and the branch direction predictor are consulted during the fetch stage as shown in Fig. 2.22. In this way, the PC can be updated immediately and the processor can fetch from the appropriate location (taken or not-taken) in the next cycle. This avoids introducing pipeline bubbles unless there is a BTAC miss or a branch misprediction.
*Motorola, Inc., Schaumburg, Illinois.
Unfortunately, some problems occur with probing the branch-prediction hardware in the fetch stage. One concern is the branch-predictor and BTAC lookup times. These tables must be fast enough, and hence small enough, to permit the lookup to complete and the PC to be updated within a single cycle. Otherwise, the fetch stage falls behind. Current processors use predictors as big as 32 Kbits, but Jime´nez et al. [28] argue that the feasible predictor size for single-cycle access will shrink in the coming years. The reason for this is that even though the feature size on a processor die continues to shrink with Moore’s law [29], electrical RC delays are not shrinking accordingly, and hence wire delays are not shrinking as fast as logic delays. As feature size shrinks, large structures therefore seem to be getting relatively slower.
Another problem is that in a typical organization, the fetch stage cannot determine whether the instructions being fetched from the instruction cache contain any branches; that information must wait until the instructions are decoded. Several solutions are available. The first solution is for the instruc- tions to be pre-decoded before they are installed into the instruction cache to indicate the instructions that are branches. The predictor structures can then be indexed using the actual addresses of the branches. Note that this means either that the predictor must be multiported to cope with fetch blocks that contain more than one branch, or the predictor can only predict one branch at a time. This is not necessarily a major restriction, since if the predicted result is not-taken, the remaining instructions in the fetch block after the branch are still valid and can still be passed on to decode. The second solution is for the branch predictor to just predict fetch-block successors instead of specific branches. In this case, the predictor simply predicts whether the next fetch block will be sequential (not-taken) or nonsequential (taken, in which case the target supplied by the BTAC is used). This is slightly better than the first choice, because it eliminates the need for pre-decode bits and can fetch past more than one not-taken branch in a fetch block. It does require the decode stage to identify how each branch in a fetch block was implicitly predicted. The third solution is for the BTAC and branch predictor to be indexed with the address of every instruction in the fetch block. Hits in the BTAC indicate the instructions that are branches, and only the corresponding direction predictions are then used. The problem with this approach is that it requires as many ports into the BTAC and branch-prediction structures as there are instructions in the fetch block. These are the basic choices, although many variations and improvements have been proposed, e.g., [26,30–32]. bpred BTAC I-cache PC mux Predicted-taken target T/NT Fetched instructions mux
Computed-taken target, from decode BTB hit?
+ 4
To decode
2.3.3.4 Bimodal Prediction
The simplest dynamic technique, introduced by Smith [19], is to maintain a small, on-chip memory with a table of saturating counters that is indexed by branch address. The saturating counters—typically two bits each—simply remember the predominant direction of previous outcomes for that branch. A schematic for a bimodal predictor appears in Fig. 2.23. As mentioned, the table usually called the pattern history table (PHT), although logically a distinct entity might actually be implemented as a unified structure with the BTAC. This prediction scheme goes by different names, often simply two-bit prediction, but recent literature has often referred to it as bimodal prediction to distinguish it from other more sophisticated schemes that also use two-bit saturating counters.
Each time a branch resolves, its corresponding coun- ter is incremented if the branch was taken, and decre-
mented if not. Incrementing or decrementing has no effect if the counter is already at its maximum or minimum value, hence the term saturating counter and the name bimodal. In the simplest case of a one- bit counter, the only possibilities are values of 0 and 1 and the predictor simply remembers the last outcome for each branch. In the case of two-bit counters, values of 00 and 01 correspond to strongly not-taken and weakly not-taken, and values of 10 and 11 corresponding to weakly taken and strongly taken. Two-bit counters give better performance because they exhibit some hysteresis that makes them less sensitive to infrequent occurrences of outcomes in the nondominant direction. A state-transition diagram for the most common two-bit counter configuration appears in Fig. 2.24.
Other configurations [20,33] are possible; however, for example, regardless of its current state, the counter might reset to 00 on a not-taken branch.
As an example of how two-bit counters improve over one-bit counters, recall that a loop branch will normally be taken. When the loop exits, a one-bit counter will only remember that most recent direction (not taken), even though the predominant direction is taken. When this same loop is encountered again, and the loop branch will once again be taken until the loop exits, the first prediction with a one-bit counter will be not-taken. A two-bit counter, on the other hand, only changes its state from 11 to 10 upon loop exit, and still predicts taken when it returns to the loop, thus eliminating a misprediction compared to the one-bit counter.
Baddr
PHT
T/NT
FIGURE 2.23 A schematic for a bimodal pre- dictor. Baddr is the branch address or PC, which is used to index the PHT (pattern history table), select the corresponding two-bit counter, and make a prediction of taken or not-taken.
00 11 10 01 Taken Taken Taken Taken Not-taken Not-taken Not-taken Not-taken
Wider counters have been considered [19] but con- fer little benefit and take longer to adjust to a change in a branch’s behavior.
The size of the PHT is of course not infinite, so the ideal of one entry per branch may not be realized. The table is indexed by the branch address modulo the table size, so some branches may collide. If these branches are biased in the same direction this is harm- less, but if not, they will interfere with each others’ attempts to update the counter, and these destructive PHT conflicts will lead to mispredictions. Sources of mispredictions are discussed in Section 2.3.4.
2.3.3.5 Two-Level Prediction
Bimodal prediction can be improved in two ways, both of which explicitly track earlier branch outcomes
and were introduced by Yeh and Patt. Local-history prediction [34] maintains a table of per-branch histories. Instead of tracking each branch’s predominant direction, this branch history table (BHT) tracks explicit history in order to detect patterns. For example, a local history can detect patterns like TNTN . . . that confound simple saturating counters. The predictor still keeps a PHT of two-bit counters, but these are now indexed using the local history pattern, and the counters now learn outcomes for each history pattern. A schematic of a local history predictor appears in Fig. 2.25. One apparent problem with local-history prediction is that it would seem to require two serial lookups: first the BHT to obtain the history pattern, then the PHT to obtain the actual prediction.
This problem is solved by caching the most recent PHT value for a given BHT entry as an extra field in the BHT. The next time that BHT entry is indexed, it provides both the current history and the cached prediction. Fetching proceeds with that cached prediction while the PHT is probed with the history pattern. The PHT result overrides the cached result, so if the PHT disagrees with the cached prediction, the pipeline is flushed from the point of the mispredicted branch.
Global-history prediction [35] on the other hand, keeps a single history register—the global branch history register or GBHR—into which all branch outcomes are shifted, as seen in Fig. 2.26. It might seem that intermingling outcomes from different branches simply produces noise, but instead global- history prediction is extremely effective. The reason is that global history exposes correlation among branches (and hence these predictors are also called correlating predictors).
Consider the following sequence of code:
B1: if (x) . . . B2: if (y) . . . z¼x && y; B3: if (z) . . .
Even if B1 and B2 are entirely unpredictable becausexandyhave very random behavior, B3 can be predicted with 100% accuracy if the outcomes of B1 and B2 are known, because the outcome of B3 is entirely correlated with the outcomes of B1 and B2. Global history is an admittedly crude way to expose this sort of correlation, because the
T/NT Baddr
PHT BHT
FIGURE 2.25 A schematic for a PAs local-history predictor. The branch address is used to index the table of per-branch histories (the BHT), select the appropriate history, and then this history is used to index the PHT.
T/NT GBHR
PHT
FIGURE 2.26 A schematic for a GAs global-
history predictor. The global history of recent branch outcomes, contained in the global branch history regis- ter (GBHR) is used to index the PHT.
GBHR also contains outcomes from other branches that provide no useful information. Yet, as Section 2.3.5 shows, global history is quite effective, and Evers et al. [36] have shown that many programs contain substantial degrees of correlated branch behavior. Unfortunately, no one has come up with a practical hardware technique for exposing correl- ation while avoiding the noise that unrelated branches introduce into the GBHR.
Both the local-history and global-history predict- ors described above have the problem that different branches may see the same history. All branches that see the same history will map to the same PHT entry. Especially with global prediction, equivalent history does not always mean the branches will behave the same way. To reduce the conse- quent destructive PHT conflicts, Pan et al. [37]
point out that bits from the branch address can be combined with the history bits in order to provide some degree of anti-aliasing—see Fig. 2.27 for example. The simplest technique is to concatenate the two bit sources. ForNbits of history andMbits of branch address, this creates a configuration where each
M-bit address pattern has its own 2N-entry PHT.
For a fixed table size and hence a fixed number of bits in the index, this necessitates a reduction in the number of history bits, so a balance must be found between the added prediction capability provided by history bits and the anti-aliasing capability provided by address bits. This balance is sensitive to the table size. In a study of the SPECint95 benchmarks [38], Skadron et al. [39] found that as a general rule of thumb, both global- and local-history predictors should use at least 6–7 bits of branch address, regardless of predictor size. Predictors with more aggressive anti-aliasing techniques, e.g., the bi-mode predictor of Lee et al. [40], will need fewer address bits.
To classify the different possible two-level predictor organizations, Yeh and Patt [35,41] developed a naming scheme that uses three letters to characterize the different organizational choices. The first letter, G, P, or S, indicates the type of history, global, per branch (i.e., local), or per branch set. The last choice refers to a predictor that explicitly allocates groups of branches to particular BHT entries, and is only feasible with extensive profiling or compiler support and hence has received little study. Skadron et al. [39] added a fourth type, M, to this naming scheme to describe predictors that track a combination of global and local history. The second letter, A or S, indicates whether the PHT is adaptive, using a finite state machine based on saturating counters, or fixed, using statically assigned directions (a profiling pass might determine the best PHT value for each entry); almost all predictors proposed or under study, however, are A—adaptive. The third letter, g, s, or p, indicates the PHT organization. The PHT might be indexed purely by history (g); or indexed using some concatenated branch address bits, making it set- associative (s); or the predictor might have a separate PHT for each branch (p, for per-branch). This last choice eliminates aliasing among branches but is prohibitively large for all but small history sizes, and is therefore mainly of theoretical interest. A pure global-history predictor like that in Fig. 2.26 is, therefore, a GAg predictor and a pure local-history predictor like that in Fig. 2.25 is a PAg predictor. If either of these concatenate some address bits into the index, like the global-history predictor in Fig. 2.27, they