July 1993
UILU-ENG-93-2229
CRHC-93-16
Center for Reliable and High-Performance Computing
APPLICATION OF
COMPILER-ASSISTED
MULTIPLE INSTRUCTION
ROLLBACK RECOVERY TO
SPECULATIVE EXECUTION
N.J. Alewine
W. K. Fuchs
W.-M. Hwu
Coordinated Science Laboratory
College of Engineering
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimited.
UNCLASSIFIED
J a
xj_
{EduRifv C
lassification of this pageREPORT DOCUMENTATION PAGE
1a. REPORT SECURITY CLASSIFICATIONUnclassified _________
1b. RESTRICTIVE MARKINGS
None
2a. SECURITY CLASSIFICATION AUTHORITY
2b. DECLASSIFICATION/DOWNGRADING SCHEDULE
3. DISTRIBUTION/AVAILABILITY OF REPORT
Approved for public release;
distribution unlimited
4. PERFORMING ORGANIZATION REPORT NUMBER(S)
UILU-ENG-93-2229 CRHC-93-16
5. MONITORING ORGANIZATION REPORT NUMBER(S)
6a. NAME OF PERFORMING ORGANIZATION
Coordinated Science Lab
University of Illinois
6b. OFFICE SYMBOL
(If applicable)
N/A
7a. NAME OF MONITORING ORGANIZATION
National Aeronautics and Space Administration
6c ADDRESS (G'ty, State, and ZIP Code)
1101 W. Springfield Avenue
Urbana, IL 61801
7b. ADDRESS (City, State, and ZIP Code)
Moffitt Field, CA
8a. NAME OF FUNDING/SPONSORING ORGANIZATION
7a
8b. OFFICE SYMBOL
(If applicable)
9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER
8c. ADDRESS (City, State, and ZIP Code) 10. SOURCE OF FUNDING NUMBERS
PROGRAM
PROJECT
TASK
7b
ELEMENT NO.
NO.
NO.
WORK UNIT ACCESSION NO.
11. TITLE (Include Security Classification)
Application of Compiler-Assisted Multiple Instruction Rollback Recovery to Speculative
Execution
12. PERSONAL AUTHOR(S)
ALEWINE, N. J, W. K. Fuchs, and W.-M. Hwu
13a. TYPE OF REPORT
Technical__
13b. TIME COVERED FROM_________ _ TO
14. DATE OF REPORT (fear, Month, Day)
1
Tnlv
19---15. PAGE COUNT
18
16. SUPPLEMENTARY NOTATION
I
17
COSATI CODES 18. SUBJECT TERMS (Continue on reverse if necessary and identify by block number) FIELD GROUP SUB-GROUP r o l l b a c k r e c o v e r y , c o m p i l e r - a s s i s t e d m u l t i p l e i n s t r u c t i o n ,t r a n s i e n t p r o c e s s o r f a i l u r e s , i n s t r u c t i o n a l l e v e l p a r a i -Jj3 .1 ÌS 2 L _ 19. ABSTRACT (Continue on reverse if necessary and identify by block number)
Speculative execution is a method to increase instruction level parallelism which can be exploited by both super-scalar and
V L I W
architectures. The key to a successful general speculation strategy is a repair mechanism to handle mispredicted branches and accurate reporting of exceptions for speculated instrucitons. Multiple instruction rollback is a technique developed for recovery from transient processor failure. Many of the difficulties encountered during recovery from branch misprediction or from instruction re-execution due to exception in a speculative execution arChitecute are similar to those encountered during multiple instruction rollback.This paper investigates the applicability of a recently developed compiler-assisted multiple instruciton rollback scheme to aid in speculative exectuion repair. Extensions to the ocmpiler-assisted scheme to support branch and exception repair are presented along with performance measurements across ten application programs.
20. DISTRIBUTION /AVAILABILITY OF ABSTRACT
| 0UNCLASSIFIED/UNLIMITED □ SAME AS RPT. □ DTIC USERS
21. ABSTRACT SECURITY CLASSIFICAI U n c l a s s i f i e d
ION
I 22a. NAME OF RESPONSIBLE INDIVIDUAL 22b. TELEPHONE (Include Area Code) 22c. OFFICE SYMBOL
DD FORM 1473,
84 MAR 83 APR edition may be used until exhausted.All other editions are obsolete. SECURITY CLASSIFICATION OF THIS PAGE
UNCLASSIFIED
UNCLASSIFIED____________
ICCUWITY C L A M I FI CATION O F THII f»AOE
TO APPEAR: WORKSHOP OH HARDWARE AND SOFTWARE ARCHITECTURES FOR FAULT
TOLERANCE: PERSPECTIVES AND TOWARDS A SYNTHESIS
JUNE 14-16, 1993
LE MONT SAINT-MICHEL, FRANCE
APPLICATION OF COMPILER-ASSISTED MULTIPLE
INSTRUCTION ROLLBACK RECOVERY TO
SPECULATIVE EXECUTION
N. J. Alewine* W . K. Fuchs, W .-M . Hwu
Center for Reliable and High-Performance Computing
Coordinated Science Laboratory
University of Illinois at Urban a-Champaign
Abstract
Speculative execution is a method to increase in struction level parallelism which can be exploited by both super-scalar and VLIW architectures. The key to a successful general speculation strategy is a repair mechanism to handle mispredicted branches and ac curate reporting o f exceptions for speculated instruc tions. Multiple instruction rollback is a technique developed for recovery from transient processor fail ures. Many o f the difficulties encountered during re covery from branch misprediction or from instruction re-execution due to exceptions in a speculative exe cution architecture are similar to those encountered during multiple instruction rollback.
This paper investigates the applicability o f a re cently developed compiler-assisted multiple instruc tion rollback scheme to aid in speculative execution repair. Extensions to the compiler-assisted scheme to support branch and exception repair are presented along with performance measurements across ten ap plication programs.
1
Introduction
Super-scalar and VLIW architectures have been shown effective in exploiting instruction level paral lelism (ILP) present in a given application [1-3]. Cre ating additional ILP in applications has been the sub ject o f study in recent years [4-6]. Code motion within
a basic block is insufficient to unlock the full potential o f super-scalar and VLIW processors with issue rates
* International Business M achines C orporation, B oca R aton, FI.
1 This research was supported in part by the N ational Aero nautics and Space Adm inistration (N A SA) under grant NASA NAG 1-613, in cooperation with the Illinois Com puter Labora tory for Aerospace Systems and Software (IC LA SS), and in part by the Departm ent o f the Navy and m anaged by the O ffice o f the C hief o f Naval Research under Contract N00014-91-J-1283.
greater than two [3]. Given a trace o f the most fre quently executed basic blocks, limited code movement across block boundaries can create additional ILP at the expense o f requiring complex compensation code to ensure program correctness [7]. Combining multiple basic blocks into superblocks permits code movement within the superblock without the compensation code required in standard trace scheduling [3].
General upward and downward code movement across trace entry points (joins) and general down ward code motion across trace exit points (branches, or forks) is permitted without the need for special hardware support [7]. Sophisticated hardware support is required, however, for unrestricted upward code mo tion across a branch boundary. Such code motion is referred to as speculative execution and has been shown to substantially enhance performance over non- speculated architectures [8-10]. This paper focuses on the support hardware for speculative execution, which ensures correct operation in the presence o f except ing speculated instructions (referred to as exception repair) and o f mispredicted branches (referred to as branch repair). It is shown that data hazards which re sult from exception and branch repair are very similar to data hazards that result from multiple instruction rollback, and that techniques used to resolve rollback data hazards are applicable to exception and branch repair.
The remainder o f the paper is organized as follows. Section 2 gives a brief overview o f a compiler-assisted multiple instruction rollback (M IR) scheme to be used as a base for application to speculative execution re pair (SER). Section 3 describes speculative execution and the requirements for exception repair and branch repaur. Section 4 introduces a schedule reconstruc
tion scheme and extends the compiler-assisted rollback
scheme. Section 5 describes read buffer flush costs auad Section
6
presents performance impacts which resultfrom read buffer flushes.
2 C om p iler-A ssisted M u ltip le Instruc
tion R ollback Recovery
2.1
Hazard Classification
Within a general error model, data hazards result ing from instruction retry are o f two types [11-13]. On-path hazards are those encountered when the in struction path after rollback is the same as the initial path and branch hazards are those encountered when the instruction path after rollback is different than the initial path. As shown in Figure 1, rx represents an on-path hazard where during the initial instruction
se-Figure 1: On-path data hazard.
quence rx is written and after rollback is read prior to being re-written. As shown in Figure 2, ry represents a branch hazard where the initial instruction sequence writes ry and after rollback ry is read prior to being re written however this time not along the original path.
2.2
On-path Hazard Resolution Using a
Read Buffer
Hardware support consisting o f a read buffer o f size
2
N , as shown in Figure 3, has been shown to be effective in resolving on-path hazards [11-13]. The read buffer maintains a window o f register read history. If am on-path hazard is present, then prior to writing over the old value o f the hazard register, a read of that value must have taken place within the last N instructions (else after rollback o f < N , a read o f the hazard register would not occur before a redefinition). Key to this scenario is the fact that the original path is repeated. Branch hazard resolution is left to the
Figure 2: Branch data hazard.
Figure 3: Read buffer.
compiler. At rollback, the reawl buffer is flushed back to the general purpose register file (G P R F ), restoring the register file to a restartable state. The primary advantage o f the read buffer is that it does not require an additional reawl port as with a history buffer, repli cation o f the GPRF as with the future file, or bypass logic as with the reorder buffer or delayed write buffer [14,15].
2.3
Branch Hazard Removal Compiler
Transformations
Compiler transformations have been shown to be effective in resolving branch hazards [11,12]. Branch hazard resolution occurs at three levels;
1
) pseudo code, 2) machine code, and 3) post-pass. Resolution at the pseudo code level would be accomplished by renaming the pseudo register ry o f instruction /,•(Fig-ure 2) to rz . Node splitting, loop expansion and loop protection transformations aid in breaking pseudo reg ister equivalence relationships so that renaming can be performed. After the pseudo registers are mapped to physical registers, some branch hazards could re appear. This is prevented at the machine code level by adding hazard constraints to live range constraints prior to register allocation. Branch hazards that re main after the first two levels can be resolved by either creating a ‘‘covering’’ on-path hazard or by inserting nop (no operation) instructions ahead o f the hazard instruction until the rollback is guaranteed to be un der the branch. Given the branch hazard o f Figure
2
, a covering on-path hazard is created by inserting an MOV ry, ry instruction immediately before the in struction in which ry is defined. This guarantees that the old value o f ry is loaded into the read buffer and is available to restore the register file during rollback.• •
Figure 5: Speculated instruction traps.
3.1
Branch Repair
3
Speculative Execution
Figures 4 and 5 illustrate the two basic problems which are encountered when attempting upward code motion across a branch. As shown in Figure 4, if the
'* - r* o v branch taken ---1 rI - r2 *
3 2
Figure 4: r\ in live_out o f taken path.
speculated instruction (i.e., an instruction moved up ward past one or more branches) modifies the system state, and due to the branch outcome the speculated instruction should not have been executed, program correctness could be affected. Figure 5 illustrates that if the speculated instruction causes an exception, and again due to the branch outcome, the excepting in struction should not have been executed, program per formance or even program correctness could be af fected.
Figure
6
shows an original instruction schedule and a new schedule after speculation. Instructions d, *, and / have been speculated above branches c andg from their respective fall-through paths
.2
Specu lated instructions Me marked “ (s).” The motivation for such a schedule might be to hide the load delay o f the speculated instructions or to allow more time for the operands o f the branch instructions to become available. If c commits to the taken path (i.e., it is mispredicted by the static scheduler), some changes to the system state that have resulted from the execu tion o f d, *, and / , may have to be undone. No update is required for the PC; execution simply begins at j. If instead, c commits to the fall-through path but g commits to the taken path, then only i ’s changes to the system state may have to be undone.Not all changes to the system state are equally im portant. If for example, d writes to register rx and
rx £ liv e.in (j) (i.e., along the path starting at j , a
redefinition o f rx will be encountered prior to a use of
rx [16]), then the original value o f rx does not have
to be restored. Inconsistencies to the system state as a result o f mispredicted branches exhibit similari ties to branch hazards in multiple instruction rollback [11,12]. Given this similarity between branch haz ards due to instruction rollback and branch hazards due to speculative execution, compiler-driven data flow manipulations, similar to those developed to elim inate branch hazards for MIR [11,12], cam be used to resolve branch hazards that result from speculation. Such compiler transformations have been proposed for
3For this exam ple it is assumed that the fall-through paths are the m ost likely outcom e o f the branch decisions at c and g.
a a RB_c: d b « d e 0 - * - j « i f d b i e («)f jum pLl f 0 — j ■ 1 : 0 * k e RB_g: h h L 1 :[ ? J -^ k i i h jump L2 2: L2:
Original Speculated Recovery
Schedule Schedule
m
a
^
^--
m
Blocks
Figure
6
: Branch repair.branch misprediction handling [9]. Since re-execution o f speculated instructions is not required for branch misprediction, compiler resolution o f branch hazards becomes a sufficient branch repair technique.
3.2
Exception Repair
Figure
6
also demonstrates the handling o f spec ulated trapping instructions. If d is a trapping in struction and an exception occurred during its execu tion, handling o f the exception must be delayed until c commits so that changes to the system state are mini mized, and in some cases to ensure that repair is pos sible in the event that c is mispredicted. If c commits to the taken path, the exception is ignored and d is haul died like any other speculated instruction given a branch mispredict. If c was correctly predicted, three exception repair strategies are possible. The first is to undo the effects o f only those instructions speculated above c (i.e., d, i, and / ) and then branch to a recov ery block R B .c [10] as shown in Figure6
. The address of the recovery block can be obtained by using the PC value o f the excepting instruction as am index into a hash table. This strategy ensures precise interrupts [14,17] relative to the nonspeculated schedule but not relative to the original schedule. Recovery blocks can cause significamt code growth [10]. The second strat egy undoes the effects o f all instructions subsequent tod (i.e., i,
6
, and / ) , handles the exception, amd resumes execution at instruction i [9]. This latter strategy pro vides restartable states and does not require recovery blocks. A third exception repair strategy undoes the effects o f only those subsequent instructions that are speculated above c (i.e., only i amd / ) , handles the exception, and resumes execution at instruction «, how ever, this time only executing speculated instructions until c is reaudied. The improved efficiency o f strategy
3
over that o f strategy2
comes at the cost o f slightly more complex exception repair hardware.When a branch commits and is mispredicted, the exception repair hardware must perform three func tions:
1
) determine whether am exception has occurred during the execution o f a speculated instruction,2
) if an exception has occurred, determine the PC value o f the excepting instruction, amd 3) determine which changes to the system state must be undone. Func tions1
and2
aure similar to error detection amd location in multiple instruction rollback. Function 3 is similar to on-path hazard resolution in multiple instruction rollback [11,12,18]. On-path hazards assume that af ter rollback the initiad instruction sequence from the faulty instruction to the instruction where the error was detected is repeated.Figure 7 illustrates the speculation o f a group of
:
Figure 7: Exception repadr.
instructions amd re-execution strategy 3. The load in struction traps, but the exception is not hamdled un til the bramch instruction commits to the fall-through path. Control is then returned to the trapping instruc tion. This scenario is identicad to multiple instruction rollbaudc where am error occurs during the loaui instruc tion and is detected during the bramch instruction. For this examiple, only r*i must be restored during rollback since amd r
5
will be rewritten prior to use during re-execution. Figure 7 shows that exception repairhazards in speculative execution are the same as on- path hazards in multiple instruction rollback, and a read buffer as described in Section 2 can be used to resolve these hazards. The depth o f the read buffer is the maximum distance from h to In along any back wards walk3, where Jn is a trapping instruction that was speculated above branch instruction h
-3.3
Schedule Reconstruction
Assumed in Figures
6
and 7 are mechanisms to identify speculative instructions, determine the PC value o f excepting speculated instructions, and deter mine how many branches a given instruction has been speculated above. An example o f the latter case is shown in Figure6
where instructions d, t, and / , are undone if c is mispredicted; however, only i must be undone if g is mispredicted.If the hardware had access to the original code schedule, the design o f these mechanisms would be straightforward. Unfortunately, static scheduling re orders instructions at compile-time and information as to the original code schedule is lost. To enable recov ery from mispredicted branches and proper handling of speculated exceptions, some information relative to the original instruction order must be present in the compiler-emitted instructions. This will be referred to as schedule reconstruction.
By limiting the flexibility o f the scheduler, less in formation about the original schedule is required. For example, if speculation is limited to one level only (i.e., above a single branch), a single bit in the opcode fleld is sufficient to indicate that the instruction has been moved above the next branch [
8
]. The hardware would then know exactly which instruction effects to undo (i.e., the ones with this bit set). Also, remov ing branch hazards directly with the compiler permits general speculation with no schedule reconstruction for branch repair [9].4
Im plicit Index Schedule Reconstruc
tion
Implicit index scheduling supports general specula
tion o f regular and trapping instructions. The scheme was inspired by the handling o f stores in the sentinel scheduling scheme [9] and was designed to exploit the unique properties o f the read buffer hardware design described in Section 2. Schedule reconstruction is ac complished by marking each instruction speculated or
3 A walk is a sequence o f edge traversals in a graph where the edges visited can be repeated [19].
nonspeculated by including a bit in the opcode field,
and using this encoding to maintain an operand his tory o f speculated instructions in a FIFO queue called a speculation read buffer (SRB). The SRB operates similar to a read buffer with additional provisions for exception handling.
4.1
Exception Repair Using a Speculation
Read Buffer
Figure
8
shows an original code schedule and two speculative schedules, along with the contents o f the SRB at the time branches Ie and Ig commit. Instruc tions Id and If have been speculated above branch instruction / c, and has been speculated above bothIg and Ic. The encoding o f speculated instructions in
forms the hardware that the source operands are to be saved in the SRB, along with the source operand values, corresponding register addresses, and the PC o f the speculated instruction.
Speculated instructions execute normally unless they trap. If a speculated instruction traps, the ex ception bit in the SRB which corresponds to the trap ping instruction is set and program execution contin ues. Subsequent instructions that use the result o f the trapping instruction are allowed to execute normally.
A chk^except(k) instruction is placed in the home block o f each speculated instruction. Only one
chk-except(k) instruction is required for a home block.
As the name implies, chL.except(k) checks for pend ing exceptions. The command cam simultaneously in terrogate each location in the SRB by utilizing the bit field k. As shown in schedule 1 o f Figure
8
,chk.except(0011U) in J' checks exceptions for instruc
tions Id and Id. If a checked exception bit is set, the SRB is flushed in reverse order, restoring the appropri ate register and PC values. Execution can then begin with the excepting instruction.
Figure
8
illustrates several on-path hazards which are resolved by the SRB. In schedule1
, if /,• traps and the branch /<• commits to the taken path, U has cor rupted ri and If has corrupted r7
. Flushing the SRB up through /, restores both registers to their values prior to the initial execution o f Ii. Note that register re is also corrupted but not restored by the SRB, since after rollback re will be rewritten with a correct value before the corrupted value is used.As an alternative to checking for exceptions in each home block, the exception could be handled when the exception bit reaches the bottom o f the SRB. This is similar to the reorder buffer used in dynamic schedul ing [14] and eliminates the cost o f the chk-except(k) command, however, increases the exception handling
Original Schedule Speculated Schedule 1 Speculated Schecule 2 rl = r2 * r3
V
rL lr2*
'L
r3 = r4 + r5h-bne rJt r3, I.
V
r8 ri a r 7 * r8VI
rs “ r> * 4¥
V 0 “ r2 * r3 r7 = r7 + 4 bne r8, v7, 1^ rt - r6 * 4 r , = M E M (r, ) 1 c U o s r h d 2N Ic: bne rJt r3, Ij I’ : chk_except(001 111) V r8 rs + 4 V bne V r7> I*: chk_except(l 10000) lk- r6 r6 + 4 Except bit — 1 I Reg. No. i t . - 0 ic . valuti r?) 7 V value(rg) 8 V valuer7) 7 I.- - 0 1/ valuer2) 2}
}
}
-I’ : chk_except(l 10011) ri + 4 lg: bne rg , r? , lk I*: chk_except(001100) V r6 = r6 * 4 Except b it— I j R e g .N o .-i If e
1 c “ o S r h d 2N i t . - 0 i t . value(r7) 7 I,- - 0 I,- valuer2) 2 V valuerg) 8 V valuefr7) 7h
} J
}
-SRB Contents SRB Contentslatency which can impact performance depending on the frequency o f exceptions.
Implicit index scheduling derives its name from the ability o f the compiler to locate a particular register value within the SRB. This is possible only if the dy namically occurring history o f speculated instructions is deterministic at branch boundaries. Superblocks guarantee this by ensuring that the sole entry into the superblock is at the header and by limiting specula tion to within the superblock. For standard blocks, bookkeeping code [7] can be used to ensure this deter ministic behavior.
4.2
Branch Repair Using a Speculation
Read Buffer
As described in Section 2, branch repair can be han dled by resolving branch hazards with the compiler. Branch hazard resolution in multiple instruction roll back can be assisted by the read buffer when cover ing on-path hazards are present, reducing the perfor mance cost o f variable renaming [11,12]. In a similar fashion, the SRB can assist in branch repair. Figure 9 shows the original code schedule and the two spec ulative schedules o f Figure
8
. For this example, it is assumed that r j, r3
, r6, and7*7
are elements in bothlive.in (Ij) and livc.in(Ik).
As shown in schedule 1, if branch instruction Ic commits to the taken path, 7*3, re, and rr, which were modified in Ii, Id, and / / , respectively, must be re stored. If instead, Ic commits to the fall-through path and Ig commits to the taken path, only
7*2
must be re stored. Registers t*2
and7*7
are rollback hazards that result from exception repair; therefore, the SRB con tains their unmodified values. By including a flush(k) command at the target o f Ie and Ig, the SRB can be used to restore r2 and/or rr given a misprediction ofIe or Ig.
The flush(k) command selectively flushes the ap propriate register values given a branch misprediction. For example, in schedule 2 o f Figure 9, if Ic is predicted correctly and Ig is mispredicted, the SRB is flushed in reverse order up through /«, restoring vaiue(r2) from
Ii but not restoring valu e(rj) from I f . Since specu
lation is always from the most probable branch path, the flush(k) command is always placed on the most improbable branch path, minimizing the performance penalty. Not all branch hazards are resolved by the presence o f on-path hazards. These remaining haz ards can be resolved with compiler transformations.
5
S R B Flush P enalty
The examples o f Section 4 demonstrate that compiler-assisted multiple instruction rollback cam be applied to both bramch repair amd exception repair in a speculative execution architecture. The flush penadty o f the read buffer is not a key concern in multiple in struction rollback applications since instruction faults are typicadly very rare. In application to exception re- pair in speculative execution, the SRB flush penalty is also not a major concern due to the infrequency o f ex ceptions involving speculated instructions. However, in application to bramch repadr, the SRB flush penadty could produce significamt performance impacts. Stud ies o f bramch behavior show a conditioned branch fre quency o f 11% to 17% [20]. Static bramch prediction methods result in bramch mispredictions in the ramge o f 5% to 15%. This results in a branch repadr fre quency as high as 2.5%. Assuming a CPI (clock cycles per instruction) rate o f one amd am average SRB flush penadty o f ten cycles, the performance overhead o f the flush mechanism would reach 22.5%. This indicates the importamce o f minimizing the amount o f redun- damt data stored in the SRB so that the flush penadty is reduced.
Recently, a technique was proposed to reduce the amount o f redundamt data in a read buffer so that the read buffer size could be reduced [12,13]. A similar technique cam be used to assure that only the data required for bramch amd exception repadr is stored in the SRB. In the implicit index scheme o f Section 4, a bit indicating whether am instruction is speculated is added to the opcode field. By expanded this field to two bits, operamd storage requirements cam be spec ified. Figure 10 shows the reduced contents o f the SRB given schedule 1 o f Figure 9. In the modified scheme, only the first reaui o f
7*7
must be maintained. Register r% is not required since it was not modified. The improved scheme adso eliminates blamk spaces in the SRB. For this exauuple, the misprediction o f Ie in schedule 1 o f Figure 9 results in four less variables to flush.The coding o f the two speculation bits would be as follows:
00
) no save required,01
) save operamd1
,10
) save operamd 2, amd 11) save both operamds. If neither operamd o f a speculated instruction has be saved in the SRB, the instruction is not marked as speculated. This is not a problem for bramch repair: however, if such am instruction traps, the hardware would have no way o f knowing not to hamdle the exception immedi ately. There would adso be no entry in the SRB for the exception bit or for the corresponding PC vadue. One solution to the problem would be to auid another bit toOriginal Schedule Speculated Schedule 1 Speculated Schedule 2 V <i - r
2
* r3 Vr3 m r 4 + rS
Ie: bne rJf Tj. Iy V r< ” r7 • rS I .: ri “ r» + 4 y r7 = r7 + 4I^i bne / j , ry, 1^
I»: r6 mr i * 4 I.: r2 = M EM (r2 ) I r 0 r2 * rJ V 0 r2 * rJ Except b it— , I Reg. No. 2 N l: valuer?) valuefrj valueffj) I^i bne ^7 » I ^ V r6 mr 6 + 4
Iy.
flu s h (lO lllO ) 1^ flush(lOOOOO) — 2 Nr8
r8 + 4 I£ bne fg I Tjt I ^ lK r6 = ri + 4 L : flu s h (lllO lO ) .1
•
I*: flush(OOlOOO) — , V - 0 v v a lu er7) 7 I. - 0 I; va lu er2) 2 h valuerg) 8 h v a lu er?) 7 SRB Contents SRB Contents„ Except b it— » Reg. No.
a
a
f é 1 c 2N u o s r h da
]f h value(r?) 7 i, vaiue(r2) 2 SRB ContentsFigure 10: SRB with reduced content.
Figure 11: Instrumentation code placement. the opcode field which marks speculated trapping in structions. A better solution is to code all speculated trapping instructions which have no operands to save as 01. This will indicate that exception handling is to be delayed and cause a reservation o f an entry in the SRB, and also will slightly increase the flush penalty during branch repairs.
6
Perform ance Evaluation
6.1
Evaluation Methodology
In this section, results o f a read buffer flush penalty evaluation are presented. The instrumentation code segments o f Figure 11 call a branch error procedure which performs the following functions:
1. Update the read buffer model.
2. Force actual branch errors during program exe cution, allowing execution to proceed along an incorrect path for a controlled number o f instruc tions.
3. Terminate execution along the incorrect path and restore the required system state from the simu lated read buffer.
4. Measure the resulting flush cycles during the branch repair.
5. Begin execution along the correct path until the next branch is encountered.
An example instrumentation code segment is shown in Figure 12. Parameters, such as operand saving in formation, current PC, branch fall-though PC, and branch target PC values, are passed by the instru mentation code to the branch error procedure. An additional miscellaneous parameter contains instruc tion type and information used for debugging.
Figure 13 gives a high level flow o f operation for the branch error procedure. When a branch instruction in the original application program is encountered, an
arm-branch flag is set. Prior to the execution o f the
next application instruction, the arm-branch flag is checked, and if set, the branch decision made by the application program is set aside. The branch is then predicted by the branch prediction model. Four mod els are used in the evaluation: 1) predict taken, 2) pre dict not taken, 3) dynamic prediction, and 4) static prediction from profiling information. The dynamic prediction model is derived from a two bit counter branch target buffer (BTB) design [21] and is the only model that requires updating with each predic tion outcome.
After the branch is predicted, the prediction is checked against the actual branch path taken by the application program. If the prediction was correct, ex ecution proceeds normally. If the prediction was incor rect, the correct branch path is loaded into the recov ery queue along with a branch error detection (BED) latency, and the predicted path is loaded into the PC. The BED latency indicates how long the execution of instructions is to continue along the incorrect path. The branch error time-out flag is set when the BED latency is reached. When a branch error is detected, the register file state is repaired using the read buffer contents. The PC value o f the correct branch path is obtained from the recovery queue. During branch er ror rollback recovery, the number o f cycles required to flush the read buffer during branch repair is recorded.
$_3 im lb _ 2 _ 2 4_0 : # i n s t r u c t i o n 24 # B e g in b rs im _ s im h o o k : s i subu l a sw l a sw l a sw l i sw l i sw move j $ s p , $ a t , $ a t , $ a t , $ a t , $ a t , $ at
t
$ a t , $ a t , $ a t , $ a t , $ a t , 44$
16, s2 - 0 : norm al — hook address — instruction adress — next hook address — miscellaneous _s im lb _ 2 _ 2 4_0 2 0 ($ sp ) $_s im lb _ 2 _ 2 4_1 2 4 ($ sp ) $_3 im lb _ 2 _ 2 5_0 2 8 ($ sp ) 8216 ---3 2 ($ sp ) 16 « b r s im sa v e 4 0 ($ sp ) $sp # End b rs im _ s im h o o k . $ _ s im lb _ 2 _ 2 4_1 : addu $16, $16,directs read buffer to save register 16 original instruction $_s im lb _ 2 _ 2 5_0 : # i n s t r u c t i o n 25 # B e g in brsim__sim h o o k : s i - 16, s2 - 9: b ra n ch subu $spr 44 l a $ a t , $ s im lb 2 25 0 — hook address sw $ a t , 2 0 ($sp) l a $ a t , $ s im lb 2 25 1 - e - —— instruction adress sw $ a t , 2 4 ($ sp ) l a $ a t , sw $ a t , 2 8 ($ sp ) l i $ a t , 532505 « sw $ a t , 3 2 ($ sp ) l a $ a t , sw $ a t , 3 6 ($ sp ) l i $ a t , 304 e ...— sw $ a t , 4 0 ($ sp ) registers 16 and 9 move $ a t , $sp j brsim __save # End brsim__sim h o o k . $ s im lb 2 25 1 :
bne $16 , $9, $_main__5 original instruction $ main 6:
Y w update RB model update recovery Queue return
i
• restore GPRF from RB model, record flush cycles • load PC from recovery queue PC - program counterGPRF - general purpose register file RB - read buffer
BPM - branch prediction model
Table 1: Application programs. Program Static Size Description
QUEEN 148 eight-queen program WC 181 UNIX utility
QSORT 252 quick sort algorithm CMP 262 UNIX utility GREP 907 UNIX utility PUZZLE 932 simple game COMPRESS 1826 UNIX utility LEX 6856 lexical analyzer YACC 8099 parser-generator CCCP 8775 preprocessor for
gnu C compiler
It is assumed for this evaluation that two read buffer entries can be flushed in a single cycle. This cor responds to a split-cycle-save assumption o f the gen eral purpose register file [12]. Performance overhead due to read buffer flushes (% increase) is computed as
F lush-O H = 100 .
total-cycles
All instructions are assumed to require one cycle for execution. This assumption is conservative since the MIPS processor used for the evaluation requires two cycles for a load. The additional cycles would increase the total-cycles and thereby reduce the observed per formance overhead. In addition to accurately measur ing flush costs, the evaluation verifies the operation o f the read buffer and its ability to restore the appropri ate system state over a wide range o f applications.
The instrumentation insertion transformation oper ates on the s-code emitted by the MIPS code generator of the IMPACT C compiler [3]. The transformation determines which operands require saving in the read buffer and inserts calls to the initialization, branch er
ror, and summary procedures. The resulting s-code
modules are then compiled and ran on a DECstation 3100. For the evaluation, BED latencies from 1 to 10 were used. Table 1 lists the ten application programs evaluated. Static Size is the number o f assembly in structions emitted by the code generator, not includ ing the library routines and other fixed overhead.
6.2
Evaluation Results
Experimental measurements o f read buffer flush overhead ( Flush Off) for various BED latencies are shown in Figures 14 through 23. The four branch
Flush OH <%) P_Taken: o -P_N_Taken:--o- Dyn_Pred: Prof_Pred: -a
-t— I— I— I— I— I— i— I— r
1 2 3 4 5 6 7 8 9 10
BED Latency
Figure 14: Flush penalty: QUEEN.
Flush OH <%)
50H P Taken: o
-P N
Taken:--o-40- Dyn_Pred: •••»•••
Prof_Pred:
- a-
30-
20-10- _ „ s & ïS ïz l
0-
1"
**Ta* ^
l i i l i
1— I— I— r 1 2 3 4 5 6 7 8 9 10 BED LatencyFigure 15: Flush penalty: WC.
prediction strategies used for the evaluation are: 1) predict taken (P-Taken), 2) predict not taken
(P-N-Taken), 3) dynamic prediction based on a
branch target buffer ( Dyn-Pred), and 4) static branch prediction using profiling data ( Prof-Prtd).
Flush costs were closely related to branch predic tion accuracies, i.e., the more often a branch was mis predicted, the more often flush costs were incurred. In a speculative execution architecture, branch predic tion inaccuracies result in performance impacts in ad dition to the impacts from the branch repair scheme. Branch misprediction increases the base run time o f an application by permitting speculative execution of unproductive instructions. Increased levels o f specular tion increase the performance impacts associated with branch prediction inaccuracies. Only the performance impacts associated read buffer flushes are shown in Figures 14 through 23.
Flush OH (% ) 50- 40- 30-P_Taken: o -P_N_Taken:--o- Dyn_Pred: •••*•• ProfJPred: -a
-i — -i— -i— -i— -i— -i— r
2 3 4 5 6 7 8 9 10
BED Latency
Figure 16: Flush penalty: COMPRESS.
Flush OH (% ) 50- 40-
30-20
-P Taken: o -PlN_Taken:--o- Dyn_Pred: ProfPred:o
f» f ■
't
f f f 2 3 4 5 6 7 8 9 10 BED LatencyFigure 17: Flush penalty: CMP.
Flush OH (* > 50H 40-
30-20
-P_Taken: o -P_N_Taken:--o- Dyn_Pred: •••*■■ Prof_Pred: -a -2 3 4 5 6 7 8 BED Latency 9 10Figure 18: Flush penalty: PUZZLE.
Hush OH <%) 50- 40-
30-20
-110
-P_Taken: o -P_N_Taken:-o- Dyn_Pred: •••».. ProfPred: -a-II--o—a
" i — i— i— i— i— i— i— r
1 2 3 4 5 6 7 8 9 10
BED Latency
Figure 19: Flush penalty: QSORT. For nine o f the ten applications, P-N.Taken was
significantly more accurate or marginally more ac curate in predicting branch outcomes than P - Taken. For QSORT, P-Taken was significantly more accurate than P-N-Taken. This result demonstrates that in a speculative execution architecture, it is difficult to guarantee optimal performance across a range o f ap plications given a choice between predict-taken and predict-not-taken branch prediction strategies.
For all but one application, Prof-Pred was more ac curate than either P-Taken or P-N-Taken. For CMP,
Prof-Pred, P-N-Taken, and Dyn-Pred were nearly per
fect in their prediction o f branch outcomes. Prof-Pred marginally outperformed Dyn-Pred in all applications except LEX.
The purpose o f measuring read buffer flush costs given the recovery from injected branch errors is to establish the viability o f using a read buffer design
for branch repair for speculative execution. Although in such a speculative schedule only static prediction strategies would be applicable, the Dyn-Pred model was included to better assess how varying branch pre diction strategies impact flush costs. Overall, the ac curacy o f Dyn-Pred fell between P-Taken/P-N-Taken and Prof-Pred.
Over the ten applications studied, read buffer flush overhead ranged from 49.91% for the PJTaken strat egy in CCCP to .01% for the P-N-Taken strategy for CMP given a BED o f ten. It can be seen from Figures 14 through 23 that a good branch prediction strat egy is key to a low read buffer flush cost. The results show that given a static branch prediction strategy using profiling data, an average BED o f ten produces flush costs no greater than 14.8% a n d an average flush cost o f 8.1% across the ten applications studied. This performance overhead is comparable to the overhead
Hush OH
(% )
Figure 20: Flush penalty: GREP.
Flush OH
<%)
50H p_Taken: H P N
Taken:--o-OH— i— i— i— i— i— i— i— i— r
1 2 3 4 5 6 7 8 9 10
BED Latency
Figure 22: Flush penalty: YACC.
Flush OH
(% )
expected from a delayed write buffer scheme with a maximum allowable BED o f ten [15]. Given a max imum BED o f ten and an average BED o f less than ten, the flush costs o f the read buffer would be less than that o f a delayed write buffer, since a delayed write buffer is designed for a worst-case BED and the flush penalty o f a read buffer is based on the average BED. The observed flush costs are small in compari son to the substantial performance gain o f speculated architectures over that o f nonspeculated architectures
[
8
-
10
].
The BED for a given branch in this evaluation cor responds to the number o f instructions moved above a branch in a speculative schedule. The results o f the evaluation indicate that if the average number o f in structions speculated above a given branch is < 10, then the read buffer becomes a viable approach to handling branch repair.
Flush OH
<%)
2 3 4 5 6 7 8 9 10
BED Latency
Figure 23: Flush penalty: CCCP.
7
Sum m ary
Speculative execution has been shown to be an ef fective method to create additional instruction level parallelism in general applications. Speculating in structions above branches requires schemes to han dle mispredicted branches and speculated instructions that trap.
This paper showed that branch hazards resulting from branch mispredictions in speculative execution are similar to branch hazards in multiple instruction rollback developed for processor error recovery. It was shown that compiler techniques previously developed for error recovery can be used as an effective branch repair scheme in a speculative execution architecture. It was also shown that data hazards that result in rollback due to exception repair are similar to on-path hazards suggesting a read buffer approach to exception
repair.
Implicit index scheduling was introduced to exploit the unique characteristics o f rollback recovery using a read buffer approach. The read buffer design was extended to include PC values to aid in rollback from excepting speculated instructions.
Read buffer flush penalties were measured by in jecting branch errors into ten target applications and measuring the flush cycles required to recover from the branch errors using a simulated read buffer. It was shown that with a static branch prediction strat egy using profiling data, flush costs under 15% are achievable. The results o f these evaluations indicate that compiler-assisted multiple instruction rollback is viable for branch and exception repair in a speculative execution architecture.
8
Acknow ledgem ents
The authors wish to thank Shyh-Kwei Chen and C.-C. Jim Li for their help with the compiler aspects o f this paper. We would like to thank Scott Mahlke, William Chen, and John Christopher Gyllenhaal for their excellent technical suggestions and assistance with the IM PACT C compiler. Finally, we express our thanks to Janak Patel for his contributions to this research.
References
[1] R. P. Colwell, R. P. Nix, J. O ’Donnell, D. B. Par pworth, and P. K. Rodman, “A VLIW Architec ture for a Trace Scheduling Compiler,” in Proc.
2nd Int. Conf. Architecture Support Programming Languages and Operating Syst., pp. 105-111, Oct.
1987.
[2] J. C. Dehnert, P. Y . Hsu, and J. P. Bratt, “Over lapped Loop Support in the Cydra 5,” in Proc.
3rd Int. Conf. Architecture Support Programming Languages and Operating Syst., pp. 25-38, April
1989.
[3] P. Chang, W . Chen, N. Warter, and W .- M. W . Hwu, “IMPACT: An Architecture Frame work for Multiple-Instruction-Issue Processors,” in Proc. 18th Annu. Symp. Comput. Architecture, pp. 266-275, May 1991.
[4] B. R. Rau and C. D. Glaeser, “Some Scheduling Techniques and an Easily Schedulable Horizon- tad Architecture for High Performance Scientific
Computing,” in Proc. 20th Annu. Workshop Mi
croprogramming Microarchitecture, pp. 183-198,
Oct. 1981.
[5] M. S. Lam, “Software Pipelining: An Effective Scheduling Technique for VLIW Machines,” in
Proc. A C M SIGPLAN 1988 Conf. Programming Language Design Implementation, pp. 318-328,
June 1988.
[6] A. Aiken and A. Nicolau, “Optimal Loop Paral lelization,” in Proc. A C M SIGPLAN 1988 Conf.
Programming Language Design Implementation,
pp. 308-317, June 1988.
[7] J. A. Fisher, “Trace Scheduling: A Technique for Global Microcode Compaction,” IEEE Trans.
Comput., vol. c-30, no. 7, pp. 478-490, July 1981.
[8] M. D. Smith, M. S. Lam, and M. Horowitz, “Boosting Beyond Scalar Scheduling in a Super scalar Processor,” in Proc. 17th Annu. Symp.
Comput. Architecture, pp. 344-354, May 1990.
[9] S. A. Mahlke, W . Y. Chen, W .-M . W . Hwu, B. R. Rao, and M. S. Schlansker, “Sentinel Scheduling for VLIW and Superscalar Processors,” in Proc.
5th Int. Conf. Architecture Support Programming Languages and Operating Syst., pp. 238-247, Oct.
1992.
[10] M. D. Smith, M. A. Horowitz, and M. S. Lam, “Efficient Superscalar Performance Through Boosting,” in Proc. 5th Int. Conf. Architecture
Support Programming Languages and Operating Syst., pp. 248-259, Oct. 1992.
[11] N. J. Alewine, S.-K. Chen, C.-C. J. Li, W . K. * Fuchs, and W .-M . W . Hwu, “Branch Recov ery with Compiler-Assisted Multiple Instruction Retry,” in Proc. 22th. Int. Symp. Fault-Tolerant
Comput., pp. 66-73, July 1992.
[12] N. J. Alewine, Compiler-assisted Multiple In
struction Rollback Recovery using a Read Buffer.
PhD thesis, Tech. Rep. CRHC-93-06, University o f Illinois at Urbana-Champaign, 1993.
[13] N. J. Alewine, S.-K. Chen, W . K. Fuchs, and W .- M. W . Hwu, “Compiler-assisted Multiple Instruc tion Rollback Recovery using a Read Buffer,” Tech. Rep. CRHC-93-11, Coordinated Science Laboratory, University o f Illinois, May 1993. [14] J. E. Smith and A. R. Pleszkun, “Implementing
Precise Interrupts in Pipelined Processors,” IEEE
[15] Y . Tamir and M. Tremblay, “High-Performance Fault-Tolerant VLSI Systems Using Micro Roll back,” IEEE Trans. Comput., vol. 39, pp. 548- 554, Apr. 1990.
[16] A. V. Aho, R. Sethi, and J. D. Ullman, Compil
ers: Principles, Techniques, and Tools. Reading,
MA: Addison-Wesley, 1986.
[17] M. Johnson, Superscalar Microprocessor Design. Englewood Cliffs, NJ: Prentice-Hall, Inc., 1991. [18] C.-C. J. Li, S.-K. Chen, W . K. Fuchs, and W .-
M. W . Hwu, “Compiler-Assisted Multiple In struction Retry,” Tech. Rep. CRHC-91-31, Coor dinated Science Laboratory, University o f Illinois, May 1991.
[19] J. A. Bondy and U. Murty, Graph Theory with
Applications. London, England: Macmillan Press
Ltd., 1979.
[20] J. L. Hennessy and D. A. Patterson, Computer
Architecture: A Quantitative Approach. San Ma
teo, CA: Morgan Kaufmann Publishers, Inc., 1990.
[21] J. K. Lee and A. J. Smith, “Branch Prediction Strategies and Branch Target Buffer Design,”