Application of compiler-assisted multiple instruction rollback recovery to speculative execution

(1)

NASA-CR-193360

Iuly 1993

UILU-ENG-93-2229

CRHC-93-16

Center for

Reliable

and High-Performance

Computing

APPLICATION

OF

COMPILER-ASSISTED

MULTIPLE

INSTRUCTION

ROLLBACK

RECOVERY

TO

SPECULATIVE

EXECUTION

N.J.

Alewine

W. K. Fuchs

W.-M.

Hwu

(NASA-CR-193360) APPLICATION OF COHPILER-ASSISTED MULTIPLE

INSTRUCTION ROLLBACK RECOVERY TO

SPECULATIVE EXECUTION (illinois

Univ.) 18 p

N93-32355

Uncl as

G3/62 0176461

Coordinated

Science

Laboratory

College of Engineering

UNIVERSITY

OF ILLINOIS

AT URBANA-CHAMPAIGN

Approved for Public Release. Distribution Unlimited.

(2)

UNCLASSIFIED

_CURItY CLASSIFICATION OF THIS PAGE

la. REPORT SECURITY CLASSIFICATION Unclassified

2a. SECURffY CLASSIFICATION AUTHORITY

Zb. DECLASSIFICATION I OOWNGRADING SCHEDULE

4. PERFORMING ORGANIZATION REPORT NUMBER(S)

UILU-ENG-93-2229 CRHC-93-16

'6a. NAME C)F PERFORMING ORGANIZATION

Coordinated Science Lab

University of Illinois 6c ADDRESS (O'ty, State, and ZIP Code)

ii01 W. Springfield Avenue Urbana, IL 61801

8a. NAME OF FUNDING/SPONSORING

ORGANIZATION

7a

8c ADDRESS(City, State, and ZlPCode)

7b

REPORT DOCUMENTATION PAGE

lb. RESTRICTIVE MA/tKINGS None

3. DISTRIBUTION/AVAILAB'IUTY OF REPORT

Approved for public release; dis cribution unlimited

S. MONITORING ORGANIZATION REPORT NUMBER(S)

ii

OFFICE SYMBOL 7a. NAME OF MONITORING ORGANIZATION (If app.cabl,)

N/A National Aeronautics and Space Admlnlstratic

7b. ADDRESS (C/ty, State, and ZIP Code) Moffitt Field, CA

9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER

10. SOURCE OF FUNDING NUMBERS

ELEMENT NO. . NO. ACCESSION NO.

Bb. OFFICE SYMB(_L

(If applicable)

11. TITLE (Include _curityOas_i_ation)

Application of Compiler-Asslsted Multiple Instruction Roilback Recovery to Speculative Execution

S. PAGE COUNT 12. PERSONAL AUTHOR(S)

=

13a. TYPE OF REPORT

ALEWINE, N. J, W. K. Fuchs, and W.-M. Hwu

1135. TIME COVERED 114. DATE OF REPORT O'ear, Month, Day)

Technical I FROM . TO _,. I ]qq_ .T,11v 1P

16. SUPPLEMENTARY NOTATION

18

ii

17. COSATI CODES I 18. SUBJECT TERMS (Continue on reverse if _ces_ty and identify by block numbed

FIELD I GROUP I SUB-GROUP I rollback recovery, compller-asslsted multiple instruction,

• I I transient processor failures, instructional level

paral-,, lel t _m

:9 ABSTRACT (Continue on reverse if necessary and identify by block number)

Speculative execution is a method to increase instruction level parallelism which can be exploited by both super-scalar and VLIW architectures. The key to a successful general speculation strategy is a repair mechanism to handle mispredicted branches and accurate reporting of exceptions for speculated instrucitons. Multiple instruction rollback is a technique developed for recovery from transient processor failure. Many of the difficulties encountered during recovery from branch misprediction or from instruction re-execution due to exception in a speculative execution architecute are similar to those encountered during multiple instruction rollback.

This paper investigates the applicability of a recently developed compiler-assisted multiple instruciton rollback scheme to aid in speculative exectuion repair. Extensions to the ocmpiler-assisted scheme to support branch and exception repair are presented along with performance measurements across ten application programs.

20. DISTRIBUTION I AVAIL/kBIUTY OF ABSTRACT J21. ABSTRACT SECURITY CLASSIFICATION

I_'IUNCLASSIFIEDAJNUMITED [] SAME AS RPT. [] DTI¢ USERS'J Unclassified

22a. NAME OF RESPONSIBLE INDIVIDUAL ... FZb.TELEPHONEOndu¢_ AreaCode)/22c. OFFICE SYMBOL

Illll

DD FORM 1473, 84 MAR 83 APR edition may be used until exhausted. SECURITy,, CLASSIFICATION OF THI S PAGE All other editions are obsolete.

(3)

TO APPEAR: WORKSHOP ON HARDWARE AND SOFTWARE ARCHITECTURES FOR FAULT TOLERANCE: PERSPECTIVES AND TOWARDS A SYNTHESIS JUNE 14-16, 1993 LE _R'I" S,_IFI"-'_CB_, FP,_CE

APPLICATION

OF

COMPILER-ASSISTED

MULTIPLE

INSTRUCTION

ROLLBACK

RECOVERY

TO

SPECULATIVE

EXECUTION

N. J. Alewine, W. K. Fuchs, W.-M. Hwu

Center for Reliable and High-Performance Computing

Coordinated Science Laboratory University of Illinois at Urbana-Champaign

Abstract

Speculative execution is a method to increase in-struction level parallelism which can be exploited by

both super-scalar and VLIW architectures. The key to a successful general speculation strate_W is a repair mechanism to handle mispredicted branches and ac-curate reporting of exceptions for speculated instruc-tions. Multiple instruction rollback is a technique developed for recovery from transient processor fail-ures. Many of the difficulties encountered during

re-cover/from branch misprediction or from instruction

re-execution due to exceptions in a speculative

exe-cution architecture are similar to those encountered

during multiple instruction roUback.

This paper investigates the applicability of a centiy developed compiler-a_isted multiple instruc-tion rollback scheme to aid in speculative execution

repair. Extensions to the compiler-assisted scheme to support branch and exception repair are presented along with performance measurements acro_ ten

ap-plication programs.

1 Introduction

Super-scalar and VLIW architectures have been

shown effective in exploiting instruction level

paral-lelism (ILP) present in a given application [1-3].

Cre-ating additional ILP in applications has been the

sub-ject of study in recent years [4-6]. Code motion within a basic block is insufficient to unlock the full potential of super-scalar and VLIW processors with issue rates

"InternationAl Btud_e_ Machines Corporation, Boca P_n, I;I.

I Thk research wu mpported in part by the National Aero-na, utJcJ and Space Admini_rathm _NASA ) under gran_ NASA

NAG 1-613, in cooperation with the minois Computer

Labora-tory for A_e Systen_ and Software (ICLASS), and in part

by the Department of the Navy and managed by the Omce ot the Chief of Naval Research under Contract N00014-91-J-1283.

greater than two [3]. Given a trace of the most fre-quently executed basic blocks, limited code movement across block boundaries can create additional ILP at the expense of requiring complex compensation code

to ensure program correctness [7]. Combining multiple

basic blocks intomuperblocks permits code movement within the superblock without the compensation code

required in standard trace scheduling [3].

General upward and downward code movement across trace entry points (joins) and general

down-ward code motion across trace exit points (branches, or forks) is permitted without the need for special hardware support [7]. Sophisticated hardware support

is required, however, for unrestricted upward code mo-tion across a branch boundary. Such code motion

is referred to as speculative ezecution and has been

shown to substantially enhance performance over

non-speculated architectures [8-10]. This paper focuses on the support hardware for speculative execution, which

ensures correct operation in the presence of except-ing speculated instructions (referred to as exception repair) and of mispredicted branches (referred to as branch repair). It is shown that data hazards which re-sult from exception and branch repair are very similar to data hazards that result from multiple instruction rollback, and that techniques used to resolve rollback data hazards are applicable to exception and branch

repair.

The remainder of the paper is organized as follows.

Section 2 gives a brief overview of a compiler-assisted

multiple instruction rollback (MIR) scheme to be used

as a base for application to speculative execution re-pair (SER). Section 3 describes speculative execution and the requirements for exception repair and branch repair. Section 4 introduces a schedule _consLr_c.

tion scheme and extends the compiler-assisted rollback scheme. Section 5 describes read buffer flush costs and Section 6 presents performance impacts which result

(4)

from read buffer flushes.

2 Compiler-Assisted Multiple Instruc-tion Rollback Recovery

2.1 Hazard Classification

Within a general error model, data hazards result-ing from instruction retry are of two types [11-13]. On-path hazards are those encountered when the in-struction path after rollback is the same as the initial path and branch hazards are those encountered when the instruction path after rollback is different than the initial path. As shown in Figure 1, rm represents an on-path hazard where during the initial instruction

se-1

N

is live

1

i

• i i • I I • | | Ii rye-'= rd+'_c i

Figure 2: Branch data hazard.

I'," ": +"bl ,

N •

I,:-,'= +x I

rollbsck

Figure h On-path data hazard.

quen¢_ rffiis written and after rollback is read prior to being re-written. As shown in Figure 2, ry represents a branch hazard where the initial instruction sequence writes ry and al_er rollback ry is read prior to being

re-written however this time not along the original path.

2.2 On-path Hazard Resolution Using a Read Buffer

Hardware support consisting of a read buffer of size 2N, as shown in Figure 3, has been shown to be ef-fective in resolving on-path hazards [11-13]. The read buffer maintains a window of register read history. If an on-path hazard is present, then prior to writing over the old value of the hazard register, a read of that value must have taken place within the last N instructions (else after rollback of <_ 2V, a read of the hazard register would not occur before a redefinition).

Key to this scenario is the fact that the original path is repeated. Branch hazard resolution is left to the

A

=

I

li

I,

'_-- Read Buffer C |

Figure 3: Read buffer.

compiler. At rollback, the read buffer is flushed back to the general purpose register file (GPRF), restoring the register file to a restartable state. The primary advantage of the read buffer is that it does not require an additional read port as with a history buffer, repli-cation of the GPRF as with the future file, or bypass logic as with the reorder buffer or delayed write buffer [14,15].

2.3 Branch Hazard Removal Compiler Transformations

Compiler transformations have been shown to be effective in resolving branch hazards [11, 12]. Branch

hazardresolution occursat three levels; I) pseudo

code,2) machine code,and 3) post-pass.Reso|ution at the pseudo code levelwould be accomplishedby renamingthe pseudoregisterey ofinstruction

(5)

li(Fig-ure 2) to rz. Node splitting, loop expansion and loop protection transformations aid in breaking pseudo reg-ister equivalence relationships so that renaming can

be performed. After the pseudo registers are mapped to physics] registers, some branch hazards could

re-appear. This is prevented at the machine code level

by adding hazard constraintsto liverange constraints prior to registerallocation.Branch hazards that re-main afterthe firsttwo levelscan be resolvedby either

creating a %overing _ on-path hazard or by inserting

hop (no operation) instructionsahead of the hazard instructionuntilthe rollbackisguaranteed to be un-der the branch. Given the branch hazard of Figure

2, a covering on-path hazard is created by inserting an MOV r_,rv instructionimmediately before the

in-structionin which rv is defined.This guarantees that

the old value of % is loaded into the read bufferand isavailableto restorethe registerfileduring rollback.

3

Speculative Execution

Figures 4 and 5 illustratethe two basic problems

which are encountered when attempting upward code

motion acro_ a branch. As shown in Figure 4, ifthe

"m

'oj

jbranchmkm

_v

T

ie

,

*

" ...

*Q_

I

Figure 4: rl in live_out of taken path.

speculated instruction (i.e., an instruction moved

up-ward past one or more branches) modifies the system

state, and due to the branch outcome the speculated instruction should not have been executed, program correctne_ could be affected. Figure 5 illustrates that if the speculated instruction causes an exception, and again due to the branch outcome, the excepting

in-struction should not have been executed, program

per-formance or even program correctness could be af-fected.

tJr_ = _(

r2)l-_--'trap

occun

: O I [branch taken

_v

T

:,e • •

t

J

Figure 5: Speculated instructiontraps.

3.1 Branch Repair

Figure 6 shows an originalinstructionschedule and

a new schedule after speculation. Instructionsd, i, and / have been speculated above branches c and

9 from their respectivefall-throughpaths.2 Specu-lated instructionsare marked "(s)." The motivation

for such a schedule might be to hide the load delay of the speculated instructionsor to allow more time forthe operands of the branch instructionsto become

available. If c commits to the taken path (i.e.,itis

mispredicted by the staticscheduler), some changes

to the system statethat have resultedfrom the execu-tion of d, i,and f, may have to be undone. No update is required for the PC; execution simply begins at j.

If instead,c commits to the fail-throughpath but 9

commits to the taken path, then only i'schanges to the system state may have to be undone.

Not allchanges to the system state are equally

im-portant. Iffor example, d writes to register r= and

r= _ live_in(j)(i.e.,along the path startingat j, a redefinitionof r= willbe encountered prior to a use of

r, [16]),then the originalvalue of 7"=does not have

to be restored. Inconsistenciesto the system state as a resultof mispredicted branches

exhibitsimilari-tiesto branch hazards in multipleinstructionrollback

[11,12]. Given this similaritybetween branch

haz-ards due to instructionrollba_.kand branch hazards due to speculative execution, compiler-driven

data-flow manipulations, similarto those developed to

elim-inate branch hazards forMIR [11,12],can be used to resolvebranch hazards that result from speculation.

Such compiler transformationshave been proposed for

2For thiB exmnple it is a_mmed that the fall-through patl_ are the mo,t likely outcome of the br_nch decisions at c aad

(6)

9-a a b (')cl j (')i d b ¢ (s)f

f

Lh _ k e h Lh _ k i h RB_c: d e f i jump L1 RB_g: h i jump 1.2

Original Speculau_ Recovery

Schedule Schedule Blocks

Figure 6: Branch repair.

branch misprediction handling [9]. Since re-execution of speculated instructions is not required for branch mispredlction, compiler resolution of branch hazards becomes a sufficient branch repair technique.

3.2 Exception Repair

Figure 6 also demonstrates the handling of spec-ulated trapping instructions. If d is a trapping in-struction and an exception occurred during its execu-tion, handling of the exception must be delayed until ¢ commits so that changes to the system state are mini. mized, and in some cases to ensure that repair is pos-sible in the event that c is mispredlcted. If c commits to the taken path, the exception is ignored and d is handled like any other speculated instruction given a branch mispredict. If c was correctly predicted, three exception repair strategies are possible. The first is to undo the effects of only those instructions speculated above c (i.e., d, i, and f) and then branch to a recovo ery block RB_c [10] as shown in Figure 6. The address of the recovery block can be obtained by using the PC value of the excepting instruction as an index into a hash table. This strategy ensures precise interrupts [14,17] relative to the nonspeculated schedule but not relative to the original schedule. Recovery blocks can cause significant code growth [10]. The second strat-egy undoes the effects of all instructions su.b_uent to

d (i.e., i, b, and/), handles the exception, and resumes execution at instruction i [9]. This latter strategy

pro-rides restartable states and does not require recovery blocks. A third exception repair strategy undoes the effects of only those subsequent instructions that are speculated above c (i.e., only i and/), handles the

ex-ception, and resumes execution at instruction i,

how-ever, this time only executing speculated instructions until c is reached. The improved efficiency of strategy 3 over that of strategy 2 comes at the coat of slightly more complex exception repair hardware.

When a branch commits and is mispredicted, the exception repair hardware must perform three func-tions: 1) determine whether an exception has occurred during the execution of a speculated instruction, 2) if an exception has occurred, determine the PC value of the excepting instruction, and 3) determine which changes to the system state must be undone. Func-tions 1 and 2 are similar to error detection and location in multiple instruction rollback. Function 3 is similar to on-path hazard resolution in multiple instruction rollback [11,12, 18]. On-path hazards assume that af-ter rollback the initial instruction sequence from the faulty instruction to the instruction where the error was detected is repeated.

Figure 7 illustrates the speculation of a group of

| /e

e

i nb=k

i

"

_e'D'D_DeO e,De 0 Blipi el IDIDI! H_D g

I

y

• branch •

• not •

• _ •

Figure 7: Exception repair.

instructions and re-execution strategy 3. The load in-struction traps, but the exception is not handled un-til the branch instruction commits to the fail-through path. Control is then returned to the trapping instruc-tion. This scenario is identical to multiple instruction rollback where an error occurs during the load instruc-tion and is detected during the branch instrucinstruc-tion. For this example, only el must be restored during rollback since r4 and rs will be rewritten prior to use during re-execution. Figure 7 shows that exception repair

(7)

hazards in speculative execution are the same as on-path hazards in multiple instruction rollback, and a

read buffer as described in Section 2 can be used to resolve these hazards. The depth of the read buffer is

the maximum distancefrom I, to In along any

back-wards walk 3, where In isa trapping instructionthat

was speculatedabove branch instructionI_.

3.3 Schedule Reconstruction

Assumed in Figures 6 and 7 are mechanisms to

identifyspeculative instructions,determine the PC

value of excepting speculated instructions,and deter-mine how many branches a given instructionhas been speculated above. An example of the lattercase is

shown in Figure 0 where instructions d, i, and f, axe undone if c is mispredicted; however, only i must be

undone if g is mispredicted.

If the hardware had access to the original code schedule, the design of these mechanisms would be

straightforward. Unfortunately, static scheduling re-orders instructions at compile-time and information as to the original code schedule is lost. To enable recov-ery from mispredicted branches and proper handling

of speculated exceptions, some information relative to

the originalinstructionorder must be present in the

compiler-emittedinstructions.This willbe referredto as schedulerecon._raction.

By limitingthe flexibilityof the scheduler,lessin-formation about the originalschedule isrequired. For example, ifspeculation is limited to one level only (i.e.,above a singlebranch), a singlebitin the opcode fieldissufficientto indicatethat the instructionhas been moved above the next branch [8].The hardware

would then know exactly which instructioneffectsto und6 (i.e.,the ones with this bit set). Also, remov-ing branch hazards directlywith the compiler permits

general speculation with no schedule reconstruction

for branch repair [9].

4 Implicit Index Schedule Reconstruc-tion

Implicit indez scheduling supports general

specula-tion of regular and trapping instructions. The scheme

was inspired by the handling of stores in the sentinel scheduling scheme [9] and was designed to exploit the

unique properties of the read buffer hardware design described in Section 2. Schedule reconstruction is ac-complished by marking each instruction specalated or

3 A ,#alk is a sequence of edge tr&verMia in a graph where the edges visited can be repeated [19].

nonspec,lated by including a bit in the opcode field, and using this encoding to maintain an operand his-tory of speculated instructions in a FIFO queue called

a speculation read buffer (SRB). The SRB operates

similar to a read buffer with additional provisions for exception handling.

4.1 Exception Repair Using a Speculation

Read Buffer

Figure 8 shows an originalcode schedule and two

speculativeschedules,along with the contents of the

SRB at the time branches Ic and _rw commit. Instruc-tions[d and I I have been speculated above branch

instruction Ic, and Ii has been speculated above both

I s and I¢. The encoding of speculated instructions in-forum the hardware that the source operands are to

be saved in the SRB, along with the source operand values, corresponding register addresses, and the PC

of the speculated instruction.

Speculated instructions execute normally unless they trap. If a speculated instruction traps, the ex-ception bit in the SRB which corresponds to the trap-ping instruction is set and program execution contin-ues. Subsequent instructions that use the result of the trapping instruction are allowed to execute normally.

A chk.ezcept(k) instruction is placed in the home

block of each speculated instruction. Only one

chLezcept(k) instruction is required for a home block. As the name implies, chLezcept(k) checks for pend-ing exceptions. The command can simultaneously

in-terrogate each location in the SRB by utilizing the

bit field k. As shown in schedule 1 of Figure 8,

chk.ezcept(OOIlll) in I_ checks exceptions for

instruc-tions Id and I_. If a checked exception bit isset,the SRB isflushedin reverseorder,restoringthe appropri-ate registerand PC values.Execution can then begin with the excepting instruction.

Figure 8 illustratesseveralon-path hazards which

are resolvedby the SRB. In schedule 1, if1_traps and the branch [c commits to the taken path, _r_has

cor-rupted r2 and [! has corcor-rupted rv. Flushing the SRB

up through [i restoresboth registersto their values prior to the initialexecution of li. Note that register

re isalsocorrupted but not restoredby the SRB, since

afterrollback,s willbe rewritten with a correctvalue

before the corrupted value isused.

As an alternativeto checking forexceptions in each home block,the exception could be handled when the exception bit reaches the bottom of the SRB. This is

similarto the reorder bufferused in dynamic schedul-ing[14] and eliminates the cost of the chk_ezcept(k)

(8)

Oril_inal Schedule I_ r_ = r2 * r_

x_ r3 = r, + rJ

L: t_

_l° _" ZS

I¢_ r6 = rT * r8 I.: rs = r& + 4 If: rT= rT + 4 It: bne rt, r7, Ik

zh: r6= r6 + "t

Ii: r2 = MEM(r2 )

1

c 2N d

l

PC Reg. No.

Zr

-

0

If value(r 7) 7

xd _,_

s

I d _l_[r 7) 7 Ii - 0

It value(r2) 2

SRB Contents

Speculated Schedule 1 S/_ulated Sch_ule 2

la: rl = r2 * r_ la: rl .. r2 * r_

I¢: bile r/, rj, Ij I¢: b;l_ r1, r3, lj

I_: chk_except(001111)- I_:chk..exeept(llO011)

le: re = r& + 4 I.: rs = ra + 4

It: btm rs. r7, It It" bner&, r7. It

I_ chk_except(110000) I;: chk=_except(O01100) --,]

Is: r6-:6 + 4 I_ r6 1"6+4

:

PC Except bit --I

:

It

I,: .

_,)71

oI I1,.._

IJ

-SRBConmnm

(9)

latency which can impact performance depending on

the frequency of exceptions.

Impficit index scheduling derives its name from the ability of the compiler to locate a particular restorer value within the SRB. This is pouible only if the

dy-namically occurring history of speculated instructions

is deterministic at branch boundaries. Superbloclm

guarantee this by ensuring that the sole entry into the

superblock is at the header and by limiting specula-tion to within the superblock. For standard blocks,

bookkeeping code [7] can be used to ensure this

deter-ministic behavior.

4.2 Branch Repair Using a Speculation

Read Buffer

As described in Section 2,branch repaircan be han-dled by resolvingbranch hazards with the compiler. Branch hazard resolutionin multiple instructionroll-back can be a_isted by the read bufferwhen cover-ing on-path hazards are present,reduccover-ing the

perfor-mance cost of variablerenaming [11,12]. In a similar fashion,the SRB can assistin branch repair. Figure 9 shows the originalcode schedule and the two

spec-ulativeschedules of Figure 8. For this example, it is + assumed that r_, _, re, and rv are elements in both

live_in( Ij ) and lit_e_in( I_ ).

As shown in schedule I, ifbranch instructionI, commits to the taken path, r2,rs,and rv,which were modified in I+,Ij, and I/, respectively,must be re-stored.Ifinstead,Ic commits to the fall-throughpath and Ig commits to the taken path, only r2 must be

re-stored. Registers r2 and rv are rollbackhazards that

resultfrom exception repair;therefore,the SRB

con-talns theirunmodified values.By including a fl_h(_)

command at the target of Ic and Ig, the SRB can be

used to restorer2 and/or rv given a mispredictionof

I, orig.

The flush(k) command selectively flushes the ap-propriate register values given a branch misprediction. For example, in schedule 2 of Figure 9, ifI¢ is predicted correctly and Ig is mispredicted, the SRB is flushed in

reverse order up through Ii, restoring value(r2) from

Ii but not restoringvalue(rv) from I I. Since

specu-lation is always from the most probable branch path,

the flush(k)command isalways placed on the most improbable branch path, minimizing the performance penalty. Not allbranch hazards are resolved by the presence of on-path hazards. These remaining haz-ards can be resolvedwith compiler transformations.

5

SRB

Flush

Penalty

The examples of Section 4 demonstrate that

compiler-assisted multiple instruction rollback can be applied to both branch repair and exception repair in a speculative execution architecture. The flush penalty

of the read buffer is not a key conceen in multiple

in-struction rollback applications since instruction faults are typically very rare. In application to exception

re-pair in speculative execution, the SRB flush penalty is not a major concern due to the infrequency of

ex-ceptions involving speculated instructions. However,

in application to branch repair, the SRB flush penalty could produce significant performance impacts. Stud-ies of branch behavior show a conditional branch fre-quency of 11% to 17% [20].Static branch prediction

methods resultin branch mispredictious in the range of 5% to 15%. This resultsin a branch repair fre-quency as high as 2.5%. Assuming a CPI (clockcycles

per instruction)rate of one and an average SRB flush

penalty of ten cycles,the performance overhead of the

flush mechanism would reach 22.5%. This indicates the importance of minimizing the amount of redun-dant data stored in the SRB so that the flushpenalty isreduced.

Recently, a technique was proposed to reduce the

amount of redundant data in a read bufferso that the read buffer size could be reduced [12,13]. A similar

technique can be used to assure that only the data

required for branch and exception repair isstored in

the SRB. In the implicitindex scheme of Section 4, a bit indicatingwhether an instructionisspeculated is added to the opcode field.By expanded this fieldto two bits,operand storage requirements can be spec-ified. Figure 10 shows the reduced contents of the

SRB given schedule I of Figure 9. In the modified scheme, only the firstread ofrr must be maintained.

Register rs is not required since it was not modified.

The improved scheme also eliminates blank spaces in the SRB. For this example, the misprediction of Ic in schedule 1 of Figure 9 results in four lessvariablesto

flush.

The coding of the two speculationbits would be as follows:00) no save required,01) save operand I, 10)

save operand 2,and 11) save both operands. Ifneither operand of a speculated instruction has be saved in

the SRB, the instruction is not masked as speculated.

This is not a problem for branch repair: however, if

such an instruction traps, the hardware would have no

way of knowing not to handle the exception

immedi-ately. There would also be no entry in the SRB for the exception bit or for the corresponding PC value. One solution to the problem would be to add another bit to

(10)

.Ori."l_al Schedule Speculated Schedule 1

I_ rl = r2* r3

I_ rj- v4 + rJ I_ v6 = rT , v8 I,: vs = va + 4

If _7= rT+ _

I_: b= _a. "7. Ik

I_ r6= r6+ 4

I_: r2 = M]_(r 2)

T lIf' "

_{Idl va/u_r_)}

o

₈

c 2N Id_ wNne(r:fl 7

o lI,

-

o

d Ii va/ue(r_ 2 l I I_ I"1"i"2" r3

I.+r+/_.

r8

Ib;r_+ r5 I=: bne vl. v3. Ij I,: v8 = va + 4 I_ br_ v&. v7, Ik I/_ r6" r6+ 4 e e

Ii

n_o_)--f l c 11 0 $ r h d

1

Speculated Schedule 2 SRB Contents I,: r_" rs + 4

Ii

b_

r_.

_7,I_

I_: r_ f r_ + 4 Ij:

flush(Ill010)

--I_:

nush(OOlO00)---,

I

•

I

I l If 0

_fv_, 7)7

Ii 0 2N li _l_e(,_) 2 Id value(r_) 8 I d _l_(rT) 7 SRB Contcnm

(11)

u o

L

PC Except bit Reg. No. -I d v_r7) 7

x,

2

M Contents

Figure I0: SRB with reduced content.

Inmz men ion

Ori nl odc

code imtru_om

0

E-7

Figure II:Instrumentation code placement.

the opcode field which marks speculated trapping

in-structions. A better solution is to code all speculated trapping instructions which have no operands to save

as 01. This will indicate that exception handling is to

be delayed and cause a reservation of an entry in the SRB, and also will slightly increase the flush penalty

during branch repairs.

6

Performance

Evaluation

6.1 Evaluation Methodology

In this section, results of a read buffer flush penalty evaluation are presented. The instrumentation code segments of Figure 11 call a branch error procedure which performs the following functions:

1. Update the read buffer model.

. Force actual branch errors during program

exe-cution, allowing execution to proceed along an incorrect path for a controlled number of

instruc-tions.

. Terminate execution along the incorrect path sad

restore the required system state from the simu-lated read buffer.

4. Measure the resulting flush cycles during the

branch repair.

5. Begin execution along the correct path until the next branch is encountered.

An example instrumentation code segment isshown

in Figure 12. Parameters, such as operand saving in-formation, current PC, branch fall-though PC, and

branch target PC values, are passed by the instru-mentation code to the branch error procedure. An additional miscellaneous parameter contains instruc-tion type and information used for debugging.

Figure 13 gives a high level flow of operation for the branch error procedure. When a branch instruction in the original application program is encountered, an

arm{.branch flag is set. Prior to the execution of the next application instruction, the arnt.branch flag is checked, and if set, the branch decision made by the

application program is set aside. The branch is then predicted by the branch prediction model. Four mod-eL are tuted in the evaluation: 1) predict taken, 2) pre-dict not taken, 3) dynamic prediction, and 4) static prediction from profiling information. The dynamic prediction model is derived from a two bit counter branch target buffer (BTB) design [21] and is the only model that requires updating with each

predic-tion outcome.

After the branch is predicted, the prediction is checked against the actual branch path taken by the

application program. If the prediction was correct,

ex-ecution proceeds normally. If the prediction was

incor-rect, the correct branch path is loaded into the recov-ery queue along with a branch error detection (BED)

latency, and the predicted path is loaded into the PC. The BED latency indicates how long the execution of instructions is to continue along the incorrect path. The branch error time_ouL flag is set when the BED latency is reached. When a branch error isdetected, the register file state is repaired using the read buffer contents. The PC value of the correct branch path is

obtained from the recovery queue. During branch er-ror rollback recovery, the number of cycles required to flush the read buffer during branch repair is recorded.

(12)

$ simlb 2 24 0"

T instruCtiOn 24

# Begin brsim sim hook: sl - 16, s2 - 0: normal

_spi 44 subu la sw la sw la sw li sw li sw move Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, # $ simlb 2 25 l--bne $16, $_main 6 : $ simlb 2 24 0 _ hookaddr_ 2_($sp) $ simlb 2 24 1 _ _s_c_oQadge_ss 2_($sp}

$ s imlb 2 25 0-4----next hook 28 ($sp)

8216 _ n_sc_Uaneous

32 ($sp)

16 _ dh_cts read buff= tosave

40 ($sp) regis= 16

Ssp

j brsim save

# End brsim simho_k.

$_simlb 2 24 I_

$ simlb 2 25 0: instruct i_n 25

# Begin brsim sim hook: sl - 16, s2 - 9: branch

subu _sp, 44

la Sat, $ simlb 2 25 0_@----hookadd_ss

sw Sat, 2_($sp)

la Sat, $ simlb 2 25 1-4----L_n_fionadre_

sw Sat, 2_($sp) la Sat, $ main 6 sw Sat, 2_($spY li Sat, 532505 sw Sat, 32($sp) la Sat, $ main 5 _ sw Sat, 3_($sp_ li Sat, 304 sw Sat, 40($sp) move Sat, $sp j brsim save

End brsim sim ho_k.

ne_thookaddxess m_7,elIaneous

mrget

d_-ectsreadbuff= tosave

_s_ 16 and9

$9, $_main_5 I_@--- o_ms_cfion

(13)

Y

N Y N Y N branch Y atm__h <- 1

update

RB model I ream i load recovery queue

withnotpredicu_lpath

)

m recovery queue

PC - program counter

GPRF - general purpose register file

RB - readbuffer

BPM - branch prediction model

N ' r . restore GPRF f_om ] RB model, record flushcycles • load PC fxom recovery queue

(14)

Table

1: Application programs.

I Program

IIStatic

D cnption

QUEEN WC QSORT CMP GREP PUZZLE COMPRESS LEX YACC CCCP 148 _eight-queen _program 181 UNIX utility 252 quick sort algorithm 262 UNIX utility 907 932 1826 6856 8099 8775 UNIX utility simple game UNIX utility lexical analyzer parser-_enerator preprocemor for gnu C compiler

It is smumed for this evaluation that two read buffer entries can be flushed in a single cycle. This cor-responds to a split-cycle-save assumption of the gen-eral purpoze register file [12]. Performance overhead due to read buffer flushes (% increaze) is computed sa

fluah.cpcle8 Flush..OH = 100. '

total._'ycles

All instructions are amurned to require one cycle for execution. This amumption is conservative since the MIPS processor used for the evaluation requires two cycles for a load. The additional cycles would increase the total_cyclea and thereby reduce the observed per-formance overhead. In addition to accurately measur-ing flush costs, the evaluation verifies the operation of the read buffer and its ability to restore the appropri-ate system stappropri-ate over a wide range of applications.

The instrumentation insertion transformation oper-ates on the s-code emitted by the MIPS code generator of the IMPACT C compiler [3]. The transformation determines which operanck require saving in the read buffer and inserts calls to the initialization, branch er-ror, and summary procedures. The resulting s-code modules are then compiled and run on a DECstation 3100. For the evaluation, BED latencies from 1 to 10 were used. Table 1 lists the ten application programs

evaluated. Static Size is the number of amembly in-structions emitted by the code generator, not includ-ing the library routines and other fixed overhead.

6.2 Evaluation Results

Experimental measurements of read buffer flush overhead (Fl_h OH) for various BED latencies are shown in Figures 14 through 23. The four branch

Hush OH (%) P N T_en:-o-0- o if,,! °. °., **.M...°. 10" _ BED Latency

Figure 14: Flush penalty: QUEEN.

Flush OH (%) 50- 40- 30-20. 10. 0 n: ...o-

n:-_-Lt

2 3 4 _ 6 7 8 9 I'0 BED I._Uency

Figure 15: Flush penalty: WC.

prediction strategies used for the evaluation are: 1) predict taken (P_Taken), 2) predict not taken

(P.N_Taken),3) dynamic predictionbased on a

branch target buffer (Dpn_Pred), and 4) static branch prediction using profiling data (Prof_Pred).

Flush costs were closely related to branch predic-tion accuracies, i.e., the more often a branch was mis-predicted, the more often flush costs were incurred. In a speculative execution architecture, branch predic-tion inaccuracies result in performance impacts in ad-dition to the impacts from the branch repair scheme. Branch misprediction increases the base run time of an application by permitting speculative execution of unproductive instructions. Increased levels of

specula-tion increase the performance impacts associated with branch prediction inaccuracies. Only the performance impacts associated read buffer flushes are shown in

(15)

FlushOH

(%)

5O" 4O- 30- 2O-10 P Taken:

P_-N_Takcn:-

a-Dyn_Pmd: .._.. Prof_Pred:

--4-BED Late,

no7

Figure 16: Flush penalty: COMPRESS.

Hush OH (%) 50.H P T_nt: .-_ - P N T

aken:-o-D .-Pre

Pmf Pre_ --_

. 3O- 2O- 10-1 2 3 4 5 6 7 8 910

BED Latency

Figure 17: Flush penalty: CMP.

Hush OH (%) _ P_Taken:

-Q-£

I_Y ..a-0 , BED Latency

Figure 18: Flush penalty: PUZZLE.

Flush OH (%) 50- p Taken: --_ - P-N Taken>o-

40-Prof. Pred:

--4-- 3O-- _ m GI 20-oos_'w O' _ :_ _. :_ _ "_ _ _ 1'0 BED Latency

Figure 19: Flush penalty: QSORT.

For nine of the ten applications, P_N.Taken was significantly more accurate or marginally more ac-curate in predicting branch outcomes than P_Taken. For QSORT, P.Taken was significantly more accurate than P.N_Taken. This result demonstrates that in s speculative execution architecture, it is difficult to guarantee optimal performance across a range of ap-plications given a choice between predict-taken and predict-not-taken branch prediction strategies.

For all but one application, Pref_Pred was more

ac-curate than either P.Taken or P_N.Taken. For CMP,

Prof_Pred, P.N_Taken, and Dyn_Pred were nearly

per-fect in their prediction of branch outcomes. Pre/_Pred

marginally outperformed D_ln_Pred in all applications

except LEX.

The purpose of measuring read buffer flush costs given the recovery from injected branch errors is to establish the viability of using a read buffer design

forbranch repairforspeculativeexecution.Although

in such a speculativescheduleonly staticprediction

strategies would be applicable, the Dyn.Pred model was included to better assess how varying branch pre-diction strategies impact flush costs. Overall, the ac-curacy of Dyn_Pred fell between P_Taken/P_N_Taken

and Pro f_Pred.

Over the ten applications studied, read buffer flush overhead ranged from 49.91% for the P_Taken strat-egy in CCCP to .01% for the P_N_Taken stratstrat-egy for CMP given s BED of ten. It can be seen from Figures

14 through 23 that s good branch prediction strat-egy is key to a low read buffer flush cost. The results show that given a static branch prediction strategy using profiling data, an average BED of ten produces flush costs no greater than 14.8% and an average flush coet of 8.1% acrc_mthe ten applications studied. This performance overhead is comparable to the overhead

(16)

Hush OH _ PTalmn:P N Talmn:-a---_

=-. °O°°U lo- ou--a °'°° j I..Q..Q..Q.._r ° x...x...,x .-o..x _- =..._-..- _''_f"_ I ! I BI_) Lamncy

Figure 20: Flush penalty: GREP.

Flush OH (%) 50- 40- 3O- 2O-0 E -.-@-

Taken:-o--Pmd: ---.

Pmf_Pmd: .-4-BED Latency

Figure21: Flush penalty:LEX.

expectedfrom a delayedwritebuffer scheme with a

maximum allowable BED of ten [15]. Given a max-imum BED of ten and an average BED of less than ten, the flush costs of the read buffer would be lees than that of a delayed write buffer, since a delayed write buffer is designed for a worst-case BED and the flush penalty of a read buffer is based on the average BED. The observed flush costs are small in compari-son to the substantial performance gain of speculated

architectures over that of nonspeculated architectures

[8--10].

The BED foragivenbranch inthisevaluation cor-respondsto the number of instructionsmoved above

a branchin a speculativeschedule.The resultsof the

evaluationindicatethatifthe averagenumber of in-structiousspeculatedabove a givenbranch is< 10, then the read bufferbecomes a viableapproach to handlingbranch repair. Hush OH (%)

50-2o:

0 P Taken: P-N Taken:-o-Prof_Pred: --4--BED Latency

Figure22: Flush penalty:YACC. FlushOH (%) 30-I V P-..N. Takcn_-o-- Dyn_Pred: .-_'.--20" Prof_Pred:--4-. 10-0 BED Latency

Figure23: Flush penalty:CCCP.

7 Summary

Speculativeexecutionhas been shown to be an

ef-fectivemethod tocreate additionalinstructionlevel parallelismin generalapplications.

Speculatingin-structiousabove branches requiresschemes to

han-dlemispredictedbranchesand speculatedinstructions thattrap.

This paper showed that branch hazards resulting

from branch mispredictiousin speculativeexecution are similarto branch hazardsin multipleinstruction

rollbackdevelopedforprocessorerrorrecovery.Itwas

shown thatcompilertechniquespreviouslydeveloped

for error recovery can be used as an effective branch repair scheme in a speculative execution architecture. It wae also shown that data hazards that result in rollback due to exception repair are similar to on-path hazards suggesting a read buffer approach to exception

(17)

repair.

Implicit index schedulin 8 was introduced to exploit the unique characteristics of rollback recovery using a read buffer approach. The read buffer design was extended to include PC values to aid in rollback from excepting speculated instructions.

Read buffer flush penalties were measured by in-jecting branch errors into ten target applications and

measuring the flush cycles required to recover from the branch en_rs using a simulated read buffer. It was shown that with a static branch prediction

strat-egy using profiling data, flush costs under 15% are achievable. The results of these evaluations indicate that compiler-assisted multiple instruction rollback is viable for branch and exception repair in a speculative

execution architecture.

8 Acknowledgements

The authors wish to thank Shyh-Kwei Chen and

C.-C. Jim Li for their help with the compiler aspects

of this paper. We would like to thank Scott Mahlke, William Chen, and John Christopher Gyllenhaal for their excellent technical suggestions and amistance with the IMPACT C compiler. Finally, we express our thanks to Janak Patel for his contributions to this

research. References

[1]

[2]

[3]

[4]

R. P. Colwell, R. P. Nix, J. O'Donnell, D. B. Ps-pworth, sad P. K. Rodman, "A VLIW Architec-ture for a Trace Scheduling Compiler," in Pro¢.

_nd Int. Conf. Architecture Support Programming Langeages and Operating Syst., pp. 105-111, Oct. 1987.

J. C. Dehnert, P. Y. Hsu, and J. P. Bratt, "Over-lapped Loop Support in the Cydra 5," in Proc.

3rd Int. Conf. Architecture Support Programming

Languages and Operating Syst., pp. 26-38, April

1989.

P. Chang, W. Chen, N. Warter, and W.-M. W. Hwu, "IMPACT: An Architecture Frame-work for Multiple-Instruction-Issue Processors," in Proc. lSth Anna. Syrup. Comput. Architecture,

pp. 286--275, May 1991.

B. It. Itau and C. D. Glaeser, "Some Scheduling

Techniques and an Easily Schedulable Horizon-tal Architecture for High Performance Scientific

Computing," in Proc. HOth Anna. Workshop Mi-croprogramming Microarchitectare, pp. 183-198, Oct. 1981.

[5] M. S. Laln, "Software Pipelinin8: An Effective Scheduling Technique for VLIW Machines," in

Pro¢. ACM SIGPLAN 1988 Conf. Programming Language Design Implementation, pp. 318-328, June 1988.

[6] A. Aiken and A. Nicolau, "Optimal Loop Paral-leUzation," in Pro¢. ACM SIGPLAN 1988 Conf.

Programming Language Design Implementation,

pp. 308-317, June 1988.

[7] J. A. Fisher, '_rrace Scheduling: A Technique for Global Microcode Compaction," IEEE Trans.

Comput., vol. c-30, no. 7, pp. 478-490, July 1981. [8] M. D. Smith, M. S. Lain, and M. Horowitz,

"Boceting Beyond Scalar Scheduling in a Super-scalar Processor," in Proc. 17th Anna. Syrup. Compnt. Architecture, pp. 344-354, May 1990. [9] S. A. Mahlke, W. Y. Chen, W.-M. W. Hwu, B. K.

ltao, and M. S. Schlansker, "Sentinel Scheduling for VLIW and Superscalar Processors," in Proc.

5tit Int. Conf. Architecture Support Programming Languages and Operating Sysf., pp. 238-247, Oct.

1992.

[10] M. D. Smith, M. A. Horowitz, and M. S. Lain, "Efficient Superscalar Performance Through Boosting," in Proc. 5th Int. Conf. Architecture Support Programming Languages and Operating

Sy_., pp. 248-259, Oct. 1992.

[11] N. J. A]ewine, S.-K. Chen, C.-C. J. Li, W. K. Fuchs, sad W.-M. W. Hwu, "Branch

Recov-ery with Compiler-Assisted Multiple Instruction

Retry," in Pro¢. _th. Int. Syrup. Fault-Tolerant

Comput., pp. 66-73, July 1992.

[12] N. J. Alewine, Compiler-assisted Multiple

In-struction Rollback Recovery asing a Read Buffer.

PhD thesis, Tech. Rep. CRHC-93-06, University of Illinois at Urbana-Champaign, 1993.

[13] N. J. Alewine, S.-K. Chen, W. K. Fuchs, and W.-M. W. Hwn, "Compiler-assisted Multiple Instruc-tion Rollback Recovery using a Read Buffer," Tech. Rap. CRHC-93-11, Coordinated Science Laboratory, University of Illinois, May 1993.

[14] J. E. Smith and A. K. Pleszkun, "Implementing

Precise Interrupts in Pipelined Processors," IEEE Trans. Comput., vol. 37, pp. 562-573, May 1988.

(18)

[15]Y. Tamir andM. Tremblsy,"High-Performance

Fault-Tolerant

VLSISystems

UsingMicroRoll-bsck," IEEE Tr=_. Comp,t., vol. 39, pp. 548-554, Apr. 1990.

[16] A. V. Aho, 11. Sethi, and J. D. Ullman,

Compil-ers: Principles, Techniqwes, and Tools. Reading, MA: Addison-Wesley, 1086.

[17] M. Johnson, Superscalar Microprocessor Design.

Enghwood Clh_, NJ: Prentice-Hall, Inc., 1991. [18] C.-C. J. Li, S.-K. Chen, W. K. Puchs, and

W.-M. W. Hwu, "Compiler-Assisted Multiple In-struction Retry," Tech. Rep. CRHC-91-31, Coor-dinate! Science Laboratory, University of IUinois, May 1991.

[19] J. A. Bondy and U. Murty, Graph Theory with Applications. London, England: Macmillan Press Ltd., 1979.

[20] J. L. Hennessy and D. A. Patterson, Computer

Architecture: A Quantitatiee Approach. San Ma-t_'o, CA: Morgan Kaufmann Publishers, Inc.,

1990.

[21] J. K. Lee and A. J. Smith, "Branch Prediction Strategies and Branch Target Buffer Design,"