Design and implementation of an asynchronous version of the MIPS R3000 microprocessor

(1)

Rochester Institute of Technology

RIT Scholar Works

Theses

Thesis/Dissertation Collections

1-1994

Design and implementation of an asynchronous

version of the MIPS R3000 microprocessor

Kevin Johnson

Follow this and additional works at:

http://scholarworks.rit.edu/theses

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].

Recommended Citation

(2)

DESIGN AND IMPLEMENTATION OF AN

ASYNCHRONOUS VERSION OF THE MIPS

R3000 MICROPROCESSOR

by

Kevin Johnson

A Thesis Submitted

In

Partial Fulfillment of the

Requirements for the Degree of

MASTER OF SCIENCE

In

Computer Engineering

Approved by:

Graduate Advisor - George A. Brown, Professor

Roy S. Czemikowski, Professor and Department Head

Tony H. Chang, Professor

Department of Computer Engineering

College of Engineering

Rochester Institute of Technology

Rochester, New York

(3)

THESIS RELEASE PERMISSION FORM

ROCHESTER INSTITUTE OF TECHNOLOGY

COLLEGE OF ENGINEERING

Title: Design and Implementation of an Asynchronous Version of the MIPS

R3000 Microprocessor.

I, Kevin Johnson, hereby deny permission to the Wallace Memorial Library of

RIT to reproduce my thesis in whole or in part.

(4)

ABSTRACT

The purpose of this thesis is to show some of the advantages for the asynchronous implementation of amajor synchronous structure. Three people were involved with the design of an asynchronous microprocessor modeled after the MIPS R3000 microprocessor. This microprocessor implemented all of the MIPS reduced instruction _set, while _eliminating the need for synchronous _clocking throughout the chip. Paul Fanelli modeled the asynchronous processor _{using VHDL} (hardware description language). Kevin Johnson created circuit level designs and layouts of the arithmetic logic unit

(5)

Table Of

Contents

Abstract ii

TableofContents _iii

ListofFigures iv

ListofTables vi

Glossary

vii

1. Introduction 1

1.1 MIPSR3000 2

1.2 Pipeline 3

1.3 ALU stage 4

2. Registers 6

2.1 Data Dependencies 6

2.2 Register Design 13

3. Arithmetic Logic Unit 21

3.1 TTLModel 21

3.2 DataPath 28

3.3 DualPurpose 28

3.4 Conditional Statements 29

3.5 Opcode

Interfacing

32

4. Shifter Design 38

5. MultiplierDivider in theAsynchronousMicroprocessor 48

6. Bus Control 50

6.1 Inputs 50

6.2 Outputs 52

7. ExceptionCreation 57

8.

Handshaking

61

8.1 CompletionSignals 61

8.2

Handshaking

ControllerCircuit 67

9. Results 70

10. Conclusions 78

(6)

[image:6.552.62.491.116.700.2]

List OfFigures

Figure 1.1 Pipeline Stages 3

Figure 1.2

Top

LevelDesignoftheALU stage 5

Figure 2.1 Simple MethodFor

Adding

Three Numbers 6

Figure 2.2 Possible Paths fortheValid Bits 9

Figure 2.3 Program

Showing

Data

Dependency

9

Figure 2.4 Petri Net fortheData

Dependency

Problem 12

Figure 2.5 Two Phase

Clocking

Scheme... 13

Figure 2.6 Bus Utilization Scheme 14

Figure 2.7 ExampleofaLoad WordLeft Instruction 16

Figure2.8 LogicDiagram ofOne RegisterBit 18

Figure 2.9 LayoutofaOne-Bit Register 19

Figure 2.10 LogicDiagramofOne 32-BitRegister 20

Figure 3.1 Schematic oftheModified ALU 22

Figure 3.2 Equations for Intermediate

Carry

Signals 24

Figure 3.3 LookAhead Circuit 26

Figure 3.4 Schematicof a32-bitAdder

Using

4-bitLookAhead

Carry

Adders 27

Figure3.5 SignExtension Circuit 30

Figure 3.6 Condition Analyzer 31

Figure3.7 ConditionBitGenerator 34

Figure3.8 InstructionInterface Circuit 35

Figure 3.9 Bit

Coding

oftheR3000Instruction Set 36

Figure 4.1 Arithmetic andLogicalShifts 38

Figure 4.2 Basic

Shifting

Cell 39

Figure 4.3 Simple Connection For

Shifting

Cells 40

Figure 4.4 CircuitDiagramofA 3-1 Multiplexor 42

Figure 4.5 Layoutofa3-1 Multiplexor 43

Figure4.6 Layoutofa2-1 Multiplexor 44

Figure 4.6 Exampleof a

Driving

Circuit 46

Figure4.7

Analog

Simulationofthe

Driving

Circuit 47

Figure 6.1 InputBusControl Block 51

Figure6.2 A-Bus Selector 53

Figure 6.3 B-Bus Selector 54

Figure 6.4 Bus Selector Decoder 55

Figure6.5 OutputSelector 56

Figure 7.1 ExampleofOverflow Conditions 58

Figure7.2 SchematicoftheOverflow Logic 59

(7)

Figure 8.1 DCVSLNOR GatewithCompletion Signal 62

Figure 8.2 PathsofDCVSLOutputs 63

Figure8.3 ComparisonofDCVSLand

Complementary

CMOSLogic 65

Figure 8.4 LayoutComparison ofDCVSLversus

Complementary

CMOS 66

Figure 8.5

Handshaking

Controller Circuit 68

Figure 9.1 SimulationofUnsigned Addition 73

Figure 9.2 Simulationof aBranch Not Equal toZeroInstruction 74

Figure9.3 SimulationofaSignedInstruction

Causing

an Exception 75

Figure 9.4 SimulationofaShiftInstruction 76

(8)

ListofTables

Table 2.1 Incorrect

Scheduling

ofDependentInstructions 10

(9)

Glossary: ALU ALU Stage Asynchronous CISC CMOS DCVSL FPU HCC HI ID stage IFstage LOW LWL LWR MEM stage MIPS

Arithmetic Logic Unit

Partofthepipelinewhere all arithmetic operations takes

place.

Actions donotoccur at specific timeperiodsor when

another actionisscheduledto takeplace.

Complex instruction setcomputer.

Complementary

metal-oxide-semiconductor

Differential Cascade Voltage Switched Logic

Floating

point unit.

Handshaking

ControllerCircuit.

32-bitregister used toholdthe high32 bitsin a_multiply operation orthequotientin adivideoperation.

Instruction decodestage ofthe pipeline wherethe

instructionis decoded into signals.

Instructionfetch stageofthepipeline wherethe

instructionis obtainedfrommemory.

32-bitregister used toholdthe low 32 bitsin a_multiply operationorthe remainderinadivideoperation.

Loadwordleft.

Loadword right.

Memory

stage ofthepipeline where all_memory

operations areperformed.

(10)

MIPS

NMOS

NOP

PC

PCL

R3000

RISC

RIT

Synchronous

TTL

VHDL

VHSIC

VLSI

WB stage

Millionsofinstructions per second.

N-typemetal-oxide-semiconductor.

Instruction thatcuses _nothing tohappen.

Programcounter.

Signalnameforthe latchedoutputofthe program counter.

Aversion ofthe MIPS architecture used in aparticular

32-bitmicroprocessor.

Reduced instruction setcomputer.

Rochester Instituteof

Technology

Actions occur_according to adeterministic schedule, e.g.

Actions occuratspecific phases ofaclock pulse.

Transistor TransistorLogic

VHSIC HardwareDescriptionLanguage

Very

HighSpeedIntegratedCircuit

Very

LargeScaleIntegration

Write backstage ofthepipeline wherethedata is written

(11)

1.

Introduction

The purpose of this project is to show some of the advantages of the

asynchronous design methodology. The design should be large enough to

allow the increaseinspeed caused

by

the asynchronous operation to be greater

than the decrease in speed caused

by

the extra circuitry. A thirty-two bit

microprocessor was chosen for this project. The asynchronous design methodology eliminates the need for _clocking circuits

by

_adding

handshaking

circuitry.

Clock lines in a microprocessor need to be routed to _every part of the

chip. These lines cause a problem in the operation of the _chip called clock

skew. The clock

lines,

upon which the clock signal_travels, spread outin many directions from the clock driver. This signal is supposed to change the state of

many circuit elements throughout the _chip simultaneously. Since these lines are not the same

length,

the clock pulses on one side of the _{chip may} or _may

not be in phase with the pulses on the other side. In order to make sure that

circuit elements on both sides of the _{chip have} changed state _using the same clock pulse, the period of the clock is extended. This makes sure that the

clock pulsehas hadtime to travel throughout thechip.

The clock lines and the gates that are driven

by

the clock create a large

capacitive load. This capacitance must be overcome in order for the clock

signal to change state in a reasonable amount of time. This causes the clock driverto be _necessarily large. The capacitance in the line must becharged in

order for the voltage on the line to come to a true logical-one value. The

capacitance also must be discharged in order for the value to return to the

logical-zero state. The time needed to charge and discharge the large

(12)

frequency

is faster than the _charging and

discharging

times, the clock pulses

will never reach their true values but will _stay

fluctuating

around some mid-levelvoltage.

Within a large clocked _{circuit, the} clock speed must be slow enough to

allow the slowest part of the circuit to be able to complete its task before the

beginning

of the next clock cycle. In large

designs,

this minimum time between clock pulses becomes a problem. For example, in a microprocessor

this minimum time is the time it takes for the slowest instruction to execute. This can be a

long

period of time because operations such as multiplication

take longerthan operations such as shifting. Even in simple

designs,

there can be large discrepancies in the time it takes for an instruction to execute. In a

thirty-two bitripple_carry adder,for example, it takes two gate delays for two

numbers to be added _together, when there are no carrys within the circuit. _If,

however,

a_carryiscreatedin everysingle stage, each addition stage must wait

until the _carry is obtained from the _proceeding stage. This condition takes 32

timeslongerthan theprevious example.

Inan asynchronous

design,

theclockiseliminated and_circuitryis added

toensurethateach stage willfinishit'sexecutionbefore thenextstagebegins.

Thecircuitwillevaluateatthemaximum speedpossibleforeachstage.

1.1 MIPSR3000

The synchronous circuit to be modified in this project is the MIPS

R3000microprocessor. It is a 32-bitmicroprocessor with an off _chip

floating

pointunit. Itemploys afive-stage pipeline structure with a reduced instruction

(13)

are_onlythreedifferentstyles forthe instruction words, immediate, register and

jump.

Themicroprocessorwas implemented in a process similarto the MOSIS

two micron process that was used in the RIT VLSI labs. The R3000

microprocessor _is a _relatively old microprocessor and there is a lot of

information available.

Many

books have been written

describing

this

architecture.

1.2 Pipeline

The Mips processor uses afive stage pipeline structure. An example of

thispipelineis shownin figure 1.1. Each stage performs adifferentpartofthe

execution of an instruction. The stages are called instruction

fetch,

instruction

decode,

arithmetic logic _{unit(ALU), memory,} and write back. The instruction

fetch stage gets the instruction from memory. The instruction decode stage

decodes the instruction into useful parts. The arithmetic logic unit performs

calculations. _{The memory} stage performs fetches and writes to memory. The

writeback stage puts completedinformation into theregisterbank.

IF

instruction fetch

ID

instruction decode

ALU

arithmetic logi

unit

MEM

memory

WB

write back

(14)

Because ofthese five stages, five instructions canbe operated on at the

same time. Different instructions will be located at different stages in the

pipeline simultaneously. This allows the microprocessor to execute

instructions fasteronce the pipeline is filled. The pipeline in the MIPS R3000

is clocked. The speed of the processor is the speed that the instructions are

able to move from stage to stage. Because this speed is controlled

by

the

clock, itmustbeableto accommodatethe slowest stagein thepipeline. In the

asynchronous

design,

the stages are controlled

by

signals that mark the

beginning

and end ofthe operation in each stage. This does not depend upon the worst case stage time and the stages are allowed to operate as fast as

possible. This projectis designed to provethatan asynchronousprocessor will

operatefasterthanits synchronouscounterpart.

1.3 ALU stage

This particular thesis _only deals with the arithmetic logic unit stage of

the total design. The thesis

by

Scott Siers deals with the other pipeline stages

and the thesis

by

Paul Fanelli describes the VHDL _modeling of the

microprocessor. The ALU stage is split _up into fivemajor parts. These parts

are the registers, the ALU, the shifter, the

handshaking

controls and the

exception creation. A _top level diagram of the ALU stage is presented in

(15)

instruction pcu

\y

in& star

^

hLlow rca

aiie

i(3l

Bus Ctrl Block

(31:1) :)) indnKiignC3l:B) el.in(3l:8> irradiate Hert hmOl: intl(3li. KljuKiM Decode

Shift Ctrl

J

no

intl(3l!fl)

;,n Alu32b it

CBUI s Conpare qui brl Ltd! Shifter ii 12 3 to s

M(31:9)3₃ <

1

SLI -OB(31:B) -OpcLout(31:0) "Construction_to_next_stage(31:0) wioiio)Overflov M(3I:1> owrfli >except int) (31:!) Bra.Ctrl ***"*"*

Obranch

\Ja ^tput Sel Sox

outOI: (31i01

intlnitHm(!li!) ffi-f>0utput(31:0)

(16)

2.

Registers

2.1 Data Dependencies

Data dependencies occur when an instruction needs the output from a

priorinstruction thathasnotleftthepipeline. For_example, a case might occur

where three numbers need to be added together. A simple method to accomplishthisoperationisshownin Figure2.1.

AUUn Rt R2 R5 addsRl toR2and the resultsis in R5 Mm R? R? R5 aits R5 toR3andthe ttmltsis inR5

Figure 2.1 Simple Methodfor

Adding

Three Numbers

This method adds the first two numbers together and adds their sum to the

third. In _{the pipeline,}

however,

these instructions will

immediately

follow

each other. The answer will be created from the first instruction when the write backstageis reached. Thesecondinstruction needs the information from the first instruction in the ALU stage. The memory stage is located between the write back and ALU stages. A space or time

delay

is needed between these twoinstructions so thatthefirstinstruction willbe in the writebackstage

when the second is in the ALU stage. Another instruction can be placed between the two add instructions to provide the space or time delay. This instruction would be in the _memory stage when the first add instruction is in

thewritebackstage.

Any

non-dependentinstructioncanbe usedforthis space

and, ifone cannot be found from the program, a simple instruction thatcauses

(17)

Most of the MIPS R3000 processor's data dependencies are taken care

ofinthecompiler or at software level. This canbe done because theprocessor

works on asynchronous

(clocked)

pipeline. For example, a branch instruction

causes a _{data dependency. A} test is needed to determine ifthe branch will be taken. This test occurs in the ALU stage, while the outcome of the branch instruction affects the instruction fetch stage. The instruction decode stage

falls between these two stages. Inthe R3000processor, thebranch instructions

are

immediately

followed

by

an executable statement. This executable

statement acts as thespacebetween the testforthebranch and theexecution of the branch. This is similar to the _way a space was needed in the addition

example above. After this second instruction _executes, the branch is then allowed to occur. The detection of these data dependencies can be done in

software and the insertion of spacer

instructions,

to eliminate the

dependency,

canbe done atthecompiler level.

Inan asynchronous _processor, one_simplycannot assume that

letting

one

instructionpassthough astage willallow enoughtimefor the data toleave the

pipeline. An instruction _may be in theprocess of_moving into the write back

stage when the dependent instruction begins to be executed in the ALU stage.

Thewrite backstage has notyetputthecorrectdatainto theregisterbeforethe

ALU stage operates on the register. This will cause the processor to act on

erroneous dataand thereforemustbe avoided.

One solution to this problem is to allow the data to be stored at intermediate points _along the pipeline and allow the stored data to be

retrievable for the dependent instruction. This would mean that _every stage

would remember the output from the instruction that _{had previously} passed though it. This implementation would be useful if the dependent instructions

(18)

was described earlier, the dependent instructions in the synchronous

microprocessor are separated

by

a spacer instruction to insure proper timing.

Because one of the goals of this project is to be compatible with the MIPS

architecture, this solution is now invalid. In order to allow dependent

instructions to be spaced _out, the intermediate storage devices now need to be

capable of _storing more than one value. This solution is problematic at best

because

increasing

the number of storage spaces reduces the speed of the

pipeline.

Another solution to this problem is to _tag the data that is stored in the

register bank. If an instruction is about to leave the instruction decode stage

and passintothe ALUstage, theinstruction can causethe value in _{the register,}

that theinstructionusesas a_target,to beinvalidated. The next

instruction,

ifit

is dependenton that

data,

would then haveto wait untilthe data intheregister

is validated. In order to implement this scheme, another bit is needed in the

register to indicate whether that register has valid or invalid data. When the

instruction passesinto theALUsection, thisregisterbitwould besetto invalid

and, when the instruction leaves the write back _{section, this} bitwould then be

setback to valid. At this pointit will be assumed _that, when the processor is

reset, thecondition bitsforalltheregisters willberesetto a valid state.

This schemeworksfora clocked systembutinanasynchronous system

thereare some hazardsthatneed tobe addressed. For

instance,

it is

conceivablypossiblefortwoinstructionsto _{be setting}and_clearingthe valid bit

atthe sametime. Thiswould cause anindeterminate state, orrace _condition,

depending

onwhich stage wasfaster. To eliminatethisproblemboth the

instruction decodestage and thewritebackstage gettheirownbitwhich

they

cantoggle. The two bitscanthenbeexclusive ORedtogetherinorderto

(19)

assigning the states 11 and 00tobe_{valid, the}raceconditionthat_may occuris

eliminated.

This method appearsto answer allthe problems with datadependencies,

however,

one problem remains. Until_{this point,} theinvestigation _{has only}

Figure 2.2 Possible Paths forthe Valid Bits

been concerned with the possible dependencies in the data part of the

instruction. The target part is just used to designate which register will be

invalidated. The flaw in this scheme can be shown in the simple program in

Figure2.3.

ill

2

3

Addn R I R2 R3

Addu R4 R5 R3

lm i;:

- ysil _n

.'.! M'l pl.ufNn in _i3 ,i,jtfe 14 %tti5 md pl.kVN n in i3

---addoneto R3

Figure2.3Program

Showing

Data

Dependency

The first two instructions are not data dependent but the third one is. It is

interesting

to note that instruction 1 is meaningless but still is a valid

instruction forthe program. An exampleofthe execution ofthis program is in

(20)

time

1

2

3

4

5

6

Pipeline Stages

IF ID ALU MEM WB

3 2 1

3 2

3

TaWe2.7 Incorrect

Scheduling

_ofDependent Instructions

As is shown in thetable the third instruction stalls until the valid bit has been

set. Thevalid bit is set attime4_allowing the third instructionto enter. This is

incorrect because the first instruction set the valid bit. In

fact,

the third

instructionwas supposedto waitforthe secondinstructionto finish.

The solution for this problem is to _only allow the instruction to enter

into the ALU stage if alldata and targetregisters are valid. This will stall the

second instruction until thefirstinstruction leavesthe pipeline. _{The validity} of

thismethod is verified in the petri net shown in the Figure 2.4.

By

_using this

method, no dependencies can occur. No set of instructions will cause the

microprocessor to operate on erroneous data caused

by

a dependency. Some care should be taken when _programming this type of microprocessor

because,

(21)

willbevery low. Theprogram will still run _but, since_many ofthe instructions

will become stalled in _{the pipeline, the} program will not run at maximum

(22)

IF)

Data

Dependent

Operations

To

The Same

Register

Valid Register

(23)

2.2 Register Design

The registers, where data is stored for eventual _use, are an important

part of _any microprocessor. Due to the asynchronous structure of this

processor, some special considerations need to be made, in order to eliminate

hazard conditions that _may occur in

loading

and _{retrieving information from}

the register bank.

First,

the synchronous _method, as used in the R3000,

processor will_{be investigated.}

Then,

the method thatallows theregisters to be

used inanasynchronous mannerwill _{be described.}

One problem encountered when

designing

registers is _preventing access

to the registers for simultaneous _reading and writing. If this

happens,

a race

condition willbe created_makingitunclear whethertheinformationobtained is

an old number stored in the register or a new number

being

written to the

register. A second problemiscaused

by

thefact that to_{do any} calculation two

numbers are needed. Since registers _generally write to a common data

bus,

gettingtwo sets ofdatacanbecomedifficult.

The R3000processor solves bothoftheseproblems

by

_usingatwo

((,1

<|)2

c

A B D

Fig

2.5 Two Phase

Clocking

Scheme

phase, non-overlappingclock. Anexample ofthis_clocking schemeis shownin

(24)

instruction fetch stage grabs the data to give to the arithmetic logic unit stage.

The clock eliminates the hazard that occurs between

loading

and _retrieving

numbers and solves the first problem of _preventing simultaneous _reading and

writing.

In order to solve the second problem of _obtaining two sets of data, the

R3000 uses one _{32-bit data bus} and multiplexes the data from the registers to

the correct lines in the arithmetic logic unit. Figure 2.5 shows that there are

four distinctperiodsinthe _clockingscheme. Onepieceofdatacan be obtained

during

the period marked with a 'C and another piece obtained

during

the period marked with a'D'. This eliminates _any chance of

having

datacollide on

thebus.

In the asynchronous _{microprocessor,} one does not have the

luxury

of

knowing

exactly when certain events are _going to happen. There is no

mechanism to _stop the write back stage from _writing at _any _time, _potentially

causing a hazard condition. In order to eliminate this problem, register

accesses need to be allowed to happen in the register bank simultaneously.

This costs some space on the_chip because, while the R3000 uses one bus and

multiplexesitsusebetween thevarious clockphases; intheasynchronous

write back VVTlte y,r back / ^ \U Asynchronous ProcessorRegister Bank

A

\1/

ALUa andb

,/32 ,/3

ALUa

\l/

ALUb

(25)

microprocessor, three different busses and _{corresponding} logic are required.

Threedifferentbusseseliminatethehazard of contention on thebus but notthe

hazard caused

by

_readingand _{writing simultaneously}to the same register. This

is similar to the data

dependency

problem described earlier in the data

dependency

section. The use of aregister

flag,

to indicate whetheraregister is

safetoread from or write_to, solves thelogical

dependency

problem thatoccurs

in the pipeline. This method also protects the register's data coherency. The

instruction fetch stage waitson the

registers'

state before accessing the register.

This solves the problem of

having

both instruction fetch and write back stages

accessingtheregisters atthe sametime.

In the asynchronous version of this processor, there are thirty-two

generalpurpose registers.

Also,

there aretwo additional registers that are used

by

themultiplier/dividerto hold sixty-fourbitnumbers. Themultiplier/divider

registers are accessed

differently

and arelocated in adifferentpart ofthe chip.

Theseregisters are _only used

by

the multiplier/divider and

by

four instructions

thatmovetheoutput ofthemultiplier/dividerto thegeneral purposeregisters.

Two other instructions also operate the registerbank differently. These

instructions areload wordleft

(LWL)

andloadword right(LWR). Onetofour

bytes of _memory can be loaded into a register with these instructions. What

makestheseinstructions different fromthe standard load instructions is that the

bytes not indicated in the target register are not changed after the end of the

operation. Ifathirty-two bitwordisneeded andit does notfall on regular byte

boundaries, acombinationofthese instructionscanbe used to getthe number.

An example of this is shown in Figure 2.7. In this example the register is

loaded with the three low order bytes from memory location 0 without

affecting the high order byte. In order to implement these

instructions,

more

logicmustbeaddedto theregisterbank. Thewritebackstage sends four

(26)

memory

address4 address0

A s fi 7

0 1 2 3

register x

before a b c d

LWL$x,l($0)

a 1 2 3

after

Fig

2.7_{Example of}aLoad Word Left Instruction

signals to the register bank before executing a load instruction. These four

signals select which byte within the register will recieve the new data and

which _bytewillbethe same. Under normal operation these signals will always

be active so that the whole register is affected. For aload word left or a load

word right, thesesignalswillindicatewhichbyteintheregister willbe loaded.

This is performed

by ANDing

the load signal with the select signals that are

sent

by

thewritebackstage.

The general purpose registers are madeout simple components designed

tofittogether to createthelargeregisterbank. Thirty-two ofthese components

are placed side to side tomake a single thirty-two bitregister. A logic diagram

of one bit of a register and the layout of that logic have been included in

Figures 2.8 and2.9. As canbe seen

by

these

figures,

theregisters are designed

with maximum modularity. A _thirty two bit register can be implemented

by

combining thirty-two of these one bit registers into four groups of eight bits.

Each eight-bitregister has an AND gate to control the selection for LWL and

LWR. The four register banks are combined together and two toggle registers

(27)

diagram of a thirty-two bit register is shown in Figure 2.10. The layout

(28)

asE>

idty

inO

bsO

Oa

-ob

(29)

one

bit

register

bs-bar

as-gar

nz

(30)

32bit

Register

as

O-

bsO-

in<31:0)O-

reg_naskOO-reg.nasklC^

reg_nask20"

reg_nas

k30-alu-tO"

reseto

loadO"

' ₀

VCC

'CLK

K 5-<P

r

K 5-iPf-1 I

" _8bit

reo <,:8)

8bit reo "<7:6)

B _8b_it

reo <7:0,

*-k7ib:

8bit reo a:0J

Nv

_>A<ai:0>

<5&M\

x>

-Ovalid

(31)

3. Arithmetic

Logic Unit

The Arithmetic Logic Unit

(ALU)

is the brain of the microprocessor.

Within the

ALU,

_{all calculations are performed.} _{It is} a complex circuit that

implements _{many different functions.} In most microprocessors, the ALU is

split into _{two parts,} _{one part operates} _on

floating

_point _numbers _and another

part operates on integernumbers.

Operating

on

floating

point numbers is more

complex due to the number's size and format.

Floating

point numbers are

typically

64 bits long. In the R2000 and R3000 _{microprocessors,} the

floating

point unit

(FPU)

is found on another _chip

(coprocessor)

that is placed beside

the microprocessor. Integer operations consist of_performing combinations of

adding, subtracting, multiplying,

dividing

andlogic operations.

3.1 TTL Model

The asynchronous microprocessor's ALU could not be modeled _exactly

after the R3000 ALU because no information on the structure of the R3000's

ALU was available. A similar ALU was found in the Fairchild Advanced

Schottky

TTL Data Book. This ALU performs all the required functions

necessary for the asynchronous _{microprocessor,} except A NOR B. The ALU

was modified toperform this operation

by

_{adding XOR} gates ontheinputs and

performing anAND operation. The XORgates complementtheinputs and,

by

DeMorgan'sLaws, theNORoperationisobtained.

The schematic ofthe modified ALU is shown inFigure 3.1. This ALU

has been designed using a look ahead _carry format. The look ahead _carry

(32)

Cout

(33)

implementation

was chosen because it is fasterthan the standard ripple _carry

implementation.

In a ripple _carry _adder, the two inputs are added together with the _carry

from the _preceding stage. This produces an output and a _carry to the next

operation. _{This carry}thenproceeds torippledown thechain ofbits to the end,

where it _becomes the _carry out. The advantage of this design is the ease of

implementation and the size ofthe circuit. The disadvantage is its slow speed.

The time needed to perform the ripple _carry addition can be calculated

by

finding

the amount oftime ittakes to create a_carry signal and _multiplying this

time

by

the number ofbitsthroughwhich this signal needs tobe carried.

If,

for

example, it takes 3 nanoseconds to create a _carry _signal, _then, for a 32-bit

adder, it would take 96 nanoseconds, in the worst _{case, to} add two numbers

together.

The look ahead _carry adder operates differently. Each pair ofinputs are

combined and a generate signal and a propagate signal are created. Each

output can now be created

by

_combining its own generate and propagate

signals with the generate and propagate signals from the _preceding pairs. The

advantageofthis method is that the entire structure ofthe adderhas only three

levels of_gates, one level to create the generate and propagate signals and two

levels to combine the signals together. For _example, ifthe gate

delay

of one

level were 3 nanoseconds, the time needed to add two numbers togetherwould

be 9 nanoseconds. This circuit is more than ten times faster than the ripple

carryadder.

The equations for the _carry in signals are shown in Figure 3.2. From

these _carry signals, derived fromthe generate and propagate signals, the output

(34)

PUk=Afn+B(i>

s-"

:-G<ifeA(imii)

|;.:

y,,

Ci=Gn+PftCh

C2=Gi+PiGo+P}PoCo

C3=n2+P2C._|_+P2PiG0+P2P_|

PoQ,

Figure3.2 EquationsforIntermediate

Carry

Signals

generate and propagate signals does not require signals other than the inputs,

theoutputscan all becreated simultaneously.

In practice, alook ahead _carry adder is _normallynotbuiltfor morethan

eight-bit adders. As the size of the adder _increases, more generate and

propagate signals need to be combined to perform the addition. (Refer to the

The larger gates needed to handle the greater number of

inputs cause theadder to be less efficient. A compromise between speed and

gate size can be reached

by

_creating another circuitto implement a look ahead

carry betweenblocksof smaller adders. This addsthree more gate levels to the

over all circuit and slows the hypothetical adder to 18 nanoseconds. This is

still a greatimprovementovertheripple _carry adder. An exampleofthis extra

look ahead circuitis shown in Figure 3.3. Itis similarto the look ahead _carry

within the adder circuit but works with the generate and propagate signals

created

by

the adderblocks. Smaller ALU blocks are used inthe asynchronous

microprocessor to perform their operations upon four bits at a time. An

example of how these smaller ALU blocks are connected can _{be found in}

Figure 3.4.

To extend the design of the look ahead _carry adder into an

ALU,

(35)

allows the look ahead _carry adder to create the proper output. This selection

(36)

(37)

(38)

3.2 DataPath

The datapath in the asynchronous microprocessor is a

"true"

thirty two

bit data path. In microprocessors that use a _clock, the sequence for

loading

data into the ALU stageis to load the firstnumber on the firstclock pulse and

the second number on the second clock pulse. The advantage of this is that

only one bus is required to send data signals from the registers to the ALU.

Thedisadvantageof_{using only}onebusto move the numbers fromthe registers

to the ALU is that it takes twice the optimum time. In the asynchronous

microprocessor, thereis noclock touse when _{scheduling bus} utilization. Each

numberneedstohave itsownbus to travelfrom theregistersto the ALU. This

causes more lines to be used but improves the overall _efficiency of the ALU's

operation.

3.3 Dual PurposeoftheALU

The R3000 processor uses the ALU for more thanjust arithmetic and

logic operations. It also uses it for base offset address calculations. When a

registerisused to holda_memorylocation and an offsetis addedto this register

to point to a desired location in memory, it is called base offset addressing.

This is _commonly used fortable

lookup

and stack operations. The ALU stage

occurs beforethe _memory stage in the pipeline order. This allows theALU to

calculate addresses for the _memory stage before the instruction is executed

withinthe_memorystage.

Because the offset is coded into the instruction word, a method is

(39)

betweentheALU andtheregisters. Thisoffsetis always thelower sixteen bits

of the _instruction. _Because _offsets _can _be _{either positive} or _negative, this

number mustbe sign extendedfrom sixteen bitsto 32 bitsto be used

by

the

32-bit ALU. Thecircuitthatperforms this operation is shown in Figure 3.5. The

circuitmultiplexesbetween the data lines from theregisters and thedata in the

instructionopcode.

3.4 ConditionalStatements

The R3000microprocessor's ALU also is used to compare two numbers

to create conditions that are used to resolve branch on conditional statements

and set on conditional statements. The conditions that are used in the R3000

microprocessor are _equal, greater than or equal _{to zero,} less than or equal to

zero, greater than zero, less than zero, and not equal. These conditions are

evaluated within the ALU and a condition bit is sent back to the instruction

fetch stage. The branch destination is calculated in the instruction fetch stage

while the ALU is

determining

whether the branch will be taken or not. The

conditionbitthenindicates thelocationthatcontainsthe nextinstruction.

No information is available on _exactly how these condition bits are

produced in the R3000 microprocessor. The asynchronous processor performs

this operation

by

_adding the two ALU input busses together. The instruction

fetch stage places the two pieces ofdata on the busses. Ifthe operation deals

with acomparison_{to zero,}register zero is usedforonepieceofdata. The zero

register in the data bank is always hardwired to contain a zero value. The

(40)

o

T

C_ o

o CD

CD CO

CO

_D CO

CO

00

(41)

<0

-t><0

in(31:0)O

-O>>0

Zero

Tesf

equal

"E^equal

-Oout(31:0)

(42)

numbers are _indicated

by

a onein thesign bit. Positive numbers are indicated

by

azero in the sign bit and aone somewhere else within the number. Zero is

indicated

by

allthenumbers

being

zero.

The analysis is then fed into another block. This block converts the

comparison outputs into a control line for the branch instructions. The

instruction decode stage uses this control line to determine whether to take the

branch or not. This block is shown in Figure 3.7. The bit code for branch

instructions includes two bits in which acondition is encoded. These two bits

arebit 26 and _{27. These bits} arethenfed into a4-1 multiplexoras selectlines.

Another series of conditional statements are the BCOND statements.

BCOND stands for branch conditionally. These statements are special

instructions unlikethe other conditionalbranches. There are_only two types of

BCOND statements.

They

aregreaterthan or equal to zero andless than zero.

These statementsare identified

by

bit 16oftheBCONDinstruction.

The Figure3.7 shows how the two typesofinstructions areresolved and

muxed togetherto createthebranchcontrol line.

3.5 Opcode

Interfacing

The disadvantage of not

having

_{any information} on the R3000's ALU is

that the control lines ofthe model ALU do not match the instruction _encoding

ofthe R3000. A circuit is required to decode the instruction _{before sending it}

to theasynchronous ALU. This circuitis shownin Figure 3.8.

The

interfacing

circuit was created

by decoding

the operations

performed inthe R3000 instruction set and _matching them to themodel ALU's

instructions. Figure 3.9 shows the bit_coding oftheR3000 instruction set and

(43)

uses two bit fields to indicate instructions. The highest five bits indicate the

instruction. If these bits are all _{zero's, this} indicates a special instruction and

the low five bits are used instead. If it is not a special instruction, the instruction is either an immediate or a conditional instruction. All register

instructionsare special instructions.

(44)

inst(31:0)1

Obranch-control

(45)

A

Interface

Logic

instruct ion(31:0)O-Sr

P

Ar

4>

Aaama

Ay

yy>

D

-oso

g^^nSf

t>T

^Dr

A

A-

SA

DASS

AA

rOsi

-OS2

-OS3

On

Figure3.8InstructionInterface Circuit

(46)

OPCODE bits 31..29 0 bits 28..26 0 1

SPECIAL BCOND J JAL BEQ BNE BLEZ BGTZ

ADDI ADDIU SLTI SLTIU ANDI ORI XORI LUI

COPO1 COP11 COP21 COP31 * * * *

* * * * * * * *

LB LH LWL LW LBU LHU LWR *

SB SH SWL SW * * _SWR *

LWCO1 LWC11 LWC21 LWC31 * * * *

SWCO1 SWC11 SWC21 SWC31 * * * *

SPECIAL

bits 2..0

bits 5..3 0 1

0 SLL * _SRL _SRA

SLLV * _SRLV _SRAV

JR JALR * * _SYSCALL _BREAK * *

MFHI MTHI MFLO MTLO * * * *

MULT MULTU Drv divu * * * *

ADD ADDU SUB SUBU AND OR XOR NOR

* * _SLT _SLTU * * * *

* * * * * * * *

BCOND

bits 18..16

20..19 0

0 BLTZ BGEZ * * * * * *

* * * * * * * *

BLTZAL BGEZAL * * * * * *

* * * * * * * *

*_{Illegal instruction}

* Coprocessor instructionswere notimplemented

(47)

INPUTS OUTPUTS

FUNCTION SO SI S2 Cn An Bn FO Fl F2 F3 G P

CLEAR 0 0 0 X X X 0 0 0 0 0 0

0 0 0 1 1 1 1 1 0

BMINUS A 1 0 0 0 0 1 0 1 1 1 0 0

0 1 0 0 0 0 0 1 1

0 1 1 1 1 1 1 1 0

1 0 0 0 0 0 0 1 0

1 0 1 1 1 1 1 0 0

1 1 0 1 0 0 0 1 1

1 1 1 0 0 0 0 1 0

0 0 0 1 1 1 1 1 0

AMINUSB 0 1 0 0 0 1 0 0 0 0 1 1

0 1 0 0 1 1 1 0 0

0 1 1 1 1 1 1 1 0

1 0 0 0 0 0 0 1 0

1 0 1 1 0 0 0 1 1

1 1 0 1 1 1 1 0 0

1 1 1 0 0 0 0 1 0

0 0 0 0 0 0 0 1 1

A PLUS B 1 1 0 0 0 1 1 1 1 1 1 0

0 1 0 1 1 1 1 1 0

0 1 1 0 1 1 1 0 0

1 0 0 1 0 0 0 1 1

1 0 1 0 0 0 0 1 0

1 1 0 0 0 0 0 1 0

1 1 1 1 1 1 1 0 0

X 0 0 0 0 0 0 1 1

AXORB 0 0 1 X 0 1 1 1 1 1 1 1

X 1 0 1 1 1 1 1 0

X 1 1 0 0 0 0 0 0

X 0 0 0 0 0 0 1 1

A ORB 1 0 1 X 0 1 1 1 1 1 1 1

X 1 0 1 1 1 1 1 1

X 1 1 1 1 1 1 1 0

X 0 0 0 0 0 0 0 0

AANDB 0 1 1 X 0 1 0 0 0 0 1 1

X 1 0 0 0 0 0 0 0

X 1 1 1 1 1 1 1 0

X 0 0 1 1 1 1 1 1

PRESET 1 1 1 X 0 1 1 1 1 1 1 1

X 1 0 1 1 1 1 1 1

X 1 1 1 1 1 1 1 0

1=_{High Voltage Level}

0=LowVoltage Level

X=Don't Care

(48)

4.

Shifter Design

Shifters are important to _any microprocessor because _they prepare the

data for use in other stages. A shifter used in a microprocessor is a _steering

circuit thatmoves the bits to the correct positions for the final output. There

are four _different types of _shifting _operations; _arithmetic, _logical, rotate, and

roll.

The R3000 architecture requires _only arithmetic and logical shifts.

These shifts operate similarly.

They

move the bits either left or right and, on

the edges ofthe number, _they either sign extend or input a 0. An arithmetic

shiftinstruction shiftsthe bits eitherleftor right. The first bitisboth held and

shifted at the same time. An example of this is shown in Figure 4.1. In a

logical _shift, a zero is always shifted into the number.

By

_shifting

left,

a

quick _multiply

by

two can be done without

having

to run it through the slow

multiplier.

0

-^L>^L>-^L>^L>^L>^n

Shift

Right Logical

Shift

Right Arithmetic

(49)

The R3000architecture containsno arithmetic left shift. The arithmetic

left shift is meaningless because it is identical in operation to a logical left

shift. _{The R3000} _architecture _{also supports} _dynamic _shifting. _A register can

be designated in the instruction to indicate how far to shift. The five least

significant bits of the register are used to control the _shifting operation. The

otherbitsare _ignored. _Rotateand rollshiftsusethebits shifted off one edge of

the number as the input to the other edge.

They

are not implemented in the

R3000reduced instruction set.

The shifter for the asynchronous microprocessor can be implemented

using a_multiplexing networkbecausethere are no rotate or roll instructions in

this processor. This implementation will not _{greatly increase} the space

required.

no shift

left input

Py

right input

Arith/logic

\

Shift/noshift

>

\\s \i/ y

3-1

MUX

output

y

Fig

4.2 Basic

Shifting

Cell

The basic cell for one

bit,

in the asynchronous

design,

is shown in

Figure 4.2. This cellis a3-1 multiplexorand has been created

by

_modifying a

(50)

4-1 multiplexor, because _only three inputs are required. A number of basic

cells arethenconnected toformthe_multiplexing network. A simple version of

this network is shown in Figure 4.3. The shifter can handle thirty-one bit

shifts. There are _{five bits in} _the _instruction

designating

_how

many lines to

shift. _Because _there _are _five _control _lines _for _{the shifter, there} _are _five

shifting stages. Each stage shifts

by

a power of two. The first stage shifts

by

2,

the second shifts

by

2l _{and so on.}

During

thelast stage the displacementis

sixteen_{bits. This does}_cause

long

_lines_to _be

formed, however,

producing line

capacitance which slowstheshifter.

Fig

4.3Simple ConnectionFor

Shifting

Cells

The multiplexors are created _using transmission gates. The

drivers,

which are needed inorderto overcome the loss of

driving

_capability caused

by

the transmission gates, enable the addition of a moderate amount of line

capacitance. These drivers can _easily handle the line capacitance caused

by

the

long

_routing channels neededinthisdesign.

The advantage of this shifter design is that the basic thirty-two bit part

can be used in other places where a three to one decision needs to be made.

(51)

second bus handles the immediate

instructions,

register instructions, and base

offsetcalculations.

The shifter measures 1800 microns wide

by

900 microns high. The

basic 3-1 mux cellcircuitdiagram is shown in Figure 4.4 and the layoutofthe

circuit is shown in Figure 4.5. 160 ofthese cells were used to form the entire

shifter. To select between logical and arithmetic shifts, five 2-1 multiplexors

were used. These multiplexors were obtained

by

_modifying the 3-1

multiplexor. The layoutofthiscellis shownin Figure 4.6.

(52)

LeftOi

rights

1E>

2E>

cy

(53)

nux

(54)

A

m

set

e

seie

(55)

The worst case line capacitance was calculated between the fourth and

final stages of the network. The area capacitance was calculated from the

equation

totalcapacitance=total _area *

capacitance coefficient.

The capacitance coefficientsdiffer

depending

onthe process used and the layer

in question.

Using

the two micron Mosis Scalable CMOS process, the

capacitance coefficient for the metal2 layer is 0.03 femtofarads per square

micron. For the metall layer it is 0.05 femtofarads per square micron. The

worst case line capacitance was found to be 135 femtofarads. The driver

circuit used to drive this load is shown in Figure 4.7. The analog trace from

the Accusim simulation of the _resulting circuit is shown in Figure 4.8. The

driver has more than _{amply driven} the load with a

delay

of _only 0.5

(56)

vcc

TP

l=2un w=8un

n

>

^

\>

TN l=2un w=^lun

TP

I

L=2un w=2^un

^

f>

TN

l=2un w=6un 136F

Figure4.6Example ofa

Driving

Circuit

(57)

Trace

5. 4. 4. 3. 3. 2. E2. I. 0. 0. 5. 4. 4. 3. >3. "2. =_2. ^1. I. 0. 0.

00^

50_|

00J 50_| 00_|

50_|

00_|

50J 00_|

50_|

00^_

00^

504

004

50_|

004

504

004

504

004

504 00_E_

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 r 0.00e+00 5. 00e-09 I.OOe-OB l.50e-0B 2.00e-08 2.50e-08

TIME (s)

(58)

5.

Multiplier/Divider

in

the Asynchronous

Microprocessor

The multiplier/divider is a special circuit in the ALU stage, where two

32-bit numbers are either multiplied or divided _creating a 64-bit result.

Multiplications and divisions cannot be handled

by

the regular ALU because

the_resulting numbers are too large for the thirty-two bit databus. The output

of this separate circuit goes

directly

to two special registers named HI and

LOW. These two registers either store the 64-bitoutputfrom a multiplication

orthe32-bitoutputfromadivisionplus the32-bitremainder.

Because all of these circuits used in the multiplier/divider are separate

from the main

ALU,

it becomes possible to allow other instructions to execute

whilethemultiplier/divideris operating. This is made possible

by latching

the

inputs to theseparate section. The method of

having

a valid bit for theregisters

and the fact that the multiplier/divider does not require _{any further} pipeline

stages, allow otherinstructions to be executed simultaneously. The latches on

theinputs hold and separatethemfromthemain buswhilethemultiplication or

division is _taking place. This allows other traffic on the data busses to be

conducted without

interfering

with themultiplier/divider'sdata.

The method of

designating

the registers valid and invalid assures that

the instructions on the bus will not interfere with the current operation of the

multiplier/divider. All instructionsthatwantto use thesemultiplier and divider

circuits must wait until the valid bit for the HI and LOW registers to become

set. Ifthe instruction does not need _{these circuits,} it can continue through the

(59)

The multiplier/divider _{automatically} places its answers into the special

registers HI and LOW. The write back and _memory stages in the pipeline are

not needed for _any operation that uses the multiplier/divider. This allows the

multiply and divide instructions to be removed from the general flow of

instructions down the pipeline. The removal of these instructions allows the

multiplier/divider to calculate atits slower speed and, at the same time, allows

therest oftheinstructions tocontinueto becalculated.

Scott Siers designed the multiplier/divider and a more complete

(60)

6.

Bus Control

One of the requirements for the ALU stage is to control the values on

thevarious busses

leading

into and outfrom the stage. All ofthe inputs to the

ALU stage need to be latched in order for the different pipeline stages to be

independent of each other. The output of the ALU stage can be either the

incremented program counter

(PC)

or the data line

leading

from the arithmetic

logic unit. The output can also come from the multiplier/divider when

executing a movefrom low or movefrom high instruction.

6.1 Inputs

The inputs to the ALU stage are comprised ofthe A and B busses from

the register

bank,

the instruction from the instruction decode stage, an

immediate signal to indicate animmediate

instruction,

and theprogram counter

value that will be incremented in the ALU stage to perform link operations.

The logic diagramoftheInput BusControlleris shownin Figure 6.1.

The instruction decode stage will select the correct registers to be

connected to the A and B busses before the start signal comes to the ALU

stage. This allows the ALU to latch the A and B bus's data

immediately

after

the start pulse is sent from the instruction decode stage. The A-bus is used

later forthe variable shifts. When this operation is _performed,a zero is sentto

the arithmetic logic unit and the lower five bits of the A-bus are allowed to

passto the shifter. The otherbus thatis connected to the ALU is thePCL bus.

The PCL contains the program counter value which allows the ALU to

(61)

a(31:0>O-

pcl_in<31:8>0-

b<31:0)O-

instruction<3l!0>O-

startO-

innedieteO-32b i l latch

inC3110)

32b it lotch

nOIIO)

32b it lotch

n(3l!(H

32bit lotch in(3IIO>

JjatcF

mtOliO

Bussel dec

"a _Asel

KMfllfl)

3i ₈sel

Oil(tB)

-OBpass(31:0)

-Opcl.out<31:0)

-OAout(31:0)

-OBout<31:0)

-Oinst(31:0)

(62)

placed into the link register

by

the write back stage. Figure 6.2 shows the

diagram of the A-bus selection logic. The B-bus is used for immediate

instructions.

During

immediate instructions the lower sixteen bits of the

instructionare sign extended to thirty-two bits and placed on the B-bus instead

of the latched value _allowing the B-bus to be used to pass information to the

next stage. Therefore the ALU can do base offset calculations for store

operations whiletheB-busis freetopass thedatato be stored inmemory. The

B selectoralso multiplexes ontotheB-bus thenumber'8'

during

linkoperations

in orderfor thePC tobe incremented

by

the correct number. Figure 6.3 shows

thelogic diagram ofthe B-buscontroller. The J and the S line are created

by

a

small decoder. This decoder is shownin Figure 6.4. It decodes both

jump

and

linkinstructionsand shiftvariableinstructions.

6.2Outputs

The outputs are controlled in a similar manner as the inputs. These

outputs are multiplexed between the arithmetic logic unit, the Hi and Low

registers, and the extra program counterincrementer. Thecontrol for this mux

comes from the instruction and is decoded

by

a small decoder. The logic

(63)

A

Bus

selector

out(31:0)

A(31:0)O

J[>

so

ocK31:0)O

zero(31:Q)

REG(i)

pci(i

t

zero(i )

OUT- out(

i)

for i := 0 to 31

(64)

o

9

o

o CD

cd CO

A)

CO

oo

(65)

inst(31:0>O-y^

AA

JAL

y

r>

y

AA

Bus Select ion

Decoder

//////////A

&i

(66)

Output

Select

Box

instruct ion(31

:0)O-alu(

hi_low(

Ii

"sm\

7

E^L>

alu(i) rcaCi)

hiJow(i) _j :

out(i)

for i :=0to 31

-Oout(31:0)

(67)

7.

Exception

Creation

Exceptions are

typically

_indicators _that

signify something out of the

ordinary has occurred. There are _many types of exceptions. These exceptions

canbegroupedinto_{two categories,} unmaskable and maskable.

Turning

offthe

power wouldbean eventthat occurs outofthe sequenceof_ordinary operation.

The _{microprocessor,}

however,

cannot do anything about the occurrence of a

power failure. Power

failure,

_{therefore, is} an unmaskable exception. An

example of a maskable exception would be an interrupt from a

input/output(I/0)

device. Themicroprocessorcan maskthe exception from the

I/O device while it is _performing a routine. This insures that the

microprocessorwill be free from interupts while itperforms a criticall routine.

The masking can then be turned _off, _{allowing interupts} to again be serviced.

This is a common practiceinside interuptserviceroutines.

The ALU stage in the R3000 architecture generates _only one single

exception. This exception occurs if the addition or subtraction of two's

complement numbers resultin numberwhich isgreaterthan the largest number

a 32-bit register can hold. This is called the overflow exception. When this

exception occurs, it signals that the number created

by

the arithmetic logic unit

is incorrect. The microprocessorcannotbe allowed to continue with incorrect

dataso thiserror isfatal to the program. At _{this point, the}microprocessor will

stop execution and wait for new instructions. The overflow instruction is an

unmaskableexception.

In the asynchronous

ALU,

the overflow exception is also created. This

signalis _simply created

by

exclusive

ORing

the signbitand the _{carry bit}ofthe

answer. Iftwo positive two's complement numbers are added together and an

(68)

changing its value from a0 to a _{1. The carry bit}however will notbe affected.

This causesthe two bits to be different and a 1 will occur on the output ofthe

XOR gate. Ifa large negative number has a large positive number subtracted

from

it,

the 1 inthe sign bitwill change to a 0 but because subtraction oftwo's

complement numbersis implemented

by

_adding the two numbers together, the

carry bitwillchangetoa 1. This againwillcause an overflow. Anexample of

this operation performedonfour bitnumbersis showninFigure 7.1.

Fig

7.1 Examples_ofOverflow Conditions

A schematic ofthe overflow creationlogic is shown in Figure 7.2. The

corresponding layout of this logic is shown in Figure 7.3. The overflow

creation logic includes some gates to decode the conditions when an overflow

occurs. An overflow can _only occur

during

a signed addition or subtraction

operation. The ALU performs the same operations for both signed and

unsigned additions so

during

unsigned operations the overflow creation is

(69)

data<31:0)O

inst<31:0)O|

t

"Ooverf

low

(70)

data(31)

instruct ion

(71)

8.

Handshaking

In an asynchronous

design,

it is important to have a method to know

when an evaluation has been completed. Each part ofthe designmusttell the

other parts when itcompletes its current work and is _ready to go onto the next

evaluation. In _{this microprocessor,} each pipeline stage must tell the previous

stage whenit is ready to accepta new set ofinputs and tell the

following

stage

whenit has completed_evaluatingthecurrent set ofinputs. Eachpartinside the

stage must have completed its work in order for the stage to signal that it is

ready tomove on. Each partis required to generate a completion signal within

its circuitry. The completion signal from one stage can tell the next stage to

begin its evaluation. From these signals, the flow within the pipeline stages

canbecontrolled.

8.1 CompletionSignals

Two methods can be used to create a completion signal. The first is a

dummy

delay. A

dummy delay

is a signal created after a period of time

long

enough toexceed theworst case

delay

within the circuit. Ithas no _relationship

to the actual _circuitry _performing the function. This method of completion

signal offers little increase in speed to the circuit because it is similar to the

method used in a synchronous circuit. No speed _up will be seen to operations

in a circuit that take a shortperiod oftime because the _{corresponding}

dummy

delay

is set for the worst case

delay

of _any of the operations the circuit

(72)

The other method to create a completion signal is to use differential

cascade voltage switched logic (DCVSL). This method allows a completion

signaltobegenerated as soon astheevaluationofthe inputs havefinished. An

example of a DCVSL gate is shown in figure 8.1. The figure consists of a

NOR gate that creates a completion signal when its evaluation is completed.

Thereare two sides

YCC

start

a

b

4

L

H

^

a

1

H

r

Vss

done

Figure8.1 DCVSL NOR Gatewith CompletionSignal

to a DCVSL gate. The first side evaluates the output and the second side

evaluates the complement of the output. The start signal controls the

evaluation of this gate. When the start signal is

low,

the P transistors on the

top

ofthegatecharge boththe outputand thecomplement to a one value. The

N transistor on the bottom is turned off, in this phase and the charge is stored

(73)

off and theN transistor is turned on. The two Nmos trees are then allowed to

evaluate. Becauseboth trees are _{complementary,} the one tree will evaluate to

zeroandtheother willevaluateto aone. In this circuit, the tree_evaluating to a

zero will allow the stored charge to discharge through to Vss. The other tree

will

keep

the stored charge and will evaluate as aone. The completion signal

can nowbe created

by

_comparingtheoutputsofthe twotrees. Thetwooutputs

are NANDed together _creating a signal that will evaluate true _only when the

twooutputs are different inthe evaluationphase. Thegatecan have its outputs

latched

by

another circuit before allowing the start signal low again. This

causes onetree to charge back upto a one value _sending the completion signal

low in the process. The outputs of the circuit will follow thepath shown in

figure 8.2.

..

precharge

/\

(f6)

(fi)

evaluation

Figure8.2 Paths _ofDCVSLOutputs

DCVSL will create a reliable completion signal that will signal when

evaluation is done. However,

by

the very nature of this

logic,

these types of

gates are slower than the more standard _{complementary} CMOS gates. The

evaluation of the Nmos trees are governed

by

the fall time in

discharging

the

stored charge. This is typically the same speed as complementary CMOS

gates. Theoutputs arethencomparedto createadonesignal. The NAND gate

which creates this done signal incurs a gate

delay,

_slowing the gate further.

(74)

Thismakes theDCVSL gates slowerthanother methods.

A simulation of theDCVSL NOR gate, shown in figure

8.1,

versus the

complementary CMOS NORgate is shown_{in figure 8.3.} _{The DCVSL}outputs

change with_roughlythesame speed asthe_{complementary} CMOS gatesbutthe

extra

delay

causesthedone signalto come_{slightly later}than theoutputs.

Also,

theprecharge time isnot showninthis simulation.

In DCVSL

logic,

the inputs to the gates need to _stay stable throughout

the entire evaluation _phase, because there is no _way to charge one of the

outputs back _up from a zero. In various _places, the outputs of the DCVSL

gates needtobe latched inorderto allow the slower sections ofthecircuittime

toevaluate and precharge. This adds another

delay

to the operation ofDCVSL

circuits.

A final drawback of DCVSL logic is that both the input and its

complement are needed foreach and _everygate. This causes twice the amount

of lines to be needed in the DCVSL implementation as compared to the

complementary CMOS implementation. Theselines becomevery cumbersome

in a large design. For

instance,

if a 32 bit adder _is to implemented _using

DCVSL, 128 lines are needfortheinputs. In atwomicron_{technology, if}these

lines were to be routed _using metal _two, 1024 microns will be needed just to

handlethe 128 bit bus.

The overall size comparison between the DCVSL and the

complementary CMOS gates is shown in figure 8.4. The two gates shown are

both two inputNOR gates. The DCVSL gate ismuch larger due to the added

transistors and to the fact that it needs both the inputs and their complements.

For this

layout,

the complements ofthe inputs are assumed to be available. If

(75)

Illlllllllllllllppifpn ii|iiiiiiii|iiiiiiiii|iiii

in

-^-rn cvj c

(A) mm in

^-m cu oinmi/iinin

L*2_,t*"^~

L

(A) v/ (A) dva ino isadqw) ino isaoq/ia

tn w o in >+ ft) w a in-^hmco o

3N0Q 1SADQ/ (A) ino"soH3/

(76)

cy

oo

CD

c_>

(77)

the complement. The DCVSL gate uses eleven transistors while the

complementary CMOS gate uses _{only four}

Design and implementation of an asynchronous version of the MIPS R3000 microprocessor

Rochester Institute of Technology

Recommended Citation

Tony H. Chang, Professor

RIT to reproduce my thesis in whole or in part.

Glossary

1.1 MIPSR3000 2

Interfacing

Showing

Carry

Shifting

Complementary

Complementary

Memory

VHSIC

Technology

handshaking

length,

frequency

designs,

however,

1.1 MIPSR3000

floating

describing

design,

handshaking

Obranch

Adding

delay

immediately

instructions,

increasing

instruction,

instance,

however,

Showing

IF ID ALU MEM WB

Scheduling

because,

loading

being

loading

during

dependency

having

instructions,

by ANDing

figures,

register

floating

Schottky

implementation

finding

delay

PUk=Afn+B(i>

figure.)

loading

lookup

determining

being

Interfacing

by decoding

Interface

SPECIAL BCOND J JAL BEQ BNE BLEZ BGTZ

MFHI MTHI MFLO MTLO * * * *

AXORB 0 0 1 X 0 1 1 1 1 1 1 1

Shifter Design

having

0

Arith/logic

designating

drivers,

instructions,

LeftOi

depending

delay

directly

by latching

interfering

Bus Control