Rochester Institute of Technology
RIT Scholar Works
Theses
Thesis/Dissertation Collections
1-1994
Design and implementation of an asynchronous
version of the MIPS R3000 microprocessor
Kevin Johnson
Follow this and additional works at:
http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please [email protected].
Recommended Citation
DESIGN AND IMPLEMENTATION OF AN
ASYNCHRONOUS VERSION OF THE MIPS
R3000 MICROPROCESSOR
by
Kevin Johnson
A Thesis Submitted
In
Partial Fulfillment of the
Requirements for the Degree of
MASTER OF SCIENCE
In
Computer Engineering
Approved by:
Graduate Advisor - George A. Brown, Professor
Roy S. Czemikowski, Professor and Department Head
Tony H. Chang, Professor
Department of Computer Engineering
College of Engineering
Rochester Institute of Technology
Rochester, New York
THESIS RELEASE PERMISSION FORM
ROCHESTER INSTITUTE OF TECHNOLOGY
COLLEGE OF ENGINEERING
Title: Design and Implementation of an Asynchronous Version of the MIPS
R3000 Microprocessor.
I, Kevin Johnson, hereby deny permission to the Wallace Memorial Library of
RIT to reproduce my thesis in whole or in part.
ABSTRACT
The purpose of this thesis is to show some of the advantages for the asynchronous implementation of amajor synchronous structure. Three people were involved with the design of an asynchronous microprocessor modeled after the MIPS R3000 microprocessor. This microprocessor implemented all of the MIPS reduced instruction set, while eliminating the need for synchronous clocking throughout the chip. Paul Fanelli modeled the asynchronous processor using VHDL (hardware description language). Kevin Johnson created circuit level designs and layouts of the arithmetic logic unit
Table Of
Contents
Abstract ii
TableofContents iii
ListofFigures iv
ListofTables vi
Glossary
vii1. Introduction 1
1.1 MIPSR3000 2
1.2 Pipeline 3
1.3 ALU stage 4
2. Registers 6
2.1 Data Dependencies 6
2.2 Register Design 13
3. Arithmetic Logic Unit 21
3.1 TTLModel 21
3.2 DataPath 28
3.3 DualPurpose 28
3.4 Conditional Statements 29
3.5 Opcode
Interfacing
324. Shifter Design 38
5. MultiplierDivider in theAsynchronousMicroprocessor 48
6. Bus Control 50
6.1 Inputs 50
6.2 Outputs 52
7. ExceptionCreation 57
8.
Handshaking
618.1 CompletionSignals 61
8.2
Handshaking
ControllerCircuit 679. Results 70
10. Conclusions 78
List OfFigures
Figure 1.1 Pipeline Stages 3
Figure 1.2
Top
LevelDesignoftheALU stage 5Figure 2.1 Simple MethodFor
Adding
Three Numbers 6Figure 2.2 Possible Paths fortheValid Bits 9
Figure 2.3 Program
Showing
DataDependency
9Figure 2.4 Petri Net fortheData
Dependency
Problem 12Figure 2.5 Two Phase
Clocking
Scheme... 13Figure 2.6 Bus Utilization Scheme 14
Figure 2.7 ExampleofaLoad WordLeft Instruction 16
Figure2.8 LogicDiagram ofOne RegisterBit 18
Figure 2.9 LayoutofaOne-Bit Register 19
Figure 2.10 LogicDiagramofOne 32-BitRegister 20
Figure 3.1 Schematic oftheModified ALU 22
Figure 3.2 Equations for Intermediate
Carry
Signals 24Figure 3.3 LookAhead Circuit 26
Figure 3.4 Schematicof a32-bitAdder
Using
4-bitLookAheadCarry
Adders 27
Figure3.5 SignExtension Circuit 30
Figure 3.6 Condition Analyzer 31
Figure3.7 ConditionBitGenerator 34
Figure3.8 InstructionInterface Circuit 35
Figure 3.9 Bit
Coding
oftheR3000Instruction Set 36Figure 4.1 Arithmetic andLogicalShifts 38
Figure 4.2 Basic
Shifting
Cell 39Figure 4.3 Simple Connection For
Shifting
Cells 40Figure 4.4 CircuitDiagramofA 3-1 Multiplexor 42
Figure 4.5 Layoutofa3-1 Multiplexor 43
Figure4.6 Layoutofa2-1 Multiplexor 44
Figure 4.6 Exampleof a
Driving
Circuit 46Figure4.7
Analog
SimulationoftheDriving
Circuit 47Figure 6.1 InputBusControl Block 51
Figure6.2 A-Bus Selector 53
Figure 6.3 B-Bus Selector 54
Figure 6.4 Bus Selector Decoder 55
Figure6.5 OutputSelector 56
Figure 7.1 ExampleofOverflow Conditions 58
Figure7.2 SchematicoftheOverflow Logic 59
Figure 8.1 DCVSLNOR GatewithCompletion Signal 62
Figure 8.2 PathsofDCVSLOutputs 63
Figure8.3 ComparisonofDCVSLand
Complementary
CMOSLogic 65Figure 8.4 LayoutComparison ofDCVSLversus
Complementary
CMOS 66
Figure 8.5
Handshaking
Controller Circuit 68Figure 9.1 SimulationofUnsigned Addition 73
Figure 9.2 Simulationof aBranch Not Equal toZeroInstruction 74
Figure9.3 SimulationofaSignedInstruction
Causing
an Exception 75Figure 9.4 SimulationofaShiftInstruction 76
ListofTables
Table 2.1 Incorrect
Scheduling
ofDependentInstructions 10Glossary: ALU ALU Stage Asynchronous CISC CMOS DCVSL FPU HCC HI ID stage IFstage LOW LWL LWR MEM stage MIPS
Arithmetic Logic Unit
Partofthepipelinewhere all arithmetic operations takes
place.
Actions donotoccur at specific timeperiodsor when
another actionisscheduledto takeplace.
Complex instruction setcomputer.
Complementary
metal-oxide-semiconductorDifferential Cascade Voltage Switched Logic
Floating
point unit.Handshaking
ControllerCircuit.32-bitregister used toholdthe high32 bitsin amultiply operation orthequotientin adivideoperation.
Instruction decodestage ofthe pipeline wherethe
instructionis decoded into signals.
Instructionfetch stageofthepipeline wherethe
instructionis obtainedfrommemory.
32-bitregister used toholdthe low 32 bitsin amultiply operationorthe remainderinadivideoperation.
Loadwordleft.
Loadword right.
Memory
stage ofthepipeline where allmemoryoperations areperformed.
MIPS
NMOS
NOP
PC
PCL
R3000
RISC
RIT
Synchronous
TTL
VHDL
VHSIC
VLSI
WB stage
Millionsofinstructions per second.
N-typemetal-oxide-semiconductor.
Instruction thatcuses nothing tohappen.
Programcounter.
Signalnameforthe latchedoutputofthe program counter.
Aversion ofthe MIPS architecture used in aparticular
32-bitmicroprocessor.
Reduced instruction setcomputer.
Rochester Instituteof
Technology
Actions occuraccording to adeterministic schedule, e.g.
Actions occuratspecific phases ofaclock pulse.
Transistor TransistorLogic
VHSIC HardwareDescriptionLanguage
Very
HighSpeedIntegratedCircuitVery
LargeScaleIntegrationWrite backstage ofthepipeline wherethedata is written
1.
Introduction
The purpose of this project is to show some of the advantages of the
asynchronous design methodology. The design should be large enough to
allow the increaseinspeed caused
by
the asynchronous operation to be greaterthan the decrease in speed caused
by
the extra circuitry. A thirty-two bitmicroprocessor was chosen for this project. The asynchronous design methodology eliminates the need for clocking circuits
by
addinghandshaking
circuitry.
Clock lines in a microprocessor need to be routed to every part of the
chip. These lines cause a problem in the operation of the chip called clock
skew. The clock
lines,
upon which the clock signaltravels, spread outin many directions from the clock driver. This signal is supposed to change the state ofmany circuit elements throughout the chip simultaneously. Since these lines are not the same
length,
the clock pulses on one side of the chip may or maynot be in phase with the pulses on the other side. In order to make sure that
circuit elements on both sides of the chip have changed state using the same clock pulse, the period of the clock is extended. This makes sure that the
clock pulsehas hadtime to travel throughout thechip.
The clock lines and the gates that are driven
by
the clock create a largecapacitive load. This capacitance must be overcome in order for the clock
signal to change state in a reasonable amount of time. This causes the clock driverto be necessarily large. The capacitance in the line must becharged in
order for the voltage on the line to come to a true logical-one value. The
capacitance also must be discharged in order for the value to return to the
logical-zero state. The time needed to charge and discharge the large
frequency
is faster than the charging anddischarging
times, the clock pulseswill never reach their true values but will stay
fluctuating
around some mid-levelvoltage.Within a large clocked circuit, the clock speed must be slow enough to
allow the slowest part of the circuit to be able to complete its task before the
beginning
of the next clock cycle. In largedesigns,
this minimum time between clock pulses becomes a problem. For example, in a microprocessorthis minimum time is the time it takes for the slowest instruction to execute. This can be a
long
period of time because operations such as multiplicationtake longerthan operations such as shifting. Even in simple
designs,
there can be large discrepancies in the time it takes for an instruction to execute. In athirty-two bitripplecarry adder,for example, it takes two gate delays for two
numbers to be added together, when there are no carrys within the circuit. If,
however,
acarryiscreatedin everysingle stage, each addition stage must waituntil the carry is obtained from the proceeding stage. This condition takes 32
timeslongerthan theprevious example.
Inan asynchronous
design,
theclockiseliminated andcircuitryis addedtoensurethateach stage willfinishit'sexecutionbefore thenextstagebegins.
Thecircuitwillevaluateatthemaximum speedpossibleforeachstage.
1.1 MIPSR3000
The synchronous circuit to be modified in this project is the MIPS
R3000microprocessor. It is a 32-bitmicroprocessor with an off chip
floating
pointunit. Itemploys afive-stage pipeline structure with a reduced instruction
areonlythreedifferentstyles forthe instruction words, immediate, register and
jump.
Themicroprocessorwas implemented in a process similarto the MOSIS
two micron process that was used in the RIT VLSI labs. The R3000
microprocessor is a relatively old microprocessor and there is a lot of
information available.
Many
books have been writtendescribing
thisarchitecture.
1.2 Pipeline
The Mips processor uses afive stage pipeline structure. An example of
thispipelineis shownin figure 1.1. Each stage performs adifferentpartofthe
execution of an instruction. The stages are called instruction
fetch,
instructiondecode,
arithmetic logic unit(ALU), memory, and write back. The instructionfetch stage gets the instruction from memory. The instruction decode stage
decodes the instruction into useful parts. The arithmetic logic unit performs
calculations. The memory stage performs fetches and writes to memory. The
writeback stage puts completedinformation into theregisterbank.
IF
instruction fetch
ID
instruction decode
ALU
arithmetic logi
unit
MEM
memory
WB
write back
Because ofthese five stages, five instructions canbe operated on at the
same time. Different instructions will be located at different stages in the
pipeline simultaneously. This allows the microprocessor to execute
instructions fasteronce the pipeline is filled. The pipeline in the MIPS R3000
is clocked. The speed of the processor is the speed that the instructions are
able to move from stage to stage. Because this speed is controlled
by
theclock, itmustbeableto accommodatethe slowest stagein thepipeline. In the
asynchronous
design,
the stages are controlledby
signals that mark thebeginning
and end ofthe operation in each stage. This does not depend upon the worst case stage time and the stages are allowed to operate as fast aspossible. This projectis designed to provethatan asynchronousprocessor will
operatefasterthanits synchronouscounterpart.
1.3 ALU stage
This particular thesis only deals with the arithmetic logic unit stage of
the total design. The thesis
by
Scott Siers deals with the other pipeline stagesand the thesis
by
Paul Fanelli describes the VHDL modeling of themicroprocessor. The ALU stage is split up into fivemajor parts. These parts
are the registers, the ALU, the shifter, the
handshaking
controls and theexception creation. A top level diagram of the ALU stage is presented in
instruction pcu
\y
in& star^
hLlow rcaaiie
i(3lBus Ctrl Block
(31:1) :)) indnKiignC3l:B) el.in(3l:8> irradiate Hert hmOl: intl(3li. KljuKiM Decode
Shift Ctrl
J
no
intl(3l!fl)
;,n Alu32b it
CBUI s Conpare qui brl Ltd! Shifter ii 12 3 to s
M(31:9)33 <
1
SLI -OB(31:B) -OpcLout(31:0) "Construction_to_next_stage(31:0) wioiio)Overflov M(3I:1> owrfli >except int) (31:!) Bra.Ctrl ***"*"*Obranch
\Ja ^tput Sel Sox
outOI: (31i01
intlnitHm(!li!) ffi-f>0utput(31:0)
2.
Registers
2.1 Data Dependencies
Data dependencies occur when an instruction needs the output from a
priorinstruction thathasnotleftthepipeline. Forexample, a case might occur
where three numbers need to be added together. A simple method to accomplishthisoperationisshownin Figure2.1.
AUUn Rt R2 R5 addsRl toR2and the resultsis in R5 Mm R? R? R5 aits R5 toR3andthe ttmltsis inR5
Figure 2.1 Simple Methodfor
Adding
Three NumbersThis method adds the first two numbers together and adds their sum to the
third. In the pipeline,
however,
these instructions willimmediately
followeach other. The answer will be created from the first instruction when the write backstageis reached. Thesecondinstruction needs the information from the first instruction in the ALU stage. The memory stage is located between the write back and ALU stages. A space or time
delay
is needed between these twoinstructions so thatthefirstinstruction willbe in the writebackstagewhen the second is in the ALU stage. Another instruction can be placed between the two add instructions to provide the space or time delay. This instruction would be in the memory stage when the first add instruction is in
thewritebackstage.
Any
non-dependentinstructioncanbe usedforthis spaceand, ifone cannot be found from the program, a simple instruction thatcauses
Most of the MIPS R3000 processor's data dependencies are taken care
ofinthecompiler or at software level. This canbe done because theprocessor
works on asynchronous
(clocked)
pipeline. For example, a branch instructioncauses a data dependency. A test is needed to determine ifthe branch will be taken. This test occurs in the ALU stage, while the outcome of the branch instruction affects the instruction fetch stage. The instruction decode stage
falls between these two stages. Inthe R3000processor, thebranch instructions
are
immediately
followedby
an executable statement. This executablestatement acts as thespacebetween the testforthebranch and theexecution of the branch. This is similar to the way a space was needed in the addition
example above. After this second instruction executes, the branch is then allowed to occur. The detection of these data dependencies can be done in
software and the insertion of spacer
instructions,
to eliminate thedependency,
canbe done atthecompiler level.
Inan asynchronous processor, onesimplycannot assume that
letting
oneinstructionpassthough astage willallow enoughtimefor the data toleave the
pipeline. An instruction may be in theprocess ofmoving into the write back
stage when the dependent instruction begins to be executed in the ALU stage.
Thewrite backstage has notyetputthecorrectdatainto theregisterbeforethe
ALU stage operates on the register. This will cause the processor to act on
erroneous dataand thereforemustbe avoided.
One solution to this problem is to allow the data to be stored at intermediate points along the pipeline and allow the stored data to be
retrievable for the dependent instruction. This would mean that every stage
would remember the output from the instruction that had previously passed though it. This implementation would be useful if the dependent instructions
was described earlier, the dependent instructions in the synchronous
microprocessor are separated
by
a spacer instruction to insure proper timing.Because one of the goals of this project is to be compatible with the MIPS
architecture, this solution is now invalid. In order to allow dependent
instructions to be spaced out, the intermediate storage devices now need to be
capable of storing more than one value. This solution is problematic at best
because
increasing
the number of storage spaces reduces the speed of thepipeline.
Another solution to this problem is to tag the data that is stored in the
register bank. If an instruction is about to leave the instruction decode stage
and passintothe ALUstage, theinstruction can causethe value in the register,
that theinstructionusesas atarget,to beinvalidated. The next
instruction,
ifitis dependenton that
data,
would then haveto wait untilthe data intheregisteris validated. In order to implement this scheme, another bit is needed in the
register to indicate whether that register has valid or invalid data. When the
instruction passesinto theALUsection, thisregisterbitwould besetto invalid
and, when the instruction leaves the write back section, this bitwould then be
setback to valid. At this pointit will be assumed that, when the processor is
reset, thecondition bitsforalltheregisters willberesetto a valid state.
This schemeworksfora clocked systembutinanasynchronous system
thereare some hazardsthatneed tobe addressed. For
instance,
it isconceivablypossiblefortwoinstructionsto be settingandclearingthe valid bit
atthe sametime. Thiswould cause anindeterminate state, orrace condition,
depending
onwhich stage wasfaster. To eliminatethisproblemboth theinstruction decodestage and thewritebackstage gettheirownbitwhich
they
cantoggle. The two bitscanthenbeexclusive ORedtogetherinorderto
assigning the states 11 and 00tobevalid, theraceconditionthatmay occuris
eliminated.
This method appearsto answer allthe problems with datadependencies,
however,
one problem remains. Untilthis point, theinvestigation has onlyFigure 2.2 Possible Paths forthe Valid Bits
been concerned with the possible dependencies in the data part of the
instruction. The target part is just used to designate which register will be
invalidated. The flaw in this scheme can be shown in the simple program in
Figure2.3.
ill
23
Addn R I R2 R3
Addu R4 R5 R3
lm i;:
- ysil n
.'.! M'l pl.ufNn in i3 ,i,jtfe 14 %tti5 md pl.kVN n in i3
---addoneto R3
Figure2.3Program
Showing
DataDependency
The first two instructions are not data dependent but the third one is. It is
interesting
to note that instruction 1 is meaningless but still is a validinstruction forthe program. An exampleofthe execution ofthis program is in
time
1
2
3
4
5
6
Pipeline Stages
IF ID ALU MEM WB
3 2 1
3 2 1
3 2 1
3 2
3
3
TaWe2.7 Incorrect
Scheduling
ofDependent InstructionsAs is shown in thetable the third instruction stalls until the valid bit has been
set. Thevalid bit is set attime4allowing the third instructionto enter. This is
incorrect because the first instruction set the valid bit. In
fact,
the thirdinstructionwas supposedto waitforthe secondinstructionto finish.
The solution for this problem is to only allow the instruction to enter
into the ALU stage if alldata and targetregisters are valid. This will stall the
second instruction until thefirstinstruction leavesthe pipeline. The validity of
thismethod is verified in the petri net shown in the Figure 2.4.
By
using thismethod, no dependencies can occur. No set of instructions will cause the
microprocessor to operate on erroneous data caused
by
a dependency. Some care should be taken when programming this type of microprocessorbecause,
willbevery low. Theprogram will still run but, sincemany ofthe instructions
will become stalled in the pipeline, the program will not run at maximum
IF)
Data
Dependent
Operations
To
The Same
Register
Valid Register
2.2 Register Design
The registers, where data is stored for eventual use, are an important
part of any microprocessor. Due to the asynchronous structure of this
processor, some special considerations need to be made, in order to eliminate
hazard conditions that may occur in
loading
and retrieving information fromthe register bank.
First,
the synchronous method, as used in the R3000,processor willbe investigated.
Then,
the method thatallows theregisters to beused inanasynchronous mannerwill be described.
One problem encountered when
designing
registers is preventing accessto the registers for simultaneous reading and writing. If this
happens,
a racecondition willbe createdmakingitunclear whethertheinformationobtained is
an old number stored in the register or a new number
being
written to theregister. A second problemiscaused
by
thefact that todo any calculation twonumbers are needed. Since registers generally write to a common data
bus,
gettingtwo sets ofdatacanbecomedifficult.
The R3000processor solves bothoftheseproblems
by
usingatwo((,1
<|)2
c
A B D
Fig
2.5 Two PhaseClocking
Schemephase, non-overlappingclock. Anexample ofthisclocking schemeis shownin
instruction fetch stage grabs the data to give to the arithmetic logic unit stage.
The clock eliminates the hazard that occurs between
loading
and retrievingnumbers and solves the first problem of preventing simultaneous reading and
writing.
In order to solve the second problem of obtaining two sets of data, the
R3000 uses one 32-bit data bus and multiplexes the data from the registers to
the correct lines in the arithmetic logic unit. Figure 2.5 shows that there are
four distinctperiodsinthe clockingscheme. Onepieceofdatacan be obtained
during
the period marked with a 'C and another piece obtainedduring
the period marked with a'D'. This eliminates any chance ofhaving
datacollide onthebus.
In the asynchronous microprocessor, one does not have the
luxury
ofknowing
exactly when certain events are going to happen. There is nomechanism to stop the write back stage from writing at any time, potentially
causing a hazard condition. In order to eliminate this problem, register
accesses need to be allowed to happen in the register bank simultaneously.
This costs some space on thechip because, while the R3000 uses one bus and
multiplexesitsusebetween thevarious clockphases; intheasynchronous
write back VVTlte y,r back / ^ \U Asynchronous ProcessorRegister Bank
A
\1/ALUa andb
,/32 ,/3
ALUa
\l/
ALUb
microprocessor, three different busses and corresponding logic are required.
Threedifferentbusseseliminatethehazard of contention on thebus but notthe
hazard caused
by
readingand writing simultaneouslyto the same register. Thisis similar to the data
dependency
problem described earlier in the datadependency
section. The use of aregisterflag,
to indicate whetheraregister issafetoread from or writeto, solves thelogical
dependency
problem thatoccursin the pipeline. This method also protects the register's data coherency. The
instruction fetch stage waitson the
registers'
state before accessing the register.
This solves the problem of
having
both instruction fetch and write back stagesaccessingtheregisters atthe sametime.
In the asynchronous version of this processor, there are thirty-two
generalpurpose registers.
Also,
there aretwo additional registers that are usedby
themultiplier/dividerto hold sixty-fourbitnumbers. Themultiplier/dividerregisters are accessed
differently
and arelocated in adifferentpart ofthe chip.Theseregisters are only used
by
the multiplier/divider andby
four instructionsthatmovetheoutput ofthemultiplier/dividerto thegeneral purposeregisters.
Two other instructions also operate the registerbank differently. These
instructions areload wordleft
(LWL)
andloadword right(LWR). Onetofourbytes of memory can be loaded into a register with these instructions. What
makestheseinstructions different fromthe standard load instructions is that the
bytes not indicated in the target register are not changed after the end of the
operation. Ifathirty-two bitwordisneeded andit does notfall on regular byte
boundaries, acombinationofthese instructionscanbe used to getthe number.
An example of this is shown in Figure 2.7. In this example the register is
loaded with the three low order bytes from memory location 0 without
affecting the high order byte. In order to implement these
instructions,
morelogicmustbeaddedto theregisterbank. Thewritebackstage sends four
memory
address4 address0
A s fi 7
0 1 2 3
register x
before a b c d
LWL$x,l($0)
a 1 2 3
after
Fig
2.7Example ofaLoad Word Left Instructionsignals to the register bank before executing a load instruction. These four
signals select which byte within the register will recieve the new data and
which bytewillbethe same. Under normal operation these signals will always
be active so that the whole register is affected. For aload word left or a load
word right, thesesignalswillindicatewhichbyteintheregister willbe loaded.
This is performed
by ANDing
the load signal with the select signals that aresent
by
thewritebackstage.The general purpose registers are madeout simple components designed
tofittogether to createthelargeregisterbank. Thirty-two ofthese components
are placed side to side tomake a single thirty-two bitregister. A logic diagram
of one bit of a register and the layout of that logic have been included in
Figures 2.8 and2.9. As canbe seen
by
thesefigures,
theregisters are designedwith maximum modularity. A thirty two bit register can be implemented
by
combining thirty-two of these one bit registers into four groups of eight bits.
Each eight-bitregister has an AND gate to control the selection for LWL and
LWR. The four register banks are combined together and two toggle registers
diagram of a thirty-two bit register is shown in Figure 2.10. The layout
asE>
idty
inO
bsO
Oa
-ob
one
bit
register
bs-bar
as-gar
nz
32bit
Register
as
O-
bsO-
in<31:0)O-
reg_naskOO-reg.nasklC^
reg_nask20"
reg_nas
k30-alu-tO"
reseto
loadO"
' 0
VCC
'CLK
K 5-<P
r
K 5-iPf-1 I
" 8bit
reo <,:8)
8bit reo "<7:6)
B 8bit
reo <7:0,
*-k7ib:
8bit reo a:0J
Nv
>A<ai:0><5&M\
x>
-Ovalid3. Arithmetic
Logic Unit
The Arithmetic Logic Unit
(ALU)
is the brain of the microprocessor.Within the
ALU,
all calculations are performed. It is a complex circuit thatimplements many different functions. In most microprocessors, the ALU is
split into two parts, one part operates on
floating
point numbers and anotherpart operates on integernumbers.
Operating
onfloating
point numbers is morecomplex due to the number's size and format.
Floating
point numbers aretypically
64 bits long. In the R2000 and R3000 microprocessors, thefloating
point unit
(FPU)
is found on another chip(coprocessor)
that is placed besidethe microprocessor. Integer operations consist ofperforming combinations of
adding, subtracting, multiplying,
dividing
andlogic operations.3.1 TTL Model
The asynchronous microprocessor's ALU could not be modeled exactly
after the R3000 ALU because no information on the structure of the R3000's
ALU was available. A similar ALU was found in the Fairchild Advanced
Schottky
TTL Data Book. This ALU performs all the required functionsnecessary for the asynchronous microprocessor, except A NOR B. The ALU
was modified toperform this operation
by
adding XOR gates ontheinputs andperforming anAND operation. The XORgates complementtheinputs and,
by
DeMorgan'sLaws, theNORoperationisobtained.
The schematic ofthe modified ALU is shown inFigure 3.1. This ALU
has been designed using a look ahead carry format. The look ahead carry
Cout
implementation
was chosen because it is fasterthan the standard ripple carryimplementation.
In a ripple carry adder, the two inputs are added together with the carry
from the preceding stage. This produces an output and a carry to the next
operation. This carrythenproceeds torippledown thechain ofbits to the end,
where it becomes the carry out. The advantage of this design is the ease of
implementation and the size ofthe circuit. The disadvantage is its slow speed.
The time needed to perform the ripple carry addition can be calculated
by
finding
the amount oftime ittakes to create acarry signal and multiplying thistime
by
the number ofbitsthroughwhich this signal needs tobe carried.If,
forexample, it takes 3 nanoseconds to create a carry signal, then, for a 32-bit
adder, it would take 96 nanoseconds, in the worst case, to add two numbers
together.
The look ahead carry adder operates differently. Each pair ofinputs are
combined and a generate signal and a propagate signal are created. Each
output can now be created
by
combining its own generate and propagatesignals with the generate and propagate signals from the preceding pairs. The
advantageofthis method is that the entire structure ofthe adderhas only three
levels ofgates, one level to create the generate and propagate signals and two
levels to combine the signals together. For example, ifthe gate
delay
of onelevel were 3 nanoseconds, the time needed to add two numbers togetherwould
be 9 nanoseconds. This circuit is more than ten times faster than the ripple
carryadder.
The equations for the carry in signals are shown in Figure 3.2. From
these carry signals, derived fromthe generate and propagate signals, the output
PUk=Afn+B(i>
s-"
:-G<ifeA(imii)
|;.:
y,,
Ci=Gn+PftCh
C2=Gi+PiGo+P}PoCo
C3=n2+P2C.|+P2PiG0+P2P|
PoQ,
Figure3.2 EquationsforIntermediate
Carry
Signalsgenerate and propagate signals does not require signals other than the inputs,
theoutputscan all becreated simultaneously.
In practice, alook ahead carry adder is normallynotbuiltfor morethan
eight-bit adders. As the size of the adder increases, more generate and
propagate signals need to be combined to perform the addition. (Refer to the
previous
figure.)
The larger gates needed to handle the greater number ofinputs cause theadder to be less efficient. A compromise between speed and
gate size can be reached
by
creating another circuitto implement a look aheadcarry betweenblocksof smaller adders. This addsthree more gate levels to the
over all circuit and slows the hypothetical adder to 18 nanoseconds. This is
still a greatimprovementovertheripple carry adder. An exampleofthis extra
look ahead circuitis shown in Figure 3.3. Itis similarto the look ahead carry
within the adder circuit but works with the generate and propagate signals
created
by
the adderblocks. Smaller ALU blocks are used inthe asynchronousmicroprocessor to perform their operations upon four bits at a time. An
example of how these smaller ALU blocks are connected can be found in
Figure 3.4.
To extend the design of the look ahead carry adder into an
ALU,
allows the look ahead carry adder to create the proper output. This selection
3.2 DataPath
The datapath in the asynchronous microprocessor is a
"true"
thirty two
bit data path. In microprocessors that use a clock, the sequence for
loading
data into the ALU stageis to load the firstnumber on the firstclock pulse and
the second number on the second clock pulse. The advantage of this is that
only one bus is required to send data signals from the registers to the ALU.
Thedisadvantageofusing onlyonebusto move the numbers fromthe registers
to the ALU is that it takes twice the optimum time. In the asynchronous
microprocessor, thereis noclock touse when scheduling bus utilization. Each
numberneedstohave itsownbus to travelfrom theregistersto the ALU. This
causes more lines to be used but improves the overall efficiency of the ALU's
operation.
3.3 Dual PurposeoftheALU
The R3000 processor uses the ALU for more thanjust arithmetic and
logic operations. It also uses it for base offset address calculations. When a
registerisused to holdamemorylocation and an offsetis addedto this register
to point to a desired location in memory, it is called base offset addressing.
This is commonly used fortable
lookup
and stack operations. The ALU stageoccurs beforethe memory stage in the pipeline order. This allows theALU to
calculate addresses for the memory stage before the instruction is executed
withinthememorystage.
Because the offset is coded into the instruction word, a method is
betweentheALU andtheregisters. Thisoffsetis always thelower sixteen bits
of the instruction. Because offsets can be either positive or negative, this
number mustbe sign extendedfrom sixteen bitsto 32 bitsto be used
by
the32-bit ALU. Thecircuitthatperforms this operation is shown in Figure 3.5. The
circuitmultiplexesbetween the data lines from theregisters and thedata in the
instructionopcode.
3.4 ConditionalStatements
The R3000microprocessor's ALU also is used to compare two numbers
to create conditions that are used to resolve branch on conditional statements
and set on conditional statements. The conditions that are used in the R3000
microprocessor are equal, greater than or equal to zero, less than or equal to
zero, greater than zero, less than zero, and not equal. These conditions are
evaluated within the ALU and a condition bit is sent back to the instruction
fetch stage. The branch destination is calculated in the instruction fetch stage
while the ALU is
determining
whether the branch will be taken or not. Theconditionbitthenindicates thelocationthatcontainsthe nextinstruction.
No information is available on exactly how these condition bits are
produced in the R3000 microprocessor. The asynchronous processor performs
this operation
by
adding the two ALU input busses together. The instructionfetch stage places the two pieces ofdata on the busses. Ifthe operation deals
with acomparisonto zero,register zero is usedforonepieceofdata. The zero
register in the data bank is always hardwired to contain a zero value. The
o
T
C_ o
o CD
CD CO
CO
_D CO
CO
00
<0
-t><0
in(31:0)O
-O>>0
Zero
Tesf
equal
"E^equal
-Oout(31:0)
numbers are indicated
by
a onein thesign bit. Positive numbers are indicatedby
azero in the sign bit and aone somewhere else within the number. Zero isindicated
by
allthenumbersbeing
zero.The analysis is then fed into another block. This block converts the
comparison outputs into a control line for the branch instructions. The
instruction decode stage uses this control line to determine whether to take the
branch or not. This block is shown in Figure 3.7. The bit code for branch
instructions includes two bits in which acondition is encoded. These two bits
arebit 26 and 27. These bits arethenfed into a4-1 multiplexoras selectlines.
Another series of conditional statements are the BCOND statements.
BCOND stands for branch conditionally. These statements are special
instructions unlikethe other conditionalbranches. There areonly two types of
BCOND statements.
They
aregreaterthan or equal to zero andless than zero.These statementsare identified
by
bit 16oftheBCONDinstruction.The Figure3.7 shows how the two typesofinstructions areresolved and
muxed togetherto createthebranchcontrol line.
3.5 Opcode
Interfacing
The disadvantage of not
having
any information on the R3000's ALU isthat the control lines ofthe model ALU do not match the instruction encoding
ofthe R3000. A circuit is required to decode the instruction before sending it
to theasynchronous ALU. This circuitis shownin Figure 3.8.
The
interfacing
circuit was createdby decoding
the operationsperformed inthe R3000 instruction set and matching them to themodel ALU's
instructions. Figure 3.9 shows the bitcoding oftheR3000 instruction set and
uses two bit fields to indicate instructions. The highest five bits indicate the
instruction. If these bits are all zero's, this indicates a special instruction and
the low five bits are used instead. If it is not a special instruction, the instruction is either an immediate or a conditional instruction. All register
instructionsare special instructions.
inst(31:0)1
Obranch-control
A
Interface
Logic
instruct ion(31:0)O-Sr
P
Ar
4>
Aaama
Ay
yy>
D
-oso
g^^nSf
t>T
^Dr
A
A-
SA
DASS
AA
rOsi
-OS2
-OS3
On
Figure3.8InstructionInterface Circuit
OPCODE bits 31..29 0 bits 28..26 0 1
SPECIAL BCOND J JAL BEQ BNE BLEZ BGTZ
ADDI ADDIU SLTI SLTIU ANDI ORI XORI LUI
COPO1 COP11 COP21 COP31 * * * *
* * * * * * * *
LB LH LWL LW LBU LHU LWR *
SB SH SWL SW * * SWR *
LWCO1 LWC11 LWC21 LWC31 * * * *
SWCO1 SWC11 SWC21 SWC31 * * * *
SPECIAL
bits 2..0
bits 5..3 0 1
0 SLL * SRL SRA
SLLV * SRLV SRAV
JR JALR * * SYSCALL BREAK * *
MFHI MTHI MFLO MTLO * * * *
MULT MULTU Drv divu * * * *
ADD ADDU SUB SUBU AND OR XOR NOR
* * SLT SLTU * * * *
* * * * * * * *
* * * * * * * *
BCOND
bits 18..16
20..19 0
0 BLTZ BGEZ * * * * * *
* * * * * * * *
BLTZAL BGEZAL * * * * * *
* * * * * * * *
*Illegal instruction
* Coprocessor instructionswere notimplemented
INPUTS OUTPUTS
FUNCTION SO SI S2 Cn An Bn FO Fl F2 F3 G P
CLEAR 0 0 0 X X X 0 0 0 0 0 0
0 0 0 1 1 1 1 1 0
BMINUS A 1 0 0 0 0 1 0 1 1 1 0 0
0 1 0 0 0 0 0 1 1
0 1 1 1 1 1 1 1 0
1 0 0 0 0 0 0 1 0
1 0 1 1 1 1 1 0 0
1 1 0 1 0 0 0 1 1
1 1 1 0 0 0 0 1 0
0 0 0 1 1 1 1 1 0
AMINUSB 0 1 0 0 0 1 0 0 0 0 1 1
0 1 0 0 1 1 1 0 0
0 1 1 1 1 1 1 1 0
1 0 0 0 0 0 0 1 0
1 0 1 1 0 0 0 1 1
1 1 0 1 1 1 1 0 0
1 1 1 0 0 0 0 1 0
0 0 0 0 0 0 0 1 1
A PLUS B 1 1 0 0 0 1 1 1 1 1 1 0
0 1 0 1 1 1 1 1 0
0 1 1 0 1 1 1 0 0
1 0 0 1 0 0 0 1 1
1 0 1 0 0 0 0 1 0
1 1 0 0 0 0 0 1 0
1 1 1 1 1 1 1 0 0
X 0 0 0 0 0 0 1 1
AXORB 0 0 1 X 0 1 1 1 1 1 1 1
X 1 0 1 1 1 1 1 0
X 1 1 0 0 0 0 0 0
X 0 0 0 0 0 0 1 1
A ORB 1 0 1 X 0 1 1 1 1 1 1 1
X 1 0 1 1 1 1 1 1
X 1 1 1 1 1 1 1 0
X 0 0 0 0 0 0 0 0
AANDB 0 1 1 X 0 1 0 0 0 0 1 1
X 1 0 0 0 0 0 0 0
X 1 1 1 1 1 1 1 0
X 0 0 1 1 1 1 1 1
PRESET 1 1 1 X 0 1 1 1 1 1 1 1
X 1 0 1 1 1 1 1 1
X 1 1 1 1 1 1 1 0
1=High Voltage Level
0=LowVoltage Level
X=Don't Care
4.
Shifter Design
Shifters are important to any microprocessor because they prepare the
data for use in other stages. A shifter used in a microprocessor is a steering
circuit thatmoves the bits to the correct positions for the final output. There
are four different types of shifting operations; arithmetic, logical, rotate, and
roll.
The R3000 architecture requires only arithmetic and logical shifts.
These shifts operate similarly.
They
move the bits either left or right and, onthe edges ofthe number, they either sign extend or input a 0. An arithmetic
shiftinstruction shiftsthe bits eitherleftor right. The first bitisboth held and
shifted at the same time. An example of this is shown in Figure 4.1. In a
logical shift, a zero is always shifted into the number.
By
shiftingleft,
aquick multiply
by
two can be done withouthaving
to run it through the slowmultiplier.
0
-^L>^L>-^L>^L>^L>^nShift
Right Logical
Shift
Right Arithmetic
The R3000architecture containsno arithmetic left shift. The arithmetic
left shift is meaningless because it is identical in operation to a logical left
shift. The R3000 architecture also supports dynamic shifting. A register can
be designated in the instruction to indicate how far to shift. The five least
significant bits of the register are used to control the shifting operation. The
otherbitsare ignored. Rotateand rollshiftsusethebits shifted off one edge of
the number as the input to the other edge.
They
are not implemented in theR3000reduced instruction set.
The shifter for the asynchronous microprocessor can be implemented
using amultiplexing networkbecausethere are no rotate or roll instructions in
this processor. This implementation will not greatly increase the space
required.
no shift
left input
Py
right inputArith/logic
\Shift/noshift
>
\\s \i/ y
3-1
MUX
output
y
Fig
4.2 BasicShifting
CellThe basic cell for one
bit,
in the asynchronousdesign,
is shown inFigure 4.2. This cellis a3-1 multiplexorand has been created
by
modifying a4-1 multiplexor, because only three inputs are required. A number of basic
cells arethenconnected toformthemultiplexing network. A simple version of
this network is shown in Figure 4.3. The shifter can handle thirty-one bit
shifts. There are five bits in the instruction
designating
howmany lines to
shift. Because there are five control lines for the shifter, there are five
shifting stages. Each stage shifts
by
a power of two. The first stage shiftsby
2,
the second shiftsby
2l and so on.During
thelast stage the displacementissixteenbits. This doescause
long
linesto beformed, however,
producing line
capacitance which slowstheshifter.
Fig
4.3Simple ConnectionForShifting
CellsThe multiplexors are created using transmission gates. The
drivers,
which are needed inorderto overcome the loss of
driving
capability causedby
the transmission gates, enable the addition of a moderate amount of line
capacitance. These drivers can easily handle the line capacitance caused
by
the
long
routing channels neededinthisdesign.The advantage of this shifter design is that the basic thirty-two bit part
can be used in other places where a three to one decision needs to be made.
second bus handles the immediate
instructions,
register instructions, and baseoffsetcalculations.
The shifter measures 1800 microns wide
by
900 microns high. Thebasic 3-1 mux cellcircuitdiagram is shown in Figure 4.4 and the layoutofthe
circuit is shown in Figure 4.5. 160 ofthese cells were used to form the entire
shifter. To select between logical and arithmetic shifts, five 2-1 multiplexors
were used. These multiplexors were obtained
by
modifying the 3-1multiplexor. The layoutofthiscellis shownin Figure 4.6.
LeftOi
rights
1E>
2E>
cy
nux
A
m
set
e
seie
The worst case line capacitance was calculated between the fourth and
final stages of the network. The area capacitance was calculated from the
equation
totalcapacitance=total area *
capacitance coefficient.
The capacitance coefficientsdiffer
depending
onthe process used and the layerin question.
Using
the two micron Mosis Scalable CMOS process, thecapacitance coefficient for the metal2 layer is 0.03 femtofarads per square
micron. For the metall layer it is 0.05 femtofarads per square micron. The
worst case line capacitance was found to be 135 femtofarads. The driver
circuit used to drive this load is shown in Figure 4.7. The analog trace from
the Accusim simulation of the resulting circuit is shown in Figure 4.8. The
driver has more than amply driven the load with a
delay
of only 0.5vcc
TP
l=2un w=8un
n
>
^
\>
TN l=2un w=^lun
TP
I
L=2un w=2^un^
f>
TN
l=2un w=6un 136F
Figure4.6Example ofa
Driving
CircuitTrace
5. 4. 4. 3. 3. 2. E2. I. 0. 0. 5. 4. 4. 3. >3. "2. =2. ^1. I. 0. 0.00^
50_|
00J 50_| 00_|50_|
00_|
50J 00_|50_|
00^_00^
504
004
50_|
004
504
004
504
004
504 00_E_1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 r 0.00e+00 5. 00e-09 I.OOe-OB l.50e-0B 2.00e-08 2.50e-08
TIME (s)
5.
Multiplier/Divider
in
the Asynchronous
Microprocessor
The multiplier/divider is a special circuit in the ALU stage, where two
32-bit numbers are either multiplied or divided creating a 64-bit result.
Multiplications and divisions cannot be handled
by
the regular ALU becausetheresulting numbers are too large for the thirty-two bit databus. The output
of this separate circuit goes
directly
to two special registers named HI andLOW. These two registers either store the 64-bitoutputfrom a multiplication
orthe32-bitoutputfromadivisionplus the32-bitremainder.
Because all of these circuits used in the multiplier/divider are separate
from the main
ALU,
it becomes possible to allow other instructions to executewhilethemultiplier/divideris operating. This is made possible
by latching
theinputs to theseparate section. The method of
having
a valid bit for theregistersand the fact that the multiplier/divider does not require any further pipeline
stages, allow otherinstructions to be executed simultaneously. The latches on
theinputs hold and separatethemfromthemain buswhilethemultiplication or
division is taking place. This allows other traffic on the data busses to be
conducted without
interfering
with themultiplier/divider'sdata.The method of
designating
the registers valid and invalid assures thatthe instructions on the bus will not interfere with the current operation of the
multiplier/divider. All instructionsthatwantto use thesemultiplier and divider
circuits must wait until the valid bit for the HI and LOW registers to become
set. Ifthe instruction does not need these circuits, it can continue through the
The multiplier/divider automatically places its answers into the special
registers HI and LOW. The write back and memory stages in the pipeline are
not needed for any operation that uses the multiplier/divider. This allows the
multiply and divide instructions to be removed from the general flow of
instructions down the pipeline. The removal of these instructions allows the
multiplier/divider to calculate atits slower speed and, at the same time, allows
therest oftheinstructions tocontinueto becalculated.
Scott Siers designed the multiplier/divider and a more complete
6.
Bus Control
One of the requirements for the ALU stage is to control the values on
thevarious busses
leading
into and outfrom the stage. All ofthe inputs to theALU stage need to be latched in order for the different pipeline stages to be
independent of each other. The output of the ALU stage can be either the
incremented program counter
(PC)
or the data lineleading
from the arithmeticlogic unit. The output can also come from the multiplier/divider when
executing a movefrom low or movefrom high instruction.
6.1 Inputs
The inputs to the ALU stage are comprised ofthe A and B busses from
the register
bank,
the instruction from the instruction decode stage, animmediate signal to indicate animmediate
instruction,
and theprogram countervalue that will be incremented in the ALU stage to perform link operations.
The logic diagramoftheInput BusControlleris shownin Figure 6.1.
The instruction decode stage will select the correct registers to be
connected to the A and B busses before the start signal comes to the ALU
stage. This allows the ALU to latch the A and B bus's data
immediately
afterthe start pulse is sent from the instruction decode stage. The A-bus is used
later forthe variable shifts. When this operation is performed,a zero is sentto
the arithmetic logic unit and the lower five bits of the A-bus are allowed to
passto the shifter. The otherbus thatis connected to the ALU is thePCL bus.
The PCL contains the program counter value which allows the ALU to
a(31:0>O-
pcl_in<31:8>0-
b<31:0)O-
instruction<3l!0>O-
startO-
innedieteO-32b i l latch
inC3110)
32b it lotch
nOIIO)
32b it lotch
n(3l!(H
32bit lotch in(3IIO>
JjatcF
mtOliO
Bussel dec
"a Asel
KMfllfl)
3i 8sel
Oil(tB)
-OBpass(31:0)
-Opcl.out<31:0)
-OAout(31:0)
-OBout<31:0)
-Oinst(31:0)
placed into the link register
by
the write back stage. Figure 6.2 shows thediagram of the A-bus selection logic. The B-bus is used for immediate
instructions.
During
immediate instructions the lower sixteen bits of theinstructionare sign extended to thirty-two bits and placed on the B-bus instead
of the latched value allowing the B-bus to be used to pass information to the
next stage. Therefore the ALU can do base offset calculations for store
operations whiletheB-busis freetopass thedatato be stored inmemory. The
B selectoralso multiplexes ontotheB-bus thenumber'8'
during
linkoperationsin orderfor thePC tobe incremented
by
the correct number. Figure 6.3 showsthelogic diagram ofthe B-buscontroller. The J and the S line are created
by
asmall decoder. This decoder is shownin Figure 6.4. It decodes both
jump
andlinkinstructionsand shiftvariableinstructions.
6.2Outputs
The outputs are controlled in a similar manner as the inputs. These
outputs are multiplexed between the arithmetic logic unit, the Hi and Low
registers, and the extra program counterincrementer. Thecontrol for this mux
comes from the instruction and is decoded
by
a small decoder. The logicA
Bus
selector
out(31:0)
A(31:0)O
J[>
so
ocK31:0)O
zero(31:Q)
REG(i)
pci(i
t
zero(i )OUT- out(
i)
for i := 0 to 31
o
9
o
o CD
cd CO
A)
CO
oo
inst(31:0>O-y^
AA
JAL
y
r>
y
AA
Bus Select ion
Decoder
//////////A
&i
Output
Select
Box
instruct ion(31
:0)O-alu(
hi_low(
Ii
"sm\
7
E^L>
alu(i) rcaCi)
hiJow(i) j :
out(i)
for i :=0to 31
-Oout(31:0)
7.
Exception
Creation
Exceptions are
typically
indicators thatsignify something out of the
ordinary has occurred. There are many types of exceptions. These exceptions
canbegroupedintotwo categories, unmaskable and maskable.
Turning
offthepower wouldbean eventthat occurs outofthe sequenceofordinary operation.
The microprocessor,
however,
cannot do anything about the occurrence of apower failure. Power
failure,
therefore, is an unmaskable exception. Anexample of a maskable exception would be an interrupt from a
input/output(I/0)
device. Themicroprocessorcan maskthe exception from theI/O device while it is performing a routine. This insures that the
microprocessorwill be free from interupts while itperforms a criticall routine.
The masking can then be turned off, allowing interupts to again be serviced.
This is a common practiceinside interuptserviceroutines.
The ALU stage in the R3000 architecture generates only one single
exception. This exception occurs if the addition or subtraction of two's
complement numbers resultin numberwhich isgreaterthan the largest number
a 32-bit register can hold. This is called the overflow exception. When this
exception occurs, it signals that the number created
by
the arithmetic logic unitis incorrect. The microprocessorcannotbe allowed to continue with incorrect
dataso thiserror isfatal to the program. At this point, themicroprocessor will
stop execution and wait for new instructions. The overflow instruction is an
unmaskableexception.
In the asynchronous
ALU,
the overflow exception is also created. Thissignalis simply created
by
exclusiveORing
the signbitand the carry bitoftheanswer. Iftwo positive two's complement numbers are added together and an
changing its value from a0 to a 1. The carry bithowever will notbe affected.
This causesthe two bits to be different and a 1 will occur on the output ofthe
XOR gate. Ifa large negative number has a large positive number subtracted
from
it,
the 1 inthe sign bitwill change to a 0 but because subtraction oftwo'scomplement numbersis implemented
by
adding the two numbers together, thecarry bitwillchangetoa 1. This againwillcause an overflow. Anexample of
this operation performedonfour bitnumbersis showninFigure 7.1.
Fig
7.1 ExamplesofOverflow ConditionsA schematic ofthe overflow creationlogic is shown in Figure 7.2. The
corresponding layout of this logic is shown in Figure 7.3. The overflow
creation logic includes some gates to decode the conditions when an overflow
occurs. An overflow can only occur
during
a signed addition or subtractionoperation. The ALU performs the same operations for both signed and
unsigned additions so
during
unsigned operations the overflow creation isdata<31:0)O
inst<31:0)O|
t
"Ooverf
low
data(31)
instruct ion
8.
Handshaking
In an asynchronous
design,
it is important to have a method to knowwhen an evaluation has been completed. Each part ofthe designmusttell the
other parts when itcompletes its current work and is ready to go onto the next
evaluation. In this microprocessor, each pipeline stage must tell the previous
stage whenit is ready to accepta new set ofinputs and tell the
following
stagewhenit has completedevaluatingthecurrent set ofinputs. Eachpartinside the
stage must have completed its work in order for the stage to signal that it is
ready tomove on. Each partis required to generate a completion signal within
its circuitry. The completion signal from one stage can tell the next stage to
begin its evaluation. From these signals, the flow within the pipeline stages
canbecontrolled.
8.1 CompletionSignals
Two methods can be used to create a completion signal. The first is a
dummy
delay. Adummy delay
is a signal created after a period of timelong
enough toexceed theworst case
delay
within the circuit. Ithas no relationshipto the actual circuitry performing the function. This method of completion
signal offers little increase in speed to the circuit because it is similar to the
method used in a synchronous circuit. No speed up will be seen to operations
in a circuit that take a shortperiod oftime because the corresponding
dummy
delay
is set for the worst casedelay
of any of the operations the circuitThe other method to create a completion signal is to use differential
cascade voltage switched logic (DCVSL). This method allows a completion
signaltobegenerated as soon astheevaluationofthe inputs havefinished. An
example of a DCVSL gate is shown in figure 8.1. The figure consists of a
NOR gate that creates a completion signal when its evaluation is completed.
Thereare two sides
YCC
start
a
b
4
L
H
^
a
1
H
r
Vss
done
Figure8.1 DCVSL NOR Gatewith CompletionSignal
to a DCVSL gate. The first side evaluates the output and the second side
evaluates the complement of the output. The start signal controls the
evaluation of this gate. When the start signal is
low,
the P transistors on thetop
ofthegatecharge boththe outputand thecomplement to a one value. TheN transistor on the bottom is turned off, in this phase and the charge is stored
off and theN transistor is turned on. The two Nmos trees are then allowed to
evaluate. Becauseboth trees are complementary, the one tree will evaluate to
zeroandtheother willevaluateto aone. In this circuit, the treeevaluating to a
zero will allow the stored charge to discharge through to Vss. The other tree
will
keep
the stored charge and will evaluate as aone. The completion signalcan nowbe created
by
comparingtheoutputsofthe twotrees. Thetwooutputsare NANDed together creating a signal that will evaluate true only when the
twooutputs are different inthe evaluationphase. Thegatecan have its outputs
latched
by
another circuit before allowing the start signal low again. Thiscauses onetree to charge back upto a one value sending the completion signal
low in the process. The outputs of the circuit will follow thepath shown in
figure 8.2.
..
precharge
/\
(f6)
(fi)
evaluationFigure8.2 Paths ofDCVSLOutputs
DCVSL will create a reliable completion signal that will signal when
evaluation is done. However,
by
the very nature of thislogic,
these types ofgates are slower than the more standard complementary CMOS gates. The
evaluation of the Nmos trees are governed
by
the fall time indischarging
thestored charge. This is typically the same speed as complementary CMOS
gates. Theoutputs arethencomparedto createadonesignal. The NAND gate
which creates this done signal incurs a gate
delay,
slowing the gate further.Thismakes theDCVSL gates slowerthanother methods.
A simulation of theDCVSL NOR gate, shown in figure
8.1,
versus thecomplementary CMOS NORgate is shownin figure 8.3. The DCVSLoutputs
change withroughlythesame speed asthecomplementary CMOS gatesbutthe
extra
delay
causesthedone signalto comeslightly laterthan theoutputs.Also,
theprecharge time isnot showninthis simulation.
In DCVSL
logic,
the inputs to the gates need to stay stable throughoutthe entire evaluation phase, because there is no way to charge one of the
outputs back up from a zero. In various places, the outputs of the DCVSL
gates needtobe latched inorderto allow the slower sections ofthecircuittime
toevaluate and precharge. This adds another
delay
to the operation ofDCVSLcircuits.
A final drawback of DCVSL logic is that both the input and its
complement are needed foreach and everygate. This causes twice the amount
of lines to be needed in the DCVSL implementation as compared to the
complementary CMOS implementation. Theselines becomevery cumbersome
in a large design. For
instance,
if a 32 bit adder is to implemented usingDCVSL, 128 lines are needfortheinputs. In atwomicrontechnology, ifthese
lines were to be routed using metal two, 1024 microns will be needed just to
handlethe 128 bit bus.
The overall size comparison between the DCVSL and the
complementary CMOS gates is shown in figure 8.4. The two gates shown are
both two inputNOR gates. The DCVSL gate ismuch larger due to the added
transistors and to the fact that it needs both the inputs and their complements.
For this
layout,
the complements ofthe inputs are assumed to be available. IfIlllllllllllllllppifpn ii|iiiiiiii|iiiiiiiii|iiii
in
-^-rn cvj c
(A) mm in
^-m cu oinmi/iinin
L*2_,t*"^~
L
(A) v/ (A) dva ino isadqw) ino isaoq/ia
tn w o in >+ ft) w a in-^hmco o
3N0Q 1SADQ/ (A) ino"soH3/
cy
oo
CD
c_>
the complement. The DCVSL gate uses eleven transistors while the
complementary CMOS gate uses only four