04 - Microarchitecture, RF
Andreas Ehliar
September 17, 2013
Todays lecture
I Another ASIP instructions example
I Microarchitecture basics
I The register file/Special Registers
I Introduction to the ALU
Andreas Ehliar 04 - Microarchitecture, RF
A pen and paper example
I A function process data
I Should execute in less than 80 clock cycles
I 16 bit input data
I Intermediate calculations should be done in accumulator to avoid loss of precision
I We assume that we are allowed to place the arrays in any memory
void process_data(i16 *array0,i16 *array1,i16 *result) { // Assume ptr to array0 in r0, array1 in r1, result in r2
i64 sumofproducts = 0; sumofdiff = 0;
for(int i=0; i < 30; i++){
sumofproducts += array0[ptr0] * array1[ptr1]
sumofdiff += abs(array0[ptr0++] - array1[ptr1++]) }
if(sumofproducts > 0x7fffffff) { sumofproducts = 0x7fffffff;
} else if(sumofproducts < -0x80000000) { sumofproducts = -0x80000000;
}
result[0] = sumofproducts[31:16];
result[1] = sumofproducts[15:0];
result[2] = sumofdiff[31:16];
result[3] = r5 = sumofdiff[15:0];
}
Andreas Ehliar 04 - Microarchitecture, RF
Microarchitecture Design
I Step 1: Partition each assembly instruction into microoperations, allocate each microoperation into corresponding hardware modules.
I Step 2: Collect all microoperations allocated in a module and specify hardware multiplexing for RTL coding of the module
I Step 3: Fine-tune intermodule specifications of the ASIP architecture and finalize the top-level connections and pipeline.
Hardware Multiplexing
I Reusing one hardware module for several different operations
I Example: Signed and unsigned 16-bit multiplication
Andreas Ehliar 04 - Microarchitecture, RF
Hardware Multiplexing
Pre processing
MA A B
MB A B
Kernel processing
MUL
Post processing-1
MP1
Possible functions 1. A + C 2. A + D 3. B + C 4. B + D 5. A * C 6. A * D 7. B * C 8. B * D 9. SAT(A + C) 10.SAT(A + D) 11.SAT(B + C) 12.SAT(B + D) 13.SAT(A * C) 14.SAT(A * D) 15.SAT(B * C) 16.SAT(B * D)
ADD
Control[2]
Control[1] Control[0]
0 1
0 1 0 1
Post processing-2
Control[3] 0 1 MP2 Saturation
opa opb
result1
C D
[Liu2008]
Hardware multiplexing
I Hardware multiplexing can be implemented either by SW or by configuring the HW
I A processor is basically a very neat design pattern for multiplexing different HW units
I Perhaps the most important skill of a good VLSI designer
Andreas Ehliar 04 - Microarchitecture, RF
Typical design pattern for datapath modules
Pre-operation-I
Collectedmicrooperations
Pre-operation-II
Pre-operation-X Pre-operation-Y
... ...
Kerneloperations
Post-operation-I Post-operation-II
Post-operation-X Post-operation-Y
... ...
Resultsandverifications
[Liu2008]
Area properties (a.k.a. what to optimize
I Relative areas of a few different components
I 32-bit Adder: 0.2 to 1 area units
I 32-bit Adder/subtracter: 0.3 to 2 area units
I 32-bit 16 to 1 mux: 0.5 – 0.6 area units
I 17 x 17 bit multiplier: 1.3-3.7 area units
I 8 KiB memory (32 bit wide): 33 area units
Andreas Ehliar 04 - Microarchitecture, RF
Performance properties
I Relative maximum frequencies
I 32-bit adder: 0.1 to 1
I 32-bit adder/subtracter: 0.1 to 0.9
I 32-bit 16 to 1 mux: 0.31 to 0.9
I 17 x 17 bit multiplier: 0.11 – 0.44
I 8 KiB memory (32 bit wide): 0.53
Optimizing memory size is often the most important task
I MP3 decoder example
I All memories in the chip are 3 time the size of the DSP core itself
I (I/O pads are also larger than the DSP core itself)
Andreas Ehliar 04 - Microarchitecture, RF
Microarchitecture design of an instruction
I Required microoperations for a typical convolution instruction:
I conv ACRx,DM0[ARy%++),DM1(ARz++)
I Required microoperations:
I Instruction decoding
I Perform addressing calculation
I Read memories
I Perform signed multiplication
I Add guard bits to the result of the multiplication
I Accumulate the result
I Set flags
I For a combined repeat/conv instruction:
I PC <= PC while in the loop
I PC <= PC + 1 as the last step in the loop
I No saturation/rounding during the iteration
I Saturate/round after final loop iteration
The register file (RF)
I The RF gets data from data memories by running load instructions while preparing for an execution of a subroutine.
I While running a subroutine, the register file is used as computing buffers.
I After running the subroutine, results in the RF will be stored into data memories by running store instructions.
Andreas Ehliar 04 - Microarchitecture, RF
General register file
RF
AGU
PCFSM
PM
Program address Configuration andstatus ExecunitALU/MAC
Instruction DM
Program flow control
Operation ctrl MEM ctrl
Results
Instructiondecoder Operand &result control Results and flags
Target addresses, configuration vectors from register file
[Liu2008]
I Connected to almost all parts of the core
Register file schematic
register 1
register 2
register 3
register n ...
from register file from memory 1 from memory 2 from ALU
from MAC from ports
...
OPA
OPB ctrl_reg_in
ctrl_o_a ctrl_o_b
Write circuit
Store cir cuit
Read circuit
[Liu2008]
Andreas Ehliar 04 - Microarchitecture, RF
Register file area comparisons
I Register file area comparisons
I Adder (for comparison): 0.1 - 1
I One port, both read/write: (uncommon)
I Relative area: 2.5-2.7
I One read port, one write port: Accumulator registers
I Relative area: 2.5-2.6
I Two read ports, one write: Basic RISC RF
I Relative area: 3-3.2
I Four read ports, two write: Simple VLIW, Simple superscalar
I Relative area: 4.6-5.8
Register file speed
I Almost (but not quite) the same speed as a very fast 32-bit adder (in this particular technology)
I Also note that it is possible to use special register file memories (but at an increased verification cost)
Andreas Ehliar 04 - Microarchitecture, RF
Read before write or write before read
I A processor architect has to decide how the register file should work when reading and writing the same register
I Read before write
I The old value is read
I Write before read
I The new value is read (more costly in terms of the timing budget)
Physical design: fan-out problem
Fan-out of the control signal For the first stage: 16*16*2 = 512
Fan-out of the control signal For the second stage: 16*8*2 = 256
Fan-out of the control signal For the third stage: 16*4*2 = 128
Fan-out of the control signal For the fourth stage: 16*2*2 = 64
Fan-out of the control signal For the fivth stage: 16*1*2 = 32
Selected operand
From 32 registers in a register file
[Liu2008]
Andreas Ehliar 04 - Microarchitecture, RF
Register File in Verilog
reg [15:0] rf[31:0]; // 16 bit wide RF with 32 entries always @(posedge clk) begin
if(we) rf[waddr] <= wdata;
end
always @* begin
op_a = rf[opaddr_a];
op_b = rf[opaddr_b];
end
Special Purpose Registers
I Sometimes we need special purpose registers (SPR or SR)
I BOT/TOP for modulo addressing
I AR for address register
I SP
I I/O
I Core configuration register
I Should these be included in the general purpose register file?
Andreas Ehliar 04 - Microarchitecture, RF
Special Purpose Registers as normal registers
I Convenient for the programmer. Special purpose registers can be accessed like any normal register.
I Example: add bot0,1 ; Move ringbuffer bottom one word
I Example 2 (from ARM): pop pc
I Drawbacks:
I Wastes entries in the general purpose register file
I Harder to use specialized register file memories
Special purpose registers needs special instructions
I Special instructions required to access SR:s
I Example:
I move r0,bot0 ; Move ringbuffer bottom one word
I (nop) ; May need nop(s) here
I add r0,1
I (nop) ; May need nop(s) here
I bot0,r0
I (Move is encoded as move from from/to special purpose register here)
I Advantage:
I Easier to meet timing as special purpose registers can easier be located anywhere in the core
I Can scale easily to hundreds of special purpose registers if required. (Common on large and complex processors such as ARM/x86)
I Drawback:
I Inconvenient for special registers you need to access all the time
Andreas Ehliar 04 - Microarchitecture, RF
Conclusions: SPRs
I Only place SPRs as a normal register if you believe it will be read/written via normal instructions very often
ALU in general
I ALU: Arithmetic and Logic Unit
I Arithmetic, Logic, Shift/rotate, others
I No guard bits for iterative computing
I One guard bit for single step computing
I Get operands from and send result to RF
I Handles single precision computing
Andreas Ehliar 04 - Microarchitecture, RF
Separate ALU or ALU in MAC
multiplier
Register file Register file
multiplier
Register file Register file
(a) (b)
ALU
Accumulator ALU and
Accumulator
DTU
[Liu2008]
ALU high level schematic
Shift Logic unit
unit
Masker, guard, carry-in, and other preprocessing
A [15:0] B [15:0]
Saturation and flag processing
Result [15:0] FA/FC, FS, FZ
[Liu2008]
Andreas Ehliar 04 - Microarchitecture, RF
Pre-processing
I Select operands: from one of the source
I Register file, control path, HW constant
I Typical operand pre processing:
I Guard: one guard
I (does not support iterative computing)
I Invert: Conditional/non-conditional invert
I Supply constant 0, 1, -1
I Mask operand(s)
I Select proper carry input
Post-processing
I Select result from multiple components
I From AU, logic unit, shift unit, and others
I Saturation operation
I Decide to generate carry-out flag or saturation
I Perform saturation on result if required
I Flag operation
I Flag computing and prediction
Andreas Ehliar 04 - Microarchitecture, RF
General instructions
Operation opa opb Carry in Carry out
ADD Addition + + 0 Cout/SAT
SUB Subtraction + - 1 Cout/SAT
ABS Absolute +/- A[15] SAT
CMP Compare + - 1 SAT
NEG Negate - 1 SAT
INC Increment + 1 0 SAT
DEC Decrement + -1 0 SAT
AVG Average + + 0 SAT
Special Instructions
Mnemonic Description Operation
MAX Select larger value RF <= max(OpA,OpB) MIN Select smaller value RF <= min(OpA,OpB) DTA Difference of two GR <= |OpA| − |OpB|
absolute values
SAD Sum of absolute GR <= |OpA−OpB|
difference
Andreas Ehliar 04 - Microarchitecture, RF
Adder with carry in for RTL synthesis (safe solution)
+
{A[15], A[15:0], “1”}
Result [16:0] < =FAO [17:1]
{B[15],B[15:0],CIN}
18b full adder FAO [17:0]
[Liu2008]
I Full adder may have no carry in
I One guard bit
I We need 2 extra bits in the adder
I LSB of the 18b result will not be used
I MSB of the 18b result will be the guard
I Works on all synthesis tools
Adder for RTL synthesis (modern version)
I Cout,Res[15:0] = 1’b0,A[15:0]+1’b0,B[15:0]+Cin;
I Cout is 1 bit wide
I Important: Cin is 1 bit wide!
I Modern synthesis tools can usually handle this case without creating two adders
I (I’ve had to resort to the “safe” version shown on the previous slide in a few cases though. For example when combining an adder with other logic in an FPGA.)
Andreas Ehliar 04 - Microarchitecture, RF
Example: Implement an 8 bit ALU
Instructions Function OP
NOP No change of flags 0
A+B A + B (without saturation) 1
A-B A − B (without saturation) 2
SAT(A+B) A + B (with saturation) 3
SAT(A-B) A − B (with saturation) 4
SAT(ABS(A)) |A| (absolute operation, saturation) 5 SAT(ABS(A+B)) |A + B| (absolute operation, saturation) 6 SAT(ABS(A-B)) |A − B| (absolute operation, saturation) 7 CLR S Clear S flag (other flags unchanged) 8
I There shold be a negative, zero, and saturation flag!
Problem: The following function must execute in < 2048 cycles
I Problem: The following function must execute in less than 2048 cycles for(i=0; i<500;i++){
tmp=abs(*p0++]);
tmp2=abs(*p1++);
*p2++ = tmp-tmp2;
}
; First try in assembler repeat 500,lend
ld r0,DM0[ar0++]
ld r1,DM0[ar1++]
abs r0 abs r1
sub r0,r0,r1 st DM0[ar2++],r0 lend:
; 3000 cycles for inner
; loop
Andreas Ehliar 04 - Microarchitecture, RF
16 bit |A| − |B|
I How many operations do we need for |A| − |B|?
I The easy approach:
I One adder for |A|
I One adder for |B|
I One adder for the final subtraction
Typical ALU shift operations
... ...
15 14 1 0 A rithmetic right shift
... ...
15 14 1 0 L ogic right shift
0
... ...
15 14 1 0 0 L ogic left shift
Rotate right without carry flag ... ...
15 14 1 0
Rotate left without carry flag ... ...
15 14 1 0
... ...
15 14 1 0
Rotate left with carry flag (more than one bit not needed) ... ...
15 14 1 0 C
C Rotate right with carry flag
(more than one bit not needed)
[Liu2008]
Andreas Ehliar 04 - Microarchitecture, RF
Shifter primitive
Fill in [0]
Fill in [1]
Fill in [2]
Fill in [3]
Fill in [4]
Fill in [5]
Fill in [6]
Fill in [7]
Fill in [8]
Fill in [9]
[15]
[14]
[13]
[12]
[11]
[10]
[8] [9]
[6] [7]
[5]
[4]
[3]
[2]
[1]
[0]
[15]
[14]
[13]
[12]
[11]
[10]
[9]
[8]
[7]
[6]
[5]
[4]
[3]
[2]
[1]
[0]
Shift Input [15:0]
Shift output [15:0]
CTRL LSB
CTRL [1]
CTRL [2]
CTRL MSB Fill in [10]
Fill in [11]
Fill in [12]
Fill in [13]
Fill in [14]
[Liu2008]
I Note: Barrel shifters based on 4-to-1 multiplexers may be more efficient Andreas Ehliar 04 - Microarchitecture, RF
Hardware multiplexing in shifter
Filling-in 16 bits shift
primitive
16 Shift in [15:0]
Shift in [0:15]
16 Shift out [15:0]
Shift out [0:15]
MSB [15] “0”
Right shift
Left shift
Right shift
Left shift Fill in port
From filling-in table
16
[Liu2008]
I Note: Fill in table may be complicated for some shift operations
Andreas Ehliar 04 - Microarchitecture, RF