04 - Microarchitecture, RF

(1)

04 - Microarchitecture, RF

Andreas Ehliar

September 17, 2013

(2)

Todays lecture

I Another ASIP instructions example

I Microarchitecture basics

I The register file/Special Registers

I Introduction to the ALU

Andreas Ehliar 04 - Microarchitecture, RF

(3)

A pen and paper example

I A function process data

I Should execute in less than 80 clock cycles

I 16 bit input data

I Intermediate calculations should be done in accumulator to avoid loss of precision

I We assume that we are allowed to place the arrays in any memory

(4)

void process_data(i16 *array0,i16 *array1,i16 *result) { // Assume ptr to array0 in r0, array1 in r1, result in r2

i64 sumofproducts = 0; sumofdiff = 0;

for(int i=0; i < 30; i++){

sumofproducts += array0[ptr0] * array1[ptr1]

sumofdiff += abs(array0[ptr0++] - array1[ptr1++]) }

if(sumofproducts > 0x7fffffff) { sumofproducts = 0x7fffffff;

} else if(sumofproducts < -0x80000000) { sumofproducts = -0x80000000;

}

result[0] = sumofproducts[31:16];

result[1] = sumofproducts[15:0];

result[2] = sumofdiff[31:16];

result[3] = r5 = sumofdiff[15:0];

}

(5)

Microarchitecture Design

I Step 1: Partition each assembly instruction into microoperations, allocate each microoperation into corresponding hardware modules.

I Step 2: Collect all microoperations allocated in a module and specify hardware multiplexing for RTL coding of the module

I Step 3: Fine-tune intermodule specifications of the ASIP architecture and finalize the top-level connections and pipeline.

(6)

Hardware Multiplexing

I Reusing one hardware module for several different operations

I Example: Signed and unsigned 16-bit multiplication

(7)

Hardware Multiplexing

Pre processing

MA A B

MB A B

Kernel processing

MUL

Post processing-1

MP1

Possible functions 1. A + C 2. A + D 3. B + C 4. B + D 5. A * C 6. A * D 7. B * C 8. B * D 9. SAT(A + C) 10.SAT(A + D) 11.SAT(B + C) 12.SAT(B + D) 13.SAT(A * C) 14.SAT(A * D) 15.SAT(B * C) 16.SAT(B * D)

ADD

Control[2]

Control[1] Control[0]

0 1

0 1 0 1

Post processing-2

Control[3] 0 1 MP2 Saturation

opa opb

result1

C D

[Liu2008]

(8)

Hardware multiplexing

I Hardware multiplexing can be implemented either by SW or by configuring the HW

I A processor is basically a very neat design pattern for multiplexing different HW units

I Perhaps the most important skill of a good VLSI designer

(9)

Typical design pattern for datapath modules

Pre-operation-I

Collectedmicrooperations

Pre-operation-II

Pre-operation-X Pre-operation-Y

... ...

Kerneloperations

Post-operation-I Post-operation-II

Post-operation-X Post-operation-Y

... ...

Resultsandverifications

[Liu2008]

(10)

Area properties (a.k.a. what to optimize

I Relative areas of a few different components

I 32-bit Adder: 0.2 to 1 area units

I 32-bit Adder/subtracter: 0.3 to 2 area units

I 32-bit 16 to 1 mux: 0.5 – 0.6 area units

I 17 x 17 bit multiplier: 1.3-3.7 area units

I 8 KiB memory (32 bit wide): 33 area units

(11)

Performance properties

I Relative maximum frequencies

I 32-bit adder: 0.1 to 1

I 32-bit adder/subtracter: 0.1 to 0.9

I 32-bit 16 to 1 mux: 0.31 to 0.9

I 17 x 17 bit multiplier: 0.11 – 0.44

I 8 KiB memory (32 bit wide): 0.53

(12)

Optimizing memory size is often the most important task

I MP3 decoder example

I All memories in the chip are 3 time the size of the DSP core itself

I (I/O pads are also larger than the DSP core itself)

(13)

Microarchitecture design of an instruction

I Required microoperations for a typical convolution instruction:

I conv ACRx,DM0[ARy%++),DM1(ARz++)

I Required microoperations:

I Instruction decoding

I Perform addressing calculation

I Read memories

I Perform signed multiplication

I Add guard bits to the result of the multiplication

I Accumulate the result

I Set flags

I For a combined repeat/conv instruction:

I PC <= PC while in the loop

I PC <= PC + 1 as the last step in the loop

I No saturation/rounding during the iteration

I Saturate/round after final loop iteration

(14)

The register file (RF)

I The RF gets data from data memories by running load instructions while preparing for an execution of a subroutine.

I While running a subroutine, the register file is used as computing buffers.

I After running the subroutine, results in the RF will be stored into data memories by running store instructions.

(15)

General register file

RF

AGU

PCFSM

PM

Program address Configuration andstatus ExecunitALU/MAC

Instruction DM

Program flow control

Operation ctrl MEM ctrl

Results

Instructiondecoder Operand &result control Results and flags

Target addresses, configuration vectors from register file

[Liu2008]

I Connected to almost all parts of the core

(16)

Register file schematic

register 1

register 2

register 3

register n ...

from register file from memory 1 from memory 2 from ALU

from MAC from ports

...

OPA

OPB ctrl_reg_in

ctrl_o_a ctrl_o_b

Write circuit

Store cir cuit

Read circuit

[Liu2008]

(17)

Register file area comparisons

I Register file area comparisons

I Adder (for comparison): 0.1 - 1

I One port, both read/write: (uncommon)

I Relative area: 2.5-2.7

I One read port, one write port: Accumulator registers

I Two read ports, one write: Basic RISC RF

I Relative area: 3-3.2

I Four read ports, two write: Simple VLIW, Simple superscalar

(18)

Register file speed

I Almost (but not quite) the same speed as a very fast 32-bit adder (in this particular technology)

I Also note that it is possible to use special register file memories (but at an increased verification cost)

(19)

Read before write or write before read

I A processor architect has to decide how the register file should work when reading and writing the same register

I Read before write

I The old value is read

I Write before read

I The new value is read (more costly in terms of the timing budget)

(20)

Physical design: fan-out problem

Fan-out of the control signal For the first stage: 16*16*2 = 512

Fan-out of the control signal For the second stage: 16*8*2 = 256

Fan-out of the control signal For the third stage: 16*4*2 = 128

Fan-out of the control signal For the fourth stage: 16*2*2 = 64

Fan-out of the control signal For the fivth stage: 16*1*2 = 32

Selected operand

From 32 registers in a register file

[Liu2008]

(21)

Register File in Verilog

reg [15:0] rf[31:0]; // 16 bit wide RF with 32 entries always @(posedge clk) begin

if(we) rf[waddr] <= wdata;

end

always @* begin

op_a = rf[opaddr_a];

op_b = rf[opaddr_b];

end

(22)

Special Purpose Registers

I Sometimes we need special purpose registers (SPR or SR)

I BOT/TOP for modulo addressing

I AR for address register

I SP

I I/O

I Core configuration register

I Should these be included in the general purpose register file?

(23)

Special Purpose Registers as normal registers

I Convenient for the programmer. Special purpose registers can be accessed like any normal register.

I Example: add bot0,1 ; Move ringbuffer bottom one word

I Example 2 (from ARM): pop pc

I Drawbacks:

I Wastes entries in the general purpose register file

I Harder to use specialized register file memories

(24)

Special purpose registers needs special instructions

I Special instructions required to access SR:s

I Example:

I move r0,bot0 ; Move ringbuffer bottom one word

I (nop) ; May need nop(s) here

I add r0,1

I (nop) ; May need nop(s) here

I bot0,r0

I (Move is encoded as move from from/to special purpose register here)

I Advantage:

I Easier to meet timing as special purpose registers can easier be located anywhere in the core

I Can scale easily to hundreds of special purpose registers if required. (Common on large and complex processors such as ARM/x86)

I Drawback:

I Inconvenient for special registers you need to access all the time

(25)

Conclusions: SPRs

I Only place SPRs as a normal register if you believe it will be read/written via normal instructions very often

(26)

ALU in general

I ALU: Arithmetic and Logic Unit

I Arithmetic, Logic, Shift/rotate, others

I No guard bits for iterative computing

I One guard bit for single step computing

I Get operands from and send result to RF

I Handles single precision computing

(27)

Separate ALU or ALU in MAC

multiplier

Register file Register file

multiplier

Register file Register file

(a) (b)

ALU

Accumulator ALU and

Accumulator

DTU

[Liu2008]

(28)

ALU high level schematic

Shift Logic unit

unit

Masker, guard, carry-in, and other preprocessing

A [15:0] B [15:0]

Saturation and flag processing

Result [15:0] FA/FC, FS, FZ

[Liu2008]

(29)

Pre-processing

I Select operands: from one of the source

I Register file, control path, HW constant

I Typical operand pre processing:

I Guard: one guard

I (does not support iterative computing)

I Invert: Conditional/non-conditional invert

I Supply constant 0, 1, -1

I Mask operand(s)

I Select proper carry input

(30)

Post-processing

I Select result from multiple components

I From AU, logic unit, shift unit, and others

I Saturation operation

I Decide to generate carry-out flag or saturation

I Perform saturation on result if required

I Flag operation

I Flag computing and prediction

(31)

General instructions

Operation opa opb Carry in Carry out

ADD Addition + + 0 Cout/SAT

SUB Subtraction + - 1 Cout/SAT

ABS Absolute +/- A[15] SAT

CMP Compare + - 1 SAT

NEG Negate - 1 SAT

INC Increment + 1 0 SAT

DEC Decrement + -1 0 SAT

AVG Average + + 0 SAT

(32)

Special Instructions

Mnemonic Description Operation

MAX Select larger value RF <= max(OpA,OpB) MIN Select smaller value RF <= min(OpA,OpB) DTA Difference of two GR <= |OpA| − |OpB|

absolute values

SAD Sum of absolute GR <= |OpA−OpB|

difference

(33)

Adder with carry in for RTL synthesis (safe solution)

+

{A[15], A[15:0], “1”}

Result [16:0] < =FAO [17:1]

{B[15],B[15:0],CIN}

18b full adder FAO [17:0]

[Liu2008]

I Full adder may have no carry in

I One guard bit

I We need 2 extra bits in the adder

I LSB of the 18b result will not be used

I MSB of the 18b result will be the guard

I Works on all synthesis tools

(34)

Adder for RTL synthesis (modern version)

I Cout,Res[15:0] = 1’b0,A[15:0]+1’b0,B[15:0]+Cin;

I Cout is 1 bit wide

I Important: Cin is 1 bit wide!

I Modern synthesis tools can usually handle this case without creating two adders

I (I’ve had to resort to the “safe” version shown on the previous slide in a few cases though. For example when combining an adder with other logic in an FPGA.)

(35)

Example: Implement an 8 bit ALU

Instructions Function OP

NOP No change of flags 0

A+B A + B (without saturation) 1

A-B A − B (without saturation) 2

SAT(A+B) A + B (with saturation) 3

SAT(A-B) A − B (with saturation) 4

SAT(ABS(A)) |A| (absolute operation, saturation) 5 SAT(ABS(A+B)) |A + B| (absolute operation, saturation) 6 SAT(ABS(A-B)) |A − B| (absolute operation, saturation) 7 CLR S Clear S flag (other flags unchanged) 8

I There shold be a negative, zero, and saturation flag!

(36)

Problem: The following function must execute in < 2048 cycles

I Problem: The following function must execute in less than 2048 cycles for(i=0; i<500;i++){

tmp=abs(*p0++]);

tmp2=abs(*p1++);

*p2++ = tmp-tmp2;

}

; First try in assembler repeat 500,lend

ld r0,DM0[ar0++]

ld r1,DM0[ar1++]

abs r0 abs r1

sub r0,r0,r1 st DM0[ar2++],r0 lend:

; 3000 cycles for inner

; loop

(37)

16 bit |A| − |B|

I How many operations do we need for |A| − |B|?

I The easy approach:

I One adder for |A|

I One adder for |B|

I One adder for the final subtraction

(38)

Typical ALU shift operations

... ...

15 14 1 0 A rithmetic right shift

... ...

15 14 1 0 L ogic right shift

0

... ...

15 14 1 0 0 L ogic left shift

Rotate right without carry flag ... ...

15 14 1 0

Rotate left without carry flag ... ...

15 14 1 0

... ...

15 14 1 0

Rotate left with carry flag (more than one bit not needed) ... ...

15 14 1 0 C

C Rotate right with carry flag

(more than one bit not needed)

[Liu2008]

(39)

Shifter primitive

Fill in [0]

Fill in [1]

Fill in [2]

Fill in [3]

Fill in [4]

Fill in [5]

Fill in [6]

Fill in [7]

Fill in [8]

Fill in [9]

[15]

[14]

[13]

[12]

[11]

[10]

[8] [9]

[6] [7]

[5]

[4]

[3]

[2]

[1]

[0]

[15]

[14]

[13]

[12]

[11]

[10]

[9]

[8]

[7]

[6]

[5]

[4]

[3]

[2]

[1]

[0]

Shift Input [15:0]

Shift output [15:0]

CTRL LSB

CTRL [1]

CTRL [2]

CTRL MSB Fill in [10]

Fill in [11]

Fill in [12]

Fill in [13]

Fill in [14]

[Liu2008]

I Note: Barrel shifters based on 4-to-1 multiplexers may be more efficient Andreas Ehliar 04 - Microarchitecture, RF

(40)

Hardware multiplexing in shifter

Filling-in 16 bits shift

primitive

16 Shift in [15:0]

Shift in [0:15]

16 Shift out [15:0]

Shift out [0:15]

MSB [15] “0”

Right shift

Left shift

Right shift

Left shift Fill in port

From filling-in table

16

[Liu2008]

I Note: Fill in table may be complicated for some shift operations