• No results found

04 - Microarchitecture, RF

N/A
N/A
Protected

Academic year: 2022

Share "04 - Microarchitecture, RF"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

04 - Microarchitecture, RF

Andreas Ehliar

September 17, 2013

(2)

Todays lecture

I Another ASIP instructions example

I Microarchitecture basics

I The register file/Special Registers

I Introduction to the ALU

Andreas Ehliar 04 - Microarchitecture, RF

(3)

A pen and paper example

I A function process data

I Should execute in less than 80 clock cycles

I 16 bit input data

I Intermediate calculations should be done in accumulator to avoid loss of precision

I We assume that we are allowed to place the arrays in any memory

(4)

void process_data(i16 *array0,i16 *array1,i16 *result) { // Assume ptr to array0 in r0, array1 in r1, result in r2

i64 sumofproducts = 0; sumofdiff = 0;

for(int i=0; i < 30; i++){

sumofproducts += array0[ptr0] * array1[ptr1]

sumofdiff += abs(array0[ptr0++] - array1[ptr1++]) }

if(sumofproducts > 0x7fffffff) { sumofproducts = 0x7fffffff;

} else if(sumofproducts < -0x80000000) { sumofproducts = -0x80000000;

}

result[0] = sumofproducts[31:16];

result[1] = sumofproducts[15:0];

result[2] = sumofdiff[31:16];

result[3] = r5 = sumofdiff[15:0];

}

Andreas Ehliar 04 - Microarchitecture, RF

(5)

Microarchitecture Design

I Step 1: Partition each assembly instruction into microoperations, allocate each microoperation into corresponding hardware modules.

I Step 2: Collect all microoperations allocated in a module and specify hardware multiplexing for RTL coding of the module

I Step 3: Fine-tune intermodule specifications of the ASIP architecture and finalize the top-level connections and pipeline.

(6)

Hardware Multiplexing

I Reusing one hardware module for several different operations

I Example: Signed and unsigned 16-bit multiplication

Andreas Ehliar 04 - Microarchitecture, RF

(7)

Hardware Multiplexing

Pre processing

MA A B

MB A B

Kernel processing

MUL

Post processing-1

MP1

Possible functions 1. A + C 2. A + D 3. B + C 4. B + D 5. A * C 6. A * D 7. B * C 8. B * D 9. SAT(A + C) 10.SAT(A + D) 11.SAT(B + C) 12.SAT(B + D) 13.SAT(A * C) 14.SAT(A * D) 15.SAT(B * C) 16.SAT(B * D)

ADD

Control[2]

Control[1] Control[0]

0 1

0 1 0 1

Post processing-2

Control[3] 0 1 MP2 Saturation

opa opb

result1

C D

[Liu2008]

(8)

Hardware multiplexing

I Hardware multiplexing can be implemented either by SW or by configuring the HW

I A processor is basically a very neat design pattern for multiplexing different HW units

I Perhaps the most important skill of a good VLSI designer

Andreas Ehliar 04 - Microarchitecture, RF

(9)

Typical design pattern for datapath modules

Pre-operation-I

Collectedmicrooperations

Pre-operation-II

Pre-operation-X Pre-operation-Y

... ...

Kerneloperations

Post-operation-I Post-operation-II

Post-operation-X Post-operation-Y

... ...

Resultsandverifications

[Liu2008]

(10)

Area properties (a.k.a. what to optimize

I Relative areas of a few different components

I 32-bit Adder: 0.2 to 1 area units

I 32-bit Adder/subtracter: 0.3 to 2 area units

I 32-bit 16 to 1 mux: 0.5 – 0.6 area units

I 17 x 17 bit multiplier: 1.3-3.7 area units

I 8 KiB memory (32 bit wide): 33 area units

Andreas Ehliar 04 - Microarchitecture, RF

(11)

Performance properties

I Relative maximum frequencies

I 32-bit adder: 0.1 to 1

I 32-bit adder/subtracter: 0.1 to 0.9

I 32-bit 16 to 1 mux: 0.31 to 0.9

I 17 x 17 bit multiplier: 0.11 – 0.44

I 8 KiB memory (32 bit wide): 0.53

(12)

Optimizing memory size is often the most important task

I MP3 decoder example

I All memories in the chip are 3 time the size of the DSP core itself

I (I/O pads are also larger than the DSP core itself)

Andreas Ehliar 04 - Microarchitecture, RF

(13)

Microarchitecture design of an instruction

I Required microoperations for a typical convolution instruction:

I conv ACRx,DM0[ARy%++),DM1(ARz++)

I Required microoperations:

I Instruction decoding

I Perform addressing calculation

I Read memories

I Perform signed multiplication

I Add guard bits to the result of the multiplication

I Accumulate the result

I Set flags

I For a combined repeat/conv instruction:

I PC <= PC while in the loop

I PC <= PC + 1 as the last step in the loop

I No saturation/rounding during the iteration

I Saturate/round after final loop iteration

(14)

The register file (RF)

I The RF gets data from data memories by running load instructions while preparing for an execution of a subroutine.

I While running a subroutine, the register file is used as computing buffers.

I After running the subroutine, results in the RF will be stored into data memories by running store instructions.

Andreas Ehliar 04 - Microarchitecture, RF

(15)

General register file

RF

AGU

PCFSM

PM

Program address Configuration andstatus ExecunitALU/MAC

Instruction DM

Program flow control

Operation ctrl MEM ctrl

Results

Instructiondecoder Operand &result control Results and flags

Target addresses, configuration vectors from register file

[Liu2008]

I Connected to almost all parts of the core

(16)

Register file schematic

register 1

register 2

register 3

register n ...

from register file from memory 1 from memory 2 from ALU

from MAC from ports

...

OPA

OPB ctrl_reg_in

ctrl_o_a ctrl_o_b

Write circuit

Store cir cuit

Read circuit

[Liu2008]

Andreas Ehliar 04 - Microarchitecture, RF

(17)

Register file area comparisons

I Register file area comparisons

I Adder (for comparison): 0.1 - 1

I One port, both read/write: (uncommon)

I Relative area: 2.5-2.7

I One read port, one write port: Accumulator registers

I Relative area: 2.5-2.6

I Two read ports, one write: Basic RISC RF

I Relative area: 3-3.2

I Four read ports, two write: Simple VLIW, Simple superscalar

I Relative area: 4.6-5.8

(18)

Register file speed

I Almost (but not quite) the same speed as a very fast 32-bit adder (in this particular technology)

I Also note that it is possible to use special register file memories (but at an increased verification cost)

Andreas Ehliar 04 - Microarchitecture, RF

(19)

Read before write or write before read

I A processor architect has to decide how the register file should work when reading and writing the same register

I Read before write

I The old value is read

I Write before read

I The new value is read (more costly in terms of the timing budget)

(20)

Physical design: fan-out problem

Fan-out of the control signal For the first stage: 16*16*2 = 512

Fan-out of the control signal For the second stage: 16*8*2 = 256

Fan-out of the control signal For the third stage: 16*4*2 = 128

Fan-out of the control signal For the fourth stage: 16*2*2 = 64

Fan-out of the control signal For the fivth stage: 16*1*2 = 32

Selected operand

From 32 registers in a register file

[Liu2008]

Andreas Ehliar 04 - Microarchitecture, RF

(21)

Register File in Verilog

reg [15:0] rf[31:0]; // 16 bit wide RF with 32 entries always @(posedge clk) begin

if(we) rf[waddr] <= wdata;

end

always @* begin

op_a = rf[opaddr_a];

op_b = rf[opaddr_b];

end

(22)

Special Purpose Registers

I Sometimes we need special purpose registers (SPR or SR)

I BOT/TOP for modulo addressing

I AR for address register

I SP

I I/O

I Core configuration register

I Should these be included in the general purpose register file?

Andreas Ehliar 04 - Microarchitecture, RF

(23)

Special Purpose Registers as normal registers

I Convenient for the programmer. Special purpose registers can be accessed like any normal register.

I Example: add bot0,1 ; Move ringbuffer bottom one word

I Example 2 (from ARM): pop pc

I Drawbacks:

I Wastes entries in the general purpose register file

I Harder to use specialized register file memories

(24)

Special purpose registers needs special instructions

I Special instructions required to access SR:s

I Example:

I move r0,bot0 ; Move ringbuffer bottom one word

I (nop) ; May need nop(s) here

I add r0,1

I (nop) ; May need nop(s) here

I bot0,r0

I (Move is encoded as move from from/to special purpose register here)

I Advantage:

I Easier to meet timing as special purpose registers can easier be located anywhere in the core

I Can scale easily to hundreds of special purpose registers if required. (Common on large and complex processors such as ARM/x86)

I Drawback:

I Inconvenient for special registers you need to access all the time

Andreas Ehliar 04 - Microarchitecture, RF

(25)

Conclusions: SPRs

I Only place SPRs as a normal register if you believe it will be read/written via normal instructions very often

(26)

ALU in general

I ALU: Arithmetic and Logic Unit

I Arithmetic, Logic, Shift/rotate, others

I No guard bits for iterative computing

I One guard bit for single step computing

I Get operands from and send result to RF

I Handles single precision computing

Andreas Ehliar 04 - Microarchitecture, RF

(27)

Separate ALU or ALU in MAC

multiplier

Register file Register file

multiplier

Register file Register file

(a) (b)

ALU

Accumulator ALU and

Accumulator

DTU

[Liu2008]

(28)

ALU high level schematic

Shift Logic unit

unit

Masker, guard, carry-in, and other preprocessing

A [15:0] B [15:0]

Saturation and flag processing

Result [15:0] FA/FC, FS, FZ

[Liu2008]

Andreas Ehliar 04 - Microarchitecture, RF

(29)

Pre-processing

I Select operands: from one of the source

I Register file, control path, HW constant

I Typical operand pre processing:

I Guard: one guard

I (does not support iterative computing)

I Invert: Conditional/non-conditional invert

I Supply constant 0, 1, -1

I Mask operand(s)

I Select proper carry input

(30)

Post-processing

I Select result from multiple components

I From AU, logic unit, shift unit, and others

I Saturation operation

I Decide to generate carry-out flag or saturation

I Perform saturation on result if required

I Flag operation

I Flag computing and prediction

Andreas Ehliar 04 - Microarchitecture, RF

(31)

General instructions

Operation opa opb Carry in Carry out

ADD Addition + + 0 Cout/SAT

SUB Subtraction + - 1 Cout/SAT

ABS Absolute +/- A[15] SAT

CMP Compare + - 1 SAT

NEG Negate - 1 SAT

INC Increment + 1 0 SAT

DEC Decrement + -1 0 SAT

AVG Average + + 0 SAT

(32)

Special Instructions

Mnemonic Description Operation

MAX Select larger value RF <= max(OpA,OpB) MIN Select smaller value RF <= min(OpA,OpB) DTA Difference of two GR <= |OpA| − |OpB|

absolute values

SAD Sum of absolute GR <= |OpA−OpB|

difference

Andreas Ehliar 04 - Microarchitecture, RF

(33)

Adder with carry in for RTL synthesis (safe solution)

+

{A[15], A[15:0], “1”}

Result [16:0] < =FAO [17:1]

{B[15],B[15:0],CIN}

18b full adder FAO [17:0]

[Liu2008]

I Full adder may have no carry in

I One guard bit

I We need 2 extra bits in the adder

I LSB of the 18b result will not be used

I MSB of the 18b result will be the guard

I Works on all synthesis tools

(34)

Adder for RTL synthesis (modern version)

I Cout,Res[15:0] = 1’b0,A[15:0]+1’b0,B[15:0]+Cin;

I Cout is 1 bit wide

I Important: Cin is 1 bit wide!

I Modern synthesis tools can usually handle this case without creating two adders

I (I’ve had to resort to the “safe” version shown on the previous slide in a few cases though. For example when combining an adder with other logic in an FPGA.)

Andreas Ehliar 04 - Microarchitecture, RF

(35)

Example: Implement an 8 bit ALU

Instructions Function OP

NOP No change of flags 0

A+B A + B (without saturation) 1

A-B A − B (without saturation) 2

SAT(A+B) A + B (with saturation) 3

SAT(A-B) A − B (with saturation) 4

SAT(ABS(A)) |A| (absolute operation, saturation) 5 SAT(ABS(A+B)) |A + B| (absolute operation, saturation) 6 SAT(ABS(A-B)) |A − B| (absolute operation, saturation) 7 CLR S Clear S flag (other flags unchanged) 8

I There shold be a negative, zero, and saturation flag!

(36)

Problem: The following function must execute in < 2048 cycles

I Problem: The following function must execute in less than 2048 cycles for(i=0; i<500;i++){

tmp=abs(*p0++]);

tmp2=abs(*p1++);

*p2++ = tmp-tmp2;

}

; First try in assembler repeat 500,lend

ld r0,DM0[ar0++]

ld r1,DM0[ar1++]

abs r0 abs r1

sub r0,r0,r1 st DM0[ar2++],r0 lend:

; 3000 cycles for inner

; loop

Andreas Ehliar 04 - Microarchitecture, RF

(37)

16 bit |A| − |B|

I How many operations do we need for |A| − |B|?

I The easy approach:

I One adder for |A|

I One adder for |B|

I One adder for the final subtraction

(38)

Typical ALU shift operations

... ...

15 14 1 0 A rithmetic right shift

... ...

15 14 1 0 L ogic right shift

0

... ...

15 14 1 0 0 L ogic left shift

Rotate right without carry flag ... ...

15 14 1 0

Rotate left without carry flag ... ...

15 14 1 0

... ...

15 14 1 0

Rotate left with carry flag (more than one bit not needed) ... ...

15 14 1 0 C

C Rotate right with carry flag

(more than one bit not needed)

[Liu2008]

Andreas Ehliar 04 - Microarchitecture, RF

(39)

Shifter primitive

Fill in [0]

Fill in [1]

Fill in [2]

Fill in [3]

Fill in [4]

Fill in [5]

Fill in [6]

Fill in [7]

Fill in [8]

Fill in [9]

[15]

[14]

[13]

[12]

[11]

[10]

[8] [9]

[6] [7]

[5]

[4]

[3]

[2]

[1]

[0]

[15]

[14]

[13]

[12]

[11]

[10]

[9]

[8]

[7]

[6]

[5]

[4]

[3]

[2]

[1]

[0]

Shift Input [15:0]

Shift output [15:0]

CTRL LSB

CTRL [1]

CTRL [2]

CTRL MSB Fill in [10]

Fill in [11]

Fill in [12]

Fill in [13]

Fill in [14]

[Liu2008]

I Note: Barrel shifters based on 4-to-1 multiplexers may be more efficient Andreas Ehliar 04 - Microarchitecture, RF

(40)

Hardware multiplexing in shifter

Filling-in 16 bits shift

primitive

16 Shift in [15:0]

Shift in [0:15]

16 Shift out [15:0]

Shift out [0:15]

MSB [15] “0”

Right shift

Left shift

Right shift

Left shift Fill in port

From filling-in table

16

[Liu2008]

I Note: Fill in table may be complicated for some shift operations

Andreas Ehliar 04 - Microarchitecture, RF

References

Related documents

In terms of mordant type and method, the use of CaO mordant with post and combined methods generated the best light fastness to light with a value of 4-5 (good

The solution ap- proaches are discussed in Section 3, where we present heuristic and dy- namic programming approaches for determining a feasible loading for one vehicle, and in

The mechanical properties and morphology results allowed us to classify the efficiency of the materials tested to compatibilize PP/PHB blends in this order: P(E–MA–GMA)  P(E–MA)

pula pull the value from the top of the stack and store in register A; increment the stack pointer; used for restoring the contents of a register at the end of a subroutine,

Running gait cycle: *for running and sprinting; IC, initial contact; TO, toe off; StR, stance phase reversal; SwR, swing phase reversal; absorption, from SwR through IC to

Digital twins of the physical asset and its operating conditions asset are a marriage of digital representation of the physical plant such as equipment data (size, capacity,

organizations and local public health departments in the next round of CHNAs. Methodology: Georgia Watch conducted a comprehensive review of 38 CHNAs and 29

The Clerk was asked to arrange a meeting between a representative from the Kent County League, Platt United and Platt Parish Council.. Cllrs Searle, Scott and Darby offered to