09 Accelerators(1)

(1)

An Embedded Systems

Approach Using Verilog

Chapter 9 Accelerators

(2)

Performance and Parallelism

 A processor core performs steps in sequence

 Performance limited by the instruction rate

 Accelerating performance

 Perform steps in parallel

 Takes less time overall to complete an operation

 Instruction-level parallelism

 Within a processor core  Pipelining, multiple-issue

 Accelerators

(3)

Achievable Parallelism



How many steps can be performed at

once?



Regularly structured data

 Independent processing steps

 Examples

 Video and image pixel processing  Audio or sensor signal processing



Constrained by data dependencies

 Operations that depend on results of

(4)

Algorithm Kernels



Algorithm: specification of the required

processing steps

 Often expressed in a programming

language



Kernel: the part that involves the most

intensive, repetitive processing

 “10% of operations take 90% of the time”



Accelerating a kernel with parallel

(5)

Amdahl’s Law

 Time for an algorithm is t

 Fraction f is spent on a kernel

t



ft



(

1 

f

)

t

 Accelerator speeds up

kernel by a factor s

_s

f

t

ft

t







(

1 

)

 Overall speedup factor s'

 For large f, s'  s

 For small f, s'  1

(

1 )

1 f

s

f

t

s













(6)

Amdahl’s Law Example

 An algorithm with two kernels

 Kernel 1: 80% of time, can be sped up 10 times  Kernel 2: 15% of time, can be sped up 100 times  Which speedup gives best overall improvement?

 For kernel 1:

 For kernel 2:

57 . 3 2 . 0 08 . 0 1 ) 8 . 0 1 ( 10 8 . 0 1        s 17 . 1 85 . 0 0015 . 0 1 ) 15 . 0 1 ( 100 15 . 0 1 _       s

(7)

Parallel Architectures



An architecture for an accelerator

specifies

 Processing blocks

 Data flow between them



Parallelism through replication

 Multiple identical block operating on

different data elements

 Works well when elements can be

(8)

Parallel Architectures

 Parallelism through pipelining

 Break a computation into steps, performs them in

assembly-line fashion

 Latency (time to complete a single operation) is

not increased

 Throughput (rate of completion of operations) is

increased

 Ideally by a factor equal to the number of pipeline stages

step 1 step 2 step 3

data

(9)

Direct Memory Access (DMA)



Input/Output data for accellerators

must be transferred at high speed

 Using the processor would be too slow



Direct memory access

 I/O controller and accellerator transfer data

to and from memory autononously

 Program supplies starting address and

(10)

Bus Arbitration



Bus masters take turns to use bus to access

slaves

 Controlled by a bus arbiter



Arbitration policies

 Priority, round-robin,

… processor memory arbiter accelerator controller request grant request request grant grant memory bus

(11)

Block-Processing Accelerator



Data arranged in regular groups of

contiguous memory locations

 Accelerator works block by block

 E.g., images in blocks of 8 × 8 × 16-bit

pixels



Datapath comprises

 Memory access: address generation,

counters

 Computation section

(12)

Stream-Processing Accelerator



Streams of data from an input source

 E.g., high-speed sensors



Digital signal processing (DSP)

 Analog sensor signal converted to stream

of digital sample values

 Filtering, gain/attenuation,

(13)

Processor/Accelerator Interface



Embedded software controls an

accelerator

 Providing control parameters

 Synchronizing operations



Input/output registers and interrupts

(14)

Case Study: Edge Detection

 Illustration of accelerator design

 Edge detection in video processing

 Identify where image intensity changes abruptly  Typically at the boundary of objects

 First step in identifying objects in a scene

 Application areas

 Video surveillance, computer vision, …

 For this case study

 Monochrome images of 640 × 480 × 8-bit pixels  Stored row-by-row in memory

(15)

Sobel Edge Detection



Compute derivatives of intensity in x

and y directions

 Look for minima and maxima (where

(16)

The Sobel Algorithm

 Use convolution to approximate partial

derivatives Dx and Dy at each position

 Weighted sum of value of a pixel and its eight

nearest neighbors

 Coefficients represented using a 3×3 convolution

mask

 Sobel masks for x and y derivatives

–1 0 +1

–2 0 +2

–1 0 +2

x

G

+1 +2 +1

0 0 0

–1 –2 –1 y G G j i O j i

(17)

The Sobel Algorithm



Combine partial derivatives

2 2

y x D

D D  



Since we just want maxima and minima

in magnitude, approximate as:

y x D

D D  



Edge pixels don’t have eight neighbors

 Skip computation of |D| for edges

(18)

The Algorithm in Pseudocode

for (row = 1; row <= 478; row = row + 1) begin for (col = 1; col <= 638; col = col + 1) begin sumx = 0; sumy = 0;

for (i = –1; i <= +1; i = i + 1) begin for (j = –1; j <= +1; j = j + 1) begin

sumx = sumx + 0[row+i][col+j] * Gx[i][j]; sumy = sumy + 0[row+i][col+j] * Gy[i][j]; end

end

D[row][col] = abs(sumx) + abs(sumy); end

(19)

Data Formats and Rates



Pixel values: 0 to 255 (8 bits)

 Coefficients are 0, ±1 and ±2

 Partial products: –510 to +510 (10 bits)  D_x and D_y: –1020 to +1020 (11 bits)

 |D|: 0 to 2040 (11 bits)

 Final pixel value: scale back to 8 bits



Video rate: 30 frames/sec

 640 × 480 = 307,200 pixels

(20)

Data Dependencies



Pixels can be computed independently



For each pixel:

(21)

System Architecture

 Data dependencies suggest a pipeline

 Coefficient multiplies are simple shift/negate, so

(22)

Memory Bandwidth



Assume memory read/write takes 20ns

(2 cycles of 100MHz clock)

 Memory is 32-bits wide, byte addressable

 Bandwidth = 50M operations/sec



Camera produces 10Mpixels/sec

 Accelerator needs to process at this rate

 (8 reads + 1 write) × 10Mpixel/sec

= 90M operations/sec

(23)

Memory Bandwidth

 Read 4 pixels at once from each of previous,

current, and next rows

 Store in accelerator to compute multiple derivative

image pixels

 Produce derivative pixels row-by-row,

left-to-right

 Read 3 × 32-bit words for every 4th derivative

pixel computed

 Write 4 pixels at a time

 (3 reads + 1 write) / 4 × 10Mpixel/sec

= 10M operations/sec

(24)

(25)

Accelerator Sequence

 Steady state

 Write 4 result pixels

 Read 4 pixels for previous,

current, next rows

 Compute for 4 cycles  Repeat…

 Start of row

 Omit writes until pipeline

full

 End of row

 Omit reads to drain

(26)

Memory Operation Timing

(27)

Pixel Datapath

// Computation datapath signals

reg [31:0] prev_row, curr_row, next_row; reg [7:0] O [-1:+1][-1:+1];

reg signed [10:0] Dx, Dy, D; reg [7:0] abs_D;

reg [31:0] result_row; ...

// Computational datapath

always @(posedge clk_i) // Previous row register if (prev_row_load) prev_row <= dat_i;

else if (shift_en) prev_row[31:8] <= prev_row[23:0]; ... // Current row register

... // Next row register

function [10:0] abs (input signed [10:0] x); abs = x >= 0 ? x : -x;

(28)

Pixel Datapath

always @(posedge clk_i) // Computation pipeline if (shift_en) begin

D = abs(Dx) + abs(Dy); abs_D <= D[10:3];

Dx <= - $signed({3'b000, O[-1][-1]}) + $signed({3'b000, O[-1][+1]})

- ($signed({3'b000, O[ 0][-1]}) << 1) + ($signed({3'b000, O[ 0][+1]}) << 1) - $signed({3'b000, O[+1][-1]})

+ $signed({3'b000, O[+1][+1]}); Dy <= $signed({3'b000, O[-1][-1]})

+ ($signed({3'b000, O[-1][ 0]}) << 1) + $signed({3'b000, O[-1][+1]})

- $signed({3'b000, O[+1][-1]})

- ($signed({3'b000, O[+1][ 0]}) << 1) - $signed({3'b000, O[+1][+1]});

(29)

Pixel Datapath

O[-1][-1] <= O[-1][0]; O[-1][ 0] <= O[-1][+1];

O[-1][+1] <= prev_row[31:24]; O[ 0][-1] <= O[0][ 0];

O[ 0][ 0] <= O[0][+1];

O[ 0][+1] <= curr_row[31:24]; O[+1][-1] <= O[+1][ 0];

O[+1][ 0] <= O[+1][+1];

O[+1][+1] <= next_row[31:24]; end

always @(posedge clk_i) // Result row register

(30)

Address Generation



Given an image in memory at base

address

B

 Address for pixel in row r, column c is

B + r × 640 + c

 Base address (B) is fixed

 Offset (r × 640 + c) increments by 4 for

each group of 4 pixels read/written

 Use word-aligned addresses

 Two least-significant bits always 00  Increment word address by 1

(31)

(32)

Address Generation

always @(posedge clk_i) // O base address register if (O_base_ce) O_base <= dat_i[21:2];

always @(posedge clk_i) // O address offset counter if (offset_reset) O_offset <= 0;

else if (O_offset_cnt_en) O_offset <= O_offset + 1; always @(posedge clk_i) // D base address register if (D_base_ce) D_base <= dat_i[21:2];

always @(posedge clk_i) // D address offset counter if (offset_reset) D_offset <= 0;

else if (D_offset_cnt_en) D_offset <= D_offset + 1; ...

(33)

Address Generation

assign O_prev_addr = O_base + O_offset; assign O_curr_addr = O_prev_addr + 640/4; assign O_next_addr = O_prev_addr + 1280/4; assign D_addr = D_base + D_offset;

assign adr_o[21:2] = prev_row_load ? O_prev_addr : curr_row_load ? O_curr_addr : next_row_load ? O_next_addr : D_addr;

(34)

Control/Status Registers

Register Offset Read/Write Purpose

Int_en 0 Write-only Interrupt enable (bit 0).

Start 4 Write-only Write causes image processing to start (value ignored).

O_base 8 Write-only Original image base address.

D_base 12 Write-only Derivative image base address + 640. Status 0 Read-only Processing done (bit 0). Reading clears

(35)

Slave Bus Interface

assign start = cyc_i && stb_i && we_i && adr_i == 2'b01; assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10; assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11; always @(posedge clk_i) // Interrupt enable register

if (rst_i)

int_en <= 1'b0;

else if (cyc_i && stb_i && we_i && adr_i == 2'b00) int_en <= dat_i[0];

always @(posedge clk_i) // Status register if (rst_i)

done <= 1'b0; else if (done_set)

// This occurs when last write is acknowledged,

// and so cannot coincide with a read of the status register. done <= 1'b1;

else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o) done <= 1'b0;

(36)

Slave Bus Interface

always @(posedge clk_i) // Generate ack output ack_o <= cyc_i && stb_i && !ack_o;

// Wishbone data output multiplexer always @*

if (cyc_i && stb_i && !we_i) if (adr_i == 2'b00)

dat_o = {31'b0, done}; // status register read else

dat_o = 32'b0; // other registers read as 0 else

(37)

Control Sequencing



Use a finite-state machine

 Counters keep track of rows (0 to 477) and

columns (0 to 159)



See textbook for details of FSM output

(38)

(39)

Accelerator Verification

 Simulation-based verification of each section

of the accelerator

 Slave bus operations

 Computation sequencing  Master bus operations  Address generation  Pixel computation

 Testbench including the accelerator

 Bus functional processor model

(40)

Sobel Verification Testbench

Processor

BFM AcceleratorSobel

Memory Model Arbiter

(41)

Processor Bus Functional Model

initial begin // Processor bus-functional model cpu_adr_o <= 23'h000000;

cpu_sel_o <= 4'b0000;

cpu_dat_o <= 32'h00000000;

cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; @(negedge rst);

@(posedge clk);

// Write 008000 (hex) to O_base_addr register

bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000); // Write 053000 + 280 (hex) to D_base_addr register

bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280); // Write 1 to interrupt control register (enable interrupt)

bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001); // Write to start register (data value ignored)

bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000); // End of write operations

(42)

Processor Bus Functional Model

cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0; begin: loop

forever begin #10000;

@(posedge clk);

// Read status register

cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset; cpu_sel_o <= 4'b1111;

cpu_cyc_o <= 1'b1; cpu_stb_o <= 1'b1; cpu_we_o <= 1'b0; @(posedge clk); while (!cpu_ack_i) @(posedge clk);

cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; if (cpu_dat_i[0]) disable loop;

end end end

(43)

Memory Bus Functional Model

always begin // Memory bus-functional model mem_ack_o <= 1'b0;

mem_dat_o <= 32'h00000000; @(posedge clk);

while (!(bus_cyc && mem_stb_i)) @(posedge clk); if (!bus_we)

mem_dat_o <= 32'h00000000; // in place of read data mem_ack_o <= 1'b1;

@(posedge clk); end

(44)

Bus Arbiter



Uses

sobel_cyc_o

and

cpu_cyc_o

as

request inputs

 If both request at the same time, give

accelerator priority

(45)

Bus Arbiter

always @(posedge clk) // Arbiter FSM register if (rst) arbiter_current_state <= sobel;

else arbiter_current_state <= arbiter_next_state; always @* // Arbiter logic

case (arbiter_current_state) sobel: if (sobel_cyc_o) begin

sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end

else if (!sobel_cyc_o && cpu_cyc_o) begin

sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu; end

else begin

sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end

cpu: if (cpu_cyc_o) begin

sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu; end else if (sobel_cyc_o && !cpu_cyc_o) begin

sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end else begin

(46)

Simulation Results



See waveforms in textbook

 Demonstrates sequencing and address

generation



But what about…

 Data values computed correctly

 Interactions between processor and

accelerator



Need to use more sophisticated

verification techniques

(47)

Summary



Accelerators boost performance using

parallel hardware

 Replication, pipelining, … 

Ahmdahl’s Law

 Best payback from accelerating a kernel



DMA avoids processor overhead

