An Embedded Systems
Approach Using Verilog
Chapter 9
Accelerators
Performance and Parallelism
A processor core performs steps in sequence
Performance limited by the instruction rate
Accelerating performance
Perform steps in parallel
Takes less time overall to complete an operation
Instruction-level parallelism
Within a processor core Pipelining, multiple-issue
Accelerators
Achievable Parallelism
How many steps can be performed at
once?
Regularly structured data
Independent processing steps
Examples
Video and image pixel processing Audio or sensor signal processing
Constrained by data dependencies
Operations that depend on results of
Algorithm Kernels
Algorithm: specification of the required
processing steps
Often expressed in a programming
language
Kernel: the part that involves the most
intensive, repetitive processing
“10% of operations take 90% of the time”
Accelerating a kernel with parallel
Amdahl’s Law
Time for an algorithm is t
Fraction f is spent on a kernel
t
ft
(
1
f
)
t
Accelerator speeds up
kernel by a factor s
s
f
t
ft
t
(
1
)
Overall speedup factor s'
For large f, s' s
For small f, s' 1
(
1
)
1
f
s
f
t
t
s
Amdahl’s Law Example
An algorithm with two kernels
Kernel 1: 80% of time, can be sped up 10 times Kernel 2: 15% of time, can be sped up 100 times Which speedup gives best overall improvement?
For kernel 1:
For kernel 2:
57 . 3 2 . 0 08 . 0 1 ) 8 . 0 1 ( 10 8 . 0 1 s 17 . 1 85 . 0 0015 . 0 1 ) 15 . 0 1 ( 100 15 . 0 1 s
Parallel Architectures
An architecture for an accelerator
specifies
Processing blocks
Data flow between them
Parallelism through replication
Multiple identical block operating on
different data elements
Works well when elements can be
Parallel Architectures
Parallelism through pipelining
Break a computation into steps, performs them in
assembly-line fashion
Latency (time to complete a single operation) is
not increased
Throughput (rate of completion of operations) is
increased
Ideally by a factor equal to the number of pipeline stages
step 1 step 2 step 3
data
Direct Memory Access (DMA)
Input/Output data for accellerators
must be transferred at high speed
Using the processor would be too slow
Direct memory access
I/O controller and accellerator transfer data
to and from memory autononously
Program supplies starting address and
Bus Arbitration
Bus masters take turns to use bus to access
slaves
Controlled by a bus arbiter
Arbitration policies
Priority, round-robin,… processor memory arbiter accelerator controller request grant request request grant grant memory bus
Block-Processing Accelerator
Data arranged in regular groups of
contiguous memory locations
Accelerator works block by block
E.g., images in blocks of 8 × 8 × 16-bit
pixels
Datapath comprises
Memory access: address generation,
counters
Computation section
Stream-Processing Accelerator
Streams of data from an input source
E.g., high-speed sensors
Digital signal processing (DSP)
Analog sensor signal converted to stream
of digital sample values
Filtering, gain/attenuation,
Processor/Accelerator Interface
Embedded software controls an
accelerator
Providing control parameters
Synchronizing operations
Input/output registers and interrupts
Case Study: Edge Detection
Illustration of accelerator design
Edge detection in video processing
Identify where image intensity changes abruptly Typically at the boundary of objects
First step in identifying objects in a scene
Application areas
Video surveillance, computer vision, …
For this case study
Monochrome images of 640 × 480 × 8-bit pixels Stored row-by-row in memory
Sobel Edge Detection
Compute derivatives of intensity in x
and y directions
Look for minima and maxima (where
The Sobel Algorithm
Use convolution to approximate partial
derivatives Dx and Dy at each position
Weighted sum of value of a pixel and its eight
nearest neighbors
Coefficients represented using a 3×3 convolution
mask
Sobel masks for x and y derivatives
–1 0 +1
–2 0 +2
–1 0 +2
x
G
+1 +2 +1
0 0 0
–1 –2 –1 y G G j i O j i
The Sobel Algorithm
Combine partial derivatives
2 2
y x D
D D
Since we just want maxima and minima
in magnitude, approximate as:
y x D
D D
Edge pixels don’t have eight neighbors
Skip computation of |D| for edgesThe Algorithm in Pseudocode
for (row = 1; row <= 478; row = row + 1) begin for (col = 1; col <= 638; col = col + 1) begin sumx = 0; sumy = 0;
for (i = –1; i <= +1; i = i + 1) begin for (j = –1; j <= +1; j = j + 1) begin
sumx = sumx + 0[row+i][col+j] * Gx[i][j]; sumy = sumy + 0[row+i][col+j] * Gy[i][j]; end
end
D[row][col] = abs(sumx) + abs(sumy); end
Data Formats and Rates
Pixel values: 0 to 255 (8 bits)
Coefficients are 0, ±1 and ±2
Partial products: –510 to +510 (10 bits) Dx and Dy: –1020 to +1020 (11 bits)
|D|: 0 to 2040 (11 bits)
Final pixel value: scale back to 8 bits
Video rate: 30 frames/sec
640 × 480 = 307,200 pixels
Data Dependencies
Pixels can be computed independently
For each pixel:
System Architecture
Data dependencies suggest a pipeline
Coefficient multiplies are simple shift/negate, so
Memory Bandwidth
Assume memory read/write takes 20ns
(2 cycles of 100MHz clock)
Memory is 32-bits wide, byte addressable
Bandwidth = 50M operations/sec
Camera produces 10Mpixels/sec
Accelerator needs to process at this rate
(8 reads + 1 write) × 10Mpixel/sec
= 90M operations/sec
Memory Bandwidth
Read 4 pixels at once from each of previous,
current, and next rows
Store in accelerator to compute multiple derivative
image pixels
Produce derivative pixels row-by-row,
left-to-right
Read 3 × 32-bit words for every 4th derivative
pixel computed
Write 4 pixels at a time
(3 reads + 1 write) / 4 × 10Mpixel/sec
= 10M operations/sec
Accelerator Sequence
Steady state
Write 4 result pixels
Read 4 pixels for previous,
current, next rows
Compute for 4 cycles Repeat…
Start of row
Omit writes until pipeline
full
End of row
Omit reads to drain
Memory Operation Timing
Pixel Datapath
// Computation datapath signals
reg [31:0] prev_row, curr_row, next_row; reg [7:0] O [-1:+1][-1:+1];
reg signed [10:0] Dx, Dy, D; reg [7:0] abs_D;
reg [31:0] result_row; ...
// Computational datapath
always @(posedge clk_i) // Previous row register if (prev_row_load) prev_row <= dat_i;
else if (shift_en) prev_row[31:8] <= prev_row[23:0]; ... // Current row register
... // Next row register
function [10:0] abs (input signed [10:0] x); abs = x >= 0 ? x : -x;
Pixel Datapath
always @(posedge clk_i) // Computation pipeline if (shift_en) begin
D = abs(Dx) + abs(Dy); abs_D <= D[10:3];
Dx <= - $signed({3'b000, O[-1][-1]}) + $signed({3'b000, O[-1][+1]})
- ($signed({3'b000, O[ 0][-1]}) << 1) + ($signed({3'b000, O[ 0][+1]}) << 1) - $signed({3'b000, O[+1][-1]})
+ $signed({3'b000, O[+1][+1]}); Dy <= $signed({3'b000, O[-1][-1]})
+ ($signed({3'b000, O[-1][ 0]}) << 1) + $signed({3'b000, O[-1][+1]})
- $signed({3'b000, O[+1][-1]})
- ($signed({3'b000, O[+1][ 0]}) << 1) - $signed({3'b000, O[+1][+1]});
Pixel Datapath
O[-1][-1] <= O[-1][0]; O[-1][ 0] <= O[-1][+1];
O[-1][+1] <= prev_row[31:24]; O[ 0][-1] <= O[0][ 0];
O[ 0][ 0] <= O[0][+1];
O[ 0][+1] <= curr_row[31:24]; O[+1][-1] <= O[+1][ 0];
O[+1][ 0] <= O[+1][+1];
O[+1][+1] <= next_row[31:24]; end
always @(posedge clk_i) // Result row register
Address Generation
Given an image in memory at base
address
B
Address for pixel in row r, column c is
B + r × 640 + c
Base address (B) is fixed
Offset (r × 640 + c) increments by 4 for
each group of 4 pixels read/written
Use word-aligned addresses
Two least-significant bits always 00 Increment word address by 1
Address Generation
always @(posedge clk_i) // O base address register if (O_base_ce) O_base <= dat_i[21:2];
always @(posedge clk_i) // O address offset counter if (offset_reset) O_offset <= 0;
else if (O_offset_cnt_en) O_offset <= O_offset + 1; always @(posedge clk_i) // D base address register if (D_base_ce) D_base <= dat_i[21:2];
always @(posedge clk_i) // D address offset counter if (offset_reset) D_offset <= 0;
else if (D_offset_cnt_en) D_offset <= D_offset + 1; ...
Address Generation
assign O_prev_addr = O_base + O_offset; assign O_curr_addr = O_prev_addr + 640/4; assign O_next_addr = O_prev_addr + 1280/4; assign D_addr = D_base + D_offset;
assign adr_o[21:2] = prev_row_load ? O_prev_addr : curr_row_load ? O_curr_addr : next_row_load ? O_next_addr : D_addr;
Control/Status Registers
Register Offset Read/Write Purpose
Int_en 0 Write-only Interrupt enable (bit 0).
Start 4 Write-only Write causes image processing to start (value ignored).
O_base 8 Write-only Original image base address.
D_base 12 Write-only Derivative image base address + 640. Status 0 Read-only Processing done (bit 0). Reading clears
Slave Bus Interface
assign start = cyc_i && stb_i && we_i && adr_i == 2'b01; assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10; assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11; always @(posedge clk_i) // Interrupt enable register
if (rst_i)
int_en <= 1'b0;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00) int_en <= dat_i[0];
always @(posedge clk_i) // Status register if (rst_i)
done <= 1'b0; else if (done_set)
// This occurs when last write is acknowledged,
// and so cannot coincide with a read of the status register. done <= 1'b1;
else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o) done <= 1'b0;
Slave Bus Interface
always @(posedge clk_i) // Generate ack output ack_o <= cyc_i && stb_i && !ack_o;
// Wishbone data output multiplexer always @*
if (cyc_i && stb_i && !we_i) if (adr_i == 2'b00)
dat_o = {31'b0, done}; // status register read else
dat_o = 32'b0; // other registers read as 0 else
Control Sequencing
Use a finite-state machine
Counters keep track of rows (0 to 477) and
columns (0 to 159)
See textbook for details of FSM output
Accelerator Verification
Simulation-based verification of each section
of the accelerator
Slave bus operations
Computation sequencing Master bus operations Address generation Pixel computation
Testbench including the accelerator
Bus functional processor model
Sobel Verification Testbench
Processor
BFM AcceleratorSobel
Memory Model Arbiter
Processor Bus Functional Model
initial begin // Processor bus-functional model cpu_adr_o <= 23'h000000;
cpu_sel_o <= 4'b0000;
cpu_dat_o <= 32'h00000000;
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; @(negedge rst);
@(posedge clk);
// Write 008000 (hex) to O_base_addr register
bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000); // Write 053000 + 280 (hex) to D_base_addr register
bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280); // Write 1 to interrupt control register (enable interrupt)
bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001); // Write to start register (data value ignored)
bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000); // End of write operations
Processor Bus Functional Model
cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0; begin: loop
forever begin #10000;
@(posedge clk);
// Read status register
cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset; cpu_sel_o <= 4'b1111;
cpu_cyc_o <= 1'b1; cpu_stb_o <= 1'b1; cpu_we_o <= 1'b0; @(posedge clk); while (!cpu_ack_i) @(posedge clk);
cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; if (cpu_dat_i[0]) disable loop;
end end end
Memory Bus Functional Model
always begin // Memory bus-functional model mem_ack_o <= 1'b0;
mem_dat_o <= 32'h00000000; @(posedge clk);
while (!(bus_cyc && mem_stb_i)) @(posedge clk); if (!bus_we)
mem_dat_o <= 32'h00000000; // in place of read data mem_ack_o <= 1'b1;
@(posedge clk); end
Bus Arbiter
Uses
sobel_cyc_o
and
cpu_cyc_o
as
request inputs
If both request at the same time, give
accelerator priority
Bus Arbiter
always @(posedge clk) // Arbiter FSM register if (rst) arbiter_current_state <= sobel;
else arbiter_current_state <= arbiter_next_state; always @* // Arbiter logic
case (arbiter_current_state) sobel: if (sobel_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end
else if (!sobel_cyc_o && cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu; end
else begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end
cpu: if (cpu_cyc_o) begin
sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu; end else if (sobel_cyc_o && !cpu_cyc_o) begin
sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end else begin
Simulation Results
See waveforms in textbook
Demonstrates sequencing and address
generation
But what about…
Data values computed correctly
Interactions between processor and
accelerator
Need to use more sophisticated
verification techniques
Summary
Accelerators boost performance using
parallel hardware
Replication, pipelining, …
Ahmdahl’s Law
Best payback from accelerating a kernel