• No results found

09 Accelerators(1)

N/A
N/A
Protected

Academic year: 2020

Share "09 Accelerators(1)"

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

(1)

An Embedded Systems

Approach Using Verilog

Chapter 9

Accelerators

(2)

Performance and Parallelism

 A processor core performs steps in sequence

 Performance limited by the instruction rate

 Accelerating performance

 Perform steps in parallel

 Takes less time overall to complete an operation

 Instruction-level parallelism

 Within a processor core  Pipelining, multiple-issue

 Accelerators

(3)

Achievable Parallelism

How many steps can be performed at

once?

Regularly structured data

 Independent processing steps

 Examples

 Video and image pixel processing  Audio or sensor signal processing

Constrained by data dependencies

 Operations that depend on results of

(4)

Algorithm Kernels

Algorithm: specification of the required

processing steps

 Often expressed in a programming

language

Kernel: the part that involves the most

intensive, repetitive processing

 “10% of operations take 90% of the time”

Accelerating a kernel with parallel

(5)

Amdahl’s Law

 Time for an algorithm is t

 Fraction f is spent on a kernel

t

ft

(

1

f

)

t

 Accelerator speeds up

kernel by a factor s

s

f

t

ft

t

(

1

)

 Overall speedup factor s'

 For large f, s's

 For small f, s'  1

(

1

)

1

f

s

f

t

t

s

(6)

Amdahl’s Law Example

 An algorithm with two kernels

 Kernel 1: 80% of time, can be sped up 10 times  Kernel 2: 15% of time, can be sped up 100 times  Which speedup gives best overall improvement?

 For kernel 1:

 For kernel 2:

57 . 3 2 . 0 08 . 0 1 ) 8 . 0 1 ( 10 8 . 0 1        s 17 . 1 85 . 0 0015 . 0 1 ) 15 . 0 1 ( 100 15 . 0 1       s

(7)

Parallel Architectures

An architecture for an accelerator

specifies

 Processing blocks

 Data flow between them

Parallelism through replication

 Multiple identical block operating on

different data elements

 Works well when elements can be

(8)

Parallel Architectures

 Parallelism through pipelining

 Break a computation into steps, performs them in

assembly-line fashion

 Latency (time to complete a single operation) is

not increased

 Throughput (rate of completion of operations) is

increased

 Ideally by a factor equal to the number of pipeline stages

step 1 step 2 step 3

data

(9)

Direct Memory Access (DMA)

Input/Output data for accellerators

must be transferred at high speed

 Using the processor would be too slow

Direct memory access

 I/O controller and accellerator transfer data

to and from memory autononously

 Program supplies starting address and

(10)

Bus Arbitration

Bus masters take turns to use bus to access

slaves

 Controlled by a bus arbiter

Arbitration policies

 Priority, round-robin,

… processor memory arbiter accelerator controller request grant request request grant grant memory bus

(11)

Block-Processing Accelerator

Data arranged in regular groups of

contiguous memory locations

 Accelerator works block by block

 E.g., images in blocks of 8 × 8 × 16-bit

pixels

Datapath comprises

 Memory access: address generation,

counters

 Computation section

(12)

Stream-Processing Accelerator

Streams of data from an input source

 E.g., high-speed sensors

Digital signal processing (DSP)

 Analog sensor signal converted to stream

of digital sample values

 Filtering, gain/attenuation,

(13)

Processor/Accelerator Interface

Embedded software controls an

accelerator

 Providing control parameters

 Synchronizing operations

Input/output registers and interrupts

(14)

Case Study: Edge Detection

 Illustration of accelerator design

 Edge detection in video processing

 Identify where image intensity changes abruptly  Typically at the boundary of objects

 First step in identifying objects in a scene

 Application areas

 Video surveillance, computer vision, …

 For this case study

 Monochrome images of 640 × 480 × 8-bit pixels  Stored row-by-row in memory

(15)

Sobel Edge Detection

Compute derivatives of intensity in x

and y directions

 Look for minima and maxima (where

(16)

The Sobel Algorithm

 Use convolution to approximate partial

derivatives Dx and Dy at each position

 Weighted sum of value of a pixel and its eight

nearest neighbors

 Coefficients represented using a 3×3 convolution

mask

 Sobel masks for x and y derivatives

–1 0 +1

–2 0 +2

–1 0 +2

x

G

+1 +2 +1

0 0 0

–1 –2 –1 y G G j i O j i

(17)

The Sobel Algorithm

Combine partial derivatives

2 2

y x D

D D  

Since we just want maxima and minima

in magnitude, approximate as:

y x D

D D  

Edge pixels don’t have eight neighbors

 Skip computation of |D| for edges

(18)

The Algorithm in Pseudocode

for (row = 1; row <= 478; row = row + 1) begin for (col = 1; col <= 638; col = col + 1) begin sumx = 0; sumy = 0;

for (i = –1; i <= +1; i = i + 1) begin for (j = –1; j <= +1; j = j + 1) begin

sumx = sumx + 0[row+i][col+j] * Gx[i][j]; sumy = sumy + 0[row+i][col+j] * Gy[i][j]; end

end

D[row][col] = abs(sumx) + abs(sumy); end

(19)

Data Formats and Rates

Pixel values: 0 to 255 (8 bits)

 Coefficients are 0, ±1 and ±2

 Partial products: –510 to +510 (10 bits)  Dx and Dy: –1020 to +1020 (11 bits)

 |D|: 0 to 2040 (11 bits)

 Final pixel value: scale back to 8 bits

Video rate: 30 frames/sec

 640 × 480 = 307,200 pixels

(20)

Data Dependencies

Pixels can be computed independently

For each pixel:

(21)

System Architecture

 Data dependencies suggest a pipeline

 Coefficient multiplies are simple shift/negate, so

(22)

Memory Bandwidth

Assume memory read/write takes 20ns

(2 cycles of 100MHz clock)

 Memory is 32-bits wide, byte addressable

 Bandwidth = 50M operations/sec

Camera produces 10Mpixels/sec

 Accelerator needs to process at this rate

 (8 reads + 1 write) × 10Mpixel/sec

= 90M operations/sec

(23)

Memory Bandwidth

 Read 4 pixels at once from each of previous,

current, and next rows

 Store in accelerator to compute multiple derivative

image pixels

 Produce derivative pixels row-by-row,

left-to-right

 Read 3 × 32-bit words for every 4th derivative

pixel computed

 Write 4 pixels at a time

 (3 reads + 1 write) / 4 × 10Mpixel/sec

= 10M operations/sec

(24)
(25)

Accelerator Sequence

 Steady state

 Write 4 result pixels

 Read 4 pixels for previous,

current, next rows

 Compute for 4 cycles  Repeat…

 Start of row

 Omit writes until pipeline

full

 End of row

 Omit reads to drain

(26)

Memory Operation Timing

(27)

Pixel Datapath

// Computation datapath signals

reg [31:0] prev_row, curr_row, next_row; reg [7:0] O [-1:+1][-1:+1];

reg signed [10:0] Dx, Dy, D; reg [7:0] abs_D;

reg [31:0] result_row; ...

// Computational datapath

always @(posedge clk_i) // Previous row register if (prev_row_load) prev_row <= dat_i;

else if (shift_en) prev_row[31:8] <= prev_row[23:0]; ... // Current row register

... // Next row register

function [10:0] abs (input signed [10:0] x); abs = x >= 0 ? x : -x;

(28)

Pixel Datapath

always @(posedge clk_i) // Computation pipeline if (shift_en) begin

D = abs(Dx) + abs(Dy); abs_D <= D[10:3];

Dx <= - $signed({3'b000, O[-1][-1]}) + $signed({3'b000, O[-1][+1]})

- ($signed({3'b000, O[ 0][-1]}) << 1) + ($signed({3'b000, O[ 0][+1]}) << 1) - $signed({3'b000, O[+1][-1]})

+ $signed({3'b000, O[+1][+1]}); Dy <= $signed({3'b000, O[-1][-1]})

+ ($signed({3'b000, O[-1][ 0]}) << 1) + $signed({3'b000, O[-1][+1]})

- $signed({3'b000, O[+1][-1]})

- ($signed({3'b000, O[+1][ 0]}) << 1) - $signed({3'b000, O[+1][+1]});

(29)

Pixel Datapath

O[-1][-1] <= O[-1][0]; O[-1][ 0] <= O[-1][+1];

O[-1][+1] <= prev_row[31:24]; O[ 0][-1] <= O[0][ 0];

O[ 0][ 0] <= O[0][+1];

O[ 0][+1] <= curr_row[31:24]; O[+1][-1] <= O[+1][ 0];

O[+1][ 0] <= O[+1][+1];

O[+1][+1] <= next_row[31:24]; end

always @(posedge clk_i) // Result row register

(30)

Address Generation

Given an image in memory at base

address

B

 Address for pixel in row r, column c is

B + r × 640 + c

 Base address (B) is fixed

 Offset (r × 640 + c) increments by 4 for

each group of 4 pixels read/written

 Use word-aligned addresses

 Two least-significant bits always 00  Increment word address by 1

(31)
(32)

Address Generation

always @(posedge clk_i) // O base address register if (O_base_ce) O_base <= dat_i[21:2];

always @(posedge clk_i) // O address offset counter if (offset_reset) O_offset <= 0;

else if (O_offset_cnt_en) O_offset <= O_offset + 1; always @(posedge clk_i) // D base address register if (D_base_ce) D_base <= dat_i[21:2];

always @(posedge clk_i) // D address offset counter if (offset_reset) D_offset <= 0;

else if (D_offset_cnt_en) D_offset <= D_offset + 1; ...

(33)

Address Generation

assign O_prev_addr = O_base + O_offset; assign O_curr_addr = O_prev_addr + 640/4; assign O_next_addr = O_prev_addr + 1280/4; assign D_addr = D_base + D_offset;

assign adr_o[21:2] = prev_row_load ? O_prev_addr : curr_row_load ? O_curr_addr : next_row_load ? O_next_addr : D_addr;

(34)

Control/Status Registers

Register Offset Read/Write Purpose

Int_en 0 Write-only Interrupt enable (bit 0).

Start 4 Write-only Write causes image processing to start (value ignored).

O_base 8 Write-only Original image base address.

D_base 12 Write-only Derivative image base address + 640. Status 0 Read-only Processing done (bit 0). Reading clears

(35)

Slave Bus Interface

assign start = cyc_i && stb_i && we_i && adr_i == 2'b01; assign O_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b10; assign D_base_ce = cyc_i && stb_i && we_i && adr_i == 2'b11; always @(posedge clk_i) // Interrupt enable register

if (rst_i)

int_en <= 1'b0;

else if (cyc_i && stb_i && we_i && adr_i == 2'b00) int_en <= dat_i[0];

always @(posedge clk_i) // Status register if (rst_i)

done <= 1'b0; else if (done_set)

// This occurs when last write is acknowledged,

// and so cannot coincide with a read of the status register. done <= 1'b1;

else if (cyc_i && stb_i && we_i && adr_i == 2'b00 && ack_o) done <= 1'b0;

(36)

Slave Bus Interface

always @(posedge clk_i) // Generate ack output ack_o <= cyc_i && stb_i && !ack_o;

// Wishbone data output multiplexer always @*

if (cyc_i && stb_i && !we_i) if (adr_i == 2'b00)

dat_o = {31'b0, done}; // status register read else

dat_o = 32'b0; // other registers read as 0 else

(37)

Control Sequencing

Use a finite-state machine

 Counters keep track of rows (0 to 477) and

columns (0 to 159)

See textbook for details of FSM output

(38)
(39)

Accelerator Verification

 Simulation-based verification of each section

of the accelerator

 Slave bus operations

 Computation sequencing  Master bus operations  Address generation  Pixel computation

 Testbench including the accelerator

 Bus functional processor model

(40)

Sobel Verification Testbench

Processor

BFM AcceleratorSobel

Memory Model Arbiter

(41)

Processor Bus Functional Model

initial begin // Processor bus-functional model cpu_adr_o <= 23'h000000;

cpu_sel_o <= 4'b0000;

cpu_dat_o <= 32'h00000000;

cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; @(negedge rst);

@(posedge clk);

// Write 008000 (hex) to O_base_addr register

bus_write(sobel_reg_base + sobel_O_base_reg_offset, 32'h00008000); // Write 053000 + 280 (hex) to D_base_addr register

bus_write(sobel_reg_base + sobel_D_base_reg_offset, 32'h00053280); // Write 1 to interrupt control register (enable interrupt)

bus_write(sobel_reg_base + sobel_int_reg_offset, 32'h00000001); // Write to start register (data value ignored)

bus_write(sobel_reg_base + sobel_start_reg_offset, 32'h00000000); // End of write operations

(42)

Processor Bus Functional Model

cpu_cyc_o = 1'b0; cpu_stb_o = 1'b0; cpu_we_o = 1'b0; begin: loop

forever begin #10000;

@(posedge clk);

// Read status register

cpu_adr_o <= sobel_reg_base + sobel_status_reg_offset; cpu_sel_o <= 4'b1111;

cpu_cyc_o <= 1'b1; cpu_stb_o <= 1'b1; cpu_we_o <= 1'b0; @(posedge clk); while (!cpu_ack_i) @(posedge clk);

cpu_cyc_o <= 1'b0; cpu_stb_o <= 1'b0; cpu_we_o <= 1'b0; if (cpu_dat_i[0]) disable loop;

end end end

(43)

Memory Bus Functional Model

always begin // Memory bus-functional model mem_ack_o <= 1'b0;

mem_dat_o <= 32'h00000000; @(posedge clk);

while (!(bus_cyc && mem_stb_i)) @(posedge clk); if (!bus_we)

mem_dat_o <= 32'h00000000; // in place of read data mem_ack_o <= 1'b1;

@(posedge clk); end

(44)

Bus Arbiter

Uses

sobel_cyc_o

and

cpu_cyc_o

as

request inputs

 If both request at the same time, give

accelerator priority

(45)

Bus Arbiter

always @(posedge clk) // Arbiter FSM register if (rst) arbiter_current_state <= sobel;

else arbiter_current_state <= arbiter_next_state; always @* // Arbiter logic

case (arbiter_current_state) sobel: if (sobel_cyc_o) begin

sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end

else if (!sobel_cyc_o && cpu_cyc_o) begin

sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu; end

else begin

sobel_gnt <= 1'b0; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end

cpu: if (cpu_cyc_o) begin

sobel_gnt <= 1'b0; cpu_gnt <= 1'b1; arbiter_next_state <= cpu; end else if (sobel_cyc_o && !cpu_cyc_o) begin

sobel_gnt <= 1'b1; cpu_gnt <= 1'b0; arbiter_next_state <= sobel; end else begin

(46)

Simulation Results

See waveforms in textbook

 Demonstrates sequencing and address

generation

But what about…

 Data values computed correctly

 Interactions between processor and

accelerator

Need to use more sophisticated

verification techniques

(47)

Summary

Accelerators boost performance using

parallel hardware

 Replication, pipelining, … 

Ahmdahl’s Law

 Best payback from accelerating a kernel

DMA avoids processor overhead

Verification requires advanced

References

Related documents

Specifications for developing software programs are explained in the IRS Publication 4164 (MeF), Indiana Publication IND 1346 the Handbook for Developers of Electronic Filing

Northwest Florida State College is seeking proposals from qualified candidates for Grant Evaluator in accordance with the Scope of Work specified in this Request for Proposal

If you are applying as a Single Agency (maximum grant request is $150,000) and as part of a Collaborative (maximum grant request of $200,000), our goal is to only award your

Request approval to 1) accept a grant award from, and enter into a grant agreement with, the American Psychological Association Board of Educational Affairs to pursue accreditation

Efficiency Vermont and Burlington Electric Department believe that one of the most effective ways for reaching low-income Vermonters is by conducting home energy visits wherein

the majority of the subjects need no orthodontic intervention as rated by conventional aesthetic component of iOtn and the newly modified aesthetic scale.. The pattern

A user changes and views configuration parameters and alarm definitions via Cisco Unified CallManager Administration configuration; therefore, only applications that have

In the current study, the SHARP Program researchers examined State Fund workers’ compensation claims for general and selected specific hand/wrist, elbow, shoulder, neck and