Field Programmable Gate Arrays (FPGAs) - Hardware Accelerators in Vision Processing

2.3 Hardware Accelerators in Vision Processing

2.3.3 Field Programmable Gate Arrays (FPGAs)

…

Thread

Thread Block

Grid GPU Device

Streaming Multiprocessor CUDA core

Software Hardware

Figure 2.21: Illustration of logical view corresponding to hardware view, modified from[30].

2.3.3 Field Programmable Gate Arrays (FPGAs)

In this section, the basic architectural features of FPGAs are explored to understand their architectural benefits. An FPGA is a type of prefabricated integrated circuit that can be re-programmed for different digital circuit or system functions. Some modern FPGA devices consist of up to two million logic cells that can be configured to implement a variety of software algorithms[115]. When an FPGA is configured, the internal circuitry is connected in a way that creates a hardware implementation of the software application. In a general purpose processor, an algorithm is executed as a sequence of instructions by utilizing its fixed architecture. In other words, with a processor, the computation architecture is fixed, and the best performance is obtained by following the available processing structures. In this case, the performance is a function of how well the algorithm maps to the capabilities of the processor[115].

Unlike general purpose processors, FPGAs use dedicated/customized hardware for processing algorithms and do not have an operating system[1; 23]. An algorithm in an FPGA is implemented by building separate hardware for each function using the FPGA’s logic cells and components. This approach, which is inherently supported by the FPGA’s architecture, allows a hardware design to have a parallel speed performance while retaining the reprogrammable flexibility of software at a relatively low cost[8].

The basic architecture and components of a generic FPGA are shown in Figure 2.22.

It consists of an array of configurable logic blocks, programmable interconnects, and input/output (I/O) blocks. Logic blocks are used to implement the logic of a custom algorithm or function. Each of these uses a look-up-table (LUT) to perform some logic operations and flip-flops to store the result of the LUT. The logic blocks are typically arranged in a two-dimensional matrix array and connected by configurable interconnects. During the FPGA configuration process, this programmable interconnect wire is used to enable the interconnections between the logic blocks. As an interface between the FPGA and external devices, I/O blocks can be configured as input/output ports. To increase the computational density and efficiency of the device, modern FPGA architectures incorporate the above-mentioned basic components along with additional computational and data storage blocks[115] such as DSP48 and Dual-Port RAM, as shown in Figure 2.23. The combination of these components provides more flexibility in the FPGA design, making it possible to implement any software algorithm that typically runs on a processor. More details about these components will be discussed in the following paragraphs.

I/O Block

Logic Block

Programmable Interconnect

Figure 2.22: Basic FPGA architecture[115].

2.3 Hardware Accelerators in Vision Processing

block RAMs (BRAMs)

DSP48 blocks

Figure 2.23: Contemporary FPGA architecture.

Each logic block in the FPGA is divided into several logic slices, and each logic slice consists of numerous logic cells, which are the smallest logic unit within the FPGA device. Different FPGA technologies usually have a distinct number of logic slices and logic cells. The basic element inside the logic cell is illustrated in Figure 2.24. As the smallest logic unit, each logic cell typically consists of a LUT and flip-flop. Basically, a LUT is a truth table where different combinations of inputs implement different functions to produce output values. A flip-flop is a basic storage unit for storing the LUT output. The hardware implementation of a LUT can be represented as a collection of memory cells connected to a set of multiplexers, as shown in Figure 2.24-a[115], where the LUT inputs are used as selector bits on the multiplexer to choose the result at a given point in time. Therefore, a LUT can be used as both a function of a computation engine and a data storage element. A LUT and flip-flop combination within a logic cell is illustrated in Figure 2.24-c.

(a) (b) Data in

Clock

Clock Enable

Data

D Q out

Clk CE Set /Reset

Inputs

D Q

Clk CE CE

Y Q

Figure 2.24: Basic elements in logic block of FPGA[115]: (a) functional representation of LUT as collection of memory cells, (b) structure of flip-flop, and (c) structure of logic cell in Xilinx FPGA.

2.3 Hardware Accelerators in Vision Processing

To efficiently support digital signal processing (DSP) applications, which typically use many binary multipliers and accumulators, FPGAs are equipped with a DSP48 block, as shown in Figure 2.25. The DSP48 block is an ALU embedded into the fabric of the FPGA. One DSP48 block could contain two or more slices. Each DSP48 slice supports many independent functions, including a multiplier, multiplier-accumulator (MACC), multiplier followed by an adder, three-input adder, barrel shifter, wide bus multiplexer, magnitude comparator, and wide counter. The architecture also supports connecting multiple DSP48 slices to form wide math functions, DSP filters, and complex arithmetic without the use of a general FPGA fabric[118].

Figure 2.25: Structure of a DSP48 block[115].

A BRAM in an FPGA device refers to a dedicated dual-port RAM module, which functions as an embedded memory element. It is used to provide on-chip storage for a relatively large set of data. Each FPGA device usually possesses two types of BRAM memories, which can hold either 18 k or 36 k bits. Indeed, these memory numbers are device specific. The dual-port nature of these memories allows for parallel, same-clock-cycle access to different locations[115]. In the Xilinx FPGA, five memory types can be generated from these block RAMs. These are single-port ROM, single-port RAM, dual-port ROM, simple dual-port RAM, and true dual-port RAM. The single-port ROM and single-port RAM have only one port to access the memory space. As illustrated in Figure 2.26-a and b, the ROM type only provides read access, while the RAM type uses the same port for both read and write accesses. The dual-port ROM (Figure

2.26-(a) (b)

(c)

(d) (e)

Port A

Port B

Port A

Port B

Port A

Port B

Figure 2.26: Five memory types[114] generated from block RAMs: (a) single-port ROM, (b) single-port RAM, (c) dual-port ROM, (d) simple dual-port RAM, and (e) true dual-port RAM.

c) allows read access to the memory space through two ports. Meanwhile, for the simple dual-port RAM, as illustrated in Figure 2.26-d, the write access to the memory is allowed through port A, and read access is allowed through port B. Lastly, as shown in Figure 2.26-e, the true dual-port RAM allows read and write accesses to the memory on either port A or B[114].

In document Heterogeneous computing systems for vision-based multi-robot tracking (Page 41-47)