Xilinx FPGAs are heterogeneous compute platforms that include Block RAMs, DSP Slices, PCI Express support, and programmable fabric. They enable parallelism and pipelining of applications across the entire platform as all of these compute resources can be used simultaneously. SDAccel is the tool provided by Xilinx to object and assist these compute resources for OpenCL programs.
The basic structure of an FPGA is composed of the following elements:
Look-up table (LUT) – LUT performs logic operations.
Flip-Flop (FF) – This register element stores the result of the LUT.
Wires – Wires connect elements.
Input/Output (I/O) pads – These physical ports get data in and out of the FPGA.
The combination of the above-mentioned elements, including LUT, FF, wires, and I/O pads, results in the basic FPGA architecture. The mentioned elements, LUT and FF are described briefly in the following pages.
LUT
The Look-up table LUT is the basic building block of an FPGA. By using LUT, which is a small memory, we can implement any logic function of 𝑀 Boolean variables. Essentially, LUT is a truth table. Therefore, in LUT, different arrangements of the inputs resulted in various functions, which is generated output values. 𝑀 represents the number of inputs to the LUT, which is the limit on the size of the truth table. Indeed, the number of memory locations
29 accessed by the table for a LUT with N inputs is 2N. This permits the table to implement 2N^Nfunctions. Note that a typical value for 𝑀 in Xilinx FPGAs is 6. Figure (2.4) shows the functional Representation of a LUT as a Collection of Memory Cells.
Here, we try to explain the hardware implementation of a LUT. It can be considered as a collection of memory cells that is linked to a set of multiplexers. The inputs to the LUT can be regarded as a selector bits on the multiplexer so that the outcome can be selected at a given point in time. This representation of LUT makes it easier to consider LUT as a compute engine function and a data storage element.
Figure (2.4). Illustration of a functional LUT as a collection of memory cells.
Flip Flop
30 The basic structure of a flip-flop is shown in Figure (2.5) which represents a data input (d_in), clock input (clk), clock enables (clk_en), reset, and data output (d_out). In each clock pulse, the input value.
Figure (2.5). Structure of a Flip-Flop.
During normal operation, any value at the data input port is latched and passed to the output on every pulse of the clock. The clock enables pin permits the flip-flop to hold specific value for more than one clock pulse. New data inputs are only latched and passed to the data output port when both clock and clock enable are equal to one.
Present FPGA architectures have the basic elements along with additional computational and data storage blocks. The extra elements added to contemporary FPGA is shown in Figure (3.4).
These additional elements, which rise the computational density and efficiency of the device, are discussed in the following sections,
Embedded memories for distributed data storage
Phase-locked loops (PLLs) for driving the FPGA fabric at different clock rates
High-speed serial transceivers
𝑑_𝑜𝑢𝑡 𝐹𝐹
𝑑_𝑖𝑛
𝑐𝑙𝑘_𝑒𝑛
𝑟𝑒𝑠𝑒𝑡 𝑐𝑙𝑘
𝑠𝑒𝑡
31
Off-chip memory controllers
Multiply-accumulate blocks
Figure (2.6) shows the combination of these elements on a recent FPGA architecture. This provides the FPGA with the flexibility to implement any software algorithm running on a processor. Note that all of these elements across the entire FPGA can be used concurrently.
Figure (2.6). Contemporary FPGA Architecture [39].
DSP48 Block
DSP48 block, which is shown below, is the most complex computational block available in a Xilinx FPGA.
32 The DSP48 block, which is embedded in the fabric of the FPGA, is an arithmetic logic unit. It is composed of a chain of three different blocks, including add/subtract unit, multiplier, and final add/subtract/accumulate engine.
The computational chain in the DSP48 holds an add/subtract unit, which is linked to a multiplier. The multiplier is linked to a final add/subtract/accumulate engine. This chain allows a single DSP48 unit to implement functions of the form, which is represented in Figure (2.7):
𝑃 = 𝐵 × (𝐴 + 𝐷) + 𝐶 or 𝑃+= 𝐵 × (𝐴 + 𝐷)
Figure (2.7). A DSP48 Block structure [39].
BRAM and Other Memories
Embedded memory elements in FPGA fabric are random-access memory (RAM), read-only memory (ROM), or shift registers. These elements are block RAMs (BRAMs), LUTs, and shift registers.
33 The BRAM is a dual-port RAM module available on the FPGA fabric to provide on-chip storage for a relatively large set of data (18k or 36k bits). Two types of BRAM memories that can hold either 18k or 36k bits are based on device specific. The dual-port BRAM has parallel, same-clock-cycle access to different locations.
In a RAM configuration, the data can be read and written at any time during the runtime of the circuit. In contrast, in a ROM configuration, data can only be read during the runtime of the circuit. The data of the ROM is written as part of the FPGA configuration and cannot be modified in any way.
As discussed in the LUT section, the contents of a truth table of LUT are written during device configuration. Due to the flexible structure of LUT in Xilinx FPGAs, these blocks can be used as 64-bit memories. LUT is commonly referred to as distributed memories, which is the fastest kind of memory available on the FPGA. Therefore, LUT can be used in any part of the fabric in order to improve the performance of the implemented circuit.
The shift register is a chain of registers connected to each other. Figure (2.8) shows the structure of an Addressable Shift Register. The purpose of this structure is to provide data to be reused along a computational path, such as with a filter.
34 Figure (2.8). Structure of an Addressable Shift Register.
Clock cycle
The speed of a computer processor, is determined by the clock cycle. Clock cycle is the amount of time between two pulses of an oscillator. the higher number of pulses per second, the faster the computer processor can to process information. The clock speed is measured in Hz, often either megahertz (MHz) or gigahertz (GHz). For example, a 3 GHz processor performs 3,000,000,000 clock cycles per second.