• No results found

most 40 % (speed) to 60 % (complexity) per year. This corresponds to an annual speedup in computing power of over a factor of two per year, confirming the prediction of Vuillemin [3].

The speedup in computing power for CPUs, however, has also fol- lowed an exponential growth track over the last 40 years. According to Thacker [4], the growth kept constant for more than 30 yr by an annual increase factor of 1.25—thus the computing power doubles every 3 yr. If these trends are stable, the performance of FPGA will increase many times faster than CPUs. The reason for this behavior is that the computing power of CPUs benefits mainly from advances in process technology by the increase of the CPU’s maximum frequency. Archi- tectural improvements usually have a much lower impact. By contrast, FPGAs take very direct advantage of both effects because the higher number of logic cells is directly used for a higher degree of parallelism. Due to their advantage for bit-level algorithms and integer arith- metic we expect that FPGA coprocessors will play an important role in image processing. Probably even algorithms requiring floating point operations will become more and more applicable to FPGA coproces-

sors.

2.3 FPGA-based image processing systems

In 1989, five years after the advent of FPGAs, the first FPGA-based com- puting machines appeared that made use of the computing power and flexibility of this new technology. The basic idea is to establish a new class of computer consisting of an array of FPGAs as computing core. As in a conventional von Neumann computer, arbitrary applications are run by downloading an appropriate configuration bitstream (the soft- ware).

The first of these FPGA processors were large systems optimized for maximum computing power [5,6]. They consist of dozens of FP- GAs, reaching or partially outrunning the computing power of super- computers. Important examples were a DNA sequencing application outperfoming a CRAY-II by a factor of 325 [7] or the world’s fastest RSA implementation [8]. A wide variety of tasks were implemented on some of the machines [9]. The large number of image processing ap- plications demonstrated the potential of FPGA processors for this area [10]. A good collection of pointers to most FPGA processors can be found in Guiccone [11].

Due to their high speed, large FPGA processors typically run their application independently of their host computer—the host only pro- vides support functions. Typical image processing systems, however, usually need tight integration of data aquisition, image processing, and

16 2 Field Programmable Gate Array Image Processing

Figure 2.2: Functional elements of microEnable.

display routines. This makes a certain subclass of FPGA processors—

FPGA coprocessors—great candidates for image processing.

An FPGA coprocessor is closely coupled to a conventional computer and optimized for fast communication with the CPU. It works simi- larly to the old mathematical coprocessors but executes in contrast to them complete algorithms instead of single instructions. The FPGA co-

processors usually contain only a few or one FPGA on relatively small

boards. The following description of the hard- and software architec- ture is made using the example of the microEnable coprocessor3. It is

particularly suitable to image processing applications. 2.3.1 General architecture of FPGA coprocessors

An FPGA coprocessor generally consists of three components (Fig.2.2): 1. The FPGA is the computing kernel in which algorithms are executed. 2. The memory system is used as a temporary buffer or for table lookup. 3. The I/O interface is required for communication between host and the FPGA coprocessor. On some machines it also comprises a sec- ond interface to external electronics.

The overall performance of the FPGA coprocessor depends on the performance of all three subsystems: FPGA; memory system; and I/O interface.

2.3.2 Example system: the microEnable

Like FPGA coprocessors in general, the hardware of microEnable (Fig.2.3) comprises the three functional units FPGA, memory system, and I/O in- terface. The I/O interface of the processor additionally provides a sec- ondary port to external electronics. Used as interface to image sources, 3MicroEnable is a commercially available system provided by Silicon Software GmbH,

2.3 FPGA-based image processing systems 17

Figure 2.3:The microEnable board.

this feature is important for image processing making microEnable an intelligent framegrabber. The last functional unit is the clock and sup- port circuitry of the processor. In the next paragraphs the hardware setup is described in more detail.

The FPGA is microEnable’s computing kernel. Due to the fast growth of FPGA complexity, and in order to keep the system inexpensive and simple, the board contains only a single device of the Xilinx XC4000 family. It supports all XC4000E/EX devices larger than the XC4013. Since 1998, a 3.3 V version is available covering all XC4000XL FPGAs between XC4013XL and XC4085XL. This makes the computing power of the processor scalable by one order of magnitude.

The RAM system is a fast buffer exclusively used by the FPGA. It consists of 0.5 to 2 MBytes of fast SRAM. This buffer is intended for tasks such as temporary data buffering, table lookup, or coefficient storage. Experiences in different fields of applications have shown the importance of a fast RAM system.

The PCI interface—the physical connection to the host computer and thus to the application—supports the standard PCI bus of 32-bit width and 33 MHz. It is implemented in a custom PCI chip, the PCI9080 from PLX Technology. The advantages of this implementation are the high performance of the chip, a wide variety of features, and the excel- lent DMA support important for high transfer rates. In our experience these features compensate for the drawbacks compared to an FPGA im- plementation: the need for a PLX interface on the local bus side and to cope with a (related to the number of features) sophisticated device.

18 2 Field Programmable Gate Array Image Processing For user-specific functionality and communication with external elec- tronics a daughterboard can be plugged onto microEnable. Two types of daughterboards are supported:

1. Common Mezzanine Cards (CMC) following the IEEE P1396 specifi- cation; and

2. S-LINK cards following a specification developed at the European particle physics lab CERN [12].

The CMC is a mechanical specification defining the board mechan- ics, connector type, and pinout for some signals (clock, reset, power, ground) of the daughtercard [13]. In addition to the CMC specification, other specifications define the protocol used: the most important one is the PMC specification for the PCI protocol. MicroEnable supports the 33 MHz, 32-bit PMC specification. The CMC/PMC daughterboards are most widely used by FPGA coprocessors [14,15].

The clock and support circuitry of microEnable implements tasks such as configuration and readback of the FPGA, a JTAG port, and the provision of user-programmable clocks. For applications running on

microEnable’s FPGA four different clocks are available. The main clocks

are two phase synchronous clocks freely programmable in the range between 1 and 120 MHz (at 1-MHz steps). The ratio between these two clocks can be set to 1, 2, or 4. The third clock is the local bus clock programmable in the interval between 1 and 40 MHz. The last clock signal usable is an external clock signal provided by the S-LINK connector.

2.3.3 Device driver and application interface library

A high bandwidth bus like PCI does not necessarily guarantee a high data exchange rate between FPGA coprocessor and CPU. What ratio of the pure hardware performance is usable in reality by the application depends strongly on the device driver. The microEnable provides device drivers for the operating systems Linux and WindowsNT 4.0 and thus supports the most important aspects of PCs. The device drivers sup- port master as well as slave accesses, two independent gather/scatter DMA channels, and interrupt capability (generated by either DMA or the FPGA). (For a sample interface, see Fig.2.4.)

In order to provide users with a simple but fast programming inter- face, accesses to microEnable are memory mapped into user space. To the user, these accesses appear as usual memory accesses.

Although the memory map is a fast mechanism, the reachable data bandwidth is far below the theoretical limit of 132 MBytes/s. The key issue for maximum throughput is DMA transfers. Using DMA microEn-

2.3 FPGA-based image processing systems 19

User

program microEnablelibrary

Device driver Memory mapping Hardware model (FPGA)

Figure 2.4: The software interface to microEnable.

PCI performance, measured to/from a user application running in user space under WinNT 4.0 on a 233 MHz Pentium II.

The DMA transfers data in large blocks. In many real-life applica- tions the hardware applet (Section2.3.4) is not able to accept data in each cycle (this is usually data dependent). The dilemma here is that a normal DMA transfer would overrun the hardware applet and cause data loss. To do it without DMA would decrease performance unac- ceptably. MicroEnable allows DMA transfers using a handshake proto- col with the applet running in the FPGA. Using this mechanism transfer rates comparable to normal DMA rates are possible.

High performance is only one important factor; the other one is to provide the user a simple view to the FPGA coprocessors’ functionality. This depends on the device driver and application interface library.

MicroEnable provides a C library for user applications. The complete

setup of the board is hidden behind a few rather simple function calls:

Example 2.1: Code example for board setup

main() {

microenable my_processor; fpga_design my_design; unsigned long *access; unsigned long value; // initialize microEnable

initialize_microenable(&my_processor); // load the hardware applet into memory

load_design(&my_processor, &my_design,"design_file.hap"); // now configure the FPGA

configure_fpga(&my_processor); // memory map on microEnable

access = GetAccessPointer(&my_processor); // access the FPGA

access[0] = 100; // write value = access[0]; // read }

20 2 Field Programmable Gate Array Image Processing Table 2.4: Configuration times for microEnable

FPGA Time in ms

4013E 30

4028EX 90

4036EX 105

The distinction between load design()storing the configuration bitstream into memory andconfigure fpga()performing the actual configuration allows an effective exchange of FPGA designs during pro- cessing. At the beginning of the application a few configuration bit- streams are stored using load design(). During the execution of the actual application the FPGA can be quickly reprogrammed using configure fpga().

2.3.4 Hardware applets

A certain application running on the FPGA coprocessor is called a hard-

ware applet. The term not only refers to the FPGA configuration but

also comprises the software routines that allow the host software to communicate with the coprocessor, a data basis for information like the supported FPGA type, the maximum clock frequency, the available address space, or the register (and their parameters implemented in the FPGA). The user does not need additional information to run the applet.

Hardware applets can be loaded in a very short time depending on the FPGA size (Table2.4).

2.3.5 Modes of operation

MicroEnable can be run in three different modes of operation that are

of particular interest for image processing applications.

1. Virtual mode. The short reconfiguration times of microEnable allow dynamic switching between different hardware applets during the run of an application. A complex image processing task is divided into several subtasks implemented in an applet each. By dynami- cally exchanging the applets the FPGA coprocessor behaves like a much larger FPGA: it becomes “virtual hardware.” This mode en- hances the capabilities of the FPGA coprocessor and saves hardware resource costs. The virtual mode also can be combined with the two modes that follow.

2.4 Programming software for FPGA image processing 21