Since the first FPGA processors began to be used, many applications were implemented, including many image processing algorithms. We will concentrate here on several industrial applications in the field of image processing implemented on microEnable during 1997 and 1998. Exemplary applications, which can be used for online or offline analy- sis (see the different operating modes of microEnable described in Sec- tion2.3.5), are image preprocessing, filtering, and JPEG compression. 2.5.1 Image preprocessing
Image preprocessing generally covers simple steps that process the ac- quired image data for the actual image analysis. Usually these prepro- cessing steps have to be performed directly after the image acquisition. For this reason it is often important to execute image preprocessing concurrently with the acquisition. Examples of image preprocessing steps are:
1. Cut out of regions of interests. 2. Rotation by 0°, 90°, 180°, 270°.
4heurisko has been developed by AEON, Hanau, Germany in cooperation with the
2.5 Application examples 27 PC memory Image DMA Filter x1 x1 x2 x2 x1 x1 x4 x4 x2 x2 Local memory (bank0) Filter + + Local memory (bank1) DMA PC memory Image microEnable read wr it e read write read write
Figure 2.5:Data flow for 2-D convolution with a 5×5 binomial filter.
3. Scaling.
4. Reformatting, for example, the creation of BMP compatible pictures. 5. Color space conversion, for example, from RGB to YUV space or vice
versa.
The examples listed here were implemented on microEnable in the scope of an industrial application. All subalgorithms were implemented in a single hardware applet. With this standard configuration the func- tionality of an average digital frame grabber is covered by the FPGA
processor.
However, the use of an FPGA processor, especially its reprogamma- bility, offers some additional features. One is the possibility to extend the described standard hardware configuration with user specific ap- plications. This allows us to solve additionally user-specific image pro- cessing steps on-the-fly. The user gains the high speed-up by executing the application directly in hardware. Example applications are unusual data formats or on-the-fly convolution for data filtering. Due to their algorithmic structure these highly parallel or bit-shuffling algorithms often can not be executed by software with sufficient speed. On the other hand, these algorithms are too problem-specific for an ASIC so- lution. The second feature is dynamic reconfiguration of the hardware, that is, to load different hardware configurations sequentially. Instead of a single one the user has several hardware setups at his or her dis- posal.
2.5.2 Convolution
An important operation in digital image processing is convolution. In what follows, we consider an implementation of a binomial filter for
microEnable. Binomial filters , typically used as smoothing filters (Vol-
ume 2, Chapter 7), belong to the class of separable filters—thus the convolution can be separated in horizontal and vertical direction (Vol- ume 2, Section5.6).
The implementation of this filter consists of two concurrent pro- cessing steps (compare Volume 2, Section5.6.2, especially Fig.5.7):
28 2 Field Programmable Gate Array Image Processing 1. The first unit is the filter pipeline in a horizontal direction. The data are processed line-by-line as they are shifted through the fil- ter pipeline. The lines convolved by the horizontal filter mask are stored in bank 0 of the local memory (Fig.2.5).
2. Concurrently to the horizontal convolution, the vertical convolution takes place in the second filter stage. A vertical stripe equivalent to the width of the vertical mask is read out from the intermediate storage area in bank 0 and processed by a second, identical filter pipeline. The results are stored in bank 1 of microEnable ’s local memory (Fig.2.5) and—in a consecutive step—readout per DMA to the host again row by row.
This type of filter is excellently qualified for an implementation in FPGAs. As already mentioned, the actual filter is implemented in a pipeline, which can be built of many stages (ordinal of 8 or higher with- out any problem). The execution speed is nearly independent of the order of the filter operation. If the coefficients are constant, they are implemented directly in hardware, if not, as loadable registers.
The implementation in microEnable achieves 50 MBytes/s through- put with a 9×9 separable filter kernel, limited by the PCI bus band- width. If the filter is used in an online mode avoiding the PCI bottleneck, 100 MBytes/s throughput is achieved. With 18 multiplications and 16 additions per pixel this corresponds to a sustained computing power of 3400 MOPS. This it just about the peak performance of a 400 MHz Pentium II with MMX instructions (Section3.4.2, Table 3.4). The real performance on this platform is, however, only about 300 MOPS, 10 times slower than the FPGA (Section3.6, Table3.6).
To implement several filters serially, a ping pong mode is possible, where data is transferred and processed from one memory bank to the other and vice versa. Exploiting the possibility of reconfiguration, the hardware can be adapted to any arbitrary filter combination. Because of the parallel implementation, the execution time for two sequential separable filters in microEnable is the same as for a single one (however the number of required resources is higher).
2.5.3 JPEG compression
The JPEG is the most popular lossy compression standard for still im- ages [16]. The data flow of the JPEG compression algorithm is shown in Fig.2.6. It is based on a discrete cosine transform (DCT) of the image (to reduce the 8×8 pixels). The JPEG exploits the fact that the DCT co- efficients representing high image frequencies can be suppressed with minor effects to the overall image quality. This is done in the second processing step called quantization. The computed DCT coefficients are divided and rounded to integer numbers, in which the divisor is