Implementation Architecture for the Depth from Defocus Application

This Section describes the implementation procedure of the DFD calculation using a Virtex 2P FPGA. It is based on the proposed algorithm given in [14]. The far and the near-focused images are added, subtracted and then convolved with the pre-filter to remove DC as well as high frequency components. The low pass filter gm1 was

convolved with the subtracted image and at the same time, the LOG filter gp1and the

correction filter gp2 are convolved with the added image. Later the convolved outputs

were smoothed by a local averaging technique and the divider stage provided the required depth. The implementation represented a pipelined architecture with two parallel channels and five different stages. The two parallel channels process the added and the subtracted images, and the five stages are: - addition and subtraction; pre-filtering; rational filtering; smoothing; and divider. Here two depth outputs (Linear and Error corrected models) are shown for experimental reasons but in practice a look-up table would be employed to provide the depth estimates. The pictorial representation of the DFD algorithm is shown in Figure (5.8).

Figure 5.8: Two Channel five stage pipelined architecture

The processing elements (PE) of the pipelined architecture can execute in parallel and the combinatorial logic blocks (adder, subtractor and multiplexers) within the PE are considered as separate components that can execute in parallel, and are synchronous with the system clock. The architecture can be termed as systolic since the input data (D0 to D4) advances into the designed module sequentially, and is controlled by the system clock as illustrated in Figure (5.9). As the input data progresses into each module, the corresponding operations are executed by the processing elements and the final output is obtained in a sequential manner based on the system clock. For every data input, there is a calculated depth output.

The adder and subtractor stages were implemented using simple logic gates. Subsequently, the added and subtracted data proceeds to the pre-filter stage where the filter module was implemented using multiplier, adders, and shift registers to suit the design requirements. Since the architecture of the pre-filter and rational filter stages have similar structure but with different filter coefficients, a generalised architecture is presented to illustrate the filtering process. For simplicity, only a single processing element (PE) representing the filtering module is explained. It should be noted that the actual design incorporates 5 PEs to compute the five 2D convolution operations corresponding to each stage of the pipelined architecture. The filter module shown in Figure (5.10) consisted of 49 shift registers (SR), 6 RAM based FIFO blocks (first in and first output), 10 multipliers and 48 adders. The bit- width of each module depended on the required accuracy and the available logic. More details about bit-width selection are provided in Section 5.4.

The shift registers were implemented using flip-flops and were arranged to form a 2D array structure with 7 rows and 7 shift register blocks per row. The output of the 7th shift register (SR17) in the first row was connected to the input of FIFO 1, where it was delayed for the completion of the image row. The output from the FIFO 1 was then looped to the input of the shift register (SR21) in the next row. Likewise, the outputs of the 7th shift register in each row were connected to the FIFO in the same row and the FIFO outputs are connected to the shift registers in the next row. This arrangement incorporating the shift registers and FIFO was a systolic array architecture, and the movement of the input data through the design module (shift registers and the FIFO) was synchronised to the common clock. The array when implemented on hardware stored a 7x7 sub-image that when multiplied by the pre- stored coefficients and summed, provided the filtered output. The latency at each filtering stage depended on: - (1) The kernel size; (2) The horizontal resolution of the image; and (3) Any internal buffering present within the PE. Here, filtering operations were performed on test images of resolution of 400 x 400 pixels using a 7x7 kernel and each PE required an internal signal buffering that corresponded to 3 clock cycles. Hence the latency for a filtering process was 1207 clock cycles. A Table illustrating the latency present at each stage of the pipelined processor is provided in Section 5.5.

Figure 5.10: Filter block module with Shift registers and FIFOs

The pre-filter output then progressed into the rational filtering stage were the design architecture remained the same, but different filter coefficients were used. After the rational filtering stage, the filtered pixels advanced into the smoothing stage. The smoothing stage provided the required smoothing operation based on local averaging, and was implemented using a 5x5 systolic array incorporating shift registers and FIFOs. Finally the smoothed data advanced into the divider stage, the output of which provided the required depth estimate. The depth output from the divider was stored on an inbuilt dual port RAM and then transferred to the desktop PC through the UART interface. The next Section provides a detailed analysis of the test pattern and the required bit-widths at each stage of the pipelined DFD calculation.

5.4. Analysis – Test pattern and Computation of bit-widths at each stage of the

In document Accurate depth from defocus estimation with video rate implementation (Page 126-130)