Design of an Pipelined Encoder suitable for integra- integra-tion into Xilinx platform based systemsintegra-tion into Xilinx platform based systems

FPGA Video Compression Systems

4.1 H.263 Encoder System using the Xilinx Plat- Plat-form

4.1.1 Design of an Pipelined Encoder suitable for integra- integra-tion into Xilinx platform based systemsintegra-tion into Xilinx platform based systems

The encoder is based on the design described in [108]. A diagram of the encoder is shown in Figure 4.1. It consists of five computational modules, the full pixel motion estimator, the half pixel motion estimator, the transform and quantiser, the inverse transform and quantiser and the variable length encoder. Each module processes macroblock units of data and stores its output in the appropriate RAM buffer. Double buffering is used to allow the encoder to operate in a pipelined fashion, as shown in Figure 4.2. In this encoder, the number of clock cycles required per pipeline stage is fixed at 1468. To encode D1 (704x480) sized images at 30 frames per second therefore requires a minimum encoder clock frequency of approximately 59 megahertz (equation 2.3).

Figure4.1:H.263encoderarchitecture

Figure 4.2: Pipeline operation of H.263 encoder. Each macroblock (MB) is pro-cessed sequentially through each of the encoder stages

Encoder Modifications

In order to ease integration into Xilinx platform based systems a number of changes were made to the encoder. The number of embedded RAMs used by it were reduced and a more efficient method of writing data to and from the encoder was introduced.

The encoder presented in [108] used 22 embedded RAMs for buffering the mac-roblock (MB) data between the computational modules, internal storage within the five functional modules, and buffering the reconstructed image data for use by the full pixel motion search module. This represents a significant proportion of the embedded RAMs present in mid-range Spartan-3 devices and, if not re-duced, would have prevented the encoder being targeted at these FPGAs, given that in a typical Xilinx platform based design the microblaze processor also uses a number of block RAMs.

Two methods were used to reduce embedded RAM usage. Where possible, the

embedded RAMs were reduced by storing the macroblock data in a more compact fashion. For instance the search memory is required to store 12 macroblocks at any instance, the data being held, from a schematic point of view, in 4 buffers each storing 3 macroblocks. The previous implementation mapped each buffer directly to an embedded RAM. This, while simple, wastes more than half of each embedded RAM’s storage capacity of 2048 bytes. To reduce the embedded RAMs used, the 4 buffers were re-mapped onto 3 embedded RAMs, the minimum number required to store 12 macroblocks of image data. The second method used was to utilise the distributed RAM feature available in Xilinx FPGAs, mapping some of the smaller memories required by the individual functional modules to distributed RAM.

The previous encoder stored all frames present in external memory (input frame, output frame and reconstructed frame) in raster scan format. The encoder operates on macroblock units of data. Thus, when loading and saving macroblock units of data the memory addresses the encoder accessed were not in a straight forward sequence. This is not an issue if static memory such as SRAM is used, as there is no performance penalty for addressing SRAM in a non-sequential fashion. However, when using dynamic memory, or when accessing a memory through a bus such as the On-chip Peripheral Bus (OPB) used here, it is desirable to use sequential memory addressing because the data can be transferred more efficiently through the use of the burst mode of the memory/bus. For the output and reconstructed frames the obvious method is to write out output frame in the macroblock ordered format shown in Figure 4.3. The appropriate data can then be easily loaded when compressing the next frame in the video sequence.

With respect to the input frame, the issue is more complicated due to it being coupled with the design of the camera interface. If the encoder accepts

Figure 4.3: Macroblock ordered format used for reconstructed and output images

Figure 4.4: Macroblock ordered format used for input images

input frames in the format shown in Figure 4.3, the onus is put on the camera interface to write the input frame data to external memory in that format. Since any camera interface will receive camera data in raster scan order, its ability to write that data to memory efficiently is inhibited if it must write it out in the format shown in Figure 4.3. With the bt656 source used, the data must also be converted from an interlaced 4:2:2 format to a non-interlaced 4:2:0 format.

To do this and be able to write out the data in an sequential manner, the camera interface uses two embedded RAMs to double buffer one line of image data. For the first field it captures the chrominance and luminance data, for the second field it captures the luminance data only, thereby performing the 4:2:2 to 4:2:0 conversion required. Each image line is written out in 16 32-bit word bursts of luminance and chrominance data. This forces every 4 macroblocks of data to be intermingled as shown in Figure 4.4. However, it allows a burst size of 16, instead of 4, to be used to write the input frame data to external memory.

Encoder Interface

The camera interface and the encoder have internal registers which require to be set for every frame captured/encoded. To facilitate this within the Xilinx

platform systems being targeted, two OPB slave interfaces, implemented using the Xilinx provided OPB-IPIF module [109], were used. As already mentioned OPB masters interfaces were used to provide the encoder and camera with access to external memory. The external memory being accessed through any of the OPB external memory interfaces provided by Xilinx. This allows the encoder and the camera interface to be flexible with regard to the type of external memory used with them.

In document Implementing video compression algorithms on reconfigurable devices (Page 104-109)