Video Compression: Algorithms and Architectures
2.2 Video Compression System Architectures
The architectures proposed for FPGA video encoding include, dedicated hard-ware compression pipelines [11] [23] [14], mixed hardhard-ware/softhard-ware based sys-tems where specific tasks are performed in hardware accelerators [24] [25] but the less computationally intensive tasks are performed in software and pure software based encoders [12]. Some degree of hardware acceleration is generally required in an FPGA in order for the encoder to operate at a sufficient frame rate. A FPGA based MPEG-4 software encoder was proposed in [12]. Multiple Nios-2 processors were used to encode separate slices of each video frame. Even when using 3 Nios-2 processors for encoding the compression system proposed in [12]
was only capable of encoding QCIF frames at a rate of 6 per second.
One of the justifications for including a software element in a compression system is that it provides a degree of flexibility [12] [26]. Given the flexibility inherent in FPGAs this justification has less weight than it would if the com-pression system was implemented using an ASIC technology. Therefore, in this thesis the focus is primarily on hardware encoder implementations. It is assumed that any processor present within the system is used solely for encoder control and scheduling operations.
A macroblock pipeline is the predominant architecture used in hardware im-plementations [11] [23] [14] [10] [27]. This is unsurprising given the block based, sequential nature of the encoding process. Pipelining allows the various encoder modules to operate simultaneously. This improves an encoder’s resource utilisa-tion and increases the throughput an encoder can achieve. However, the perfor-mance improvement comes at a cost. A significant number of embedded RAMs are required to support pipeline operation. In [15] a generic video processing
ar-chitecture is proposed. The analysis in [15] focuses on the design of the communi-cations architecture between the processing elements within a pipeline, providing a methodology which can be used to ensure the processing pipeline as a whole operates at the rate required.
While the general architecture is the same, there are differences between the various macroblock pipeline implementations which have been proposed.
The number of pipeline stages varies. H.264 implementations generally have a larger number of pipeline stages than implementations supporting previous stan-dards [23] [27]. This can be attributed to the greater complexity of the H.264 encoding process. For example the motion estimation process in H.264 supports both variable block sizes and quarter pixel prediction. As a result, there is a greater benefit from splitting the motion estimation process across a number of pipeline stages than there would be if implementing standards with a less complex motion estimation process.
The method used to control the macroblock pipeline also varies between im-plementations. The simplest method, used in [23] [28], is to use a fixed number of clock cycles per pipeline stage, with each stage in the macroblock pipeline advancing to the next macroblock after a set number of clock cycles. In this case the maximum frame rate an encoder can operate at is given by,
Fr = F/(wmbhmbCp+ Np) (2.3)
where wmb and hmb are the frame width and frame height in macroblocks, Fr is the required frame rate, F is the clock frequency of the encoder, Np is the number of pipeline stages and Cp is the number of clock cycles required before the pipeline can advance.
More flexible control schemes have been used [11] [14]. In [11] each individual stage in the pipeline advances to the next macroblock as soon as it completes operations on the current macroblock. This is conditional on the next macroblock being available for processing and the succeeding pipeline stage being able to accept data. In [14], the macroblock pipeline advances as soon as all stages have completed operations for their current macroblock. The benefit of a more flexible pipeline control scheme is that each macroblock can use a different number of clock cycles at each encoder stage. This can potentially allow a reduction in the clock frequency required to support a particular frame rate and resolution.
Any actual benefit is, however, dependent on the algorithms and architectures used within the encoder, particularly those used for motion estimation and mode decision.
Apart from pipelining, other methods can be used to parallelise the encoding process. As previously mentioned each frame can be split into several slices, as shown in Figure 2.5. This allows a separate encoder instance to be used to en-code each slice. Using this method of parallelisation, any redundancy that exists between adjacent macroblocks that are in separate slices cannot be exploited. As H.264 considers this redundancy, this method of parallelisation reduces the video quality that is achievable for a given bit rate [29]. Parallelisation can also take place at the frame level. In this case, each separate encoder instance is used to en-code a subset of the pictures within the video sequence [30]. This method is only practical if B frames, which use a forward and a backward reference frame, are used within the encoded video sequence. As such this method of parallelisation is unsuitable in low latency applications.
Frame and slice based parallelisation are independent of the architecture used to implement each encoder instance. Using an efficient architecture such as a
Figure 2.5: Splitting a video frame into several slices to enable parallelisation macroblock pipeline reduces the need to use these other methods. The benefit of using slice or frame based parallelisation is that it allows the total hardware resources to scale easily with the required video frame rate and resolution.