Video Compression: Algorithms and Architectures
2.5 Mode Decision
In H.264, there are a greater number of modes which can be used to compress a macroblock than in previous standards. If the determination of which spa-tial prediction mode to use is considered as part of mode decision process there are 592 different mode combinations to choose from for I-slices (assuming 8x8 intra prediction is not supported). Additional modes must be considered when P-Slices are used (INTRA 16x16, INTRA 4x4, INTER 16x16, INTER 16x8, IN-TER 8x16 and ININ-TER 8x8). For the ININ-TER 8x8 mode 4 modes must be consid-ered for each 8x8 sub-block (INTER 8x8, INTER 8x4, INTER 4x4 and INTER 4x4). This compares to the 3 modes available in MPEG-2, and the 4 available in H.263/MPEG-4 [8].
The rate distortion optimised (RDO) mode decision algorithm implemented within the H.264 reference software is computationally expensive. The distortion between the encoded and unencoded macroblocks and the rate for that distortion must be measured for each mode being considered. This requires each macroblock to be encoded multiple times. As a result of the RDO mode decision algorithm’s complexity, it has not been frequently used in H.264 hardware implementations.
Instead less complex mode decision algorithms have been used. The simplest of which is to use the cost measures calculated in the intra prediction and motion estimation stages to determine which encoding mode to use. Compared to the RDO algorithm this offers less compression performance but is substantially less complex.
Efforts have been made to develop mode decision algorithms which reduce
complexity further. In general the method used to do this is to make the mode decision prior to performing all the calculations required by the motion estimation and intra prediction processes. In [72], it is proposed to determine the mode used for some marcoblocks partly on the modes chosen for spatially adjacent macroblocks. This is based on the assumption that the modes chosen for adjacent macroblock’s are correlated. While reducing complexity [72], still requires either the 4x4 intra prediction process, the 16x16 intra prediction process or the motion estimation process for a specific block size to be performed for every macroblock.
Another proposal in [73] determines which mode class, inter or intra, to exhaustively search. To determine whether the intra or intra mode class should be used, work in [73] uses features representative of both the spatial and temporal redundancy. The spatial feature used is the minimum SATD of a subset of the 4x4 intra prediction modes. The temporal features used are the SATD and motion vector length of best motion vector candidate as determined by the vector prediction algorithm proposed in [39]. Using these features, a Bayesian cost based decision is made on which mode class to use.
In [74], an algorithm is developed to predict when the SKIP encoding mode offers the best compression performance. When the algorithm predicts that the SKIP mode offers the best compression performance, neither the motion estima-tion and intra predicestima-tion processes need to be performed. The algorithm devel-oped in [74] uses a model to predict the distortion and rate for a macroblock when it is encoded in the normal fashion (i.e by performing both motion estimation and intra prediction). The models are based on the following assumptions,
• The best mode for a macroblock and its associated distortion will be the same as that of the co-located macroblock in the previous frame
• The required rate for the best mode will be half that of the co-located macroblock in the previous frame.
Using these assumptions the cost for best mode can be determined and com-pared against the cost for the skip mode. Result are given in [74] showing that using the proposed algorithm reduces encoding time in a software encoder by between 30% and 70%.
Only work in [23] has used a mode decision algorithm which allows the skip-ping of some of the required motion estimation and intra prediction calculations in a pipelined encoder implementation. In general, such algorithms offer the potential to reduce the power used in a pipelined encoder. The resource sav-ings offered by such algorithms are limited because the encoder still needs to be able to perform the full motion estimation and intra prediction operations when required. Although, as discussed in section 2.4, using a fast mode decision algo-rithm which predicts when either the 8x8 or 4x4 intra prediction operations need to be performed can reduce the number of transform operations which need to be performed. Consequently, such algorithms have the potential to reduce the resources used by the transform component.
2.6 Transform
The transform stage within DCT/DPCM based compression de-correlates the video image data, facilitating compression. In most standards apart from H.264 the Discrete Cosine Transform (DCT) is used to do this because it has been shown to be a good approximation of the optimal Karhunen Loeve Transform (KLT) for natural video images. The transform is also used to de-correlate the residual images formed as a result of motion estimation/compensation. In [75] it
was shown that the KLT for a motion compensated difference image is identical to the KLT for the original video image and that the DCT remains a good approximation of the KLT for motion compensated difference images.
Within H.264, a 4x4 integer transform, derived from the DCT, is used. This is in contrast to previous standards which have used an 8x8 DCT. An integer transform is used to remove the possibility of the encoder and decoder refer-ence images differing due to rounding errors in the encoder and decoder DCT implementations. A 4x4 transform is justified because the spatial and temporal prediction which occurs before the transform stage in H.264, negates to a large extent the correlations between each 4x4 transform block [76]. A new revision of the H.264 standard does provide support for an 8x8 integer transform [1]. The integer transform used in H.264 has comparable performance to the normal DCT transform [20]. In addition, it is simpler to implement because it does not re-quire any multiplication operations. As discussed in section 2.4, the performance requirements placed on the transform and inverse transform implementation are dependent on the mode decision algorithm used and on the intra prediction ar-chitecture used. Within a pipelined encoder, the transform component’s perfor-mance may not be critical. In the encoder described in [23] for instance, the vertical and horizontal transforms required are performed sequentially with one operation occurring per clock cycle. This gives a throughput of less than one pixel per clock cycle. The impact on overall encoder performance is masked however by the performance of other parts of the encoder implementation. This encoder uses a fast mode decision algorithm. As a result, the encoder only needs to perform one complete transform operation per macroblock.
If the encoder is specifically targeted at larger frame rates and resolutions a higher performance transform architecture may be needed. In [14], a distributed
architecture is used to implement the transform operations. Each processing element calculates the transformed value for one pixel in a 4x4 block. This is an unusual design which requires the output values to be multiplexed onto a data bus prior to quantisation. More common is to split the transform operation into its vertical and horizontal components, as in [23], but perform multiple transform operations per clock cycle [77] [78]. Throughput is further increased in [78] by pipelining the vertical and horizontal transform operations. Split architectures require a transpose memory to sequence the data appropriately prior to the second transform operation. In [79], the vertical and horizontal operations are combined to realize an architecture which does not require a transposition memory.