Past Frame
Σ
Prediction Errors
+
-Figure 1.18: Coding of predicted pictures.
Current Frame
Past Frame Future Frame
Σ
Prediction Errors
+-
-Figure 1.19: Coding of bidirectional predicted pictures.
ˆ
mb= round(α1m1+ α2m2), where α1 and α2 are defined below.
(a) α1= 0.5, and α2 = 0.5 if both matches are satisfactory.
(b) α1= 1, and α2 = 0 if only first match is satisfactory.
(c) α1= 0, and α2 = 1 if only second match is satisfactory.
(d) α1= 0, and α2 = 0 if neither match is satisfactory.
Finally, the error block eb is computed by taking the difference of mb
and ˆmb. These error blocks are coded in the same way as the blocks of an I-frame.
Motion Vector Search Area
Reference Frame
Current Frame
Macroblock
Figure 1.20: Motion estimation for MPEG-2 video encoder.
1.5.1.4 Motion Estimation
In video compression, for motion-compensated prediction, pixels within the current frame are modeled as translations of those within a reference frame.
In forward prediction, each macroblock (MB) is predicted from the previous frame assuming that all the pixels within that MB undergo same amount of translational motion. This motion information is represented by a two-dimensional displacement vector or motion vector. Due to its block-based representation, block-matching techniques are employed for motion estima-tion (see Figure 1.20). As shown in the figure, both the current frame and
the reference frame have been divided into blocks. Subsequently, each block in the current frame is matched at all locations within the search window of the previous frame. In this block-based matching technique, a cost func-tion measuring the mismatch between a current MB and the reference MB is minimized to provide the motion vector. There are different cost measures used for this purpose, such as, mean of absolute differences (MAD), sum-of-absolute-differences (SAD), mean-square-error (MSE), etc. The most widely used metric is the SAD, defined by
SADi,j(u, v) =
N −1X
p=0 N −1X
q=0
|ci,j(p, q) − ri−u,j−v(p, q)|. (1.14)
where SADi,j(u, v) represents the SAD between the (i, j)th block and the block at the (u, v)th location in the search window Wi,j of the (i, j)th block.
Here, ci,j(p, q) represents the (p, q)th pixel of an N × N (i, j)th MB Ci,j, from the current picture. ri−u,j−v(p, q) represents the (p, q)th pixel of an N × N MB from the reference picture displaced by the vector (u, v) within the search range of Ci,j. To find the MB producing the minimum mismatch error, the SAD is to be computed at several locations within the search window. The simplest but the most computationally intensive search method, known as the full search or exhaustive search method, evaluates SAD at every possible pixel location in the search area. Using full search, the motion vector is computed as follows:
M Vi,j= {(u′, v′)|SADi,j(u′, v′) ≤ SADi,j(u, v), ∀(u, v) ∈ Wi,j}, (1.15) where M Vi,j expresses the motion vector of the current block Ci,j with mini-mum SAD among all search positions. In MPEG-2, motion vectors are com-puted either with full pixel or half pixel precision. In the former case, MBs are defined from the locations of the reference frame in its original resolu-tion, but for half pixel motion vectors, the reference image is first bilinearly extrapolated to double its resolution in both the directions. Then motion vec-tors are computed from the locations of the interpolated reference image. For a downsampled chrominance component, the same motion vector is used for prediction. In this case, the resulting motion vector of the chrominance MB is scaled down by a factor of two.
1.5.1.5 Handling Interlaced Video
The MPEG-2 compression standard also handles interlaced video, which is common for television standards. In this case, a frame is partitioned into two fields (odd and even fields). Each field is separately encoded, and motion estimation of an MB of a field is optionally performed from the same type of field of the reference frame or another field of the current frame if it is encoded prior to the present one.
1.5.2 MPEG-4
The MPEG-4 [135] video compression technique is distinguished by the fact that it takes care of object-based encoding of a video. The project was initiated by MPEG in July 1993 in view of its application in representing multimedia content and delivery. The standard was finally adopted in February 1999. In fact, the standard encompasses representation of not only video but also other medias such as synthetic scenes, audio, text, and graphics. All these entities are encapsulated in an audiovisual object (AVO). A set of AVOs represents a multimedia content, and they are composed together to form the final com-pound AVOs or scenes. In our discussion, we restrict ourselves to the video compression part of the MPEG-4 standard. An overview of the video com-pression technique is shown in Figure 1.21.
O1
Figure 1.21: An overview of an MPEG-4 video encoder.
1.5.2.1 Video Object Layer
As shown in Figure 1.21, an input video is modeled as a set of sequence of video objects. We have to apply segmentation algorithms for extracting objects from individual frames known as video object planes (VOPs), and the sequence of these objects in the overall video defines a video object (VO) or a video object layer (VOL). For example, in Figure 1.21, there are three such VOLs. We may also consider the background as another layer. However, in MPEG-4, the background of a video may be efficiently coded as sprites, which is discussed later. Individual VOs are independently coded consisting of information related to shape, motion, and texture. Each of these codings is briefly discussed here.
1. Shape encoding: Every VOP in a frame could be of arbitrary shape.
However, the bounding rectangle of this shape in the frame is specified.
This rectangle is adjusted in such a way that its dimension in both horizontal and vertical directions becomes an integral multiple of 16 so that it can be encoded as a set of nonoverlapping MBs of size 16 × 16. The pixels within the rectangle not belonging to the object are usually denoted by the value zero (0); otherwise they contain 1 (in binary representation of the shape) or a gray value (usually denoted by α), used for blending with other VOPs during reconstruction of the video. There are three types of MBs within this rectangle. An MB may have (i) all nonzero pixels (contained within the object), or (ii) all zero pixels, or (iii) both types of pixels. The third type of MB is called the boundary MB of a VOP. The binary shape information of a VOP is encoded by a content-adaptive arithmetic coding (CAE) [131], while for gray-shape representation the usual motion compensated (MC) DCT representation is followed.
2. Motion encoding: Like MPEG-2, each VOP of a layer is one of three types, namely, I-VOP, P-VOP, and B-VOP. In this case also, motion vectors are computed in the same manner as done for MPEG-2. However, there are other options (or modes) for computing motion vectors and subsequently obtaining the motion-compensated prediction errors for an MB. These modes include the following:
(a) Four motion vectors for each 8 × 8 block of an MB are computed separately, and the prediction errors are obtained using them.
(b) Three overlapping blocks of the reference frame are used to com-pute the prediction errors for each 8 × 8 block of the current MB.
These reference blocks are obtained from the closest matches of its neighboring blocks, either to its left (right) or to its top (bottom).
The third one is the closest match of the concerned block itself in the reference frame. For every pixel in the current block, a weighted mean of corresponding pixels of those three blocks provides its pre-diction. These weights are also predefined in the standard.
3. Texture encoding: In texture coding, MBs of a VOP are encoded in the same way as it is done in MPEG-2. For I-VOP, 8 × 8 blocks are transformed by the DCT, and subsequently transformed coefficients are quantized and entropy coded. For P-VOP and B-VOP, after motion compensation of an MB, residual errors (for each 8×8 block) are encoded in the same way. However, there are a few additional features in MPEG-4 in encoding textures. They are briefly discussed here.
(a) Intra DC and AC prediction: In MPEG-4 there is a provi-sion for prediction of quantized DC and AC coefficients from one of its neighboring blocks (either to its left or to its bottom) in intra VOPs. The neighboring block, which has lower gradient in the co-efficient space (with respect to the leftmost and topmost diagonal
neighboring block), is chosen for this purpose. The same feature is also extended for the MBs in the intra mode of inter VOPs.
(b) DCT of boundary blocks: As boundary blocks have pixels not belonging to a VOP, they need to be either padded with suitable values to avoid abrupt transitions in the block (as they may demand greater number of bits for encoding), or they may be ignored in the computation of the transform itself. In the first strategy, all the non-VOP pixels within the block are padded with the mean pixel value of that block, and then they are subjected to a low-pass average filtering to reduce the abrupt transitions near the junctions of VOP and VOP pixels. Alternatively, for dropping these non-VOP pixels from the compressed stream, we need to apply shape adaptive DCT (SA-DCT) [131] to the corresponding block. In this case, N -point DCTs are applied successively to the columns and rows of the block containing N (N ≤ 8) VOP pixels.
1.5.2.2 Background Encoding
The background of a video may be treated as a separate VOL and encoded in the same way as others. However, MPEG-4 also makes provision for encoding in the form of a sprite. A sprite is a still image formed from a set of consecutive video frames, that do not show any motion within its pixel. At any instance of time, a particular rectangular zone of the sprite is covered by a frame.
Hence, once a sprite is defined for the set of frames, it is sufficient to pass the coordinate information of that zone for rendering the concerned frame.
This is similar to the panning of the camera, and the respective parameters are encoded in the video stream. A sprite acts as a background for that set of frames. If it is defined for the complete video, the sprite is static, and it is computed offline during the encoding process. However, in real time, we have to compute dynamic sprites for a group of pictures.
1.5.2.3 Wavelet Encoding of Still Images
The MPEG-4 allows use of wavelets for encoding of still images. Like JPEG2000, it also follows dyadic decomposition [92] of images into a multi-resolution representation. However, the coefficients of the LL band (seeSection 1.4.2.1andFigure 1.11) is encoded in a different way from other subbands. In this case, they are simply quantized and entropy coded by arithmetic coding.
Other subbands are encoded with the help of a zero tree as proposed in [128].
The symbols of the zero tree are encoded using an arithmetic coder.
1.5.3 H.264/AVC
The video compression standard H.264/AVC [155] (or H.264 as will be re-ferred to subsequently) is the outcome of the joint effort of the ITU-T VCEG
and the ISO/IEC MPEG. The major objectives of this standardization effort are (i) to develop a simple and straightforward video coding design with en-hanced compression performance, and (ii) to provide a network-friendly video representation addressing both conversational (such as video telephony) and non-conversational (such as storage, broadcast, or streaming) applications. In fact, the H.264 has greatly outperformed other existing video standards such as MPEG-2.
The H.264 specifies the video in the form of a video-coding layer (VCL) and a network abstraction layer (NAL) (seeFigure 1.22). The VCL represents the video content, and the NAL formats the VCL with requisite header infor-mation appropriate for its transmission by transport layers or for its storage in a media. The H.264 video compression algorithm is also a hybrid of inter-picture prediction and transform coding of the prediction residual error (see Figure 1.23).
Like MPEG-2 and MPEG-4, the H.264 design [155] also supports the cod-ing of video in 4:2:0 chroma format. It handles both progressive and interlaced videos. In Figure 1.23, processing of a single MB of a video frame is shown with the help of a block diagram. The basic coding structure of H.264 is a hierarchical one [151]. A video sequence consists of several pictures or frames.
A single picture is partitioned into a set of slices, which is again a collection of MBs. Some of the features of H.264 and its coding structure are discussed next.
1.5.3.1 Slices and Slice Groups
A video frame may be split into one or several slices. The H.264 has the provision of defining a slice with a flexible macroblock ordering (FMO) by utilizing the concept of a slice group, as a set of MBs and specified by a mapping between an MB to a slice. Again, each slice group consists of one or more slices such that MBs in the slice are processed in the order of a raster scan.
1.5.3.2 Additional Picture Types
There are two more additional types for encoding slices other than usual picture types such as I, P and B. These two new coding types are switching I (SI) and switching P (SP) pictures, which are used for switching the bitstream from one rate to another [75].
1.5.3.3 Adaptive Frame/Field-Coding Operation
For higher coding efficiency, the H.264/AVC encoders may adaptively encode fields (in the interlaced mode) separately or combine them into a single frame.