Scalable Image and Video Coding - Content scalability in multiple description image and video c

The main aim of conventional image and video coders is to achieve high compression ratio or coding gain. But high compression ratio is not always the only requirement of the end user, especially when the end users have diﬀerent resources in terms of bandwidth, display device and computational complexity. Scalable coding emerges as a good solution for multimedia content distribution over heterogeneous networks.

Hierarchical subband decomposition and embedded coding are the two main com- ponents of any scalable coding framework. Scalability is the property of a bitstream in which the bitstream is arranged according to the signiﬁcance of information and can be truncated. Therefore the scalable image and video codecs allow the end users to truncate the scalable bitstream at any frame rate, resolution and quality to meet the data rate requirement and user preferences. Any scalable coder generates an embedded bitstream and has at least two layers i.e., base layer and enhancement layers. The base layer contains the most important information by which a minimum quality or resolution is obtained [26]. The base

layer is followed by other layers, called the enhancement layers, having additional information to enhance the quality, resolution or frame rate of the decoded image or video. Following are the diﬀerent types of scalabilities that are useful in image and video coding.

1. Quality or SNR Scalability: In quality or SNR scalability, at least two layers (base and enhancement) of an image/video are required to decode the image/video at two or more diﬀerent quality levels. The base layer encodes the information that is required to decode the image/video at a basic quality. The enhancement layer increases the quality of the decoded image/video when added to the base layer. The Encoder can encode as many enhancement layers as possible which gives decoder an option to decode the image/video at diﬀerent quality levels.

2. Spatial or Resolution Scalability: In spatial or resolution scalability, the base layer generated by the encoder is responsible to provide a basic lower spatial resolution. The enhancement layer provides the information, which is interpolated with the base layer to decode the image at some higher spatial resolution.

3. Temporal or Frame Rate Scalability: In temporal scalability diﬀerent frame rates can be selected for video encoding/decoding. Fewer frames from the video sequence are used for the motion prediction and estimation for the base layer. Higher frame rate are used in the enhancement layer for the good perception of motion in video.

The quality and resolution scalabilities can be achieved both in images and video while the temporal scalability is possible only in videos. Figure 2.2 shows the scalable video coding framework. A scalable video coding framework is divided into three main blocks [1, 30].

1. Scalable Video Encoder. 2. Scalable Video Extractor. 3. Scalable Video Decoder.

Scalable Video Encoder Scalable Video Extractor Scalable Video Decoder 0 0 0,G,H F FP,GP,HP Bitstream Description Extracted Bitstream Description e e eG H F, , Input Video Low Decoded Video Medium Decoded Video High Decoded Video

Scalable Bitstream Extracted Scalable Bitstream

Figure 2.2: Scalable video coding framework.

The encoder block only once generate a scalable video bitstream and a bitstream description for input video for the highest achievable quality, resolution and frame rate. The bitstream description can be used separately or interleaved with the scalable video bitstream. The scalable video bitstream is generated in such a manner, that it is capable of achieving all the three types of scalabilities discussed above. The extractor block is responsible to truncate the scalable video bitstream into a new adapted scalable video bitstream and its description. The decoder block uses the adapted scalable video bitstream and its description to decode the input video at particular quality, resolution and frame rate depending on the adapted scalable video bitstream.

Let F0, G0 and H0 be the bitstream requirement for a basic quality, resolution

and frame rate respectively and FP, GP and HP be the bitstream requirement for a highest quality, resolution and frame rate as shown in Figure 2.2. All this information is presented in a single bitstream generated by the scalable video encoder. Extractor can extract the scalable video bitstream at any quality (Fe),

resolution (Ge), and frame rate (He). Decoder decodes the input video at a diﬀer- ent quality, resolution, and frame rate according to extracted scalable bitstream and its description.

Different scalable image coding algorithms are available in the literature. Shapiro in [31] introduced the concept of embedded zero tree wavelet (EZW)-based image coding that generates a bitstream according to the significance of the wavelet coefficients. An alternative scheme for implementing the same concept as introduced in EZW is discussed in [32] named SPIHT (Set Partitioning in Hierarchical Trees). Only quality scalability is achieved in EZW and SPIHT. Both the quality and resolution scalability is achieved in embedded block coding with optimized truncation (EBCOT) [33], which is also adopted in JPEG2000 [26]. Motion com- pensated temporal filtering (MCTF) [34] and DWT is extensively used in video coding to generate scalable video bitstreams [27, 35–40]. MCTF is a lifting based wavelet approach used to decompose a video in temporal direction. Motion com- pensation and prediction is performed in [41, 42] and not performed in [43] when applying the wavelet transform in temporal direction.

3D wavelet decomposition or spatio-temporal decomposition is a two step process: 2D spatial transform and MCTF. In video coding two different frameworks for the spatio-temporal decomposition are used. In one framework, MCTF is performed on 2D spatial transform coefficients and is known as (2D+t) framework [44]. In another framework, the 2D spatial transform is performed after the MCTF and is known as (t+2D) framework [2]. All three kinds of scalabilities i.e., (temporal, spatial and quality) can be achieved by using the spatio-temporal decomposition architecture. Motion vectors generated by MCTF can be encoded in non scalable fashion in [2] and also in scalable fashion [45, 46]. In [1], different wavelet-based scalable video coding approaches are discussed in detail.

The major problem of scalable bitstream is its rapid performance deteriora- tion when transmitted over error-prone channels. In scalable coding, the higher enhancement layers are dependent on base and the lower enhancement layers. Therefore, if the base layer is aﬀected by transmission errors in error-prone channels, such errors are propagated due to interdependencies among layers and can lose the expected improvements in quality even though enhancement layers are received without any errors. An example of such a situation is the best eﬀort

packet networks like Internet. The scalable video bitstream is packetized according to the significance of information to transmit the video through Internet. If the packets are corrupted or lost at different nodes due to various bandwidth links, buffer capacities and network congestion, it is possible that the video is not decoded properly at the decoder. In such cases some error resilient image and video coding methods are required to cater for the error propagation problem of scalable bitstreams.

Different methods have been proposed for reliable transmission of images and video over Internet and mobile wireless networks. Forward error correction and automatic repeat request are the two common error correction techniques adopted in image and video transmission over error-prone channels [47,48]. The data rate is increased by introducing any error correction technique. FEC schemes are capable to detect and correct certain amount of bit errors depending on the error detection and correction capabilities of the adopted schemes. The forward error correction scheme fails when the bit errors are beyond the error correction capabilities. Usually these schemes fail under bursty error conditions. It is shown in [49, 50] that the ARQ is more effective to combat bursty errors than the FEC scheme. An additional delay caused by the ARQ scheme for requesting the corrupted packets is only the disadvantage and therefore it is not appropriate for real time applications. Instead of using FEC or ARQ a source coding method known as MDC is also used as an effective scheme to combat channel errors.

In document Content scalability in multiple description image and video coding (Page 46-50)