Observations and Algorithm Formulation - The Proposed Fast Mode Decision Algorithm

4.2 The Proposed Fast Mode Decision Algorithm

4.2.1 Observations and Algorithm Formulation

To improve coding efficiency, besides performing inter- and intra-prediction within each layer, as in single layer coding, inter-layer prediction is also employed in SVC. This exploits the reconstructed data of the lower layers, so in the enhancement layer, the prediction signal is obtained either by conventional motion-compensated temporal prediction or by upsampling the reconstructed lower layer information.

Motion Information from P-Frames

Although inter-layer prediction effectively improves the coding efficiency of SVC, not all lower layer upsampled data is suitable for inter-layer prediction, especially for video sequences with slow motion or simple texture, such as the ‘Mother-daughter’ sequence. The reason for this is that, in this case, the best matching macroblock can usually be found in the temporal reference frames. Therefore, an approach is proposed to determine whether the current macroblock is more suitable for inter-frame prediction or inter-layer prediction, in other words, to determine whether the current macroblock represents slow or fast motion. It can then be decided which prediction mode should be applied to the current

macroblock, inter-frame prediction or inter-layer prediction. The Motion Vector Differ- ence (MVD) between P-frames in each GOP is chosen as the measure of motion. This is motivated by the hierarchical coding structure of SVC and the fact that MVD is a good measure with which to categorise video motion activity.

In order to realise temporal scalability, SVC adopts a hierarchical coding structure to partition a video sequence into a number of temporal layers. Fig. 4-1 illustrates a typical hierarchical coding structure. The first frame is encoded as an I-frame, and the encoder inserts a key frame at regular intervals, the key frame being encoded either as an I-frame or a P-frame, and serves as a reference for subsequent frames. In a hierarchical coding structure, an I-frame is firstly coded without reference to any other frames. Subsequently each P-frame uses the previous key frame as a reference for prediction. Consequently, the remaining frames of a GOP are hierarchically predicted and coded as B-frames. In other words, the B-frames in a GOP are encoded after the I- and P-frames. Therefore, some encoding results from the P-frames can be used to eliminate the computational cost of the B-frames. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 4 3 5 2 7 6 8 1 12 11 13 10 15 14 16 9 Display order Coding order GOP GOP I0 B2 B1 B2 P0 B3 B3 B3 B3 B2 B1 B2 P0 B3 B3 B3 B3

Fig. 4-1 A typical hierarchical B-frame coding structure.

translation, rotation, and scaling. A sophisticated model is required to represent the motion in a video sequence, as the trajectory of a video object can be arbitrary[95]. Only considering the simplest case, namely the trajectory, motion is assumed to be linear. In Fig. 4-2, assuming that the truck is moving from the left to the traffic lights on the right at a constant velocity ofvt(x)betweent =tk−1andτ(τ >t). The trajectory of the truck can be described using a linear model as follows:

x(τ) =x(t) +vt(x)(τ−t) =x(t) +dt,τ(x) (4.1)

wheredt,τ(x) =vt(x)(τ−t)is a displacement vector measured fromt toτ.

Motion trajectory

Fig. 4-2 An example of a linear motion trajectory.

In SVC, a block-based search algorithm is used to estimate the displacement of all pix- els in a block. The obtained displacement is represented by a MV, which is determined by a predefined matching criterion, such as Cross-Correlation Function (CCF), Mean Squared Error (MSE), and Mean Absolute Error (MAE). As a result, the MV can be expressed as

MV=vt(x)(τ−t) (4.2)

It can be seen from equation (4.2) that, at a given period of time, a larger MV corresponds to faster motion, and vice versa. Therefore, the MV can be used as a measure to categorise video motion activity.

In SVC, differential coding is applied to MVs to further reduce the motion information overhead. That is, only the difference between the actual MV and the MVP is encoded and

transmitted, instead of coding the actual MV directly. In SVC, the MVP of a macroblock in the enhancement layer is taken either from its spatially surrounding macroblocks in the same layer or from the corresponding macroblock in the previous layer.

In the same layer, the MVs of adjacent macroblocks tend to be very similar, consequently the current MV can be predicted from the three MVs which are located to the left, above, and above-right, as shown in Fig. 4-3.

Fig. 4-3 MVs of a current block and its neighbours.

The horizontal and vertical components of the current macroblock’s MVP are calcu- lated separately and each of the components is the median value of the three neighbouring MVs.

MVPx=Median MV_Lx, MV_Ax, MVx_AR MVPy =Median MV_Ly, MV_Ay, MV_ARy

(4.3)

wherex and y denote the horizontal and vertical components; subscripts L, A, and AR stand for the macroblocks to the left, to the above, and the above-right.

In the enhancement layers of SVC, the MVP can be obtained by the conventional median-based approach, or the scaled MV of the corresponding block in the previous layer can be used as the MVP, as shown in Fig. 4-4.

Layer 1 Layer 0 16 16 16 16 16 16 16 8 8 8 8 8 16 16 4 4

Fig. 4-4 Inter-layer MV prediction with various block sizes.

process is performed to find the actual MV within a defined search range under a conventional matching criterion. Finally, the MVD between the actual MV and the MVP, which is defined as

|MVD|=|MVactual−MVP| (4.4)

is encoded and transmitted. The relationship between the actual MV and the MVP is illus- trated in Fig. 4-5.

Current MB Search area

Reference frame Current frame

Best matching MB

Based on the above analysis, it can be concluded that sequences with small motion generally tend to have small MVDs, and vice versa. Therefore, the MVD can be chosen as a measure of video motion activity. Using the MVD also satisfies the main objective of reducing the computational complexity, as the MVD is easy to extract from the coded data.

In document Efficient algorithms for scalable video coding (Page 111-116)