A Review on reference Picture Memory and Coding Efficiency of Multiview Video

(1)

Low-Delay Multiview Video Coding for Free-Viewpoint Video

Communication

Hideaki Kimata,

1

Masaki Kitahara,

2

Kazuto Kamikura,

1

Yoshiyuki Yashima,

1

Toshiaki Fujii,

3

and

Masayuki Tanimoto

3

1

NTT Cyber Space Laboratories, NTT Corporation, Yokosuka, 239-0847 Japan

2

NTT Advanced Technology Corporation, Yokohama, 244-0805 Japan

3

Department of Electrical Engineering and Computer Science, Graduate School of Engineering, Nagoya University, Nagoya, 464-8603 Japan

SUMMARY

We have proposed free-viewpoint video communica-tions, in which a viewer can change the viewpoint and viewing angle when receiving and watching video content. A free-viewpoint video consists of several views, whose viewpoints are different. To freely and instantaneously change the viewpoint and view angle, a random access capability to decode the requested view with little delay is necessary. In this paper, a multiview video coding method to achieve high coding efficiency with low-delay random access functionality is proposed. In the proposed method, the GOP is the basic unit of a view, and selective reference picture memory management is applied to multiple GOPs to improve coding efficiency. In addition, the coding method of disparity vectors, which utilizes the camera arrangement, is proposed. © 2007 Wiley Periodicals, Inc. Syst Comp Jpn, 38(5): 14–29, 2007; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/scj.20683

Key words: free viewpoint; free viewpoint video;

multi-viewpoint video; disparity compensation; H.264.

1. Introduction

Free-viewpoint video is a visual representation in which a viewer can change the viewpoint freely as desired when watching the video content. We can provide immer-sive visual experiences to viewers showing such highly interactive and high-quality video content. The visual rep-resentation in which viewpoints are changed has been ap-plied in movies and sport relay broadcasts. For instance, in movies, special camerawork is used to allow a viewer to see a scene as if from different viewpoints continuously while time is stopped. To create such camerawork, camera images are captured from multiple viewpoints, and then images for the camerawork are generated as virtual camera positions as determined by a movie maker in a studio. These gener-ated images are used for transmission and broadcasting. In these traditional applications, the viewer cannot change the viewpoint. On the contrary, we have proposed free-view-point TV and free-viewfree-view-point video communication [1–3]. In these proposed applications, interactivity is very high because viewers can change viewpoints freely. In the MPEG standardization body, 3DAV activity for natural three-dimensional video coding is in progress, and in that activity, standardization of free-viewpoint TV and mul-tiview video coding is being considered [8, 16].

Systems and Computers in Japan, Vol. 38, No. 5, 2007

(2)

In free-viewpoint video communication, it is as-sumed that the transmitting side captures a scene with multiple cameras and produces multiview video data, and then the receiving side generates and displays an image while changing the viewpoint (Fig. 1). On the transmitting side, it is assumed that all of the cameras are calibrated in advance, and that the camera parameters obtained from calibration are transmitted together with video bitstream. On the receiving side, after decoding the video bitstream, the image from the virtual camera position is generated by view interpolation techniques making use of the camera parameters. When the camera density is higher on the transmitting side, smoother change of viewpoints is achieved.

In free-viewpoint video communications, multiview video coding with high coding efficiency is an essential technology. In addition, in multiview video coding, low-de-lay random access functionality for change of viewpoint is needed. This is because the views necessary for generation of views are changed. This paper presents research results focusing on multiview video coding.

For multiview video coding, a coding method of multiview images which exploits the epipolar constraint and encodes the images as one video has been proposed [4]. However, this proposal has been limited to still objects, and has not been extended to moving objects. The MPEG-2 multiview profile uses a coding method for a stereo video in which one view is predicted from the other view, and by extending the method used in the MPEG-2 multiview pro-file, a coding method of multiple views using prediction between multiple views has been proposed [5]. In this method, view scalability was also proposed, achieving the change of views to be decoded from multiple views to a stereo or single view. However, this method does not pro-vide low-delay random access to a requested view, because all views have the same GOP structure, that is, the number of pictures in a GOP is the same for all views. Moreover, because it is based on MPEG-2, it does not make use of the reference picture selection method adopted in H.264, in

which the reference picture is selected from multiple de-coded images. That is why high coding efficiency has not been achieved.

In this paper, we propose a new multiview video coding method which achieves low-delay random access to a requested view with regard to change of viewpoint and view direction, while maintaining high coding efficiency. In this paper, the delay is calculated from the number of frames to be decoded in order to obtain the requested frame of the requested view. In the second section of the paper, we propose a multiview video coding method based on the reference picture selection method, which has low-delay random access functionality. In the third section, we pro-pose a new disparity prediction method, where camera geometry information is used for coding of the disparity vector and determination of the search range of the disparity vectors.

2. Multiview Video Coding

2.1. Assumed camera arrangement and multiview video

This section describes the assumed camera arrange-ment and structure of multiview video. To construct a free-viewpoint video, the cameras must be arranged densely. When the epipolar constraint is utilized for genera-tion of a virtual view, it is better if the cameras are arranged regularly [6]. Figure 2 presents an example of the assumed camera arrangement. Figure 2(a) shows the structure in which the cameras are arranged in a line, and Fig. 2(b) shows the structure in which the cameras are arranged in an arc. In practice, because the cameras are arranged manually, some error in camera positions is present, and it is difficult to remove such errors at the pixel level before capture. We could correct such errors before encoding the video signals, using the camera parameters obtained by camera calibra-tion, but this requires a huge amount of processing time because correction must be applied to all pixels of all cameras. Thus, a system with many cameras is particularly unsuitable for the communications application discussed in this paper. Not only errors in camera geometry but also color inconsistency is difficult to remove. Therefore, in this paper, we assume that errors at pixel level in the camera Fig. 1. Free-viewpoint video communications.

(3)

positions and colors are removed by using the camera parameters on the receiving side when a virtual view is generated, and we assume that images that contain errors at pixel level are subject to encoding.

2.2. Proposal of GoGOP structure and GOP adaptive reference picture selection method

In free-viewpoint video communications, not all views are necessarily decoded, because obtaining the re-quested view suffices. There are two ways to obtain partial data from a multiview video. The first is “partial decoding,” in which the receiving side decodes partial data after it receives all multiview data, and the second is “view scalability,” in which only the necessary data to obtain a request view are transmitted [9]. Figure 3 shows the flow between the transmitting side and the receiving side for these two methods. We have proposed the GoGOP (Group of GOP) structure to implement these methods [3, 10]. It extends the concept of GOP structure in conventional 2D video to multiview video. In the GoGOP structure, a view consists of several GOPs, and prediction coding is applied between GOPs. A GOP is categorized as either a Base GOP or an Inter GOP. In a Base GOP, the images can be decoded by using images in the same GOP, and in an Inter GOP, they can be decoded using images in other GOPs as well as the same GOP. In an Inter GOP, higher coding efficiency can be achieved than in a Base GOP, because the correlation between GOPs is utilized in the prediction coding. A GOP within a GoGOP is encoded using only GOPs in that GoGOP. Figure 4 shows examples of the GoGOP structure. A white square represents a picture in a Base GOP and a gray square represents a picture in an Inter GOP. In Fig.

4(a), a Base GOP and an Inter GOP are applied alternately in a view. The arrows show the reference relations of the pictures. Fine arrows in the figures show the relations of pictures; the picture positioned at the origin of the arrow refers to the picture positioned at the tip of the arrow. Thick arrows in the figures show the relations of GOPs; the GOP positioned in the origin of the arrow refers to the GOP positioned at the tip of the arrow. In this example, the picture in the Inter GOP refers to pictures in the GOP as well. Partial images from all views can be obtained even if Inter GOPs are not decoded. When the correlation of pictures in the time dimension or the interview dimension is high, the images in Inter GOPs can be generated from images in the Base GOPs, and thus all images can be obtained. It is possible to obtain a requested view. The number of delayed pictures is consistent with that of all pictures within a GoGOP at a maximum in the case shown in Fig. 4(a). In Fig. 4(b), an Inter GOP refers only to Base GOPs. Partial decoding and view scalability can be achieved in a GOP, as well as in the case shown in Fig. 4(a). An Inter GOP may contain multiple pictures, as Fig. 4(b) shows, or it may contain only one picture. In this case, the pictures in an Inter GOP can be decoded with low delay, while the pictures in Base GOPs are decoded. Thus, the number of delayed pictures is just one at a minimum in this case. This structure is efficient in terms of reducing processing time for decod-ing pictures, as well as reducdecod-ing the memory size of refer-ence pictures.

In the GoGOP structure, a GOP number is assigned to each GOP. The relations of reference between GOPs are indicated by a reference GOP number, which is encoded in a GOP header. To determine the reference GOPs, camera arrangement information is useful. For instance, when the correlation between adjacent views is assumed to be high, GOPs taken where the camera positions are close together are selected as candidate reference GOPs. If a reference GOP also refers to another GOP, the delay of decoding a picture becomes high. When the acceptable size of such

(4)

delay can be set, reference GOPs must be determined so that the delay is not exceeded. This paper presents research results on the relationship between coding efficiency and the structure of the reference GOPs.

We propose the GOP adaptive Reference Picture Selection (GRPS) method for GoGOP structure, extending the hierarchical reference picture selection method that has been proposed for temporal scalable video coding [11]. Multiple reference picture memories that are managed logi-cally independently for each GOP are prepared, and the utilized reference picture memories are selected adaptively. Each reference picture memory assigned to a GOP has multiple decoded pictures, and a reference picture is se-lected from those pictures. Figure 5 shows the structure of the decoder of GRPS. The reference GOPs used for decod-ing are indicated by reference GOP numbers. The reference indices are assigned to the indicated reference GOPs, and the reference picture is selected according to the reference index encoded in the bitstream. In the proposed method, the reference picture is selected per block. For instance, in the case shown in Fig. 4(a), when GOP5 and GOP6 are set to the reference GOPs of GOP6, reference indices are as-signed to pictures stored in the reference picture memories for GOP5 and GOP6, and then the pictures in GOP6 are decoded. The reference indices are not assigned to the picture in GOP4.

To improve coding efficiency in the reference picture selection method, the coding mode and displacement vec-tors (motion vecvec-tors and disparity vecvec-tors), and the refer-ence indices for block B are chosen so as to minimize the cost function defined by Eq. (1) [12]. o(i, j, g, t) represents the original image at position (i, j) in frame t of GOP g, and

r(i, j, g, t) represents the decoded image. R represents the

number of encoded bits for the block, and λ is the La-grangian multiplier.

The decoded image r(i, j, g, t) is given below, where

p(i, j, g, t) is the predictive image, e(i, j) is the residual error,

and (d_x, d_y) is the displacement vector. h represents the difference in GOP numbers and s represents that of the frame numbers. The coefficients a and b are used for color correction when the GOP number is different.

2.3. Prediction error in multiview video coding

The prediction error in multiview video coding is estimated by applying the ray space representation. A dis-parity compensation utilizing the features of the ray space for coding multiview images has been proposed [17]. Ref-erence 17 discusses the case in which the standard plane is moved along the viewing angle in the ray space; however, this paper discusses the case in which the standard plane is moved along the viewing position, because the cameras are arranged in a line rather than in a circle. When we set the standard plane P shown in Fig. 6(a) for the camera arrange-ment in Fig. 2(a), the rays across the standard plane are represented in the ray space whose dimensions are the ray directions (θ, ϕ) and positions (x, y) [6]. The camera images correspond to multiple rays (real rays) in the ray space

Fig. 5. Decoder configuration for GOP adaptive Reference Picture Selection (GRPS).

(1)

(2)

Fig. 6. Samples transformed into ray space with time axis from real captured images. (a) Real camera sets; (b)

samples in ray space; (c) samples in ray space with time axis; (d) relations of view angle and camera distance.

(5)

whose dimension is u and whose horizontal position is x, as shown in Fig. 6(b), where u is given by u = tan(θ) for horizontal angle θ if the vertical information is omitted. If the camera arrangement is temporarily fixed, they are ar-ranged according to the time dimension shown in Fig. 6(c). The correlation of the real rays in the time dimension is dependent on the distance of the real rays ∆t, and the correlations in the positional dimensions are dependent on the distances of the real rays ∆x.

Here the relationship of rays in the position and time dimensions is analyzed. The image o(i, j, g, t) captured by the cameras in Fig. 6(d) is transformed to the ray

f(x, y, v, t) in the standard plane by the following:

First, the errors in the positional dimensions of the real rays are discussed. The error Ev in the standard plane for the

current frame captured by camera v2 is given by Eq. (4) when the current frame refers to the frame captured by camera v1 at the same time:

Region S_c shows the covered area in which v2 and v1 overlap in the standard plane, and Su shows the uncovered

area in which they do not overlap. The displacement vector (α, β) shows the position where the correlations are the highest in the frame of v2. The difference of f(x, y, v, t) and

f(x, y, v − 1, t) can be regarded as the difference for an angle

change ∆θ at position (x, y) in the standard plane. The angle change ∆θ can be approximated by the position change ∆x and the distance Z from the camera to the standard plane. The difference in the frame when we introduce the com-plexity M for the positional dimension in the real rays can be defined as a quantity that is dependent on the complexity

M. Then Ev

__

, the average of error Ev, is expressed as follows,

where ρv(∆θ) is the average difference of the real rays in

region Sc for the angle change, and ρa(M) is the average

difference of the real rays in region Su for the position

change in the frame:

Next, errors in the time dimension of real rays are discussed. The error Et in the standard plane for the current frame

captured by camera v2 is given by Eq. (6), when the current frame refers to the previous frame in the same camera:

The average error Et

__

is given below, where ρt(∆t) is the

average difference of the real rays in region S for the temporal change:

Based on the above analysis, the average error E__ among the real rays is obtained by averaging with some weighting of the errors in the positional dimensions Ev

__ and in the time dimension Et

__ :

The coding efficiency can be improved if this average E__ is reduced.

Thus, to improve coding efficiency by utilizing the correlation of views, the average error Ev

__

, should be re-duced. Provided that ρv(∆x/Z) is much smaller than

ρa(M), the average error Ev

__

can be reduced by decreasing the distance ∆x of the real rays in the ray space. However, it is difficult in practice to make the distance of the real rays very small because a camera has a physical size. We propose to use the reference picture selection method to improve coding efficiency. For prediction of real rays to the posi-tional dimension, not only the adjacent frame whose time stamp is the same, but other frames too are set as candidate reference pictures. By this scheme, the average error E___v of the real rays is reduced. In addition, the reference picture selection method is also applied to the time dimension. Here the weighting coefficients wv and wt in Eq. (8) correspond

to the selection ratios of disparity compensation and motion compensation, respectively. We earlier showed that coding efficiency could be improved by increasing the number of reference pictures when the frame rate was low for temporal prediction [11].

Moreover, to improve coding efficiency, the cost function J given by Eq. (1) must be minimized for all pixels to be encoded. The average error E__ corresponds to the SSD part in Eq. (1), and it is necessary to reduce it, but it is also necessary to reduce the number of bits R, e.g., for repre-senting the disparity vectors.

2.4. Experimental results and discussion (without reference picture memory for Inter GOPs)

The coding efficiency of multiview video with the GoGOP structure is dependent on the reference relations of GOPs. Thus, experiments were conducted to evaluate the coding efficiency of the GRPS method while changing the camera distances, the structure of the reference picture memories, and the size of the reference picture memories. The evaluation was carried out in terms of the number of (3) (4) (5) (6) (7) (8)

(6)

bits and the PSNR of the current GOP. First, experimental results are presented for the case in which temporal predic-tion is not applied to Inter GOPs. The GoGOP structure corresponds to Fig. 4(b). In this case, an Inter GOP has multiple frames, but it does not have reference picture memories to store decoded images. Figure 7 presents an example of the time stamps of reference pictures, and in particular Fig. 7(b) illustrates the case in which one frame decoding delay from a Base GOP is allowed. Table 1 summarizes the test sequence conditions. Table 2 summa-rizes the encoding conditions. We examined two sequences whose camera arrangement differed. The sequences used were provided by the MPEG 3DAV group. The sequence “Flamenco” is provided as a KDDI test sequence for mul-tiview video [7], and the camera arrangement corresponds to Fig. 2(a). The sequence “Aquarium” was provided by Nagoya University [8], and the camera arrangement corre-sponds to Fig. 2(b). Note that in this experiment, no color correction was performed. Figure 8 shows examples of the multiview video used. In the figure, the left corresponds to the camera at the left edge, the middle corresponds to the camera in the middle, the right corresponds to the camera at the right edge, the top is the first frame, and the bottom is the final frame. All GOPs consisted of the same number of frames, all reference GOPs were encoded as Base GOPs, and the quantization parameter (QP) was set the same as for the current GOP. The coefficients a and b in Eq. (2) were calculated by Eq. (9) for correction of colors. The coeffi-cient a was the ratio for all pixels in frame F:

The proposed coding method was implemented in accord-ance with H.264, and color correction by coefficients a and

b in Eq. (2) was carried out by weight prediction (WP) as

specified in H.264.

Figure 9 shows the PSNR when view 5 was encoded with the number of reference pictures equal to 1, using view 4 or 3 as reference GOPs for the sequence “Flamenco,” and also the PSNR when view 8 was encoded with the number of reference pictures equal to 1, and with view 7 or 6 as the reference GOP for the sequence “Aquarium.” In the figure, “base” denotes the case in which all frames were encoded as Intra frames, and “GOPx” denotes the case in which the view was encoded with view x the reference GOP.

We see from these results that the coding efficiency is improved by encoding as an Inter GOP for both se-quences, and in addition that it becomes higher when the camera distances are shorter. This is because the prediction efficiency is improved when the distance ∆x of the real rays shown in Eq. (5) is small.

Figure 10 shows the PSNR when view 5 was encoded using view 4 as the reference GOP, with the number of reference pictures being greater than 2 for the sequence “Flamenco.” We see from this figure that the coding effi-ciency is improved when the number of reference pictures

Fig. 7. Positions of reference pictures for low-delay decoding.

(9)

Table 1. Test sequences

(7)

is three and one frame delay is allowed, compared with the case in which the number of reference pictures is one. When the number of reference pictures was two or when no delay was allowed for three reference pictures, a coding gain of just a few percent was obtained. For the sequence “Aquar-ium,” no coding gain was obtained by increasing the num-ber of reference pictures. We see from these results that the coding efficiency is improved when the number of refer-ence pictures is increased in two different directions in the time dimension (e.g., to the past and to the future), which is dependent on the features of the sequences. This implies that the prediction efficiency is improved when many adja-cent real rays are used for prediction, as shown in Fig. 6(c). We also considered the coding efficiency when the number of reference GOPs was increased. Figure 11 shows the PSNR when the number of reference GOPs was two for both sequences. These results are for the case in which both sides of the views are set to the reference GOPs and the case in which one side of the views is set to the reference GOPs. In the former case, views 6 and 4 were set to the reference GOPs for the sequence “Flamenco” and views 9 and 7 were

set to the reference GOPs for the sequence “Aquarium.” In the latter case, views 4 and 3 were set to the reference GOPs for the sequence “Flamenco” and views 7 and 6 were set to the reference GOPs for the sequence “Aquarium.” In the Fig. 8. Examples of images used in experiments.

Fig. 9. PSNR for different reference GOP in the absence of reference picture for current GOP.

Fig. 10. PSNR for different numbers of reference pictures in GOP in the absence of reference

(8)

figure “v2_bidir” denotes the results in the former case, and “v2_unidir” denotes those in the latter case. We see from these results that the coding efficiency is improved when the number of reference GOPs is increased and that it is better when both sides of the views are set to the reference GOPs. This is because the prediction efficiency is improved by an increase in the number of candidate reference pic-tures; in particular, when both sides of the views are set to the reference GOPs, the prediction error is reduced by a decrease of the region Su in which real rays do not overlap,

as shown in Fig. 6(d). Figure 12 shows an example of decoded images of the sequence “Aquarium” for QP equal to 36; in this case view 8 was encoded with the GOP6 and GOP7 reference GOPs. As shown in the figure, there are no noticeable blocking artifacts or afterimages. The reason for the absence of blocking artifacts is essentially the effect of deblocking filtering. However, blurring is evident, espe-cially around the algae area, when GOP6 is set to the reference GOP. This is because the prediction efficiency is decreased when the disparity is large.

2.5. Experimental results and discussion (with reference picture memory for Inter GOPs)

Next, experimental results are presented for the case in which reference picture memory for Inter GOPs is pro-vided and temporal prediction can be selected even for Inter GOPs. Even in this case, when the GOP length in the time direction is small, relatively low-delay random access to a requested view is possible. The encoding conditions were the same as in Table 2. All the decoded pictures in the reference picture memory of the Inter GOPs were discarded before the first picture of the next GOP was encoded.

Figure 13 shows the PSNR when view 5 was encoded as the Base GOP for the sequence “Flamenco” and when view 8 was encoded as the Base GOP for the sequence “Aquarium.” Figure 14 shows the PSNR when view 5 was encoded with view 4 or view 3 used as the reference GOP for the sequence “Flamenco,” and when view 8 was en-coded with view 7 or view 6 used as the reference GOP for the sequence “Aquarium.” The number of reference pic-tures was two.

We see from Fig. 13 for the sequence “Aquarium” that the coding efficiency is improved as the number of reference pictures is increased when the view is encoded as Base GOPs. However, it is not improved noticeably for the sequence “Flamenco.” The reason why it is improved for the sequence “Aquarium” is that the prediction efficiency is improved by an increase in the number of candidate reference pictures in the time dimension. It is surmised that for the sequence “Aquarium” the tendency is noticeable because the frame rate is low and the distance ∆t of the real rays in the time dimension is large.

We see from Fig. 14 that for both sequences the coding efficiency is improved when views are encoded as Inter GOPs, and that it increases at shorter camera dis-tances. This is because the prediction efficiency is improved as the distance ∆x of the real rays in Eq. (5) becomes smaller, similarly to the case discussed in Section 2.4. This tendency is noticeable for the sequence “Flamenco.” The Fig. 11. PSNR for different numbers of reference GOPs

in the absence of reference picture for current GOP.

Fig. 12. Examples of decoded images for different reference GOPs.

(9)

above results show that the coding efficiency is improved more when the number of reference pictures is increased for disparity compensation than when it is increased for motion compensation for the sequence “Flamenco.”

The coding efficiency was also evaluated as the num-ber of reference GOPs was increased. Figure 15 shows the PSNR when the number of reference GOPs was two for both sequences. Both sides of the views were set to the reference GOPs. Views 6 and 4 were set to the reference GOPs to encode view 5 for the sequence “Flamenco,” and views 9 and 7 were set to the reference GOPs to encode view 8 for the sequence “Aquarium.” The number of refer-ence pictures for a GOP was two. We see from the results that the coding efficiency is improved as the number of reference GOPs is increased. This is because the number of candidate reference pictures is increased and prediction efficiency is improved by decreasing the region Su where

real rays do not overlap, as shown in Fig. 6(d), similarly to the case discussed in Section 2.4.

Figure 16 shows the reduction ratio of the amount of bits when view 5 was encoded as Inter GOPs for the sequence “Flamenco.” When view 4 was set to the reference GOP, the ratio was high, sometimes exceeding 50%, in the Fig. 13. PSNR for coding as Base GOP.

Fig. 14. PSNR for different reference GOP when current GOP has two reference pictures.

Fig. 15. PSNR for different number of reference GOPs when current GOP has two reference pictures.

(10)

first frame where temporal prediction was not applied, and the ratio was still noticeable, sometimes exceeding 10%, in the successive frames. We see that disparity compensation contributes to an increase of coding efficiency in the frames other than the first. However, the ratio for the other frames is quite small, sometimes less than one-fifth, compared with the ratio for the first frame. It is considered that there are large regions of real rays in which the error Et depending

on the distance ∆t of the real rays is smaller than the error

Ev depending on the distance ∆x of the real rays for the

sequence “Flamenco.”

3. Adaptive Disparity Compensation

3.1. Usage of camera arrangement

In the earlier sections, we proposed a coding method to improve coding efficiency by decreasing the average error E__ in multiview video coding. In this section, we propose a coding method to reduce the number of bits of the disparity vectors. Especially for a structure intended to achieve low-delay random access to views, highly efficient disparity compensation is necessary. In this paper a fixed camera arrangement is assumed, and that condition is util-ized for improving coding efficiency.

3.2. Reference disparity vector prediction method

If the camera arrangement is fixed, it is assumed that the change of disparity versus the change of time is small. Then we propose a reference disparity vector prediction method in which the disparity vector of the current frame is coded using the previous disparity vectors. The objective of this proposal is reduction of the number of bits for the disparity vectors.

The disparity vectors in the first frame of the GOPs are stored and used for coding the subsequent disparity vectors. As in Fig. 17, the bitstream of the first frame is divided into disparity information and texture information. The disparity information contains disparity vectors and mode information such as the block partitioning pattern and intra/inter mode information. After decoding the first frame, the disparity vectors are stored in the memory as reference vectors (rdvx, rdvy). In the successive frames,

those reference vectors are loaded from memory and used for decoding of the disparity vectors. The disparity vector (dvx, dvy) is derived from the reference vector (rdvx, rdvy)

and the differential vector (ddvx, ddvy) by the equation

When reference disparity vector prediction is not used, the disparity vector is derived in the same way as the motion vectors, namely, the predictive vector is set to the intermediate value between the surrounding blocks’ dispar-ity vectors and the dispardispar-ity vector is obtained by adding the predictive vector and the differential vector.

Figure 18 shows the results of the reference disparity vector prediction method. It shows the PSNR when view 8 was encoded while using view 7 or 3 or 1 as the reference GOP for the sequence “Flamenco,” and when view 15 was encoded while using view 14 or 7 or 1 as the reference GOP for the sequence “Aquarium.” In each experiment, only disparity compensation is applied. In the figure “rdv” de-notes the results of the reference disparity vector prediction Fig. 16. Reduction ratio of bit numbers.

Fig. 17. Bitstream structure of reference disparity vector coding.

(11)

method. The encoding conditions are the same as in Table 2. We see from the results that the coding efficiency is improved by the reference disparity vector prediction method, regardless of the camera distances.

Table 3 shows the ratio of the number of bits for the disparity vectors in the first frame of the GOPs for the sequence “Flamenco.” The ratio is larger for smaller camera distances. This is because the prediction efficiency im-proves as the distances ∆x of the real rays in Eq. (5) become smaller.

3.3. Adaptive disparity vector estimation method

The direction of the disparity is often the same as that of the cameras. In this section, we propose an adaptive disparity vector estimation method to decrease the number of bits for the disparity vectors utilizing that feature. When a disparity vector is sought beyond the distances ∆x of the real rays given by Eq. (5), a region Sc where the real rays

overlap as shown in Fig. 6(d) exists, and the prediction

efficiency is improved. Therefore, the search range should be large, but a uniform increase in the search range in-creases the complexity. Thus, in the proposed method, the search range is determined from the above feature of the disparity vectors, and increased complexity is avoided by limiting the search accuracy.

In the base disparity vector search (the BDS method), as in motion search by the JM method [13] of H.264, the following procedure is applied to the luminance informa-tion.

(a1) The predictive disparity vector is derived. (a2) The disparity vector is sought with integer pel precision in the predetermined search range, and the de-rived predictive vector is set to the origin for the search.

(a3) The disparity vector is sought at half-pel preci-sion at the surrounding eight positions.

(a4) The disparity vector is sought at quarter-pel precision at the surrounding eight positions.

In the proposed adaptive disparity vector estimation method, the same search method as that used in the base (BDS method) or the extended method, but with the search range is doubled in the camera arrangement direction (the DDS method), is selectively applied. The judgment crite-rion for the difference of luminance is used to determine which search method is applied. Provided that the cameras are arranged in the horizontal direction, the search method is determined in accordance with the flow illustrated in Fig. 19, using the judgment standard DL obtained by Eq. (11). In Eq. (11), (L0 + L1)/2 is the average difference in lumi-nance in the same search range as used in the base, and (L0 + L1 + L2)/3 is the average difference of luminance in the search range that is double the range used in the base: Fig. 18. PSNR for different reference GOP when

applying reference disparity vector coding.

(12)

L0, L1, and L2 are calculated by Eq. (12). L0 is the

average difference of luminance for the position where the current frame is the same as the reference frame (G = 0), L1 is that for the position where the current frame is shifted horizontally by R, that is, the search range of the base (G =

R), and L2 is that for the position where the current frame

is shifted horizontally by 2R, that is, the search range of the base (G = 2R). Region A is that used for calculating the difference of luminance, and Na is the number of pixels:

The following procedure is applied for the DDS method.

(b1) The predictive disparity vector is derived. (b2) The disparity vector is sought with integer pel precision for the vertical direction and with two-pixel pre-cision for the horizontal direction in the predetermined search range, and the derived predictive vector is set to the origin for the search.

(b3) The disparity vector is sought with half-pel precision for the vertical direction and integer pel precision for the horizontal direction at the surrounding eight posi-tions.

(b4) The disparity vector is sought with quarter-pel precision for the vertical direction and half-pel precision for the horizontal direction at the surrounding eight positions. In the adaptive disparity vector estimation method, because the number of search positions is the same as the base search method BDS, the complexity with respect to the number of search positions is not increased. And in Eqs. (11) and (12) that are used for determination of the search method, the increase of complexity is negligible because it involves a maximum of three calculations of SAD. Note that the disparity vector is obtained at half-pel accuracy for DDS, and is coded as a half-pel accuracy vector. On the other hand, if the search range is simply extended, the complexity is greatly increased. The number of calculations of SAD is a measure of the complexity of the search. The processing time for the search constitutes about 80% of the

encoding time for a frame. Therefore, in the simple exten-sion of search range to double size, the number of calcula-tions of SAD increases by a factor of 4, and the processing time to encode a frame is increased by a factor of 3.2. If the search range is extended after the direction of extension of the search range is determined, the number of calculations is still doubled, and the processing time to encode a frame is increased by a factor of 1.6.

Figure 20 shows the results for comparison of the DDS and BDS methods. It shows the PSNR when view 8 was encoded with view 7 or 3 or 1 as the reference GOP for the sequence “Flamenco,” and when view 15 was encoded with view 14 or 5 or 1 as the reference GOP for the sequence “Aquarium.” In each experiment, the reference disparity vector prediction method was used. Figure 21 shows the results derived from Eq. (11). The encoding conditions were the same as for Table 2.

We see from this figure that the coding efficiency is improved by extension of search range when the distance between cameras is large for the sequence “Flamenco,” but the coding efficiency is not improved for the sequence (11)

(12)

Fig. 19. Flow of determination of disparity search methods in the adaptive disparity vector search method.

Fig. 20. PSNR for different reference GOPs, comparing BDS and DDS methods in disparity search.

(13)

“Aquarium.” This is because the camera arrangement is in an arc and the extension of the region in which the real rays overlap on the standard plane in the ray space is small. Figure 21 shows the validity of the judgment standard using Eq. (11) to determine whether the search range is extended or not. Thus, the proposed adaptive disparity vector estima-tion method is effective.

As additional experimental results, the DDS method and the method using horizontal quarter-pel disparity pre-diction were compared. In the latter method, horizontal quarter-pel search was carried out after step (b4) in the flow of DDS. In the figure “qpel” denotes the results of the latter method. We see from the results that the coding efficiency is higher at half-pel accuracy regardless of the distance between cameras. This is because the number of bits for disparity vector is smaller at half-pel accuracy. The coding efficiency in quarter-pel disparity compensation shown in Fig. 22 is lower than that in the BDS method, which does not extend the search range shown in Fig. 20. This degra-dation of coding efficiency is derived from the two-pel search in step (b2) of the DDS method.

3.4. Filter coefficients for disparity compensation

As shown in the previous section, half-pel disparity compensation combined with the adaptive disparity vector estimation method provides better coding efficiency when the distance between cameras is large. Thus, in this section the filter coefficients used to generate images at half-pel positions are discussed. In H.264, which serves as the basis for this paper, the images at the half-pel positions are generated by a six-tap Wiener filter. The Wiener filter improves coding efficiency for high-definition images [14, 15]. On the other hand, a two-tap filter is used for the quarter-pel positions, which achieves low-pass filtering effects.

Figure 23 shows the average ratio of the numbers of bits when a six-tap filter and a two-tap filter are used for the horizontal half-pel positions. The PSNR was almost the same in both cases. The average ratio ave_ratio of the number of bits was calculated by Eq. (13), where Num2 and

Num6 are the numbers of bits for a two-tap filter and a

six-tap filter, respectively. The results for the four QP values shown in Table 2 are averaged.

Fig. 21. Value of DL.

Fig. 22. PSNR for different reference GOPs when applying quarter-pel disparity compensation.

(14)

We see from the results that there is a tendency for the coding efficiency to be higher in the case of a two-tap filter when the distance between cameras is large. Thus, the low-pass filtering effect is greater and the prediction effi-ciency is degraded by an increase of the distance ∆x of the real rays in the ray space.

4. Conclusions

We propose the GoGOP structure, the GOP adaptive reference picture selection method (GRPS), the reference disparity vector prediction method, and the adaptive dispar-ity vector estimation method to improve the coding effi-ciency of multiview video coding for free-viewpoint video communications. The GoGOP structure and GRPS achieve high coding efficiency, and low-delay random access of a view is provided. It is shown that the coding efficiency is improved by an increase in the number of reference pictures and an increase in the number of reference GOPs. It is also shown that the reference disparity vector prediction method

and the adaptive disparity vector estimation method im-prove disparity compensation when the distance between cameras is large.

In free-viewpoint video communications, view scalability is necessary for changes of viewpoints by com-munication. A flexible coding rate control method and a mechanism to guarantee identity of the received view and the requested view are also needed, even when a round-trip delay exists between the transmitting and receiving sides [3]. Such a rate control method and communications pro-tocol are items for further study. A decision method for choosing Base GOPs and reference GOPs is also a subject of further study, with the objective of achieving high coding efficiency in multiview video while the delay is kept within the tolerance level.

REFERENCES

1. Tanimoto M, Fujii T. FTV—Free viewpoint televi-sion. M8595 MPEG Klagenfurt Document, 2002. 2. Tanimoto M. Free viewpoint television—Using

mul-tiviewpoint image processing. J Inst Image Inf Telev Eng Japan 2004;58:898–901. (in Japanese)

3. Kimata H, Kitahara M, Kamikura K, Yashima Y, Fujii T, Tanimoto M. System design of free viewpoint video communication. CIT2004.

4. Hata K, Etoh M, Chihara K. Coding of multi-view-point images. Trans IEICE 1999;J82-D-II:1921– 1929. (in Japanese)

5. Lim JE, Ngan KN, Yang W, Sohn K. A multiview sequence CODEC with view scalability. Signal Proc-ess Image Commun 2004;19:239–256.

6. Fujii T, Kimoto T, Tanimoto M. Ray space coding for 3D visual communication. PCS’96 Vol. 2, p 447–451. 7. Kawada R. KDDI multiview video sequences for MPEG 3DAV use. M10533, MPEG Munich Docu-ment, 2004.

8. Report on 3DAV exploration. N5878 MPEG Trond-heim Document, 2003.

9. Kimata H, Kitahara M. Framework on free-view-point video with shared memory video coding. M11232 MPEG Palma Document, 2004.

10. Kimata H, Kitahara M, Kamikura K, Yashima Y. Multi-view video coding using reference picture se-lection for freeviewpoint video communication. PCS2004.

11. Kimata H, Kitahara M, Kamikura K, Yashima Y. Temporal scalable video coding with hierarchical reference picture selection method. Electron Com-mun Japan (Part III) 2006;89:1–14.

12. Sullivan GJ, Wiegand T. Rate-distortion optimization for video compression. IEEE Signal Process Mag 1998;15:74–90.

(13) Fig. 23. Average reduction ratio “ave_ratio” of bit

(15)

13. Lim K-P, Sullivan GJ, Wiegand T. Text description of joint model reference encoding methods and decod-ing concealment methods. JVT-K049 JVT Munich Document, 2004.

14. Girod B. Motion-compensating prediction with frac-tional-pel accuracy. IEEE Trans Commun 1993;41: 604–612.

15. Wedi T. Adaptive interpolation filter for motion com-pensated hybrid video coding. PCS2001, p 49–52, 2004.

16. Kimata H. Movement on MPEG 3DAV toward inter-national standardization of 3D video. Tech Rep Inf

Process Soc Japan 2005, No. 23, 2005-AVM-48, p 49–54. (in Japanese)

17. Fujii T, Kimoto T, Tanimoto M. Data compression of 3-D spatial information based on ray-space coding. J Inst Image Inf Telev Eng Japan 1998;52:356–363. (in Japanese)

18. Kimata H, Yashima Y, Kobayashi N. Time adaptive motion estimation method for software-based real-time video coding. 2001 IEEE International Confer-ence on Multimedia and Expo (ICME) Vol. 1, p 329–330.

AUTHORS (from left to right)

Hideaki Kimata (member) received his B.E., M.E., and Ph.D. degrees from Nagoya University in 1993, 1995, and 2006. He joined Nippon Telegraph and Telephone Corporation (NTT) in 1995, and has been engaged in research on picture coding, error tolerance, and image communications systems. His research interest includes 3D video signal processing. He is currently a Senior Research Engineer at NTT Cyber Space Laboratories.

Masaki Kitahara (member) received his B.E. and M.E. degrees in industrial and management systems engineering from Waseda University in 1999 and 2001 and joined NTT. He has been engaged in R&D of data compression for image-based rendering and H.264 encoding algorithms. His research interests include signal processing methods for 3D applications and video compression.

Kazuto Kamikura (member) received his B.E. and M.E. degrees in electrical engineering from Tokyo Science University in 1984 and 1986 and joined Nippon Telegraph and Telephone Corporation (NTT). He has been engaged in research and development for video coding systems. His current research interests include digital image processing and video coding. He is currently a Senior Research Engineer, Supervisor of the Visual Media Communications Project at NTT Cyber Space Laboratories.

Yoshiyuki Yashima (member) received his B.E., M.E., and Ph.D. degrees from Nagoya University in 1981, 1983, and 1998. In 1983 he joined the Electrical Communications Laboratories, Nippon Telegraph and Telephone Corporation (NTT), where he has been engaged in the research and development of high-quality HDTV signal compression, MPEG video coding algorithm, and lossless image coding system. His research interests also include pre- and postprocessing for video coding, processing of compressed video, compressed video quality metrics, and image analysis for video communication system. He is currently a Senior Research Engineer, Supervisor of the Visual Media Communications Project at NTT Cyber Space Laboratories. He has also been a visiting professor at Tokyo Institute of Technology since 2004. He was awarded the Takayanagi Memorial Technology Prize in 2005. He is a member of the IEEE Signal Processing Society, the Information Processing Society of Japan, IEICE, and the Institute of Image Information and Television Engineers of Japan (ITE).

(16)

AUTHORS (continued) (from left to right)

Toshiaki Fujii (member) received his B.E., M.E., and D.Eng. degrees in electrical engineering from the University of Tokyo in 1990, 1992, and 1995. He is currently an associate professor in the Graduate School of Engineering of Nagoya University. His research interests include 3D image processing and 3D visual communications.

Masayuki Tanimoto (member; Fellow) received his B.E., M.E., and D.Eng. degrees in electronic engineering from the University of Tokyo in 1970, 1972, and 1976. He joined Nagoya University and has been a professor in the Department of Electrical Engineering and Computer Science, Graduate School of Engineering. He received the Ichimura Award, TELECOM System Technology Award, ITE Niwa-Takayanagi Best Paper Award, and IEICE Achievement Award. He was a chairperson of the Technical Group on Communication Systems of IEICE and a councilor of IEICE and ITE. He was also the Vice President of ITE. He is a Fellow of IEICE and ITE. His current research interests include image communication, image coding, image processing, 3D images, and ITS.