Video Signal Processing and Coding on Data{Parallel Computers
P. Moulin 1
, A. T. Ogielski 1
, G.Lilienfeld 2
, and J. W. Woods 2
1 Bell Communications Research 445 South Street
Morristown, NJ 07960
2 ECSE Department, and Center for Image Processing Research
Rensselaer Polytechnic Institute Troy, NY 12180-3590
submitted to Digital Signal Processing, August 1994
Corresponding author:
Pierre Moulin, Bellcore 2M-393, 445 South St., Morristown, NJ 07960.
Tel. (201) 829-5021, fax (201) 829-4391, email [email protected]
Research atRensselaerpartially supportedbyARPAcontractF19628-91-K-0031.
Abstract
Special purpose hardware has been traditionally viewed as the only practical solu- tion for high speed Video Signal Processing (VSP). However, new parallel computing technologies may provide a much more exible alternative. In this paper we discuss techniques for implementing VSP algorithms on data-parallel computers, including data distribution and the tradeos between memory usage, communication, and computation. We provide theoretical analyses and illustrate them with examples of implementations written for the MasPar data-parallel computers. The algorithms studied here are a selection of classical algorithms that includes block-DCT coding, subband coding and block-matching motion estimation. Additionally, two new al- gorithms of the authors are also presented, on intraframe nonorthogonal subband coding and on motion compensated 3-D subband coding.
1 Introduction
The concurrent progress in digital video signal processing (VSP) technologies and digital video transmission on the one hand, and in the graphical user interfaces and multimedia applications in desktop computers on the other hand, force the merging of these previously separate entities. Such convergence is expected to have a signicant economic impact: Widely available digital video will create a market for a multitude of new application software packages involving operations such as video editing, enhancement, computer vision, and video query languages and databases.
Integration of VSP with other applications suggests that VSP algorithms should preferably be executed on general{purpose computing platforms. Unfortunately, the computing requirements of VSP algorithms often overwhelm the single{processor, sequential computers. Moreover, it can be expected that gains in microprocessor speed will be outpaced by a demand for higher image resolution. Let us, therefore, review the key computational characteristics of the VSP algorithms:
Very large amount of data (examples: NTSC, HDTV, and super HD video)
Very large computing requirements (typically, over 102 arithmetic operations per pixel),
High requirements for I/O and transport speed.
Natural potential for parallel execution, resulting from the locality of computa- tions in space and time for intraframe and interframe processing, respectively.
The computational demands of VSP can be met in several ways. First, one may use dedicated, special{purpose hardware, from dedicated subsystems (plug{in boards
with special chip sets or attached devices) to special{purpose parallel computers such as, e.g. , the Princeton Engine [1]. Special hardware has been a preferred solution to date, dictated by cost considerations and the insucient performance of CPUs. The major limitation of this approach is its limited programmability and interoperability, and therefore reduced exibility for applications.
Another solution is to use general{purpose parallel computers. At present they are considerably more expensive than desktop computers, but one may reasonably expect the continuation of cost reduction to commodity level, especially with small attached array processors or even board{level products. In addition, since VSP is a rapidly evolving eld, special-purpose hardware rapidly becomes obsolete while general-purpose computer software can be made portable to newer or upgraded platforms. Such a perspective makes the investigation of parallel computing for VSP even more important.
Already at this time there are a number of VSP applications where dedicated, special{purpose hardware is not economically justied, and a high degree of pro- gramming exibility is absolutely required. They include, among others, 1) al- gorithm development and ne{tuning, 2) special applications such as video edit- ing, restoration, or enhancement, 3) three{dimensional (3-D) image processing in biomedical and scientic applications, and, last but not least, 4) video coding. Even though standards for video decoding exist or are being established, (e.g., MPEG, MPEG II), standards for video coding are not expected because of the great di- versity of user needs and applications. This leaves much exibility in the coder implementation, which may be very desirable. For instance, artistic directors may need to maintain control over what visual artifacts are or are not acceptable de- pending on the scene or program to be broadcast. In all of the applications above,
the use of general{purpose computers increases the engineer's ability to quickly de- velop new ideas and algorithms, test these on suciently long and varied image test sequences, and extensively explore the design parameter space.
Our discussion is focused on 2-D mesh{connected processor arrays, which appear most suitable for a broad class of VSP tasks and have been used as general{purpose platforms for a variety of problems in related areas such as computer vision and image understanding [2, 3, 4, 5]. In VSP, parallel algorithms have been proposed for vector quantization [6], split{and{merge block{motion estimation [7], and en- tropy coding and arithmetic coding [8]. For most VSP algorithms it is sucient to restrict attention to the data{parallel computers, also called Single Instruction Multiple Data (SIMD). It has been demonstrated in diverse applications that the data{parallel model remarkably simplies parallel software engineering.
To make our discussion more concrete, we illustrate our study with several spe- cic implementations of video intraframe and interframe coding algorithms. In particular, we discuss the important issues that arise in implementation of VSP al- gorithms on parallel computers. Such issues include schemes for data distribution (or layout) to processors, layout transformations, I/O and inter{processor commu- nication schemes. We then review the actual implementations of selected algorithms on two MasPar computers, one a fully congured 16k{processor MP-1 at Bellcore, and the other a smaller 2k{processor machine at Rensselaer.
The review provides the theoretical analysis as well as actual performance data for a selection of classical and novel algorithms. The rst category includes block{
Discrete Cosine Transform (DCT) and 2{D subband coding schemes, as well as three block{matching motion estimation algorithms. The second category includes an iterative optimization algorithm for nonorthogonal 2{D subband coding [9] and
a 3{D subband coding algorithm [10] introduced by the authors. We choose to concentrate on encoders since the parallel implementation issues for decoders are similar.
A number of VSP algorithms may not be directly amenable to parallel imple- mentations. Examples include spatially recursive algorithms such as spatial DPCM coding and certain adaptive algorithms such as adaptive Human coding. Some of these algorithms exhibit a certain degree of parallelism while others may warrant the use of hybrid sequential/parallel schemes. Such algorithms are not discussed further in this paper.
2 Parallel Computers
2.1 Overview
We focus on the logical view of parallel computing in algorithm and software design rather than on hardware. The essential characteristics of a parallel computer are as follows:
1. The number of processing elements (PEs), 2. The topology of the PE interconnection network, 3. The architecture of local and/or global memory.
We have found that many VSP tasks can be eciently implemented when the local PE memories are explicitly considered. The physical interconnection topology need not limit the possible communication patterns. The emulation of one network by another enables software portability across dierent parallel architectures [11, 12].
It has been convenient to classify parallel computer architectures into two fami- lies: the data{parallel machines and program{parallel machines.
In a data{parallel (SIMD) computer all PEs execute essentially the same pro- gram, although they may contain dierent data. There is a distinguished processor (controller) executing the program and broadcasting instructions which aresynchronouslyacted upon by all active PEs. Nevertheless, the tasks executed by individual PEs may dier somewhat: a PE may be inactive and ignore the instruction, it may store local pointers to its local memory (indirect addressing), it may locally control conditional statements, etc.
In a program{parallel (MIMD) computer each PE executes its own program.
The logical model is that of many processes communicating by passing mes- sages, in an inherently asynchronous environment. Although synchronization may be hardware{assisted, it needs to be explicitly handled by the program- mer. A program{parallel computer can be used in the data{parallel mode.
This is usually aided by compiler support for data{parallel languages.
The VSP applications in general do not require the full exibility of program{parallel computers. Since in addition the SIMD design is simpler and more economic, data{
parallel computing appears as the method of choice for parallel VSP.
The 2{D mesh{interconnection network appears particularly well suited for VSP applications. However, we remind that this only means that a parallel computer should appear as a 2{D mesh to the program. In fact, a large variety of physical networks have been actually implemented in various parallel computers: hypercubes, butter ies, binary trees, fat trees, multidimensional meshes (grids) with and without
toroidal wrap-around. Currently, for engineering and economic reasons the trend is towards the choice of 2{D and 3{D mesh{connected machines.
All this points to algorithms and software design as the most important aspects of parallel computing. Nowadays, the programming tends to be more expensive than hardware, creating demand for code portability. This is being aided by new software technologies. Programming languages for parallel machines are rapidly maturing.
At present all parallel machines support extended versions of C and FORTRAN, and eorts towards standardization are well underway. This implies that currently some programs may not be portable across dierent parallel machines without pro- grammer's intervention. However, well designed software (i.e. incorporating the basic elements of object{oriented style, such as data abstraction, encapsulation, and modularity) can be systematically ported from one parallel architecture to another, with localized and often minor program changes.
2.2 The MasPar Computer
The algorithms discussed in this paper have been implemented and tested on the MasPar MP-1 computers, produced by the MasPar Computer Corp. in Sunnyvale, CA [13]. The MasPar is a SIMD machine with processing elements logically con- gured in the shape of a 2{D rectangular array. The array size depends on the machine model; the computers used in this study have been a 6432 = 2;048 processor machine at Rensselaer, and a 128128 = 16;384 processor machine at Bellcore. Interprocessor communication is supported by two independent intercon- nection networks. One network is a 2{D mesh, with toroidal wrap{around and nearest{neighbor connections along the array rows, array columns, and diagonals.
This network connects each PE to its 8 neighbors, and provides ecient data com-
munications along the array rows, columns and diagonals. Support for arbitrary interprocessor communication patterns is provided by another network, a "global router" whose topology is transparent to the user.
Although machine performance cannot be meaningfully characterized by a sin- gle number, it is of interest to provide certain reference benchmarks. A 16k{
processor MasPar MP{1 can achieve the computing speed of over 1 Giga op for single{precision calculations, and over 3 billion additions and logical operations per second for 32{bit integers. The rates for 8{bit variables are considerably higher.
The local memory at each PE is 64 kB on 16k{processor machine, and 16 kB on a 2k{processor machine. This facilitates the use of local lookup{tables and data{
replication techniques for speeding up computations.
All programs described in this paper have been written in MPL, which is a data{parallel extension of the ANSI standard C language [14] supported on MasPar computers.
2.3 Data Layout
The key to ecient data{parallel computing is the design of distributed data struc- tures that can balance the computational load on individual processors, and that provide for maximum parallelism of interprocessor data communications. The ele- ments of a single data object, such as the pixels of a frame, can be distributed into the PEs in a variety of ways, depending on the intended computations and their interprocessor communication patterns. A bad choice of data layout usually results in extreme performance degradation.
Transformations among dierent data layouts for a single data object and the
matching of data layout with intended computations are among the most important issues of parallel computing. For illustration consider the operations on a video frame on a rectangular processor array such as a MasPar computer. A frame is read into the computer from a le or a video source as a 1{D stream, usually the line{
by{line scan order. A parallel read distributes the pixels to PEs in a lexicographic order. This is the input data layout, which in general will not be the one desired by the computations. A favorite frame data layout that is well suited to VSP algorithms on mesh{connected computers is the (rectangular) tiling layout, with tile size determined by the algorithm and the computer performance characteristics.
These include the computation speed, memory reference time, and interprocessor communication latency and bandwidth.
In the tiling layout a frame of sizeNxNy is partitioned into rectangular tiles of equal sizeMxMy, and each tile is assigned to one processor of the array. A concrete example is the allocation of 88 blocks of a 288360 CIF frame to a 3645 PE array, preserving the correspondence between nearest{neighbor blocks and nearest{
neighbor PEs. In order for the frame to t on a rectangular processor array of size PxPy, we should have PxMx Nx and PyMy Ny. For transparency we will assume that the processor array is large enough to support direct tiling of required size. If that is not the case for the array being used, there are standard parallel programming techniques which allow to solve this problem. One is programming withvirtual processors, i.e. a larger virtual PE array is emulated on a smaller physical array transparently to the user. Alternatively, explicit cut{and{stack methods can be used.
3 Intraframe Coding
In this section we discuss three intraframe video coding algorithms: block{DCT coding, 2{D (or spatial) subband coding, and a new algorithm for nonorthogonal 2{D subband coding using iterative optimization techniques.
3.1 Block{DCT Coding
The block-DCT transform and its inverse are commonly used in video coding and decoding applications such as the standards MPEG and H.261. Let BB denote the size of the blocks. It is assumed that the frame dimensions Nx and Ny are multiples of B and that B is a power of 2. There are no dependencies between blocks, so the block-DCT transform and its inverse can be executed completely in parallel for the entire frame.
MxMy tiles of frame pixels are mapped onto each PE. How should Mx and My be selected? Let NB = NBxN2y be the number of blocks per frame. When Mx =My =B, one block is assigned to each PE, so no inter{processor communica- tion is needed. This implementation is straightforward, but whenNB PxPy, only a fractionNB=PxPy of the PEs are active. A better technique that balances the load on the PEs consists in partitioning each block into MxMy subblocks, where Mx
andMyare powers of 2, and exploiting the butter y structure of the DCT algorithm.
This is done by performing independent DCT transforms on each subblock and then recombining the results. A similar technique has been applied to the related FFT algorithm in [15]. Inter{processor communication is required only for data shuf- ing prior to the DCT and for recombination of the partial DCTs. Since addingp data items stored inpadjacent processors requiresdlog2peinter{processor commu-
nications per PE, the number of inter{processor communications required here is MxMylog2MBxM2 y. The number of computations per PE is MxMylog2B2. The exe- cution time is monotonically increasing withMxandMy, so the tiles should be made as small as possible subject to the constraints PxMx Nx and PyMy Ny. The block-DCT transform on 88 blocks has been implemented on the 16k{processor MasPar and applied to 10241024 frames (Mx = My = B = 8). The execution time for our parallel implementation is 13 msec [16].
The block{DCT is followed by quantization of the transformed image. When the quantizers are xed, the quantization operations are independent for single pixels (scalar quantization) or blocks of pixels (vector quantization), so these operations can be performed in parallel for the entire frame. Scalar quantization is very fast since it can be implemented with integer division and a lookup table. Parallel vector quantization has been studied in [6]. When dynamic bit allocation is used to avoid buer over ow (e.g., macroblock bit allocation for H.261 standard), full parallelization of the quantization operations is generally not be possible [16, 17].
3.2 Subband Coding
The use of 2{D subband coding has been shown to be an eective technique for the coding of still images [18, 19]. It is also widely used in hybrid video coders. A typical 2{D subband encoder used in intraframe video coding could be represented by the system diagrammed in Fig. 3. The rst step is to spatially analyze the input into K subbands- typical numbers for K are 7, 10, or 16. Ideally, the subband analysis process producesK independent subbands each containing uncorrelated data. The subbands which are still highly internally correlated would be coded with either spatial DPCM or block{DCT methods. The remaining subbands would be coded
directly with a PCM method.
When using separable 2-D lters, the ltering operations to be performed in each band have the typical form
y(k;l) =LX?1
n=0x(2k?n;l)h(n) (1) where x is the data and fh(n);0 n < Lg is the impulse response of a L-tap subband lter. Although the bands are generated and processed sequentially, the ltering operations express local spatial interactions which can be performed in parallel within each band and are identical for all bands.
The tiling data layout is used again: MxMy tiles of frame pixels are mapped onto each PE. For convenience, we temporarily assume thatMx=My= 2m, where m is a nonnegative integer. Denote byj = 0 the nest scale in the multiresolution decomposition, i.e., the original data. To compute the bands at thejth+1 level from thejth, two row convolutions and four column convolutions are required. Because the bands at dierent scales have unequal sizes, the computation and communication loads of the PEs are dierent for each scale. At ne scalesjm, each PE contains 22(m?j) pixels. Inter{processor communication is required only when a PE needs access to a pixel stored on another PE; in all other cases, the information transfer takes place within the same PE. The ltering operations are implemented in the following fashion. For a given (k;l), the summation in (1) is rst decomposed into partial sums that can each be evaluated within single PEs (computation step), then the partial sums are combined to produce the complete result (inter{processor communication step). This operation is repeated for all pixels. The number of computations per PE to compute thejth+1 scale from thejthis 22(m?j)+1Lwhile the number of inter{processor communications per PE is approximately 2(m?j)+1(L?1).
Computation and communication loads increase with the length of the subband lter, and the relative importance of the communication term increases rapidly with j. The tile size should be as small as possible, subject to the constraintsPxMxNx
andPyMy Ny.
At coarse scalesj > m, only PEs whose coordinates are both multiples of 2j?m contain a pixel. All other PEs are to be deactivated. Computations are dominated by inter{processor communication (maximum communication distance is 2j?m(L? 1)). Despite the mediocre processor usage at coarse scales, the overall performance of the algorithm is not substantially aected as long as m is large enough, since most machine cycles are used at ne scales.
The analysis produces similar results for arbitraryMx and My. Let mx and my
be the largest integers such that Mx = 2mxox and My = 2myoy, with ox and oy
odd integers. It is easily seen that bands at resolution levels coarser than mx or my cannot be processed using all PEs in parallel. At these resolution levels, only a fraction 2mx+my?2j of the PEs is active at the same time.
A subband coding algorithm has been implemented on the 16k{processor Mas- Par. A simulation was performed using a 3-tap lter and 5 resolution levels (13 bands). The execution time for 10241024 (Mx = My = 8), 512512 (Mx = My = 4), and 256256 (Mx = My = 2) images were respectively 13, 4, and 2.4 msec. Thus, the execution time scales linearly with image size as soon as the tile size is moderately large. This indicates that computations dominate communications and demonstrates the eciency of the parallel implementation for medium{to{large tiles.
A special case of interest arises when the popular QMF lters are used. The
lowpass and highpass lters are related byhh(n) = (?1)nhl(n). This suggests that (1) should be evaluated jointly for the lowpass and highpass lters. This can be done as follows. (1) is evaluated for the lowpass lter only by decomposing the summation in the right{hand side into two terms, one over even indices and the other over odd indices. The output of the lowpass lter is the sum of these two terms while the output of the highpass lter is equal to their dierence. Thus, due to the special relationship between the QMF lters, the computation count may be reduced by one half. A simulation of a 4-band decomposition using 16-tap QMF lters and a 768640 (Mx = 12;My = 20) image executed on the 2k{processor Maspar in 0.225 seconds.
Quantization can be performed in parallel within each subband using the tech- niques indicated in the previous section.
3.3 Nonorthogonal 2{D Subband Coding
Iterative optimization algorithms have recently been applied to nonorthogonal sub- band image coding [9]. Such algorithms may be used to minimize the mean squared error (MSE) of the coded image under given quantizer constraints. Examples of nonorthogonal subband transforms include semi-orthogonal transforms [20] and biorthogonal transforms [21].
Using a vector notation, denote respectively by f and ^f the original and com- pressed images (size Nx Ny), and by Mf and M the transform matrices at the coder and decoder, respectively. For perfect-reconstruction systems, MMf is the identity matrix. The decoder computes ^f =M^a, where ^ais the quantized subband
image. In [9], the coding problem is formulated as follows: Identify ^athat minimizes MSE(^a) =kf ?M^ak2 (2) subject to the constraint that each pixel of ^a be one of the output levels of the given quantizer. Direct quantization of the subband image Mff does generally not minimize (2) when M is nonorthogonal. A better solution may be obtained by solving the discrete optimization problem (2). The algorithm in [9] uses multiscale relaxation techniques and proceeds as follows.
Denote by ^ak the k-th subband image (size Nk, 1 k K) and by Mk the restriction of Mto that subband. Regrouping terms in (2) yields
MSE( ^f) =kf ?XK
k=1
Mka^kk2 = ^aTkKk^ak?2Tka^k+k (3) where the superscript T denotes matrix and vector transpose, Kk =4 MTkMk is a NkNk banded Toeplitz matrix, and k (size Nk) and k (scalar) contain terms depending onf and on f^ak0;k0 6=kg, but not on ^ak.
The image ^a=Mff is used as an initial point for the iterative algorithm. The low-frequency band (k = K) is updated rst. The term k?1 is then computed in preparation for an update of the next band, ^ak?1. This step is repeated at all scales down to and including the ne scale (k = 1). Several such sequences of coarse-to-ne steps, called sweeps, are performed successively to improve the estimates. Due to its multiscale nature, the algorithm converges rapidly.
In order to update each ^ak, a minimization of (3), subject to the quantizer con- straints on ^ak, is performed using a Red/Black Gauss-Seidel or a Jacobi relaxation scheme. Both schemes have been used extensively in parallel computing and mul- tiresolution optimization problems [22]. The computations involved include ltering
and quantization operations.
The algorithm has been implemented on the 16k{processor MasPar using the biorthogonal B-spline transform #2 in [9], withK = 13 bands. Implementation of the ltering and quantization operations was done as in Section 3.2. A distinctive feature of the algorithm is the choice of the relaxation technique in each band.
With the Jacobi scheme, all pixels values are updated simultaneously. With the Red/Black Gauss-Seidel scheme, calculations are done successively on the even and odd rows (resp. columns) for all pixels in the horizontal (resp. vertical) band.
Although the Gauss-Seidel algorithm converges more rapidly, at coarse resolutions processor usage is not as good as with Jacobi since twice as many processors are idle. A detailed study of the tradeo between convergence speed and load balancing has shown that the Red/Black Gauss-Seidel scheme is desirable in the low band (Jacobi diverges in the low band) while the Jacobi scheme is appropriate at all ner levels [23]. The MSE stabilized after less than ten iterations of the algorithm. The execution times for 10241024 (Mx =My = 8), 512512 (Mx = My = 4) and 256256 (Mx=My = 2) images were 72, 29, and 21 msec per iteration, respectively.
While the negative eects of inter{processor communication can be felt for small tiles, timing performance gradually improved for larger tiles. For small tiles, the somewhat complex control structure of the program also had an adverse eect on timing performance.
4 Interframe Video Coding
In this section, we rst discuss block{matching motion estimation (BMME) schemes in terms of the performance and complexity of the parallel implementation. Then,
we study a new 3{D subband coding algorithm developed by two of the authors.
4.1 Motion Estimation
Motion estimation algorithms are used to reduce the temporal redundancy of video signals. These algorithms are computationally demanding, so great care must be taken in their implementation. We have studied the following BMME algorithms:
full search (FS), three{step search (TSS) [24], and Zaccharin and Liu's 8:1 and 16:1 subsampling algorithms [25]. Additional details about the MasPar implementation of these algorithms can be found in [26].
4.1.1 Full Search
LetBB be the size of the blocks to be matched, andNB= NBxN2y be the number of blocks per frame. It is assumed that Nx and Ny are multiples of B and that B is a power of 2. Denote byfF(x;y;t); 0x < Nx; 0y < Ny; t= 0;1;2;:::g the frame sequence. For each block (i;j), the FS algorithm seeks the motion vector (^vx;v^y) which minimizes the mean absolute error (MAE)
MAEi;j(vx;vy) = 1B2
B
X
m=1 B
X
n=1jF(iB+m;jB+n;t)?F(iB+vx+m;jB+vy+n;t?1)j (4) over the motion vectors in the range?dmaxvx;vy < dmax, wheredmaxis the max- imum block displacement (in pixels). Computational complexity of the algorithm is (2dmax+ 1)2 computations per pixel.
In order to simplify the discussion, we assume thatNBPxPy. As in Section 3, the frames are divided into rectangular tiles of equal sizeMxMy, each of which is assigned to one PE. Each block is stored in a small set of adjacent processors. When
Mx =My =B, no inter{processor communication is needed. As in Section 3.1, this straightforward implementation has the disadvantage that only a fractionNB=PxPy
of the PEs are active. Therefore,Mx and My should be chosen so as to balance the PE load. In order to keep the program control simple,Mx and My may further be restricted to be powers of 2. In this case each block is scattered over MBxM2y PEs, without overlapping on the PEs. Inter{processor communication may be reduced by adopting a data{replication technique referred to aszooming, in which each PE collects and stores data that would otherwise need to be fetched from other PEs for every motion vector calculation [16]. The replication of frame data is acceptable given the large amount of memory available and illustrates how memory usage can be traded for processing speed. The following two algorithms illustrate the use of the zooming technique.
Algorithm #1 [26]. The technique used to evaluate (4) is similar to that used to evaluate (1) in subband ltering. The summation in the right{hand side of (4) is split into partial sums that are evaluated on dierent PEs (computation step) and then added up (communication step). In order to implement the computation step without fetching data from other PEs for each MAE evaluation, a (2dmax + Mx) (2dmax+My) array of data representing all possible displacements of the MxMy tile is zoomed onto each PE prior to the rst computation. The number of computations per PE isMxMy(2dmax+1)2. Since the communication step requires communication between MBx
MBy PEs, the number of communications per PE is log2MBxM2 y(2dmax+ 1)2. In addition, zooming accounts for an additional (2dmax+ Mx)(2dmax+My) inter{processor communications per PE.
Algorithm #2. A more ecient technique consists in partitioning the motion vector range [?dmax;dmax][?dmax;dmax] into MBxM2 y identical rectangles and allocating
a dierent PE to each rectangle in this partition. Each PE minimizes (4) over its restricted motion{vector range (computation step), then the restricted minima are compared in order to produce the optimal motion vector over the entire range [?dmax;dmax][?dmax;dmax] (communication step). Prior to the computations, each PE collects an array of data representing all possible displacements of theBB block within its dedicated restricted range. The implementation of this zooming step is only slightly more complex than that used for Algorithm #1. The computations count is the same as with Algorithm #1, but since only MBxM2y partial minima are compared, the number of communications per PE is reduced to log2 MBxM2y, regardless of the value ofdmax.
4.1.2 Three{Step Search
The TSS algorithm applies a logarithmic search over motion vectors in the range [?dmax;dmax] [?dmax;dmax]. The algorithm is faster than the FS algorithm because of the limited number of motion vectors computed: 1 + 8log2dmax vs.
(2dmax + 1)2 for the FS algorithm. The well-known disadvantage of the TSS algo- rithm is that it may be trapped in a local minimum of (4). The parallel implemen- tation of TSS is similar to that of FSS with the dierence that a reduced number of motion vectors is evaluated.
4.1.3 Subsampling Algorithms
A promising alternative to the TSS algorithm is the class of subsampling algorithms.
Computational complexity is reduced by using block subsampling and alternat- ing pixel subsampling techniques instead of limiting the number of motion{vector
searches. According to [25], subsampling algorithms outperform the TSS algorithm in the sense that they produce better motion vectors at comparable computational cost. Two such algorithms using a 4:1 pixel subsampling rate and a 2:1 (resp. 4:1) block subsampling rate allow for a reduction of computational complexity by a fac- tor 8 (resp. 16). These algorithms are denoted here byS8 and S16and are sketched below.
Block subsampling is based on the assumption that each motion vector is "close"
to at least one of the vectors obtained from adjacent blocks. With 2:1 block subsam- pling, motion vectors for theactive blocks (2i;2j) and (2i+ 1;2j+ 1), 1i N2Bx and 1 j N2By, are computed using an alternating pixel subsampling technique [25]. The motion vector for each remaininginactiveblock is obtained by evaluating only four motion vectors obtained from the adjacent active blocks and picking the motion vector with the smallest MAE. The 4:1 subsampling scheme uses a similar technique, where only one in four blocks is active.
The organization of computations should not be done in the same fashion as in Section 4.1.1, since all processors that store inactive blocks would be poorly used.
A partitioning technique similar to that used for Algorithm #2 may be used instead [26]. For theS8 (resp. S16) algorithm, the motion{vector range for each active block is partitioned into two (resp. four) rectangles and the work load is split with the otherwise inactive PEs. This technique achieves good load balancing and reduces the number of communications per PE by a factor equal to the block subsampling rate.
4.1.4 The MasPar Implementation
The BMME algorithms FS, TSS, S8, and S16 have been implemented on the 16k- processor MasPar using the methods discussed in Sections 4.1.1-3. The execution time (including system calls but not I/O time) has been measured for various frame sizes. The MSE of the motion-compensated frame dierence has been computed for each algorithm from an average over the rst 30 frames of the well-known Flower Garden sequence. The results are displayed in Table 1 for B = 8 and dmax = 8.
The average MSE per pixel for each frame is shown in Fig. 1. In this experiment, the S8 algorithm results in average in a moderate MSE increase (9 percent) over the FS algorithm, but its time complexity is several times lower than that of the FS algorithm. Further timing improvements can be obtained using the S16 algorithm, at the expense of MSE. The TSS algorithm has execution time similar to that of S16, but also larger MSE.
The execution time for the FS algorithm applied to a 10241024 frame (Mx = My = 8) is 349 msec. A total of 303106 arithmetic operations are performed, so the processing rate is 868106 computations per second, which is reasonably close to the 1 G op/s peak computation rate of the MasPar. In Fig. 2, the execution time for each algorithm is plotted against the number of pixels per processorMxMy. Except for very small values of MxMy, the execution time is well approximated by a linear function ofMxMy, indicating that computation dominates communication.
4.2 3{D Subband Coding
The application of subband coding to include the temporal direction is a natural extension of spatial subband coding. The additional subband decomposition in the
temporal direction allows the coder to eciently take advantage of both the spatial and temporal frequency characteristics of the human visual system [27] [28]. Some of the earliest work in 3{D subband coding was done by Karlsson and Vetterli [29]
and by Kronander [28]. Here, we present some simulation results based on a new 3{D subband coder [10], which is an outgrowth of a coder discussed in [28]. The basic motivation was to combine the analysis benets of 3{D subband coding with the temporal-redundancy reduction benets of motion estimation.
4.2.1 3{D Subband Analysis
Fig. 4 diagrams the analysis section of the coder. First, the incoming sequence is temporally analyzed into two bands. Each resulting subsequence is further decom- posed into four spatial subbands. In our simulations all ltering operations were done using Johnston's 16C FIR lter [30]. The nal result is eight equally-sized subbands per two incoming frames. The subbands have been labeled LLL, LHL, HLL, HHL, LLH, LHH, HLH, HHH, with the letters representing temporal, hori- zontal, and vertical frequencies, in the given order. Here, the temporal analysis was done rst to improve the computational eciency on the SIMD machine.
Fig. 5 diagrams the active components of the coder. The LLL, LHL, and HLL subbands are coded using temporal DPCM which is implemented with a forward- motion-compensated predictor. Motion-compensated estimation was chosen instead of 3{D linear prediction because this type of prediction is usually superior to low- order linear prediction. To reduce the complexity, a motion estimate was only made on the LLL subband and then used to estimate each of the subbands, LLL, LHL, and HLL. The basis for this implementation is that moving objects within a given temporal subband must move with the same displacement in each spatial subband.
The resulting displaced frame dierences and the other subbands are then coded using uniform quantizers. In order to adapt to the non-stationary statistics between frames, a new bit allocation is made and new quantizers are designed during each frame. Each subband's bit rate is assigned using the algorithm described in [18].
Each quantizer is then designed so that the estimated rst-order entropy of its output is matched to its assigned rate.
The architecture of this coder/decoder is well matched to the architecture of the MasPar. With the exception of temporal ltering, the eciency of each subsystem has been discussed in Section 3.2. The temporal ltering operations are eciently performed in parallel using a circular frame buer implemented on the PE array.
The buers contain frame tiles, and their length is equal to the temporal lter's length. Since there are no spatial interactions during the temporal ltering, no inter{processor communication is required. Using the algorithm for QMF lters, the number of computations areMxMyL, were Lis the temporal lter's length. A simulation of a 2-band temporal decomposition using 2-tap QMF lters and 768640 (Mx= 12;My = 20) images processed approximately one frame per 0.055 seconds.
4.2.2 Computational Results
When coding a cropped (256256) version of the rst 40 frames of the Salesman sequence at an average rate/frame of 0.54 bits/pixel, an average PSNR of 38.94 dB/frame was observed, refer to Fig 6. Perceptually, there were little signs of distortion in the reconstructed sequence when viewed against the original. The motion estimation algorithm incorporated was full-search estimation with blocks of 88 pixels correlated over a search region of7 pixels using a MSE criterion. Also, an average decrease of 1.9 dB/frame was observed when the prediction was removed
from the LHL and HLL subbands, thus justifying the use of motion-compensated prediction on these bands.
A comparison was made between a comparable 2{D subband coder and a simpler 3{D subband coder in order to observe the benets of the temporal decomposition.
The 2{D subband coder consisted of 4 spatial subbands with motion-compensated predictive coding applied only to the LL subband. The 3{D subband coder con- sisted of the eight subband decomposition listed above, with motion-compensated predictive coding applied only to the LLL subband. It was found that the 3{D coder outperformed the 2{D coder by an average PSNR gain of 3.15 dB, refer to Fig. 7.
Perceptually, the reconstructed sequence from the 3{D coder looked similar to the original except in areas were there was a large degree of motion, whereas the 2{D coder looked uniformly distorted.
It should be noted that the bit rates quoted include the motion vector overhead but do not include the codebook overhead. In our algorithms, the quantizers are re- designed for each frame, so the codebooks really must be transmitted to the receiver in a practical system. In this connection, relevant work done [31] [32] on modeling with generalized Gaussian models can be incorporated to drastically reduce the side information required to be transmitted to an insignicant amount. We are currently modifying our MasPar implementation to incorporate this needed feature.
5 Conclusions
The research reported here has examined the use of general{purpose, data{parallel computers as a platform for VSP algorithms. This approach is facilitated by techno- logical and economic trends and provides the user with a high degree of programming
exibility. We have developed parallel implementations for a number of specic VSP algorithms, including motion estimation, interframe coding, and intraframe coding.
The theoretical analyses have been supplemented with timing results on MasPar MP{1 computers.
The key to ecient computing is the selection of data structures that balance the computational load on the individual processors and optimize the inter{processor communications. The data structure design we found best suited to VSP algorithms is obtained by rectangular tiling of frames. The tile size is selected depending on the frame size, the algorithm, and the computer characteristics. The block{DCT transform and the block{matching motion estimation algorithms are eciently im- plemented using 2mx 2my tiles, where mx and my are integers. The appropriate tiles for subband coding algorithms have size 2mxox2myoy tiles, wheremx andmy
are larger than or equal to the number of resolution levels in the subband analy- sis. The optimization of data structures is facilitated by possible tradeos between processor memory utilization and processing speed: we found that layout transfor- mations between dierent program modules (e.g., data{replication techniques for motion estimation algorithms) are extremely eective in that regard.
The capability to design and test novel video coding algorithms on realistic test sequences has been illustrated by our study of a 3{D subband coding scheme. Dif- ferent numbers of bits are allocated to the various spatio{temporal bands of the image sequence. In addition, motion-compensated predictive-coding may advanta- geously be applied to the higher spatial subbands of the low temporal subband.
Experimental results demonstrated that 3{D subband coding gives visually more pleasing visual results than 2{D subband coding. The error in the 2{D subband coding was noticeable and uniformly distributed throughout the image, whereas the
3{D subband coded image sequence looked transparent in the regions of no motion with only mild signs of distortion in rapidly moving areas.
References
[1] Chin, D., Passe, J., Bernard, F., Taylor, H., and Knight, S. The Princeton En- gine: A Real-Time Video System Simulator.IEEE Transactions on Consumer Electronics
34(2)
(1988), pp. 285-297.[2] Levialdi, S. Issues on Parallel Algorithms for Image Processing. InThe Char- acteristics of Parallel Algorithms (Jamieson, L.H. et al., Eds.). MIT Press, Cambridge, MA, 1987.
[3] Cypher, R., and Sanz, J.L.C. SIMD Architectures and Algorithms for Image Processing and Computer Vision.IEEE Transactions on Acoustic, Speech, and Signal Processing.
37(12)
(1989), pp. 2158-2174.[4] Little, J.J., Blelloch, G.E., and Cass, T.A. Algorithmic Techniques for Com- puter Vision on a Fine-Grained Parallel Machine.IEEE Transactions on Pat- tern Analysis and Machine Intelligence.
11(3)
(1989), pp. 244-257.[5] IEEE Computer
25(2)
, 1992. Special Issue on Parallel Processing for Com- puter Vision and Image Understanding.[6] Manohar, M., and Tilton, J.C. Progressive Vector Quantization of Multispec- tral Image Data Using a Massively Parallel SIMD Machine. In Proc. Data Compression Conference, Snowbird, UT, 1992, pp. 181-190.
[7] Carpentieri, B., and Storer, J.A. A Split-Merge Parallel Block-Matching Al- gorithm for Video Displacement Estimation. InProc. Data Compression Con- ference, Snowbird, UT, 1992, pp. 239-248.
[8] Howard, P.G., and Vitter, J.S. Parallel Lossless Image Compression Using Human and Arithmetic Coding. In Proc. Data Compression Conference, Snowbird, UT, 1992, pp. 299-308.
[9] Moulin, P. A Relaxation Algorithm for Minimizing theL2 Reconstruction Er- ror in 2-D Nonorthogonal Subband Coding. To appear inProc. 1st IEEE Int'l Conf. on Im. Proc. (ICIP), Austin, TX, 1994. Extended version in: A Mul- tiscale Relaxation Algorithm for SNR Maximization in 2-D Nonorthogonal Subband Coding. Bellcore Preprint, Mar. 1994. Submitted.
[10] Woods, J.W., and Lilieneld, G. Experiments in 3{D Subband Coding.
In Proc. 8th Workshop on Image and Multidimensional Signal Processing, Cannes, France, 1993, pp. 84-85. Summary only.
[11] Berman, F., and Snyder, L. On mapping parallel algorithms into parallel ar- chitectures. J. of Parallel and Distributed Computing
4
(1987), pp. 439-458.[12] Bhatt, S.N., Chung, F., Leighton, F.T., and Rosenberg, A.L. Ecient embed- dings of trees in hypercubes.SIAM J. Computing
21(1)
(1992), pp. 151-162.[13] MPL Reference Manual, MasPar Computer Corporation, Sunnyvale, CA, 1990.
[14] Kernighan, B.W., and Ritchie, D.M.The C Programming Language, Prentice Hall, Englewood Clis, NJ, 1988.
[15] Brass, A., and Pawley, G. Two and Three Dimensional FFTs on Highly Par- allel Computers.Parallel Computing
3
(1986), pp. 167-184.[16] Loui, A., Ogielski, A., and Liou, M. A Parallel Implementation of the H.261 Video Coding Algorithm. In Proc. IEEE Workshop on Visual Signal Process-
ing and Communications, Raleigh, NC, 1992, pp. 80-85. Extended version in:
Ecient Video Codec Simulation on a Massively Parallel Computer. Bellcore Technical Memorandum TM{ARH-020418, 1991.
[17] Hang, H.-M., Leonardi, R., Haskell, B.G., Schmidt, R.L., Bheda, H., and Othmer, J. Digital HDTV Compression at 44 Mbps Using Parallel Motion- Compensated Transform Coders. SPIE
1360
(1990), pp. 1756-1772.[18] Woods, J.W., and O'Neil, S.D. Subband Coding of Images.IEEE Transactions on Acoustic, Speech, and Signal Processing
34
(1986), pp. 1278-1288.[19] Gharavi, H., and Tabatabai, A. Subband Coding of Monochrome and Color Images. IEEE Transactions on Circuits and Systems
35
(1988), pp. 207-214.[20] Wavelets: A Tutorial in Theory and Applications(Chui, C.K., Ed.). Academic Press, Boston, 1992.
[21] M. Antonini, M. Barlaud, P. Mathieu and I. Daubechies, "Image Coding Using Wavelet Transform," IEEE Trans. on Im. Proc., Vol. 1, No. 2, pp. 205-220, Apr. 1992.
[22] Briggs, W.L.A Multigrid Tutorial. SIAM, Philadelphia, 1987.
[23] Caputo, P., and Moulin, P. Parallel Implementation of Multiresolution B- Spline Image Coding and Decoding Algorithms.Bellcore Technical Memoran- dum, 1993.
[24] Koga, T., Linuma, K., Hirano, A., Iijima, Y., and Ishiguro, T. Motion com- pensated interframe coding for video conferencing. InProc. Nat. Telecommun.
Conf., New Orleans, LA, 1991, pp. G5.3.1{5.3.5.
[25] Zaccarin, A., and Liu, B. Fast Algorithms for Block Motion Estimation. In Proc. ICASSP'92, San Francisco, CA, 1992, pp. III.449{452.
[26] Rhee, I., and Moulin, P. The Parallel Implementation of Video Motion Esti- mation Algorithms. Bellcore Technical Memorandum, 1993.
[27] Podilchuk, C.I., and Farvardin, N. Perceptually Based Low Bite Rate Video Coding. Proc. ICASSP'91, 1991, pp. 2837-2840.
[28] Kronander, T. Some Aspects of Perception Based Image Coding.Ph.D Thesis, Linkoping University, 1989.
[29] Karlsson, G., and Vetterli, M. Three Dimensional Subband Coding of Video.
In Proc. ICASSP'88, New York, NY, 1988, pp. 1100-1103.
[30] Johnston, J.D. A Filter Family Designed for use in Quadrature Mirror Filter Banks. In Proc. ICASSP'80, Denver, CO, 1980, pp. 291-294.
[31] Mallat, S.G. A Theory for Multiresolution Signal Decomposition: the Wavelet Representation. IEEE Trans. on Pattern Analysis and Machine Intelligence
11
(1989), pp. 674-693.[32] Naveen, T., and Woods, J.W. Motion Compensated Multiresolution Transmis- sion of High Denition Video. IEEE Trans. Circuits and Systems for Video Technology,
4
(1994), pp. 29-41.A Biographies
PIERRE MOULIN was born in Mons, Belgium, in 1963. He received the degree of Ingenieur civil electricien from the Faculte Polytechnique de Mons, Belgium, in 1984, and the M.Sc. and D.Sc. degrees in Electrical Engineering from Washing- ton University in St. Louis in 1986 and 1990, respectively. He was a researcher at the Faculte Polytechnique de Mons in 1984-85 and at the Ecole Royale Militaire in Brussels, Belgium, in 1986-87. He has been with the Information Sciences and Technologies Research Laboratory at Bell Communications Research since 1990. His current research interests include image and video compression, estimation theory, statistical signal processing, and the application of multiresolution methods and parallel computing to these areas.
ANDREW T. OGIELSKI received the Ph. D. degree in mathematical physics from University of Wroclaw, Poland in 1976. After several years of university re- search in Europe and in America he joined Bell Laboratories, Murray Hill, in 1982 as a research scientist. In 1989 he moved to Bellcore, where he helped establish a laboratory focusing on the applications of parallel and distributed computing in telecommunications. He is currently Director of the Parallel Computing and Algo- rithms Research Group at Bellcore. His own research and the projects he supervises include digital video processing, multimedia servers, distributed algorithms and soft- ware systems, packet network trac analysis, biological signal processing and others.
GARY S. LILIENFIELD was born in Washington, D.C. on August 6, 1963. He recieved his B.S.E.E., M.S.E.E., and an M.S. in mathematics from Rensselaer Poly-
technic in Troy, NY, in 1985, 1987, and 1993, respectively. He is currently a Ph.D.
candidate in electrical engineering at Rensselaer Polytechnic Institute. His research interests include image and video compression, digital signal processing, and medi- cal imaging. Mr. Lilieneld is a member of Eta Kappa Nu and Tau Beta Pi.
JOHN W. WOODS received the Ph.D. degree from the Massachusetts Institute of Technology in 1970. Since 1976 he has been with the ECSE Department at Rens- selaer Polytechnic Institute, where he is currently Professor and Associate Director of the Center for Image Processing Research.
Dr. Woods was co-recipient of 1976 and 1987 Senior Paper Awards of the now IEEE Signal Processing (SP) Society. He served on the editorial board of Graph- ical Models and Image Processing and was chairman of the Seventh Workshop on Multidimensional Signal Processing in 1991. He received the Signal Processing So- ciety Meritorious Service Award in 1990. He is currently on the editorial board of the IEEE Transactions on Video Technology. He received a Technical Achievement Award from the IEEE Signal Processing Society in 1994. He is a member of Sigma Xi, Tau Beta Pi, Eta Kappa Nu, and the AAAS. Dr. Woods is a Fellow of the IEEE.
B Tables
Table 1: Timing of BMME algorithms for various frame sizes and MSE for the rst 30 frames of theFlower Garden sequence. Block size is 88 and dmax = 8.
Computation Time for various frame sizes (in msec) MSE 256256 256512 512512 5121K 1K1K 720480
FS 41 59 100 183 349 119
TSS 5 7 12 23 44 196
S8 19 21 27 43 76 130
S16 11 12 16 26 48 142