• No results found

types are called unpack operations (Fig.3.4a). They are implemented as versatile merge operations by interleaving the packed data in the lower-order or higher-order word of two multimedia registers in the destination register.

Instructions that decrease the width are known as pack instructions (Fig.3.4b). They take the packed data from two multimedia registers and truncate the bitlength to the half. The MMX instruction set includes only pack instructions with signed and unsigned arithmetic. The Al- tiVec instruction set also has modulo pack instructions.

3.4.8 Transfer operations

The load and store instructions of standard instruction sets work only with the standard bitlength of the integer registers. Therefore these instructions cannot be used to load and store the wider multimedia registers. With the MMX instruction set 32 and 64 bits, that is, a double or a quadword can be moved from and to memory. The quadword move instructions are the most efficient instructions because they utilize the full bus width with a single instruction.

These wide move instructions cause, however, an alignment prob- lem. If an address is not aligned on a 64-bit boundary, which is normally the case with 8-, 16-, and 32-bit data types, memory access is slowed down. In order to take full advantage of the MMX instructions, it is necessary to align the data correctly. This requires a number of careful considerations detailed in the corresponding Intel manuals [2].

3.5 SIMD algorithms for signal processing

In this section we will discuss which classes of signal processing algo- rithms can be performed and how efficiently this can be done. This includes the following classes of operations: point operations, global transforms, convolution, gray-scale morphology, segmentation, binary morphology, classification, and neural networks.

3.5.1 Point operations

Any type of point operation can efficiently be performed with SIMD algorithms because all pixels can be processed in parallel. It is just required to perform a scan that includes all points of aD-dimensional signal and to perform the operation point by point.

Example 3.3: Image accumulation

The following code fragment shows the inner loop of a point operation that adds an 8-bit image to a 16-bit image in Intel MMX inline assembly code.

46 3 Multimedia Architectures

pxor mm7, mm7 // Set register mm7 to zero vsbadd1:

// Load 4 x 4 pixels from 8-bit source into the // low-order doubleword of registers mm0 - mm3

movd mm0, [esi] movd mm1, [esi+4] movd mm2, [esi+8] movd mm3, [esi+12]

// Unpack from 8 bits to 16 bits, add to destination punpcklbw mm0, mm7 paddw mm0, [edi] punpcklbw mm1, mm7 paddw mm1, [edi+8] punpcklbw mm2, mm7 paddw mm2, [edi+16] punpcklbw mm3, mm7 paddw mm3, [edi+24] // Save in destination movq [edi], mm0 movq [edi+8], mm1 movq [edi+16], mm2 movq [edi+24], mm3

// Increment addresses and check loop counter add esi, 16

add edi, 32 sub ecx, 1 jg vsbadd1

This loop contains 20 instructions that add per loop scan 16 pixels of the 8-bit source image to 16 pixels of the 16-bit destination image. As some instructions run in parallel (e. g.,punpcklbwandpaddw), the loop should take efficiently about one clock per pixel. On a 400-MHz Pentium II, it should thus run with a rate of 400-MPixels/s. This per- formance could, however, be achieved only if all source and destina- tion data were available in the primary cache. Because this is never possible with image data, the performance of such a simple operation is rather limited to the maximum possible sustained memory trans- fer from and to the main memory. If we assume that the effective memory transfer rate is in the order of 100 MB/s and count only the load operations (3 bytes/pixel), the performance is slowed down by one order of magnitude to about 30 MPixels/s.

The preceding example shows that simple operations are memory- transfer limited. In other words, many more operations per pixel can be performed before the computing power of the multimedia instruction set limits the throughput.

There are two classes of operations that are related to point op- erations but cannot be accelerated by SIMD instructions: lookup table operations and the computation of histograms. Common to both op- erations is an additional indirection. The value of a variable is used to compute the address of the lookup table or the address of the element in the histogram that is to be incremented. Because of the content-

3.5 SIMD algorithms for signal processing 47 dependent addresses, these operations cannot be performed with SIMD instructions.

This is a serious limitation of the current multimedia instruction sets. Lookup tables are a central element for low-level image and signal processing that is part of the hardware of any frame grabber but can be used for incoming data only [14]. With a lookup table any function can be implemented. Especially useful are lookup tables for dyadic point operations with two operands. With such a lookup table any dyadic operation—including multiplication, division, magnitude of a 2-D vector, etc.—can be computed.

3.5.2 Global transforms

In contrast to point operations, global transforms such as the discrete

Fourier transform (DFT) (Volume 2, Section3.3) compute each element of the transform from all elements of the input data. Nevertheless it is possible to implement such transforms efficiently by SIMD algorithms. We will show this with the example of the 2-D fast Fourier trans- form algorithm. The 2-D DFT can be parted into a 1-D row and a 1-D column transform (Volume 2, Section3.4.2). The key point then is that multiple rows and columns can be transformed in parallel. In essence this leads to an inner loop with the multiplication of a complex scalar with a complex vector. This operation can efficiently be implemented with the multimedia SIMD instructions. The real and imaginary part of the constant are loaded into the multimedia registers and multiplied with the vector. The MMXpmaddwdinstruction can be used to perform one complex multiplication with four real multiplications and two real additions in a single clock cycle.

It is, however, still awkward to implement an FFT algorithm in 16-bit integer arithmetic. Either significant roundoff errors are introduced or block-floating techniques are required. Thus the recent extension of multimedia instructions to 32-bit floating arithmetic by AMD’s 3DNow! or Motorola’s AltiVec are very useful for algorithms such as the FFT.

3.5.3 Convolution

Convolution or linear shift-invariant filtering is one of the most im- portant neighborhood operations in signal processing (Volume 2, Sec- tion 5.3). There are several ways to execute convolution operations efficiently with an SIMD architecture. We demonstrate one here. A 1-D convolution can be written as

g0 n= R X n0=−R hn0gnn0 or g0= R X n0=−R hn0Sn0g (3.3)

48 3 Multimedia Architectures where the shift operatorSn0 shifts the vectorgbyn0 elements. Thus the inner loop consists of the following basic operation

g0=g0+h

n0Sn0g (3.4)

A vector is multiplied with a constant and the result accumulated in another vector. This operation is repeated for all nonzero coefficients of the convolution sum. This way to execute a convolution operation is efficient as long as the vectors fit into the primary cache.

The true problem for this operation is caused by the shift of the input vector. The multimedia instructions require that the data are aligned on 64-bit boundaries. While it is easy to align the beginning of vectors and image rows at this boundary, the vector becomes dealigned because of the pixelwise shift. Thus additional shift operations are necessary with the pixels packed into the multimedia registers slowing down the overall performance of convolution operations.

3.5.4 Gray-scale morphology

Gray-scale morphology requires the computation of the minimum or

maximum of vector elements for erosion and dilation operations (Vol- ume 2, Chapter21). With standard instruction sets, the computation of minima and maxima requires comparison operations and conditional branches.1 As these branches cannot be predicted they permanently

cause slow-down of the executing stream.

Multimedia instruction sets either use comparison instructions that generate masks for the computation of minima and maxima with logical operations or include these operations directly. The following example shows the MMX assembly code in the inner loop of a vector maximum routine.

Example 3.4: Maximum computation with MMX instructions

This routine uses the greater than comparison instruction pcmpgtw to mask the elements that are greater. This mask is used to cut out (andoperation) the greater values in the first operand and the negated mask to cut out the greater values in the second operand. A subse- quent or operation combines all maximal values. In the following code fragment of the inner loop of the maximum routine, eight 16-bit integer pixels from two vectors are processed per loop scan.

m1:

movq mm0, [esi] movq mm1, [esi+8] movq mm2, [edi] movq mm3, [edi+8]

1Recently conditional move instructions have been added to standard instruc-

tion sets (e. g., for the PentiumPro and Pentium II processors) that avoid conditional branches.

3.5 SIMD algorithms for signal processing 49 movq mm4, mm0 pcmpgtw mm4, mm2 // mm4 = 1st mask movq mm5, mm1 pcmpgtw mm5, mm3 // mm5 = 2nd mask pand mm0, mm4 pandn mm4, mm2 por mm0, mm4 // mm0 = maximum pand mm1, mm5 pandn mm5, mm3 por mm1, mm5 // mm1 = maximum movq [edi], mm0 movq [edi+8], mm1 add esi, 16 add edi, 16 dec ecx jg m1 3.5.5 Global segmentation

A global segmentation operation is also a point operation and as such can easily be implemented by the comparison instructions discussed in Sections3.4.6and 3.5.4. With the aid of shift operations it is also possible to generate a binary image in an efficient way.

It is not possible to accelerate edge-oriented segmentation opera- tions with SIMD algorithms because such an operation is inherently serial. Compact codes for binary objects, such as the runtime length or chain code, likewise cannot be generated from image data with SIMD algorithms.

3.5.6 Binary morphology

Morphological operations on binary images require bitwise logical op- erations (Volume 2, Chapter21). If binary images are stored with one bit per pixel, 32 pixels can be packed into a standard register and 64 or 128 pixels in a multimedia register. Thus operations with binary im- ages are considerably faster than with gray-scale images because many pixels can be processed in one register in parallel. The acceleration factor is, of course, not equal to the number of bits in the registers, be- cause misalignments of the bits require additional shift operations and the composition of a bit string from two bit strings. For standard 32-bit registers, it is still to be expected that binary images can be processed about 10 times faster than gray-scale images.

For the implementation of binary morphological operations with multimedia instruction sets an additional acceleration by a factor of two or four, depending on the register width of the multimedia reg- isters as compared with the width of the standard registers, can be achieved.

50 3 Multimedia Architectures

3.5.7 Classification; neural networks

Finally, we discuss various classification algorithms and operations with neural networks. For the reasons discussed in Section3.5.1, lookup ta- ble operations cannot be accelerated with SIMD instructions. Two other types of classification techniques are more suitable, box classification and minimum distance classification.

Box classification requires two comparison operations for each di- mension of the boxes that model the clusters in feature space [14]. These comparisons can be performed in parallel and can thus be com- puted efficiently on packed data. Minimum distance classification is based on the computation of the distance between theP-dimensional feature vectormand the cluster centersmq:

d2 q=mmq2= P X p=1 (mp−mq)2 (3.5)

This operation can also be implemented efficiently with SIMD opera- tions provided that the dimension of the feature space is large enough. The basic and computationally most expensive operation of neural

networks is the accumulation of the weighted input values (Volume 2,

Chapter23):

g0= XP

p=1

wpgp (3.6)

Mathematically it is equivalent to an inner or scalar product between the weight vector w of the neuron and the input vector g. This op- eration can be computed efficiently in parallel using, for example, the multiplication-addition instruction (Section3.4.3) of the MMX instruc- tion set.

Usually, the output of the neuron is transformed by a nonlinear function and then used as a final output or the input for further neurons (Volume 2, Chapter23). This is an operation that is typically performed with a lookup table and thus cannot be accelerated, as discussed in Section3.5.1, by SIMD instructions.