• No results found

In this section a detailed comparison of the available multimedia in- struction sets is performed. The comparison includes Sun’s VIS, Intel’s MMX, AMD’s 3-DNow!, and Motorola’s AltiVec. DEC’s motion video in- struction set is not further considered because it is a highly dedicated instruction set with only a few instructions to compute the sum of abso- lute differences, compute minima and maxima, and pack/unpack pixels specifically targeted to improve motion video compression [13]. 3.4.1 Registers and data types

All multimedia instruction sets except for Motorola’s AltiVec work with 64-bit registers (Table3.3). The AltiVec instruction set uses 32 128-bit registers. Because multiple pixels are packed into the multimedia reg- isters a number of new data types were defined. The following data for- mats are supported by all instruction sets: packed bytes (8 bits), packed

3.4 Comparative analysis of instruction sets 39

Table 3.3: Comparison of the register sets for several multimedia instruction sets

Feature MMX 3-DNow! VIS AltiVec

Number of registers 8 8 32 32

Width (bits) 64 64 64 128

Number of packed bytes 8 8 8 16

Number of packed words 4 4 4 8

Number of packed doublewords 2 2 2 4

Number of packed floating-point doublewords

- 2 - 4

words (16 bits), packed doublewords (32 bits) (Table 3.3, Fig. 3.2). Packed 32-bit floating-point doublewords are supported by the 3-DNow! and AltiVec instruction sets. The processors of the Intel family have only 8 multimedia registers, while the RISC architectures from Sun and Motorola are much more flexible with 32 registers [5].

The Intel MMX registers are mapped onto the floating-point register set. This approach has the advantage that no changes to the architec- ture were applied that would have caused software changes in operating systems for exception handling. The disadvantage is that floating-point and MMX instructions cannot be carried out simultaneously. With the extension of the MMX instruction set to include packed 32-bit floating- point arithmetic—as with AMD’s 3-DNow! instructions and probably also with Intel’s MMX2—this disadvantage is disappearing.

The 64-bit multimedia registers double the width of the standard 32-bit registers of 32-bit microprocessors. Thus twice the number of bits can be processed in these registers in parallel. Moreover, these registers match the 64-bit buses for transfer to the main memory and thus memory throughput is also improved.

3.4.2 Instruction format and processing units

The structure of all the various multimedia instructions reflects the basic architecture into which they are implemented. Thus the MMX instructions (and extensions of it) are two-operand instructions that are all of the same form except for the transfer instructions:

mmxreg1=op(mmxreg1,mmxreg2|mem64) (3.1) Destination and first operand is an MMX register; the first operand is overwritten by the result of the operation. This scheme has the significant disadvantage that additional register copy instructions are required if the contents of the first source register has to be used

40 3 Multimedia Architectures Table 3.4: Comparison of the peak performance of various implementations of multimedia instruction sets for 16-bit integer arithmetics. The three given numbers are: number of processing units, number of operations performed in parallel, number of clocks needed per operation. If the duration (latency) of the operation is different from the latter, the latency of the operation is added in parenthesis. The values for additions are also valid for any other simple arithmetic and logical instruction.

Year Processor Clock Add Add Mul Mul

[MHz] [MOPS] [MOPS]

1997 Pentium MMX 233 2·4·1 1864 1·4·1(3) 932 1998 AMD K6 300 1·4·1 1200 1·4·1 1200 1998 Pentium II 400 2·4·1 3200 1·4·1(3) 1600 1998 AMD K6-2 333 2·4·1 2667 1·4·1(2) 1333

more than once. The second operand can either be an MMX register or an address to a 64-bit source. Thus—in contrast to classical RISC architectures—it is not required to load the second operand into an MMX register before it can be used. One exception to the rule men- tioned here are shift operations. In this case the shift factor can be given as a direct operand.

The visual instruction set of Sun’s UltraSPARC architecture is a clas- sical three-register implementation. Most instructions have one desti- nation and two source registers:

visreg1=op(visreg2,visreg3) (3.2) This has the advantage that no source register is overwritten saving register copy instructions.

The AltiVec instruction set announced in June, 1998 is even more flexible. It allows instructions with one destination and up to three source registers.

The Intel P55C processor has a duration for all MMX instructions of just one clock Intel [2]. All arithmetic, logic and shift operations also have a latency of only one clock. This means that the results are immediately available for the instructions executed in the next clock cycle. Only the multiplication instructions show a latency of 3 clocks.

Both the Pentium P55C and the Pentium II have the following MMX processing units: two MMX ALUs, one MMX shift and pack unit, and one MMX multiplication unit. Because the MMX processors have two execution pipelines and four MMX execution units (two ALUs, a multi- plier, and a shift unit) chances are good that two MMX instructions can be scheduled at a time. For simple instructions such as addition, this

3.4 Comparative analysis of instruction sets 41 results in a peak performance of 1864 and 3200 MOPS on a 233-MHz MMX Pentium and a 400-MHz Pentium II, respectively (Table3.4).

The performance figures for the AMD processors are quite similar. Although the K6 has only one MMX processing unit and thus can only schedule one MMX instruction at a time, the multiplication instructions do not show any additional latency. For integer operations, the AMD K6-2 is quite similar to the MMX Pentium with the exception that the multiplication instructions have a latency of only two clock cycles (Ta- ble3.4).

The AMD K6-2 processor initiates the second wave of multimedia in- struction sets extending SIMD processing to 32-bit floating-point num- bers [10]. Two of them are packed into a 64-bit register. This paral- lel processing of only 32-bit floating-point figures still requires much less hardware than a traditional floating-point processing unit that pro- cesses either 80-bit or 64-bit floating-point numbers. Moreover, it can share much of the circuits for 32-bit integer multiplication. This ex- tension brings 32-bit floating-point arithmetic to a new level of perfor- mance. With a throughput of 1 operation per clock cycle and the paral- lel execution of addition/subtraction and multiplication operations, the peak floating point performance is boosted to 1333 MOPS, way above the peak performance of 200 and 400 MFLOPS for floating-point multi- plication and addition of a 400-MHz Pentium II.

No implementation of the AltiVec instruction set is yet available, so no detailed consideration can be made yet. Because of the 128-bit registers, however, it is evident that this architecture is inherently two times more powerful at the same processor clock speed and number of processing units.

3.4.3 Basic arithmetic

Integer arithmetic instructions include addition, subtraction, and mul-

tiplication (Table3.5, Fig.3.3). Addition and subtraction are generally implemented for all data types of the corresponding instruction set. These operations are implemented in three modes. In the wraparound mode arithmetic over- or underflow is not detected and the result is computed modulo the word length.

This behavior is often not adequate for image data because, for ex- ample, a bright pixel suddenly becomes a dark one if an overflow oc- curs. Thus a saturation arithmetic is also implemented often for both signed and unsigned values. This means that if an over-/underflow occurs with an operation, the result is replaced by the maximum/mini- mum value.

Multiplication is implemented for fewer data types than addition and subtraction operations and also not in saturation arithmetics (Ta- ble3.5). In the MMX instruction set multiplication is implemented only

42 3 Multimedia Architectures Table 3.5: Basic arithmetic multimedia instructions

Instruction set 8 bits 16 bits 32 bits 32 bits float Addition/Subtraction

Sun VIS W, US, SS W, US, SS W -

Intel MMX W, US, SS W, US, SS W -

AMD MMX & 3-DNow! W, US, SS W, US, SS W Y Motorola AltiVec W, US, SS W, US, SS W Y

Multiplication

Sun VIS W (8×16) W - -

Intel MMX - W - -

AMD MMX & 3-DNow! - W - Y

Motorola AltiVec W, US, SS W, US, SS - Y Multiply and Add

Intel MMX - Y - -

AMD MMX & 3-DNow! - Y - -

Motorola AltiVec W, US, SS W, US, SS - Y

The abbreviations have the following meaning: US unsigned saturation arithmetic; SS signed saturation arithmetic; W wraparound (modulo) arithmetic; Y yes.

for packed 16-bit pixels in three variants. The first two store only either the 16 high-order or low-order bits of the 32-bit multiplication result. The third variant adds the 32-bit multiplication results pairwise to store two 32-bit results in two doublewords.

3.4.4 Shift operations

Shift operations are available with packed data types in the very same

way as for standard integers. Thus, logical (unsigned) and arithmetic (signed) shifts with sign extension are distinguished. With the MMX instruction set shift operations are not implemented for packed bytes but only for packed words and doublewords and for the whole 64-bit word. In this way packed data can be shifted to other positions within the register. Such operations are required when data not aligned on 64-bit boundaries are to be addressed.

3.4 Comparative analysis of instruction sets 43 a + = + = + = + = + = + = + = + = + = + = + = + = + = + =

PADD[|S|US]B PADD[|S|US]W PADDD

8 bits 16 bits 32 bits

b x x x = = = x x x = = = x x x = = x x x = = a4 a3 a2 a1 a4 a3 a2 a1 a4 a3 a2 a1 b1 b2 b3 b4 b1 b2 b3 b4 b1 b2 b3 b4 a b4 4 a b4 4 a b3 3 a b3 3 a b2 2 a b2 2 a b1 1 a b1 1 216 216 216 216 a b + a b4 4 3 3 a b + a b1 1 2 2

PMULLW PMULHW PMADDWD

Figure 3.3:MMXaaddition andbmultiplication instructions.

3.4.5 Logical operations

Logical operations work bitwise so it is not required to distinguish any

different packed data types. Logical operations simply process all bits of the register. The MMX instruction set includes the following logical instructions: and,or, and exclusive orxor. In addition, a special and operation is available where the first operand is negated before the andoperation. Motorola’s AltiVec instruction set includes also a nor instruction and Sun’s VIS knows in total 8 types of logical instructions. 3.4.6 Comparison, minimum, and maximum operations

Comparison operations are normally used to set flags for conditional

branches. The comparison operations in multimedia instruction sets are used in a different way. If the result is true, all bits of the corre- sponding element that have been compared are set to one. If the result is false, all bits are set to zero. In this way a mask is generated that can be used for subsequent logical operations to select values on con- ditions and to compute minima and maxima without any conditional jumps that would stall the execution pipeline.

The MMX instruction set includes only greater and equal comparison operations as does Motorola’s AltiVec instruction set. While the MMX instructions are only implemented for signed packed 8-, 16-, and 32-

44 3 Multimedia Architectures a a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a2 a2 a2 a2 a2 a2 a2 a2 a3 a3 a3 a3 a4 a4 a4 a4 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b1 b2 b2 b2 b2 b2 b2 b2 b2 b3 b3 b3 b3 b4 b4 b4 b4

byte -> word word -> doubleword doubleword -> quadword

PUNPCKLBW PUNPCKHBW PUNPCKLWD PUNPCKHWD PUNPCKLDQ PUNPCKHDQ b a1 a1 a1 a1 a2 a2 a2 a2 a3 a3 a4 a4 b1 b1 b1 b1 b2 b2 b2 b2 b3 b3 b4 b4 PACKUSWB PACKSSWB PACKSSDW

Figure 3.4:MMXaunpack andbpack instructions.

bit data, the AltiVec instructions are implemented for both signed and unsigned data types. Sun’s VIS instruction set is much more flexible. All standard comparison instructions including greater than, greater equal, equal, not equal, less equal, and less than are implemented.

The MMX does not include any direct instructions to compute the minimum or maximum of two packed data words. Such instructions have been added for packed floating-point data with the 3-DNow! in- struction extension. The AltiVec instruction set incorporates minimum and maximum instructions for both packed signed and unsigned inte- gers and for packed floating-point data.

3.4.7 Conversion, packing, and unpacking operations

Conversion instructions convert the different packed data types into each other. Such operations are required because arithmetic operations must often be performed with data types wider than the width of the input data. The multiplication of two 8-bit values, for instance, results

3.5 SIMD algorithms for signal processing 45