Table 3.1: Peak performance of floating-point arithmetic of the Intel micro- processor family (after [1,2]); the number of clocks refers to the throughput; if the duration (latency) is different from this figure, the latency of the operation is added in parenthesis
Year Processor Clock rate
Add Add Mul Mul
[MHz] [Clocks] [MFLOPS] [Clocks] [MFLOPS]
1978 8087 4.7 70-100 0.055 130-145 0.034 1982 80287 12 70-100 0.141 90-145 0.102 1985 80387 33 23-34 1.16 25-37 1.06 1989 80486 100 8-20 7.1 16 6.3 1993 Pentium 166 1(3) 167 2(3) 83 1997 Pentium MMX 233 1(3) 233 2(3) 116 1998 Pentium II 400 1(3) 400∗ 2(5)∗ 200∗ ∗separate execution units for addition and multiplication
3.2 Signal processing performance of microprocessors
Since its invention the development of the microprocessor has seen an enormous and still continuing increase in performance. A single num- ber such as the number of millions of operations per second (MOPS) or millions of floating-point operations per second (MFLOPS) or the mere clock rate of the processor does not tell much about the actual perfor- mance of a processor. We take a more detailed approach by analyzing the peak computing performance, the bus data transfer rates, and the performance for typical signal processing tasks.
3.2.1 Peak computing performance
In evaluating the computing performance of a microprocessor the peak
performance for an operation has to be strictly distinguished from the
real performance for certain applications. The values for the peak per- formance assume that the data to be processed are already contained in the corresponding registers (or primary cache). Thus, the whole chain of transfer from the main memory to the registers is ignored (Section3.2.2). The peak performance value also assumes that all in- structions are already in the primary program cache and that they are set up in an optimum way for the specific operation. Thus, it is a rather hypothetical value, which will certainly never be exceeded. When one keeps in mind that real applications will miss this upper theoretical peak performance limit by a significant factor, it is still a useful value in comparing the performance of microprocessors.
34 3 Multimedia Architectures Table 3.2: Peak performance of integer arithmetic of the Intel microprocessor family (after [1,2]); the number of clocks refers to the throughput; if the duration (latency) is different from this figure, the latency is added in parenthesis; note that the values for the processors with MMX instructions do not refer to these instructions but to the standard integer instructions
Year Processor Clock Add Add Mul Mul
[MHz] [Clocks] [MOPS] [Clocks] [MOPS]
1978 8086 4.7 3 1.57 113-118 0.04 1982 80286 12 2 6 21 0.57 1985 80386 33 2 16.7 9-22 2.13 1989 80486 100 1 100 13 7.7 1993 Pentium 166 1/2∗ 333 10 16.7 1997 Pentium MMX 233 1/2∗ 466 10 23.3 1998 Pentium II 400 1/2∗ 800 1 (4) 400 ∗2 Pipelines, 2 ALUs
The peak performance of microprocessors can easily be extracted from the technical documentation of the manufacturers. Table 3.1 shows the peak performance for floating-point arithmetic including addition and multiplication of the Intel microprocessor family. From the early 8087 floating-point coprocessor to the Pentium II, the peak floating-point performance increased almost by a factor of 10,000. This enormous increase in performance is about equally due to the 100-fold increase in the processor clock rate and the reduction in the required number of clocks to about 1/100.
Significantly lower is the performance increase in integer arithmetic (Table3.2). This is especially true for addition and other simple arith- metic and logical operations including subtraction and shifting because these operations required only a few clocks on early microprocessors. The same is true for integer multiplication up to the Pentium genera- tion. The number of clocks just dropped from about 100 down to 10. With the Pentium II, the throughput of integer multiplications is 1 per clock, effectively dropping down the duration to 1 clock.
Thus, low-level signal processing operations that are almost exclu- sively performed in integer arithmetic do not share in the performance increase to the extent that typical scientific and technical number crunching applications do. In fact, the much more complex floating- point operations can be performed faster than the integer operations on a Pentium processor. This trend is also typical for all modern RISC
processors. Although it is quite convenient to perform even low-level
3.2 Signal processing performance of microprocessors 35 Registers 64 bits 400 MHz 3200 MB/s Primary data cache 64 bits 200 Mhz 1600 MB/s Secondary data cache 64 bits 32 bits 100 Mhz 33 Mhz 800 MB/s 132 MB/s Video Disk PCI bus RAM
Figure 3.1: Hierarchical organization of memory transfer between peripheral storage devices or data sources and processor registers (Pentium II, 400 MHz).
floating-point image requires, respectively, 4 and 2 times more space than an 8- or 16-bit image.
3.2.2 Data transfer
Theoretical peak performances are almost never achieved in signal pro- cessing because a huge amount of data has to be moved from the main memory or an external image data source such as a video camera to the processor before it can be processed (Fig.3.1).
Example 3.1: Image addition
As a simple example take the addition of two images. Per pixel only a single addition must be performed. Beforehand, two load instruc- tions are required to move the two pixels into the processor registers. The load instructions require two further operations to increase the address pointer to the next operands. After the addition, the sum has to be stored in the destination image. Again, an additional operation is required to increase the address to the next destination pixel. Thus, in total six additional operations are required to perform a single ad- dition operation. Moreover, because the images often occupy more space than the secondary cache of a typical microprocessor board, the data have to be loaded from the slow main memory.
One of the basic problems with modern microprocessors is related to the fact that the processor clock speed has increased 100-fold, while memory bus speed has increased only 10-fold. If we further consider that two execution units typically operate in parallel and take account of an increase in the bus width from 16 bits to 64 bits, the data transfer rates from main memory lack a factor of five behind the processing speed as compared to early microprocessors. This is why a hierarchical cache organization is of so much importance. Caching, however, fails if only a few operations are performed with a large amount of data, which means that most of the time is wasted for data transfer.
Example 3.2: Image copying
Copying of images is a good test operation to measure the perfor- mance of memory transfer. Experiments on various PC machines re- vealed that the transfer rates depend on the data and the cache size.
36 3 Multimedia Architectures Copying data between images fitting into the second level cache can be done with about 180 MBytes/s on a 266-MHz Pentium II. With large images the rate drops to 80 MBytes/s.
A further serious handicap of PC image processing was the slow transfer of image data from frame grabber boards over the AT bus sys- tem. Fortunately, this bottleneck ceased to exist after the introduction of the PCI bus. In its current implementation with 32-bit width and a clock rate of 33 MHz, it has a peak transfer rate of 132 MB/s. This peak transfer rate is well above transfer rates required for real-time video image transfer. Gray-scale and true color RGB video images re- quire transfer rates of 6–10 MB/s and 20–30 MB/s, respectively. With well-designed PCI interfaces on the frame grabber board and the PC mother board sustained transfer rates of up to 80–100 MB/s have been reported. This transfer rate is comparable to transfer rates from the main memory to the processor.
Standard personal computers have reached the critical threshold for real-time image data transfer. Thus, it appears to be only a ques- tion of time before standard digital camera interfaces are introduced to personal computers (Section11.6.2).
3.2.3 Performance of low-level signal processing
The peak computing performance of a processor is a theoretical figure that cannot be reached by applications coded with high-level program- ming languages. Extensive tests with the image processing software heurisko showed that with optimized C programming the following real performances can be reached: 20-30 MFLOPS on a 166-MHz Pen- tium; 50-70 MFLOPS on a 200-MHz Pentium Pro; and 70-90 MFLOPS on a 266-MHz Pentium II. The integer performance is even lower, espe- cially multiplication which took 10 clocks on a Pentium and more on older processors. This is why only half of the floating-point perfor- mance could be achieved. Further tests revealed that with assembler optimized coding of the inner loops of vector functions only modest ac- celerations would be gained. Usually, the speed-up factors are smaller than 2. These performances do not allow sophisticated real-time image processing with rates of 6-10 Pixels/s.