• No results found

Instruction Parallelization

Compilers translate the high-level implementation of video algorithms into low-level machine instructions. However, there are some instructions that do not depend on the previous instructions to complete; thus, they can be scheduled to be executed concurrently. The potential overlap among the instructions forms the basis of instruction parallelization, since the instructions can be evaluated in parallel. For example, consider the following code:

1 R4 = R1 + R2

2 R5 = R1 – R3

3 R6 = R4 + R5

4 R7 = R4 – R5

In this example, there is no dependence between instructions 1 and 2, or between 3 and 4, but instructions 3 and 4 depend on the completion of instructions 1 and 2. Thus, instructions 1 and 2 and instructions 3 and 4 can be executed in parallel. Instruction parallelization is usually achieved by compiler-based optimization and by hardware techniques. However, indefinite instruction parallelization is not possible; the parallelization is typically limited by data dependency, procedural dependency, and resource conflicts.

Figure 5-8. An example of SIMD technique

21S. M. Akramullah, I. Ahmad, and M. L. Liou, “A Data-parallel Approach for Real-time MPEG-2

Instructions in reduced instruction set computer (RISC) processors have four stages that can be overlapped to achieve an average performance close to one instruction per cycle. These stages are instruction fetch, decode, execute, and result write-back. It is common to simultaneously fetch and decode two instructions A and B, but if instruction B has read-after-write dependency on instruction A, the execution stage of B must wait until the write is completed for A. Mainly owing to inter-instruction dependences, more than one instruction per cycle is not achievable in scalar processors that execute one instruction at a time. However, superscalar processors exploit instruction parallelization to execute more than one unrelated instructions at a time; for example, z=x+y and c=a*b

can be executed together. In these processors, hardware is used to detect the independent instructions and execute them in parallel.

As an alternative to superscalar processors, very long instruction word (VLIW) processor architecture takes advantage of instruction parallelization and allows programs to explicitly specify the instructions to execute in parallel. These architectures employ an aggressive compiler to schedule multiple operations in one VLIW per cycle. In such platforms, the compiler has the responsibility of finding and scheduling the parallel instructions. In practical VLIW processors such as the Equator BSP-15, the integrated caches are small—the 32 KB data cache and 32 KB instruction cache typically act as bridges between the higher speed processor core and relatively lower speed memory. It is very important to stream in the data uninterrupted so as to avoid the wait times.

To better understand how to take advantage of instruction parallelism in video coding, let’s consider an example video encoder implementation on a VLIW platform.22 Figure 5-9 shows a block diagram of the general structure of the encoding system.

Figure 5-9. A block diagram of a video encoder on a VLIW platform

22S.M. Akramullah, R. Giduthuri, and G. Rajan, “MPEG-4 Advanced Simple Profile Video Encoding

on an Embedded Multimedia System,” in Proceedings of the SPIE 5308, Visual Communications and

Here, the macroblocks are processed in a pipelined fashion while they go through the different encoding tasks in the various pipeline stages of the encoder core. A direct memory access (DMA) controller, commonly known as the data streamer, helps prefetch the necessary data. A double buffering technique is used to continually feed the pipeline stages. This technique uses two buffers in an alternating fashion – when the data in one buffer is actively used, the next set of data is loaded onto the second buffer. When processing of the active buffer’s data is done, the second buffer becomes the new active buffer and processing of its data starts, while the buffer with used-up data is refilled with new data. Such design is useful in avoiding potential performance bottlenecks.

Fetching appropriate information into the cache is extremely important; care needs to be taken so that both the data and the instruction caches are maximally utilized. To minimize cache misses, instructions for each stage in the pipeline must fit into the instruction cache, while the data must fit into the data cache. It is possible to rearrange the program to coax the compiler to generate instructions that fit into the instruction cache. Similarly, careful consideration of data prefetch would keep the data cache full. For example, the quantized DCT coefficients can be stored in a way so as to help data prefetching in some Intra prediction modes, where only seven coefficients (either from the top row or from the left column) are needed at a given time. The coefficients have a dynamic range (-2048, 2047), requiring 13 bits each, but are usually represented in signed 16-bit entities. Seven such coefficients would fit into two 64-bit registers, where one 16-bit slot will be unoccupied. Note that a 16-bit element relevant for this pipeline stage, such as the quantizer scale or the DC scaler, can be packed together with the quantized coefficients to fill in the unoccupied slot in the register, thereby achieving better cache utilization.