• No results found

Pipeline and Parallelism

Introduction to TMS320C55x Digital Signal Processor

2.5 Pipeline and Parallelism

The pipeline technique has been widely used by many DSP manufacturers to improve processor performance. The pipeline execution breaks a sequence of operations into smaller segments and executes these smaller pieces in parallel. The TMS320C55x uses the pipelining mechanism to efficiently execute its instructions to reduce the overall execution time.

2.5.1 TMS320C55x Pipeline

Separated by the instruction buffer unit, the pipeline operation is divided into two independent pipelines ± the program fetch pipeline and the program execution pipeline (see Figure 2.12). The program fetch pipeline consists of the following three stages (it uses three clock cycles):

PA (program address): The C55x instruction unit places the program address on the program-read address bus (PAB).

PM (program memory address stable): The C55x requires one clock cycle for its program memory address bus to be stabilized before that memory can be read.

PB (program fetch from program data bus): In this stage, four bytes of the program code are fetched from the program memory via the 32-bit program data-read bus (PB).

Figure 2.12 The C55x pipeline execution diagram

PIPELINE AND PARALLELISM 59

The code is placed into the instruction buffer queue (IBQ). For every clock cycle, the IU will fetch four bytes to the IBQ. The numbers on the top of the diagram represent the CPU clock cycle.

At the same time, the seven-stage execution pipeline performs the fetch, decode, address, access, read, and execution sequence independent of the program fetch pipe-line. The C55x program execution pipeline stages are summarized as follows:

F (fetch): In the fetch stage, an instruction is fetched from the IBQ. The size of the instruction can be one byte for simple operations, or up to six bytes for more complex operations.

D (decode): During the decoding process, decode logic gets one to six bytes from the IBQ and decodes these bytes into an instruction or an instruction pair under the parallel operation. The decode logic will dispatch the instruction to the program flow unit (PU), address flow unit (AU), or data computation unit (DU).

AD (address): In this stage, the AU calculates data memory addresses using its data-address generation unit (DAGEN), modifies pointers if required, and computes the program-space address for PC-relative branching instructions.

AC (access cycles 1 and 2): The first cycle is used for the C55x CPU to send the address for read operations to the data-read address buses (BAB, CAB, and DAB), or transfer an operand to the CPU via the C-bus (CB). The second access cycle is inserted to allow the address lines to be stabilized before the memory is read.

R (read): In the read stage, the data and operands are transferred to the CPU via the CB for the Ymem operand, the B-bus (BB) for the Cmem operand, and the D-bus (DB) for the Smem or the Xmem operands. For the Lmem operand read, both the CB and the DB will be used. The AU will generate the address for the operand write and send the address to the data-write address buses (EAB and FAB).

X (execute): Most data processing work is done in this stage. The ALU inside the AU and the ALU inside the DU performs data processing execution, stores an operand via the F-bus (FB), or stores a long operand via the E-bus and F-bus (EB and FB).

The C55x pipeline diagram illustrated in Figure 2.12 explains how the C55x pipeline works. It is clear that the execution pipeline is full after seven cycles and every execution cycle that follows will complete an instruction. If the pipeline is always full, this technique increases the processing speed seven times. However, the pipeline flow efficiency is based on the sequential execution of instruction. When a disturbing execution such as a branch instruction occurs, the sudden change of the program flow breaks the pipeline sequence.

Under such circumstances, the pipeline will be flushed and will need to be refilled. This is called pipeline break down. The use of IBQ can minimize the impact of the pipeline break down. Proper use of conditional execution instructions to replace branch instructions can also reduce the pipeline break down.

2.5.2 Parallel Execution

The parallelism of the TMS320C55x uses the processor's multiple-bus architecture, dual MAC units, and separated PU, AU, and DU. The C55x supports two parallel process-ing types ± implied and explicit. The implied parallel instructions are the built-in instructions. They use the symbol of parallel columns, `: :', to separate the pair of instructions that will be processed in parallel. The explicit parallel instructions are the

user-built instructions. They use the symbol of parallel bar, `j j', to indicate the pair of parallel instructions. These two types of parallel instructions can be used together to form a combined parallel instruction. The following examples show the user-built, built-in, and combined parallel instructions. Each example is carried out in just one clock cycle.

User-built:

mpym *AR1‡, *AR2‡, AC0 ; User-built parallel instruction jjand AR4, T1 ; Using DU and AU

Built-in:

mac *AR0 , *CDP , AC0 ; Built-in parallel instruction ::mac *AR1 , *CDP , AC1 ; Using dual-MAC units

Built-in and User-built Combination:

mpy *AR2‡, *CDP‡, AC0 ; Combined parallel instruction ::mpy *AR3‡, *CDP‡, AC1 ; Using dual-MAC units and PU j jrpt #15

Some of the restrictions when using parallel instructions are summarized as follows:

. For either the user-built or the built-in parallelism, only two instructions can be executed in parallel, and these two instructions must not exceed six bytes.

. Not all instructions can be used for parallel operations.

. When addressing memory space, only the indirect addressing mode is allowed.

. Parallelism is allowed between and within execution units, but there cannot be any hardware resources conflicts between units, buses, or within the unit itself.

There are several restrictions that define the parallelism within each unit when applying parallelism to assembly coding. The detailed descriptions are given in the TMS320C55x DSP Mnemonic Instruction Set Reference Guide [4].

The PU, AU, and DU can all be involved in parallel operations. Understanding the register files in each of these units will help to be aware of the potential conflicts when using the parallel instructions. Table 2.5 lists some of the registers in PU, AU, and DU.The parallel instructions used in the following example are incorrect because the second instruction uses the direct addressing mode:

mov *AR2, AC0 j jmov T1, @x

We can correct the problem by replacing the direct addressing mode, @x, with an indirect addressing mode, *AR1, so both memory accesses are using indirect addressing mode as follows:

PIPELINE AND PARALLELISM 61

Table 2.5 Partial list of the C55x registers and buses

PU Registers/Buses AU Registers/Buses DU Registers/Buses

RPTC T0, T1, T2, T3 AC0, AC1, AC2, AC3

BRC0, BRC1 AR0, AR1, AR2, AR3, TRN0, TRN1

RSA0, RSA1 AR4, AR5, AR6, AR7

REA0, REA1 CDP

BSA01, BSA23, BSA45, BSA67 BK01, BK23, BK45, BK67

Read Buses: C, D Read Buses: C, D Read Buses: B, C, D Write Buses: E, F Write Buses: E, F Write Buses: E, F

mov *AR2, AC0 j jmov T1, *AR1

Consider the following example where the first instruction loads the content of AC0 that resides inside the DU to the auxiliary register AR2 inside the AU. The second instruction attempts to use the content of AC3 as the program address for a function call. Because there is only one link between AU and DU, when both instructions try to access the accumulators in the DU via the single link, it creates a conflict.

mov AC0, AR2 j jcall AC3

To solve the problem, we can change the subroutine call from call by accumulator to call by address as follows:

mov AC0, AR2 j jcall my_func

This is because the instruction, call my_func, only needs the PU.

The coefficient-dual-AR indirect addressing mode is used to perform operations with dual-AR indirect addressing mode. The coefficient indirect addressing mode supports three simultaneous memory-accesses (Xmem, Ymem, and Cmem). The finite impulse response (FIR) filter (will be introduced in Chapter 3) is an application that can effectively use coefficient indirect addressing mode. The following code segment is an example of using the coefficient indirect addressing mode:

mpy *AR1‡, *CDP‡, AC2 ; AR1 pointer to data X1 : :mpy *AR2‡, *CDP‡, AC3 ; AR2 pointer to data X2

| |rpt #6 ; Repeat the following 7 times mac *AR1‡, *CDP‡, AC2 ; AC2 has accumulated result : :mac *AR2‡, *CDP‡, AC3 ; AC3 has another result

In this example, the memory buffers (Xmem and Ymem) are pointed at by AR2 and AR3, respectively, while the coefficient array is pointed at by CDP. The multiplication results

are added with the contents in the accumulators AC2 and AC3, and the final results are stored back to AC2 and AC3.