Introduction to TMS320C55x Digital Signal Processor
2.2 TMS320C55x Architecture
The C55x CPU consists of four processing units: an instruction buffer unit (IU), a program flow unit (PU), an address-data flow unit (AU), and a data computation unit (DU). These units are connected to 12 different address and data buses as shown in Figure 2.1.
2.2.1 TMS320C55x Architecture Overview
Instruction buffer unit (IU): This unit fetches instructions from the memory into the CPU. The C55x is designed for optimum execution time and code density. The instruc-tion set of the C55x varies in length. Simple instrucinstruc-tions are encoded using eight bits
BB
Two 24-bit data-write address buses (EAB, FAB) 24-bit program-read address bus (PAB)
32-bit program-read data bus (PB)
Three 16-bit data-read data buses (BB, CB, DB) Three 24-bit data-read address buses (BAB, CAB, DAB)
CB DB
Two 16-bit data-write data buses (EB, FB)
Figure 2.1 Block diagram of TMS320C55x CPU
32 (4-byte opcode fetch)
Figure 2.2 Simplified block diagram of the C55x instruction buffer unit
(one byte), while more complicated instructions may contain as many as 48 bits (six bytes). For each clock cycle, the IU can fetch four bytes of program code via its 32-bit program-read data bus. At the same time, the IU can decode up to six bytes of program.
After four program bytes are fetched, the IU places them into the 64-byte instruction buffer. At the same time, the decoding logic decodes an instruction of one to six bytes previously placed in the instruction decoder as shown in Figure 2.2. The decoded instruction is passed to the PU, the AU, or the DU.
The IU improves the efficiency of the program execution by maintaining a constant stream of instruction flow between the four units within the CPU. If the IU is able to
TMS320C55X ARCHITECTURE 37
hold a segment of the code within a loop, the program execution can be repeated many times without fetching additional code. Such a capability not only improves the loop execution time, but also saves the power consumption by reducing program accesses from the memory. Another advantage is that the instruction buffer can hold multiple instructions that are used in conjunction with conditional program flow control. This can minimize the overhead caused by program flow discontinuities such as conditional calls and branches.
Program flow unit (PU): This unit controls DSP program execution flow. As illus-trated in Figure 2.3, the PU consists of a program counter (PC), four status registers, a program address generator, and a pipeline protection unit. The PC tracks the C55x program execution every clock cycle. The program address generator produces a 24-bit address that covers 16 Mbytes of program space. Since most instructions will be exe-cuted sequentially, the C55x utilizes pipeline structure to improve its execution effi-ciency. However, instructions such as branches, call, return, conditional execution, and interrupt will cause a non-sequential program address switch. The PU uses a dedicated pipeline protection unit to prevent program flow from any pipeline vulnerabilities caused by a non-sequential execution.
Address-data flow unit (AU): The address-data flow unit serves as the data access manager for the data read and data write buses. The block diagram illustrated in Figure 2.4 shows that the AU generates the data-space addresses for data read and data write.
It also shows that the AU consists of eight 23-bit extended auxiliary registers (XAR0±
XAR7), four 16-bit temporary registers (T0±T3), a 23-bit extended coefficient data pointer (XCDP), and a 23-bit extended stack pointer (XSP). It has an additional 16-bit ALU that can be used for simple arithmetic operations. The temporary registers may be utilized to expand compiler efficiency by minimizing the need for memory access. The AU allows two address registers and a coefficient pointer to be used together for processing dual-data and one coefficient in a single clock cycle. The AU also supports up to five circular buffers, which will be discussed later.
Data computation unit (DU): The DU handles data processing for most C55x applications. As illustrated in Figure 2.5, the DU consists of a pair of MAC units, a 40-bit ALU, four 40-bit accumulators (AC0, AC1, AC2, and AC3), a barrel shifter, rounding and saturation control logic. There are three data-read data buses that allow two data paths and a coefficient path to be connected to the dual-MAC units simultaneously. In a single cycle, each MAC unit can perform a 17-bit multiplication
24-bit
Figure 2.3 Simplified block diagram of the C55x program flow unit
FB
Figure 2.4 Simplified block diagram of the C55x address-data flow unit
DB
Figure 2.5 Simplified block diagram of the C55x data computation unit
and a 40-bit addition or subtraction operation with a saturation option. The ALU can perform 40-bit arithmetic, logic, rounding, and saturation operations using the four accumulators. It can also be used to achieve two 16-bit arithmetic operations in both the upper and lower portions of an accumulator at the same time. The ALU can accept immediate values from the IU as data and communicate with other AU and PU registers. The barrel shifter may be used to perform a data shift in the range of 2 32 (shift right 32-bit) to 231 (shift left 31-bit).
2.2.2 TMS320C55x Buses
As illustrated in Figure 2.1, the TMS320C55x has one 32-bit program data bus, five 16-bit data buses, and six 24-16-bit address buses. The program buses include a 32-16-bit program-read data bus (PB) and a 24-bit program-read address bus (PAB). The PAB carries the program memory address to read the code from the program space. The unit of program address is in bytes. Thus the addressable program space is in the range of
TMS320C55X ARCHITECTURE 39
0x000000±0xFFFFFF (the prefix 0x indicates the following number is in hexadecimal format). The PB transfers four bytes of program code to the IU each clock cycle.
The data buses consist of three 16-bit data-read data buses (BB, CB, and DB) and three 24-bit data-read addresses buses (BAB, CAB, and DAB). This architecture sup-ports three simultaneous data reads from data memory or I/O space. The C bus and D buses (CB and DB) can send data to the PU, AU, and DU; while the B bus (BB) can only work with the DU. The primary function of the BB is to connect memory to a dual-MAC; so some specific operations can access all three data buses, such as fetching two data and one coefficient. The data-write operations are carried out using two 16-bit data-write data buses (EB and FB) and two 24-bit data-write address buses (EAB and FAB). For a single 16-bit data write, only the EB is used. A 32-bit data write will use both the EB and FB in one cycle. The data-write address buses (EAB and FAB) have the same 24-bit addressing range. Since the data access uses a word unit (2-byte), the data memory space becomes 23-bit word addressable from address 0x000000 to 0x7FFFFF.
The C55x architecture is built around these 12 buses. The program buses carry the instruction code and immediate operands from program memory, while the data buses connect various units. This architecture maximizes the processing power by maintaining separate memory bus structures for full-speed execution.
2.2.3 TMS320C55x Memory Map
The C55x uses a unified program, data, and I/O memory configurations. All 16 Mbytes of memory are available as program or data space. The program space is used for instructions and the data space is used for general-purpose storage and CPU memory mapped registers. The I/O space is separated from the program/data space, and is used for duplex communication with peripherals. When the CPU fetches instructions from the program space, the C55x address generator uses the 24-bit program-read address bus. The program code is stored in byte units. When the CPU accesses data space, the C55x address generator masks the least-significant-bit (LSB) of the data address since data stored in memory is in word units. The 16 Mbytes memory map is shown in Figure 2.6. Data space is divided into 128 data pages (0±127). Each page has 64 K words. The memory block from address 0 to 0x5F in page 0 is reserved for memory mapped registers (MMRs).