simulation (ISS). Support for RTL code generation has been added later. Hence, the sim- ulation capabilities are more mature than RTL synthesis capabilities. The most important shortcomings of the RTL model are listed below:
• The LISA processor generation tools apparently do not recognize that template operations share the same behavior. Therefore, all instances of template operations are compiled separately. This results in increased runtime and code size for wider SIMD models.
• The modeling capabilities for hierarchical RTL models, where resources and logic are grouped locally, are insucient. While LISA supports assigning operations and resources to functional units using the UNIT resource, bugs limit the usefulness of this mechanism: Register arrays and pipeline registers cannot be assigned to units, while operations need to belong to the same pipeline stage. All resources and operations that are not assigned to user-dened units are grouped in default units for pipeline stages. Hence, it is impossible to model a desired RTL hierarchy in LISA. The impact on gate level synthesis is discussed in section 3.4.3.
• Processor Designer can automatically generate synthesis scripts for commonly used synthesis tools, such as Synopsys Design Compiler. Yet, the generated scripts use outdated commands and are not suitable for hardware synthesis.
Despite the above-mentioned shortcomings of the Processor Designer tools, LISA still is an adequate language for ASIP design and design space exploration (DSE), because it facilitates a fast development of processor architecture and tools. Furthermore, new instructions can be added and tested with little programming eort. Many of the listed shortcomings only occur for complex models that use template resources and operations extensively like the proposed scalable SIMD processor architecture. The modeling of such processor architectures obviously is not the design focus of the LISA tool set.
3.3 Vertical-horizontal vector processing as an
alternative for LIW
LIW and especially VLIW architectures have some disadvantages compared to an archi- tecture that issues one instruction per clock cycle: Firstly, the instruction decoder is more complex. Secondly, more read and write ports are needed for the register les, which leads to increased area and power demands and an increased delay [RDK+00].
An alternative for LIW and VLIW processing based on vertical-horizontal vector process- ing is sketched in the remainder of this section. Vertical-horizontal vector processing can
Chapter 3 Scalable SIMD processor architecture
potentially overcome the above-mentioned shortcomings of LIW processing at the cost of reduced exibility.
The term vertical-horizontal vector processing or more precise vertical-horizontal pro- cessing pipeline vector computer has been dened by Gao et al. [GZYC86]. The term describes parallel and time-sequential vector processing in an analogy to two-dimensional spacial processing. The horizontal component describes the parallel processing of data vec- tors, i. e. multiple processing units process data in parallel. The vertical component refers to the iterative processing of data vectors over time, in one instruction. In consequence, vertical-horizontal processing denes a technique, where data vectors that are too long to be processed in parallel are segmented into blocks that t into the parallel data path. The segments are then processed sequentially. The underlying concept has been applied to early vector supercomputers, such as the Cray-1 [Rus78]. Figure 3.13 illustrates the idea. In the rst clock cycle, an operation on the vector MAC unit is started, which runs for multiple clock cycles. In the next clock cycle, an operation on the vector ALU is initiated. Both operations run in parallel for the next clock cycles. On a LIW processor architecture, new vmul and vadd instructions would have to be issued in each cycle to achieve the same behavior.
time
Decode
Execute
Vec. ALU Vec. MAC
vmul
ii+1
vadd
...
...
Figure 3.13: Example for vertical-horizontal vector processing with two vector units
The concept of a vertical-horizontal vector processing architecture for SDR has been ex- amined in a diploma thesis [Lec09, in German] and is explained below in section 3.3.1. Some performance benchmarking based on three SDR algorithms is described in section 3.3.2.
3.3 Vertical-horizontal vector processing as an alternative for LIW
3.3.1 Vertical-horizontal vector processing for SDR
The basis for the vertical-horizontal SDR architecture is the scalable SIMD processor architecture developed in section 3.1. Modications have been done on the instruction fetch/decode mechanism and the organization of register les.
ILP is no longer achieved by issuing multiple instructions per cycle in a long instruction word; instead, one instruction that may iterate for multiple clock cycles is issued in each cycle. As operations that start successively overlap, parallelism is achieved. The iteration count of an instruction is explicitly encoded in the instruction word. As an operation iterates over multiple clock cycles, multiple source and destination registers are required assuming that the size of one register remains the same. This can be achieved by simply incrementing the register address, yet, in this case, the number of required register le ports does not decrease compared to a LIW architecture. Instead, the proposed register le organization is based on the assumption that successive instruction iterations usually do not need to access the same data values. Hence, successive iterations may be mapped on dierent register le banks. A small number of register banks is provided; instructions iterate through these register banks in a cyclic manner (see gure 3.14). In each clock cycle, each register bank is accessed by a single functional unit, which reduces the number of required ports. If data from one register bank is needed for the calculations in a dierent bank, it has to be explicitly transferred in an instruction.
FU1FU
1v
N...
v
1v0v
0v0
v
N...
v
1v
0FU
MVector register banks
Cyclic
mapping
Units
Figure 3.14: Cyclic mapping of FUs on register banks: In each cycle, the read and write ports are mapped to a single units. The FUs iterate through register banks in a cyclic manner.
The partitioning of the register le leads to reduced area and power demands. This eect is demonstrated by the following comparison of a four-way LIW architecture and a vertical- horizontal vector processing architecture with four register banks and, hence, up to four parallel instructions. The asymptotic area and power complexity for a monolithic register le (LIW case) is O (Nreg· p2) [RDK+00], with p denoting the number of ports. The
Chapter 3 Scalable SIMD processor architecture
case) is O (Nreg· Nb· pb2), with Nb and pb denoting the number of banks and the number
of ports per bank respectively. Nreg describes the number of registers per bank. Table
3.11 shows a comparison of dierent register le congurations. Normalized energy and area are computed based on the model of the asymptotic complexity, normalization is done based on a monolithic register le for a four-way LIW architecture with 16 registers and 12 ports (four FUs, two read ports and one write port per unit). As energy and area depend on the squared number of ports, the reduced number of ports for a partitioned register le has a signicant inuence. The table also shows a combination of LIW and vertical- horizontal vector processing with two-way LIW and two register banks (eight registers and six ports per register bank). This architecture conguration requires more area and power than the pure vertical-horizontal vector processing congurations, yet in comparison to a four-way LIW architecture with monolithic register le, the area and power demands are still reduced by 75 percent.
Table 3.11: Model-based register le comparison of monolithic and partitioned register les
Description Nreg Nb p / pb Normalized area & power
Monolithic register le, four-way LIW 16 1 12 1.000
Partitioned register le 48 44 33 0.06250.125
Partitioned register le, two-way LIW 8 2 6 0.250
3.3.2 SDR algorithm performance
While a vertical-horizontal vector processing architecture with a partitioned register re- quires less power and area for the register le than a LIW architecture, the processor architecture is also less exible as only one instruction can be started in each clock cy- cle. The eects of the reduced exibility have been studied for three dierent signal processing kernels [Lec09]: matrix-vector product, a 16-point FFT, and Viterbi decoding on a 64-bit SIMD processor. The performance of an architecture without ILP support, a four-way LIW architecture, and a vertical-horizontal vector processing architecture with four register banks have been measured. The results are summarized in table 3.12. The performance of the vertical-horizontal architecture is only slightly worse than the LIW architecture performance. Furthermore, the number of instructions can be signicantly reduced.
Vertical-horizontal vector processing may oer performance similar to a LIW architecture with reduced area and power demands for the register les and the decoding of instruc-
3.4 SIMD architecture analysis methodology