3.2 SIMD processor modeling in LISA
3.2.3 Extensions for modeling SIMD processors
As depicted by gure 3.10, a SIMD processor architecture contains many parallel data lanes that have the same functionality, yet operate on dierent data. Except for the permutation network in the VPU, the whole structure of the SIMD data path can be dened by modeling one lane and scaling the number of lanes based on the desired SIMD width.
Chapter 3 Scalable SIMD processor architecture FE DC EX WB ALU writeback fetch decode
add sub load
Figure 3.12: Example of an operation hierarchy with a four-stage pipeline. Grey boxes describe pipeline stages. Arrows describe the activation of operations.
LISA contains language elements for modeling multiple resources or operations that share a common identier and in case of operations a common behavior. These language elements are called template resources and template operations. Template operations and template resources are dened using angle brackets; two examples are given be- low:
1: REGISTER uint16 opnd<1 ..16>;
2: OPERATION alu<index> IN pipe.EX { .. }
Here, the index parameter may be used inside of the operation to index further template operations building an operation hierarchy or to access template resources. Template resources may only be indexed by constants or by index parameters of template operations. Identical or similar data paths may be dened by template operations; template resources enable to dene resources that are local to these data paths. These two language elements can be used to model LIW architectures with multiple processing units of the same type6
6This was the original purpose for introducing these language elements.
3.2 SIMD processor modeling in LISA or data paths in a SIMD architecture. However, modeling a SIMD data path is still dicult: LISA versions prior to V2009.1.0 do not support activating or calling multiple instances of the same template operation [WS09b], while newer versions at least support activations. Yet, the number of template instances that is activated has to be xed, as each declaration and activation has to be manually implemented. Hence, a scalable architecture with a scalable number of lanes cannot be modeled directly. Additionally, permutation networks have to be programmed manually for each dierent network size and topology. Hence, the design of the scalable SIMD architecture in section 3.1, with four SIMD widths and four network congurations, would require maintaining 16 partially dierent processor models in LISA.
An approach based on the combination of LISA with GNU M4 [SPVB08, WS09b] has been chosen to overcome this issue. The M4 language is a macro processing language, which supports string manipulations, conditional evaluation, and loops. These language features allow generating a scalable number of data paths by macros that produce LISA code for template operation declarations and activations (or calls). Hence, M4 is utilized as a preprocessing step. The extension of the LISA models has been realized in three steps:
• Necessary macros have been identied, implemented, and tested. The macros can be grouped into two classes: macros for the generation of permutation networks and macros for scalable SIMD data paths. Macros for scalable SIMD data paths primarily generate and control the accessing of multiple instances of template operations and resources. Macros for permutation networks generate the complete permutation network from macro calls.
• An M4 macro le containing denitions for all adjustable parameters has been gen- erated. The parameter le denes the SIMD width, the conguration of the permu- tation network (topology and width), and further parameters, such as register le sizes and memory congurations.
• The M4 macros have been introduced in the LISA model. For ex-
ample, a SIMD data path is added to the LISA model by implement- ing a template operation, which denes the behavior of the data path (e.g. OPERATION data_path<index> { ...}), and instantiating and acti- vating the scalable data path using the SIMD_INSTANCE(data_path) and SIMD_ACTIVATION(data_path) macros in the parent operation.
The preprocessing and compilation of the LISA model have been integrated into makeles that enable to alter the M4 parameter conguration by adding parameters to make. Preprocessing using M4 macros has not only been done for SIMD support further macros have been implemented for simplifying the LISA model structure and for circumventing
Chapter 3 Scalable SIMD processor architecture
bugs in LISA. For example, the LISA code for operand bypassing and saturation logic7 is
generated from macros. The binary opcode encoding of instructions for one processing unit (e. g. 0b1001001 for a scalar addition) is generated by dening a base opcode and using opcode oset or increment macros. Furthermore, macros enable generating dierent LISA code for simulation and RTL code generation for features of LISA that are not supported for both.
The eort for developing and maintaining a scalable SIMD model in LISA with and with- out M4 macro support cannot be measured directly. Yet, the number of source lines of code (SLOC) can be used as an indicator for the model complexity. Table 3.10 shows measured SLOC excluding comments and empty lines. The number of lines of code has been measured before and after macro expansion. The reference value includes the macro les. In the best-case scenario (128-bit SIMD processor and single-vector crossbar net- work), the code size after macro expansion is only approximately 67 percent higher than before macro expansion. However, in the worst-case scenario (1024 bit SIMD bit width and double-vector buttery network), the model requires almost 15 times as many SLOC after macro expansion.
Table 3.10: Source lines of code measured for LISA model before and after M4 macro expansion
Measurement SIMD
width Network SLOC Normal.SLOC
Before macro expansion all all 7292 1.00
After macro expansion: best case 128 bit cross1 12179 1.67
After macro expansion: worst case 1024 bit bfy2 108986 14.95