5.6.1
Hardware Efficiency
The architectural properties after logic synthesis of ICORE2 and ALICE are listed in table 5.1. Additionally the table lists the corresponding values for the MIPS32 architecture for comparison. These values are taken from the MIPS32 data sheet. The rows labeled Time for EVD and Energy
for EVD show the absolute execution time and the energy required for processing a 10×10 matrix
EVD. The execution time is calculated by the quotient of the number of execution cycles taken from figure 5.7 and the maximum clock frequency. Beside these values the table also lists the die size of the MIPS32, the tailored ALICE, and the ICORE2 architecture. All numbers have been obtained for a typical 0.18µm CMOS technology using defined worst case conditions for temperature, voltage, and fabrication.
Architecture (0.18µm): ICORE2 ALICE MIPS32
Max. Frequency (M Hz): 140 190 170-200
Die Size (mm2): ≤ 0.4 ≤ 1.2 ≤ 1.0
Time for EVD (ms): 0.32 0.43 4.19-3.57
Energy for EVD (µJ ): ∼ 10.8 ∼ 79 > 641
Table 5.1: Architecture comparison
ALICE and MIPS32. This is the consequence of the unbalanced pipeline of ICORE2 mentioned in section 5.2: The long critical path in ICORE2’s execution stage significantly reduces the maximum clock frequency.
The die size of ICORE2 is about half the size of ALICE and MIPS32. At least for ALICE this result is not astonishing because ICORE2 gets its computation performance from a few highly specialized functional units while ALICE receives its performance from parallelizing several general units of finer granularity (i.e. multiplier, ALU, etc.). Furthermore in contrast to ICORE2 ALICE requires area for the predecode pipeline stage that decompresses the very large instruction word. Figure 5.7 also illustrates that forwarding is very important for performance. Unfortunately for ALICE the implementation of bypassing is quite expensive because for its u parallel functional units that are spread over n pipeline stages an u× n interconnection network is required. This leads to multiplexers of significant size. On the other hand the ASIP that will be part of a SoC will have attached data/program cache/memory. Compared to the size of this memory the area consumption of the processors is only a few percent.
The time required for computing the EVD is relatively comparable between ALICE and ICORE2. The MIPS32 is slower by one order of magnitude. This result is quite impressive especially because not much time was spent to highly optimize the CoSy C compiler for the ALICE architecture. The only significant advantage of ICORE2 over ALICE is its energy requirement. There are several reasons why ALICE is 8 times less power efficient than ICORE2:
• ALICE requires more clock cycles to compute the EVD and at the same time has a more
complex clock tree and many more registers than ICORE2.
• The predecode stage and the bypassing logic which do not exist in ICORE2 require energy. • ALICE instructions operate on a lower granularity compared to ICORE2 instructions. e.g.
on ICORE2 there is a single assembly instruction to multiply matrices. On ALICE a com- plete function with control overhead (condition evaluation and branches) is needed for this task. Furthermore this also means that ALICE more often has to access the program mem- ory which significantly increases power consumption.
5.6.2
Design and Verification Time
The design and verification times for both architectures are depicted in table 5.2. As one can see the creation of the LISA processor models took only two weeks for both architectures2. After that time assembler, linker, simulator/profiler, and the control path of the HDL model could be generated.
The time for creating the hardware model is also comparable for both architectures: A significant amount of time was spent on implementing the cordic unit that is part of both ALICE and
Architecture: ICORE2 ALICE LISA model: 2 2 Hardware Model: 3 3 Assembly Code: 6 − C Compiler: − 7 Retargeting: ∼ 8 ∼ 3
Table 5.2: Design and verification times in man-weeks for ICORE2 and ALICE
ICORE2. Besides the cordic in ICORE2 the vector operations dominated the design time. The concept and the implementation of the predecoder dominated the ALICE hardware design effort. The writing of the ICORE2 assembly code took 6 weeks. This time is quite long because in the architecture exploration phase several rewrites of the assembly code were necessary. Furthermore the tedious co-verification of assembly code and HDL model turned out to be very time consuming. When implementing the C compiler for ALICE the designers were not very familiar with the CoSy design environment. A lot of time was spent on understanding how to modify an existing CoSy compiler for the SPARC processor to produce code for a VLIW architecture. Since ALICE is a very orthogonal architecture the verification time of the compiler was only about 40% of the 7 weeks. The developers of industrial compilers usually spend most of the time on implementing and verifying techniques that produce better code quality. For this case study, only a verified sequence of high level optimizations (taken from the SPARC compiler) was utilized in the C compiler of ALICE.
The last row of table 5.2 is the estimated time for tailoring the processor HW/SW to a different application of the signal processing application domain. Since ICORE2 mostly contains specialized functional units, major changes to the hardware design would be required. Furthermore it could turn out that the critical path in the execution stage leads to a performance bottleneck which might necessitate a complete redesign of the ICORE2 pipeline. In conjunction with major changes of the hardware, the assembly code needs to be rewritten and verified. The estimated gain of reusing ICORE2 instead of rewriting it from scratch is only three weeks (2 + 3 + 6− 8). In contrast the reusability of the ALICE architecture is much higher. If the new design stays in the ALICE processor class as described in section 5.3 the modifications of the LISA model and the CoSy compiler can be done in less than a day. If more performance is required additional functional units have to be inserted into the pipeline. Since such units can be addressed by intrinsics on the compiler side the three weeks mentioned in table 5.2 mainly refer to the hardware implementation of these blocks.