Study on Design Space Exploration Efficiency

Case Study

8.6 Study on Design Space Exploration Efficiency

In this section, the focus of the case study is on demonstrating the design efficiency of the proposed tool flow, i.e.

how different design alternatives can be quickly explored using the proposed methodology. For these experiments, the fixed base-processor is a simple RISC processor, named LTRISC, with a 5 stage pipeline and 16 GPRs. This processor is extended with a FPGA fabric for architecture exploration.

At the beginning of the case study, six different application kernels were selected for analysis. These kernels were selected from two prominent multimedia software suites: mediabench II benchmark and X.264 codec implementa-tion for the H.264 standard. The selected kernels were: Inverse Discrete Cosine Transform (IDCT) from MPEG2 decoder, Inverse and Forward Discrete Cosine Transform (IDCT/DCT) from JPEG software, Sum of Absolute Dif-ferences (SAD) from H.263 and H.264 encoder, and Sum of 8x8 Hadamard Transformed DifDif-ferences (SHTD) from H.264 encoder. The DCT/IDCT and SAD are chosen because various implementations of these kernels are embed-ded in a large number of media applications. SHTD was selected because it is one of the most computation intensive parts of the H.264 encoder.

The exploration flow mainly consisted of three steps - ISE identification, FPGA exploration and interface ex-ploration. In the ISE identification phase, each application kernel was partitioned into a set of ISEs which were identified using a mixture of manual algorithm analysis, profiling with µProfiler [104] and automatic ISE identifica-tion. In the FPGA and interface exploration phases, the identified ISEs were inserted into the application code and simulated on the rASIP ISS to obtain speed-up results.

The FPGA exploration consisted of synthesizing the identified ISEs to various FPGA fabrics with different logic elements, topologies and connectivities. This step was used to characterize the different ISE sets in terms of cluster usage and critical paths. At the same time, this step provided hints on the best FPGA fabric for a given application kernel. The interface exploration consisted of bench-marking a variety of interfaces, as well as FPGA internal storage structures. In the MPEG-2 IDCT case, these two explorations were carried out independently, i.e. the FPGA was kept fixed during the interface exploration phase. In the other cases, the interface exploration was carried out only for the FPGA and ISEs short-listed through the FPGA exploration. Finally, the results of these two explorations were combined together to determine the best FPGA structure, interfacing options and ISEs for a given application kernel.

The next subsections present the different design points that are explored. For the MPEG2 IDCT routine, the FPGA and interface exploration are described in detail. For the other benchmark kernels, the configurations which achieve the best performance are presented. During the FPGA exploration, a cycle-based cost model with inter-cluster routing delay set to 2 FPGA clock cycles and intra-inter-cluster routing delay set to 1 FPGA clock cycle was used (DM1 cost model in [137]). The gate-level synthesis results for LTRISC and FPGA clusters were obtained using a 130nm technology library and Synopsys Design Compiler. For all different design points, the base processor met a clock constraint of 2.5 ns, and the FPGA met a clock constraint of 5 ns, i.e. the base processor clock was running two times faster than the FPGA clock. All the critical path results in the next sections are reported in terms of base processor clock cycles.

MPEG2 IDCT Kernel

The IDCT routine from the MPEG2 decoder has two loops which process a 8×8 element data-array in row-wise and column-wise fashions. These loops have extremely similar structures and can be covered by the same set of ISEs. Two sets of ISEs were identified for these loop kernels after careful analysis. The first set contained 4 ISEs (named ISE1 through ISE4), while the second set had only one ISE which encompassed the entire DFG of the kernels.

FPGA Configuration Partition I Partition II(full DFG)

Cluster Connectivity Critical Path (cycles) Number of Critical Path Number of Name Style ISE1 ISE2 ISE3 ISE4 Clusters (cycles) Clusters

IDCT-1

NN-1 12 4 1 1 29 27 34

Mesh-1, NN-2 11 4 1 1 28 23 33

NN-1, Mesh-2 10 3 1 1 28 18 30

IDCT-2

NN-1 11 3 1 1 28 17 32

Mesh-1, NN-2 9 4 1 1 27 17 30

NN-1, Mesh-2 8 3 1 1 27 14 27

IDCT-3

NN-1 11 3 1 1 31 19 34

Mesh-1, NN-2 10 3 1 1 34 17 35

NN-1, Mesh-2 9 3 1 1 29 15 32

Table 8.18. FPGA Exploration for MPEG2 IDCT

FPGA Exploration: Algorithm analysis and µprofiling showed that the loop kernels use various arithmetic oper-ations - mostly additions and subtractions, and some multiplicoper-ations (around 15% of the total operators). Therefore, only ALUs and Multipliers were selected as PEs in the FPGA fabric. Three different cluster topologies (IDCT-1, IDCT-2 and IDCT-3 in Figure 8.14) with 2×2, 2×3 and 3×3 PEs were explored. Nearest Neighbor (NN) connectiv-ity scheme is used inside a single cluster, and explored NN, {Mesh-1, NN-2}, and {NN-1, Mesh-2} configurations for inter-cluster communication. The critical paths and cluster usage for the two ISE sets with different FPGA configurations are shown in table 8.18.

Interface Exploration: For MPEG-2 IDCT, interface exploration was carried out with all 9 FPGA configurations listed in table 8.18. For the first ISE set, the following set of interface options for each FPGA configuration are tried:

1. GPR file with 4-in/4-out ports.

2. Clustered GPR file similar to the one described in [102] with 4- in/4-out ports.

3. GPR file with 4-in/4-out ports, and an additional 16 Internal Registers (IRs) accessible from FPGA. ISE1 through ISE4 use these registers to communicate the intermediate values.

4. GPR file and a block of 8×8 IRs accessible from FPGA. At the beginning of IDCT calculation, the entire 8×8 data-block is moved from memory to this register file. At the end of calculation, the block is moved back to memory.

Since the second ISE set contains a single ISE with 8-inputs and 8-outputs, a GPR file with 8-in/8-out ports is chosen. The 8×8 IR file was also put inside the FPGA for the second set. In total, 45 different design points for the first and second ISE sets are explored.

As can be easily seen from table 8.18, the IDCT-2 cluster topology and {NN-1, Mesh-2} inter-cluster connectivity results in the lowest critical path for all ISEs. The 3 cluster topology, in spite of having more PEs than IDCT-2, do not give better critical paths or higher cluster utilization. The IDCT-1 topology is smaller in area than IDCT-IDCT-2, but results in longer critical paths. Table 8.18 also shows that the first set of ISEs (ISE1 through ISE4) always result in smaller number of execution cycles than the second set. Moreover, the second set requires more data bandwidth from the GPR file. The speed-up, register file area and cluster area results for the case study are summarized in figure 8.13. For the sake of simplicity, only the results for {NN-1, Mesh-2} inter-cluster connectivity scheme is presented, since it always produced the best result. As the figure clearly shows, the design space is highly non-linear. As far as speed-up is concerned, the combination of IDCT-2 FPGA structure, 4-in/4-out register file with 8×8 IR block, and the first set of ISEs produce the best result. However, the 8×8 IR block significantly increases the register file area. Similarly, the IDCT-1 cluster topology produces almost as good speed-up results as the IDCT-2, but it is significantly smaller in area. As a consequence, the inclusion of the IDCT-2 topology and 64 IRs in the final design might depend on the area constraints. Such non-linearities of the design space underlines the importance of architecture exploration.

Figure 8.13. Results for IDCT case study

JPEG IDCT and DCT kernels

The JPEG IDCT and DCT kernels are very similar, but not exactly the same, to the MPEG2 IDCT kernel. The JPEG IDCT kernel uses an extra quantization table and requires even more data-bandwidth from the base processor.

Additionally, the JPEG DCT/IDCT kernels require more FPGA clusters due to the presence of larger number of

operators. As is the general case with all IDCT/DCT kernels, the 8×8 IR block in the re-configurable fabric results in better performance for JPEG IDCT/DCT.

SAD kernel

The SAD kernel calculates the difference of two vectors, element by element, and sums up the absolute value of these differences. This kernel is extensively used in motion estimation blocks of different video compression algorithm like MPEG2, H.263 and H.264. Since various implementations of SAD differ very little, a set of ISEs which can be reused over all of them is conceived. The FPGA topologies explored for SAD are marked as SAD-1, SAD-2 and SAD-3 in Figure 8.14. Each of these FPGA topologies included a special PE, namedAB ALU, which can calculate abs(x − y) or abs(x + y) apart from common arithmetic and logic operations. The abs(x − y) was included because the SAD kernel repeatedly uses this function. The utility of abs(x + y) is explained later. It is also found that the high degree of parallelism available in the SAD kernel can not be exploited through the GPR file.

When the ISEs requiring 8 inputs/4 outputs into SAD code are inserted, they caused large amount of GPR spills.

However, the performance improved greatly when 16 IRs were provided to SAD ISEs for communication.

ALU à {+,-, <<, >>, &, |}

MUL à {*}

AB_ALU à {+,-, <<, >>, &, |, abs(x+y), abs(x-y)}

MUL ALU

Figure 8.14. Cluster Structures for Different Applications

SHTD kernel

The SHTD kernel from H.264 uses several additions and subtractions, abs(x − y) operations, and abs(x + y) operations. Moreover, the software code for the kernel uses two 4×4 local arrays. Considerable speed-up can be

achieved if 32 IRs are used instead of these local arrays. The SHTD-1 and SHTD-2 FPGA topologies in Figure 8.14 were used in the FPGA exploration of the SHTD kernel.

Application

Storage Interface FPGA Structure Base Processor Speed-up Area (KGates) (times) GPR I/O IR IR I/O Area Cluster Cluster Area

Count (KGates) Count (KGates)

MPEG2 IDCT 4/4 64 8/4 76.16 27 25.63 96.35 3

JPEG DCT 4/4 64 8/4 76.16 32 25.63 96.35 4.45

JPEG IDCT 4/4 64 8/4 76.16 30 25.63 96.35 2.08

SAD 8/4 16 4/4 32.14 29 26.52 51.77 2.26

SHTD 8/4 32 4/4 48.93 16 29.72 68.56 5.59

Table 8.19. Best Configuration of Application Kernels

The best results (considering the achievable speed-up only) for all the kernels are presented in Table 8.19. The FPGA configurations explored for each one of them are shown in figure 8.14. For the first three applications namely, MPEG2 IDCT, JPEG DCT and JPEG IDCT, IDCT-2 cluster configuration is used. For SAD and SHTD, SAD-2 and SHTD-2 cluster configurations are used respectively. The connectivity style NN-1, Mesh-2 yielded the best results for all the applications shown above. One can see that different design points yield best results for different applications. The best cluster topologies of all the different kernels can be unified in the SHTD-2 structure.

Similarly, the best interface setting is to have 64 IRs with 8-inputs/4-outputs and a GPR file with 4-inputs/4-outputs.

The 64 IRs can be used in IDCT/DCT for storing the 8×8 data block, or for inter ISE communication in SAD, or for storing the local arrays in SHTD kernel. This case study points out how important architecture exploration is for the rASIP design process. Using the proposed design flow, 45 different design points for MPEG2 IDCT and more than 60 different design points for the other kernels taken together, are explored. This exploration was done within 5 man days which would be extremely difficult without the comprehensive tool support offered by the proposed methodology. The results also clearly advocate a careful investigation of various rASIP design alternatives for a variety of applications. For example, a FPGA designed for IDCT/DCT is not good enough for SAD/SHTD, since it does not contain the abs operator. Similarly, the number of internal registers required for SHTD (32 registers) is not enough for the DCT/IDCT kernel.

Chapter 9

In document Language-driven exploration and implementation of partially re-configurable ASIPs (rASIPs) (Page 146-151)