Runtime-vs-Accuracy Trade-off of Estimation Algorithm

STOCHASTIC WIRELENGTH ESTIMATION-BASED HIGH-LEVEL DESIGN SPACE EXPLORATION

6.4 Iterative Binding for High-Level Design Space Exploration

6.4.4 Runtime-vs-Accuracy Trade-off of Estimation Algorithm

The run-time of the dynamic Rent-parameter extraction technique is determined by number of levels to which the gate-level netlist is partitioned. This is in turn, dictated by the minimum number of data points needed for computing the Rent parameters, and the granularity of the placement bins used in estimating the positions of gates. To accurately determine the Rent parameters, a certain minimum number of partitions is required, to obtain an adequate number of data points in Region I of the Rent's curve [18]. In addition, to be able to uniquely identify all RTL modules, the smallest bin-size on the partitioned layout must be no larger than the size of the smallest RTL module in the datapath. For runtime efficiency, the partitioning process in our algorithm is stopped when an adequate number of data points needed to compute Rent parameters and RTL module locations, are obtained.

Through experiments, we found that obtaining five or more data points in Regions-I of a netlist's Rent's curve was adequate to accurately determine the Rent's parameters. The number of bi- partitions needed to uniquely identify all RTL modules in the datapath, can be computed from the ratio of the gate-count of the smallest datapath RTL module, to the total gate-count, as shown in the following equation:

Number of partition-levels = log N_S−log N_T 6.10 where NS is the gate-count of the smallest RTL module, and NT is the total number of gates in the

datapath (assuming uniformly sized gates).

Figure 6.17 illustrates the percentage error in wirelength estimation for the DCT-1 benchmark, as a function of the levels of partitioning to which the gate-level netlist is partitioned during wirelength estimation. A similar trend was observed for all the other benchmarks. From the figure it is evident that the accuracy of wirelength estimation improves with the number of levels to which the partitioning process is carried out on the gate-level netlist. For our experiments, we used a partitioning depth of 10 for all the benchmarks.

6.5 Experimental Results

The methods described in this paper were implemented and tested on a Linux workstation running on a 1.86 GHz Intel Core2 Duo CPU with 2GB RAM.

In this section, we present results from our experiments on clock period optimization by the proposed iterative binding algorithm. The algorithm was tested on four data-intensive HLS benchmarks – 8-point IIR filter, 16-point FIR filter, 5-point elliptic wave filter (EWF), and an 8 x 8 Discrete Cosine Transform (DCT) filter. Our algorithm accepts inputs in the form of dataflow graphs. For our experiments, we used ASAP and force-directed scheduling to schedule the input dataflow graphs, for which the initial resource allocation and binding was done using a clique-partitioning heuristic. The clock period of this initial solution was then iteratively improved by our iterative binding algorithm.

Table 6.3 illustrates the clock periods of the initial and best bindings for the benchmarks tested. In the table, column 1 lists the name of the benchmark. Column 2 shows the estimated clock period of the initial solution provided to the iterative binding algorithm, while column 3 shows the clock period of the best binding solution found. Column 4 shows the percentage improvement in the clock period. The percentage improvements in the clock period vary from 6.65% for the DCT-1 benchmark design, to 14.42% for the DCT-2 benchmark design, with an average improvement of 9.55%.

Figures 6.18 and 6.19 show the convergence plots for our iterative binding algorithm, illustrating the trend in the clock period and total wirelength of the best solution found by the iterative binding algorithm, during design space exploration. The x-axis in these figures represent the number of binding moves attempted during the iterative improvement phase, and the y-axis represents the clock period and total wirelength for the best binding solution found. For these experiments, we set the maximum number of binding moves attempted to 200. These figures

Table 6.3 Clock Period Improvement by the Iterative Binding Algorithm

Benchmark Initial CP (ps) Best CP (ps) % improvement

IIR-1 1261 1171 7.14 IIR-2 1227 1115 9.13 IIR-3 1133 1009 10.94 EWF-1 2110 1884 10.8 EWF-2 1586 1397 11.92 EWF-3 1347 1243 7.72 DCT-1 2796 2610 6.65 DC T-2 2168 1855 14.42 FIR-1 1332 1233 7.38 FIR-2 1127 1021 9.41 Avg: 9.55%

illustrate that the clock period of an initial binding solution can be improved significantly through a layout-aware binding. We also observe that for all the benchmarks the total wirelength increases, albeit by a small amount, with the increase being typically less than 15%. Clearly, the improvements in the clock period were achieved without any significant sacrifice in the total wirelength.

The improvements in the clock period through binding could be attributed to a better wire distribution of the final layout, due to a more balanced binding of DFG operations and variables among the data resources. The binding moves attempted by the algorithm try to identify natural clusters of connected modules in the datapath that could potentially lead to smaller wire delays for data transfers and lower wire congestion.

Table 6.4 shows the runtime for design space exploration performed by our algorithm for the tested benchmarks. Column 2 indicates the total number of datapath designs examined by the algorithm, and column 3 states the CPU time in minutes and seconds. Each binding move attempted by the algorithm involves creating a new datapath architecture, which is recursively partitioned and global placed by the algorithm, to estimate the wirelength and clock period. In a

Table 6.4 CPU Runtimes for the Iterative Binding Algorithm

HLS benchmark Number of bindings moves Execution time

IIR-1 100 6m:24s

EWF-1 200 11m:04s

FIR-1 200 22m:49s

DCT-1 200 34m:07s

typical physical synthesis step, every cell-placement and routing step would take several minutes. Hence, for a traditional synthesis flow, evaluating each binding move, by itself would take several minutes for synthesis and timing analysis. Figure 6.20 compares the run times for a traditional HLS design space exploration using standard-cell place & route, and the proposed iterative binding using stochastic wirelength estimation. The iterative binding algorithm proposed in this paper performs the same task almost an order-of-magnitude faster than the traditional synthesis flow.

Figure 6.20 Runtime Comparison Between HLS Design Space Exploration with Traditional Place & Route and the Proposed Stochastic Wirelength Estimation Method

6.6 Conclusions

In this chapter, we presented an iterative binding algorithm for clock period optimization, that uses stochastic wirelength models to estimate the total wirelength of cell-based designs, and a top-down partitioned based RTL placement to estimate the clock period. Use of these estimates to guide HLS binding decisions enables our approach to achieve an order-of-magnitude improvement in the search time for HLS design space exploration. Our wirelength estimates are within 15% of Dragon, Capo, FengShui, and Cadence Silicon Ensemble. Experiments on dataflow intensive HLS benchmarks show that our iterative binding algorithm can improve the clock period of a datapath by an average of 9.6%, with minimal impact on the wirelength.

CHAPTER 7 A GENETIC ALGORITHM FOR HIGH-LEVEL DESIGN SPACE

In document Temperature and Interconnect Aware Unified Physical and High Level Synthesis (Page 171-179)