TEMPERATURE-AWARE UNIFIED PHYSICAL-LEVEL AND HIGH-LEVEL SYNTHESIS
5.3 Overview of the Proposed Approach
5.3.2 Solution Encoding
5.3.3.6 Chip-Package Thermal Model
As mentioned at the beginning of section 2.5, we use thermal modeling within the optimization flow to provide direct guidance for thermal optimization. To enable static thermal analysis of the placed modules in the floorplan, we use ISAC, a fast and accurate temperature analysis tool [69]. ISAC has been validated against FEMLAB [83], an accurate but slow commercial finite-element based simulator, with less than 3.5% estimation error [69]. For our thermal analysis, we assume
that each chip is attached to a copper heat sink using forced air cooling. Heat dissipates from the silicon die, through the cooling package to the ambient environment, and through the package to the printed circuit board. We assume an ambient temperature of 45oC and a silicon thickness of 200μm, similar to the assumptions made in [80, 81].
5.4 Experimental Results
The proposed algorithm was tested on a Linux-based workstation using a 1.86GHz Intel CoreDuo processor with 2GB memory. The overall flow used in our experiments is shown in Figure 5.7. Our experiments were performed on a comprehensive set of benchmarks drawn from real-life applications in the MediaBench suite [87]. Each of these benchmarks was specified as a DFG [86] capturing the behavioral description of the architecture to be synthesized. The RTL resource set used in our experiments comprised of multipliers, ALUs, registers, and multiplexers. These resources were synthesized using Cadence BuildGates and Encounter tools, and mapped to a 180nm technology library, and the capacitance values of the functional modules were then extracted. The areas, delays, and profiled power dissipation values from actual layouts of these RTL resources served as the RTL module library data in our experiments. We then estimated the switching activity between pairs of DFG operations that can potentially be executed on the same resource consecutively, and these were used to create the switching activity tables used for computing the switching power of resources.
We tested our synthesis technique on a comprehensive set of twenty HLS benchmarks. Each of these benchmarks was specified as a dataflow graph capturing the behavioral description of the architecture to be synthesized. These benchmarks are drawn from two sources:
• popular high-level synthesis benchmarks used in previous literature, • real-life examples generated from the MediaBench suite [86, 87]
Among the set of popular benchmarks, we selected seven examples widely used in HLS studies. These examples focus on frequently used numeric calculations performed by various DSP applications. They are as follows:
• ARF: an implementation of an auto-regression filter, • EWF: an implementation of an elliptic wave filter,
• FIR1 and FIR2: two versions of a finite impulse response filter,
• COSINE1: an implementation for a 1-D eight-point fast discrete cosine transform filter,
assuming constant coefficients,
• COSINE2: an implementation for a 1-D eight-point fast discrete cosine transform filter,
where the coefficients are given as inputs,
• HAL: an iterative solution of a second-order differential equation solver.
The dataflow graphs for these examples range in size from 11 nodes to 82 nodes. Table 5.1 provides details of the first set of benchmarks used in our experiments. In Table 5.1, column 1 lists the benchmark name. Columns 2 and 3 specify the characteristics of the corresponding dataflow graph, where column 2 represents the number of nodes and column 3 the number of edges.
Table 5.1 Benchmark Set - 1
Benchmark Number of DFG
Nodes Number of DFG Edges
HAL 11 8 ARF 28 30 EWF 34 47 FIR 2 40 39 FIR 1 44 43 COSINE 1 66 76 COSINE 2 82 91
Eleven examples extracted from the MediaBench benchmark suite were used to test our algorithm. The MediaBench suite [86] contains a wide variety of complete applications drawn from image processing, communications, and DSP domains. The dataflow graphs used in experiments range in size from 51 nodes to 333 nodes, and were obtained from [87]. The dataflow graphs were derived from four MediaBench applications:
• JPEG: a lossy compression technique for digital images,
• MPEG2: a digital video compression standard used for high-quality video compression, • EPIC: an efficient pyramid image coder, and is an image compression utility,
• MESA: a software 3-D graphics package.
Table 5.2 provides details of the dataflow graphs from the MediaBench benchmark set used in our experiments. In Table 5.2, the first column states the names of the various functions where the basic blocks (DFGs) originated, and the second column specifies the MediaBench application to
Table 5.2 Benchmark Set - 2
Benchmark Application Domain Num. of DFG Nodes Num. of DFG Edges
h2v2_smooth_downsample JPEG 51 52 feedback_points MESA 53 50 collapse_pyr EPIC 56 73 write_bmp_header JPEG 106 88 interpolate_aux MESA 108 104 matrix_multiply MESA 109 116 IDCT_col MPEG 2 114 164 JPEG_IDCT_ifast JPEG 122 162 JPEG_FDCT_islow JPEG 134 169 smooth_color_z_triangle MESA 197 196 Invert_matrix_general MESA 333 354
which these functions belong. The remaining two columns specify the the number of nodes and edges in the dataflow graph, respectively.
In our experiments, the objective was to minimize the peak temperature among the functional units in a datapath during high-level synthesis. The temperature-aware synthesis method was compared with three other temperature-unaware methods:
• Method-A: A traditional floorplan-aware but power unaware synthesis methodology that
minimizes chip area,
• Method-B: A low-power floorplan-aware synthesis methodology that minimizes the total
power consumption,
• Method-C: A low-power floorplan-aware synthesis methodology that minimizes peak
module power.
Method-A is an SA-based layout-driven high-level synthesis that tightly integrates a floorplanner within the HLS synthesis loop. The SA cost function used in Method-A minimizes the schedule length and the traditional floorplanning objectives of chip area and total wirelength. Method-B augments the cost function used in Method-A with a power minimization objective of minimizing total power, while Method-C augments Method-A's cost function with a power minimization objective of minimizing the peak power consumption of the datapath functional units. Since chip temperatures are correlated with power, comparing TABS with low-power synthesis techniques allows us to highlight the advantages of a temperature-driven synthesis technique over a low-power design methodology. A thermal analysis is performed on the datapaths produced by these methods, and the peak module temperatures are compared with the datapaths created by TABS.
Method-A is used as a baseline synthesis technique to study the contribution of a low-power design strategy towards minimizing on-chip temperatures, and contrast it against a power-
dissipation in a circuit would hopefully lower overall on-chip power density and hence the on- chip temperatures. The intuition for Method-C is that constraining peak module power, could help mitigate the formation of on-chip thermal-hotspots, and hopefully create a more even thermal distribution.