Postprocessing Repeater Chains - BonnRepeaterTree Utilities

7.6 BonnRepeaterTree Utilities

7.6.2 Postprocessing Repeater Chains

Larger distances are often covered by chains of repeaters. A chain is a set of subsequent nets with fanout one. For larger distances, it is more probable that it is better to use higher layers with wider wires for the nets than the default layers. It is often not possible to reach the slack targets without using wider wires and a lot of short nets on lower layers increase placement congestion due to a lot of additional repeaters. On the other hand, it makes no sense to assign high fanout nets to higher layers. High fanout nets often connect a lot of sinks locally due to the pin capacitance. The benefits of wider wires are not high in such a situation. Thus, we restrict ourselves to repeater chains.

We offer two routines for improving the placement and layer assignment of repeater chains.

Postprocessing

The first routine tries to improve the layer assignment of repeater tree chains by ripping them out and rebuilding them on higher layers. The routine iterates over all repeater chains and then builds a new solution using each configured layer assignment with the Fast Buffering routine. The best solution according to our

Figure 7.6: All repeater chains longer than 4 mm on a large design. Distributing the chains

more evenly can give us less congestion while preserving the timing.

criteria, electrical violations, slack, power consumption, and length is kept. The routine either uses a shortest path computation in the blockage grid or uses the path search of the congestion map if available. In the later case, assignments are only performed if they do not violate congestion targets on the probed layers. Congestion-aware Rerouting

The second routine, RerouteChains, does not try to improve the layer assignment of repeater tree chains. Instead, it tries to improve congestion by moving the repeaters into less congested areas. Figure 7.6 shows all repeater chains longer than 4 mm on a design that has a lot of congestion. By distributing the chains evenly and by avoiding congested areas, it is possible to reduce overall congestion. We have a simple ripup-and-reroute heuristic that reroutes the chains.

This routine only works in presence of a congestion map. In several iterations all chains are collected that use an edge in the map above a certain congestion level. Then, for each chain, the following steps are performed:

1. Collect the nets of the chain and the costs of their routing in the congestion map.

2. Search for a new path from the start point to the end point of the chain in the congestion map.

3. Distribute the internal repeaters of the chain along the new path such that relative distances are preserved. The repeaters are then placed legally at the free position next to their target position.

7.6 BonnRepeaterTree Utilities

5. If the costs of the new nets are smaller than the old costs, then the new solution is kept. Otherwise, it is reverted.

The search for the new path does not take existing layer assignments into account because it is not supported by the congestion map. The assignments are only used when new routes for the nets are recomputed. Similarly, the path search ignores timing. We restrict the congestion map to the bounding box between the start and end of the path plus some tiles for detour. Thus, we expect that the routes do not get too long and timing is not degraded too much. Instead, the number of nets with long detours generated by the global router should become smaller.

The routine also does not consider placement congestion directly apart from searching for free positions. In practice, placing gates densely such that no gaps occur results in unroutable designs. Thus, if we have a design that is considered routable by the global router, we expect that placing the circuits legally in the corresponding global routing tiles should be feasible without much placement distortion. This cannot be guaranteed but the experiments did not show problems with placement legalization. If we skip placing the repeaters legally after movement, then the results are similar to runs where the circuits are legalized.

A potentially better solution to the problem has been proposed by Janzen (2012). He integrates the placement of repeater chains into the global router. Whole chains are considered as a single net by the global router, and, during path search, timing as well as placement area are considered. Compared to the method presented here, the running time of the global router increased significantly due to the additional work during path search. This approach is not suitable for application in practice so far.

Experimental Results

We test the RerouteChains routine on our testbed of chip designs (see Section 8). The test cases are output of a standard BonnTools optimization flow (see Section 7.5). We compare the effects of our routine on routability and timing.

Table 7.2 and Table 7.3 summarize the experiments. BonnRouteGlobal1 _{is used}

to measure routability. We have one run of the global router before and one run after RerouteChains. The global router and the congestion map have nearly the same number of global routing tiles in both dimensions.

Table 7.2 first shows the number of repeater chains considered on the test instances. Our tool runs in five iterations over all chains and tries to reroute a chain if it uses an edge in the congestion map with more than 85 % utilization. If a cheaper solution is found, then it is kept. The second column shows the number of improvements found over all five iterations. The following columns show the running times of both global router runs (1st _{GR, 2}nd _{GR), congestion map creation (CM), and RerouteChains}

(RC). The congestion map proves to be very fast. The RerouteChains running

Running Times

Design Chains Reroutes 1st _GR _CM _{RC 2}nd _GR

Baldassare 7613 381 9 s 2 s 3 s 10 s Beate 5043 251 14 s 2 s 5 s 14 s Wolfram 3842 1473 18 s 2 s 6 s 19 s Gerben 3493 10 14 s 1 s 5 s 15 s Luciano 27922 4200 46 s 5 s 20 s 50 s Benedikt 37893 3209 106 s 14 s 38 s 111 s Renaud 14908 12379 295 s 13 s 33 s 289 s Julius 41677 7315 178 s 24 s 185 s 150 s Franziska 56697 3108 151 s 22 s 65 s 160 s Meinolf 128023 68221 165 s 23 s 103 s 170 s Iris 76153 141865 750 s 30 s 193 s 797 s Gautier 165491 214135 6454 s 205 s 1344 s 5247 s

Table 7.2: Testbed for the RerouteChains tool. The running times are given for the

global routing (GR), congestion map creation (CM), and RerouteChains (RC).

times also contain the legalization of moved repeaters. Overall, the running times are acceptable.

Table 7.3 first shows the total overflow over all global routing edges as reported by BonnRouteGlobal. Then, the number of nets is reported with length longer than two times the length of a Steiner minimum tree and the highest relative detour. Finally, timing quality is measured with the sum of negative slacks (SNS). Most instances are uncritical and the tool has little effect. On Benedikt and Franziska the routability decreases slightly due to differences in the free edge capacities seen between the global router and the congestion map. On designs Gautier and Renaud the congestion was reduced significantly resulting in less detours. Due to the reduction of detours, also the sum of negative slacks was improved on these instances. In summary, there are instances where running the tool improves routability, while on uncritical instances no harm is done.

Figure 7.7 and Figure 7.8 show congestion maps as reported by BonnRouteGlobal for designs Julius and Gautier. The colors show the maximum relative edge utilization over all layers. Especially on Gautier, we see the huge reduction of overflow by a factor more than 100×.

7.6 BonnRepeaterTree Utilities

Design Step Overflow Nets > 100 % Max detour SNS

Baldassare before 0 0 8 % −140049 ps after 0 0 8 % −140039 ps Beate before 0 0 78 % −151997 ps after 0 0 78 % −151990 ps Wolfram before 0 0 25 % −75187 ps after 0 0 25 % −75210 ps Gerben before 0 0 30 % −109589 ps after 0 0 30 % −109626 ps Luciano before 0 52 524 % −464277 ps after 0 24 342 % −481492 ps Benedikt before 1446 0 72 % −207397 ps after 2906 0 103 % −210871 ps Renaud before 4140245 913 747 % −6557941 ps after 3444979 825 513 % −5822497 ps Julius before 0 0 99 % −18787024 ps after 0 0 89 % −18766033 ps Franziska before 1307 0 78 % −1694730 ps after 10982 0 78 % −1694434 ps Meinolf before 0 0 98 % −4622849 ps after 0 0 89 % −4715832 ps Iris before 2843214 4057 1662 % −19704061 ps after 1891486 3032 1662 % −15602736 ps Gautier before 12682825 4609 306 % −9647137 ps after 123511 111 131 % −7058513 ps

30 % 50 % 60 % 70 % 76 % 82 % 87 % 91 % 95 % 10the0 % 110 % 120 %

Figure 7.7:Congestion on design Julius before (left) and after (right) RerouteChains.

30 % 50 % 60 % 70 % 76 % 82the % 87 % 91 % 95 % 100 % 110 % 120 %

8 Experimental Results

As part of the IBM optimization tool suite, the BonnRepeaterTree tool is in daily use to optimize the physical design of current chip designs. It proved to be useful for current ASIC (application-specific integrated circuit) designs as well as for bleeding- edge processor units. Designers choose it as part of the BonnTools optimization tools to achieve timing-closure.

In this chapter, we want to show performance metrics of our tool. We compare it to another tool used by IBM for repeater tree optimization. We also compare against lower bounds for slack, netlength and repeater counts. We then show how parallel walk and effort assignment affect the results and give hints on proper selection of parameters ξ, η, and dnode.

Design Technology Instances Max. Sinks

Baldassare 22 nm 20552 152 Beate 22 nm 34382 166 Wolfram 32 nm 44413 274 Gerben 32 nm 44677 194 Luciano 22 nm 101625 263 Benedikt 22 nm 241465 12061 Renaud 45 nm 268775 389 Julius 22 nm 284934 1589 Franziska 22 nm 328600 740 Meinolf 22 nm 364969 1539 Iris 22 nm 393457 3264 Gautier 45 nm 1275731 6859

Table 8.1: Designs used for experimental results. For each design, all repeater tree instances

were build.

We have chosen twelve current chip designs from our industrial partner IBM for our experiments. For each design, we build all repeater tree instances regardless of difficulty. In total, we have more than 3.3 million instances of varying sizes with up to 12061 sinks. Table 8.1 shows the designs we used, their codename, technology, the number of instances, and the number of sinks in the biggest instance.

The distribution of instance sizes is very uneven as shown by Table 8.2, which shows it subdivided into the different technologies. The designs are dominated by single sink instances. Instances with up to four sinks already make up for more than 90 % of all instances. It is important to produce good repeater trees for

Sinks 45 nm 32 nm 22 nm Total 1 958307 58920 1307808 2325035 68.39 % 2 191709 11422 179836 382967 11.26 % 3 144054 9077 96330 249461 7.34 % 4 115818 2219 42237 160274 4.71 % 5 27687 2384 24882 54953 1.62 % 6 25420 968 20435 46823 1.38 % 7 16523 652 19334 36509 1.07 % 8 11518 521 18883 30922 0.91 % 9–20 27348 2336 44070 73754 2.17 % 21–50 14842 759 12434 28035 0.82 % 51–100 1667 239 5192 7098 0.21 % 101–250 775 46 2311 3132 0.09 % 251–500 233 1 269 503 0.01 % 501–1000 75 0 30 105 0.00 % > 1000 110 0 28 138 0.00 % Total 1536086 89544 1774079 3399709 100.00 %

Table 8.2: Test instances grouped by number of sinks and technologies.

small instances to get overall acceptable results. However, it is also important to optimize instances with more sinks well because, in practice, they are often among the timing-critical instances.

Table 8.3 shows the size of the repeater families used as libraries. The designs from 32 nm and 22 nm technologies do not have any buffers. Instead, two consecutive inverters are used if necessary.

We have performed all of our experiments on a machine with two Intel Xeon X5690 processors with 6 cores each. The machine runs with a base clock speed of 3.46 GHz. The Intel® _{Turbo Boost Technology is enabled with a maximum clock speed of}

3.73 GHz. Hyperthreading was disabled. All experiments were run single-threaded, but up to 12 experiments were run in parallel if not noted otherwise. The running times reported might be higher than necessary because the machine was fully loaded with experiments. However, in practice, designers often have to share computing resources with others, too. The machine has 192 GiB of main memory.

The code was compiled with GCC version 4.1.2 under Red Hat Enterprise Linux Server 5.6 at optimization level O2.

8.1 Comparison to an Industrial Tool

As a first experiment, we compare our algorithm to a repeater tree construction tool that is used by IBM for most repeater trees. It uses a van Ginneken-style approach with the running time improvements by Li et al. (2012). It is the default tool in

In document Fast Repeater Tree Construction (Page 103-111)