Simultaneous placement with clustering and duplication

(1)

and Duplication

GANG CHEN

Magma Design Automation and

JASON CONG UCLA

Clustering, duplication, and placement are critical steps in a cluster-based FPGA design flow. Clus-tering has a great impact on the wirelength, timing, and routability of a circuit. Logic duplication is an effective method for improving performance while maintaining the logic equivalence of a circuit. Based on several novel algorithmic contributions, we present an efficient and effective algorithm named SPCD (simultaneous placement with clustering and duplication) which performs clustering and duplication during placement for wirelength and timing minimization. First, we incorporate a path counting-based net weighting scheme for more effective timing optimization. Secondly, we introduce a novel method of moving a fragment of a cluster (called a fragment level move) during placement to optimize the clustering structure. To reduce the critical path detour during legal-ization from a more global perspective, we also introduce the notions of a monotone region and a

global monotone region in which improvement to the local/global path detour is guaranteed.

Fur-thermore, we introduce a notion of a constrained gain graph to embed all complex FPGA clustering constraints, and implement an optimal incremental legalization algorithm under such constraints. Finally, in order to reduce the circuit area, we formulate a timing-constrained global redundancy

removal problem and propose a heuristic solution. Our SPCD algorithm outperforms a widely used

academic FPGA placement flow, T-VPack+ VPR, with an average reduction of 31% in the longest path estimate delay and 18% in the routed delay. We also apply our SPCD algorithm to Altera’s Stratix architecture in a commercial FPGA implementation flow (Quartus II 4.0). The routed result achieved by our SPCD algorithm outperforms VPR by 20% and outperforms Quartus II 4.0 by 4%. Categories and Subject Descriptors: B.7.2 [Integrated Circuits]: Design Aids—Placement and

routing; G.4 [Mathematics of Computing]: Mathematical of Software—Algorithm design and analysis; J.6 [Computer Applications]: Computer-Aided Engineering—Computer-aided design (CAD)

This research was partially funded by NSF Grant CCF-0096383 and by a grant from Magma Design Automation under the California MICRO Program. This research was performed as part of G. Chen’s Ph.D. study at UCLA. Portions of this article were published in Chen and Cong [2004, 2005].

Authors’ addresses: G. Chen, Magma Design Automation, 5460 Bayfront Plaza, Santa Clara, CA 95054; email: [email protected]; J. Cong, Computer Science Department, University of California at Los Angeles, Campus Mailcode 159610, Los Angeles, CA 90095-1596; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax:+1 (212) 869-0481, or [email protected].

C

(2)

General Terms: Algorithms, Design, Performance

Additional Key Words and Phrases: Placement, clustering, duplication, legalization, redundancy removal, FPGA

1. INTRODUCTION

Field programmable gate arrays (FPGAs) have become more and more popular in recent years because of their short time-to-market, field programmability, ease of use, and low cost in small- to medium-volume production. A typical type of FPGA is based on a K -input lookup-table (LUT), which can implement any K -input function. A typical FPGA architecture described by Betz and Rose [1997], the LUT-Based FPGA family, contains two levels of physical hierarchy: basic logic elements (BLE) and cluster-based logic blocks (CLB). As described in Figure 1, each BLE contains a K -input LUT and a flip-flop (FF), and the LUT and FF share the same output. As described in Figure 2, each CLB contains N BLEs, I inputs and N outputs. Each of the I inputs can drive all the BLEs, and each BLE drives an output. Here, K , N , and I are parameters described by an architecture file. The interconnect delay between BLEs within the same CLB is usually much smaller than the delay between BLEs in different CLBs. We call this a baseline architecture.

Many commercial LUT-based FPGA architectures are similar to the baseline architecture, and some of them have two or more levels of physical hierarchy. For instance, Altera’s Stratix architecture consists of logic elements (LEs) and logic array blocks (LABs). As shown in Figure 3, an LE is the smallest logic unit in the Stratix architecture, and it corresponds to a BLE in the baseline architecture. An LE contains a four-input LUT, a programmable register, and a carry chain. Unlike a BLE, the LUT and FF in an LE have separate outputs. A LAB is equivalent to a CLB in the baseline architecture, and Figure 4 shows its structure in the Stratix device. Each LAB consists of ten LEs and forms the second level of the physical hierarchy. For more details, please refer to the Stratix Device Handbook [2006].

A typical FPGA physical implementation flow consists of the following steps: clustering/packing, logic duplication, placement, and routing. The clustering step packs LUTs and FFs into CLBs according to the connectivity and timing of a mapped netlist; the duplication step clones one or more logic cells on critical paths to improve speed; the placement step assigns locations to all the nodes of a clustered netlist; and the routing step connects all the nets of a placed netlist. For both Figures 5 and 6, we define a target architecture: Each CLB contains two BLEs, the interconnect delay is proportional to a Manhattan distance, and both logic and intracluster delays are 0. As illustrated in Figure 5, clustering has a great impact on placement. The initial network in Figure 5(a) consists of four FFs and two LUT+FFs. As shown in Figure 1, each LUT+FF contains a LUT directly driving an FF. An optimal clustering solution in Figure 5(b), with a minimum area of three clusters and a minimum logic level of one, can be obtained from T-VPack [Marquardt et al. 1997]. However, if the target de-vice contains only two rows and two columns of CLBs, the optimal placement

(3)

Fig. 1. Basic logic element (BLE).

Fig. 2. Cluster-Based logic block (CLB).

(4)

F ig . 4. Stratix LAB structure [Stratix Device Handbook 2006].

(5)

Fig. 5. Impact of clustering on placement.

Fig. 6. Impact of duplication on placement.

solution in Figure 5(c) on this optimal clustering yields a longest path delay of 2.0. Instead, if we perform clustering together with placement, we can obtain the clustering and placement solution in Figure 5(d) with a longest path delay of 1.0.

As illustrated in Figure 6, duplication has a great impact on placement as well. The initial network in Figure 6(a) consists of five FFs and one LUT. We

(6)

use the same architecture as in Figure 5, and assume that the target device contains one row and three columns of CLBs. An optimal clustering solution in Figure 6(b), with a minimum area of three clusters and a minimum logic level of two, can be obtained from T-VPack. However, the optimal placement solution in Figure 6(c) on this optimal clustering yields a longest path delay of 2.0, which cannot be improved by postplacement duplication. Instead, if we perform duplication together with placement, we can obtain the solution in Figure 6(d) with a longest path delay of 1.0.

2. REVIEW OF EXISTING WORK

Packing LUTs and FFs into CLBs is a critical step in a cluster-based FPGA design flow, since it has a great impact on wirelength, timing, and routabil-ity. VPack [Betz and Rose 1997] packs each logic block to its capacity to minimize the number of clusters and encourages input sharing to minimize number of connections between clusters. The timing-driven version, T-Vpack [Marouardt et al. 1999], minimizes the number of connections on the criti-cal path, since internal connections are normally much faster than external ones, Rpack [Bozorgzadeh et al. 2001] introduces an effective routability metric and presents a routability-driven clustering algorithm for cluster-based FPGAs. PRIME [Cong et al. 1999] integrates retiming with performance-driven clus-tering/partitioning. Given an area bound for each cluster, PRIME generates a quasioptimal solution if duplication is allowed.

Logic duplication is a common technique for improving circuit performance by cloning one or more logic cells while maintaining the logic equivalence of a circuit. In the past, logic duplication for timing optimization has been studied in the following contexts. First, logic duplication is applied before placement in the logic synthesis domain. The speed of a circuit can be improved by replicating high fanout logic gates on the critical path to isolate critical sinks from noncriti-cal ones [Lillis et al. 1996; Srivastava et al. 2000]. Lillis et al. [1996] performed gate replication to reduce the delay and area of a circuit under certain tim-ing requirements. In Srivastava et al. [2000], the authors present an effective heuristic algorithm for the gate replication problem under a load-dependent delay model. They show that both global and local (fanout partitioning) logic duplication problems for delay optimization are NP-complete. The gate repli-cation technique complements the popular gate sizing approach in the ASIC flow.

Secondly, logic duplication is applied after placement as a postprocessing step to further increase design performance for FPGAs. Beraudo and Lillis [2003] observe that coordinates of cells on a critical path often do not follow a monotone order and they propose a heuristic replication algorithm to straighten them (re-duce detour) locally. A legalization engine based on the “ripple-move” approach in Mongrel [Hur and Lillis 2000] is used to legalize the placement incremen-tally. The average delay reduction over VPR obtained from [Beraudo and Lillis 2003] is 7.5%. The follow-up work [Hrkic et al. 2004] improves upon Beraudo and Lillis [2003] by incorporating two new techniques: a timing-driven fanin-tree embedding and replication fanin-tree. The authors first introduce an optimal

(7)

algorithm to solve the fanin-tree embedding problem under a general cost func-tion; they then propose a replication tree to introduce large subcircuits to be solved by the embedding algorithm. The average delay reduction over VPR obtained from [Hrkic et al. 2004] is 14.2%.

However, limited work has been done to carry out clustering and/or logic duplication during placement. Neumann et al. [1999] apply logic duplication in a recursive partitioning-based timing-driven placement flow. During each recursion, they sequentially perform timing analysis, net length estimation and weight calculation, bipartitioning, and cell replication. Before cells are assigned to rows, the redundancies introduced by gate replication are removed. This combined approach outperforms gate sizing by 10%, on average.

In this article we propose an efficient and effective algorithm to perform simultaneous clustering and logic duplication during placement for both wire-length and timing minimization. First, we incorporate a path counting-based net weighting scheme [Kong 2002] into our approach. Secondly, we introduce a novel fragment level move during placement to optimize the clustering struc-ture. Thirdly, we introduce the notions of a monotone region and a global mono-tone region, which enable the optimization of nonmonomono-tone paths from a global perspective. We then present an optimal incremental legalization algorithm under complex clustering constraints using a constrained gain graph. Finally, we formulate a timing-constrained global redundancy removal problem and propose a heuristic solution by solving the local redundancy removal problems optimally. The resulting algorithm, named SPCD, outperforms a widely used academic FPGA placement flow, T-VPack+ VPR, with an average reduction of 31% in the longest path estimate delay and 18% in the routed delay. Meanwhile, our combined approach has the same runtime complexity as the existing VPR placement algorithm, and both runtime and area increases are very small.

3. INITIAL ANALYSIS

We first define the default baseline FPGA architecture that we use for many of the experiments in this article. In this default architecture, each BLE consists of one four-input LUT and one FF, each CLB consists of four BLEs, all wires span only one logic block, and all routing switches are tristate buffers. Later on in Section 5, we shall extend and apply our study to an FPGA architecture with more complex logic and routing structures, namely, Altera’s Stratix architec-ture. During the study of VPR’s placement results on this default architecture, we confirmed two observations mentioned in Beraudo and Lillis [2003].

First, the number of critical/near-critical pins is relatively small. Assuming the longest path delay is T , a pin t is critical if slack(t)= 0; t is x% critical if

slack(t)/T≤ x%. From Table I, we can see that, on average, the percentage of

critical pins is 0.10%, the percentage of 5% critical pins is 1.1%, the percentage of 10% critical pins is 3.7%, the percentage of 15% critical pins is 8.5%, and the percentage of 20% critical pins is 15.7%. It seems possible to perform a very small number of postplacement duplications to speed up the circuit by 5–10%. However, it may involve many nodes to achieve more than a 10–15% speedup.

(8)

Table I. Percentage of Near-Critical Pins Circuit 0% 5% 10% 15% 20% ex5p 0.13% 1.99% 6.54% 15.88% 30.68% apex4 0.15% 3.04% 10.14% 22.66% 39.62% Misex3 0.11% 1.40% 4.47% 11.83% 22.74% Tseng 0.28% 1.71% 4.17% 7.09% 9.45% alu4 0.12% 1.50% 5.53% 14.38% 26.53% dsip 0.05% 0.17% 0.98% 2.29% 4.83% seq 0.09% 0.73% 3.64% 9.69% 19.57% diffeq 0.17% 1.02% 3.00% 6.35% 10.96% apex2 0.11% 0.84% 5.93% 15.29% 27.36% s298 0.23% 3.60% 10.55% 19.43% 30.03% des 0.05% 0.24% 0.77% 1.93% 5.85% bigkey 0.04% 0.15% 0.27% 0.38% 1.17% spla 0.06% 1.17% 4.08% 10.85% 20.19% elliptic 0.07% 1.03% 3.81% 8.01% 12.75% ex1010 0.04% 0.78% 2.54% 7.23% 17.70% pdc 0.04% 0.46% 2.16% 5.98% 12.83% frisc 0.10% 0.76% 2.21% 4.67% 8.24% s38584.1 0.04% 0.36% 1.01% 2.05% 3.24% s38417 0.03% 0.44% 1.00% 1.99% 4.94% clma 0.02% 0.15% 0.51% 1.88% 4.96% Average 0.10% 1.08% 3.67% 8.49% 15.68%

Secondly, the critical paths are highly nonmonotone. For a path p consisting of m nodes (BLEs, in our case), v1, v2,. . . , vm, v1is the start point and vmis the

end point. The x coordinate of node viis x(vi) and the y coordinate of node vi is y (vi). The Manhattan distance between any two nodes vi and vj is defined as dist(vi, vj)=|x(vi)−x(vj)|+| y(vi)− y(vj)|. For a subpath vi−1, vi, vi+1, it is

mono-tone if both the x and y coordinates of vi−1, vi, vi+1follow a monotone order; the

local monotone region of viis the placement region in which vi can be placed so that the subpath is monotone; the deviation of viwith respect to one of its input nodes vi−1 and one of its output nodes vi+1, namely, the measurement of how much a node is outside of its monotone region, is defined as dev(vi−1, vi, vi+1)= dist(vi−1, vi)+ dist(vi, vi+1) – dist(vi−1, vi+1). The subpath vi−1, vi, vi+1is mono-tone if dev(vi−1, vi, vi+1) = 0; that is, it is the shortest Manhattan dis-tance subpath. The Manhattan disdis-tance of path p is defined as dist(p) = m−1

i=1 dist(vi, vi+1). The minimum distance of path p is defined as min dist(p)= dist(v1, vm); that is., the Manhattan distance between the start and end point. The path p is globally monotone if dist(p)=min dist(p). The level of a path p is defined as level(p)= m. The unit dist is defined as the distance between two adjacent CLBs. The detour ratio is defined as dr(p)= dist(p) / max(min dist(p), level(p)*unit dist). The symbol dist(p) is the actual Manhattan distance of p; assuming that p1and pmare fixed, min dist(p) is the ideal Manhattan distance of p when it is globally monotone; level(p)*unit dist is the distance of p when it is placed almost ideally; the detour ratio describes how nonmonotone the path p actually is. The reason that we use the maximum of min dist(p) and

level(p)*unit dist to compute the detour ratio is that sometimes node v1and vm

(9)

Table II. Detour Ratio of 5% Near-Critical Paths Circuit avg dr( p) min dr( p) max dr( p)

ex5p 3.31 2.00 10.14 apex4 3.07 2.00 5.43 Misex3 3.52 1.71 10.83 Tseng 2.14 1.50 2.90 alu4 3.50 2.03 10.29 dsip 1.00 1.00 1.00 seq 5.75 2.12 12.00 diffeq 3.79 3.50 4.09 apex2 4.30 1.68 9.86 s298 6.64 4.73 7.92 des 4.36 1.48 19.00 bigkey 1.01 1.00 1.04 spla 5.34 1.86 14.86 elliptic 4.22 2.10 10.25 ex1010 6.08 2.64 21.00 pdc 3.53 1.85 15.50 frisc 4.10 2.93 5.50 s38584.1 3.43 1.25 8.25 s38417 4.89 2.33 10.00 clma 15.92 10.79 18.36 Average 4.49 2.53 9.91

of the average/minimum/maximum detour ratios in Table II, we consider all the 5% critical paths from PI/FFs to PO/FFs. Both the average detour ratio of 4.49 and the minimum detour ratio of 2.53 in Table II show that the near-critical paths are far from monotone.

4. SIMULTANEOUS PLACEMENT WITH CLUSTERING AND DUPLICATION 4.1 Algorithm Overview

Our algorithm uses a simulated annealing-based optimization engine [Betz and Rose 1997; Marquardt et al. 2000] to minimize a weighted function of wire-length and timing (weighted edge delays). In Figure 7, we show the overall flow of our SPCD algorithm. First, we perform an initial clustering on a mapped netlist, then we generate a random placement. During the annealing process, we optimize the clustering structure and cluster locations at the same time. To improve the suboptimal clustering structure during placement, we introduce a novel fragment level move to relocate a mapped node (BLE) into/out of a clus-ter (CLB). Afclus-ter each move, we update the cost function and decide whether the move should be kept. We iteratively perform a number of moves at each temperature and then reduce the temperature until the acceptance ratio is too low. At the end of each temperature, we perform redundancy removal, logic du-plication, and legalization, sequentially. To reduce the detour in critical paths from a global perspective, we introduce the notions of a monotone region and a global monotone region. To handle the complex constraints in commercial FPGA architectures, we introduce a constrained gain graph and perform optimal in-cremental legalization. To control the runtime of the duplication procedure,

(10)

(11)

we limit the number of duplications allowed for each temperature. Through experiments, we find that good results can be achieved in a short runtime when this limit is logarithmic to the circuit size. In order to merge duplicated copies to reduce area, we introduce a duplication graph representation and propose a heuristic to solve a global redundancy removal problem.

In the following section, we describe the key components of our SPCD al-gorithm: a path counting-based net weighting scheme, clustering optimization during placement, logic duplication, optimal incremental legalization under complex constraints, a duplication graph, and redundancy removal.

4.2 Path Counting-Based Net Weighting

Net-based timing-driven placers (e.g., [Marquardt et al. 2000]) convert timing information into net weight and optimize a weighted function of all nets. The basic idea of net weighting is to assign higher weights to timing-critical nets and lower weights to noncritical nets. The net weighting scheme is both efficient and flexible enough to handle complex constraints, but most existing methods do not take path information into account.

In this article we implement a novel net weighting scheme [Kong 2002] which accurately counts all paths (critical and noncritical) for certain types of discount functions. One such discount function is D(x, y)= a−x/ y, where a is a positive constant number, x is the slack of a path, and y is the delay of the longest path. This scheme considers path sharing, and assigns a higher weight to edges shared by two or more critical paths. For more details about path counting, please refer to Kong [2002].

4.3 Clustering Optimization During Placement

As an artificial step in an FPGA implementation flow, packing nodes into clus-ters has two benefits. First, it hides the complex packing constraints from the placement algorithms. Secondly, it reduces the size of the placement problem and reduces runtime. However, due to a lack of physical information, the clus-tering procedure makes the wrong decisions in grouping logic. This becomes especially severe when the chip utilization is high, and the clustering proce-dure has to perform unrelated packing. Furthermore, the placement proceproce-dure honors the clustering solution and does not correct any packing mistakes.

One of our key contributions in this article is to optimize the clustering struc-ture during placement. Conventional FPGA placers only carry out the block level move, which moves a CLB node to a new location and swaps with another CLB node if necessary. We introduce a novel and effective fragment level move, which moves a BLE node to a new CLB and swaps with another BLE node if necessary. As a result, we are able to significantly improve the suboptimal clus-tering structure and achieve a high-quality placement. With the simultaneous clustering and placement optimization, we can correct mistakes made during the previous clustering stage and significantly improve both wirelength and timing.

After we perform a fragment level move, we must determine whether the new CLB is in a valid configuration. To honor the packing constraints of a CLB, we need to check the number of BLEs and inputs. For commercial FPGA

(12)

Fig. 8. Overview of a duplication algorithm.

architectures, we also need to verify the number of clocks, the number of control signals, the number of feedbacks, etc. Hence, we dynamically update a set of hash-maps for each involved CLB whenever a fragment level move is performed. The complexity of the update is O(K ), where K is the input size of a LUT. By carefully controlling the number of fragment level moves, we shall show in Section 6 that the complexity of our placement algorithm is O(n4/3).

4.4 Logic Duplication

Our logic duplication algorithm can be performed either after a full placement or during a placement at the end of each iteration (e.g., in a simulated annealing-based placer or a quadratic programming-annealing-based placer). As shown in Figure 8, a timing analysis is first performed on the current placement. Then, we itera-tively select a candidate node on the critical path, duplicate or move it to a new destination CLB, redistribute fanouts, and immediately legalize the destination CLB to resolve any possible physical constraint conflicts. In case the delay after legalization increases, both the legalization and the duplication operations will be undone. The iteration continues until there are no more candidates or until the limit on the number of duplications is reached.

4.4.1 Criticality-Driven Candidate Selection. Initially, we put all the near-critical PO pins into a heap sorted by the slack, and then we iteratively select the most critical pin t from the heap to perform the speedup operation. The candidate node, the source node s, is duplicated and moved to a new location so as to straighten all the near-critical paths flowing through edge (s, t). When two sink pins have the same criticality, we use the deviation of their source nodes to break ties. After a source node s is duplicated and legalized, we put the remaining timing-critical input pins of the clone into the heap.

4.4.2 Monotone Region and Global Monotone Region. After a candidate node s is chosen, we find a destination location for it in order to minimize the critical path delay. We assume node s has k critical input nodes i1, i2through

(13)

Fig. 9. An example of a monotone region.

Fig. 10. An example of global monotone region.

First, we define a monotone region MR({ij}, s, t) for node s with respect to one of its input nodes ij and the output node t, which is the minimum bounding box enclosing ij, s, and t. Node s can be placed anywhere inside the monotone region MR({ij}, s, t) without increasing the deviation of s with respect to ij and

t, hence the subpath ij → s → t can be shortened.

Next, we define a monotone region MR({i1, i2,. . . , ik}, s, t), which is the

inter-section of all MR({ij}, s, t). Node s can be placed anywhere inside the monotone region MR({i1, i2,. . . , ik}, s, t) without increasing the deviation of s with respect

to any of its input nodes and the output node t. If we search for the destina-tion locadestina-tion within this monotone region, all the critical paths passing through edge (s, t) can be shortened.

Figure 9 is an example of the monotone region. The three rectangles with dashed lines are MR({i1}, s, t), MR({i2}, s, t), and MR({i3}, s, t), respectively; the

gray rectangle is MR({i1, i2, i3}, s, t). For the largest circuit in our benchmark

set, clma, the average size of the monotone region is around 5% of the total placement area.

One drawback of the work of Beraudo and Lillis [2003] is that it cannot handle paths that are locally monotone but globally nonmonotone. As shown in Figure 10, both subpaths pi1→ r → s and r → s → t are monotonous (shortest

path). However, the complete path pi1→ r → s → t is nonmonotone, since the

(14)

net delay is linear to the Manhattan distance, such paths cannot be shortened by the approach in Beraudo and Lillis [2003], or by using the monotone region alone. To resolve such a global nonmonotone problem, when we attempt to move node s, not only do we need to consider its direct fanin r, but also its primary input pi1. Therefore, we define the notion of a global monotone region. A fanin

cone Cv, rooted at v, is a connected subnetwork which consists of only v and its predecessors; a critical fanin cone Ccrit v, rooted at v, is a connected subnetwork which consists of only v and its timing-critical predecessors. For the primary input set of Ccrit s, we assume there are l critical primary inputs pi1, pi2,. . . , pil. A global monotone region is defined as MR({pi₁, pi₂,. . . , pi_l}, s, t). During the computation of the global monotone region, we use the x and y coordinates of primary inputs pi₁, pi₂,. . . , pi_l instead of the immediate inputs i1, i2,. . . , ik. With the introduction of the global monotone region, we give priority to locations inside both regular and global monotone regions. As illustrated in Figure 10, node s will first be moved to s, node r will then be moved to r, and the complete path will be shortened perfectly to pi1→ r→ s→ t.

4.4.3 Destination Selection Within the Monotone Region. We iterate through each location inside the monotone region and choose the destination location (x, y ) such thatcost(x, y) is minimal. The cost(x, y) function at a location (x, y ) is defined ascost(x, y) = -α*slack(t) + β*overflow cost(x, y) + γ *g path cost(x, y). The signs α, β andγ are predefined constants. The first term slack(t) describes the timing improvement. The slack(t) is defined as the in-crease in slack at sink pin t when the clone is placed at location (x, y ). The second term overflow cost(x, y) depicts the legality of the placement or the diffi-culty of the legalization. The term overflow cost(x, y) is 0 when the destination location can accommodate the clone, otherwise, it is the difference between the actual usage and the capacity. Priority is given to locations that can accommo-date the copy of s without violating any physical constraints. The third term g path cost(x, y) characterizes the violation of the global monotonicity. The term g path cost(x, y) is 0 when the destination location is inside the global Monotone region, otherwise, it is the minimum distance from (x, y ) to the global monotone region.

4.4.4 Timing-Driven Fanout Partitioning. After a clone node is placed at a destination location, we perform timing-driven fanout partitioning to redis-tribute the fanouts to their corresponding inputs. Each fanout node t is assigned to a copy of the source node such that the arrival time at t is minimal. This is similar to the approach described in Beraudo and Lillis [2003].

4.5 Optimal Incremental Legalization under Complex Constraints

We perform legalization immediately after each duplication operation. If the delay on the most critical path deteriorates, we will undo both legalization and duplication.

First, we describe the ripple-move-based legalization approach used in Hur and Lillis [2000]. For each ripple-move, we select a source location S with over-flow and a destination location T with extra capacity, and find a maximum

(15)

Fig. 11. An example of a gain graph.

gain monotone path from S to T along which a sequence of cells is moved. To determine the maximum gain path and the cells to be moved, a global analysis based on the gains of individual cells is performed. Given the source S and the destination T , each cell can only be moved in, at most, two directions. The gain value associated with each cell move is the reduction in the cost function, and the gain value associated with each location and direction is the maximum gain value among all the cells moving in that direction. Then, we can construct a gain graph in which each vertex corresponds to a location inside the rectangu-lar region determined by S and T , and in which each weighted arc represents the maximum gain value in the direction of the arc. Figure 11 is an example of a gain graph. Since a gain graph is acyclic, the maximum gain path can be found by dynamic programming in a topological order. When a ripple-move is performed on this maximum gain path, a cell is allowed to move more than once so that the final gain is equal to or better than the value determined by the maximum gain path.

When each cell is allowed to move, at most, one distance away, the ripple-move algorithm is optimal for certain cost functions such as the bounding box wirelength [Hur and Lillis 2000], the weighted source-sink distance, etc. How-ever, the optimality of the maximum gain path does not hold for a general cost function. For example, the timing cost is the summation of weighted delays over all edges, and the delay of an edge is determined by the locations of both source and sink pins. If there is an edge between two cells on the maximum monotone gain path, then the timing cost reduction precomputed for the sink node would be inaccurate, since the source node is moved as well. As a result, the maxi-mum gain path computed by the ripple-move algorithm may not be optimal for timing optimization under a general delay model. However, if the interconnect delay is a linear function of the Manhattan distance, the maximum gain path is still optimal.

In reality, commercial FPGA architectures have complex clustering con-straints at the CLB level, in addition to capacity concon-straints such as the input constraint, clock constraint, control signal constraint, etc. Since the gain graph does not consider such complex constraints at all, we introduce the notion of a constrained gain graph.

For example, in the artificial FPGA architecture shown in Figure 12, each CLB contains two BLEs and six inputs. This simplified architecture imposes two clustering constraints: a capacity constraint of two, and an input constraint of

(16)

Fig. 12. Construction of a constrained gain graph.

six. Give that a source CLB S contains three BLEs, and a sink CLB T contains only one BLE, we want to find a monotone path from S to T to maximize gain while observing the clustering constraints for each CLB on the path. If we assume that x(T )> x(S) and y(T) > y(S), then each move along the monotone path is in the direction of either north or east. For an internal CLB C, there are, at most, two incoming nodes from the west, two incoming nodes from the south, and two outgoing nodes. If a CLB C is still in a legal configuration after moving in a node vi and moving out a node vj, we draw an edge from vi to vj with a proper gain value. At best, we can build a 4× 2 bipartite graph between Cand its neighbors from west and south. For the purpose of illustration, we draw all feasible edges between CLB Cand its four neighbors in Figure 12. The gain of an edge in the constrained gain graph is the reduction of the cost function when the source node of the edge is moved from the source CLB to the sink CLB. Finally, we create a pseudosource node s and a pseudosink node t.

Once the constrained gain graph is constructed, the constrained legalization problem is transformed to a longest-path problem for a directed acyclic graph, which can be optimally solved with a complexity of O(n). Our algorithm is very flexible for handling any CLB-level clustering constraints.

4.6 Duplication Graph and Redundancy Removal

Since we perform logic duplication at the end of each annealing iteration, two problems may arise. First, the area increase due to cloning may be substantial. Secondly, some of the duplications committed in the previous iterations may become noncritical. If we do not remove them in a timely manner, they may occupy timing-critical locations and affect future duplication processes. There-fore, before logic duplication takes place at the end of each annealing iteration, we merge duplicated copies to reduce the circuit area and maintain speed.

We introduce a data structure called a duplication graph, which is the orig-inal netlist with two modifications. First, to keep track of all the copies of the same node, we introduce the notion of a choice node whose fanin nodes are all logically equivalent. When a duplication graph is initialized, we create a choice node for each node in the netlist. Then, for each choice node c, we introduce

(17)

Fig. 13. Illustration of a duplication graph.

a new net e. Assume c has k fanin nodes g1, g2,. . . , gk, and that each of the fanin nodes has an output net e1, e2,. . . , ek. The source pin of net e is choice node c, and the sink pins of net e are the union of all sink pins of e1, e2,. . . , ek. Figure 13 is an illustration of the duplication graph. Under choice node c1, g1 is

a copy of g1. Choice node c1drives two gates, g2and g3, which fanout to choice

nodes c2and c3, respectively. Choice node c2 drives primary outputs po1, and c3 drives po2. Initially, each choice node contains only one fanin. During the

logic duplication step, the duplicated copies are added to the duplication graph incrementally.

In a duplication graph N = (C, V, E), each node c ∈ C represents a choice node, each gate v∈ V represents a logic gate, and each directed edge e = (c, v) ∈ E represents a wire connecting the output of choice node c to one input of a logic gate v. Each choice node c is a set in which each gate g∈ c is a logic gate with equal functionalities. For each directed edge e = (c, v) ∈ E, the arrival time arr t(v) at the sink pin is min(arr t(g)+ delay(g, v))for every g ∈ c.

We formulate a global timing-constrained redundancy removal problem. Un-der the given timing constraints, slack(e)≥ 0 for every edge e ∈ E. We want to find a maximum set S, and remove every v∈ S and every edge e = (c, v) ∈ E from N such that in the new duplication graph N = (C, V, E), slack(e) ≥ 0 for every edge e∈ Eand c= ∅ for every choice node c ∈ C.

We also formulate a local redundancy removal problem under timing con-straints. For a choice node c, we assume c has m fanin gates g1through gm,

and n fanout gates vlthrough vn. Under the timing constraints, each fanout gate vihas a required arrival time req t(vi). We want to find a maximum subset S of c and remove every g ∈ S from c such that arr t(vi)≤ req t(v_i) for all the fanout gates of c. We build an m by n matrix,and define the value matrix (i, j ) at row i column j as the following: if arr t(vj)≤ req t(vj) when vjis driven by gi, matrix (i, j ) = 1; otherwise matrix (i, j ) = 0. To solve the local redundancy removal problem, we need to select a minimum number of rows such that every column contains at least a one. This is a unate covering, or a minimum set covering, problem which is NP-complete. Since a minimum set covering problem can be transformed to a local redundancy removal problem, the local removal problem is NP-complete as well. If we limit m to a small constant (e.g., five) during the logic duplication, we can solve the local redundancy removal problem optimally using the reduction techniques together with a branch-and-bound algorithm.

We propose a heuristic to solve the global redundancy removal problem by solving the local redundancy removal problem in a reverse topological order. During the traversal of the duplication graph from PO to PI, we optimally perform local redundancy removal for each choice node with multiple fanins.

(18)

Fig. 14. The validation flow.

After a local redundancy problem is solved, we perform duplication removal and fanout partitioning together. Then, we propagate the remaining time during the incremental timing update. We can incrementally update the required time during the redundancy removal process.

5. VALIDATION IN A COMMERCIAL FPGA IMPLEMENTATION FLOW

The Quartus University interface program (QUIP) kit is designed to enable uni-versity or other researchers to plug new CAD tools and ideas into the complete Altera’s Quartus II CAD flow—from register transfer level (and even above) de-scriptions of circuits to programming files for real FPGAs. With help from QUIP, we have built the first academic flow that allows for direct comparison with the Quartus II physical implementation tools. To target a Stratix device, we modify the VPR architecture file to describe LAB fitting rules (described in detail in Section 5.2.3) and cell delays for the LUT, FF, and pads. The interconnect delay is provided by an API function from QUIP.

5.1 The Validation Flow

Figure 14 is an overview of our entire validation flow. First, we run the script.algebraic in SIS [Sentovich et al. 1992], followed by Flowmap [Cong and Ding 1994], a depth-optimal mapping algorithm for LUT-Based FPGAs. In order to eliminate long interconnect delays between pads and design logic (especially when a small design is fitted to a relatively large device), we intentionally insert a flip-flop (FF) after each primary input and before each primary output. Since we only measure the clock frequency of the modified netlist, the pad placement is irrelevant.

(19)

Next, for the Quartus II flow, the modified netlist is converted to an Altera format (.vqm). This is done by a dumper utility named net2vqm, distributed as part of the QUIP package. However, the original version assumes that each CLB contains only one BLE, while in the Stratix architecture each CLB contains ten BLEs. We performed extensive modifications to the utility to support multiple BLEs in a single CLB. After that, the Quartus II’s fitter reads in the .vqm files and performs clustering, placement, and routing sequentially. Since Quartus II is a constraint-driven optimization engine, we set a small maximum clock fre-quency constraint of three ns. Finally, we run Quartus II’s timer to report the timing result.

For our SPCD flow, we first run T-VPack [Marquardt et al. 1999] to generate an initial clustering solution. Then we perform simultaneous placement with clustering and duplication (SPCD). After that, we convert the new netlist to .vqm format and generate location constraints for all design logics. The new .vqm netlist, together with the timing/location constraints, is then given to Quartus II. The fitter honors our clustering and placement constraints and performs routing only. Finally, we run Quartus II’s timer to report the maximum frequency.

5.2 Placement Engine Extension

In order to accurately model the Stratix architecture and generate valid phys-ical constraints, we need to perform several enhancements to our placement engine.

5.2.1 Heterogeneous Resources. VPR only considers simple FPGA architec-tures with CLBs and pads. In contrast, commercial FPGA architecarchitec-tures such as Stratix and Virtex2 contain memory blocks, DSP blocks, etc. In order to gener-ate valid physical locations for the Stratix device, we consider these resources as well. Since the MCNC benchmark circuits we use do not contain such macros, we simply mark such locations as blockages and do not use them to place design logic.

5.2.2 Delay Modeling. We need to modify VPR’s delay model to target the Stratix devices. The delay consists of two parts: a cell delay and an interconnect delay. For the cell delay, we use the value in the library file. Since the propa-gation delays from each input pin of a four-input LUT are different, we use an average value. For the interconnect delay, we directly call an API function get point to point delay() provided by the QUIP package.

5.2.3 LAB Fitting Rules. In the Stratix architecture, each LAB contains ten logic elements (LEs). Since the MCNC benchmark circuits we use do not have complex clock schemes, we only need to observe the following subset of LAB fitting rules. For more details, please refer to the Stratix Device Handbook [2006].

(1) For good routability, the number of distinct data inputs (excluding CIN and feedbacks) should not exceed 26;

(20)

(2) All .clk and .ena signals on a logic cell are paired to form an LE clock. In any LAB, there can be no more than two distinct LE clock pairs;

(3) A maximum of two distinct signals can be connected to the .aclr ports; and (4) A maximum of one distinct signal can be connected to the .aload ports.

6. RUNTIME/QUALITY TRADEOFF AND COMPLEXITY ANALYSIS

The runtime of our SPCD algorithm consists of two parts: a placement engine and a duplication/legalization engine. First, we analyze the complexity of the placement engine. For a given architecture, each CLB contains N BLEs, I in-puts, and N outputs. In the input-clustered netlist, the number of CLBs is n, and the number of BLEs is m. Therefore, n≤ m ≤ N*n, and O(m) = O(N*n) = N*O(n). In our SPCD algorithm, we perform both block level moves and frag-ment level moves. At each temperature, the number of block level moves per-formed is n4/3_{, and the number of fragment level moves performed is (}_α∗m)1.33_≈

(α ∗ N ∗ n)1.33_{. We can choose the value of}_{α between zero and one and achieve}

the runtime/quality tradeoff. As a result, the complexity of the block level move is O(n4/3), and the complexity of the fragment level move is O((α ∗ N ∗ n)4/3). In reality, the value of N is not very big, and we can always chooseα to make O((α ∗ N ∗ n)4/3

)= O(n4/3). Hence, the overall complexity is O(n4/3). As a result, our algorithm’s complexity is similar to VPR, and hence quite scalable.

Then, we analyze the complexity of the duplication/legalization engine. For each source node, the complexity of the monotone region computation is O(K ). The maximum size of the monotone region is the size of the device, which is O(n). For each location in a monotone region, we need to recalculate the edge delay for all input pins of a node s and the sink pin t, and that is an O(K ) operation. Since the size of the monotone region is worst-case O(n), the complexity of finding the optimal destination location is O(n). For legalization, we assume the distance between the source location and the destination location is dx and dy, respectively. The complexity for constructing the gain graph is O(dx*dy*N), and the complexity for the maximum gain path algorithm is O(dx*dy). Since dx is bounded by the width of the device and dy is bounded by the height of the device, the legalization algorithm has a worst-case complexity of O(n). Also, we perform static timing analysis during the duplication, which is an O(n) operation as well. Since we limit the number of duplications to a small number (logarithmic to the circuit size) and the number of annealing iterations to another constant (∼100), the overall duplication/legalization has a complexity of O(nlogn). As a result, the overall SPCD algorithm has a runtime complexity ofO(n4/3_).

7. EXPERIMENTAL RESULTS

We implemented our SPCD algorithm under the VPR framework. For purposes of comparison, we downloaded the VPR 4.3 source code, architecture file, and the complete set of 20 MCNC benchmark circuits from the FGPA Place-and-Route Challenge [2006]. We modified the architecture file to specify the number of BLEs contained in a single CLB. We compared all of the 20 MCNC circuits with the commonly used academic FPGA design flow in Figure 15. We first ran the script.algebraic in SIS [Sentovich et al. 1992], followed by Flowmap [Cong

(21)

Fig. 15. The experimental flow.

and Ding 1994]. Then we ran T-VPack [Marquardt et al. 1999] to generate an initial clustering solution. This initial clustering was then given to both timing-driven VPR and SPCD to perform placement. Except in Section 7.1, we always compare our results with the timing-driven VPR. Our SPCD algorithm has several different modes: SPC, SPD, and SPCD. SPC performs clustering during placement; SPD performs logic duplication during placement; and SPCD per-forms both clustering and duplication during placement. Furthermore, SPD has three options: SPD-0 has zero duplication, and it is essentially the same as VPR; SPD-1 performs logic duplication only once after the full placement is obtained; SPD-m performs simultaneous logic duplication and placement optimization. SPCD with path counting utilizes a path counting-based net weighting scheme. In Sections 7.1 to 7.4, we conduct experiments against VPR on the default ar-chitecture described in Section 3. Finally, in Section 7.5 we also compare SPCD with the Quartus II 4.0 implementation tool on the Stratix architecture.

7.1 Wirelength Comparison

Since duplication does not improve the wirelength, we compare our algorithm SPC (no duplication) with VPR in Table III using the total weighted half-bounding-box wirelengths as the only optimization objective. The weights for nets of different sizes can be found in Chen and Cong [2004]. When we combine clustering with placement, we can outperform VPR by 27%, on average.

In Figure 16 we illustrate the impact of CLB size (N ) on the wirelength improvement obtained from SPC. When we change the CLB size from two to ten, the wirelength gap between SPC and T-Vpack+VPR increases monotonically from 15% to 36%. The result shows that as the size of CLB increases, it is more and more difficult to generate a good clustering solution with small wirelength without physical information. Since SPC explores different clustering solutions during the placement stage, it generates clustering and placement solutions with much shorter wirelength.

(22)

Table III. Wirelength Improvement of SPC (N= 4) Circuit VPR SPC % ex5p 112.47 92.7707 17.52% apex4 113.639 94.2218 17.09% misex3 123.616 99.2435 19.72% Tseng 94.9456 61.6671 35.05% alu4 123.03 95.3293 22.52% dsip 195.544 94.8918 51.47% seq 173.641 135.756 21.82% diffeq 132.271 88.7259 32.92% apex2 190.324 151.032 20.64% s298 166.899 140.351 15.91% des 278.122 210.536 24.30% bigkey 171.986 155.525 9.57% spla 426.227 324.999 23.75% elliptic 359.011 228.558 36.34% ex1010 463.618 341.405 26.36% pdc 704.286 545.51 22.54% frisc 584.732 432.13 26.10% s38584.1 576.457 321.653 44.20% s38417 696.701 424.874 39.02% clma 1701.02 1169.64 31.24% Average 26.90%

Fig. 16. Impact of CLB size on wirelength improvement of SPC. 7.2 Timing Comparison

7.2.1 Impact of Clustering. In Table IV we compare SPC with both VPR and path [Kong 2002] in timing optimization. If we use the path counting-based net weighting scheme only in SPC, we can outperform VPR by 12% (col-umn 4); if we perform clustering optimization only in SPC, we can outperform VPR by 16% (column 6); if we integrate the path counting-based net weighting scheme with the clustering optimization, SPC significantly outperforms the original VPR result by 25%. However, the wirelength reduction obtained by SPC in its timing mode is reduced to about 15% from 27% when compared with VPR.

(23)

Table IV. Timing Improvement of SPC (N= 4)

Circuit VPR Path % SPC % SPC+ Path %

ex5p 50.45 43.14 16.94% 42.71 18.14% 40.75 23.80% apex4 47.44 41.96 13.07% 46.96 1.04% 41.44 14.49% misex3 51.04 44.81 13.91% 41.00 24.49% 38.53 32.47% tseng 38.85 36.15 7.48% 32.77 18.55% 35.11 10.65% alu4 53.16 44.12 20.50% 46.85 13.46% 42.50 25.07% dsip 38.32 38.15 0.44% 34.30 11.73% 40.12 _−4.49% seq 51.26 46.62 9.97% 44.46 15.29% 42.90 19.51% diffeq 47.73 43.52 9.69% 37.49 27.30% 41.15 16.01% apex2 56.36 54.94 2.60% 52.16 8.07% 47.23 19.34% s298 87.36 84.63 3.23% 88.40 −1.17% 80.98 7.88% des 83.88 75.42 11.20% 76.85 9.14% 65.44 28.18% bigkey 41.37 43.29 _−4.44% 41.56 _−0.47% 41.35 0.03% spla 72.47 63.86 13.48% 66.51 8.95% 58.27 24.35% elliptic 71.07 54.47 30.48% 76.94 −7.63% 48.49 46.58% ex1010 97.88 80.64 21.38% 85.16 14.93% 74.82 30.81% pdc 113.15 77.93 45.18% 79.43 42.45% 67.60 67.38% frisc 81.39 93.16 −12.64% 77.27 5.33% 75.73 7.47% s38584 64.37 61.63 4.45% 45.44 41.67% 47.78 34.71% s38417 76.63 71.94 6.52% 48.45 58.17% 49.89 53.61% clma 137.20 114.48 19.85% 125.03 9.73% 102.0 34.52% Average 11.66% 15.96% 24.62%

Fig. 17. Impact of CLB size on the timing improvement of SPC.

In Figure 17 we illustrate the impact of CLB size on the timing improvement obtained from SPC with path counting. When the CLB size is two, the timing gap between SPC and T-Vpack+VPR is 17%. When the CLB size increases from four to ten, the gap remains in a narrow range between 22 and 25%. The result shows that even when the CLB size is relatively small (two or four), it is difficult to generate a good clustering solution with small delay without physical information. Since SPC explores different clustering solutions during the placement stage, it generates clustering and placement solutions with much better delay.

In Table V we illustrate the performance of SPC with path counting on a different routing architecture. For the results in Figure 17, we use the

(24)

Table V. Impact of Routing Architecture on Timing Improvement of SPC (N_{= 4)} Circuit VPR SPC Improvement ex5p 33.68 31.60 6.58% apex4 31.74 31.29 1.45% misex3 31.05 28.05 10.67% Tseng 44.56 37.12 20.03% alu4 34.32 32.29 6.30% dsip 16.32 17.16 −4.90% seq 31.67 29.98 5.64% diffeq 45.65 41.63 9.64% apex2 38.45 34.84 10.36% s298 61.39 61.92 −0.85% des 32.38 27.26 18.81% bigkey 21.64 23.15 −6.51% spla 45.03 40.25 11.88% elliptic 44.47 42.32 5.09% ex1010 53.74 47.96 12.05% pdc 58.94 44.82 31.51% frisc 70.03 67.71 3.43% s38584.1 31.53 35.53 −11.25% s38417 47.84 41.37 15.66% clma 79.07 66.25 19.35% Average 8.25%

Table VI. Timing Improvement of SPD (N= 4)

SPD-m w/

Circuit VPR SPD-0 SPD-1 % SPD-m % path counting %

ex5p 50.45 51.66 47.76 8.2% 46.80 7.80% 42.31 19.24% apex4 47.44 48.30 46.40 4.1% 43.86 8.16% 40.2 18.01% misex3 51.04 48.94 45.92 6.6% 42.14 21.12% 37.85 34.85% tseng 38.85 38.55 34.80 10.8% 28.71 35.32% 29.29 32.64% alu4 53.16 54.85 52.45 4.6% 43.56 22.04% 43.44 22.38% dsip 38.32 38.92 38.92 0.0% 41.37 −7.37% 33.35 14.90% seq 51.26 53.29 49.44 7.8% 43.56 17.68% 42.4 20.90% diffeq 47.73 45.26 40.93 10.6% 36.53 30.66% 38.57 23.75% apex2 56.36 58.75 55.05 6.7% 53.58 5.19% 45.32 24.36% s298 87.36 82.70 75.26 9.9% 80.56 8.44% 88.4 −1.18% des 83.88 81.71 73.60 11.0% 71.48 17.35% 63.29 32.53% bigkey 41.37 41.51 40.74 1.9% 38.18 8.36% 36.46 13.47% spla 72.47 72.09 66.45 8.5% 63.72 13.73% 65.64 10.41% elliptic 71.07 64.48 62.32 3.5% 59.89 18.67% 52.27 35.97% ex1010 97.88 95.24 87.06 9.4% 87.82 11.46% 71.59 36.72% pdc 113.15 95.89 87.51 9.6% 82.87 36.54% 69.68 62.39% frisc 81.39 83.81 76.02 10.3% 75.51 7.79% 79.33 2.60% s38584.1 64.37 53.98 52.16 3.5% 46.72 37.78% 48.82 31.85% s38417 76.63 76.84 70.18 9.5% 53.29 43.80% 55.2 38.82% clma 137.20 136.56 123.93 10.2% 106.92 28.32% 99.39 38.04% Average 7.32% 18.64% 25.63%

(25)

Fig. 18. Impact of CLB size on timing improvement of SPD-1.

default routing architecture obtained from the FPGA Place-and-Route Chal-lenge [2006], in which routing segments have a length of one and all routing switches are tristate buffers. Since interconnect delays are very sensitive to dis-tance in this default architecture, the placement algorithms are vital to design performance. In Table V we try a different routing architecture; here, rout-ing segments have a length of four and are connected to both tristate buffers and pass transistors. Interconnect delays in this architecture are much less sensitive to distance compared to the default routing architecture. The timing improvement obtained from SPC on this architecture is only 8%. For a realistic architecture with routing segments of multiple lengths (1, 4,. . . , etc.), the delay improvement should fall somewhere between 8% and 25% (please refer to the delay comparison using Stratix architecture shown in Table XII).

7.2.2 Impact of Logic Duplication. In Table VI, we show the impact of logic duplication on timing using the default architecture. Column 3 is the result of SPD-0, which is our implementation of VPR without any logic duplication. The result of SPD-0 is, in general, similar to that of VPR. In column 4 we perform duplication/legalization only once after full placement (SPD-1), and we achieve, on average, around a 7% timing improvement. In column 6 we perform simultaneous logic duplication and placement optimization (SPD-m), and we outperform VPR by 19%. In column 8 we integrate the path counting-based net weighting scheme with the duplication optimization, and SPD-m with path counting significantly outperforms the original VPR result by 26%.

In Figure 18, we illustrate the impact of CLB size on the performance of SPD-1. When the CLB size is one, the timing improvement obtained from du-plication is 5%. When the CLB size increases from two to ten, the timing im-provement remains in a narrow range between 7 and 9%. The result shows that when the CLB size is greater than one, there is more room for duplication since the delay between BLEs within the same CLB is normally smaller than the delay between different CLBs.

In Figure 19, we illustrate the impact of CLB size on the performance of SPD-m with path counting. When the CLB size is one, the performance gap

(26)

Fig. 19. Impact of CLB size on timing improvement of SPD-m. Table VII. Timing Comparisons between SPD, SPC and SPCD

Circuit SPD-m SPD-m+ Path SPC SPC+ Path SPD-m+ SPC + Path

ex5p 7.80% 19.24% 18.14% 23.80% 21.76% apex4 8.16% 18.01% 1.04% 14.49% 30.23% misex3 21.12% 34.85% 24.49% 32.47% 36.02% tseng 35.32% 32.64% 18.55% 10.65% 21.99% alu4 22.04% 22.38% 13.46% 25.07% 23.34% dsip _−7.37% 14.90% 11.73% _−4.49% _−5.64% seq 17.68% 20.90% 15.29% 19.51% 27.16% diffeq 30.66% 23.75% 27.30% 16.01% 39.33% apex2 5.19% 24.36% 8.07% 19.34% 29.39% s298 8.44% −1.18% −1.17% 7.88% 16.08% des 17.35% 32.53% 9.14% 28.18% 37.07% bigkey 8.36% 13.47% _−0.47% 0.03% 31.80% spla 13.73% 10.41% 8.95% 24.35% 37.20% elliptic 18.67% 35.97% −7.63% 46.58% 19.89% ex1010 11.46% 36.72% 14.93% 30.81% 49.51% pdc 36.54% 62.39% 42.45% 67.38% 62.86% frisc 7.79% 2.60% 5.33% 7.47% 16.30% s38584 37.78% 31.85% 41.67% 34.71% 30.41% s38417 43.80% 38.82% 58.17% 53.61% 48.21% clma 28.32% 38.04% 9.73% 34.52% 47.18% Average 18.64% 25.63% 15.96% 24.62% 31.01%

between SPD-m and T-Vpack+VPR is 18%. When the CLB size increases from two to ten, the timing gap between SPD-m and T-Vpack+VPR gradually in-creases from 21 to 27%. The result shows that even when the CLB size is rel-atively small (one or two), integrating duplication with placement has a great impact on circuit performance.

7.2.3 Comparison of SPC, SPD, and SPCD. As shown Table VII, without the path counting-based net weighting scheme, SPD-m outperforms SPC by a few percentages; with path counting, both SPD-m and SPC achieve a sim-ilar improvement of 25–26%; when all three techniques are combined, SPCD significantly outperforms T-Vpack+VPR by 31%.

(27)

Table VIII. Effect ofα on Timing (CLB = 4)

α = 0.25 α = 0.50 α = 1.0

Timing Runtime Timing Tuntime Timing Runtime

Circuit Improvement Ratio Improvement Ratio Improvement Ratio

des 24.47% 26.15% 28.29% 38.13% 32.64% 70.28% bigkey _−10.20% 30.42% 16.00% 42.59% 7.18% 72.58% spla 27.09% 41.17% 34.99% 49.39% 28.57% 76.59% elliptic 51.06% 42.89% 48.61% 51.12% 49.63% 73.86% ex1010 29.69% 36.08% 31.24% 41.53% 31.73% 64.08% pdc 58.75% 32.75% 69.24% 39.72% 86.41% 58.41% frisc 0.66% 33.61% −0.72% 40.23% 7.33% 60.61% s38584.1 43.58% 26.62% 47.86% 32.53% 34.81% 47.41% s38417 27.05% 32.01% 53.17% 37.47% 60.67% 59.77% clma 27.80% 25.82% 38.38% 31.75% 41.08% 48.29% Average 21.59% 32.75% 31.18% 40.45% 30.83% 63.19%

Table IX. Effect ofα on Timing (CLB = 10)

α = 0.25 α = 0.50 α = 1.0

Timing Runtime Timing Tuntime Timing Runtime

Circuit Improvement Ratio Improvement Ratio Improvement Ratio

des 16.59% 24.53% 19.22% 36.35% 20.88% 71.59% bigkey −4.34% 36.24% −9.78% 54.49% 2.95% 98.44% spla 33.13% 61.99% 43.21% 74.56% 40.32% 121.37% elliptic 50.30% 45.57% 54.03% 47.84% 52.97% 72.16% ex1010 38.80% 54.47% 26.00% 61.75% 32.12% 98.68% pdc 49.99% 50.79% 52.58% 60.38% 54.95% 95.26% frisc −2.29% 46.78% 10.42% 57.29% 14.86% 88.07% s38584.1 30.30% 31.73% 39.66% 33.08% 39.01% 52.39% s38417 22.18% 43.91% 41.60% 50.76% 40.63% 81.65% clma 54.89% 28.85% 71.51% 36.69% 83.96% 58.08% Average 20.18% 42.49% 27.05% 51.32% 31.25% 83.77%

Furthermore, we observe that the timing improvement of path counting is orthogonal to that of clustering and duplication, but the timing improvement of clustering and duplication overlaps with one another. When we perform logic duplication, we both duplicate and relocate cells to new CLBs, thus changing the clustering structure dramatically. Therefore, clustering and duplication share a large portion of their solution space.

7.3 Runtime Comparison

7.3.1 Runtime/Quality Tradeoff of SPC. In the previous experiments in Sections 7.1 and 7.2.1 we performed m1.33_{≈ (N ∗n)}1.33_{fragment moves and zero} block moves. In this section, we fix the number of block moves to n1.33_{, and set the} number of fragment moves to (α∗m)1.33_{≈ (α∗ N ∗n)}1.33_{, where}_{α is between zero} and one. In Table VIII we show the impact ofα on the runtime/quality tradeoff of SPC. It is no surprise that whenα increases, that is, the number of fragment moves increases, the timing improvement increases from 22% to 31%. Our run-time is generally shorter than VPR because the number of block moves we per-form is only 10% of VPR’s. If we reduce the number of block moves VPR perper-forms to the same as SPC, it yields about 5% worse results (both in timing and

(28)

Table X. Area and Runtime Comparison of SPD

SPD-0 SPD-1 SPD-m

Circuit Area Runtime Area Runtime Area Runtime

ex5p 1274 65.344 1274 0.00% 65.31 −0.05% 1311 2.90% 62.81 −3.87% apex4 1319 65.234 1320 0.08% 65.63 0.60% 1337 1.36% 72.30 10.83% misex3 1529 86.516 1535 0.39% 86.67 0.18% 1534 0.33% 75.53 _−12.70% tseng 1473 87.516 1476 0.20% 87.00 −0.59% 1481 0.54% 95.08 8.64% alu4 1630 81.953 1630 0.00% 81.89 −0.08% 1631 0.06% 79.78 −2.65% dsip 2045 115.828 2045 0.00% 116.75 0.80% 2045 0.00% 116.86 0.89% seq 2029 134.922 2030 0.05% 137.89 2.20% 2033 0.20% 127.06 −5.82% diffeq 2036 130.391 2036 0.00% 131.02 0.48% 2039 0.15% 123.03 −5.64% apex2 2159 149.281 2160 0.05% 152.36 2.06% 2165 0.28% 156.48 4.83% s298 2558 146.406 2558 0.00% 148.33 1.31% 2559 0.04% 165.83 13.27% des 2673 203.5 2673 0.00% 203.86 0.18% 2673 0.00% 206.41 1.43% bigkey 3361 239.516 3361 0.00% 241.52 0.83% 3361 0.00% 242.44 1.22% spla 3999 371.922 4004 0.13% 373.81 0.51% 4036 0.93% 385.83 3.74% elliptic 4430 473.188 4476 1.04% 475.92 0.58% 4448 0.41% 484.30 2.35% ex1010 4740 444.25 4740 0.00% 445.86 0.36% 4743 0.06% 438.41 −1.32% pdc 5672 664.532 5674 0.04% 666.28 0.26% 5678 0.11% 638.47 _−3.92% frisc 6061 617.265 6061 0.00% 620.28 0.49% 6074 0.21% 665.47 7.81% s38584.1 7375 819.859 7380 0.07% 822.74 0.35% 7375 0.00% 831.14 1.38% s38417 8589 993.36 8591 0.02% 996.30 0.30% 8604 0.17% 1105.92 11.33% clma 13673 2214.188 13674 0.01% 2208.86 −0.24% 13688 0.11% 2415.09 9.07% Average 0.10% 0.53% 0.39% 2.04%

Table XI. Routed Delay and Track Count Comparison

VPR SPC SPD-m

Routed Touted Routed

Circuit Delay #Tracks Delay % #Tracks % Delay % #Tracks %

ex5p 52.38 646 45.47 15.20% 627 3.03% 46.15 13.50% 798 −19.05% apex4 55.93 627 46.32 20.75% 665 −5.71% 51.72 8.14% 703 −10.81% misex3 56.56 588 40.92 38.22% 588 0.00% 44.50 27.10% 714 −17.65% tseng 41.08 483 36.35 13.01% 437 10.53% 33.92 21.11% 506 _−4.55% alu4 55.16 594 47.47 16.20% 506 17.39% 47.12 17.06% 616 −3.57% dsip 38.80 935 35.73 8.59% 660 41.67% 35.06 10.67% 935 0.00% seq 58.13 744 49.12 18.34% 768 _{−3.13% 54.51 6.64%} 792 _−6.06% diffeq 50.41 506 39.57 27.39% 506 0.00% 40.67 23.95% 552 −8.33% apex2 58.00 775 48.03 20.76% 725 6.90% 52.54 10.39% 875 −11.43% s298 103.69 648 89.43 15.95% 621 4.35% 82.76 25.29% 918 _−29.41% des 88.32 960 69.26 27.52% 832 15.38% 70.76 24.82% 1024 −6.25% bigkey 42.60 495 48.40−11.98% 550 −10.00% 39.59 7.60% 495 0.00% spla 78.65 1452 67.68 16.21% 1287 12.82% 64.12 22.66% 1683 _−13.73% elliptic 75.16 1156 62.14 20.95% 1054 9.68% 71.45 5.19% 1326 −12.82% ex1010 102.88 1188 81.58 26.11% 1008 17.86% 78.95 30.31% 1512 −21.43% pdc 125.46 2028 93.18 34.64% 1755 15.56% 89.84 39.65% 2301 −11.86% frisc 87.64 1560 127.02−31.00% 1600 −2.50% 85.90 2.03% 1760 −11.36% s38584.1 66.41 1276 47.02 41.24% 924 38.10% 51.21 29.68% 1364 −6.45% s38417 81.54 1410 54.15 50.58% 1128 25.00% 64.91 25.62% 1504 −6.25% clma 144.02 2760 124.14 16.01% 2040 35.29% 125.73 14.55% 3120 _−11.54% Average 19.23% 11.61% 18.30% −10.63%

(29)

Table XII. Placement Estimated Delay Comparison on Stratix Circuit VPR Path % SPC % SPD-m % SPCD % ex5p 8.26 8.20 0.82% 7.91 4.44% 7.66 7.86% 7.34 12.54% apex4 8.04 7.61 5.69% 7.81 2.94% 7.27 10.65% 7.48 7.55% misex3 7.64 7.54 1.37% 7.15 6.84% 6.90 10.73% 6.61 15.53% tseng 10.97 11.73 _−6.41% 8.90 23.33% 10.59 3.59% 8.68 26.44% alu4 8.45 8.26 2.26% 7.38 14.46% 7.23 16.78% 7.31 15.58% dsip 4.92 4.53 8.64% 3.83 28.57% 4.25 15.66% 3.83 28.57% seq 8.12 7.76 4.71% 7.34 10.61% 7.39 9.85% 7.34 10.61% diffeq 10.58 10.53 0.41% 10.06 5.10% 8.74 21.01% 9.56 10.59% apex2 9.56 9.23 3.56% 9.22 3.64% 7.82 22.25% 8.40 13.78% s298 19.54 18.74 4.23% 17.75 10.08% 14.62 33.61% 16.12 21.17% des 7.58 7.20 5.27% 6.91 9.67% 7.29 3.87% 6.45 17.42% bigkey 5.46 5.26 3.69% 4.57 19.51% 4.78 14.12% 4.31 26.67% spla 10.94 10.11 8.20% 9.85 10.99% 9.27 17.97% 9.22 18.67% elliptic 14.85 12.53 18.59% 11.27 31.77% 11.49 29.33% 10.60 40.17% ex1010 12.14 10.64 14.08% 10.95 10.83% 9.43 28.75% 10.58 14.74% pdc 12.84 11.56 11.03% 10.70 19.94% 10.06 27.56% 9.89 29.87% frisc 20.79 19.45 6.89% 16.67 24.74% 15.90 30.81% 15.85 31.16% s38584 9.27 8.37 10.84% 8.33 11.40% 7.47 24.08% 7.89 17.59% s38417 13.84 13.62 1.63% 10.88 27.17% 11.27 22.77% 10.29 34.52% clma 19.73 17.98 9.72% 15.19 29.87% 14.73 33.93% 14.62 34.98% Average 5.76% 15.29% 19.26% 21.41%

Table XIII. Routed Delay Comparison on Stratix

Circuit Quartus VPR % Path % SPC % SPD-m % SPCD %

ex5p 7.81 8.32 _{−6.45% 8.12 −3.86% 8.22 −5.14%} 8.57 _{−9.65% 7.62} 2.53% apex4 7.56 8.19 _{−8.40% 7.81 −3.38% 7.97 −5.53%} 8.06 _{−6.70% 7.47} 1.08% misex3 7.32 8.03 _{−9.60% 7.54 −2.99% 7.47 −1.99%} 7.52 _{−2.71% 7.57 −3.32%} tseng 9.17 10.38 _{−13.17% 10.49 −14.39% 9.01} 1.69% 10.11 _{−10.23% 8.96} 2.34% alu4 7.51 8.68 _{−15.51% 8.35 −11.11% 7.72 −2.76%} 7.80 _{−3.83% 7.60 −1.09%} dsip 4.44 5.10 _{−14.76% 4.73 −6.48% 4.37} 1.71% v4.41 0.63% 4.12 7.84% seq 7.19 8.32 _{−15.68% 8.27 −14.98% 7.54 −4.83%} 7.91 _{−9.97% 7.47 −3.75%} diffeq 10.51 10.80 _{−2.73% 10.78 −2.57% 10.57 −0.55% 10.01} 4.76% 9.70 8.34% apex2 8.81 9.66 −9.68% 9.87 −11.97% 9.09 −3.16% 8.82 −0.10% 8.45 4.29% s298 17.29 20.64 −19.36% 19.33 −11.77% 16.75 3.12% 16.70 3.44% 16.08 7.52% des 6.07 7.32 −20.56% 6.80 −12.08% 6.63 −9.30% 7.20 −18.63% 6.31 −3.74% bigkey 5.47 6.04 −10.46% 5.51 −0.77% 4.66 14.79% 5.08 7.20% 4.72 15.91% spla 10.44 11.49 −10.03% 10.62 −1.70% 10.03 3.99% 10.40 0.44% 9.60 8.80% elliptic 11.36 15.13 −33.18% 12.39 −9.06% 11.20 1.42% 11.41 −0.46% 10.58 7.41% ex1010 10.97 12.18 −11.05% 11.28 −2.80% 11.06 −0.76% 10.67 2.71% 10.91 0.54% pdc 12.13 13.25 −9.18% 12.73 −4.89% 11.40 6.07% 12.13 0.05% 11.09 9.46% frisc 15.67 19.58 −24.92% 18.97 −21.04% 16.04 −2.33% 15.90 −1.46% 15.38 1.90% s38584 8.25 9.62 −16.61% 9.24 −11.95% 8.51 −3.11% 8.12 1.65% 8.04 2.64% s38417 11.49 14.49 _{−26.06% 14.29 −24.39% 11.20} 2.51% 11.95 _{−4.00% 11.41 0.75%} clma 16.01 20.35 _{−27.06% 17.83 −11.34% 15.59} 2.65% 16.19 _{−1.10% 14.85 7.80%} Avg _−15.22% _−9.17% _−0.07% _−2.40% 3.86%

(30)

Fig. 20. Critical path of Quartus II result (bigkey). (From Stratix Device Family Data Sheet, v3.2, July 2005, c_{ALTERA 2005.)}

wirelength) and consumes 15% of standard VPR’s runtime. Whenα = 0.25, SPC uses 33% of standard VPR’s runtime. SPC’s runtime increases up to 63% asα increases to one. Table IX shows a similar trend when the size of the CLB is ten.

7.3.2 Area and Runtime Increase of SPD. In Table X we show the area (in terms of the number of CLBs) and runtime overhead of our SPD algorithm. Regardless of the number of iterations of logic duplication we performed, the area increase by both SPD-1 and SPD-m was very small, normally less than 1%. When we performed only postplacement logic duplication in SPD-1, the runtime increase was negligible; even when we performed multiple iterations of logic duplication in SPD-m, the average runtime increase remained very small at 2%. We expected more runtime increase for SPD-m, but this was not the case. Our analysis of the annealing process reveals that logic duplication helps the placement reach a local minimum faster, so SPD-m uses a smaller number of annealing iterations than does SPD-0, in general.

7.4 Routed Results

In Table XI we show the comparison of the routed delay and track count between T-Vpack+ VPR, SPC, and SPD-m using the default architecture. SPC outper-forms T-Vpack+ VPR by 19% on average in routed delay, and the reduction

(31)

Fig. 21. Critical path of SPCD result (bigkey). (From Stratix Device Family Data Sheet, v3.2, July 2005, c_{ALTERA 2005.)}

in routed tracks is 12% on average. This is consistent with the estimated de-lay/wirelength reduction after placement. SPD-m outperforms T-Vpack+ VPR by 18% on average in routed delay, with an average increase in routed tracks of 11%. The increase in the number of routed tracks is due to the increase in the number of nodes and nets introduced by logic duplication. Note that the routed delay reduction is smaller than the estimated placement delay reduc-tion, which is probably due to the intrinsic inaccuracy of the delay model used by placement.

7.5 Comparison with Quartus II 4.0

7.5.1 Placement Estimated Delay Comparison. Table XII shows the com-parison of placement estimated delays of different algorithms under the Stratix delay model, which is based on a much more complex segmented routing archi-tecture. Note that in this section, the results obtained by “VPR” are are actually from SPD-0 instead of the original VPR. This is because the original VPR can-not model the Stratix device properly. As we mentioned in Section 7.2.2, our implementation SPD-0 generates results that are similar to those produced by the original VPR.

In Table XII, if we use the path counting-based net weighting scheme, we can outperform VPR by 6% (column 4); if we perform simultaneous placement with

(32)

clustering, we can outperform VPR by 15% (column 6); if we perform simulta-neous placement with duplication, we can outperform VPR by 19% (column 8); if we perform simultaneous placement with both clustering and duplication, we can outperform VPR by 21% (column 8). Table XII shows that our unified synthesis and placement tool SPCD significantly outperforms VPR on a widely used commercial architecture, as well as on simplified academic architectures (Table VII).

7.5.2 Routed Delay Comparison. Table XIII shows the comparison of the routeds delay of different algorithms reported by the Quartus timer. The timing results of VPR lose to Quartus II by 15% (column 4). If we use the path counting-based net weighting scheme, we lose to Quartus II by 9% (column 6); if we perform simultaneous placement with clustering, SPC achieves delay results similar to Quartus II (column 8); if we perform simultaneous placement with duplication, SPD-m loses to Quartus II by 2% (column 10); if we perform simul-taneous placement with both clustering and duplication, SPCD outperforms Quartus II by 4% (column 12).

For example, we ran the circuit bigkey using both the standard Quartus II flow and our SPCD Stratix flow. As shown in Figure 20, a critical path of 5.469 ns can be obtained from the Quartus II flow. As shown in Figure 21, a critical path of 4.928 ns can be obtained from the SPCD flow. The cells on the critical path in Figure 21 are placed closer and the delay improvement is 11%.

8. CONCLUSIONS

We introduce an efficient and effective algorithm for simultaneous placement with clustering and duplication. By integrating novel techniques such as path counting-based net weighting, a fragment level move, simultaneous logic dupli-cation during placement, monotone region-based global path monotonicity op-timization, optimal legalization under complex constraints, duplication graph representation, and redundancy removal, our new SPCD algorithm produces excellent results for both wirelength and timing optimization. When compared to a widely used separate academic FPGA design flow, T-VPack+VPR, across different architectures, our algorithm improves up to 36% in wirelength and 31% in the longest path delay, with less than 1% increase in area. Although we test our algorithm in the context of FPGAs, the duplication and placement algorithms apply directly to ASICs and other architectures as well. The SPCD package is available for download from http://ballade.cs.ucla.edu/∼chg/spcd. REFERENCES

BERAUDO, G.ANDLILLIS, J. 2003. Timing optimization of FPGA placements by logic replication. In Proceedings of the ACM/IEEE Design Automation Conference, 196–201.

BETZ, V.ANDROSE, J. 1997. VPR: A new packing, placement and routing tool for FPGA research. In Proceedings of the International Workshop on Field Programmable Logic and Application, 213–222.

BOZORGZADEH, E., OGRENCI, S.,ANDSARRAFZADEH, M. 2001. Routability-Driven packing for cluster-based FPGAs. In Proceedings of the Asia and South Pacific Design Automation Conference (Yokohama, Japan). 629–634.