FC-based CAD Flow Overview - BLOCK-LEVEL IMPLEMENTATION OF BINARY-VALUED,

CHAPTER V BLOCK-LEVEL IMPLEMENTATION OF BINARY-VALUED,

V.3.3 FC-based CAD Flow Overview

The CAD flow to convert an input logic netlist is described next. The input logic netlist is technology independent in our experiments, but it could be technology dependent as well. There are several steps in the flow, which are briefly described next, and then explained in detail.

First, the input netlist is clustered into FCs (where FCiimplements Fm,ni ), with a goal of minimizing the wiring between FC’s. In our experiments, m≤ 6 and n ≤ 3. After this, we obtain a multi-level netlist of interconnected FCs.

Next, the layout of each FC is generated. The FCs, FLAs and FLBs are extremely regular in their physical characteristics, making them amenable to the on-the-fly physical synthesis flow that we use. Based on the fanout load of the ith output of FCj, additional buffers are added for that output.

synthesized and mapped using commercial standard-cell based CAD tools. The resulting designs (flash-based and standard-cell based) are compared in terms of their delay, area, power and energy, over a number of designs.

V.3.3.1 FC-based Clustering

Problem Definition: Given an arbitrary logic netlist η, cluster η into a multi-level networkη∗of FCs, subject to the following constraints:

• The networkη∗is acyclic.

• Each FCi∈ η∗has a logic function Fs,ti where s≤ m and t ≤ n.

Algorithm 1 Clustering a Logic Netlist into a Multi-level Network of FCs η = decompose_network(η, p)

L= dfs_and_levelize_nodes(η) FC∗= 0

η∗= 0

while get_next_element(L) != NIL do FC∗= FC∗∪ get_next_element(L)

if (num_input(FC∗)≤ m) && (num_output(FC∗)≤ n) then continue else Q= remove_last_element(FC∗) η∗=η∗∪ FC∗ FC∗= Q end if end while η∗= wiring_recovery(η∗)

Algorithm 1 outlines our clustering strategy. We first decomposeη into an equivalent network of nodes, with at most p inputs. If this were not done, we could encounter a situation where the number of inputs to some node in η is greater than m, making it

impossible to create the multi-level FC-based netlist. We choose p< m, and in particular we found that p= 3 yielded good results. Now η is sorted in a depth-first manner. The resulting array of nodes is sorted in topological2order, and placed into an array L.

Now we greedily construct the logic in each FC, by successively grouping nodes from L such that the resulting implementation of the grouped nodes FC∗ does not violate the input or output cardinality constraints for the FCs. If so, we attempt to include another node into FC∗, otherwise we append the last FC satisfying the height and width constraints to the resultη∗.

In order to reduce the wiring between FCs, the get_next_element routine preferen- tially returns nodes in the fanout of the nodes of FC∗, provided that the inclusion of such a node into FC∗ would not result in a cyclic dependency between the FCs of η∗. If such nodes are not available, the first un-mapped node from L is returned. At every step of the construction ofη∗, we verify that the graph induced by the multi-level network of FCs is acyclic.

After the clustering step is completed, we invoke a procedure called wiring_recovery. This is a final effort in reducing the wiring between FCs. This procedure attempts to move individual nodes in L to a different FC than their currently assigned FC. If a wiring gain is realized by such a move, the move is made. If no more nodes can be gainfully moved, or if a specified number of iterations have been made through L, the procedure returns. On average, the wiring_recovery procedure is able to reduce wiring by

2_{Primary inputs are assigned a level 0, and other nodes are assigned a level which is one larger than the}

about 9.6%. We note the following about this procedure:

• It is possible that a node n in L is the only node in some FC X , and if n can be moved to another FC, then FC X can be eliminated fromη∗. We came across a few instances where a FC was removed in this manner.

• wiring_recovery returns when no node can be moved without increasing the wiring cost of the multi-level network of FCs. At this point, it is still possible that more than one node can simultaneously be moved to realize a gain in wiring. However, this condition is not checked.

The functional correctness of the resulting multi-level network of FCs was verified at the end of the clustering step.

V.3.3.2 On-the-fly Layout Synthesis

Once the multi-level netlist of FCs is generated in the previous step, we next gen- erate the layout for each FCi∈ η∗. First, for each FCi, we construct a table of all the 2n output minterms opand their corresponding input cubes Cp= Σcp,q. This construction is inexpensive in practice, since m and n are small (6 and 3 respectively in our experiments). The set of cubes {Cp} form a partition of the points in Bm, where B= {0, 1}. This table is constructed from the truth table of F_m,ni , simply by grouping all the input minterms for each output minterm. Now the input minterms for each output minterm are minimized using Espresso [29]. The output minterm which has the largest number of cubes is not implemented, and is mapped to the default output of the FC when it is precharged, as discussed in Section IV.3.5.

Table 5.1, shows the number of input minterms (and cubes) that correspond to each output minterm for a representative function Fm,nwith m = 6 and n = 3. The cubes corresponding to the ’7’ output are not implemented, since the number of cubes for this output is the largest, and can be mapped to the default output of the FC, since it is a precharged circuit.

Output minterm 0 1 2 3 4 5 6 7 Total

# Input minterms 8 5 8 11 6 7 7 12 64

# Input cubes 8 3 5 4 6 6 3 9 44

Table 5.1: Example of Minterm Distribution of an n-output Logic Function with m Inputs

For each FCi∈ η∗, our layout synthesis algorithm adds larger output buffers for output x whenever the fanout load (measured in terms of the total number of pulldown stack devices that x drives) exceeds a particular value. We chose this threshold to be 96.

V.4 Experiments

V.4.1 Simulation Environment

In this section we will discuss the methodology we used in reporting the results obtained through our flash-based design flow compared to the results obtained from an equivalent CMOS standard cell based design flow. The designs presented in this thesis are implemented in a 45nm process technology. This is because an industry grade CMOS standard cell library in 45nm technology is easily obtained and serves as a realistic candi-

date to compare our flash-based design flow with. The CMOS standard cell based digital circuits are synthesized and mapped using a 45nm Nangate FreePDK45 Open Cell Li- brary [54, 55] using Synopsys Design Compiler [56]. The delay, power and area of the CMOS standard cell based digital circuits are extracted using Design Compiler. The flash- based digital circuits were generated using custom scripts. For CMOS devices, we used a 45nm PTM process [57]. For the flash devices, we derived the 45nm flash device models from the measurements results presented in [46] and validated our models using [58, 45]. The details regarding our model card regression was discussed earlier in Section III.4.2. However, we only use two VT’s for the flash transistors (as described in Section IV.4, since the design flow described in this chapter uses the FC as the building cell of the flash-based design. The target programmed threshold voltages used in our designs are (V T0 = -0.5 V) and (V T1 = 0.5 V). We simulated the flash-based FCs in HSPICE and also verified the correct logical operation of the flash-based digital circuit, which is realized as a network of interconnected FCs. Custom layouts for the FCs were generated using Cadence Virtu- oso [65] to compare the physical area of the flash-based digital circuits to their standard cell based counterparts. We obtained the layout of our FCs using design rules for flash devices that were obtained from the ITRS reports [59].

In document Digital Circuit Design Using Floating Gate Transistors (Page 104-109)