Datapath Manipulation - Digital ASIC Manual

3.7 Datapath Manipulation

This section shows how the construction of the datapath can be determined by writing appropriate VHDL code. A simple arithmetic instruction as an addition can be written as:

z <= A + B + C + D;

The instruction is mapped to hardware as a balanced tree if possible in order to minimize the delay. The same hardware construct can be obtained as follows:

z <= (A + B) + (C + D);

If information on the arrival time of the signal is available the delay can be shortened by rearranging the order of the adders. Assuming that signal E arrives late, the delay time can be shortened by using parentheses as

z <= (A + B + C + D) + E;

The parser is forced to place E latest in the adder chain. The signals within the parentheses will be arranged as a balanced adder tree, see Figure 3.15.

1 1.5 2 2.5

1500 2000 2500 3000 3500 4000

Delay (ns)

Area (a.u.)

nbw wall

1.5 2 2.5 3 3.5

7500 8000 8500 9000 9500 10000 10500 11000 11500 12000

Delay (ns)

Area (a.u.)

nbw wall

Figure 3.11: Area/Delay comparison of multipliers. On the left-hand side a 8-bit multiplier and on the right-hand side a 16-bit multiplier.

Chapter 3. Arithmetic

Figure 3.12: Negative Right-Shift Multiplier.

1 0 1 0 1

Figure 3.13: Positive Right-Shift Multiplier.

D C B A

Figure 3.14: Balanced Adder Tree

3.7. Datapath Manipulation

Z E D C B A

Figure 3.15: Unbalanced Adder Tree

Chapter 4

Memories

One of the most important topics in digital ASIC design today is memories.

Memories occupy a lot of space and consume a lot of power and the situation becomes even worse if off-chip memories are considered. Even though each new generation of silicon processes can include more transistors on a single chip, it is predicted that the percentage of memory on a chip will increase. In the road map from Japanese system-LSI industry it is predicted that in 2014 more than 90% of a chip is covered with memory, compared to todays 50%.

With these numbers in mind, it is reasonable to spend some time to optimize the memory system when designing a new ASIC. Each saved square millimeter of silicon corresponds to a neat pile of money and to be able to use ASICs in mobile or extremely fast applications, low power consumption is required. In mobile applications to maximize battery life time and in fast applications to avoid expensive cooling equipment. Another memory issue is that the larger a memory is the slower and more power hungry it becomes. The consequence is, that the designer have to split large memories into smaller once in order to be more power efficient or to meet timing requirements. This will create a more complex memory hierarchy and introduce more design criterias.

This section does not try to cover every aspect of memory design; instead some examples on what can be done to optimize memory systems. These ex-amples do not serve as the final truth but rather show the reader that memory systems have to be tailor made for each specific application. The first example is about minimizing the area of a Shift Register (SR) [4]. The second example deals with a memory that is not always utilized 100% [5]. A cache system to read from a off-chip memory in a image convolution application is presented in the last example [6]. For more information on memory designs consult [7].

4.1 SR example

In this example shift register (SRs) are considered. SRs are commonly used to delay or align data in data paths, for example, in a pipelined Fast Fourier Transform (FFT) processor. Here, four different ways to implement a SR are presented together with an estimate of the required silicon area.

Since one input is read and one output is written each clock cycle, the SR could be implemented in at least four different ways, as shown in Figure 4.1:

Chapter 4. Memories

Figure 4.1: Four ways to realize a SR. Flip-flop based (a), dual-port memory (b), two single-port memories (c), and one single-port memory of twice width (d).

a. Flip-flops connected in series: This is an easy solution since there is no control logic needed. Additionally, flip-flops have a fast access time and thus allow high operation speed. However, flip-flops are not optimized for area.

b. One dual-port memory: A dual-port memory can perform both a read and write operation each clock cycle and thus suits the requirements perfectly.

However, in order to perform simultaneous read/write additional logic is required in each memory cell, compared to a single port memory.

c. Two single-port memories of half length and alternating read/write: The input is gathered in blocks of two and written to the memories every other clock cycle. In a similar manner, two words are read from the mem-ories every other clock cycle. Single-port memmem-ories have smaller memory cells than dual-port memories, but additional logic is required outside the memories in order to gather and separate input and output.

d. One single-port memory of half length and double width, which reads and writes every other clock cycle. This solution works in the same way as the previous solution, but instead of storing the two inputs in separate memories they are stored as one word in the same memory location. Thus, the memory has to be twice as wide.

Figure 4.2 shows the estimated area for the different implementations of SRs in a 0.35µm CMOS technology. There are no interconnections included in the flip-flop area. All memory SRs include control logic and a 50µm power ring on three sides of the block. In the figure it can be seen, that flip-flops is only the best solution, from an area perspective, if the SR is less than approximately 400

In document Digital ASIC Manual (Page 63-69)