Circuit Parallelization - Circuits Techniques for Dynamic Power

Circuits Techniques for Dynamic Power

10.3 Circuit Parallelization

Circuit parallelization has been proposed to maintain, at a reduced Vdd, the throughput of logic modules that are placed on the critical path [8,17,23,24]. It can be achieved with M parallel units clocked at ƒ/M. Results are provided at the nominal frequency ƒ through an output multiplexer controlled at ƒ (Figure 10.2(a)). Each unit can compute its result in a time slot M times longer (Figure 10.2(b)), and can therefore be supplied at a reduced supply voltage. If the units are datapaths or processors [23], the latter have to

P_dyn=C_L◊V_dd² ◊ ◊a f

1941_C10.fm Page 3 Thursday, September 30, 2004 4:46 PM

10-4 Low-Power Electronics Design

be duplicated, resulting in an M times area and switched capacitance increase. Applying the well-known power formula, one can write:

(10.3) Table 10.1 presents the reduction of power of an 8-bit adder. One could deduce that power is saved only if Vdd is reduced. As operating frequency is reduced, however, the use of cells with smaller or unsized transistors results in a power reduction. Furthermore, some parallelized logic modules do not require M-unit duplication. It is the case, for instance, for memories [25], in which each unit contains 1/M data or instructions, resulting in the same total area to store the information and in the same Ceff or smaller total switched capacitance, if cells with unsized transistors are used (Figure 10.3). In such a case, the power is the following:

(10.4) At first order, power could be saved even if Vdd is not reduced; however, some overhead has to be considered, such as the address registers duplication and the output multiplexer (Figure 10.3). If this overhead is not too expensive, such a parallelization scheme has to be considered for logic modules that are not on the critical path. At a low Vdd, the latter are working without parallelization. At the same low Vdd, power could be saved if they are parallelized at the cost of a small overhead. Memories, shift registers, and serial-parallel converters provide interesting examples.

10.3.1 Memory Parallelization

In a parallelized module, operations of the execution units or data accesses in memories are performed in an overlapped or interleaved fashion (Figure 10.2(b)). Therefore, the result is provided with an M-1 latency delay compared to a nonparallel architecture. One can see on the timing diagram of Figure 10.2(b) FIGURE 10.2 (a) Datapath parallelization concept, and (b) timing diagram.

TABLE 10.1 8-bit Adder Power Simulation With the CoolChip Library

1941_C10.fm Page 4 Thursday, September 30, 2004 4:46 PM

Circuits Techniques for Dynamic Power Reduction 10-5

that the output multiplexer can be controlled at ƒ/2. The operation or access of a Unit 2 is started before the completion of the operation of Unit 1. Therefore, M successive computations do not have to be dependent on each other.

Controllers with a fixed sequence of commands without any branch instruction, or specialized pro-cessors for special linear computation, or random-access memories (RAMs) used to store coefficients for programmable finite impulse response (FIR) filters, can be parallelized according to the structure of Figure 10.3. It can be used, for instance, for transcoders in which several lookup tables (i.e., read-only memory [ROM]) are connected in parallel; however, parallel memories are difficult to use if branch instructions are used. Interleaved or parallelized memories (Figure 10.4) with branch instructions were used in the 1960s for computers [14]. With, for instance, 32 memory modules and an access time of 10 cycles (ƒ/10), the probability to insert a branch delay is reduced as 10 successive instructions are, most of the time, stored in different modules.

FIGURE 10.3 Memory parallelization.

FIGURE 10.4 Memory parallelization in computers.

At first order P = C′_eff. f/2 . V²_dd

1941_C10.fm Page 5 Thursday, September 30, 2004 4:46 PM

10-6 Low-Power Electronics Design

10.3.2 Parallelized Shift Register

Figure 10.5 depicts a parallelized shift register. Such a concept has been proposed for CCD serial memories [14,21]. The input is successively provided to the upper or to the lower half shift register at a reduced frequency, while the output multiplexer restores the output at the frequency ƒ. No latency exists because the combinatorial circuit of the state machine “shift register” is implemented by simple wires, resulting in no associated delay. The total number of D-flip-flops (DFFs) is the same as in the nonparallelized shift register [24,25].

For the nonparallelized shift register, the maximum frequency is limited by the delays of the latches of the DFF. For the parallelized shift register, the maximum frequency is limited by one latch delay and the output multiplexer delay. Thus, the maximum frequency of the parallelized structure is the same as the classic structure (an ƒmax = 100 MHz classic shift register can be replaced by an ƒ/2 = 50 MHz parallelized shift register, but it is impossible to increase ƒ/2 > 50 MHz). Such a parallelization does not provide faster shift registers. It is therefore impossible to reduce Vdd if the shift register is on the critical path. For shift registers, which are not on the critical path, one can reduce both f and Vdd.

Table 10.2 presents the power consumption of nonparallelized and parallelized shift registers, depend-ing on the degree of parallelism. Such a comparison is only valid for shift registers, which are not at their frequency limits, however, because an 8- or 4-parallelized cannot provide the same throughput as a nonparallelized shift register.

10.3.3 Serial-Parallel Converter

Figure 10.6depicts a parallelized structure of a 16-bit serial-parallel converter in which the 1-bit input is successively loaded in four 4-bit shift registers clocked at ƒ/4. Power consumption is reduced by a factor of four with the same throughput. Because no output multiplexer exists, the maximum frequency of such a structure can be much higher than the nonparallelized serial-parallel converter.

FIGURE 10.5 Parallelized shift register.

TABLE 10.2 Power Simulation with the CoolChip Library for:

(i) Nonparallelized 16-bit Shift Register and (ii) 2- and 4-Parallelized 16-bit Shift Registers

2 µm Technology f (MHz) Vdd (V) Power (µW) %

16-bit SR f = 33 4.5 1535 100

2-// 16 bit SR f/2 = 16.5 4.5 887 58

4-// 16 bit SR f/4 = 8.25 4.5 738 48

16-bit SR f = 33 3.2 797 100

2-// 16 bit SR f/2 = 16.5 3.2 448 56

4-// 16 bit SR f/4 = 8.25 4.0 585 83

f/2

D Q D Q D Q

D Q

Q Q1

D Q

D Q 0

wires M

U X 1941_C10.fm Page 6 Thursday, September 30, 2004 4:46 PM

Circuits Techniques for Dynamic Power Reduction 10-7

10.3.4 Linear Feed-Back Shift Registers

Shift register parallelization can be used for linear feed-back shift registers [19] with as many output multiplexers as the number of inputs of the XOR tree. Figure 10.7 gives an example with two output multiplexers. The example in [19] is an M-parallelized M-bit shift register with M-input simplified multiplexers. In addition, a parallelized LFSR architecture was used for the development of a division algorithm [16] and for the implementation of steam ciphers in cryptography [11].

10.3.5 Double-Edge Triggered Flip-Flop

Figure 10.8(a) and Figure 10.8(b) as well as Figure 10.8(c) to Figure 10.8(e) show the schematic and various circuit designs of single-edge triggered flip-flop (SET-FF) and double-edge triggered flip-flop (DET-FF), respectively. A classic SET-FF is implemented with two latches in series, while its parallelization results in two latches in parallel with an output multiplexer (i.e., derivation). A DET-FF is triggered on both rising and falling edge of a clock pulse. Using DET-FF the clock frequency, f, can be halved for the same throughput rate, thus reducing the power dissipation on the clock distribution network. Although many alternative DET-FF designs have been proposed, they have not been used extensively, due to the increased silicon area (i.e., increased input capacitance and number of transistors). This implies a larger number of internal nodes, which is strongly dependent on the input signal transition probability, a. It was proved [26] that if the switching activity a is low, significant power savings may be achieved, while high activity a may lead to increased power consumption. In addition, DET-FFs exhibit increased glitching activity compared with SET-FFs. SPICE simulation results show power savings around 10% using DET-FFs at the expense of a reduction of 10% in performance.

In Chung et al. [9], a detailed comparative study of five existing DET-FFs in terms of performance (i.e., latency^-1), total power consumption and power ¥ delay product (PDP) is given in Table 10.3. It is FIGURE 10.6 Parallel-serial converter.

FIGURE 10.7 Linear feed-back shift register.

f/16

Input 1bit

SR <4> SR <4> SR <4> SR <4>

1941_C10.fm Page 7 Thursday, September 30, 2004 4:46 PM

10-8 Low-Power Electronics Design

assumed 0.18-mm technology and supply voltage of 1.8 volts. Notice that the total power consumption consists of three components:

1. Internal power dissipation 2. Data power

3. Local clock power, where the contribution of the internal power is over the 70% of total power consumption

FIGURE 10.8 Flip-Flops: (a) block diagram of a single-edge triggered flip-flop (SET-FF), (b) circuit design of a SET-FF, (c) block diagram of a double-edge triggered flip-flop (DET-FF), (d) circuit design of a DET-FF [18], and (e) circuit design of a DET-FF [9].

TABLE 10.3 Comparison Results of DET-FF in Terms of Power Consumption, Latency, and Power-Delay Product

[22] 17.6 65.6 241.7 324.9 245.4 79.6

[18] 17.0 4.6 153.4 175.0 312.3 54.7

[11] 23.2 11.6 131.4 166.2 262.2 43.6

[26] 30.0 13.4 194.5 237.8 235.3 56.0

[9] 18.1 10.9 189.4 218.4 230.5 50.3

1941_C10.fm Page 8 Thursday, September 30, 2004 4:46 PM

Circuits Techniques for Dynamic Power Reduction 10-9

Specifically, the first component concerns the power consumed inside a DET-FF including the power consumed for driving CL. Thus, the power optimization techniques should concern the careful design of DET topology reducing capacitance or switching activity.

In document LowPowerElectronics.pdf (Page 157-163)