EECS 151/251A Homework 8

Full text

(1)

Due Monday, April 6th, 2020

Problem 1: Power and Leakage

Consider a 3-input “2-1 AOI” gate shown below with VDD = 1 V, CL = 5 fF, CD = 0.2 fF/nm.

Assume RON,n = 10 kΩ · nm, RON,p= 20 kΩ · nm, ROF F,n = 100 MΩ · nm, ROF F,p= 500 MΩ · nm for the given device length.

a) Size the gate, using as a reference a symmetrically sized inverter with Wn = 10 nm. Your sized gate should have the same input capacitance as the reference inverter for all inputs.

Solution:

To keep the pull-up and pull-down delays the same, we size Wp,a= 4Wn,a and Wp,b/c = 2Wn,b/c. To make the inputs have the same input capacitance as the reference inverter we size the transistors as follows:

• an= 3/5Wn= 6 nm

• ap = 12/5Wn= 24 nm

• bn= cn= Wn= 10 nm

(2)

• bp = cp= 2Wn= 20 nm

b) Assume that the probability of an input being high is 0.5 (i.e., on any given clock cycle, each input is equally likely to be a 0 or a 1.) and that all inputs are independent. What is the probability that the output is high, P (Out = 1)? What is the probability that the output is low, P (Out = 0)? What is the gate activity factor (i.e. the probability that the output will transition from low to high, P0→1)?

Solution:

The truth table for the 2-1 AOI gate is as follows:

a b c out

0 0 0 1

0 0 1 1

0 1 0 1

0 1 1 0

1 0 0 0

1 0 1 0

1 1 0 0

1 1 1 0

Since the probability of an input being 1 or 0 is equally likely and all the inputs are independent, then the probability of the output being 1 is the sum of the probabilities of input combinations that yield an output of 1. Therefore

P(Out = 1) = 3

8 = 0.375

The same process is applied to finding the probability of the output being 0.

P(Out = 0 = 5

8 = 0.625

The probabiility that the output will transition from low to high can be found using conditional probability.

P0→1 = P (Outt=n = 1|Outt−1= 0) = P (Out = 0) · P (Out = 1) = 15

64 = 0.234375 c) What is the dynamic power dissipation of the gate, if the clock frequency is 3GHz? You may

ignore the parasitic drain capacitance in the internal nodes of the PMOS stack, but not at the output.

Solution:

From lecture, we know that dynamic power dissipation of a gate can be expressed as Pdyn= 1

2αCtotVdd2fclk

The total capacitance (from our sizing) is the drain capacitance from an, ap, and bnalong

(3)

with the load capacitance.

Ctot = 3

5CD+ 12

5 CD + CD+ CL= 4CD+ CL

The activity factor is the probability of the output transitioning, which you found in part (b).

α= P0→1

The other values are

Vdd = 1 V, fclk = 3 GHz Pdyn= 1

2(0.234)(4 · 0.2 · 10 + 5) · 10−15(1)2(3 · 109) Pdym = 4.57 µW

d) 251A only - 151 Optional. For the following three cases, calculate the leakage current.

An approximate expression is perfectly fine as long as you explain and justify your assump- tions/simplifications.

(a) All inputs are zero.

Solution:

All PMOS on so ignore their resistance, parallel combo of a_nmos with b_nmos + c_nmos.

Req≈ Ron,n,a||(Ron,n,b+ Ron,n,c)

I = Vdd

Ron,n,a||(Ron,n,b+ Ron,n,c) (b) All inputs are 1.

Solution:

All NMOS on so ignore their resistance, series combo of a_pmos with parallel b_pmos + c_pmos.

Req≈ Ron,p,a+ (Ron,p,b||Ron,p,b)

I = Vdd

Ron,p,a+ (Ron,p,b||Ron,p,b) (c) A = B = 1, C = 0.

Solution:

c_pmos on in series with a_pmos off in series with a_nmos on.

Req≈ Ron,p,c+ Rof f,p,a+ Ron,n,a)

I = Vdd

Req ≈ Ron,p,c+ Rof f,p,a+ Ron,n,a)

(4)

Problem 2: Energy Efficiency Improvements

Consider the design of a vector add unit. As shown below the unit has two input input register banks and an output register bank. One of the input register banks holds the first vector [A3, A2, A1, A0] and the second holds [B3, B2, B1, B0]. A controller (not shown) passes the elements of the input vectors through the adder (one per clock cycle) and the result is stored in the output register bank [C3, C2, C1, C0]. As you can see, a 4-1 multiplexor is used by the controller for choosing the proper A and B register, and clock enable signals are given to select the proper C register. The circuit elements have the following delays: τadd= 16 ns, τmux= 2 ns, and τsetup = τclk−Q = 1 ns.

On average, at the nominal Vdd the energy for one data item passed through the adder block is 1 Joule, and 0.2 Joules for the multiplexor. The registers each consume 0.1 Joule on average for each new data word stored.

Your application for this circuit requires a complete vector of 4 elements be computed every 80ns.

You can ignore the time and energy required to load new values into the A and B registers.

For this problem assume that the adder operation cannot be pipelined.

Devise a scheme that would improve the switching energy efficiency while meeting the application requirements. Compare the switching energy per result of the original circuit and your new one.

(5)

Assume that a 1/n reduction in clock frequency can accommodate a 1/n reduction in Vdd. Solution:

It takes 4 clock cycles to complete the vector addition. Each clock cycle requires going through 2 MUXes, an adder, and storing the result in the correct register. The total energy expenditure is

Etot,old = 4(2 · 0.2 + 1 + 0.1) = 6 J

Since we cannot pipeline the adder operation, the other tradeoff we can make with energy efficiency while meeting application requirements is more hardware cost with parallelism. We can have 4 adders running in parallel and remove the need for MUXes. The new total energy expenditure is then

Etot,new = 4 · 1 + 4 · 0.1 = 4.4 J

We can also run the clock slower since the vector addition can be completed in one clock cycle.

The clock can be slowed by a factor of 8018 = 4.44 where the new critical path is 18ns. This allows a factor of 4.44 reduction in Vdd, resulting in a further 4.442 = 19.8 times reduction in energy.

Problem 3: Race to Halt

An effective scheme for improving energy efficiency when static power consumption is a significant component of total power consumption is a technique call “race to halt”. The basic idea is to run the hardware at maximum speed to quickly compute the necessary set of computations, then turn off the power, thus preventing leakage.

Suppose you have a CPU that take 4 seconds to run your application with an average power consumption of 8 Watts, where 50% of the power is dynamic and 50% is static. Assume that no other program also running on the CPU. You are willing to run your application slower if that could preserve energy.

You would like to determine the most effective way to run your application to preserve the battery life. You have the ability to control the supply voltage (Vdd), the clock frequency (f), and if needed can put the CPU into a sleep mode where static power is essentially zero. The CPU’s Vdd can be increased or decreased by at most 25%.

Explore “race to halt” versus running longer at a lower Vdd. Which approach will be better at conserving your battery charge? For this problem, assume that when varying frequency f and suppy voltage Vdd, that the static power usage remains constant. This is more or less true. Show your work and justify your answer.

Assume that an n% increase/decrease in clock frequency can accommodate an n% increase/decrease in Vdd.

Solution:

Since the nominal average power consumption Ptot,nom = 8 W, then the nominal dynamic

(6)

power Pdyn,nom= 12CV2f = 4 W and Pstatic,nom = 4 W.

Race to Halt: Increase Vdd and fclk by the maximum 25%.

Pdyn,race= 1

2C(1.25Vdd)2(1.25fclk)

The CPU now takes 3 seconds instead of 4 to run the application, so the total energy is Etot,race = 1.95 ·1

2CV2f ·3 s + 4 W · 3 s Etot,race= 1.95 · 4 W · 3 s + 4 W · 3 s = 35.4 J Lower Vdd: Decrease Vdd and fclk by the maximum 25%.

Pdyn,race= 1

2C(0.75Vdd)2(0.75fclk)

The CPU now takes 5 seconds instead of 4 to run the application, so the total energy is Etot,race = 0.42 ·1

2CV2f ·5 s + 4 W · 5 s Etot,race= 0.42 · 4 W · 5 s + 4 W · 5 s = 28.4 J Interestly enough, the lower Vdd scheme is more energy efficient.

Problem 4: Memory

a) Suppose you want to design a 32-bit wide memory block with a capacity of 2K 32-bit words of storage (remember 1K = 1024). We would like to have the core of the block square (equal number of rows and columns). How many total address bits are needed for this memory?

How many address bits are used by the row-decoder? How many address bits are used by the column-decoder?

Solution:

2x1024x32 = 65536 -> core is 256 x 256, 11-bit address, col decoder requires 3 (2ˆ3 = 256/32 = 8), row decoder requires 8 (11-3) The total memory size is

2 × 1024 × 32 = 65536 bits

Since we want a squuare block, we take the square root of the memory size

√65536 = 256

This means we have to design a 256 × 256 memory core. The total number of address bits required is

L= log2(2 · 1024) = 11 bits

(7)

Each row contains 256/32 = 8 32-bit words. So the column-decoder needs K = log2(8) = 3 bits

The row-decoder then requires L − K = 11 − 3 = 8 bits.

b) Now you want to design the row decoder using the predecoder technique presented in lecture.

You can use only gates with no more than 4-inputs. Map out the scheme and describe the design of each of the decoder.

Solution:

Predecode groups of 2 bits for the 8 bits used by the row decoder. Then combine each group of 2 results into a 4-input AND gate to decode the 8 bit address.

Problem 5: DRAM [4 pts]

1-transistor DRAM designs usually include a “row buffer”—a register on the periphery that is used to register an entire row.

a) Explain how this register could be used and why it’s a good idea.

Solution:

It reduces power and increasing memory system speed. RAM accesses exhibit spacial locality to a high degree: it’s likely that access to one word in a DRAM row is likely followed by another access to the same row. Buffering the row saves having to read the memory cells again, returning a value to the system faster and using less power. For

(8)

writing: a row is opened (copied into the row buffer) and constituent bytes/words are updated before the entire buffer is written back.

b) Explain how the inclusion of this buffer changes the detailed steps needed for a memory read and memory write operation.

Solution:

For read:

• compulsory miss: slow access - open row, read data and move to row buffer, then move data to out

• row buffer hit: fast access - only move data from buffer to out

• row buffer conflict: low access - write back existing row, open and read the new row, and update row buffer

For write:

• compulsory miss: slow access - open row, update row buffer, then move data to buffer where edits will take place

• row buffer hit: fast access - write and make edits on row buffer

• row buffer conflict: low access - write back existing row, open and read the new row, and update row buffer

Problem 6: Memory Implementation

a) Consider the design of a (very) small asynchrous-read register file block of 4 words by 4- bits each, and with two read ports and one write port. You want to implement the register memory cells as positive edge-triggered flip-flops. Draw the circuit diagram for your design using the flip-flop cells, multiplexers, and logic gates.

Solution:

(9)

b) 251A only - 151 Optional. Now consider the redesign of the register file from part a) using latches instead of flip-flops. For this design, as above, the write operation occurs on the positive edge of the clock, but now the output data on a read become available after the falling edge of the clock.

Solution:

Problem 7: Memory Blocks [10pts]

You are given a simple dual port (SDP) memory block that is 128x8. Show how you would use multiple instances to design a memory that has 2 independent read ports and is 256x8.

Solution:

Use 2 128x8 to make 256x8 (increase depth), then stack 2 (4 total) to get 2 read ports

(10)

Monday, April 6, 2020 12:08 PM

Figure

Updating...

References

Updating...

Related subjects :