Model for Power Consumption - Reducing Memory Latency by Improving Resource Utilization

Power consumption in computers can be divided into static and dynamic power consumption. Dynamic power consumption for a CMOS chip is shown in Equation II.1.

Pdynamic= Cswitched∗ Vdd2 ∗ fclk (II.1)

Vdd is supply voltage, fclkis frequency and Cswitchedis the total effective switched

capacitance, i.e. is the average capacitance of the transistors and communication lines that are switch in each clock cycle. In synchronous designs such as a micro- processor the clock distribution adds a significant value to Cswitched.

As chips get denser, leakage power increases, and transistors consume more power without switching[12]. This is called static power consumption. Caches have higher leakage currents per stored bit compared to DRAM because of higher transistor count, and will therefore consume more power in denser technologies compared to DRAM.

0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 gzip gcc

crafty mcf swim mgrid

equake applu

vpr

ammp mesa galgel lucas fma parser

sixtrack eon perlbmk gap bzi2p apsi wupwise twolf facerec art IPC Conv_2 Dest_2 Conv_16 Dest_16

Figure II.5: IPC for the applications in the SPEC2000 benchmark suite for conventional (conv) and destructive-read (dest) DRAM with 2 and 16 kbytes cache configuration. The average IPC is 13.7% higher for dest 2 compared to conv 2, and 4.5% higher for dest 16 compared to conv 16.

II.3.0.1 Voltage and Frequency

The following relationship between frequency and power has been described by Khellah and Elmasry[16]:

Td≈

CL∗ Vdd

κ ∗ (Vdd− Vth)α¯

(II.2)

Tdis the delay of a CMOS gate, where CLis the load capacitance, κ is a factor that

depends on the process and gate size, ¯α takes a value between one and two, Vthis

the threshold voltage and Vddis the supply voltage. Even though the formula is not

complicated, a lot of complexity is hidden in the ¯α, Vth, CLand κ factors. However,

what can be read from this formula is that for a given technology and circuit, the maximum operating frequency is a function of supply voltage. This fact is used for power saving in commercial computer systems in a technique called dynamic voltage-frequency scaling. The technique reduces both frequency and voltage for the system when maximum computational performance is not needed.

II.3.1

Power model of DRAM with bus

The data flow of accessing DRAM from cache is shown in Figure II.6. The address and control signals are sent to the DRAM bank. Data are read out of the bank

Figure II.6: Data communication when accessing a memory bank.

and written back along the route shown as a thick gray line. Equation II.3 shows a model for the energy consumption (EM EM) for these components for the execution

of a complete application, e.g. one of the benchmarks from Figure II.5.

EM EM = NACC∗ (EDRAM+ ESW IT CH+ EBU S) (II.3)

NACC is the number of DRAM memory accesses, and for a single access: EDRAM

is energy consumed in the DRAM banks, EBU Sis energy consumed by the bus and

ESW IT CH is energy consumed by the switch and the write-back buffer.

Yong-Ha Park et. al. [25] have made a model for the dynamic power consumption of embedded DRAM at an abstraction level that matches the simulation model that we are using for the processor. We have based Equation II.4 on this model.

EDRAM = NROW ∗ EROW+ NCOL∗ ECOL (II.4)

NROW is the number of row activations, NCOLis the number of column activations,

EROW is energy consumed by activation of a row in the embedded DRAM and

ECOL is energy consumed by activation of a column in the embedded DRAM. All

these variables are for a single memory access. Burst transfer is not used in this work because communication is on-chip and the bus width is increased instead. In this way the column decoder is not needed as one word line of the memory array is read or written simultaneously. Refresh of DRAM is not different for the two DRAM models and is therefore omitted for simplicity. The result is that NCOL

is one and NROW is the number of data bits which is 512 in our model. The

consequence is that EDRAM is equal for all memory accesses because each access

activates one row and 512 columns. In order to quantify EDRAM we use a DRAM

macro made by Morishita et.al [21]. The power consumption is 0.260 W at 250 MHz (in 130 nm technology) for accessing a 16 Mb bank. This is approx. 1 nJ for one access with 128 bits, and we conservatively multiply this with 4 for 512 bits width resulting in EDRAM = 4nJ for our DRAM bank. Power consumption

depends on DRAM macro size as this impacts wire lengths, the number of cells activated in parallel, etc. Smaller macros consume less power while larger macros

consume more power per access. Research on interconnection is a large topic and several techniques exists (see for example [27]). Ron Ho et.al.[9] have made efforts to quantify energy consumption for buses, and found that a wire of length 10 mm in 180 nm with 1.8 volt require < 1 pJ per bit for techniques with voltage swing reduction and < 10 pJ for a simple bus wire. We conservatively use the simple wire and 10 mm bus (10 pJ). This result in EBU S = 5.4pJ for 544 bus lines (512 data

+ 32 address).

The power consumption for the write-back buffer and switch was modeled as a 2 kB cache with four banks and 64 bytes block size in CACTI[29] (in 180 nm technology). It consumes 0.31 nJ per bank per access. We conservatively use ESW IT CH= 1nJ .

This results in EM EM = 10.5nJ ∗NACC. The same energy model is applied to both

destructive-read DRAM and conventional DRAM, and used for both read and write operation. This is acceptable as our main goal is to compare the two techniques, not necessarily to have exact estimates. There are fewer operations performed in the destructive-read DRAM since no write-back is performed during each read. Static power consumption is not included in this power model as it is not different for conventional and destructive-read DRAM. Using destructive-read DRAM results in shorter execution time and the static power consumption will be slightly lower. If in error, the power consumption penalty of using the destructive-read DRAM will therefore be overestimated.

In document Reducing Memory Latency by Improving Resource Utilization (Page 122-125)