Clock Comb - Leakage power minimisation techniques for embedded processors

Logic

ISOLATE

VVDD

Off

VVDD

Figure 3.5: Sub-clock power gating timing

is driven to a logic 1, thereby isolating the combinational outputs. When the clock is logic 0, ISOLATE is held at logic 1 while the V Vdd input remains at logic 0 (discharged).

This ensures the combinational outputs remain isolated until the supply rail is charged to an equivalent logic 1, eliminating short-circuit crowbar currents during wake-up. The output is ANDed with the nOverride signal used to disable the sub-clock power gating and ensures that if the proposed technique is disabled, additional switching energy is not consumed from the isolation gates.

To help understand what is happening with the combinational logic when using the proposed technique, the overall timing diagram of one clock period in sub-clock power gating is shown in Fig. 3.5. After the next logic state is clocked into the rising-edge triggered registers, the V Vdd rail to the combinational logic is disconnected from the V_dd supply but the capacitive nature of the virtual supply rail [3] means the time taken for the virtual rail to discharge ensures register hold times(Thold), which are on the order of ps in modern technology libraries [155], will be met. At this point, the output isolation is also enforced. The virtual supply rail is held off for the remainder of the high phase of the clock (Tpgof f) minimising leakage power dissipation, and the outputs of the combinational domain remain isolated (T_isolate). Note that by changing the duty cycle of the clock it is possible to extend this off period (high phase of clock), maximising

the leakage power savings. The virtual supply rail is restored at the negative edge of the clock but the output isolation is held until the virtual supply rails are fully restored (Tpgstart). The remainder of the clock period is used for the evaluation of the next state (Teval) and ensuring setup time (Tsetup) is met before the process repeats in the next clock period.

3.2.2 Design Flow

The design flow to augment a digital design with the proposed SCPG technique is shown in Fig. 3.6; three additional steps are added to a traditional power gating design flow (Chapter 1, Section 1.4.1.1) and are indicated. A brief summary of each of these steps is given and further details will be given when discussing the fabrication of a sub-clock power gated case study in Chapter4, Section4.3. The design flow begins with the original RTL of the circuit that is to be mapped to a sub-clock power gating architecture.

In order to achieve the power domain split shown in Fig. 3.3, the RTL must be written with separate Verilog modules for the combinational and sequential logic so that a UPF file can be used (Chapter1, Section 1.4.1.1). This is a constraint of the UPF standard [46] and is the primary reason the first two process steps of the design flow are required.

If the original RTL is easily split into combinational and sequential logic Verilog modules then the first two steps shown in Fig. 3.6can be skipped and the split can be performed manually as will be shown in Section 3.3.1. If however the HDL description consists of intertwined combinational and sequential logic, the first step is used to synthesise the design to a generic gate library available through the EDA tool vendor, which in the case of Synopsys EDA tools is the GTECH library [17]. The output of this step is a flat gate level netlist of the circuit and enables a Perl script to be used in the second step to identify sequential and combinational logic gates and separate them into two individual Verilog modules. The output of the second step is then the same GTECH gate level netlist from the first step, but with the combinational logic in one module and the sequential in another. The third and final additional step merges this gate level netlist with the isolation circuit shown in Fig. 3.4 in a top level wrapper along with definitions for the control signals used for the power gates. The complete, split netlist can then be combined with the power intent UPF of the sub-clock power gating, which defines the power domains, power switches and isolation to match with the architecture shown in Fig. 3.3. The rest of the design flow remains the same as a traditional power gating design flow.

3.3 Simulation Results

To validate the sub-clock power gating technique, three case studies were used: a 16-bit parallel binary multiplier, an ARM Cortex-M0 microprocessor and an ASIC wireless

Synthesis

Figure 3.6: Design flow of the sub-clock power gating technique

sensor node processor, the Event Processor [63]. The flow of how all the results pre-sented in this section were obtained is shown in Fig. 3.7. A brief summary of the steps in Fig. 3.7is given but further details of how the HSpice simulation of the circuits was conducted can be found in Appendix B.3. Firstly, the designs were all implemented using the implementation flow described in Fig. 3.6, using a nominal 1.2V 90nm tech-nology library² and the Synopsys EDA tool suite. A full transistor level netlist including parasitic resistors and capacitors (RC) of all signals and power grids was then extracted from the place and routed design using the Synopsys Star-RC tool to ensure the simula-tion provided an accurate representasimula-tion of timing and power. Simulasimula-tion vectors were captured from the Verilog simulation of the gate level netlist which were then ported into a digital vector file used in HSpice for the transistor level netlist simulation. The

2Synopsys 90nm Education Kit available from Synopsys

Implement test case

Figure 3.7: Experimental flow for generation of sub-clock power gating power results

post layout simulation was then carried out using Synopsys HSpice at a scaled voltage of 0.6V. This voltage was chosen because it remained adequately above the 0.4V thresh-old voltages of the transistors alleviating problems from near threshthresh-old operation whilst being half the nominal 1.2V voltage giving large dynamic and leakage power saving and is representative of the scaled voltage used in low performance applications [60,64,153].

Finally, the power and energy values were extracted from the simulation results and recorded, Tables3.2,3.3 and 3.4.

An integral part of implementing power gating is the choice of sleep transistors. In all the test cases, PMOS transistors are used and as discussed in Chapter1, Section 1.4.1, the inclusion of the header transistors introduces a small IR drop to the power gated logic. As reported in previous publications, the header transistor size, the number of headers and their arrangement directly affects the IR drop across the power domain [3, 40]. With a lower IR drop the impact in performance is reduced and the time taken to reach an active state from power down is also reduced as a higher current can be facilitated through the power gating transistors. As shown in Chapter 1, Section 1.4.1though, including many header transistors can have a negative impact on in-rush current causing ground bounce [3, 42]. Constraints of up to 5% IR drop are common in many power gating designs. Trying to achieve very low IR drop is important in high performance systems [45] but can result in unnecessarily large effective power gating widths, resulting in increased sleep mode leakage current and area overhead [3]. As such, iterative simulation was used to find the widths required for a 5% IR drop in

IR Drop (mV) Ground Bounce ( µ V)

In document Leakage power minimisation techniques for embedded processors (Page 76-80)