• No results found

Area, Power and Delay Evaluation Methodology

3.5 Industrial Validation Techniques

4.1.4 Area, Power and Delay Evaluation Methodology

One of the objectives of this thesis is to satisfy the needs for efficient reliability solutions with minimal costs in performance, power and area, while at the same time providing the high reliability levels of traditional defect tolerance techniques. Therefore, area, power and delay studies also require specific evaluation tools and methodology.

We use an in-house 1 path-finding power, area and delay tool that models the processor micro-architectural blocks and units. This model allows driving power, area and delay analysis and takes into consideration the particular implementation

1

Developed by the Intel R

4.1. Benchmarks, Tools and Simulators

·

55 of specific micro-architectural blocks. For cache-like and array structures, our model is based on CACTI 5.3 [200]. For the rest of structures (such as combinational logic, wiring and clocking), our model ports and extends Wattch 1.0 [27]. As opposed to Wattch, our model works with new CACTI versions, interfaces into an advanced timing simulator and incorporates specific Intel R

-internal values. An alternative model like McPAT [99] has not been used because it became publicly available and stable after we had begun evaluating some of our techniques. The models have been parameterized for a 32nm technology node.

Note that our model does not rely on costly and slow computer-aided design circuit tools (such as HSPICE), nor on electronic design automation tools. The reasons are twofold. First, the circuit-level implementation of our baseline processor was not available. And second, tools like CACTI and Wattch provide processor architects with power, area and delay modeling at abstraction levels above circuits and schematics. This enables the possibility to explore and cull the design space early on, using faster, higher-level tools [27, 200].

The power component also counts the number of times some predefined microar- chitectural events occur. For example, we count the number of times a register is read or written. This is done for every major block in the micro-architecture during program execution. The peak power of individual units and these machine utilization statistics are used to calculate the runtime power dissipation. However, to evaluate the power overheads of our solutions, we focus on peak dynamic power 2. Peak power numbers are obtained based on maximum activity factors and maximum peak energy-per-event. Peak power ends up defining the maximum power consumption of a processor and provides upper bounds estimates. Furthermore, this power metric critically impacts the reliability of the processor [191]. The power overheads we show are clearly pessimistic, as a consequence.

The main blocks that the model incorporates fall into these categories:

• Array structures: Caches, cache tag arrays, TLBs, branch prediction struc- tures, rename tables, free lists, register files, the ROB, the issue queue payload RAM and register scoreboard, as well as the load-store queue payload RAM. • Fully Associative Content-Addressable Memories: Issue queue wake-up

logic, load-store queue memory checks.

2

In CMOS processors, dynamic power consumption (Pd) is the main source of power consumption,

and is defined as: Pd= C ∗ Vdd

2

∗ a ∗ f . C is the load capacitance, Vddis the supply voltage, and f

is the clock frequency. The activity factor, a is a value between 0 and 1 indicating how often clock ticks lead to switching activity on average.

56

·

Chapter 4. Evaluation Framework

• Combinational Logic: Decoders, renaming intra-bundle dependency check- ing, selection logic, functional units and ROB walk (RAT recovery) logic. • Data wires: Result and bypass buses.

• Global clocking: Clock buffers, clock wires, etc.

The design, structure and sizing of micro-architectural blocks (described in Ta- ble 4.2) are used to derive their representation and parametrization in our power- area-delay model. A single high-level logical microarchitectural structure sometimes is represented as several components in the model. As an example, the issue queue is represented as a CAM memory and a RAM memory (modeled by CACTI), and as combinational logic and as wiring (modeled as in Wattch).

For array structures and CAM memories CACTI allows specifying a block config- uration based on parameters such as: cache type (i.e. data arrays, data+tag arrays, and DRAM arrays), structure size, associativity, line size, number of read, write and read/write ports, technology, voltage, frequency, temperature, number of banks, out- put/input bus width, explicit tag size, tag and data access mode (i.e. fast, sequential, normal) and transistor type (high-performance, low stand-by power, low operating power, DRAM).

CACTI allows specifying optimization criteria and constraints in order to find a design that better suits the user needs. This allows the user to skip over many of the low-level details of the components being modeled and lessen the burden on the architect to figure out every detail. Configurations are evaluated by assigning a weight to each optimization criteria (delay, leakage power, dynamic power, cycle time and area), and the solution space is pruned based on maximum deviation with respect to the best solutions found during the process. Alternatively, the user can specify a design exploration criteria based on energy-delay (ED) or energy-delay square (ED2).

ED2optimization criteria has been chosen for most blocks, as we target performance- oriented processors. Those blocks affected by our techniques are checked to meet the processor cycle time (the target clock rate is used as a design constraint). Those that are time-critical have been optimized by CACTI using other constraints. For exam- ple, the bypass network is time critical because it is routed over the functional units and the register files [140]. As a consequence, the register files have been optimized by prioritizing the area and dynamic power.

For combinational logic, data buses and clocking structures, our power-area-delay model is heavily based in Wattch. Next we provide details of several of our microar- chitectural components.

4.1. Benchmarks, Tools and Simulators

·

57 • Instruction Decoders: In this case, we have used internal values from

previous Intel R

products scaled by process technology and frequency.

• Intra-Bundle Dependency Checking Logic: Two parallel intra-bundle dependency checking blocks handle RAW and WAW dependencies. The area and power of each block is computed based on the number of comparators and their capacitance. Delay is assumed to be lower than the RAT access time, as noted by Palacharla et al. [140].

• Functional Units: In this case, we have used internal values from previous Intel R

products scaled by process technology and frequency.

• Write-Back Bus and Bypass Network: The number of wires equals to the data width times the number of stacks that produce a value within all the execution ports multiplied by the number of stacks of the same type. The result bus power is computed based on specific internal wire capacitances from the technology and clock frequency. The area of the functional units and the register files are used to compute the result bus length [140], which is multiplied by the capacitance per unit of length. Tristate buffers are used to model input multiplexors.

• Select Logic: We follow the approach of Wattch (and McPAT): we model it as a tree of cascaded arbiters, where each arbiter in the tree handles up to four selection requests. Select requests traverse the tree down to the root arbiter, and a bid answer traverses up to a leaf arbiter which eventually selects an in- struction. An arbiter is modeled as OR gates and as priority encoders. Globally, as many trees as the number of execution ports are modeled. The centralized select logic that manages resource conflicts is included in our framework. • Wake-up Logic: We follow the approach of Wattch (and McPAT): the CAM

search operation serves as the wakeup logic for the issue queue. We model both the tag drive (including the power and area to write new tags) and the tag match components. This includes the buffers to drive the destination tags, taglines, comparators, wordlines, bitlines, matchlines and OR gates to produce the readiness bits [139].

• LSQ Checking Logic: The CAM search operation also models the detection of store-to-load forwarding and memory ordering violations scenarios. The full length of addresses are used in CAM matches. The load and store queue CAM memories are modeled separately but in as a similar way as in the previous item. Our power and area model also accounts for the comparators that handle

58

·

Chapter 4. Evaluation Framework

age information and the priority encoders to choose the youngest but older forwarding stores, as opposed to Wattch.

• ROB Walk Logic: The modeling is handled similar to the second item. In this case, only WAW dependencies are handled, but given that the RAT can be recovered by undoing or redoing register mappings, two independent blocks are needed. They are modeled as in the second item. In addition, we also account the power and area needed to store and access the register mapping fields (that are kept at separate ROB banks).

• Global Clock: We enhance Wattch’s H-tree model where the global clock signal is routed to all portions of the chip using equivalent length metal wires and clock buffers. The model also accounts for the bits required to latch each stage, and uses the processor area number computed by CACTI or obtained from internal Intel R

values, as opposed to Wattch.