Methodology - On the design of power- and energy-efficient functional units for vector processo

This section describes the framework used to perform the design space exploration

ofVA as well as the framework parameters and the test benchmarks that we use.

4.2.1 Framework

The framework is depicted on Figure 4.1. The basis for this framework is PAS

estimation flow (Chapter3), so the corresponding explanations and details are valid

here. Therefore, here we focus on the added features, i.e. differences, characteristic for this framework.

The framework includes architectural- (VectorSim) and circuit-level (RC, NCsim) simulators and tools, as well as an interfacing tool (tBenchGenerator). For various

circuit- and architectural-level parameters (explained in Section4.2.2) we obtain the

metrics of interest of our VA(described in Table4.1).

The first step in the framework is to feed VectorSim (introduced in Chapter 2)

with the vector parameters (microbenchmark (uBench), maximum vector length

(MVL), number of lanes (nL)) as well as some design parameters (number of stages

(nS), clock period (TCLK)). This stage generates data and timing traces. The traces

include information about vector additions only, excluding scalar additions.

The next step is transforming this architectural-level information into Verilog

test benchmarks (tBenchs) using tBenchGen (introduced in Section 4.2.3). We gen-

erate independent tBenchs files for each lane of the VA. Section 4.2.4describes the

different tBenchs that we generate. Apart from the traces, we include here one more parameter (CGable) which indicates if the design has clock-gating support.

We incorporate design parameters (AF, TCLK, nS, CGable) into handcraftedHDL

codes of adders which are supplied to Cadence RTL-Compiler [32] (RC PLE Synthe-

sis) to produce different adder’s synthesized mapped netlists and to performstatic

timing analysis (STA). For multiple laneVA configurations, adders in all lanes are identical.

traces uBench, MVL, nL nS,f

P

CGable AF INPUT PARAMETERS ARCHITECTURAL LEVEL CIRCUIT LEVEL EXPLORATION PART

t

A

.sdf PARETO-OPTIMAL GRAPHS OF VA CONFIGURATIONS IN SPACE OF METRICS OF INTEREST

Processor type

CIRCUIT-LEVEL PARAMETERS ARCHITECTURAL PARAMETERS .vcd .v tBench

Figure 4.1: Block diagram of the framework’s steps, parameters, and metrics.

The next step is to simulate each synthesized adder in NCSim [58] for each

matching tBench with back-annotated delays using standard delay format (sdf)

files [100]. This is done in order to obtain the execution time te, verify the syn-

thesized designs and extract resulting switching activity information using Value

Change Dump (vcd) files [110]. The final step of the framework is a precise com-

putation of power metrics (P and Prest) using RC Power Simulation. The inputs are

synthesized designs in Verilog andvcdfiles.

4.2.2 Framework Parameters

We first present the vector processor specific parameters:

• uBench is a vectorized microbenchmark (kernel) extracted from an applica- tion, and it consists of integer data. It is a representative part of the application and small enough (between 100k and 150k test vectors) to keep circuit simulation time reasonable. We use three different uBenchs extracted from

Table 4.1: Metrics of Interest. Explained in detail in Section3.2

Measured Metrics

P and Prest. Average power of FU (or FUs if we have more than one lane)

including and excluding the clock tree respectively.

te. The execution time of a test benchmark tBench, also referred as Delay (D). A. The area ofFU.

Derived Metrics

Pd. Surface power density = P/A. It is proportional to the fourth power of

temperature of the given surface by Stefan-Boltzmann law (Pd∝ T4) [104]. E=PDP. Power-Delay product is total energy spent inFUduring a tBench. PdDP, EDP, E2DP, and E3DP are commonly used Power density and Energy-

Delay products [75].

three vectorized SPEC applications (described in Table 4.2) that are used in

mobile devices and can also be found in server workloads. In the uBenchs, there are long and short vector data, so our application set is comprehen- sive for our needs. They are addition intensive (in average 27.14% of total

instructions executed), so adder’s impact on the uBenchs te is significant.

• MVL is the already mentioned maximum vector length of the vector processor.

Possible values are 16 and 128 to represent both extremes of short and long maximum vector lengths. We chose 16 as a short vector length that is close to SIMD extensions while 128 represents long maximum vector lengths.

• nL is the number of vector lanes. Possible values are 1, 2, and 4.

The design parameters, which are primarily needed for the circuit-level part of

the framework, are CGable, AF, TCLK(=1/f), and nS. The explanation for CGable, AF,

TCLK, and nS in Section3.3.2are valid here as well.

4.2.3 Test Benchmarks Generator - tBenchGen

To transform architectural-level information to Verilog tBenchs we develop a tool written in Perl named tBenchGen. The most important inputs are: data and time

Table 4.2: Vectorized microbenchmarks (uBench)

Hmmer (SPEC2006)applies profile Hidden Markov Models (HMMs) and is useful in many areas such as speech synthesis, handwriting, gesture recognition, part-of-speech recognition, machine translation, bioinformatics, etc. Facerec (SPEC2000)is an implementation of a face recognition system. H264ref (SPEC2006)is the reference implementation of H.264/AVC standard for video compression. and it is currently one of the most commonly used formats for the recording, compression, and distribution of high definition video.

traces from VectorSim, all circuit- and architectural-parameters, CGtype(explained in

Section6.5.3), data bit-width, and tBench length expressed in the number of Verilog

test vectors. As output, it provides tBench for each lane separately, and its profiling report.

4.2.4 Test Benchmarks

We generate two kinds of tBenchs: app-tBench are obtained from real uBenchs and synth-tBench are synthetic. app-tBench are a function of all the parameters used in VectorSim. On the contrary, synth-tBenchs are not related with the vector simulator

and are a function of TCLK only. There are three types of app-tBenchs (noCG, CG

and 100%) and two types of synth-tBenchs (rnd and 0%).

• noCG is used to evaluate designs without clock-gating. The input values of

theVAare provided for each cycle.

• CG enables evaluating designs with clock-gating support. Clock-gating is enabled when we have idle cycles. Here we need additional clock-gating signals, one per stage.

• 100% represents a case where the VA is always busy, assuming that mem-

ory is fast enough to provide data on time and that consecutive vector add instructions (ADDV) are independent. Therefore, the execution time depends

• rnd is a tBench with random values that are supplied each cycle to the adder. This is the traditional methodology of testing digital designs.

• 0% is used to evaluate the case when there is no vector additions in the uBench, i.e. clock-gating is always active.

rnd and 100% have the highest and the second highest activity factors respectively. 0% has the lowest activity factor among the tBenchs. CG and 0% are supplied to the adders with clock-gating, while the rest are supplied to adders without clock-gating logic integrated.

In document On the design of power- and energy-efficient functional units for vector processors (Page 60-64)