time instructions easily predictable regions (a) original execution flow speculation stream. verification stream (b) contrail processor (4PEs) squash

(1)

of a Temp erature-aware Multicore-processor

Hidenori Sato Toshinori Sato

SeikoEpson Corporation PRESTO, JST

[email protected] [email protected]

Abstract|While lowersupply voltage is

eec-tive for energy reduction, it causes performance

loss. To mitigate the loss, weproposeto execute

only the part, which do es not have any inuence

onexecutionsp eed,withlow-speed. Weare

inves-tigating a multithreaded execution, named

Con-trail Architecture, which divides an instruction

stream into two streams. One is the sp eculation

stream, whichis the main partof a program and

where speculation is applied. The other is the

verication stream, which veries every sp

ecula-tion. The energy consumption is reduced by the

decreasein the executiontime inthe sp eculation

streamandbythelow-speedexecutioninthe

ver-ication stream. The paper willpresent our

pre-liminaryevaluationonenergyeciencyof a

Con-trailpro cessor.

I. Introduction

The increasing p opularity ofp ortable andmobile

computer platforms such as laptop PCs and smart

cell phones is a driving force in the investigation

of high-p erformance and p ower-ecient

micropro-cessors. As thecomputingp ower ofmicropro cessors

for mobile devices increases, however, their p ower

consumption alsoincreases. While p ower isalready

amajor design constraintin thearea ofmobile and

emb edded computer platforms, it has also become

alimitingfactoringeneral-purp osemicropro cessors.

Multicore-processor architecture is one of the

solu-tionsandit has b een already adoptedin emb edded

microprocessors[6,13, 25,28].

The energy consumed in a micropro cessor is the

product of its p ower consumption and execution

time. Thus, to reduce energy consumption, we

shoulddecreaseeither or bothof them. Power

con-sumptionin a CMOS digital circuit is governed by

theequation: P =P activ e +P l eak ag e (1) whereP activ e

isactive p owerand P

leak ag e

isleakage

p ower. The active p ower P

activ e

and gate delay t

pd aregivenby P activ e /fC l oad V 2 dd (2) t pd / V dd (V dd 0V t ) (3)

wheref is the clo ckfrequency, C

load

is theload

ca-pacitance, V

dd

is the supply voltage, and V

t is the

threshold voltage of the device. is a factor

de-p endent up on the carrier velocity saturation and is

around1.5 inmodern CMOS technology. Based on

Eq.(2),wecaneasilyndthatapower-supply

reduc-tion is the most eective way to lower power

con-sumption. However, Eq.(3) tells us that reductions

in the supply voltage increase gate delay, resulting

inaslowerclo ckfrequency,andthusdiminishingthe

computingp erformanceofthemicroprocessor. In

or-dertomaintainhightransistorswitchingsp eeds,itis

requiredthatthethresholdvoltageisprop ortionally

scaleddownwiththesupplyvoltage.

Ontheotherhand,theleakagep owercanb egiven

by P l eakage =I l eak ag e V dd (4) whereI leak ag e

istheleakagecurrent. The

subthresh-oldleakagecurrentI

leak ag e

isdominatedbythreshold

voltageV

t

in thefollowingequation:

I leak ag e /10 0 V t S (5)

whereSisthesubthresholdswingparameter. Thus,

lower thresholdvoltage leadsto increase

(2)

consumption.

In order to achieve high p erformance and low

power, we can exploit parallelism[3]. Two

identi-cal circuits are used in order to make each unit to

work at half the original frequency while the

origi-nal throughput is maintained. Since the speed

re-quirement for the circuit b ecomes half, the supply

voltagecan b e decreased. In this case, the amount

of parallelism can b e increased to further reduce

thetotal power consumption. Another kind of

par-allelism, which is thread level parallelism, is

uti-lizedforenergyreductionwithmaintainingpro cessor

performance[6]. Inthispap er,weprop osean

energy-ecientspeculativemulticore-pro cessor.

II. Contrail ProcessorArchitecture

To reduce the energy consumption, we divide an

executionofanapplicationintotwostreams. Oneis

calledthespeculationstreamandconsistsofthemain

part of the execution. However, it utilizes sp

ecula-tiontoskipseveralregionsoftheexecution. Inother

words,thenumberofinstructionsinthesp eculation

streamissmallerthanthatintheoriginalexecution,

resultingin energyreduction. Incontrast,the other

streamiscalledtheverication streamandsupp orts

the sp eculation stream by verifying each sp

ecula-tion. The key idea isthatthe speculativeexecution

in the original program translates its each critical

path into a non-critical one and we move it from

the sp eculation stream into the verication stream.

Hence, the verication stream can executeslowly if

thesp eculation isalmost successful. Wecan reduce

the clo ckfrequency ofthe components for the

veri-cation stream. Furthermore, the supply voltageis

alsoreduced. From these considerations, its energy

consumptionissignicantlyreduced. Wecoinedthis

techniqueContrailarchitecture[23].

A.Multicore-pro cessor mo del

Eachstreamexecutes as a thread ona

multicore-processor,eachpro cessingelement(PE)ofwhich

in-dependently works at the variable clo ck frequency

andsupplyvoltage[7,8,10]. Thespeculationstream

isexecutedonaPEinhigh-speedmo deandthe

veri-cationstreamisexecutedonotherPEsinlow-speed

mode. Intheidealcase,thatmeansthereareno

mis-predictions; the sp eculation stream nishes silently

inwhichamispredictionisdetectedin averication

stream, all threads from the misprediction p ointto

tail, including the speculation stream and any

ver-icationstreams, are squashed, andprocessor state

is recovered by the verication stream that detect

themisprediction. Andthen,thevericationstream

b ecomesthe sp eculationstream.

time

instructions

(a) original execution flow

easily predictable regions

(b) contrail processor (4PEs)

spawn new thread

verification stream

speculation stream

misprediction

squash

FIG.1 ExecutiononaContrailprocessor

Weexplainhow theprogram islogically executed

onaContrail pro cessor,using Figure1. Weassume

that half ofthe original execution of anapplication

is ideally predicted and is distributed uniformly as

explained in Figure 1(a). This is a reasonable

as-sumption, since it has b een rep orted that 59% of

dynamic traces can b e reused with the help of the

value prediction[20]. We alsoassume the clo ck

fre-quency for the verication PEs at half that for the

sp eculation PE. Under these assumptions, the

ex-ecution is divided into sp eculation and verication

streamsinaContrailpro cessorinadistributed

man-nerasdepictedinFigure1(b). Thepredictedregions

areskipp edinthesp eculationstreamandexecutein

theverication streams whileenlargingtheir

execu-tion time. Determining trigger p oints is based on

the condence information obtained from the value

predictor. When an easily predictable region is

de-tected, the head thread spawns a new sp eculation

stream on the next PE andit turns into a

verica-tionstream. Thus,theContrailpro cessorislogically

build as a ring-connected multicore-processor such

asMultiScalararchitecture[9],asshowninFigure2.

Physical interconnectsdo not necessarily followthe

logicaltop ology. Eachvericationstreamstaysalive

untilall instructionsinthe corresp ondingregion

re-movedfromthesp eculationstreamareexecuted.

(3)

D$

Data

path

I$

VP

Vdd/Clk

D$

Data

path

I$

VP

Vdd/Clk

D$

Data

path

I$

VP

Vdd/Clk

D$

Data

path

I$

VP

Vdd/Clk

FIG.2 Contrailprocessor

Thecostofspawninganewthreadmightb elarger

thanthatofverifying avaluepredictionona

single-threadedprocessor. However,insingle-threaded

pro-cessormo del,onlydatapathalternatesbetween

high-speedandlow-speedmodes. The otherblo cks,esp

e-ciallyinstruction-supplyfront-end,shouldbealways

in high-speed mo de. This reduces the eect of the

variablefrequencyandvoltagecontroltechnique. On

the other hand, every comp onent of each pro cessor

corecanalternatetwomo desinmulticoremo del,and

thus improving energyeciency is exp ected. From

these observations, we determined to adopt

multi-threadedmo delratherthansingle-threaded mo del.

One of the dierences from the previously

pro-posed speculative multithreaded pro cessors is that

the Contrail pro cessor do es not require any

mecha-nismtodetectmemorydep endenceviolations. Since

the Contrail processor strongly relies on value

pre-diction, any memory dep endence violations cannot

occur. Instead, itsuers from value mispredictions.

Thissimplieshardwarecomplexity. Thisisbecause

value mispredictions can b e detected lo cally in an

PE, while detecting memory dep endence violations

requiresa complexmechanismsuchasARBor

Ver-sioningCache[9]. Anotherdierence fromthe

previ-ously prop osed pre-computing architectures[26, 30]

isthat the Contrail pro cessor architecturedoes not

rely on redundantexecution. In the ideal case, the

number of executedinstructions is unchanged.

An-otherdierenceisthatitstargetistheimprovement

inenergyeciencyinsteadofthat inp erformance.

B.Activep ower reduction

As shown in Figure 2, each PE has its

dedi-catedvoltage/frequencycontrollerandvalue

predic-tor. The p otential eect of the Contrail pro cessor

architecture onenergy-eciency isestimated as

fol-for the sp eculation PE. When we focus on active

p ower,energy consumption iscalculated as follows:

Forthesp eculation stream, energyconsumption b

e-comes half that of the original execution since the

number of instructions is reduced by half. In

con-trast, for the verication streams, the sum of every

execution time remains unchanged since the

execu-tion time ofeachinstruction increases doubly while

the totalnumber of instructions isreduced by half.

Its energy consumption is decreased by the

reduc-tion of the clo ck frequency and the supply voltage.

Based onEq.(2), itisreduced to 1

8

. Thus,the total

energysavingsis37.5%. Itistruethattheenergy

ef-ciencyofContrail pro cessorsdep ends onthe value

predictionaccuracyandthesizeofeachpredicted

re-gion. However,webelievethatthep otentialeectof

Contrailpro cessorsonenergysavingsissubstantial.

Recent studies regarding power consumption of

value predictors found that complex value

predic-tors are p ower-hungry[1, 19, 22]. One of the

solu-tions for reducing p ower consumed in value

predic-tors is using simple value predictors such as

last-value predictor[17, 22]. However, value prediction

isnot the onlytechnique for generatingsp eculation

stream. Other techniques are probably utilized for

this purpose, and the key idea b ehind the Contrail

architecturewillb e applied.

C.Temp erature awareness

The Contrail pro cessor has a go o d characteristic

on temp erature awareness[24]. Figure 3 is a

bird's-eyeview showing how the speculation and

verica-tionstreamsareexecutedonaConrailprocessorchip

consisting of four pro cessor cores. As you can see,

there is only one hot core and it is rotating. This

feature resemblesthe activitymigration[12]andthe

cluster hopping[4]. Because hot sp ots are rotating,

heatisdiusedoverthechipanditstemp eraturewill

b e down.

D.Leakagep owerconsideration

WhiletheContrailprocessorarchitecturehasgo o d

characteristics on active p ower reduction and

tem-p erature awareness, its leakage p ower consumption

might be increased since it has multiple cores. As

mentioned in the previous sections, only one core

(4)

sleep

speculation/hot

verification/cool

FIG.3 PErotation

fortheslowcores. Similarly,thethresholdvoltageof

transistorsfortheslowcorescanb eraised,resulting

inleakagereduction. There areseveral circuits

pro-posedtoreduceleakagecurrentbydynamically

rais-ingthethresholdvoltage,forexamplebymo dulating

bo dybiasvoltage[2,14,29]. Leakagep owerstrongly

dependsupontemp erature. Sinceitisexpectedthat

theContrailprocessorreduces chiptemp erature,its

leakagep owerwillbe alsoreduced. Fromthese

con-siderations, the Contrail processor's leakage power

consumption will be comparableto or even smaller

thanthatofasingle-coreprocessor.

I II. Evaluation

In our previous study[27], we evaluated the p

o-tentialoftheContrailpro cessoronenergyeciency

undertheassumptionofperfectvalueprediction. In

this pap er, we consider p enalties due to value

mis-predictions.

A.Metho dology

We implementedtheContrail pro cessor simulator

usingMASE/SimpleScalar/PISAto olset[15]. Its

in-struction set architecture (ISA) is based on MIPS

ISA. The baseline processor and each PE are a

2-way out-of-order execution pro cessor. Only fetch

bandwidth is 8-instruction wide. The number of

PEs on the Contrail pro cessor model is 4. In the

current simulator, the followings are assumed.

In-struction and data caches are ideal. Branch

pre-dictionisp erfect. Ambiguousmemory dep endences

arep erfectly resolved. Weuse a2K-entrylast-value

go odjobwithloadbalance. Consideringother

parti-tioningp olicies[18]isremainedfor thefuture study.

The interval is 32 instructions. This value is

deter-mined based on the previous study[27]. The

over-headonspawninganew thread is8cycles.

Weevaluateascalingforsupplyvoltageandclo ck

frequencybased onSAMSUN's ARM pro cessor[16].

The verication PEs work at lower frequency and

voltage(600MHz,0.7V),andthesp eculationPEand

thebaselinepro cessor workathigher frequencyand

voltage (1.2GHz, 1.1V). As mentioned ab ove,

leak-agep owerstronglydep endsup ontemp erature. Since

evaluation here is very preliminary, we should use

a p essimistic assumption. We use temp erature of

100C,wheretheleakagepowerisequaltotheactive

p ower[4]. This is a reasonable assumption, since it

isreported that theleakagep ower iscomparableto

theactivep owerinthefuturepro cesstechnologies[2].

The verication PEs exploit the bo dy bias

technol-ogy. Itisassumedthattheleakagep owerconsumed

bythePEs,wherereversebo dybiasisapplied,is

re-ducedby22[2]. And last,weassumethat the

leak-agepowerconsumed byidlePEsisnegligible.

We use CRC, FFT, and StringSearch from

MiBench suite[11] for this evaluation, b ecause our

primary interests are on emb edded applications.

MiBenchis developed for use in the contextof

em-b edded, multimedia, and communications

applica-tions. It contains image pro cessing,

communica-tions, and DSP applications. We use original input

les provided by University of Michigan. All

pro-gramsarecompiledbytheGNUGCCwiththe

opti-mizationoptionsspeciedbyUniversityofMichigan.

Eachprogram isexecutedtocompletion.

B.Results

Figure4showshowthelast-valuepredictorworks.

Each bar is divided into three parts. The b ottom

part indicates the p ercentage of instructions whose

value is correctly predicted. The middle part

indi-cates the p ercentagethat is mispredicted. Andthe

top part indicates the percentage that is not

pre-dictedduetolowcondence. Itisobservedthatthe

characteristicsstronglydep endup onprograms. The

only thing we can nd from the gure is that the

p ercentageofmisprediction issmall.

(5)

0%

20%

40%

60%

80%

100%

CRC

FFT

Ssearch

P

re

dic

tion a

cc

ur

ac

y

correct

miss-predicted

no-speculated

FIG.4 %Valuepredictionaccuracy

ready know that the last-value predictor has only

a few contributions on single-corepro cessor

perfor-mance. The is same for the Contrail processor,

while p erformance improvement is not the goal of

thisarchitecture. The combination ofvalue

predic-tionandmultithreadingarchitectureachievesthe

im-provementofaround10% inp erformance.

75%

80%

85%

90%

95%

100%

CRC

FFT

Ssearch

Rel

at

iv

e ex

ecu

tio

n cy

cl

es

FIG.5 Relativeexecutioncycles

Figure 6 shows the relative power consumption.

Eachbar isdividedintothree parts. The lowerand

upper parts indicatethe average active andleakage

power, respectively. It should b e noted that p ower

consumed by the value predictor is not considered.

The consideration is remained for the future study.

As you can see, p ower consumption is slightly

in-creased. This is b ecause the execution cycle is

re-duced by speculation. From Figures 5 and 6, it is

observedthe cycle reduction rate islarger than the

powerincreaserate. Thus,energyconsumptionis

re-duced. The results are verydierentfrom those for

other low p ower architectures, whichachieve power

reductionat thecost ofperformanceloss.

Figure 7 shows Energy-Delay 2

pro duct (ED 2 P). 2

0%

20%

40%

60%

80%

100%

120%

CRC

FFT

Ssearch

R

ela

tive

powe

r

active

leakage

FIG.6 Relativepowerconsumption

p erformanceand energy. The vertical lineindicates

ED 2

Prelativetothatofthebaselinesingle-core

pro-cessor. Itisobservedthattheimprovementof25%in

ED 2

Pis achievedon average. As mentioned above,

the misprediction p enalties are included in the

re-sults. Wehaveseen from Figure4thatthe

percent-age of correct prediction is small. In addition, we

haveseen from Figure 6 that p owerconsumption is

increased. Nonetheless, the Contrail pro cessor

im-provesenergyeciency. Thisisanotable

character-isticofthis architecture.

50%

60%

70%

80%

90%

100%

CRC

FFT

Ssearch

R

ela

tive

ED2P

FIG.7 Relativeenergy-delay 2

product

As we have already seen in Figure 5, the

Con-trail pro cessor achieves performance improvement,

whichis notits primary target. If performance

im-provement is not required, we slow down clock

fre-quency and further reduce power. This is

hierar-chicalfrequency control. Global control signal

uni-formly throttles every lo cal clock, which originally

hastwomo des;high-sp eedmo deforthesp eculation

PEandthelow-sp eedoneforvericationPEs. Since

(6)

di-asshown in Figure 8 if wecould ideallycontrolthe

global clock. While this technique do es not aect

energy, thep owerreduction isdesirable for

temper-atureawareness.

80%

85%

90%

95%

100%

CRC

FFT

Ssearch

R

ela

tive

powe

r

FIG.8 Powerreductionviafrequencycontrol

IV. Summary

It is exp ected that multithreading and

dual-power functionalunitsare keytechniquesforenergy

reduction[21]. We have proposed such an

energy-ecient sp eculative multicore-processor. Our

pro-posed architecture exploitsthread level parallelism,

resulting in mitigating p erformance loss caused by

thesupplyvoltagereduction. Inthispap er,weshow

preliminary evaluation results of the Contrail

pro-cessor. We found that the Contrail pro cessor has a

potentialofapproximately25%ED 2

Psavingswhile

processorp erformanceisslightlyimproved.

References

[1] R. Bhargava, et al., \Latency and energy aware value

predictionforhigh-frequencyprocessors,"16thInt.Conf.

onSup ercomputing,2002.

[2] S.Borkar,\Microarchitectureanddesignchallengesfor

gigascaleintegration,"37thInt.Symp.on

Microarchitec-ture,Keynote,2004.

[3] A.P.Chandrakasan,etal.,\Minimizingp ower

consump-tionindigitalCMOScircuits,"Pro c.IEEE,83(4),1995.

[4] P.Chaparro,etal.,\Thermal-eectiveclustered

microar-chitecture,"1stWorkshoponTemperatureAware

Com-puterSystem,2004.

[5] L.Co drescu,etal.,\Ondynamicsp eculativethread

par-titioningandtheMEM-slicingalgorithm,"8thInt.Conf.

on ParallelArchitectures and Compilation Techniques,

1999.

[6] M. Edahiro, et al., \A single-chip multi-processor for

smartterminals,"IEEEMicro,20(4),2000.

[7] M.Fleischmann,\LongRunp owermanagement,"white

pap er,TransmetaCorporation,2001.

[8] D.Flynn, \Intelligentenergymanagement: anSoC

de-Publishers,2003.

[10] S. Go chman, et al., \The Intel Pentium M pro cessor:

microarchitecture and p erformance," IntelTech. Jour.,

7(2),2003.

[11] M. R.Guthaus,etal., \MiBench: afree,commercially

representativeembedded b enchmark suite," 4th

Work-shoponWorkloadCharacterization,2001.

[12] S.Heo,etal.,\Reducingp owerdensitythroughactivity

migration," Int. Symp. on LowPower Electronics and

Design,2003.

[13] S.Kaneko,etal.,\A600MHzsingle-chipmultiprocessor

with4.8GB/sinternalsharedpip elinedbusand512kB

internalmemory,"Int.SolidStateCircuitsConf.,2003.

[14] T.Kuro da,etal.,\A0.9V,150MHz,10mW,4mm 2

,2-D

discrete cosine transform core pro cessor with

variable-threshold-voltagescheme,"Int.SolidStateCircuitConf.,

1996.

[15] E.Larson,etal.,\MASE:Anovelinfrastructurefor

de-tailedmicroarchitecturalmo deling,"Int.Symp.on

Per-formanceAnalysisofSystemsandSoftware,2001.

[16] M.Levy,\SamsungtwistsARMpast1GHz,"

Micropro-cessorrep ort,16(10),2002.

[17] M.H.Lipasti,etal.,\Valuelo calityandloadvalue

pre-diction",7thInt.Conf.onArchitecturalSupp ortfor

Pro-grammingLanguagesandOp erationSystems,1996.

[18] P. Marcuello, et al., \Thread partitioning and value

predictionfor exploitingsp eculative thread-level

paral-lelism,"IEEETrans.Comput.,53(2),2004.

[19] R.Moreno,etal.,\Ap owerp ersp ectiveofvaluesp

ecu-lationfor sup erscalarmicropro cessors,"18thInt. Conf.

onComputerDesign,2000.

[20] M. L. Pilla, et al., \Predicting trace inputs with

dy-namic trace memoization: determining sp eedup upp er

b ounds,"10th Int.Conf.on ParallelArchitecturesand

CompilationTechniques,WorkinProgress,2001.

[21] J.Rattner, \Electronics in the Internet age," 10th Int.

Conf.on ParallelArchitecturesandCompilation

Tech-niques,Keynote,2001.

[22] N.B.Sam,etal.,\Ontheenergy-eciencyofsp eculative

hardware,"Int.Conf.onComputingFrontiers,2005.

[23] T. Sato, et al., \Contrail pro cessors for converting

high-p erformanceintoenergy-eciency,"10thInt.Conf.

onParallel Architecturesand Compilation Techniques,

WorkinProgress,2001.

[24] T.Sato, \Future directions of pro cessor architecture,"

6thInt.WorkshoponInnovativeArchitectureforFuture

GenerationHigh-Performance Pro cessors and Systems,

Panel, http:www.mickey.ai.kyutech.ac.jp/~tsato/

cosmos/papers/iwia03panel.pdf,2003.

[25] T. Shiota,et al., \A51.2GOPS, 1.0GB/s-DMA

single-chipmulti-processorintegratingquadruple8-wayVLIW

pro cessors,"Int.SolidStateCircuitsConf.,2005.

[26] K.Sundaramoorthy,etal.,\Slipstreampro cessors:

im-provingb othp erformance andfaulttolerance"9th Int.

Conf.on ArchitecturalSupp ort forProgramming

Lan-guagesandOp eratingSystems,2000.

[27] Y.Tanaka,eyal.,\Thep otentialinenergyeciencyof

asp eculativechip-multiprocessor,"16thSymp.on

Par-allelisminAlgorithmsandArchitectures,2004.

[28] S.Torii,etal.,\A600MIPS120mW70Aleakage

triple-CPUmobileapplicationpro cessorchip,"Int.SolidState

CircuitsConf.,2005.

[29] Transmeta Corp., \LongRun2 technology," http://

www.transmeta.com/longrun2/.

[30] C. Zilles, et al., \Master/slave sp eculative