High performance computing through SoC coprocessors

(1)

HIGH PERFORMANCECOMPUTING THROUGH SOC COPROCESSORS

GIANNI DANESE, FRANCESCOLEPORATI, MARCO BERA, MAURO GIACHERO, NELSON NAZZICARI, AND

ALVARO SPELGATTI

∗

Abstrat. In thispaper we desribe DPFPA (Double Preision Floating Point Aelerator), a FPGA-based oproessor

interfaedtotheCPUthrough standardbusonnetions;itisoneivedtoaeleratedoublepreisionoating pointoperations,

featuringtwodoublepreisionoatingpointunits,apipelinedadderandapipelinedmultiplierwithasuitablenumberofstages.

Wetested itsperformane byimplementinga Montearlo-Metropolis simulationof a dipolar system, using a proper software

developmentenvironmentdesignedand realizedinour laboratory. DPFPAanprovideaspeed-upequalto4,withrespetlast

generationPC,showingalsoagoodsalabilityintermsoflokfrequeny,memoryapabilityandnumberofomputingunits.

Keywords: FPGA;hardwareaelerator;highperformaneembeddedsystem;parallelproessing.

1. Introdution. Sienti researh owes a lot to omputer systems whih allowed the ahievement of

results otherwise unthinkable [Marsh, 2005℄[Boghosian et al., 2005℄. A powerful omputing system permits

the study of several phenomena through the employment of simulations like statistial ones into whih the

systemunderanalysisismadetoevolvefromaertaininitialondition,bymodifyingafewofitsharateristi

parameters and by evaluating the feasibility on the basis of a proper merit funtion. These operations are

iteratedthousandsoftimestobringthesysteminanewstablestate.

Severalofthesesimulationsperformdoublepreisionoatingpointoperationssinetheyprovidethe

au-rayrequiredtoappreiate eventhesmallestutuationsin thetypialvariablesofthesimulatedphenomena.

Ontheotherhand,this ouldrepresentahardtask evenforthemostpowerfulproessorswhih takealot of

lokylesto exeuteasingleoatingpointoperation.

Thelakofomputingpowerisgenerallyoveromebyresortingtosuperomputersorlusters[Dongarraet

al.,2005℄butinthelastyearstheuseofaelerators,i.e. dediatedhardwaresystems,isgraduallyestablishing

asavalid alternative,dueto thefeature ofthese devieswhihallowto performthose operationsin lesstime

thantraditionalproessors[Buelletal.,2007℄[Herbordtetal.,2007℄. Severalresearhersworkedintheseyears

not only in this sense but also to improve methodology, tools and praties supporting the integration of

hardwareandsoftwareomponentsduringsystemdesignanddevelopment [Hankeletal.,2003℄[Wolf,2003℄.

AtpresentasimilarprojetonerningaDoublePreisionFloatingPointAelerator(DPFPA)toproess

omplex funtions has been arried out in the Miroomputer laboratory at the University of Pavia (Italy).

Thisativitysuiteswellwith themission of thelaboratorywhih aimsto designand developspeialpurpose

arhitetureforomputationallyintensiveappliations. ThedesignedaeleratorisimplementedontoaFPGA

devielodgedonaboardinteronnetedwithaPersonalComputerandisabletoexeuteoatingpoint

opera-tionsfasterthanatraditionalproessor[Daneseetal,2007℄.Moreover,aproperspeiprogramminglanguage

and asuitablesoftwaredevelopment environmentwere realisedallowing theuserto write, translate andload

properinstrutionssequeneswrittenin aspeilanguage.

Thispaperdesribestheimplementation,ontotheaelerator,ofaMontearlo-Metropolissimulationofa

dipolarsystem,atypialomputationalhallengeforsuperomputers.

TheMontearlo Metropolisalgorithm isan exellent benhmarkto test performane ofaspeialpurpose

alulation system, sine its omputational ore onsists of few oating point operations (double preision)

repeatedoverandover:thisrepresentstheidealonditiontoexploitanappliationspeiarhiteturedevoted

totheaelerationofonlypartiularinstrutions.

Moreover,the algorithmfeatures aSIMDfashion soitis suitablefor adistributed implementation whih

anexploit morealulationunits soinreasingtheoverallahievedspeedup.

Finally,typialMontearlosimulationsinvolvehundredsthousandspartilesystemsandanrunforweeks

or months on the most performing omputers with a single CPU: the availability of powerful aelerating

units, in aseonneted into a luster onguration, makes possible simulations urrently unfeasible or

sim-ulations with more partiles than now, ahieving a better omprehension of the physial phenomena under

analysis.

∗

Department ofInformatia and Sistemistia,PaviaUniversity,Italy, E-mail: franeso.leporatiunipv .it, Phone: +39

(2)

Inthepastotherresearhgroupsproposed aeleratorsbasedonFPGAforMontearlosimulations:

•

oneoftherstproposalispresentedin[Postulaetal.,1996℄whereisdesribedametallurgialsintering

simulationimplementedonaFPGAdeviewithatwoordersofmagnitudespeed-upwithrespettoa

mid90'sworkstation;

•

inthesameyears,otherauthorsoneivedaFPGAimplementationofapartiularMontearlotehnique

(Swendsen-Wanglustering)withaonsiderableaelerationwithrespettoa15MHzDSPormaking

useofellularautomata[Cowenetal,1994℄[Monaghanetal,1992℄;

•

morereently, areongurableomputerwasdesigned devoted to heat transfer simulations,working

on single preisionoating point data and ahieving an order of magnitude speed-up relativeto a3

GHzP4 proessor[Gokhaleet al,2003℄; thepeuliarityof this ontribute isthe ideaof using widely

available oating pointlibrariesfor implementing aalulationfuntion ontoFPGA, thus shortening

designtime;

•

nally,in[Zhangetal,2005℄itispresentedasimulationofananialmodelimplementedonaFPGA

devie to aeleratedouble preisionoatingpointalulations. Theahievedspeed-upis 26 relative

toa1.5GHzP4proessor;

•

with regard to FPGA based arhitetures speially devoted to physis simulations, the reent

lit-erature proposed the works of Cruz and Belletti [Cruzet al, 2001℄[Belletti et al, 2006℄; the rstone

providesinterestingarhiteturalissuesalthoughusingAlteraFlex10K30omponentslimitsthe

work-ingfrequenyto48MHz;theseondisaprojetsubsequenttoourone,employingAlteraStratixfamily

omponentsandaimsto buildalusterofaeleratorsbasedonthemostreentFPGAdevies.

ForwhatonernsamoregeneraluseofSoCforomputingintensiveappliationsthereisawideliteratureto

whihthereaderouldrefer. ThemostpartoftheOtober2007issueofIEEEComputerwasdevotedto that

topi[Wolf,2007℄.

In thenext setion the arhitetural features of the aelerator, of the spei languagedesigned and of

itssoftwaredevelopmentenvironmentwill bedesribed. Then, thebasiphysialpriniples of thesimulation

and itsneededmodiationsforoptimizingthe useofthe aeleratorwillbehighlighted. Finally, wewill see

theimplementation of thealgorithm onthe aelerator,taking advantagefrom the useof a`dediated stage'

pipeline and the omparison with afew ommerial and popularproessors showinga lear speed-up. Some

remarksexplainingtheevolutionoftheprojetwillonludethepaper.

2. The Aelerator. We realized a FPGA-based aelerator onneted to a host PC to aelerate the

hardestpartofaalulus. OurideareferstoaboardwithaFPGAdevie(AlteraStratix family)andaFlash

memorystoring theongurationode; aJTAGportis used to sendtheprogram to theFlashmemoryfrom

thePC.Reently,Alterahasmadeavailable someboardswiththese features. These boardsanommuniate

with PC through the network requiring a propernetwork manager. In this ase, both the aelerators and

the network proessoran be loaded on the same FPGA. The board we bought is equipped with a Stratix

1S40 FPGA omponent on whih a 32 bit RISC CPU, alled Nios, is implemented; this proessor an be

programmed using C language and is supplied with basi libraries to easily handle the on board devies:

2 MB Ram, 8 MB Flash Memory, 16 MB Compat Flash Memory, 100 Mb/s Ethernet Interfae, 2 Serial

ports.

Wedesignedan aeleratingunit that is abletoimplementdierent funtions (alsoomplexlikesin,os,

log, ...,throughTaylorseries). Thus, itanbeusedforseveralappliations, alsoverydiversied. Moreover,

theinstrutionset isfullyre-programmableaordingto thepartiularalulationtobeperformed.

The designed unit (DPFPA) anexploit the parallelism present in theoperationssine double preision

FloatingPointMACoperations anbe exeutedat the sametime in thesum and multiplypipelines present

onto it. The main partof DPFPA is DPFPP (double preision oating point proessor),whose arhiteture

onsistsof(g. 2.1):

•

2aeleratingunits,independentlyworking;

•

aCaheMemory(4banks),whih anstoreinputdataandresultsforthetwoaeleratingunits;

A suitable bus devoted to ommuniation between Aeleratorand Nios proessor("sub bus") hasbeen

also implemented. TheMath Unit funtional oreis a double-preisionoating-point ALU, whih integrates

both anadder and amultiplier operating in a parallel fashion. Both devies are pipelined (9 stages for the

(3)

Fig.2.1. ArhitetureoftheomputationalunitimplementedontotheFPGAdevie.

Together with theadder andthe multiplier,the ALUalso ontains3registerbanks, eah ableto store4

double-preisionoatingpointnumbers. Thebanksareeahtiedtoapartiularpurpose(oneisforinputdata,

oneforadderresultsandoneformultiplier results).

Like in many similar appliations, to makeomputing elements and storage spae independent, aFIFO

memoryforbothinputsandoutputsisimplemented(therearetwoFIFOqueuesontheoutputsinearithmeti

resultsareseparatedfromlogialones).

TheALU operations are enoded in 37-bit words,able to simultaneously triggereither a sumor a

om-parison, amultipliation, adata feth, 3 write operations to the internal registerbanks and the output of a

result.

Toahievebetterperformanewithourspeitask,theoperandsoftheadderanoptionallybemultiplied

by

[

−

2 ,

−

1 ,

2]

fortherstoperand,and

[

−

1 ,

−

0 .

5 ,

0 .

5]

fortheseondone. Inasimilarway,themultiplierresult anbedoubled,halvedor negatedwithoutextralokyles.

Sinefeedingtheop-odeswouldrequirealargeandmostlywastedbandwidth(theodeisessentiallyyli,

so that the same op-odes are exeuted over and over again) the ode sequenes are stored in a Miroode

Sequener. This deviestoresthe programsequenesin aninternal RAMandassoiatestothem a6-bits

op-ode(thisismuhlikehavingaCPUwithamiro-programmedontrolunitwhoseodeanbehangedbythe

appliationtodeneaustominstrutionset).

TheMathUnititselfhasnoaddressingapabilitiestowardeitherinputoroutputhannels,soeverymemory

I/O operation must be managed by an external devie. A Memory Manager was deemed to that task and

oneived for a spei appliation lass: those where most omputations are performed on data logially

organizedin three-dimensionalmatries. Deoupling thealloationissues from theomputing algorithm,the

MemoryManageromputesthememoryaddressesfrom semanti-levelinputs,suh asaddressesinthematrix

domain (

X

−

Y

−

Z

oordinates) or osets between elements (the matrix is supposed to be yli, so that e.g. theleftmost elementin arowis adjaentto therightmost element in thesamerow). Thisis of extreme

importane,sineotherwisethe sameode would requireat least areompilationto beexeutedon matries

withdierentsizes.

TheinternalControlUnit(CU)deodesinstrutionsomingfromthehostomputeranddrivestheontrol

signalsimplementingtherequestedfuntion. Itmainlyonsistsof3units:

•

InstrutionDeode: seletsbetweendataandinstrutionsfromhosttotheDPFPA.Onlyinthelast

aseitgeneratesproperontrolsignals;

•

JumpUnit: setstheRAMaddresstothestartingpointofthenextinstrutionsequenetobeexeuted;

(4)

Instrutionsare64bitwideexploitingpartoftheredundanypresentintheIEEE754standardofoating

pointrepresentation,to distinguishthem fromdouble preisionnumbers. Twokindsofinstrutionshavebeen

implemented:

•

Programminginstrutions to storein theCURAMexeutivesequenes.

•

Exeutive instrutions toperformspei alulations,reallingsequenesalreadyloaded.

Programming instrutions to store in the CU RAM exeutive sequenes. Exeutive instrutions to perform

speialulations,reallingsequenesalreadyloaded.

A great advantageof our approah is that thesequenes of anexeutiveinstrution areperformed in an

iterativemanneruntilanewexeutiveinstrutionwill bereeivedbytheCU.So,during theexeutionofthe

alulus,CUhastodeodeonlyfewinstrutionsandansaveagreatamountoftime.

3. Programming DPFPP. As previously stated, DPFPP an handle two types of instrutions:

pro-gramminginstrutionsandexeutiveinstrutions. The formerareused tostoremiroodesequenesinto the

CU RAM, making miroode words to be loaded at the orret address into the RAM of CU. The word of

miroode,allowstheassertionofneededontrolsignalsforeahlokyle.

Eahexeutiveinstrutionallows,ontheotherhand,thereallingofsequenesalreadystored.

We realisedsoon, that the sequene developmentusing binary miroodewas avery hardand ineient

work. Thus,wehosetodesignanddevelopapseudo-assemblydediatedlanguagethatsimpliesthesequene

writing. Theinstrutionsof thelanguageare mappeddiretly onthehardware andreettheoperation that

DPFPP anexeute. Table3.1showsthelistoftheinstrutionsand theirsyntax.

Table3.1

Listandsyntaxofthelanguageinstrutions.

InstrutionSyntax

MOVreg;

SUM1op2op;SUM1op;SUM1 opopSUMop2op;SUMopop

MULopop;MULopop;MULop;

OUTxx;

INT;

Apropertranslatorwasalsodeveloped,usingstandardUnixtoolssuhasLexandYa.

Furthermore,wedevelopedanalloatorforaneasygenerationofthelewiththeprogramminginstrutions

that must be sent to theDPFPP. Finally, we designedasimulator, reproduing exatlytheDPFPP working

and enabling pipeline and registerinspetion. The simulator also allows thevisualisation of the lok yles

neededbyaspeisequeneorbyasetofsequenes. Thankstothistool,weanexeutemiroodesequenes

withoutloadingthemintotheDPFPP;thus,weansimplifythesequenedebug,verifytheresults'orretness

andhektheperformane.

Allthesetoolsareintegratedinauniquedevelopmentenvironment,realisedin theMiroomputer

labora-torytoeasethesequenedevelopment. Therearefourmainsteps: rst,wewriteandompilesoureodeusing

an internal editor, then wetest the odeusing the simulator. Finally, weproduethe programmingle that

hasto besentto theDPFPP by usingthe alloator. Moredetails on thehardwareand software forDPFPA

arein[Daneseetal.,2003℄.

4. The ConsideredProblem. Liquidrystalsandolloidalsuspensionsaretwoexamplesofsystemsfor

whih the orientation order has been widely studied through simulations. In both ases interations among

partilesplayadominantrole. Inpreviousworks, werealizedaubilattiemodeldesribingtheinterations

eets in adipolarsystem inpresene ofan externallattieeld [Bellini eral.,2001℄: simulationsmade with

this model identied thepresene oftwophase transitionsand theobtained resultsouldin partexplain the

phenomenonknownas anomalousbi-refringene asanalyzedin[O'Konskiet al.,1950℄[Radevaet al.,1996℄.

Onthe otherhand,simulationstakeunaeptablylongtimes evenonthe mostreent andpowerful

om-putingsystemsrangingfromafewdaysuptosomeweeksdepending onthesizeofthesimulatedsystem. The

oreof theomputationis, infat,theevaluationoftheenergysine,aordingto theimplementedalgorithm

(5)

evaluating theorrespondinghangein energy. Eahmoveanbeaeptedorrejeteddependingonthe

vari-ation of the energy assoiated with it [Metropolis et al., 1953℄. We simulated lattie systemswith partiles

ranging from afew hundreds upto 100.000 onsidering only rst neighborinterations, i. e. theinteration

betweeneah spinand thesixlosestones in the

X

+

,

X

−

,

Y

+

,

Y

−

,

Z

+

,

Z

−

diretions. Periodiboundary onditionswere applied [Frenkelet al., 1996℄. Theassoiatedenergyof eah dipole due to thepreseneof an

externaleldorientedtowardzaxisis:

(1)

Edip

=

momz

(

dip

)

Thetermsdueto theinterationsbetweentheonsidereddipoleandeahofitsrstneighboursare:

(2)

EX+

= 2

∗

momx

(

dip

)

∗

momx

(

X

+) +

−

momy

(

dip

)

∗

momy

(

X

+)

−

momz

(

dip

)

∗

momz

(

X

+)

(3)

EX

−

= 2

∗

momx

(

dip

)

∗

momx

(

X

−

) +

−

momy

(

dip

)

∗

momy

(

X

−

)

−

momz

(

dip

)

∗

momz

(

X

−

)

(4)

EY

+

=

−

momx

(

dip

)

∗

momx

(

Y

+) +

+2

∗

momy

(

dip

)

∗

momy

(

Y

+)

−

momz

(

dip

)

∗

momz

(

Y

+)

(5)

EY

−

=

−

momx

(

dip

)

∗

momx

(

Y

−

) +

+2

∗

momy

(

dip

)

∗

momy

(

Y

−

)

−

momz

(

dip

)

∗

momz

(

Y

−

)

(6)

EZ+

=

−

momx

(

dip

)

∗

momx

(

Z

+) +

−

momy

(

dip

)

∗

momy

(

Z

+) + 2

∗

momz

(

dip

)

∗

momz

(

Z

+)

(7)

EZ

−

=

−

momx

(

dip

)

∗

momx

(

Z

−

) +

−

momy

(

dip

)

∗

momy

(

Z

−

) + 2

∗

momz

(

dip

)

∗

momz

(

Z

−

)

wheretheomponentsofthemomentsforeahdipoleare:

(8)

momx

(

dip

) =

cos

(

θ

)

∗

sin

(

θ

)

∗

cos

(

ϕ

)

(9)

momy

(

dip

) =

cos

(

θ

)

∗

sin

(

θ

)

∗

sin

(

ϕ

)

(10)

momz

(

dip

) =

cos

′

₍

_θ

₎

and

θ

,

ϕ

aretheangularo-ordinatesofageneridipole. Theoverallenergyofthedipoleisthesumofallthese

ontributes:

(11)

ET OT

[

dip

] =

−

0 ,

5 ∗

[

Edip

−

k

∗

(

EX+

+

EX

−

+

EY

+

EY

−

+

EZ+

+

EZ

−

)]

Theglobalenergyin thesystemisthesumextendedonthewhole dipolarset.

Thesimulatedsystemisharaterisedbyaninitialrandompartiledistributionnotorrespondingtothat

ahievable at the equilibrium. This means that the hange in the orientation of a dipole will modify the

moments and the energy in the others, mainly in the neighbours. These ones, in turn, will inuene their

respetive neighbours and so on, propagating those variations in the moments throughout the lattie. This

reetsinenergyutuationsthatdisappearonlyafterasuientnumberofylesintowhihETOTforeah

dipoleisalulated(equilibration). Onlyatthispoint,theMetropolistestonenergyvariationanbeapplied.

This loop series orresponds to nearlythe 85%of the alulationbut it onsists of only few instrutions, so

justifyingthe ideaof anaelerator speialized in proessing onlythose operations. Todo this, we employed

theFPGAtehnology,whih isheaperandsimplerthanASICintermsofdesignandtest.

However,duringthedesignphase,weonsideredonvenienttorealiseamoregeneralhipabletoaelerate

thosedoublepreisionoatingpointinstrutionswhihanbeoftenfoundinsientisimulations. Thisextends

theappliabilityoftheDPFPA bothtomodelsdierenttothat used(i. e. hexagonallattiesinsteadof ubi

ones)ortoompletely dierenteldswherehighperformaneomputingismandatory.

5. Energy Evaluationand Implementation. Tosimplify thereadabilityoftheenergyalulationon

the DPFPA, as itwill be desribed in thefollowing, let's rewritethe expressions reported in setion 3. The

interation energyof eah dipole anbe written as

−

(CT

∗

CT

)

(6)

globalenergyin thesystem.

CT

istheloaleld generatedbytheneighborsof theonsidered dipoleandan

beexpressedas:

(12)

CT

=

CT X

∗

SC

∗

k

+

CT Y

∗

SS

∗

k

+

CT Z

∗

C

∗

k

+

C

where

k

isaonstantdependingonthesystemdensityand

SC

=

sin

(

θ

)

cos

(

ϕ

)

,

SS

=

sin

(

θ

)

sin

(

ϕ

)

,

C

=

cos

(

θ

)

,

with

θ

,

ϕ

angularo-ordinatesofthedipole. CTX,CTYeCTZaretheloalomponentsoftheeldgenerated

bytheneighbourdipoles. Theyarerespetivelyequalto:

(13)

CT X

= (

M XX

+

M X

∗

X

+

M Y X

+

M Y

∗

X

+

M ZX

+

M Z

∗

X

)

(14)

CT Y

= (

M XY

+

M X

∗

Y

+

M Y Y

+

M Y

∗

Y

+

M ZY

+

M Z

∗

Y

)

(15)

CT Z

= (

M XZ

+

M X

∗

Z

+

M Y Z

+

M Y

∗

Z

+

M ZZ

+

M Z

∗

Z

)

Weidentifywith

M XX

,

M XY

,

M XZ

theloaleldomponentsgeneratedbytherstneighbordipoleinthe

diretion

X

−

,andwith

M X

∗

X

,

M X

∗

Y

,

M X

∗

Z

theloaleldomponentsgeneratedbytherstneighbor dipoleinthediretion

X

+

. Theothertermsduetotheeetofdipolesindiretions

Y

+

/Y

−

and

Z

+

/Z

−

are denedaordinglytothesamenotation. Moreovertheloaleld,duetotheneighbors,hangestheomponents

ofthedipolarmoment. These shouldbeevaluatedeahtimeaordingtothefollowingexpressions:

(16)

momx

(

dip

) =

CT

∗

SC,

momy

(

dip

) =

CT

∗

SS,

momz

(

dip

) =

CT

∗

C

Whilethe

SC

,

SS

and

C

termsareevaluatedateahmovement,theothertermsshouldbere-alulatedforthe

numberofyles neessaryto equilibrate theenergyin the system. All these operations, nally,are repeated

M

∗

N

times with

M

=

ylenumber(i. e. 10.000)and

N

=

numberof dipoles in thesystem. This aounts

forthehighomputationalweightoftheelaboration.

Fig.5.1. Diagonalsanning.

With 'sanning' wemean the order throughwhih thedipoles are proessed during the simulation. The

identiationofasuitableorderansigniantlyaetthealgorithmeieny intermsofmemoryaessand

reuse ofdata. Ifwe would notuseany partiularsanningorder but ifweonly would onsiderdipoles in the

sameorder ofmemorization(

1 st

,

2 nd

,

3 rd

,...),theirelaborationwouldneed21inputdata(

SC

,

SS

and

C

of

themoveddipole plusthemomentsofitssix rstneighbors),returningthe3newomponentsofthemoment

oftheonsidereddipole.

However,iftheseletionorderonsidersdipolesloseto eahother towardadiagonaldiretion,theselast

onessharetworstneighborswhoseparametersarenomoreneededfortheelaborationofthenewdipole. Fig.

5.1showsanexampleofthis,sinepassingfromdipole 1to2,dipoles3and4arepreservedasrstneighbors.

(7)

againg. 5.1wenotethatdipoles3and4givethefollowingontributestoeahomponentoftheloaleldin

dipole1:

(17)

M X

∗

X

+

M Y

∗

Y

= 2

∗

momx

(4)

−

momx

(3)

(18)

M X

∗

Y

+

M Y

∗

Y

=

−

momy

(4) + 2

∗

momy

(3)

(19)

M X

∗

Z

+

M Y

∗

Z

=

−

momz

(4)

−

momz

(3)

Ifwenowonsider theontributeof thesamedipoles todipole 2,thenext reahedby thediagonalsanning,

wend:

(20)

M XX

+

M Y X

= 2

∗

momx

(3)

−

momx

(4)

(21)

M XY

+

M Y X

=

−

momy

(3) + 2

∗

momy

(4)

(22)

M XZ

+

M Y Z

=

−

momz

(3)

−

momz

(4)

Thevaluesontherightareobtainedbysubstitutingatthetermsonleft, thosevaluesreportedinequationsin

setion3.

Equations19and22areequalandanbealulatedonlyone. Thesameonsiderationsareappliablein

aseofmovementstoward

Y Z

or

XZ

diretionwithaonsistentsparingof operations.

Finally,themomentomponentsinvolvedinequations1719forthedipole1arealsopresent(withdierent

oeients)inequations2022and,again,theyanbealulatedonlyone(i. e. fordipole1,storingthemin

registersfromwhihtheyanberetrievedlaterforthenextdipole)withafurthersavingof time.

Fig.5.2. Diagonalsforsanningin

XY

fae.

Thediagonalsanningbasiallyonsistsof

XY

movementsasshownin g. 5.2.

Theubi lattie isonsidered as madeby `slies'and when the last dipole is reahed on an

XY

fae,a

littlemovement towardthe

Y Z

or

XZ

diretion allowsto skip to thenext

XY

slie. Ineah slie, dierent

startingpointsanbehosendependingontheodd/evennumberofdipolespresentontheedgeofthelattie,

butforsakeofsimpliitywedon'twanttoexessivelydetailthesesimulationaspets.

6. Implementationon DPFPA. As previouslysaid,asequene onsistsof amiroinstrution setand

ould be identied asa Setup or aLoop sequene. The rst problem to dealwith is the denition of those

operationsmorefrequentlyexeutedwhihshouldbeinsertedintotheLoopsequene. Inthediagonalsanning,

themostfrequentoperationregardstheinterationbetweendipolesloatedondiagonalsbelongingtothe

XY

side: thus,theLoopsequeneshouldimplementtheenergyalulationofthesedipoles,whiletheSetupshould

exeutethemovementsin the

XZ

or

Y Z

faes ofthelattie, throughwhih thealgorithm onsiders therst

dipoleofthenext

XY

`slie'andanotherLoopsequenebegins.

Aordingto whatsaidin theprevioussetion,thenumberoftheneededsumsis14forevaluating

CT X

,

CT Y

e

CT Z

(one add is shared with the previousdipole), 3 for

CT

and 1 moreto add these valuesto the

partialtotalenergyobtainedfromthepreviousdipolesonsidered. Thustheadderpipelineisusedasitsbest,

if18lokylesaretaken. Forwhat onernsmultipliations,instead,6areneededtoalulate

CT

,3forthe

(8)

Fig.6.1. Stage2intheadderpipelineduringtheLoopphase

6.1. AdderUnit. Eahstageisonsideredasanindependentregisterontainingthepartialresultwhih

anbestoredeveryLlokyles(Listhepipelinelength). TheLoopsequeneevaluatestheenergyofdipoles

onsidered in the

XY

diretion: 4stages of adderpipeline were devoted to alulate

CT X

,

CT Y

,

CT Z

and

CT

. In g. 6.1, the seond pipeline stage devoted to thealulation of

CT X

is shown, with the partiular

alulation highlightedin bold in eah of the four sumsneeded. Inthe rst step, the termin parentheses is

`shared'withthepreviousdipoleonsideredanddoesnotneedtobere-alulated(seeprevioussetion). Eah

partial result is available only when it has run aross the whole pipeline i. e. after 9 lok yles and the

ompletevalueof

CT X

isavailableafter36lokyles. Thenthestageproeedstoevaluatethe

CT X

forthe

nextdipole. Thesameonsiderationsanbemadefor

CT Y

and

CT Z

. Thealulationof

CT

isimplemented

inthestage4,whihworksagainfor36lokyles. The

CT X

,

CT Y

,

CT Z

valuesusedin thisasearethose

(9)

!

"

Fig.6.4.Stage6intheadderpipelineduringtheLoopphase.

Therefore the partial value of

CT

is savedin a register from whih it will be retrieved during stage 4B

(g.6.2). Tooptimisetheuseofthepipelinetheremainingstagesaredevotedtoimplementthesamealulations

fora seond dipole,soasto proess 2dipoles in 36 lokyles. This orresponds, aspreviouslyseen, to an

optimaluseoftheadder. Finally,stage6isdevotedto addto theglobalenergyvalue

ET OT

,thosetwoenergy

ontributes (

EN EW

) alulatedin the other stages of the pipeline upto this moment(g. 6.3). Basially it

works inthesamewayasstage4,inluding twosumssharedwiththesuessiveelaborateddipoles(againto

optimisethepipelineuse). Eventhough,during the36lokylesallthe sumsneeded fortheenergyoftwo

dipoleshavebeenperformed,thedipolesinvolvedintheelaborationaremorethan2. Infat,whiletheadder

is evaluating

CT X

,

CT Y

and

CT Z

forthe twodipoles, it is notpossibleto determineat the sametime the

orrespondent

CT

terms,sinethepreviousalulations(

CT X

,

CT Y

and

CY Z

)shouldbeompletedandthey

should also be multiplied by

SC

1 ∗

k

,

SS

1 ∗

k

and

C

1 ∗

k

(k is asuitableonstantdepending on thesystem density). Thereforethe

CT

termreallyomputedreferstothepreviousLoopsequene. Thismeansthat while

CT X

,

CT Y

and

CT Z

for dipoles

(

n

)

and

(

n

+ 1)

are evaluated, the

CT

terms referto

(

n

−

3)

and

(

n

−

4)

dipoles and the

EN EW

orrespondsto the ouple

(

n

−

5)

and

(

n

−

6)

previously started. Moreover,also the ouple

(

n

−

1 , n

−

2)

issubjetedtoapartialelaborationmakingthepipelinealwaysworking.

This ongurationbrings a onsistentlevelof parallelisation in the exeution of thealgorithm. Fig. 6.4

shows the omplete set of operationsalulated during the36 lok ylesof eah Loopsequene. Per eah

stageandlokyle,theeetivesumperformedisreportedinbold.

6.2. Multiplier Unit. Thisunit exeutesthemultipliationsneededinthetermsthatmustbeadded,i.

e. 10 pereah ofthetwodipoles of theadderunit (globally 20)and in asequentialway. Tosynhronisethe

operationsinthemultiplierpipelinewiththoseoftheadder,thelengthofthepipeline(15stages)isextendedto

18byaddingthreeNOP(nooperation)yles: thismeansthatin36lokylesthemultiplierworkseetively

for30yles,atimesuienttoexeutetherequired20produts,withoutloosingthesynhronisationwiththe

orrespondenttermsin theadderunit. Fig. 6.5desribestheoperationsperformedtogether withthe output

fromthepipelineatthatinstant,pereahlokyle. Inparenthesistheordernumberisreportedofthedipole

towhihthealulationrefers: nisthedipoleforwhihthealulationoftheenergyisinitiatedintheurrent

sequene. AttheendofeahLoopsequenethepipelineoutputsnewmomentsandenergyofthedipoleouple

whih startedtheevaluation 3sequenes before. Fititiousprodutshavebeeninsertedwhen neededtofore

thepipelinegoingonestepbeyond.

7. Results. ThewholesystemhasbeentestedbyexeutingMontearlosimulationsofdierentsizelatties

(10)

Fig.6.5. Operationsperformed inthemultipliationpipelineduring36lokyles.

Performanehasbeenevaluatedasspeed-uprespettotheexeutionofthesamesimulationonanIntelP4

proessorwith 1GBRam memory;also FPGAoupation wasused asaperformaneparameter. Simulation

ode waswritten in C language and optimized using Mirosoft Visual C++ environment. The Aelerator

elaborationtimesweremeasuredbymeansofthelokountersimplementedintheinterfaebetweenNiosand

theoproessorpreviouslydesribed.

Ing. 7.1weshowtheperformaneasspeed-upfatorrespettotwoIntelP4proessorswith3GHzand

1.7 GHz frequenyrespetively, alulatingthe dipolar energy of the simulatedsystem. That omputational

oreisrepeatedlyexeuted

k

∗

N

∗

10000

timeswhere

k

istheoeientresponsiblefortheinterationsettlement (equilibration) and N is the dipole number: this givesreason ofthe high omputational loadwhih anlead

(forbigpartile systems,e. g. 100000dipoles)to waitalot forresults,ifthesimulationshould beperformed

on aPC. Thespeed-up fatoris inreasing forthe 1.7 GHzproessordue to ahe eet,while for themost

performingIntelproessor(3GHz)sets around2.

Considering the size of the FPGA we used, other 2 aelerating units ould be implemented, we an

reasonablystate that a speed-up fator equalto 4anbe ahieved in ase of afull implementation on the

FPGAomponentwehose(StratixEP1S40). Furtherspeed-up ouldbeobtainedifother omponentsofthe

Altera'sfamily(Stratix2or Stratix3nowavailable)shouldbeemployed.

Theostofeahboardweboughtwasnearly$1200:thisrepresentsanimportantindiationwhenprediting

(11)

Fig.7.1.Speed-upoftheFPGAbasedaeleratorwithrespettheP4Intelproessors.

to aomputational unit in aPC luster, providing the sientist witha COTSdesktop omputing system on

whih he/sheanrunsimulations.

8. Conlusions. Simulationsallowtheanalysisofaphysialsystem,evenomplex,withoutexperimental

measuresor,sometimes,toonrmwhatwasexperimentallyobserved. Inertainsituationssuhasmirosopi

systems, simulationsrepresentthe simplest if notthe onlyway to quikly foreseethe behaviourof a partile

system in dierent environmental onditions. The high number of variables involved together with omplex

interationlawsoftenmake simulationtimesunaeptablylong. Finally, severalofthe requestedalulations

askfordoublepreisionoatingpointarithmeti, furtherinreasingtheomputationalpowerneeded.

In this paper, we haveshown how anappliation spei arhiteture(DPFPA) speially designed for

thiskindof problemsandbasedonFPGAtehnologyould representagoodompromisebetweenproessing

apabilitiesandlowosts. DPFPA anbeprogrammedwithadediatedlanguageto exeuteomplexoating

pointfuntions anditis equipped withasuitablesoftwaredevelopmentenvironment. Weexeutedthedipole

energy alulation through the simulator, ahieving, thanks also to the new sanning algorithm purposely

designedand heredesribed, aperformanetwieasthat of alast generation PersonalComputerbut anbe

easilyextended to4.

Afurtherimprovementouldbeahievedbyafull ustomASICimplementationoftheAeleratorwhih

isnotjustiedataprototypinglevelwhileitallowsalargesalemanufaturingwithreduedosts. Thiswould

make available several omputing units onneted in luster fashion by means of a point to point network,

providingtheuserwithagreatomputingpower.

REFERENCES

[BELLETTIF.,etal.,℄ AnadaptiveFPGAomputer,IEEEComputinginSiene&Engineering,vol.8(1),January-February

2006,pp.41-49.

[BOGHOSIANB.,etal.℄ Sientiappliationsofgridomputing,IEEEComp.inSiene&Engin.,vol.7(5),Sept.-Ot.2005,

pp.10-13.

[BUELLD.,etal℄ HighPerformaneReongurableComputing,IEEEComputer.Marh2007,pp.2326.

[COWENC.P.etal.℄ A reongurable Montearlo lustering proessor (MCCP), FPGAs for Custom Computing Mahines,

1994.Proeedings.IEEEWorkshopon1013April1994,pp.5965.

[CRUZA.,etal.℄ ASpeialPurposeComputerforspinglassmodels,ComputerPhysisCommuniations,vol.133,n°2-3,2001,

pp.165176

[DANESEG.,etal.℄ AdevelopmentandsimulationenvironmentforaoatingpointoperationsFPGAbasedaelerator,Pro.

ofDSD'033rdEuromiroSymposiumonDigitalSystemDesign,Belek(Turkey),September2003,pp.173-179.

[DANESEG.,etal.℄ AnappliationspeiproessorforMontearlosimulations,IEEEonfereneonParallelandDistributed

Proessing(PDP07),Naples,February2007,pp.262-269.

[DANESEG.,etal.℄ Fieldinduedanti-nematiorderinginassembliesofanisotropiallypolarizablespins,EurophysisLetters

55(3),pp.362-368,2001.

[DONGARRAJ.,etal.℄ High-PerformaneComputing:Clusters,Constellations,MPPs,andFutureDiretions,IEEEComp.in

(12)

[GOKHALEM.,etal.℄ MonteCarloradiativeHeatTransferSimulationonareongurableComputer,Pro.ofFPL2004,LNCS

3203,pp.95104,Springer-Verlaged.

[HANKELJ.,etal.℄ Takingontheembeddedsystemdesignhallenge,IEEEComputer,April2003,3537.

[HERBORDTM.C.℄ AhievingHighPerformanewithFPGA-BasedComputing,IEEEComputer.Marh2007,pp.50-57.

[MARSHP.℄ Highperformanehorizons,Computing&ControlEngineeringJournal,vol.15(6),DeemberJanuary2004/2005,

pp.4248.

[METROPOLISN.,etal.℄ Equation ofStateCalulationsbyFastComputingMahines,JournalofChem.Physis,21,(1953),

pp.10871092.

[MONAGHANS.,etal.℄ Reongurablespeialpurposehardwareforsientiomputationandsimulation,Computing&

Con-trolEngineeringJournal,Vol.3,Issue5,Sept.1992,pp.225234.

[O'KONSKIC.T.,etal.℄ Newmethodforstudyingeletrialorientationandrelaxationeetsinaqueousolloids:preliminary

resultswithtobaomosaivirus,Siene,111,pp.113116(1950).

[POSTULAA.,etal.℄ Thedesignof a speialized proessor forthe simulationof sintering,EUROMICRO96. 'Beyond 2000:

HardwareandSoftwareDesignStrategies',Pro.ofthe22ndEUROMICROConferene,25September1996,pp.501

508.

[RADEVAT.,etal.℄ EletriLightSatteringfromPolytetrauorethylene Suspensions,Coll.AndSurf.119,1(1996).

[WOLFW.℄ Adeadeofhardware/softwareodesign,IEEEComputer,April2003,3842

[WOLFW.℄ Theembeddedsystemslandsape,IEEEComputer,Otober2007,2933.

[ZHANGG.L.,etal.℄ ReongurableAelerationforMontearlobasednanialsimulation,Pro.ofFPT05,IEEEConferene

oneldprogrammabletehnology,SingaporeDe.11142005.

Editedby: PasquaD'Ambra,DanieladiSerano,MarioRosarioGuarraino,FranesaPerla

Reeived: June2007