HIGH PERFORMANCECOMPUTING THROUGH SOC COPROCESSORS
GIANNI DANESE, FRANCESCOLEPORATI, MARCO BERA, MAURO GIACHERO, NELSON NAZZICARI, AND
ALVARO SPELGATTI
∗
Abstrat. In thispaper we desribe DPFPA (Double Preision Floating Point Aelerator), a FPGA-based oproessor
interfaedtotheCPUthrough standardbusonnetions;itisoneivedtoaeleratedoublepreisionoating pointoperations,
featuringtwodoublepreisionoatingpointunits,apipelinedadderandapipelinedmultiplierwithasuitablenumberofstages.
Wetested itsperformane byimplementinga Montearlo-Metropolis simulationof a dipolar system, using a proper software
developmentenvironmentdesignedand realizedinour laboratory. DPFPAanprovideaspeed-upequalto4,withrespetlast
generationPC,showingalsoagoodsalabilityintermsoflokfrequeny,memoryapabilityandnumberofomputingunits.
Keywords: FPGA;hardwareaelerator;highperformaneembeddedsystem;parallelproessing.
1. Introdution. Sienti researh owes a lot to omputer systems whih allowed the ahievement of
results otherwise unthinkable [Marsh, 2005℄[Boghosian et al., 2005℄. A powerful omputing system permits
the study of several phenomena through the employment of simulations like statistial ones into whih the
systemunderanalysisismadetoevolvefromaertaininitialondition,bymodifyingafewofitsharateristi
parameters and by evaluating the feasibility on the basis of a proper merit funtion. These operations are
iteratedthousandsoftimestobringthesysteminanewstablestate.
Severalofthesesimulationsperformdoublepreisionoatingpointoperationssinetheyprovidethe
au-rayrequiredtoappreiate eventhesmallestutuationsin thetypialvariablesofthesimulatedphenomena.
Ontheotherhand,this ouldrepresentahardtask evenforthemostpowerfulproessorswhih takealot of
lokylesto exeuteasingleoatingpointoperation.
Thelakofomputingpowerisgenerallyoveromebyresortingtosuperomputersorlusters[Dongarraet
al.,2005℄butinthelastyearstheuseofaelerators,i.e. dediatedhardwaresystems,isgraduallyestablishing
asavalid alternative,dueto thefeature ofthese devieswhihallowto performthose operationsin lesstime
thantraditionalproessors[Buelletal.,2007℄[Herbordtetal.,2007℄. Severalresearhersworkedintheseyears
not only in this sense but also to improve methodology, tools and praties supporting the integration of
hardwareandsoftwareomponentsduringsystemdesignanddevelopment [Hankeletal.,2003℄[Wolf,2003℄.
AtpresentasimilarprojetonerningaDoublePreisionFloatingPointAelerator(DPFPA)toproess
omplex funtions has been arried out in the Miroomputer laboratory at the University of Pavia (Italy).
Thisativitysuiteswellwith themission of thelaboratorywhih aimsto designand developspeialpurpose
arhitetureforomputationallyintensiveappliations. ThedesignedaeleratorisimplementedontoaFPGA
devielodgedonaboardinteronnetedwithaPersonalComputerandisabletoexeuteoatingpoint
opera-tionsfasterthanatraditionalproessor[Daneseetal,2007℄.Moreover,aproperspeiprogramminglanguage
and asuitablesoftwaredevelopment environmentwere realisedallowing theuserto write, translate andload
properinstrutionssequeneswrittenin aspeilanguage.
Thispaperdesribestheimplementation,ontotheaelerator,ofaMontearlo-Metropolissimulationofa
dipolarsystem,atypialomputationalhallengeforsuperomputers.
TheMontearlo Metropolisalgorithm isan exellent benhmarkto test performane ofaspeialpurpose
alulation system, sine its omputational ore onsists of few oating point operations (double preision)
repeatedoverandover:thisrepresentstheidealonditiontoexploitanappliationspeiarhiteturedevoted
totheaelerationofonlypartiularinstrutions.
Moreover,the algorithmfeatures aSIMDfashion soitis suitablefor adistributed implementation whih
anexploit morealulationunits soinreasingtheoverallahievedspeedup.
Finally,typialMontearlosimulationsinvolvehundredsthousandspartilesystemsandanrunforweeks
or months on the most performing omputers with a single CPU: the availability of powerful aelerating
units, in aseonneted into a luster onguration, makes possible simulations urrently unfeasible or
sim-ulations with more partiles than now, ahieving a better omprehension of the physial phenomena under
analysis.
∗
Department ofInformatia and Sistemistia,PaviaUniversity,Italy, E-mail: franeso.leporatiunipv .it, Phone: +39
Inthepastotherresearhgroupsproposed aeleratorsbasedonFPGAforMontearlosimulations:
•
oneoftherstproposalispresentedin[Postulaetal.,1996℄whereisdesribedametallurgialsinteringsimulationimplementedonaFPGAdeviewithatwoordersofmagnitudespeed-upwithrespettoa
mid90'sworkstation;
•
inthesameyears,otherauthorsoneivedaFPGAimplementationofapartiularMontearlotehnique(Swendsen-Wanglustering)withaonsiderableaelerationwithrespettoa15MHzDSPormaking
useofellularautomata[Cowenetal,1994℄[Monaghanetal,1992℄;
•
morereently, areongurableomputerwasdesigned devoted to heat transfer simulations,workingon single preisionoating point data and ahieving an order of magnitude speed-up relativeto a3
GHzP4 proessor[Gokhaleet al,2003℄; thepeuliarityof this ontribute isthe ideaof using widely
available oating pointlibrariesfor implementing aalulationfuntion ontoFPGA, thus shortening
designtime;
•
nally,in[Zhangetal,2005℄itispresentedasimulationofananialmodelimplementedonaFPGAdevie to aeleratedouble preisionoatingpointalulations. Theahievedspeed-upis 26 relative
toa1.5GHzP4proessor;
•
with regard to FPGA based arhitetures speially devoted to physis simulations, the reentlit-erature proposed the works of Cruz and Belletti [Cruzet al, 2001℄[Belletti et al, 2006℄; the rstone
providesinterestingarhiteturalissuesalthoughusingAlteraFlex10K30omponentslimitsthe
work-ingfrequenyto48MHz;theseondisaprojetsubsequenttoourone,employingAlteraStratixfamily
omponentsandaimsto buildalusterofaeleratorsbasedonthemostreentFPGAdevies.
ForwhatonernsamoregeneraluseofSoCforomputingintensiveappliationsthereisawideliteratureto
whihthereaderouldrefer. ThemostpartoftheOtober2007issueofIEEEComputerwasdevotedto that
topi[Wolf,2007℄.
In thenext setion the arhitetural features of the aelerator, of the spei languagedesigned and of
itssoftwaredevelopmentenvironmentwill bedesribed. Then, thebasiphysialpriniples of thesimulation
and itsneededmodiationsforoptimizingthe useofthe aeleratorwillbehighlighted. Finally, wewill see
theimplementation of thealgorithm onthe aelerator,taking advantagefrom the useof a`dediated stage'
pipeline and the omparison with afew ommerial and popularproessors showinga lear speed-up. Some
remarksexplainingtheevolutionoftheprojetwillonludethepaper.
2. The Aelerator. We realized a FPGA-based aelerator onneted to a host PC to aelerate the
hardestpartofaalulus. OurideareferstoaboardwithaFPGAdevie(AlteraStratix family)andaFlash
memorystoring theongurationode; aJTAGportis used to sendtheprogram to theFlashmemoryfrom
thePC.Reently,Alterahasmadeavailable someboardswiththese features. These boardsanommuniate
with PC through the network requiring a propernetwork manager. In this ase, both the aelerators and
the network proessoran be loaded on the same FPGA. The board we bought is equipped with a Stratix
1S40 FPGA omponent on whih a 32 bit RISC CPU, alled Nios, is implemented; this proessor an be
programmed using C language and is supplied with basi libraries to easily handle the on board devies:
2 MB Ram, 8 MB Flash Memory, 16 MB Compat Flash Memory, 100 Mb/s Ethernet Interfae, 2 Serial
ports.
Wedesignedan aeleratingunit that is abletoimplementdierent funtions (alsoomplexlikesin,os,
log, ...,throughTaylorseries). Thus, itanbeusedforseveralappliations, alsoverydiversied. Moreover,
theinstrutionset isfullyre-programmableaordingto thepartiularalulationtobeperformed.
The designed unit (DPFPA) anexploit the parallelism present in theoperationssine double preision
FloatingPointMACoperations anbe exeutedat the sametime in thesum and multiplypipelines present
onto it. The main partof DPFPA is DPFPP (double preision oating point proessor),whose arhiteture
onsistsof(g. 2.1):
•
2aeleratingunits,independentlyworking;•
aCaheMemory(4banks),whih anstoreinputdataandresultsforthetwoaeleratingunits;A suitable bus devoted to ommuniation between Aeleratorand Nios proessor("sub bus") hasbeen
also implemented. TheMath Unit funtional oreis a double-preisionoating-point ALU, whih integrates
both anadder and amultiplier operating in a parallel fashion. Both devies are pipelined (9 stages for the
Fig.2.1. ArhitetureoftheomputationalunitimplementedontotheFPGAdevie.
Together with theadder andthe multiplier,the ALUalso ontains3registerbanks, eah ableto store4
double-preisionoatingpointnumbers. Thebanksareeahtiedtoapartiularpurpose(oneisforinputdata,
oneforadderresultsandoneformultiplier results).
Like in many similar appliations, to makeomputing elements and storage spae independent, aFIFO
memoryforbothinputsandoutputsisimplemented(therearetwoFIFOqueuesontheoutputsinearithmeti
resultsareseparatedfromlogialones).
TheALU operations are enoded in 37-bit words,able to simultaneously triggereither a sumor a
om-parison, amultipliation, adata feth, 3 write operations to the internal registerbanks and the output of a
result.
Toahievebetterperformanewithourspeitask,theoperandsoftheadderanoptionallybemultiplied
by
[
−
2
,
−
1
,
2]
fortherstoperand,and[
−
1
,
−
0
.
5
,
0
.
5]
fortheseondone. Inasimilarway,themultiplierresult anbedoubled,halvedor negatedwithoutextralokyles.Sinefeedingtheop-odeswouldrequirealargeandmostlywastedbandwidth(theodeisessentiallyyli,
so that the same op-odes are exeuted over and over again) the ode sequenes are stored in a Miroode
Sequener. This deviestoresthe programsequenesin aninternal RAMandassoiatestothem a6-bits
op-ode(thisismuhlikehavingaCPUwithamiro-programmedontrolunitwhoseodeanbehangedbythe
appliationtodeneaustominstrutionset).
TheMathUnititselfhasnoaddressingapabilitiestowardeitherinputoroutputhannels,soeverymemory
I/O operation must be managed by an external devie. A Memory Manager was deemed to that task and
oneived for a spei appliation lass: those where most omputations are performed on data logially
organizedin three-dimensionalmatries. Deoupling thealloationissues from theomputing algorithm,the
MemoryManageromputesthememoryaddressesfrom semanti-levelinputs,suh asaddressesinthematrix
domain (
X
−
Y
−
Z
oordinates) or osets between elements (the matrix is supposed to be yli, so that e.g. theleftmost elementin arowis adjaentto therightmost element in thesamerow). Thisis of extremeimportane,sineotherwisethe sameode would requireat least areompilationto beexeutedon matries
withdierentsizes.
TheinternalControlUnit(CU)deodesinstrutionsomingfromthehostomputeranddrivestheontrol
signalsimplementingtherequestedfuntion. Itmainlyonsistsof3units:
•
InstrutionDeode: seletsbetweendataandinstrutionsfromhosttotheDPFPA.Onlyinthelastaseitgeneratesproperontrolsignals;
•
JumpUnit: setstheRAMaddresstothestartingpointofthenextinstrutionsequenetobeexeuted;Instrutionsare64bitwideexploitingpartoftheredundanypresentintheIEEE754standardofoating
pointrepresentation,to distinguishthem fromdouble preisionnumbers. Twokindsofinstrutionshavebeen
implemented:
•
Programminginstrutions to storein theCURAMexeutivesequenes.•
Exeutive instrutions toperformspei alulations,reallingsequenesalreadyloaded.Programming instrutions to store in the CU RAM exeutive sequenes. Exeutive instrutions to perform
speialulations,reallingsequenesalreadyloaded.
A great advantageof our approah is that thesequenes of anexeutiveinstrution areperformed in an
iterativemanneruntilanewexeutiveinstrutionwill bereeivedbytheCU.So,during theexeutionofthe
alulus,CUhastodeodeonlyfewinstrutionsandansaveagreatamountoftime.
3. Programming DPFPP. As previously stated, DPFPP an handle two types of instrutions:
pro-gramminginstrutionsandexeutiveinstrutions. The formerareused tostoremiroodesequenesinto the
CU RAM, making miroode words to be loaded at the orret address into the RAM of CU. The word of
miroode,allowstheassertionofneededontrolsignalsforeahlokyle.
Eahexeutiveinstrutionallows,ontheotherhand,thereallingofsequenesalreadystored.
We realisedsoon, that the sequene developmentusing binary miroodewas avery hardand ineient
work. Thus,wehosetodesignanddevelopapseudo-assemblydediatedlanguagethatsimpliesthesequene
writing. Theinstrutionsof thelanguageare mappeddiretly onthehardware andreettheoperation that
DPFPP anexeute. Table3.1showsthelistoftheinstrutionsand theirsyntax.
Table3.1
Listandsyntaxofthelanguageinstrutions.
InstrutionSyntax
MOVreg;
SUM1op2op;SUM1op;SUM1 opopSUMop2op;SUMopop
MULopop;MULopop;MULop;
OUTxx;
INT;
Apropertranslatorwasalsodeveloped,usingstandardUnixtoolssuhasLexandYa.
Furthermore,wedevelopedanalloatorforaneasygenerationofthelewiththeprogramminginstrutions
that must be sent to theDPFPP. Finally, we designedasimulator, reproduing exatlytheDPFPP working
and enabling pipeline and registerinspetion. The simulator also allows thevisualisation of the lok yles
neededbyaspeisequeneorbyasetofsequenes. Thankstothistool,weanexeutemiroodesequenes
withoutloadingthemintotheDPFPP;thus,weansimplifythesequenedebug,verifytheresults'orretness
andhektheperformane.
Allthesetoolsareintegratedinauniquedevelopmentenvironment,realisedin theMiroomputer
labora-torytoeasethesequenedevelopment. Therearefourmainsteps: rst,wewriteandompilesoureodeusing
an internal editor, then wetest the odeusing the simulator. Finally, weproduethe programmingle that
hasto besentto theDPFPP by usingthe alloator. Moredetails on thehardwareand software forDPFPA
arein[Daneseetal.,2003℄.
4. The ConsideredProblem. Liquidrystalsandolloidalsuspensionsaretwoexamplesofsystemsfor
whih the orientation order has been widely studied through simulations. In both ases interations among
partilesplayadominantrole. Inpreviousworks, werealizedaubilattiemodeldesribingtheinterations
eets in adipolarsystem inpresene ofan externallattieeld [Bellini eral.,2001℄: simulationsmade with
this model identied thepresene oftwophase transitionsand theobtained resultsouldin partexplain the
phenomenonknownas anomalousbi-refringene asanalyzedin[O'Konskiet al.,1950℄[Radevaet al.,1996℄.
Onthe otherhand,simulationstakeunaeptablylongtimes evenonthe mostreent andpowerful
om-putingsystemsrangingfromafewdaysuptosomeweeksdepending onthesizeofthesimulatedsystem. The
oreof theomputationis, infat,theevaluationoftheenergysine,aordingto theimplementedalgorithm
evaluating theorrespondinghangein energy. Eahmoveanbeaeptedorrejeteddependingonthe
vari-ation of the energy assoiated with it [Metropolis et al., 1953℄. We simulated lattie systemswith partiles
ranging from afew hundreds upto 100.000 onsidering only rst neighborinterations, i. e. theinteration
betweeneah spinand thesixlosestones in the
X
+
,X
−
,Y
+
,Y
−
,Z
+
,Z
−
diretions. Periodiboundary onditionswere applied [Frenkelet al., 1996℄. Theassoiatedenergyof eah dipole due to thepreseneof anexternaleldorientedtowardzaxisis:
(1)
Edip
=
momz
(
dip
)
Thetermsdueto theinterationsbetweentheonsidereddipoleandeahofitsrstneighboursare:
(2)
EX+
= 2
∗
momx
(
dip
)
∗
momx
(
X
+) +
−
momy
(
dip
)
∗
momy
(
X
+)
−
momz
(
dip
)
∗
momz
(
X
+)
(3)
EX
−
= 2
∗
momx
(
dip
)
∗
momx
(
X
−
) +
−
momy
(
dip
)
∗
momy
(
X
−
)
−
momz
(
dip
)
∗
momz
(
X
−
)
(4)
EY
+
=
−
momx
(
dip
)
∗
momx
(
Y
+) +
+2
∗
momy
(
dip
)
∗
momy
(
Y
+)
−
momz
(
dip
)
∗
momz
(
Y
+)
(5)
EY
−
=
−
momx
(
dip
)
∗
momx
(
Y
−
) +
+2
∗
momy
(
dip
)
∗
momy
(
Y
−
)
−
momz
(
dip
)
∗
momz
(
Y
−
)
(6)
EZ+
=
−
momx
(
dip
)
∗
momx
(
Z
+) +
−
momy
(
dip
)
∗
momy
(
Z
+) + 2
∗
momz
(
dip
)
∗
momz
(
Z
+)
(7)
EZ
−
=
−
momx
(
dip
)
∗
momx
(
Z
−
) +
−
momy
(
dip
)
∗
momy
(
Z
−
) + 2
∗
momz
(
dip
)
∗
momz
(
Z
−
)
wheretheomponentsofthemomentsforeahdipoleare:
(8)
momx
(
dip
) =
cos
(
θ
)
∗
sin
(
θ
)
∗
cos
(
ϕ
)
(9)momy
(
dip
) =
cos
(
θ
)
∗
sin
(
θ
)
∗
sin
(
ϕ
)
(10)momz
(
dip
) =
cos
′
(
θ
)
and
θ
,ϕ
aretheangularo-ordinatesofageneridipole. Theoverallenergyofthedipoleisthesumofalltheseontributes:
(11)
ET OT
[
dip
] =
−
0
,
5
∗
[
Edip
−
k
∗
(
EX+
+
EX
−
+
EY
+
+
EY
−
+
EZ+
+
EZ
−
)]
Theglobalenergyin thesystemisthesumextendedonthewhole dipolarset.
Thesimulatedsystemisharaterisedbyaninitialrandompartiledistributionnotorrespondingtothat
ahievable at the equilibrium. This means that the hange in the orientation of a dipole will modify the
moments and the energy in the others, mainly in the neighbours. These ones, in turn, will inuene their
respetive neighbours and so on, propagating those variations in the moments throughout the lattie. This
reetsinenergyutuationsthatdisappearonlyafterasuientnumberofylesintowhihETOTforeah
dipoleisalulated(equilibration). Onlyatthispoint,theMetropolistestonenergyvariationanbeapplied.
This loop series orresponds to nearlythe 85%of the alulationbut it onsists of only few instrutions, so
justifyingthe ideaof anaelerator speialized in proessing onlythose operations. Todo this, we employed
theFPGAtehnology,whih isheaperandsimplerthanASICintermsofdesignandtest.
However,duringthedesignphase,weonsideredonvenienttorealiseamoregeneralhipabletoaelerate
thosedoublepreisionoatingpointinstrutionswhihanbeoftenfoundinsientisimulations. Thisextends
theappliabilityoftheDPFPA bothtomodelsdierenttothat used(i. e. hexagonallattiesinsteadof ubi
ones)ortoompletely dierenteldswherehighperformaneomputingismandatory.
5. Energy Evaluationand Implementation. Tosimplify thereadabilityoftheenergyalulationon
the DPFPA, as itwill be desribed in thefollowing, let's rewritethe expressions reported in setion 3. The
interation energyof eah dipole anbe written as
−
(CT
∗
CT
)
globalenergyin thesystem.
CT
istheloaleld generatedbytheneighborsof theonsidered dipoleandanbeexpressedas:
(12)
CT
=
CT X
∗
SC
∗
k
+
CT Y
∗
SS
∗
k
+
CT Z
∗
C
∗
k
+
C
where
k
isaonstantdependingonthesystemdensityandSC
=
sin
(
θ
)
cos
(
ϕ
)
,SS
=
sin
(
θ
)
sin
(
ϕ
)
,C
=
cos
(
θ
)
,with
θ
,ϕ
angularo-ordinatesofthedipole. CTX,CTYeCTZaretheloalomponentsoftheeldgeneratedbytheneighbourdipoles. Theyarerespetivelyequalto:
(13)
CT X
= (
M XX
+
M X
∗
X
+
M Y X
+
M Y
∗
X
+
M ZX
+
M Z
∗
X
)
(14)CT Y
= (
M XY
+
M X
∗
Y
+
M Y Y
+
M Y
∗
Y
+
M ZY
+
M Z
∗
Y
)
(15)CT Z
= (
M XZ
+
M X
∗
Z
+
M Y Z
+
M Y
∗
Z
+
M ZZ
+
M Z
∗
Z
)
Weidentifywith
M XX
,M XY
,M XZ
theloaleldomponentsgeneratedbytherstneighbordipoleinthediretion
X
−
,andwithM X
∗
X
,M X
∗
Y
,M X
∗
Z
theloaleldomponentsgeneratedbytherstneighbor dipoleinthediretionX
+
. TheothertermsduetotheeetofdipolesindiretionsY
+
/Y
−
andZ
+
/Z
−
are denedaordinglytothesamenotation. Moreovertheloaleld,duetotheneighbors,hangestheomponentsofthedipolarmoment. These shouldbeevaluatedeahtimeaordingtothefollowingexpressions:
(16)
momx
(
dip
) =
CT
∗
SC,
momy
(
dip
) =
CT
∗
SS,
momz
(
dip
) =
CT
∗
C
Whilethe
SC
,SS
andC
termsareevaluatedateahmovement,theothertermsshouldbere-alulatedforthenumberofyles neessaryto equilibrate theenergyin the system. All these operations, nally,are repeated
M
∗
N
times withM
=
ylenumber(i. e. 10.000)andN
=
numberof dipoles in thesystem. This aountsforthehighomputationalweightoftheelaboration.
Fig.5.1. Diagonalsanning.
With 'sanning' wemean the order throughwhih thedipoles are proessed during the simulation. The
identiationofasuitableorderansigniantlyaetthealgorithmeieny intermsofmemoryaessand
reuse ofdata. Ifwe would notuseany partiularsanningorder but ifweonly would onsiderdipoles in the
sameorder ofmemorization(
1
st
,
2
nd
,
3
rd
,...),theirelaborationwouldneed21inputdata(
SC
,SS
andC
ofthemoveddipole plusthemomentsofitssix rstneighbors),returningthe3newomponentsofthemoment
oftheonsidereddipole.
However,iftheseletionorderonsidersdipolesloseto eahother towardadiagonaldiretion,theselast
onessharetworstneighborswhoseparametersarenomoreneededfortheelaborationofthenewdipole. Fig.
5.1showsanexampleofthis,sinepassingfromdipole 1to2,dipoles3and4arepreservedasrstneighbors.
againg. 5.1wenotethatdipoles3and4givethefollowingontributestoeahomponentoftheloaleldin
dipole1:
(17)
M X
∗
X
+
M Y
∗
Y
= 2
∗
momx
(4)
−
momx
(3)
(18)M X
∗
Y
+
M Y
∗
Y
=
−
momy
(4) + 2
∗
momy
(3)
(19)
M X
∗
Z
+
M Y
∗
Z
=
−
momz
(4)
−
momz
(3)
Ifwenowonsider theontributeof thesamedipoles todipole 2,thenext reahedby thediagonalsanning,
wend:
(20)
M XX
+
M Y X
= 2
∗
momx
(3)
−
momx
(4)
(21)M XY
+
M Y X
=
−
momy
(3) + 2
∗
momy
(4)
(22)
M XZ
+
M Y Z
=
−
momz
(3)
−
momz
(4)
Thevaluesontherightareobtainedbysubstitutingatthetermsonleft, thosevaluesreportedinequationsin
setion3.
Equations19and22areequalandanbealulatedonlyone. Thesameonsiderationsareappliablein
aseofmovementstoward
Y Z
orXZ
diretionwithaonsistentsparingof operations.Finally,themomentomponentsinvolvedinequations1719forthedipole1arealsopresent(withdierent
oeients)inequations2022and,again,theyanbealulatedonlyone(i. e. fordipole1,storingthemin
registersfromwhihtheyanberetrievedlaterforthenextdipole)withafurthersavingof time.
Fig.5.2. Diagonalsforsanningin
XY
fae.Thediagonalsanningbasiallyonsistsof
XY
movementsasshownin g. 5.2.Theubi lattie isonsidered as madeby `slies'and when the last dipole is reahed on an
XY
fae,alittlemovement towardthe
Y Z
orXZ
diretion allowsto skip to thenextXY
slie. Ineah slie, dierentstartingpointsanbehosendependingontheodd/evennumberofdipolespresentontheedgeofthelattie,
butforsakeofsimpliitywedon'twanttoexessivelydetailthesesimulationaspets.
6. Implementationon DPFPA. As previouslysaid,asequene onsistsof amiroinstrution setand
ould be identied asa Setup or aLoop sequene. The rst problem to dealwith is the denition of those
operationsmorefrequentlyexeutedwhihshouldbeinsertedintotheLoopsequene. Inthediagonalsanning,
themostfrequentoperationregardstheinterationbetweendipolesloatedondiagonalsbelongingtothe
XY
side: thus,theLoopsequeneshouldimplementtheenergyalulationofthesedipoles,whiletheSetupshould
exeutethemovementsin the
XZ
orY Z
faes ofthelattie, throughwhih thealgorithm onsiders therstdipoleofthenext
XY
`slie'andanotherLoopsequenebegins.Aordingto whatsaidin theprevioussetion,thenumberoftheneededsumsis14forevaluating
CT X
,CT Y
eCT Z
(one add is shared with the previousdipole), 3 forCT
and 1 moreto add these valuesto thepartialtotalenergyobtainedfromthepreviousdipolesonsidered. Thustheadderpipelineisusedasitsbest,
if18lokylesaretaken. Forwhat onernsmultipliations,instead,6areneededtoalulate
CT
,3fortheFig.6.1. Stage2intheadderpipelineduringtheLoopphase
Fig.6.2. Stage4intheadderpipelineduringtheLoopphase
Fig.6.3. Stage6intheadderpipelineduringtheLoopphase
6.1. AdderUnit. Eahstageisonsideredasanindependentregisterontainingthepartialresultwhih
anbestoredeveryLlokyles(Listhepipelinelength). TheLoopsequeneevaluatestheenergyofdipoles
onsidered in the
XY
diretion: 4stages of adderpipeline were devoted to alulateCT X
,CT Y
,CT Z
andCT
. In g. 6.1, the seond pipeline stage devoted to thealulation ofCT X
is shown, with the partiularalulation highlightedin bold in eah of the four sumsneeded. Inthe rst step, the termin parentheses is
`shared'withthepreviousdipoleonsideredanddoesnotneedtobere-alulated(seeprevioussetion). Eah
partial result is available only when it has run aross the whole pipeline i. e. after 9 lok yles and the
ompletevalueof
CT X
isavailableafter36lokyles. ThenthestageproeedstoevaluatetheCT X
forthenextdipole. Thesameonsiderationsanbemadefor
CT Y
andCT Z
. ThealulationofCT
isimplementedinthestage4,whihworksagainfor36lokyles. The
CT X
,CT Y
,CT Z
valuesusedin thisasearethose!
!
"
"
Fig.6.4.Stage6intheadderpipelineduringtheLoopphase.
Therefore the partial value of
CT
is savedin a register from whih it will be retrieved during stage 4B(g.6.2). Tooptimisetheuseofthepipelinetheremainingstagesaredevotedtoimplementthesamealulations
fora seond dipole,soasto proess 2dipoles in 36 lokyles. This orresponds, aspreviouslyseen, to an
optimaluseoftheadder. Finally,stage6isdevotedto addto theglobalenergyvalue
ET OT
,thosetwoenergyontributes (
EN EW
) alulatedin the other stages of the pipeline upto this moment(g. 6.3). Basially itworks inthesamewayasstage4,inluding twosumssharedwiththesuessiveelaborateddipoles(againto
optimisethepipelineuse). Eventhough,during the36lokylesallthe sumsneeded fortheenergyoftwo
dipoleshavebeenperformed,thedipolesinvolvedintheelaborationaremorethan2. Infat,whiletheadder
is evaluating
CT X
,CT Y
andCT Z
forthe twodipoles, it is notpossibleto determineat the sametime theorrespondent
CT
terms,sinethepreviousalulations(CT X
,CT Y
andCY Z
)shouldbeompletedandtheyshould also be multiplied by
SC
1
∗
k
,SS
1
∗
k
andC
1
∗
k
(k is asuitableonstantdepending on thesystem density). ThereforetheCT
termreallyomputedreferstothepreviousLoopsequene. Thismeansthat whileCT X
,CT Y
andCT Z
for dipoles(
n
)
and(
n
+ 1)
are evaluated, theCT
terms referto(
n
−
3)
and(
n
−
4)
dipoles and the
EN EW
orrespondsto the ouple(
n
−
5)
and(
n
−
6)
previously started. Moreover,also the ouple(
n
−
1
, n
−
2)
issubjetedtoapartialelaborationmakingthepipelinealwaysworking.This ongurationbrings a onsistentlevelof parallelisation in the exeution of thealgorithm. Fig. 6.4
shows the omplete set of operationsalulated during the36 lok ylesof eah Loopsequene. Per eah
stageandlokyle,theeetivesumperformedisreportedinbold.
6.2. Multiplier Unit. Thisunit exeutesthemultipliationsneededinthetermsthatmustbeadded,i.
e. 10 pereah ofthetwodipoles of theadderunit (globally 20)and in asequentialway. Tosynhronisethe
operationsinthemultiplierpipelinewiththoseoftheadder,thelengthofthepipeline(15stages)isextendedto
18byaddingthreeNOP(nooperation)yles: thismeansthatin36lokylesthemultiplierworkseetively
for30yles,atimesuienttoexeutetherequired20produts,withoutloosingthesynhronisationwiththe
orrespondenttermsin theadderunit. Fig. 6.5desribestheoperationsperformedtogether withthe output
fromthepipelineatthatinstant,pereahlokyle. Inparenthesistheordernumberisreportedofthedipole
towhihthealulationrefers: nisthedipoleforwhihthealulationoftheenergyisinitiatedintheurrent
sequene. AttheendofeahLoopsequenethepipelineoutputsnewmomentsandenergyofthedipoleouple
whih startedtheevaluation 3sequenes before. Fititiousprodutshavebeeninsertedwhen neededtofore
thepipelinegoingonestepbeyond.
7. Results. ThewholesystemhasbeentestedbyexeutingMontearlosimulationsofdierentsizelatties
Fig.6.5. Operationsperformed inthemultipliationpipelineduring36lokyles.
Performanehasbeenevaluatedasspeed-uprespettotheexeutionofthesamesimulationonanIntelP4
proessorwith 1GBRam memory;also FPGAoupation wasused asaperformaneparameter. Simulation
ode waswritten in C language and optimized using Mirosoft Visual C++ environment. The Aelerator
elaborationtimesweremeasuredbymeansofthelokountersimplementedintheinterfaebetweenNiosand
theoproessorpreviouslydesribed.
Ing. 7.1weshowtheperformaneasspeed-upfatorrespettotwoIntelP4proessorswith3GHzand
1.7 GHz frequenyrespetively, alulatingthe dipolar energy of the simulatedsystem. That omputational
oreisrepeatedlyexeuted
k
∗
N
∗
10000
timeswherek
istheoeientresponsiblefortheinterationsettlement (equilibration) and N is the dipole number: this givesreason ofthe high omputational loadwhih anlead(forbigpartile systems,e. g. 100000dipoles)to waitalot forresults,ifthesimulationshould beperformed
on aPC. Thespeed-up fatoris inreasing forthe 1.7 GHzproessordue to ahe eet,while for themost
performingIntelproessor(3GHz)sets around2.
Considering the size of the FPGA we used, other 2 aelerating units ould be implemented, we an
reasonablystate that a speed-up fator equalto 4anbe ahieved in ase of afull implementation on the
FPGAomponentwehose(StratixEP1S40). Furtherspeed-up ouldbeobtainedifother omponentsofthe
Altera'sfamily(Stratix2or Stratix3nowavailable)shouldbeemployed.
Theostofeahboardweboughtwasnearly$1200:thisrepresentsanimportantindiationwhenprediting
Fig.7.1.Speed-upoftheFPGAbasedaeleratorwithrespettheP4Intelproessors.
to aomputational unit in aPC luster, providing the sientist witha COTSdesktop omputing system on
whih he/sheanrunsimulations.
8. Conlusions. Simulationsallowtheanalysisofaphysialsystem,evenomplex,withoutexperimental
measuresor,sometimes,toonrmwhatwasexperimentallyobserved. Inertainsituationssuhasmirosopi
systems, simulationsrepresentthe simplest if notthe onlyway to quikly foreseethe behaviourof a partile
system in dierent environmental onditions. The high number of variables involved together with omplex
interationlawsoftenmake simulationtimesunaeptablylong. Finally, severalofthe requestedalulations
askfordoublepreisionoatingpointarithmeti, furtherinreasingtheomputationalpowerneeded.
In this paper, we haveshown how anappliation spei arhiteture(DPFPA) speially designed for
thiskindof problemsandbasedonFPGAtehnologyould representagoodompromisebetweenproessing
apabilitiesandlowosts. DPFPA anbeprogrammedwithadediatedlanguageto exeuteomplexoating
pointfuntions anditis equipped withasuitablesoftwaredevelopmentenvironment. Weexeutedthedipole
energy alulation through the simulator, ahieving, thanks also to the new sanning algorithm purposely
designedand heredesribed, aperformanetwieasthat of alast generation PersonalComputerbut anbe
easilyextended to4.
Afurtherimprovementouldbeahievedbyafull ustomASICimplementationoftheAeleratorwhih
isnotjustiedataprototypinglevelwhileitallowsalargesalemanufaturingwithreduedosts. Thiswould
make available several omputing units onneted in luster fashion by means of a point to point network,
providingtheuserwithagreatomputingpower.
REFERENCES
[BELLETTIF.,etal.,℄ AnadaptiveFPGAomputer,IEEEComputinginSiene&Engineering,vol.8(1),January-February
2006,pp.41-49.
[BOGHOSIANB.,etal.℄ Sientiappliationsofgridomputing,IEEEComp.inSiene&Engin.,vol.7(5),Sept.-Ot.2005,
pp.10-13.
[BUELLD.,etal℄ HighPerformaneReongurableComputing,IEEEComputer.Marh2007,pp.2326.
[COWENC.P.etal.℄ A reongurable Montearlo lustering proessor (MCCP), FPGAs for Custom Computing Mahines,
1994.Proeedings.IEEEWorkshopon1013April1994,pp.5965.
[CRUZA.,etal.℄ ASpeialPurposeComputerforspinglassmodels,ComputerPhysisCommuniations,vol.133,n°2-3,2001,
pp.165176
[DANESEG.,etal.℄ AdevelopmentandsimulationenvironmentforaoatingpointoperationsFPGAbasedaelerator,Pro.
ofDSD'033rdEuromiroSymposiumonDigitalSystemDesign,Belek(Turkey),September2003,pp.173-179.
[DANESEG.,etal.℄ AnappliationspeiproessorforMontearlosimulations,IEEEonfereneonParallelandDistributed
Proessing(PDP07),Naples,February2007,pp.262-269.
[DANESEG.,etal.℄ Fieldinduedanti-nematiorderinginassembliesofanisotropiallypolarizablespins,EurophysisLetters
55(3),pp.362-368,2001.
[DONGARRAJ.,etal.℄ High-PerformaneComputing:Clusters,Constellations,MPPs,andFutureDiretions,IEEEComp.in
[GOKHALEM.,etal.℄ MonteCarloradiativeHeatTransferSimulationonareongurableComputer,Pro.ofFPL2004,LNCS
3203,pp.95104,Springer-Verlaged.
[HANKELJ.,etal.℄ Takingontheembeddedsystemdesignhallenge,IEEEComputer,April2003,3537.
[HERBORDTM.C.℄ AhievingHighPerformanewithFPGA-BasedComputing,IEEEComputer.Marh2007,pp.50-57.
[MARSHP.℄ Highperformanehorizons,Computing&ControlEngineeringJournal,vol.15(6),DeemberJanuary2004/2005,
pp.4248.
[METROPOLISN.,etal.℄ Equation ofStateCalulationsbyFastComputingMahines,JournalofChem.Physis,21,(1953),
pp.10871092.
[MONAGHANS.,etal.℄ Reongurablespeialpurposehardwareforsientiomputationandsimulation,Computing&
Con-trolEngineeringJournal,Vol.3,Issue5,Sept.1992,pp.225234.
[O'KONSKIC.T.,etal.℄ Newmethodforstudyingeletrialorientationandrelaxationeetsinaqueousolloids:preliminary
resultswithtobaomosaivirus,Siene,111,pp.113116(1950).
[POSTULAA.,etal.℄ Thedesignof a speialized proessor forthe simulationof sintering,EUROMICRO96. 'Beyond 2000:
HardwareandSoftwareDesignStrategies',Pro.ofthe22ndEUROMICROConferene,25September1996,pp.501
508.
[RADEVAT.,etal.℄ EletriLightSatteringfromPolytetrauorethylene Suspensions,Coll.AndSurf.119,1(1996).
[WOLFW.℄ Adeadeofhardware/softwareodesign,IEEEComputer,April2003,3842
[WOLFW.℄ Theembeddedsystemslandsape,IEEEComputer,Otober2007,2933.
[ZHANGG.L.,etal.℄ ReongurableAelerationforMontearlobasednanialsimulation,Pro.ofFPT05,IEEEConferene
oneldprogrammabletehnology,SingaporeDe.11142005.
Editedby: PasquaD'Ambra,DanieladiSerano,MarioRosarioGuarraino,FranesaPerla
Reeived: June2007