E.V.E., AN OBJECTORIENTEDSIMD LIBRARY
JOEL FALCOU AND JOCELYN SEROT
∗
Abstrat. Thispaperdesribestheeve(ExpressiveVeloityEngine)library,anobjetorientedC++librarydesignedtoease
theproessofwritingeientnumerialappliationsusingAltiVe,theSIMDextensiondesignedbyApple,MotorolaandIBM.
AltiVe-powered appliationstypiallyshow oarelativespeedupof4to 16butneedaomplexandawkwardprogrammation
style. Byusingvarious templatemetaprogrammingtehniques, E.V.E.provides an easytouse, STL-like, interfaethat allows
developer to quiklywrite eient and easy to read ode. Typial appliations written with E.V.E.an benet from a large
frationoftheorialmaximumspeedupwhilebeingwrittenassimpleC++arithmetiode.
1. Introdution.
1.1. The AltiVe Extension. Reently, SIMDenhaned instrutionshavebeenproposedasasolution
fordeliveringhighermiroproessorhardwareutilisation. SIMD(SingleInstrution,MultipleData)extensions
startedappearingin1994inHP's MAX2andSun'sVSextensionsandannowbefoundinmostof
miropro-essors, inluding Intel's Pentiums (MMX/SSE/SSE2) and Motorola/IBM's PowerPCs (Altive). They have
been provedpartiularlyusefulforaeleratingappliationsbased upon data-intensive,regularomputations,
suh assignalorimageproessing.
AltiVe[10℄isanextensiondesignedtoenhanePowerPC 1
proessorperformaneonappliationshandling
large amounts of data. The AltiVe arhiteture is based on a SIMD proessing unit integrated with the
PowerPC arhiteture. It introdues a new set of 128 bit wide registers distint from the existing general
purpose oroating-pointregisters. These registersare aessible through160 newvetor instrutionsthat
anbefreelymixedwithotherinstrutions(therearenorestritiononhowvetorinstrutionsanbeintermixed
withbranh,integeroroating-pointinstrutionswithnoontextswithingnoroverheadfordoingso). Altive
handles data as 128 bit vetors that an ontain sixteen 8 bit integers, eight 16 bit integers, four 32 bit
integersorfour32 bitoatingpointsvalues. Forexample,anyvetoroperationperformed onavetor har
is in fat performed on sixteen har simultaneously and is theoretially running sixteen times faster as the
salarequivalentoperation. AltiVevetorfuntionsoveralargespetrum,extendingfromsimplearithmeti
funtions(additions,subtrations)tobooleanevaluationorlookuptablesolving.
Altive isnativelyprogrammedbymeans ofaC API[5℄. Programming at thislevelanoer signiant
speedups (from 4to 12 fortypialsignalproessing algorithms)but is arather tediousand error-pronetask,
beause this CAPI isreally assembly in disguise. Theappliation-levelvetors(arrays,in variable number
andwithvariablesizes) mustbeexpliitlymappedontotheAltivevetors(xednumber,xedsize) andthe
programmermustdealwithseverallow-leveldetailssuhasvetorpaddingandalignment. Toorretlyturna
salarfuntion intoavetor-aeleratedone,alargepartofodehastoberewritten.
Considerforexampleasimple3x1smoothinglter(Fig.1.1):
void C_filter( har* d, short* r)
{
for(int i=1; i<SIZE-1; i++ )
r[i℄ = (d[i-1℄+2*d[i℄+d[i+1℄)/4;
}
Fig.1.1. Asimple3x1gaussianlterwritteninstandard C.
This ode an be rewritten (vetorized")using Altive vetor funtions. However, this rewriting is not
trivial. Wersthavetolookattheoriginalalgorithminaparallel way. TheC_filterfuntionisbasedonan
iterativealgorithmthatrunstrougheahitemoftheinputdata,appliestheorrespondingoperationsandwrites
the result into the output array. By ontrast, AltiVe funtions operate on abunh of data simultaneously.
Wehave to reraft thealgorithm so that itworks onvetorsinstead ofsingle salarvalues. This is done by
∗
LASMEA,UMR6602CNRS/U.ClermontFerrand,Frane(falou, jserotlasmea.univ-bplerm ont .fr) .
1
loadingdataintoAltiVevetors,shiftingthesevetorsleftandright,andperformingvetormultipliationand
addition. TheresultingodewhihisindeedsigniantlylongerthantheoriginaloneisgiveninAppendixA.
Wehavebenhmarkedboththesalarand vetorized implementation ona2GHzPowerPCG5 andobtained
theresultsshowninTable1.1. Bothodewereompiledusingg 3.3using-O3. Onthisexample,atenfold
aelerationanbeobservedwiththeAltiVeextension. However,thetimespenttorewritethealgorithmina
vetorized"wayandthesomehowawkwardAltiveAPIanhinderthedevelopmentoflargersaleappliations.
Table1.1
Exeutiontimeandrelativespeed-upfor3x1lters.
SIZEvalue C_filter AV_filter SpeedUp
16
K0
.
209
ms0
.
020
ms10
.
5
64
K0
.
854
ms0
.
075
ms11
.
4
256
K3
.
737
ms0
.
322
ms11
.
6
1024
K16
.
253
ms1
.
440
ms11
.
3
2. AltiVe in highlevelAPI. Asevidenedintheprevioussetion,writingAltiVe-basedappliations
anbeatedious task. A possibleapproah to irumventthis problem is to enapsulateAltive vetorsand
the assoiated operations within a C++ lass. Instantiating this lass and using lassi inx notations will
produetheAltiVeode. Weatuallybuiltsuhalass(AVetor)andusedittoenodethelteringexample
introduedinsetion1.1. Theresultingodeisshownbelow.
AVetor<har> img(SIZE);
AVetor<short> res(SIZE);
res = (img.sr(1)+2*img+img.sl(1))/4;
Inthisformulation,expliititerationshavebeenreplaedbyappliationofoverloadedoperatorsonAVetor
objets. Thesrandsl methodsimplementstheshiftingoperations. Theperformaneofthis ode,however,
is verydisappointing. With the arraysizes shownin Table1.1, themeasured speed-ups never exeed1. The
reasonsforsuhbehaviouraregivenbelow.
Considerasimpleodefragmentusingoverloadedoperatorsasshownbelow:
AVetor<har> r(SIZE),x(SIZE),y(SIZE),z(SIZE);
r = x + y + z;
WhenaC++ompileranalysesthisode,itreduesthesuessiveoperatorallsiteratively,resolvingrst
y+z thenx+(y+z) where y+zis in fat stored in a temporary objet. Moreover, to atually omputex+y+z,
theinvolvedoperationsare arriedoutwithin aloopthat appliestheve_addfuntion to everysinglevetor
elementofthearray. Anequivalentode,afteroperatorredutionandloopexpansionis:
AVetor<har> r(SIZE),x(SIZE),y(SIZE),z(SIZE);
AVetor<har> tmp1(SIZE),tmp2(SIZE);
for(i=0;i<SIZE/16);i++) tmp1[i℄ = ve_add(y[i℄,z[i℄);
for(i=0;i<SIZE/16);i++) tmp2[i℄ = ve_add(x[i℄,tmp1[i℄);
for(i=0;i<SIZE/16);i++) r[i℄ = tmp2[i℄;
Fig.2.1.Expanded odeforoverloaded operatorompilation
Thisodeanbeomparedtoanoptimal",hand-writtenAltiveodeliketheoneshownongure2.2. The
odegenerated by thenaive"AltiVe lass learlyexhibits unneessary loopsand opies. Whenexpressions
get moreomplex, the situation getsworse. The time spent in loopindex alulationand temporary objet
opiesquiklyoveromesthebenetsoftheSIMDparallelization,resultinginpoorperformanes.
This an be explained by the fat that all C++ ompilers use a dyadi redution sheme to evaluate
operators omposition. Some ompilers 2
an output a slightly better ode when ertain optimisations are
AVetor<har> r(SIZE),x(SIZE),y(SIZE),z(SIZE);
for(i=0;i<SIZE/16);i++) r[i℄ = ve_add(x[i℄,ve_add(y[i℄,z[i℄));
Fig.2.2.Optimal,handwrittenAltiVeode forx+y+zomputation
turned on. However,largeexpressionsoromplexfuntions allan't betotally optimised. Anotherfator is
theimpat ofthe order ofAltiVe instrutions. When writingAltiVe ode, one haveto takein aountthe
fatthatahelineshavetobelleduptotheirmaximum. Thetypialwayfordoingsoistopaktheloading
instrutionstogether, then theoperations andnally the storinginstrutions. When loading,omputingand
storinginstrutionsaremixed inanunorderedway,AltiVeperformanesgenerallydrop.
The aforementioned problem has already been identiedin [13℄ for exampleand is the major
inon-venient of the C++ languagewhen it is used for high-levelsienti omputations. In the domain of C++
sientiomputing, it hasled to thedevelopmentof theso-alled Ative Libraries [15, 2, 14,1℄, whih both
provide domain-spei abstrations and dediated ode optimisationmehanisms. Thispaperdesribeshow
thisapproahanbeappliedtothespeiproblemofgeneratingeientAltiveodefromahigh-levelC++
API.
Itisorganizedasfollows. Set.3explainswhygeneratingeientodeforvetorexpressionsisnottrivial
andintroduestheoneptoftemplate-basedmeta-programming. Set.4explainshowthisoneptanusedto
generateoptimisedAltiveode. Set.5rapidlypresentstheAPIofthelibrarywebuiltuponthesepriniples.
PerformaneresultsarepresentedinSet.6. Set. 7isabriefsurveyofrelatedworkandSet. 8onludes.
3. Template based Meta Programming. Theevaluationof anyarithmeti expression anbe viewed
asatwostagesproess:
•
Arststep,performedatompiletime,wherethestrutureoftheexpressionisanalysedto produeahainoffuntion alls.
•
A seond step, performed at run time, where the atual operands are provided to the sequene offuntionalls andtheaerentomputationsarearriedout.
Whentheexpressionstruturemathesertainpatternorwhenertainoperandsareknownatompiletime,it
isoftenpossibletoperformagivensetofomputationsatompiletimeinordertoprodueanoptimisedhain
offuntionalls. Forexample,ifweonsiderthefollowingode:
for( int i=0;i<SIZE;i++)
{
table[i℄ = os(2*i);
}
Ifthesizeofthetableisknownatompiletime,theodeouldbeoptimisedbyremovingtheloopentirely
andwritingalinearsequeneofoperations:
table[0℄ = os(0);
table[1℄ = os(2);
// later \dots
table[98℄ = os(196);
table[99℄ = os(198);
Furthermore,thevalueos(0),..., os(198)anbeomputed oneand forallat ompile-time, sothat
theruntimeostofsuhinitialisationboilsdownto 100storeoperations.
Tehnially speaking, suh a lifting of omputations from runtime to ompile-time anbeimplemented
usingamehanismknownastemplate-basedmetaprogramming. Thesequelofthissetiongivesabriefaount
ofthis tehniqueand ofitsentral onept,expressionstemplates. Moredetails anbe found,forexample,in
Veldhuizen'spapers[11,12,13℄. Wefoushereonhowthistehniqueanbeusedtoremoveunneessaryloops
andobjetopiesfromtheodeproduedfortheevaluationofvetorbasedexpressions.
ontainerlass,itprovidesawaytobuild astatirepresentationofanarray-basedexpression. Forexample,if
weonsideranoat Arraylassandanadditionfuntoradd,theexpressionD=A+B+Couldberepresentedby
thefollowingC++type:
Xpr<Array,add,Xpr<Array,add,Array>>
WhereXprisdened bythefollowingtype:
template<lass LEFT,lass OP,lass RIGHT>
lass Xpr
{
publi:
Xpr( float* lhs, float* rhs ) : mLHS(lhs), mRHS(rhs) {}
private:
LEFT mLHS;
RIGHT mRHS;
};
TheArraylassisdenedasbelow:
lass Array
{
publi:
Array( size_t s ) { mData = new float[s℄; mSize = s;}
~Array() {if(mData) delete[℄ mData; }
float* begin() { return mData; }
private:
float *mData;
size_t mSize;
};
Thistypeanbeautomatiallybuiltfromtheonretesyntaxoftheexpressionusinganoverloadedversion
ofthe'+'operatorthattakesanArrayandanXprobjetandreturnsanewXprobjet:
Xpr< Array,add,Array> operator+(Array a, Array b)
{
return Xpr<T,add,Array>(a.begin(),b.begin());
}
Usingthis kindofoperators,weansimulatetheparsingofthe aboveode(A+B+C")and seehowthe
lassesgetombinedto buildtheexpressiontree:
Array A,B,C,D;
D = A+B+C;
D = Xpr<Array,add,Array> + C
D = Xpr<Xpr<Array,add,Array>,add,Array>
FollowingthelassiC++ operator resolution,theA+B+Cexpression isparsedas(A+B)+C.TheA+B
partgetsenodedintoarsttemplatetype. Then,theompilerreduetheX+Cpart,produingthenaltype,
enodingthewhole expression.
HandlingtheassignationofA+B+CtoDanthenbedoneusinganoverloadedversionoftheassignment
operator:
template<lass XPR> Array& Array::operator=(onst XPR& xpr)
{
for(int i=0;i<mSize;i++) mData[i℄ = xpr[i℄;
TheArrayandXprlasseshavetoprovideaoperator[℄methodtobeabletoevaluatexpr[i℄:
int Array::operator[℄(size_t index)
{
return mData[index℄;
}
template<lass L,lass OP,lass R>
int Xpr<L,OP,R>::operator[℄(size_t index)
{
return OP::eval(mLHS[i℄,mRHS[i℄);
}
Westill haveto denethe addlass ode. Simplyenough, addis afuntorthat exposes astatimethod
alled eval performing the atual omputation. Suh funtors an be freely extended to inlude any other
arithmetiormathematialfuntions.
lass add
{
stati int eval(int x,int y) { return x+y; }
}
Withthese methods,eahrefereneto xpr[i℄anbeevaluated. Fortheaboveexample,thisgives:
data[i℄ = xpr[i℄;
data[i℄ = add::eval(Xpr<Array,add,Array>,C[i℄);
data[i℄ = add::eval(add::apply(A[i℄,B[i℄),C[i℄);
data[i℄ = add::eval(A[i℄+B[i℄,C[i℄);
data[i℄ = A[i℄+B[i℄+C[i℄;
4. Appliation to eient AltiVe ode generation. At this stage, we an add AltiVe support
to this meta-programming engine. If wereplae the salaromputations and the indexed aesses by vetor
operationsandloads,weanwriteanAltiVetemplateodegenerator. Thesehangesaetallthelassesand
funtionsshownintheprevioussetions.
TheArraylassnowprovidesaloadmethod thatreturnavetorinsteadofasalar:
int Array::load(size_t index) { return ve_ld(data_,index*16); }
Theaddfuntor nowuseve_addfuntions insteadofthestandard+operator:
lass add
{
stati vetor int eval(vetor int x,vetor int y)
{ return ve_add(x,y); }
}
Finally, weuseve_stto storeresults:
template<lass XPR> Array& Array::operator=(onst XPR& xpr)
{
for(size_t i=0;i<mSize/4;i++) ve_st(xpr.load(i),0,mData);
return *this;
}
Thenalresultofthisodegenerationanbeobservedongure4.1.bfortheA+B+Cexample. Figure4.1.a
givestheodeproduedbygwhenusingthestd::valarraylass.
For this simple task, one an easily see that the minimum number of loads operation is three and the
minimumnumberofstoreoperationsisone. Forthestandardode,wehavesevenextraneouslwzinstrutions
toloadpointers, threelsfxtoloadtheatualdataand onestfstostoretheresult. Fortheoptimisedode,
we havereplaed thesalar lsfxwith theAltiVe equivalent lvx, the salar faddswith vaddfpand stfsx
(a)std::valarrayode (b)optimizedode
L253: L117:
lwz r9,0(r3) slwi r2,r9,4
slwi r2,r12,2 addi r9,r9,1
lwz r4,4(r3) lvx v1,r5,r2
addi r12,r12,1 lvx v0,r4,r2
lwz r11,4(r9) lvx v13,r6,r2
lwz r10,0(r9) vaddfp v0,v0,v1
lwz r7,4(r11) vaddfp v1,v0,v13
lwz r6,4(r10) stvx v1,r2,r8
lfsx f0,r7,r2 bdnz L117
lfsx f1,r6,r2
lwz r0,4(r4)
fadds f2,f1,f0
lfsx f3,r2,r0
fadds f1,f2,f3
stfs f1,0(r5)
addi r5,r5,4
bdnz L253
Fig.4.1.Assemblyodeforasimplevetoroperation
5. The EVE library. Using the ode generation tehnique desribed in the previous setion, we have
produed a high-level array manipulation library aimed at sienti omputing and taking advantage of the
SIMD aeleration oered by the Altive extension on PowerPC proessors. This library, alled eve (for
Expressive VeloityEngine)basiallyprovidestwolasses,vetorandmatrixfor1Dand2Darrays,anda
rih setofoperatorsandfuntionsto manipulatethem. Thissetanberoughlydividedinfour families:
1. Arithmetiandbooleanoperators,whiharethediretvetorextensionoftheirC++ounterparts.
Forexample:
vetor<har> a(64),b(64),(64),d(64);
d = (a+b)/;
2. Boolean prediates. These funtions an be used to manipulate boolean vetorsand use them as
seletionmasks. Forexample:
vetor <har> a(64),b(64),(64);
// [i℄ = a[i℄ if a[i℄<b[i℄, b[i℄ otherwise
= where(a < b, a, b);
3. Mathematial and STL funtions. These funtions work like theirSTL ormath.hounterparts.
The only dierene is that theytake an array (or matrix) as awhole argument instead of aouple
of iterators. Apart from this dierene, eve funtions and operators are very similar to their STL
ounterparts(theinterfaetotheevearraylassisatuallyverysimilartotheoneoeredbytheSTL
valarraylass. ThisallowsalgorithmsdevelopedwiththeSTLto beported(andaelerated)witha
minimumeortonaPowerPCplatformwitheve. Forexample:
vetor <float> a(64),b(64);
b = tan(a);
float r = inner_produt(a, b);
4. Signal proessing funtions. These funtions allowthediret expression (without expliit
deom-positionintosumsandproduts)of1Dand2DFIRlters. Forexample:
matrix<float> image(320,240),res(320,240);
TheeveAPIallowsthedeveloperto writealargevarietyofalgorithmsaslongasthese algorithmanbe
expressedasaserieofglobaloperationonvetor.
6. Performane. Two kindsof performane tests have been performed: basi tests, involvingonly one
vetoroperationand moreomplextests, in whih severalvetoroperationsare omposed intomoreomplex
expressions. All tests involved vetors of dierent types (8 bit integers, 16 bit integers, 32 bit integersand
32 bit oats) but of the sametotal length (16Kbytes) in order to redue the impatof ahe eets on the
observedperformanes 3
. Theyhavebeenondutedona2GHzPowerPCG5withg 3.3.1andthefollowing
ompilationswithes: -faltive -ftemplate-deph-128 -O3. A seletion of performane results is given in
Table 6.1. For eah test, four numbers are given: the maximum theoretial speedup 4
(TM), the measured
speedupforahand-odedversionofthetestusingthenativeCAltiveAPI(N.C),themeasuredspeedupwith
anaive vetor librarywhih does notuse the expression template mehanismdesribed in Set. 3(N.V),
andthemeasuredspeedupwiththeevelibrary.
Table6.1
Seletedperformaneresults
Test Vetortype TM N.C N.V EVE
1.v3=v1+v2 8bitinteger 16 15.7 8.0 15.4
2.v2=tan(v1) 32bitoat 4 3.6 2.0 3.5
3.v3=v1/v2 32bitoat 4 4.8 2.1 4.6
4.v3=v1/v2 16bitinteger 8(4) 3.0 1.0 3.0
5.v3=inner_prod(v1,v2) 8bitinteger 8 7.8 4.5 7.2
6.v3=inner_prod(v1,v2) 32bitoat 4 14.1 4.8 13.8
7. 3x1Filter 8bitinteger 8 7.9 0.1 7.8
8. 3x1Filter 32bitoat 4 4.12 0.1 4.08
9.v5=sqrt(tan(v1+v2)/os(v3*v4)) 32bitoat 4 3.9 0.04 3.9
Itanbeobservedthat,formostofthetests,thespeedupobtainedwitheveisloseto theoneobtained
withahand-odedversionofthealgorithmusingthenativeCAPI.Byontrast,theperformanesofthenaive
lasslibraryareverydisappointing(espeiallyfortests7-10). Thislearlydemonstratestheeetivenessofthe
metaprogramming-basedoptimisation.
Tests13orrespondtobasioperations,whiharemappeddiretlytoasingleAltiVeinstrution. Inthis
ase,themeasuredspeedupisverylosetothetheoretialmaximum. Fortest3,itisevengreater. Thiseet
an be explained by thefat that onG5 proessors, and even for non-SIMD operations, the Altive FPU is
alreadyfaster thanthe salarFPU 5
. Whenadded to thespeedup oeredbythe SIMDparallelism,this leads
to super-linear speedups. Thesameeet explainstheresult obtainedfortest 6. Byontrast, test 4exhibits
asituation inwhih theobservedperformanes aresigniantlylowerthanexpeted. Inthis ase,this isdue
to the asymmetryof the Altive instrutionset, whih doesnot provide the basioperations forall types of
vetors. Inpartiular,itdoesnotinludedivisionon16bitintegers.Thisoperationmustthereforebeemulated
usingvetoroatdivision. Thisinvolvesseveraltypeastingoperationsandpratiallyreduesthemaximum
theoretialspeedupfrom 8to 4.
Tests 5-9 orrespond to more omplex operations, involvingseveral AltiVe instrutions. Note that for
tests 5and 7, despite thefat that the operandsare vetorsof 8bit integers,the omputations are atually
arried out on vetors of 16 bit integers,in order to keep a reasonablepreision. The theoretial maximum
speedup istherefore8insteadof16.
6.1. RealistiCase Study. Inordertoshowthateveanbeusedtosolverealistiproblems,whilestill
deliveringsigniantspeedups,wehaveusedittovetorizeseveralompleteimageproessingalgorithms. This
setiondesribestheimplementationofanalgorithmperformingthedetetionofpointsofinterest ingreysale
imagesusingtheHarrislter[7℄.
3
I.e. thevetorsize(inelements)was16Kfor8bitintegers,8Kfor16bitintegersand4Kfor32bitsintegersoroats.
4
Thisdependsonthetypeofthevetorelements:16for8bitintegers,8for16bitintegersand4for32bitintegersandoats.
Starting from an input image
I
(
x, y
)
, horizontal and vertial gaussian ltersare applied to removenoiseandthefollowingmatrixisomputed:
M
(
x, y
) =
(
∂I
∂x
)
2
∂I
∂x
∂I
∂y
∂I
∂x
∂I
∂y
(
∂I
∂y
)
2
!
Where (
∂I
∂x
)
and (∂I
∂y
)
are respetivelythe horizontal and vertial gradientofI
(
x, y
)
.M
(
x, y
)
is lteredagainwithagaussianlterandthefollowingquantityisomputed:
H
(
x, y
) =
Det
(
M
)
−
k.trace
(
M
)
2
, k
∈
[0
.
04; 0
.
06]
H
isviewedasameasure ofpixelinterest. Loal maximaofH
arethensearhedin3x3windowsandthen
th
rstmaxima arenally seleted. Figure 6.1shows theresultofthe detetionalgorithm onavideoframepituringanoutdoorsene.
Inthis implementation, onlythelteringand thepixel detetionare vetorized. Sortinganarrayannot
beeasilyvetorizedwiththeAltiVe instrutionset. It'snotworthitanyway,sinethetimespentinthenal
sortingandseletionproessonlyaountsforasmall fration(around3%)ofthetotalexeutiontimeofthe
algorithm. Theodeforomputing
M
oeientsandH
valuesisshowninFig.6.1. It anbesplitinto three setions:1. Adelarativesetionwheretheneededmatrixandfilterobjetsareinstantiated.matrixobjets
aredelaredasfloatontainerstopreventoverowwhenlteringisapplied ontheinputimageandto speed
upnalomputationbyremovingtheneedfortypeasting.
2. A lteringsetionwheretheoeientsofthe
M
matrixareomputed. Weuseevelterobjets,instantiatedforgaussianandgradientlters. Filtersupportanoverloaded*operatorthatissemantiallyused
astheomposition operator.
3. A omputingsetionwherethenal valueof
H
(
x, y
)
isomputedusingtheoverloadedversions ofarithmetioperators.
The performanes of this detetor implementation have been ompared to those of the same algorithm
writtenin C,bothusing320*240pixelsvideosequeneasinput. Thetestswererunona2GHzPowerPCG5
andompiledwithg 3.3. Asthetwostepsofthealgorithm(lteringanddetetion)usetwodierentparts
oftheE.V.E.API,wegivetheexeutiontimeforeahstepalongwiththetotalexeutiontime.
Step ExeutionTime SpeedUp
Filtering
1
.
4
ms
5
.
21
Evaluation
0
.
45
ms
4
.
23
TotalTime
1
.
85
ms
4
.
98
Theperformaneofbothpartsofthealgorithmaresatisfatory. Thelteringsetionspeed-upisnear65%
ofmaximumspeed-upwhiletheseondpartbenetsfromasuperlinearaeleration.
7. Related Work. Projets aiming at simplifying the exploitation of the SIMD extensions of modern
ap-// Delarations
#define W 320
#define H 240
matrix<short> I(W,H),a(W,H),b(W,H);
matrix<short> (W,H),t1(W,H),t2(W,H);
matrix<float> h(W,H);
float k = 0.05f;
filter<3,horizontal> smooth_x = 1,2,1;
filter<3,horizontal> grad_x = 1,0,1;
filter<3,vertial> smooth_y = 1,2,1;
filter<3,vertial> grad_y = -1,0,1;
// Computes matrix M:
//
// | a |
// M = | b |
t1 = grad_x(I);
t2 = grad_y(I);
a = (smooth_x*smooth_y)(t1*t1);
b = (smooth_x*smooth_y)(t2*t2);
= (smooth_x*smooth_y)(t1*t2);
// Computes matrix H
H = (a*b-*)-k*(a+b)*(a+b);
Fig.6.1.TheHarrisdetetor,oded with eve
Theswar(SIMDWithinAregister,[4℄)projetisanexampleoftherstapproah. Itsgoalistopropose
aversatiledata parallelClanguagemakingfull SIMD-styleprogrammingmodels eetiveforommodity
mi-roproessors. An experimentalompiler(s)hasbeendevelopedthat extendsCsemantisandtypesystem
andantargetseveralfamilyofmiroproessors. Startedin1998,theprojetseemstobein dormantstate.
Anotherexampleoftheompiler-basedapproahisgivenbyKyoetal. in[8℄. Theydesribeaompilerfor
aparallelCdialet(1d,OneDimensionalC)produingSIMDodeforPentiumproessorsandaimedatthe
suint desription of parallel image proessing algorithms. Benhmarks resultsshowthat speed-ups in the
rangeof2to 7(omparedwith odegeneratedwith aonventional Compiler)anbeobtainedfor low-level
imageproessingtasks. Buttheparallelizationtehniquesdesribedintheworkwhiharederivedfromtheone
usedforprogramminglinearproessorarraysseemstobeonlyappliabletosimpleimagelteringalgorithms
baseduponsweepingahorizontalpixel-updatinglinerow-wisearosstheimage,whihrestritsitsappliability.
Moreover,and thisanbeviewedasalimitationof ompiler-basedapproahes,retargetinganotherproessor
maybediult, sineitrequiresagoodunderstandingoftheompilerinternalrepresentations.
Thevastodeoptimiser[3℄hasaspei bak-endforgenerating Altive/PowerPCode. Thisompiler
oersautomativetorizationandparallelizationfromonventionalCsoureode,automatiallyreplaingloops
withinlinevetorextensions. Thespeedupsobtainedwithvastarelaimedtobelosedtothoseobtainedwith
hand-vetorizedode.vastisaommerialprodut.
There have been numerous attempts to provide a library-based approah to the exploitation of SIMD
features in miro-proessors. Applevelib [6℄, whih provides aset of Altive-optimisedfuntions forsignal
proessing,isanexample. Butmostoftheseattemptssuer fromtheweaknessesdesribedinSet.2;namely,
theyannot handle omplex vetor expressions and produeineient odewhen multiple vetoroperations
areinvolvedinthesamealgorithm. MaSTL[9℄istheonlyworkweareawareofthataimsateliminatingthese
optimised for Altive and relies on template-based metaprogramming tehniques for ode optimisation. The
only dierene is that it only provides STL-ompliant funtions and operators (it an atually be viewed as
a spei implementation of the STL for G4/G5 omputers) whereas eve oers additional domain-spei
funtionsforsignalandimageproessing.
8. Conlusion. Wehaveshownhowalassialtehniquetemplate-basedmetaprogramminganbe
ap-pliedto thedesignandimplementationofaneienthigh-levelvetormanipulationlibraryaimedatsienti
omputing onPowerPC platforms. Thislibrary oersasigniantimprovementin termsof expressivityover
thenativeCAPItraditionallyusedfortakingadvantageoftheSIMDapabilitiesofthisproessor. Itallows
de-veloperstoobtainsigniantspeedupswithouthavingtodealwithlowlevelimplementationdetails. Moreover,
TheeveAPIislargelyompliantwiththeSTLstandardandthereforeprovidesasmoothtransition pathfor
appliationswritten withothersientiomputinglibraries. Aprototypeversionofthelibraryanbe
down-loadedfromthefollowingURL:http://wwwlasmea.univ-bplermont.fr/Personnel/Joel.Falou/eng/eve.
We are urrently working on improving the performanes obtained with this prototype. This involves, for
instane, globally minimizingthe number ofvetorloadand storeoperations,using morejudiiously
Altive-speiahemanipulationinstrutionsortakingadvantageoffusedoperations(e.g. multiply/add). Finally,it
anbenotedthat,althoughtheurrentversionof evehasbeendesignedforPowerPCproessorswithAltive,
itouldeasilyberetargetedtoPentium4proessorswithMMX/SSE2beausetheodegeneratoritself(using
theexpressiontemplatemehanism) anbemadelargelyindependentoftheSIMDinstrutionset.
REFERENCES
[1℄ TheBOOSTLibrary.http://www.boost.org/.
[2℄ ThePOOMALibrary.http://www.odesourery .om /po oma/ .
[3℄ VAST.http://www.psrv.om/vast_a lti ve. html /.
[4℄ TheSWARHomePagehttp://shay.en.purdue.edu /~s war PurdueUniversity
[5℄ Apple,TheAltiVeInstrutionsReferenesPage.http://developer.apple. om/h ard ware /ve.
[6℄ Apple,VeLibframework.http://developer.apple. om/ hard war e/ve /ve tor _lib rar ies. html
[7℄ C.HarrisandM.Stephens,Aombinedornerandedgedetetor.In4thAlveyVisionConferene,1988.
[8℄ S. Kyo andS. OkasakiandI. Kuroda, An extended Clanguage anda SIMD ompiler foreientimplementation of
imageltersonmediaextendedmiro-proessors. inProeedingsofAivs2003(AdvanedConeptsforIntelligentVision
Systems),Ghent,Belgium,Sept.1998
[9℄ G.Low, MaSTL. http://www.pixelglow.om/ mas tl/.
[10℄ I. Ollman,AltiVeVeloityEngineTutorial.http://www.simdteh.or g/al tive . Marh2001.
[11℄ T.Veldhuizen,UsingC++TemplateMeta-Programs.InC++Report,vol.7,p.36-43,1995.
[12℄ ,ExpressionTemplates.InC++Report,vol.7,p.26-31,1995.
[13℄ ,TehniquesforSientiC++.http://osl.iu.edu/ tveldhui/papers/tehniques /.
[14℄ ,ArraysinBlitz++.InDrDobb'sJournalofSoftwareTools,p.238-44,1996.
[15℄ T. Veldhuizen and D. Gannon, Ative Libraries: Rethinkingthe roles of ompilers and libraries Pro. of the SIAM
Appendix A.A simple3x1 gaussianlter written with the Altive native C API .
void AV_filter( har* img, short* res)
{
vetor unsigned har zu8,t1,t2,t3,t4;
vetor signed short x1h,x1l,x2h;
vetor signed short x2l,x3h,x3l;
vetor signed short zs16 ,rh,rl,v0,v1,shift;
// Generate onstants
v0 = ve_splat_s16(2);
v1 = ve_splat_s16(4);
zu8 = ve_splat_u8(0);
zs16 = ve_splat_s16(0);
shift = ve_splat_s16(8);
for( int j = 0; j< SIZE/16 ; j++ )
{
// Load input vetors
t1 = ve_ld(j*16, img); t2 = ve_ld(j*16+16, img);
// Generate shifted vetors
t3 = ve_sld(t1,t2,1); t4 = ve_sld(t1,t2,2);
// Cast to short
x1h = ve_mergeh(zu8,t1); x1l = ve_mergel(zu8,t1);
x2h = ve_mergeh(zu8,t3); x2l = ve_mergel(zu8,t3);
x3h = ve_mergeh(zu8,t4); x3l = ve_mergel(zu8,t4);
// Atual filtering
rh = ve_mladd(x1h,v0,zs16);
rl = ve_mladd(x1l,v0,zs16);
rh = ve_mladd(x2h,v1,rh);
rl = ve_mladd(x2l,v1,rl);
rh = ve_mladd(x3h,v0,rh);
rl = ve_mladd(x3l,v0,rl);
rh = ve_sr(rh,shift);
rl = ve_sr(rl,shift);
// Pak and store result vetor
t1 = ve_paksu(rh,rl);
ve_st(t1,j,out);
}
}
Editedby: FrédériLoulergue
Reeived: June26,2004