E.V.E., An Object Oriented SIMD Library

(1)

E.V.E., AN OBJECTORIENTEDSIMD LIBRARY

JOEL FALCOU AND JOCELYN SEROT

∗

Abstrat. Thispaperdesribestheeve(ExpressiveVeloityEngine)library,anobjetorientedC++librarydesignedtoease

theproessofwritingeientnumerialappliationsusingAltiVe,theSIMDextensiondesignedbyApple,MotorolaandIBM.

AltiVe-powered appliationstypiallyshow oarelativespeedupof4to 16butneedaomplexandawkwardprogrammation

style. Byusingvarious templatemetaprogrammingtehniques, E.V.E.provides an easytouse, STL-like, interfaethat allows

developer to quiklywrite eient and easy to read ode. Typial appliations written with E.V.E.an benet from a large

frationoftheorialmaximumspeedupwhilebeingwrittenassimpleC++arithmetiode.

1. Introdution.

1.1. The AltiVe Extension. Reently, SIMDenhaned instrutionshavebeenproposedasasolution

fordeliveringhighermiroproessorhardwareutilisation. SIMD(SingleInstrution,MultipleData)extensions

startedappearingin1994inHP's MAX2andSun'sVSextensionsandannowbefoundinmostof

miropro-essors, inluding Intel's Pentiums (MMX/SSE/SSE2) and Motorola/IBM's PowerPCs (Altive). They have

been provedpartiularlyusefulforaeleratingappliationsbased upon data-intensive,regularomputations,

suh assignalorimageproessing.

AltiVe[10℄isanextensiondesignedtoenhanePowerPC 1

proessorperformaneonappliationshandling

large amounts of data. The AltiVe arhiteture is based on a SIMD proessing unit integrated with the

PowerPC arhiteture. It introdues a new set of 128 bit wide registers distint from the existing general

purpose oroating-pointregisters. These registersare aessible through160 newvetor instrutionsthat

anbefreelymixedwithotherinstrutions(therearenorestritiononhowvetorinstrutionsanbeintermixed

withbranh,integeroroating-pointinstrutionswithnoontextswithingnoroverheadfordoingso). Altive

handles data as 128 bit vetors that an ontain sixteen 8 bit integers, eight 16 bit integers, four 32 bit

integersorfour32 bitoatingpointsvalues. Forexample,anyvetoroperationperformed onavetor har

is in fat performed on sixteen har simultaneously and is theoretially running sixteen times faster as the

salarequivalentoperation. AltiVevetorfuntionsoveralargespetrum,extendingfromsimplearithmeti

funtions(additions,subtrations)tobooleanevaluationorlookuptablesolving.

Altive isnativelyprogrammedbymeans ofaC API[5℄. Programming at thislevelanoer signiant

speedups (from 4to 12 fortypialsignalproessing algorithms)but is arather tediousand error-pronetask,

beause this CAPI isreally assembly in disguise. Theappliation-levelvetors(arrays,in variable number

andwithvariablesizes) mustbeexpliitlymappedontotheAltivevetors(xednumber,xedsize) andthe

programmermustdealwithseverallow-leveldetailssuhasvetorpaddingandalignment. Toorretlyturna

salarfuntion intoavetor-aeleratedone,alargepartofodehastoberewritten.

Considerforexampleasimple3x1smoothinglter(Fig.1.1):

void C_filter( har* d, short* r)

{

for(int i=1; i<SIZE-1; i++ )

r[i℄ = (d[i-1℄+2*d[i℄+d[i+1℄)/4;

}

Fig.1.1. Asimple3x1gaussianlterwritteninstandard C.

This ode an be rewritten (vetorized")using Altive vetor funtions. However, this rewriting is not

trivial. Wersthavetolookattheoriginalalgorithminaparallel way. TheC_filterfuntionisbasedonan

iterativealgorithmthatrunstrougheahitemoftheinputdata,appliestheorrespondingoperationsandwrites

the result into the output array. By ontrast, AltiVe funtions operate on abunh of data simultaneously.

Wehave to reraft thealgorithm so that itworks onvetorsinstead ofsingle salarvalues. This is done by

∗

LASMEA,UMR6602CNRS/U.ClermontFerrand,Frane(falou, jserotlasmea.univ-bplerm ont .fr) .

1

(2)

loadingdataintoAltiVevetors,shiftingthesevetorsleftandright,andperformingvetormultipliationand

addition. TheresultingodewhihisindeedsigniantlylongerthantheoriginaloneisgiveninAppendixA.

Wehavebenhmarkedboththesalarand vetorized implementation ona2GHzPowerPCG5 andobtained

theresultsshowninTable1.1. Bothodewereompiledusingg 3.3using-O3. Onthisexample,atenfold

aelerationanbeobservedwiththeAltiVeextension. However,thetimespenttorewritethealgorithmina

vetorized"wayandthesomehowawkwardAltiveAPIanhinderthedevelopmentoflargersaleappliations.

Table1.1

Exeutiontimeandrelativespeed-upfor3x1lters.

SIZEvalue C_filter AV_filter SpeedUp

16

K

0 .

209

ms

0 .

020

ms

10 .

5

64

K

0 .

854

ms

0 .

075

ms

11 .

4

256

K

3 .

737

ms

0 .

322

ms

11 .

6 1024

K

16 .

253

ms

1 .

440

ms

11 .

3

2. AltiVe in highlevelAPI. Asevidenedintheprevioussetion,writingAltiVe-basedappliations

anbeatedious task. A possibleapproah to irumventthis problem is to enapsulateAltive vetorsand

the assoiated operations within a C++ lass. Instantiating this lass and using lassi inx notations will

produetheAltiVeode. Weatuallybuiltsuhalass(AVetor)andusedittoenodethelteringexample

introduedinsetion1.1. Theresultingodeisshownbelow.

AVetor<har> img(SIZE);

AVetor<short> res(SIZE);

res = (img.sr(1)+2*img+img.sl(1))/4;

Inthisformulation,expliititerationshavebeenreplaedbyappliationofoverloadedoperatorsonAVetor

objets. Thesrandsl methodsimplementstheshiftingoperations. Theperformaneofthis ode,however,

is verydisappointing. With the arraysizes shownin Table1.1, themeasured speed-ups never exeed1. The

reasonsforsuhbehaviouraregivenbelow.

Considerasimpleodefragmentusingoverloadedoperatorsasshownbelow:

AVetor<har> r(SIZE),x(SIZE),y(SIZE),z(SIZE);

r = x + y + z;

WhenaC++ompileranalysesthisode,itreduesthesuessiveoperatorallsiteratively,resolvingrst

y+z thenx+(y+z) where y+zis in fat stored in a temporary objet. Moreover, to atually omputex+y+z,

theinvolvedoperationsare arriedoutwithin aloopthat appliestheve_addfuntion to everysinglevetor

elementofthearray. Anequivalentode,afteroperatorredutionandloopexpansionis:

AVetor<har> tmp1(SIZE),tmp2(SIZE);

for(i=0;i<SIZE/16);i++) tmp1[i℄ = ve_add(y[i℄,z[i℄);

for(i=0;i<SIZE/16);i++) tmp2[i℄ = ve_add(x[i℄,tmp1[i℄);

for(i=0;i<SIZE/16);i++) r[i℄ = tmp2[i℄;

Fig.2.1.Expanded odeforoverloaded operatorompilation

Thisodeanbeomparedtoanoptimal",hand-writtenAltiveodeliketheoneshownongure2.2. The

odegenerated by thenaive"AltiVe lass learlyexhibits unneessary loopsand opies. Whenexpressions

get moreomplex, the situation getsworse. The time spent in loopindex alulationand temporary objet

opiesquiklyoveromesthebenetsoftheSIMDparallelization,resultinginpoorperformanes.

This an be explained by the fat that all C++ ompilers use a dyadi redution sheme to evaluate

operators omposition. Some ompilers 2

an output a slightly better ode when ertain optimisations are

(3)

for(i=0;i<SIZE/16);i++) r[i℄ = ve_add(x[i℄,ve_add(y[i℄,z[i℄));

Fig.2.2.Optimal,handwrittenAltiVeode forx+y+zomputation

turned on. However,largeexpressionsoromplexfuntions allan't betotally optimised. Anotherfator is

theimpat ofthe order ofAltiVe instrutions. When writingAltiVe ode, one haveto takein aountthe

fatthatahelineshavetobelleduptotheirmaximum. Thetypialwayfordoingsoistopaktheloading

instrutionstogether, then theoperations andnally the storinginstrutions. When loading,omputingand

storinginstrutionsaremixed inanunorderedway,AltiVeperformanesgenerallydrop.

The aforementioned problem has already been identiedin [13℄ for exampleand is the major

inon-venient of the C++ languagewhen it is used for high-levelsienti omputations. In the domain of C++

sientiomputing, it hasled to thedevelopmentof theso-alled Ative Libraries [15, 2, 14,1℄, whih both

provide domain-spei abstrations and dediated ode optimisationmehanisms. Thispaperdesribeshow

thisapproahanbeappliedtothespeiproblemofgeneratingeientAltiveodefromahigh-levelC++

API.

Itisorganizedasfollows. Set.3explainswhygeneratingeientodeforvetorexpressionsisnottrivial

andintroduestheoneptoftemplate-basedmeta-programming. Set.4explainshowthisoneptanusedto

generateoptimisedAltiveode. Set.5rapidlypresentstheAPIofthelibrarywebuiltuponthesepriniples.

PerformaneresultsarepresentedinSet.6. Set. 7isabriefsurveyofrelatedworkandSet. 8onludes.

3. Template based Meta Programming. Theevaluationof anyarithmeti expression anbe viewed

asatwostagesproess:

•

Arststep,performedatompiletime,wherethestrutureoftheexpressionisanalysedto produea

hainoffuntion alls.

•

A seond step, performed at run time, where the atual operands are provided to the sequene of

funtionalls andtheaerentomputationsarearriedout.

Whentheexpressionstruturemathesertainpatternorwhenertainoperandsareknownatompiletime,it

isoftenpossibletoperformagivensetofomputationsatompiletimeinordertoprodueanoptimisedhain

offuntionalls. Forexample,ifweonsiderthefollowingode:

for( int i=0;i<SIZE;i++)

{

table[i℄ = os(2*i);

}

Ifthesizeofthetableisknownatompiletime,theodeouldbeoptimisedbyremovingtheloopentirely

andwritingalinearsequeneofoperations:

table[0℄ = os(0);

table[1℄ = os(2);

// later \dots

table[98℄ = os(196);

table[99℄ = os(198);

Furthermore,thevalueos(0),..., os(198)anbeomputed oneand forallat ompile-time, sothat

theruntimeostofsuhinitialisationboilsdownto 100storeoperations.

Tehnially speaking, suh a lifting of omputations from runtime to ompile-time anbeimplemented

usingamehanismknownastemplate-basedmetaprogramming. Thesequelofthissetiongivesabriefaount

ofthis tehniqueand ofitsentral onept,expressionstemplates. Moredetails anbe found,forexample,in

Veldhuizen'spapers[11,12,13℄. Wefoushereonhowthistehniqueanbeusedtoremoveunneessaryloops

andobjetopiesfromtheodeproduedfortheevaluationofvetorbasedexpressions.

(4)

ontainerlass,itprovidesawaytobuild astatirepresentationofanarray-basedexpression. Forexample,if

weonsideranoat Arraylassandanadditionfuntoradd,theexpressionD=A+B+Couldberepresentedby

thefollowingC++type:

Xpr<Array,add,Xpr<Array,add,Array>>

WhereXprisdened bythefollowingtype:

template<lass LEFT,lass OP,lass RIGHT>

lass Xpr

{

publi:

Xpr( float* lhs, float* rhs ) : mLHS(lhs), mRHS(rhs) {}

private:

LEFT mLHS;

RIGHT mRHS;

};

TheArraylassisdenedasbelow:

lass Array

{

publi:

Array( size_t s ) { mData = new float[s℄; mSize = s;}

~Array() {if(mData) delete[℄ mData; }

float* begin() { return mData; }

private:

float *mData;

size_t mSize;

};

Thistypeanbeautomatiallybuiltfromtheonretesyntaxoftheexpressionusinganoverloadedversion

ofthe'+'operatorthattakesanArrayandanXprobjetandreturnsanewXprobjet:

Xpr< Array,add,Array> operator+(Array a, Array b)

{

return Xpr<T,add,Array>(a.begin(),b.begin());

}

Usingthis kindofoperators,weansimulatetheparsingofthe aboveode(A+B+C")and seehowthe

lassesgetombinedto buildtheexpressiontree:

Array A,B,C,D;

D = A+B+C;

D = Xpr<Array,add,Array> + C

D = Xpr<Xpr<Array,add,Array>,add,Array>

FollowingthelassiC++ operator resolution,theA+B+Cexpression isparsedas(A+B)+C.TheA+B

partgetsenodedintoarsttemplatetype. Then,theompilerreduetheX+Cpart,produingthenaltype,

enodingthewhole expression.

HandlingtheassignationofA+B+CtoDanthenbedoneusinganoverloadedversionoftheassignment

operator:

template<lass XPR> Array& Array::operator=(onst XPR& xpr)

{

for(int i=0;i<mSize;i++) mData[i℄ = xpr[i℄;

(5)

TheArrayandXprlasseshavetoprovideaoperator[℄methodtobeabletoevaluatexpr[i℄:

int Array::operator[℄(size_t index)

{

return mData[index℄;

}

template<lass L,lass OP,lass R>

int Xpr<L,OP,R>::operator[℄(size_t index)

{

return OP::eval(mLHS[i℄,mRHS[i℄);

}

Westill haveto denethe addlass ode. Simplyenough, addis afuntorthat exposes astatimethod

alled eval performing the atual omputation. Suh funtors an be freely extended to inlude any other

arithmetiormathematialfuntions.

lass add

{

stati int eval(int x,int y) { return x+y; }

}

Withthese methods,eahrefereneto xpr[i℄anbeevaluated. Fortheaboveexample,thisgives:

data[i℄ = xpr[i℄;

data[i℄ = add::eval(Xpr<Array,add,Array>,C[i℄);

data[i℄ = add::eval(add::apply(A[i℄,B[i℄),C[i℄);

data[i℄ = add::eval(A[i℄+B[i℄,C[i℄);

data[i℄ = A[i℄+B[i℄+C[i℄;

4. Appliation to eient AltiVe ode generation. At this stage, we an add AltiVe support

to this meta-programming engine. If wereplae the salaromputations and the indexed aesses by vetor

operationsandloads,weanwriteanAltiVetemplateodegenerator. Thesehangesaetallthelassesand

funtionsshownintheprevioussetions.

TheArraylassnowprovidesaloadmethod thatreturnavetorinsteadofasalar:

int Array::load(size_t index) { return ve_ld(data_,index*16); }

Theaddfuntor nowuseve_addfuntions insteadofthestandard+operator:

lass add

{

stati vetor int eval(vetor int x,vetor int y)

{ return ve_add(x,y); }

}

Finally, weuseve_stto storeresults:

template<lass XPR> Array& Array::operator=(onst XPR& xpr)

{

for(size_t i=0;i<mSize/4;i++) ve_st(xpr.load(i),0,mData);

return *this;

}

Thenalresultofthisodegenerationanbeobservedongure4.1.bfortheA+B+Cexample. Figure4.1.a

givestheodeproduedbygwhenusingthestd::valarraylass.

For this simple task, one an easily see that the minimum number of loads operation is three and the

minimumnumberofstoreoperationsisone. Forthestandardode,wehavesevenextraneouslwzinstrutions

toloadpointers, threelsfxtoloadtheatualdataand onestfstostoretheresult. Fortheoptimisedode,

we havereplaed thesalar lsfxwith theAltiVe equivalent lvx, the salar faddswith vaddfpand stfsx

(6)

(a)std::valarrayode (b)optimizedode

L253: L117:

lwz r9,0(r3) slwi r2,r9,4

slwi r2,r12,2 addi r9,r9,1

lwz r4,4(r3) lvx v1,r5,r2

addi r12,r12,1 lvx v0,r4,r2

lwz r11,4(r9) lvx v13,r6,r2

lwz r10,0(r9) vaddfp v0,v0,v1

lwz r7,4(r11) vaddfp v1,v0,v13

lwz r6,4(r10) stvx v1,r2,r8

lfsx f0,r7,r2 bdnz L117

lfsx f1,r6,r2

lwz r0,4(r4)

fadds f2,f1,f0

lfsx f3,r2,r0

fadds f1,f2,f3

stfs f1,0(r5)

addi r5,r5,4

bdnz L253

Fig.4.1.Assemblyodeforasimplevetoroperation

5. The EVE library. Using the ode generation tehnique desribed in the previous setion, we have

produed a high-level array manipulation library aimed at sienti omputing and taking advantage of the

SIMD aeleration oered by the Altive extension on PowerPC proessors. This library, alled eve (for

Expressive VeloityEngine)basiallyprovidestwolasses,vetorandmatrixfor1Dand2Darrays,anda

rih setofoperatorsandfuntionsto manipulatethem. Thissetanberoughlydividedinfour families:

1. Arithmetiandbooleanoperators,whiharethediretvetorextensionoftheirC++ounterparts.

Forexample:

vetor<har> a(64),b(64),(64),d(64);

d = (a+b)/;

2. Boolean prediates. These funtions an be used to manipulate boolean vetorsand use them as

seletionmasks. Forexample:

vetor <har> a(64),b(64),(64);

// [i℄ = a[i℄ if a[i℄<b[i℄, b[i℄ otherwise

= where(a < b, a, b);

3. Mathematial and STL funtions. These funtions work like theirSTL ormath.hounterparts.

The only dierene is that theytake an array (or matrix) as awhole argument instead of aouple

of iterators. Apart from this dierene, eve funtions and operators are very similar to their STL

ounterparts(theinterfaetotheevearraylassisatuallyverysimilartotheoneoeredbytheSTL

valarraylass. ThisallowsalgorithmsdevelopedwiththeSTLto beported(andaelerated)witha

minimumeortonaPowerPCplatformwitheve. Forexample:

vetor <float> a(64),b(64);

b = tan(a);

float r = inner_produt(a, b);

4. Signal proessing funtions. These funtions allowthediret expression (without expliit

deom-positionintosumsandproduts)of1Dand2DFIRlters. Forexample:

matrix<float> image(320,240),res(320,240);

(7)

TheeveAPIallowsthedeveloperto writealargevarietyofalgorithmsaslongasthese algorithmanbe

expressedasaserieofglobaloperationonvetor.

6. Performane. Two kindsof performane tests have been performed: basi tests, involvingonly one

vetoroperationand moreomplextests, in whih severalvetoroperationsare omposed intomoreomplex

expressions. All tests involved vetors of dierent types (8 bit integers, 16 bit integers, 32 bit integersand

32 bit oats) but of the sametotal length (16Kbytes) in order to redue the impatof ahe eets on the

observedperformanes 3

. Theyhavebeenondutedona2GHzPowerPCG5withg 3.3.1andthefollowing

ompilationswithes: -faltive -ftemplate-deph-128 -O3. A seletion of performane results is given in

Table 6.1. For eah test, four numbers are given: the maximum theoretial speedup 4

(TM), the measured

speedupforahand-odedversionofthetestusingthenativeCAltiveAPI(N.C),themeasuredspeedupwith

anaive vetor librarywhih does notuse the expression template mehanismdesribed in Set. 3(N.V),

andthemeasuredspeedupwiththeevelibrary.

Table6.1

Seletedperformaneresults

Test Vetortype TM N.C N.V EVE

1.v3=v1+v2 8bitinteger 16 15.7 8.0 15.4

2.v2=tan(v1) 32bitoat 4 3.6 2.0 3.5

3.v3=v1/v2 32bitoat 4 4.8 2.1 4.6

4.v3=v1/v2 16bitinteger 8(4) 3.0 1.0 3.0

5.v3=inner_prod(v1,v2) 8bitinteger 8 7.8 4.5 7.2

6.v3=inner_prod(v1,v2) 32bitoat 4 14.1 4.8 13.8

7. 3x1Filter 8bitinteger 8 7.9 0.1 7.8

8. 3x1Filter 32bitoat 4 4.12 0.1 4.08

9.v5=sqrt(tan(v1+v2)/os(v3*v4)) 32bitoat 4 3.9 0.04 3.9

Itanbeobservedthat,formostofthetests,thespeedupobtainedwitheveisloseto theoneobtained

withahand-odedversionofthealgorithmusingthenativeCAPI.Byontrast,theperformanesofthenaive

lasslibraryareverydisappointing(espeiallyfortests7-10). Thislearlydemonstratestheeetivenessofthe

metaprogramming-basedoptimisation.

Tests13orrespondtobasioperations,whiharemappeddiretlytoasingleAltiVeinstrution. Inthis

ase,themeasuredspeedupisverylosetothetheoretialmaximum. Fortest3,itisevengreater. Thiseet

an be explained by thefat that onG5 proessors, and even for non-SIMD operations, the Altive FPU is

alreadyfaster thanthe salarFPU 5

. Whenadded to thespeedup oeredbythe SIMDparallelism,this leads

to super-linear speedups. Thesameeet explainstheresult obtainedfortest 6. Byontrast, test 4exhibits

asituation inwhih theobservedperformanes aresigniantlylowerthanexpeted. Inthis ase,this isdue

to the asymmetryof the Altive instrutionset, whih doesnot provide the basioperations forall types of

vetors. Inpartiular,itdoesnotinludedivisionon16bitintegers.Thisoperationmustthereforebeemulated

usingvetoroatdivision. Thisinvolvesseveraltypeastingoperationsandpratiallyreduesthemaximum

theoretialspeedupfrom 8to 4.

Tests 5-9 orrespond to more omplex operations, involvingseveral AltiVe instrutions. Note that for

tests 5and 7, despite thefat that the operandsare vetorsof 8bit integers,the omputations are atually

arried out on vetors of 16 bit integers,in order to keep a reasonablepreision. The theoretial maximum

speedup istherefore8insteadof16.

6.1. RealistiCase Study. Inordertoshowthateveanbeusedtosolverealistiproblems,whilestill

deliveringsigniantspeedups,wehaveusedittovetorizeseveralompleteimageproessingalgorithms. This

setiondesribestheimplementationofanalgorithmperformingthedetetionofpointsofinterest ingreysale

imagesusingtheHarrislter[7℄.

3

I.e. thevetorsize(inelements)was16Kfor8bitintegers,8Kfor16bitintegersand4Kfor32bitsintegersoroats.

4

Thisdependsonthetypeofthevetorelements:16for8bitintegers,8for16bitintegersand4for32bitintegersandoats.

(8)

Starting from an input image

I

(

x, y

)

, horizontal and vertial gaussian ltersare applied to removenoise

andthefollowingmatrixisomputed:

M

(

x, y

) =

(

∂I

∂x

)

2 ∂I

∂x

∂I

∂y

∂I

∂x

∂I

∂y

(

∂I

∂y

)

2

!

Where (

∂I

∂x

)

and (

∂I

∂y

)

are respetivelythe horizontal and vertial gradientof

I

(

x, y

)

.

M

(

x, y

)

is ltered

againwithagaussianlterandthefollowingquantityisomputed:

H

(

x, y

) =

Det

(

M

)

−

k.trace

(

M

)

2

_{, k}

_∈

_[0

_.

_{04; 0}

_.

_06]

H

isviewedasameasure ofpixelinterest. Loal maximaof

H

arethensearhedin3x3windowsandthe

n

th

rstmaxima arenally seleted. Figure 6.1shows theresultofthe detetionalgorithm onavideoframe

pituringanoutdoorsene.

Inthis implementation, onlythelteringand thepixel detetionare vetorized. Sortinganarrayannot

beeasilyvetorizedwiththeAltiVe instrutionset. It'snotworthitanyway,sinethetimespentinthenal

sortingandseletionproessonlyaountsforasmall fration(around3%)ofthetotalexeutiontimeofthe

algorithm. Theodeforomputing

M

oeientsand

H

valuesisshowninFig.6.1. It anbesplitinto three setions:

1. Adelarativesetionwheretheneededmatrixandfilterobjetsareinstantiated.matrixobjets

aredelaredasfloatontainerstopreventoverowwhenlteringisapplied ontheinputimageandto speed

upnalomputationbyremovingtheneedfortypeasting.

2. A lteringsetionwheretheoeientsofthe

M

matrixareomputed. Weuseevelterobjets,

instantiatedforgaussianandgradientlters. Filtersupportanoverloaded*operatorthatissemantiallyused

astheomposition operator.

3. A omputingsetionwherethenal valueof

H

(

x, y

)

isomputedusingtheoverloadedversions of

arithmetioperators.

The performanes of this detetor implementation have been ompared to those of the same algorithm

writtenin C,bothusing320*240pixelsvideosequeneasinput. Thetestswererunona2GHzPowerPCG5

andompiledwithg 3.3. Asthetwostepsofthealgorithm(lteringanddetetion)usetwodierentparts

oftheE.V.E.API,wegivetheexeutiontimeforeahstepalongwiththetotalexeutiontime.

Step ExeutionTime SpeedUp

Filtering

1 .

4 ms

5 .

21

Evaluation

0 .

45 ms

4 .

23

TotalTime

1 .

85 ms

4 .

98

Theperformaneofbothpartsofthealgorithmaresatisfatory. Thelteringsetionspeed-upisnear65%

ofmaximumspeed-upwhiletheseondpartbenetsfromasuperlinearaeleration.

7. Related Work. Projets aiming at simplifying the exploitation of the SIMD extensions of modern

(9)

ap-// Delarations

#define W 320

#define H 240

matrix<short> I(W,H),a(W,H),b(W,H);

matrix<short> (W,H),t1(W,H),t2(W,H);

matrix<float> h(W,H);

float k = 0.05f;

filter<3,horizontal> smooth_x = 1,2,1;

filter<3,horizontal> grad_x = 1,0,1;

filter<3,vertial> smooth_y = 1,2,1;

filter<3,vertial> grad_y = -1,0,1;

// Computes matrix M:

//

// | a |

// M = | b |

t1 = grad_x(I);

t2 = grad_y(I);

a = (smooth_x*smooth_y)(t1*t1);

b = (smooth_x*smooth_y)(t2*t2);

= (smooth_x*smooth_y)(t1*t2);

// Computes matrix H

H = (a*b-*)-k*(a+b)*(a+b);

Fig.6.1.TheHarrisdetetor,oded with eve

Theswar(SIMDWithinAregister,[4℄)projetisanexampleoftherstapproah. Itsgoalistopropose

aversatiledata parallelClanguagemakingfull SIMD-styleprogrammingmodels eetiveforommodity

mi-roproessors. An experimentalompiler(s)hasbeendevelopedthat extendsCsemantisandtypesystem

andantargetseveralfamilyofmiroproessors. Startedin1998,theprojetseemstobein dormantstate.

Anotherexampleoftheompiler-basedapproahisgivenbyKyoetal. in[8℄. Theydesribeaompilerfor

aparallelCdialet(1d,OneDimensionalC)produingSIMDodeforPentiumproessorsandaimedatthe

suint desription of parallel image proessing algorithms. Benhmarks resultsshowthat speed-ups in the

rangeof2to 7(omparedwith odegeneratedwith aonventional Compiler)anbeobtainedfor low-level

imageproessingtasks. Buttheparallelizationtehniquesdesribedintheworkwhiharederivedfromtheone

usedforprogramminglinearproessorarraysseemstobeonlyappliabletosimpleimagelteringalgorithms

baseduponsweepingahorizontalpixel-updatinglinerow-wisearosstheimage,whihrestritsitsappliability.

Moreover,and thisanbeviewedasalimitationof ompiler-basedapproahes,retargetinganotherproessor

maybediult, sineitrequiresagoodunderstandingoftheompilerinternalrepresentations.

Thevastodeoptimiser[3℄hasaspei bak-endforgenerating Altive/PowerPCode. Thisompiler

oersautomativetorizationandparallelizationfromonventionalCsoureode,automatiallyreplaingloops

withinlinevetorextensions. Thespeedupsobtainedwithvastarelaimedtobelosedtothoseobtainedwith

hand-vetorizedode.vastisaommerialprodut.

There have been numerous attempts to provide a library-based approah to the exploitation of SIMD

features in miro-proessors. Applevelib [6℄, whih provides aset of Altive-optimisedfuntions forsignal

proessing,isanexample. Butmostoftheseattemptssuer fromtheweaknessesdesribedinSet.2;namely,

theyannot handle omplex vetor expressions and produeineient odewhen multiple vetoroperations

areinvolvedinthesamealgorithm. MaSTL[9℄istheonlyworkweareawareofthataimsateliminatingthese

(10)

optimised for Altive and relies on template-based metaprogramming tehniques for ode optimisation. The

only dierene is that it only provides STL-ompliant funtions and operators (it an atually be viewed as

a spei implementation of the STL for G4/G5 omputers) whereas eve oers additional domain-spei

funtionsforsignalandimageproessing.

8. Conlusion. Wehaveshownhowalassialtehniquetemplate-basedmetaprogramminganbe

ap-pliedto thedesignandimplementationofaneienthigh-levelvetormanipulationlibraryaimedatsienti

omputing onPowerPC platforms. Thislibrary oersasigniantimprovementin termsof expressivityover

thenativeCAPItraditionallyusedfortakingadvantageoftheSIMDapabilitiesofthisproessor. Itallows

de-veloperstoobtainsigniantspeedupswithouthavingtodealwithlowlevelimplementationdetails. Moreover,

TheeveAPIislargelyompliantwiththeSTLstandardandthereforeprovidesasmoothtransition pathfor

appliationswritten withothersientiomputinglibraries. Aprototypeversionofthelibraryanbe

down-loadedfromthefollowingURL:http://wwwlasmea.univ-bplermont.fr/Personnel/Joel.Falou/eng/eve.

We are urrently working on improving the performanes obtained with this prototype. This involves, for

instane, globally minimizingthe number ofvetorloadand storeoperations,using morejudiiously

Altive-speiahemanipulationinstrutionsortakingadvantageoffusedoperations(e.g. multiply/add). Finally,it

anbenotedthat,althoughtheurrentversionof evehasbeendesignedforPowerPCproessorswithAltive,

itouldeasilyberetargetedtoPentium4proessorswithMMX/SSE2beausetheodegeneratoritself(using

theexpressiontemplatemehanism) anbemadelargelyindependentoftheSIMDinstrutionset.

REFERENCES

[1℄ TheBOOSTLibrary.http://www.boost.org/.

[2℄ ThePOOMALibrary.http://www.odesourery .om /po oma/ .

[3℄ VAST.http://www.psrv.om/vast_a lti ve. html /.

[4℄ TheSWARHomePagehttp://shay.en.purdue.edu /~s war PurdueUniversity

[5℄ Apple,TheAltiVeInstrutionsReferenesPage.http://developer.apple. om/h ard ware /ve.

[6℄ Apple,VeLibframework.http://developer.apple. om/ hard war e/ve /ve tor _lib rar ies. html

[7℄ C.HarrisandM.Stephens,Aombinedornerandedgedetetor.In4thAlveyVisionConferene,1988.

[8℄ S. Kyo andS. OkasakiandI. Kuroda, An extended Clanguage anda SIMD ompiler foreientimplementation of

imageltersonmediaextendedmiro-proessors. inProeedingsofAivs2003(AdvanedConeptsforIntelligentVision

Systems),Ghent,Belgium,Sept.1998

[9℄ G.Low, MaSTL. http://www.pixelglow.om/ mas tl/.

[10℄ I. Ollman,AltiVeVeloityEngineTutorial.http://www.simdteh.or g/al tive . Marh2001.

[11℄ T.Veldhuizen,UsingC++TemplateMeta-Programs.InC++Report,vol.7,p.36-43,1995.

[12℄ ,ExpressionTemplates.InC++Report,vol.7,p.26-31,1995.

[13℄ ,TehniquesforSientiC++.http://osl.iu.edu/ tveldhui/papers/tehniques /.

[14℄ ,ArraysinBlitz++.InDrDobb'sJournalofSoftwareTools,p.238-44,1996.

[15℄ T. Veldhuizen and D. Gannon, Ative Libraries: Rethinkingthe roles of ompilers and libraries Pro. of the SIAM

(11)

Appendix A.A simple3x1 gaussianlter written with the Altive native C API .

void AV_filter( har* img, short* res)

{

vetor unsigned har zu8,t1,t2,t3,t4;

vetor signed short x1h,x1l,x2h;

vetor signed short x2l,x3h,x3l;

vetor signed short zs16 ,rh,rl,v0,v1,shift;

// Generate onstants

v0 = ve_splat_s16(2);

v1 = ve_splat_s16(4);

zu8 = ve_splat_u8(0);

zs16 = ve_splat_s16(0);

shift = ve_splat_s16(8);

for( int j = 0; j< SIZE/16 ; j++ )

{

// Load input vetors

t1 = ve_ld(j*16, img); t2 = ve_ld(j*16+16, img);

// Generate shifted vetors

t3 = ve_sld(t1,t2,1); t4 = ve_sld(t1,t2,2);

// Cast to short

x1h = ve_mergeh(zu8,t1); x1l = ve_mergel(zu8,t1);

// Atual filtering

rh = ve_mladd(x1h,v0,zs16);

rl = ve_mladd(x1l,v0,zs16);

rh = ve_mladd(x2h,v1,rh);

rl = ve_mladd(x2l,v1,rl);

rh = ve_mladd(x3h,v0,rh);

rl = ve_mladd(x3l,v0,rl);

rh = ve_sr(rh,shift);

rl = ve_sr(rl,shift);

// Pak and store result vetor

t1 = ve_paksu(rh,rl);

ve_st(t1,j,out);

}

Editedby: FrédériLoulergue

Reeived: June26,2004