Modeling Stream Communications in Component-based Applications

(1)

MODELINGSTREAM COMMUNICATIONS INCOMPONENT-BASEDAPPLICATIONS

∗

M. DANELUTTO

†

, D. LAFORENZA

‡

, N. TONELLOTTO

‡

, M. VANNESCHI

†

, AND C. ZOCCOLO

†

Abstrat. ComponenttehnologyisapromisingapproahtodevelopGridappliations,allowingtodesignveryomplex

appli-ationsbyhierarhialompositionofbasiomponents.Nevertheless,omponentappliationsonGridshaveomplexdeployment

models. Performane-sensitive deisions shouldbetakenbyautomati tools, mathing developer knowledge about omponent

performanewithQoSrequirementsontheappliations,inordertonddeploymentplansthatsatisfyaServieLevelAgreement

(SLA).

Thispaperpresentsasteady-stateperformanemodelforomponent-basedappliationswithstreamommuniationsemantis.

The model stritly adheres to the hierarhial nature of omponent-based appliations, and isof pratial use inlaunh-time

deisions.

Keywords: gridomputing;heterogeneousenvironments;streamomputations;performanemodel;mapping.

1. Introdution. Grid omputing is an emerging tehnology that enablesthe aggregationof

heteroge-neous, distributed resoures to solve omputational problems of ever inreasing size and omplexity. The

appliations that best perform on Grid platforms are the ones requiring large omputational power, or the

treatmentoflargedatasets,i.e. asublassofHigh-PerformaneAppliations[17℄.

Suh appliations (e.g. data-mining [12℄, query proessing[3℄, image proessing and visualization[2℄and

multimediastreaming[38℄)anbeonvenientlyexpressedusingaformalismbasedontwofundamentalnotions:

streamsof dataowingbetweenomponents,andomponents(eithersequentialorparallel) proessingthem.

Several programming languages are built on these onepts. Skeleton-based languages (e.g. SkIE [4℄ and

SBASCO [14℄) and skeleton libraries (e.g. eSkel [11℄ and Kuhen's C++ skeleton library [21℄) exploit the

notionofstreamsfortask-parallelskeletons(e.g. pipeandfarm). MoregenerallanguageslikeASSIST[33℄and

Datautter[15℄introduemodulesandstreamsasprimitiveoneptsto strutureparallelappliations.

Grid programmingframeworks (e.g. GrADS [9℄, ASSIST [13℄) are in hargeof the omplete automation

of appliation exeutionmanagement, eientlyexploiting Grid resoures. Moreover,they should be ableto

exeutetheappliationwithuser-requiredQoS,adaptingtheexeutiontothedynamihangesofGridresoures.

Thetraditionalomponentmappingstrategy,inwhihomponentsarestatially deployedin adistributed

environmentby theirdevelopers,doesnott well in suh senario. A broader deploymentmodelis required,

featuring

(i) manualmapping,inwhihtheomponentsarealreadypairedwiththeirresoures(onwhihtheyare

deployed),

(ii) resouresdisoveryandseletionat launhtime,to guaranteetheinitialdesiredperformane,

(iii) adaptiveomponentsmanagement,that at run-timeadjusttheset ofomputing resouresexploited

[31,1℄,inordertoadapttodierentperformanerequirements(on-demandomputing)ortohangingresoures

availability.

Aordingtothismodel,thedeploymentframeworkmustautomatially managetheoperationsneededto

enforetheappliationdesiredQoS.Thisanbeobtainedwiththespeiationofaperformaneontrat [34℄.

OurapproahintendstoautomatisethetasksneededtostarttheexeutionofHPCappliations. Ournal

goalistoallowanaslargeaspossibleuserommunitytogainfullbenetsfromtheGrid,andatthesametime

togivethemaximumgenerality,appliabilityandeasyofuse.

Themainontributionsofthispaperareasfollows:

(i) Wepropose an analytialmodel of thedynami behaviorof sequential/parallelomponents,

hierar-hial omponents and omponent appliations, ommuniating throughtyped streams of data. It is suited

to be used in simulation environments, to synthetially generate omponents and appliations to test

map-ping/shedulingsolutionsinarepeatableandontrolled setting. Eventually,theproposeddynami modelan

beexploited intheimplementationofdynamireongurationpoliies[1℄.

∗

Thisworkhas beensupportedby: the ItalianMIURFIRB Grid.itprojet,No. RBNE01KNFP,on High-performaneGrid

platformsandtools,andtheEuropeanCoreGRIDNoE(EuropeanResearhNetworkonFoundations,SoftwareInfrastruturesand

AppliationsforLargeSale,Distributed,GRIDandPeer-to-PeerTehnologies,ontrat no.IST-2002-004265).

†

DepartmentofComputerSiene,UniversityofPisa,Pisa,Italy

‡

(2)

(ii) Starting from the dynami model we identify the set of variables that an be used to desribe the

performanebehaviorofanappliation,andwederivethesetofrelationsamongthemwhihholdatsteady-state

(performane model). In this way we abstratfrom partiularruntime platforms and weapture all possible

steady-statebehaviorsof anappliation. Moreover, theirformulationbymeans oflinear algebraallowsus to

hierarhiallyomposetheperformanemodelsofseveralomponentsto derivethesteady-statemodelofnew

omponentsorappliations.

(iii) Weintrodueadenitionofperformanemodel forstreamappliations,whihisexploitedin

launh-timemappingand runtimereongurationdeisions.

Afterasurveyofrelatedwork(Set.2),thispaperpresentsadynamimodelofstream-basedomputations

(Set. 3), andin Set.4suhmodelis exploited toderiveasteady-stateperformane model forstream-based

appliations. In Set.5, suh model is applied to aase study, to preditthe programbehaviorat run-time,

and todeviseaorretinitial mappingforspeiedQoSlevels. Setion6onludes thepaper, disussingthe

presentedapproahandfuture work.

2. Related Work. Performane speiation of omponents and their interations is a basi problem

thatmustbesolvedtoenablesoftwareengineerstoassembleeientappliations[27℄. Moreover,performane

modeling is one of the key aspets that needs to be addressed to fae sheduling/mapping problems in

het-erogeneousplatforms. It arises in automatiomponent plaementand reonguration. Severalreentworks

fousonperformanemodelingtehniquestoanalyzethebehaviorofomponent-basedparallelappliationson

distributed,heterogenous,dynamiplatforms.

Analyti performane models in software engineering makeextensive use of UML formalism to desribe

software omponent behavioral models [35℄ and to derive models based on Queuing Networks [19℄ or

Lay-ered Queueing Networks [36℄ to be exploited in design phase of the lifeyle of software. The same holds

for Stohasti Petri Nets[20℄ and Stohasti Proess Algebras [18℄. Suh models typially translatea

paral-lel appliation into an analyti representation of its exeution behavior and the target runtime system

(a-ording to the Software Performane Engineering methodology [28℄). A detailed survey of suh models is

in [5℄. Suh translation is usually not straightforward. It may requireapproximationsto obtain

mathemat-ial models [29℄ for whih a losed-form solution is known. Stohasti models usually require the solution

of the underlying Markov hain whih an easily lead to numerial problems due to the spae state

explo-sion [5℄. More omplex models an be solved by means of simulation, at the ost of a larger omputation

time.

Symboliperformanemodeling[32℄isamethodologythat enablesarapiddevelopmentoflowomplexity

and parametri performane models. Symboli performane models anbe derived from simulationmodels,

tradingo result aurayfor model evaluation ost. In [32℄ asymboli performane model for the Pamela

modelinglanguageisintrodued. Itderiveslowerboundsforsteady-stateperformanesofappliationsstarting

fromamodeloftheprogramandofthesharedresoures,ombiningdeterministiDiretAyliGraphs(DAGs)

modeling with mutual exlusion. Oneof thestrengths of thePamelaapproahis that it is fastand easy to

transformaregularlystruturedappliation intoaperformane model. Themainlimitationofsuhapproah

isthatitomputeslowerboundsoftheperformaneofaprogram. Symboliperformanemodelsshareseveral

properties withthemodelwepropose: bothanbeextratedfrom thestrutureof programs,areparametri,

and anbe eiently evaluated. The main dierene is that the presented model doesnot omputea lower

bound,buttheasymptotisteady-stateperformaneofanappliation,thatisingeneralabetterapproximation

oftherealperformane.

Theasymptotisteady-stateanalysishasbeenpioneeredbyBertsimasandGamarnik[10℄. Thisapproah

hasbeenreently appliedto mappingandshedulingproblems ofparallel appliationsonheterogeneous

plat-forms[23,7,6℄,inwhihtheanalysisisappliedtopartiularlassesofparallelappliations(divisibleload[23℄,

master/slave[6℄, pipelined andsatter operations [7℄), in thehypothesisthat the set ofresouresis knownin

advane. The existing steady-stateapproahesapply only to arestritedlass of strutured parallel

applia-tions,assumingtoknowtheruntimeenvironmentinsuhawaytoderiveoptimalshedulingoftheappliation

omponents. Inadynami environmentlike aGrid anoptimal initial plaement of theomponentsmay

be-omeuselessverysoon,beausetheonditionsoftheexeutionplatformmayvarydynamially. Thepresented

steady-stateanalysis an beapplied to abroader lass ofstrutured parallel appliationsand tries to solvea

(3)

Strutural performane models [25℄ are the rst eortto develop ompositional performane models for

omponentappliations. MostsientiandGridomponentmodelsrelyontheoneptofalgorithmiskeleton.

Skeletons are ommon, reusable and eient strutured parallelism exploitation patterns. One advantage of

theskeletalapproahisthat parametriostmodelsanbedevisedfortheevaluationofruntimeperformane

of skeleton ompositions. In [14, 8℄ dierent ost models are assoiated to eah skeleton of an appliation

to enhane its runtime performane through parallelism/repliationdegree adjustments and initial mapping

seletion, respetively. The authors of [14℄ propose parametri ostmodels forpipe, farm and multiblok

skeletons,thatanbearbitrarilyomposedandnested. In[8℄,analytiostmodelsforappliationsomposedby

pipesanddealsarederivedwithinastohastiproessalgebraformulation. Struturalperformanemodelsare

extendedbythepresentedmodelbyproposingamethodologywell-suitedforgeneriompositionofskeletons,

andbytakingintoaountthesynhronizationproblemsintroduedbyusingstreamedommuniations.

Trae-basedperformanemodels[34,26℄areurrentlyexploitedinparallel/Gridenvironmentstomodelthe

performaneofsetsofkernelappliations. Reordingandanalyzingexeutiontraesonreferenearhitetures

ofsuhappliation itispossible,withaertaindegreeofpreision,to foreasttheperformaneofthesameor

similarappliationsondierentresoures.Traeinformationisexploitedinthepresentedmodel,butindierent

waywithrespettotheexisting approahes. Insteadofprolingawholeappliationonaset ofrepresentative

resoures,theappliationmodeliskeptindependentfromresoures. Whentheappliationwill bemappedon

atualresoures,historialinformation willbeusedto model theruntime behaviorof singleomponents,and

then suh information will beoupled with the omponent interations informationto obtain apredition of

theperformaneofthewholeappliation.

The problem of deriving a performane model for omponents has been addressed also in the ontext

of omponent frameworks suh as EJB [37℄, COM+/.NET [16℄ and CCA [24℄. Suh works apply analytial

performanemodel(LQN)ortrae-basedperformanemodeltoderiveamodelforomponents. In[30℄,

trae-basedmodelsareexploitedtoseletthemostsuitableomponents,whenmultiplehoiesareavailable,tobuild

anoptimalappliation,fromthepointofviewofperformane.

3. DynamiBehavior. Anappliationanbestruturedasahypergraphwhosenodesrepresentprimitive

omponentsandwhose(hyper)edgesrepresentommuniationsorsynhronizationsbetweenomponents. Nodes

interat withinput (server) interfaesand output(lient)interfaes. Edgesare diretedand anonnettwo

ormorenodesthroughtheirinterfaes. Twonodesmaybelinkedbymorethanasingleedge.

3.1. Communiations. Communiations between omponents are implemented through input/output

interfaesbindings. Inthisworkdata-owstreamommuniationsarestudied. Everyomponentreeivesdata

throughoneormoreinputinterfaes,performssomeomputations,andgeneratesnewdatatobesentthrough

oneormoreoutputinterfaes.

Inthisontext, astream representsatyped, unidiretionalommuniationhannelbetweenanon-empty,

nitesetofomponents(produers)andanon-empty,nitesetofomponents(onsumers). Theatomipiee

of information transferred througha stream is alled item. A produeris onneted to astream through an

output interfae, while a onsumeris onneted to astream through an input interfae. Every node anbe

produeroronsumerofseveralstreams,anditispossibletospeifyylistrutures(i.e. theommuniation

strutureisnotrestritedtobeaDAG).

Componentsanbeonnetedbystreamsaordingtothreedierentpatterns:

(i) uniast: one-to-one onnetion. Everyitem sentontheoutput streaminterfaeisreeivedin order

bytheinputstreaminterfae.

(ii) merge: many-to-oneonnetion. Everyitem sentontheoutputstreaminterfaesisreeivedbythe

inputstreaminterfae. Thetemporalordering oftheitemsomingfrom eahinputinterfaeispreserved,but

theinterleavingbetweenthedierentsouresisnon-deterministi.

(iii) broadast: one-to-manyonnetion. Everyitem sent ontheoutput stream interfaeis reeivedin

orderbytheinputstreaminterfaes. Thereeptionshappeningondierentinputinterfaesarenotsynhronized.

3.2. Computations. Componentsimplement sequential aswell asparallel omputations. A sequential

omponent exeutes asingle funtion in a single ative thread, proessing items as theyare reeived. Fora

parallelomponent,twosenariosarepossible:

(i) dataparallel: asinglefuntion isexeutedinparallel ondierentportionsofthesamedata;

(4)

A primitive omponent, either sequentialorparallel, at runtime repeatedly reeivesitems from its input

streams,performssomeomputationsanddeliversresultitemstoitsoutputstreams.

Aomponentanhaveseveralinputstreams. Thesetofinputstreamsispartitionedbetweenthe

omputa-tionsassoiatedwiththeomponents. Eahinputstreamis assoiatedtoonlyoneomputation;nevertheless,

spontaneousomputationsmayexist,thatdonotneedinputitemstoativate,butfollowownativationpoliies

(e.g. periodially).

Aomputationan beativatedifthefollowingonditionshold:

(i) theomponentanexeuteanewfuntion(thismeansthatitisidle,oritisparallelandthreadsare

availableto exeuteit),

(ii) theassoiatedinputitemshavebeenreeived,ornoitemisneessary.

Asequentialomponentanativateanewfuntiononlywhenitisidle. Aparallelomponentanhaveat

mostoneativedata-parallelomputationatanygiventime(omposedbyaxednumberofthreads),orseveral

task-parallelomputationsrunninginparallel (uptothemaximumnumberofthreadsin theomponent).

Aomponentanhaveseveraloutputstreams. Oneormoreomputations oftheomponentandispath

dataoneahoutputstream.

3.3. Node Behavior. Inordertodesribethebehaviorofaomputation atruntime,onsiderFig.3.1.

Fig.3.1. Sequentialomponentatruntime

Withoutlossofgenerality,asequentialomponentisonsidered;thedisplayedquantitiesrepresent:

(i)

i

k(

t

)

: totalnumberofreeiveditemsat time

t

fromthe

k

th

inputinterfae;

(ii)

e

(

t

)

: totalnumberofomputationsarriedoutattime

t

; (iii)

o

j(

t

)

: totalnumberofsentitemsat time

t

throughthe

j

th

outputinterfae.

Continuousquantitiesareusedtomodelpartialevolution,e.g.

e

(

t

) = 3

.

5

meansthatthenodereahedthehalf waypointin thefourthomputation.

Theativation of aomputationan happenonly whenthe numberofitemsompletely reeivedoneah

assoiatedstreamisgreaterthanthenumberofpartiallyomputeditems:

∀

k

= 1

, . . . , n

i

k(

t

)

−

e

(

t

)

>

0

(3.1)

Thenodeimplementationwillexploitnitebuerstostorereeiveditemsforeahinputinterfae,thereforefor

eahinputinterfaeandassoiatedomputationthefollowingmusthold:

∀

k

= 1

, . . . , n

i

k(

t

)

−

e

(

t

)

≤

τ

1k

(3.2)

where

τ

1k

representsthemaximumnumberofelements thatanbereeivedonthe

k

th

inputinterfaebefore

thestreambloks. Thenthemaximumadmissible valuefor

i

k

(

t

)

attime

t

is:

i

max

k

(

t

) =

τ

1k

+

e

(

t

)

(3.3)

Assumingthatnosensibledelaysarepresentbetweentheendofomputationsandthebeginningofthe

transmis-sionoftheprodueditems,thetotalnumberoftransmitteditemsisrelatedtotheprogressoftheomputations

ofthenode. Inthegeneralaseofanodewith

s

funtions,thefollowingequationholdsforeahoutputinterfae:

∀

j

= 1

, . . . , m

o

j(

t

) =

f

j

e

1(

t

)

, . . . , e

s(

t

)

(3.4)

(5)

3.4. Edge Behavior. In order to desribe thebehaviorof a data transmissionon astream, onsider a

uniaststream. Theinvolvedvariablesare

o

(

t

)

,totalnumberofitemssentattime

t

from soureinterfae,and

i

(

t

)

,totalnumberofitemsreeivedattime

t

bythedestinationinterfae. Anewtransmissionbeginsonlyafter

afullitemisprodued:

i

(

t

)

≤ ⌊

o

(

t

)

⌋

(3.5)

Theedge implementation will exploit nite ommuniationbuers and thenetwork layertransfershunks of

data. Let

q

−

1

betheminimumfration ofitemtransferredatomially. Then

o

(

t

)

−

⌊

q

· i

(

t

)

⌋

q

≤

τ

2

(3.6)

where

τ

2

representsthe maximumnumberof itemsthat anbebuered. Therefore themaximumadmissible

valuefor

o

(

t

)

at time

t

is:

o

max

(

t

) =

τ

2

+

⌊

q

· i

(

t

)

⌋

q

(3.7)

Wheneveranedgebuerisfull, aproduerwill blokassoonasittriesand sendsanewitem. From(3.4)we

obtain:

o

max

(

t

)

−

f e

1(

t

)

, . . . , e

m(

t

)

≤

0

(3.8)

For merge streams with

k

soure interfaes and broadast streams with

k

destination interfaes, the general

onstraints(Eqs.(3.5)and(3.6)fortheuniaststream)beome:

merge:

i

(

t

)

≤

P

k

o

k

(

t

)

P

k

o

k(

t

)

−

i

(

t

)

≤

τ

2k

(3.9)

broadast:

∀

k

i

k(

t

)

≤

o

(

t

)

∀

k

o

(

t

)

−

i

k

(

t

)

≤

τ

2k

(3.10)

For simpliity, inthepreviousequationsthenetworkquantizationonstant

q

hasbeensuppressed.

3.5. RuntimeBehavior. Atruntime,aomponentanbeseenasadynamisystem. Thesystemstate

at time

t

is desribedby aset ofstate variables:

i

1,...,ni(

t

)

,

e

1,...,ne(

t

)

,

o

1,...,no

(

t

)

. Thus, the state spae

P

is a

n

=

n

i

+

n

e

+

n

o

dimension Eulidean spae. The dynami behavior ofaomponent anbe modeled bya trajetory

p

(

t

)

in suhstatespae.

The runtime behavior of a omponent is fully speied when it is oupled with hosting resoures. A

omputingresoureis modeled by

w

(

t

)

, theavailableomputing powerat time

t

(measuredin MFlop/s)and a ommuniation link is modeled by

b

(

t

)

, the instantaneous bandwidth at time

t

(measured in MByte/s). Moreover, aharaterizationof the items is required. It is assumed that an item proessed by aomponent

requires

l

unitsofomputingworktobeproessed(measuredinMFlop)and

s

unitsofommuniationworkto

betransmitted(measuredinbytes).

Introduingthestepfuntion

u

(

x

)

,thenumberofperformed(partial)omputationspertimeunitis:

de

dt

=

u

min

i

1(

t

)

, . . . ,

i

n(

t

)

−

e

(

t

)

·

· u

o

max

(

t

)

−

f e

1(

t

)

, . . . , e

m(

t

)

· w

(

t

)

L

(3.11)

whiletheequations governingthenumberofpaketsowingin theuniast, mergeand broadaststreamsper

timeunitare,respetively:

di

dt

=

u

o

₍

t

₎

−

i

(

t

)

· u

i

max

₍

t

₎

₋

i

₍

t

₎

· b

₍

t

₎

s

(3.12a)

di

dt

=

u

X

k

o

k

(

t

)

−

i

(

t

)

· u

i

max

₍

t

₎

₋

i

₍

t

₎

· b

₍

t

₎

s

(3.12b)

di

k

dt

=

u

o

₍

t

₎

−

i

k

(

t

)

· u

i

max

₍

t

₎

−

i

k

(

t

)

· b

₍

t

₎

s

(6)

Note that an important assumption has been made. The work required to perform a omputation is

supposed to be independent from the values of the inoming items; their values are used just to perform

omputations. Thisisaommonassumptioninparalleldata-owprogramming,butthereareappliations(e.g.

queryproessinganddatamining)that donotrespetthisassumption.

Thedynamiequationsprovidedbythemodelanbewrittenin thegeneralform:

˙

p

(

t

) =

U

(

p

(

t

))

α

(

t

)

(3.13)

Wedenote with

U

:

P

→

M

n,n

thefuntion that, foreverypoint inthe statespae,providestheontrolpart ofthedierentialequations(the onesinvolvingthestepfuntions),andwith

α

(

t

)

theresourespart(involving

w

(

t

)

and

b

(

t

)

).

Weobservethattheontrolmatrixispiee-wiseonstantovernon-innitesimaltimeintervals: itdesends

from quantization in the generalequations for thenodes(3.11), and in theequations for the streams (3.12).

Then, the Cauhy problem an be solved onstrutively. Starting with

t

0

= 0

, p

0(

t

0) = 0

, U

0

=

U

(0)

, we indutivelydene

p

i(

t

) =

Z

t

ti

U

i

α

(

τ

)

dτ

t

i+1

= sup

{

t > t

i

|

U

(

p

i(

t

)) =

U

i

}

U

i+1

= lim

t

→

t

+

i

U

(

p

i(

t

))

Inthis way,

p

(

t

)

is dened asthe onatenationof the piees

p

i

|

[ti,ti+1)

: it is aontinuous funtion (

p

i(

t

i) =

p

i+1(

t

i)

)andpiee-wisedierentiable.

4. STEADYSTATEBEHAVIOR. Thesteady-statebehaviorofthesystemanbeanalysedby

study-ingmeanvalues

p

¯

fortherateofhangeof thestatevariables:

¯

p

=

E

[ ˙

p

|

[t0,

∞

)

] =

Z

∞

t0

˙

p

(

t

)

dt

= lim

t

→∞

p

(

t

)

−

p

(

t

0)

t

−

t

0

(4.1)

Thehoieof

t

0

isarbitrary,infattheweightofthetransientphasefadesawayonsideringinniteexeutions.

However,to ease thereasoning aboutthese quantities, wean interpret

t

0

asthe end of thetransientphase,

e.g. whenthelaststageonsumestherstdataiteminapipeline.

The essential aspet to point out is that for the steady-statemodel the fous is on relations among the

steady-statevariables,rather thanintheir values. Inthiswayitispossibletoabstrat from partiulartarget

platforms,andapturethelassofallpossiblesteady-statebehaviorsofanappliation.

Thesteady-statebehaviorofanode anbe modelledassoiatingto eah omputation

e

k

(

t

)

itsativation rate

¯

e

k

= lim

t

→∞

e

k

(

t

)

−

e

k(

t

0)

t

−

t

0

(4.2)

Spontaneous omputations are free variables in thesteady-state model. Computations that are ativated by

datareeption,instead,aresubjettothefollowingondition.

Proposition4.1. Thesteady-stateexeutionrate ofaomputationisboundtobeequal tothe inputrates

onthe inputinterfaes thatativate the omputation.

Proof. Let

k

∈

A

i

,wewillprovethat

e

¯

i

−

¯

ı

k

= 0

¯

e

i

−

¯

ı

k

= lim

t

→∞

e

i(

t

)

−

e

i(

t

0)

t

−

t

0

−

lim

t

→∞

i

k(

t

)

−

i

k(

t

0)

t

−

t

0

= lim

t

→∞

e

i(

t

)

−

e

i(

t

0)

−

i

k(

t

) +

i

k

(

t

0)

t

−

t

0

= lim

t

→∞

e

i(

t

)

−

i

k(

t

)

t

−

t

0

−

e

i(

t

0)

−

i

k(

t

0)

(7)

Thenumeratoroftherstaddendislimitedbyonstants: (3.1)gives

e

i(

t

)

−

i

k(

t

)

≤

0

and(3.2)(notingthat

e

(

t

)

≥ ⌊

e

(

t

)

⌋

)gives

e

i(

t

)

−

i

k(

t

)

≥ −

τ

1k

whilethenumeratorof theseondaddend isonstant,so thelimittendsto zerowhenthedenominator tends

toinnity.

The data transmission rate

¯

o

k

of an output stream will depend on the ativation rates of one or more omputationsofthenode. Intheprevioussetion,thenumberofdataoutputshasbeenrelatedtothenumber

ofperformedomputationsbymeansofatransferfuntion

f

k

(Eqn. (3.4)). Proposition4.2. If the transferfuntion is(asymptotially) linear

o

k

=

f

k(

e

1

, . . . , e

m) =

α

1 k

e

1

+

. . . α

m

k

e

m

+

c

k(

e

1

, . . . , e

m)

with

lim

k

e

k→∞

k

c

k(

e

)

k

e

k

= 0

then asteady-state is eventually reahed, in whih the output rate isa linear ombination of the omputation

rates:

¯

o

k

=

m

X

i=1

α

ki

e

¯

i

(4.3)

Proof.

¯

o

k

= lim

t

→∞

f

k(

e

(

t

))

−

f

k

(

e

(

t

0))

t

−

t

0

= lim

t

→∞

α

k

· (

e

(

t

)

−

e

(

t

0)) +

c

(

e

(

t

))

−

c

(

e

(

t

0))

t

−

t

0

=

α

k

· lim

t

→∞

e

(

t

)

−

e

(

t

0)

t

−

t

0

+ lim

t

→∞

c

(

e

(

t

))

−

c

(

e

(

t

0))

t

−

t

0

=

α

k

· ¯

e

+ 0 =

m

X

i=1

α

m

k

e

¯

i

Thesteady-statebehaviorofstreamsanbemodelledbyassoiatingtoeahendpointitsdatatransmission

rate. Balane equationsrelatinginputandoutputendpointsarederived.

Proposition4.3. Thesteady-statetransmissionrateatthe endpointsofastreamareharaterisedbythe

following balaneequations:

uniast:

o

¯

A

= ¯

ı

B

(4.4a)

merge:

o

¯

A

+ ¯

o

B

= ¯

ı

C

(4.4b)

broadast:

o

¯

A

= ¯

ı

B

= ¯

ı

C

(4.4)

Theseequations areeasily extendedinthe aseof moreendpoints.

Proof. Theproofissimilarto theoneofProp.4.1,exploiting:

(i) (3.5)and(3.6)foruniast,

(ii) (3.9)formerge,

(iii) (3.10)forbroadast.

Theexeutionrate for eah omputation,and thedata transfer rate foreahinput/output interfae

om-pletelyspeify theappliationstatefrom thepointof viewofitsperformane,therefore wewillallthem the

performanefeatures ofourappliation.

(8)

bymeansofsomeperformane annotations,in ordertobuildaperformanemodel. Proposition4.1allows

ustoeliminateexeutionratesassoiatedtodata-dependentomputations. Proposition4.3allowsustorelate

outputratestoinputratesoflinkedmodules.

Theperformanemodelisthereforedenedasanhomogeneoussystemofsimultaneouslinearequations,

thatdesribetherelationsthatholdinthesteady-stateamongtheperformanefeatures. Thesetofsolutions

ofthe systemisavetorsubspaeof

R

n

(where

n

isthe totalnumberofvariables, either inputrates, output

ratesorexeution rates);weall the dimensionofthe solutionspae thenumberof degrees of freedomof

theappliation. If thisdimensionis 1,thenthesystemis ompletelydetermined assoonasasinglevaluefor

anyvariableisimposed. Thedegenerateaseofaspaewithdimension0impliesthattheonlysolutiontothe

system isthe null vetor(i. e. everyvariable must bezero): this meansthat thepredited steady-stateis a

deadlokstate,inwhihnoomputationorommuniationanproeed. Thenumberofdegreesoffreedomof

thesystem will impaton how many onstraintsmustbeprovided in order to derivethe expeted valuesfor

everyvariable.

Clearly,onlypositivevaluesoftheratesaremeaningful,soweanonludethateveryassignmentofpositive

valuesforthevetor

[

i e o

]

T

∈

R

n

thatisasolutionofthesystemisapossibleoperationpointforthemodeled appliation.

Theoutlinedapproah iseient,in fatthesimpliationofthesimultaneous equationsanbeahieved

usingwellknowntehniques.

5. Appliation of the Model. We showhowthepresentedmodelan beapplied to areal appliation

(see Fig. 5.1), a rendering pipeline. The rst stage requests the renderingof asequene of senes while the

seond renderseah sene (exploiting the PovRay rendering engine), interpreting a sript desribing the 3D

modelof objets,theirpositionsandmotion. Thethirdstage olletsimagesrenderedbytheseondone,and

builds GroupsOf Pitures (GOP), that are sentto the fourth stage,performing DivXompression. Thelast

stageolletsDivXompressedpiees andstoresthemin anAVIoutputle.

Fig.5.1.Graphoftherender-enodeappliation

ForGOPsof12pitures,theperformanemodelforourtestappliationis(weeliminatedexeutionrates

fordata-dependentomputations):

C

1e

=

C

1o

=

C

2i

=

C

2o

=

C

3i

= 12

· C

3o

=

= 12

· C

4i

= 12

· C

4o

= 12

· C

5i

andhasonedegreeoffreedom.

5.1. Convergene to Steady State. Westartshowingthat theappliationbehavioratuallytends to

steady-state.

Figure5.2showsperformanefeaturestakenfromarealexeutionofthetestappliationonaBladeluster

onsisting of 32 omputing elements, eah equipped with an IntelPentium III Mobile CPU at 800MHz and

1GBofRAM,interonnetedby aswithedFastEthernet dediatednetwork. Theappliation wasongured

toexploit 20mahinesintherenderomputation,andonemahineforeahremainingnode.

Performanefeatures are measuredasin (4.2),i. e. averagingthe number ofperformedomputations on

thedurationoftheexeution. ThetopdiagramshowstheperformaneoftheRenderandtheGOPAssembler

nodes, whih operate on frames, while the bottom diagram shows the Enoder and Colletor nodes, whih

operate on GOPs. The similarity of the urves in the left and the right diagrams shows empirially that

Prop.4.2issatisednotonlyatthesteady-state,butalsoduringtheniteomputation,assoonasbuersare

(9)

0

1

2

3

4

5

0

100

200

300

400

500

600

700

800 bandwidth (activation/s)

frame

Rendering engine

GOP Assembler

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0

10

20

30

40

50

60 bandwidth (activation/s)

GOP

Encoder

Collector

Fig.5.2. Convergenetosteady-stateofaveragedperformanefeatures

Moreover,Fig. 5.2showsthat theaveragedomputation ratesstabilize during theomputation, allowing

ustoadoptasteady-statemodeltoapproximate theatualappliationrun.

5.2. FromDesired Performaneto ResoureRequirements. Typially,ifsomeoneisfainga

prob-lem by means of HPC tools, he has learin mind some sort of performane requirement for his appliation.

This anbeexpressed in dierentforms, e.g. ompletion time, omputation rate, response time, et. In our

framework we express requirementsas bounds on omputation rates. That is the most naturalway dealing

withstream parallelism. Thismeansthat, iftheproblemis expressedin dierentterms,somesort of

prelim-inary transformation should be applied (e.g. study the initial transient length to relate ompletion time to

omputationrate,orusetheLittle'sLawtotranslateresponsetimerequirementsinomputationrateones).

Suppose thatwerequire1frame/s(the onstraintis expressedby

C

5i

≥

1

12

, beauseeahinput for

C

5

is omposed by12frames). Applyingtheperformane model wederiverequiredomputation andtransferrates

foreahomputationand ommuniation.

These values,pairedwith programannotations(seeTab.5.1) ontheweightof omputationor

ommuni-ation (e.g. MFLOPper task/MB transferred to/from memory and message size, respetively) an be used

to derive requirements that the resouresmust fulll in order to meet the performane requirements on the

appliation.

For instane, we an show the requirement for stream

S

2

=

C

2o

. Sine it is required to arry 1.19MB messages with at least rate 1/s, a link of 9.5 Mbit/s is suient. Likewise, the test appliation will never

sale above10 frames/s with a100 Mbit/s network, and needs to be redesigned,if wewant to reah higher

performanes.

(10)

Deploymentannotationsfortheexampleappliation.

Component

C

1

C

2

C

3

C

4

C

5

Proessor i686 i686 i686 i686 i686

Memory(MB) 64 256 64

CPUWork 3307 52

Mem. Work 302 104

Connetor

S

1

S

2

S

3

S

4

datatype param pi GOP zip

datasize 54B 1.19MB 14.24MB 2MB

eahatomiomputation that,giventherequiredexeutionrate, produestheresourerequirements. This is

essentialinanexeutionenvironmentin whih resouresarenotknowninadvane.

Themodelpresentedin[22℄suitsourneeds. Weanassoiatetoeahomputationaweight,representedby

apairofvalues

w

= (

w

MF LOP

, w

MB

)

,speifyingthenumberofoatingpointoperations(expressedinMFLOP) andthedatatransferredto/frommainmemory(expressedinMB)perativation. Resourepowerisdesribed

bythepair

p

= (

p

MF LOP/s

, p

MB/s)

,andexeutiontimeisthereforeestimatedas

t

(

p, w

) =

w

MF LOP

p

MF LOP/s

+

w

MB

p

MB/s

.

This model an be employed also to nd appropriate parallelism degree for parallel omputation nodes.

We,infat,anrelate

t

(

p, w

)

foranaggregateresoure

p

= [

p

1

, . . . , p

k]

totheperformaneoftheodeonsingle resoures

t

(

p

i

, w

)

.

Assumingperfetspeedup,weobtain:

t

(

p, w

) =

X

i

t

(

p

i

, w

)

−

1 −

1

Inthiswayweanderive,foreahomputationnode,mathingresourerequirements. Thesewillonern

singleresouresforsequentialnodes,andaggregateonesforparallelnodes.

Resultsommented. InFig.5.3,twomappings(toponanhomogeneousluster,bottomwithheterogeneous

resoures)forthesameonstraintare displayed. Therstthing tonote isthat,eveniftheheterogeneousrun

has morevariane in ahieved bandwidth, the average bandwidth is omparablewith the homogeneousone.

Thisprovidesevidenethattheemployedperformanemodelorretlyhandlesheterogeneoussetsofresoures,

determiningtheorretparallelismdegree. Thegoodperformaneinheterogeneousrun(itsompletiontimeis

evenshorterthantheoneforhomogeneousrun)isexplainedbythefatthatthemodelanmathomputation

requirementswith suitableresoures,i. e. shedule memoryboundomputations (e.g. enoding)onmahines

withfastermemory,andFPUboundones(e.g. rendering) onmahineswithfasterFPU.

The obtained results are as expeted: the mapping omputed using the performane model fullls the

onstraint,atthebeginningandmostofthetimeoftheappliationrun. Thisoursbeause,inordertobuild

ourmodel,wesampledtheahievedperformaneontherstframesofthemovie,buttheappliationworkload

slightlyhangeswiththeevolutionofthemovie. Thisisevidenedbythesmoothedbandwidthurve,thathas

the same oursein the twoexperimental settings: theworkload is heavieraround 100s and 300s, while it is

lighterinthemiddleandattheend.

6. Conlusions and Future Work. Inthisworkwedesribedananalytialapproahtomapalassof

appliationsonaGrid. Theseappliationsinterat throughstreamsofdata, proessedbyseveralautonomous

softwareomponents, either sequential orparallel. We presented asteady stateperformane model for these

appliationsandweappliedittoaasestudy,arenderingpipelineofsequentialandparallelomponents. The

model wasexploited to predit aprogram behaviorat run-time. Then we showed ageneral methodology to

deviseaorretinitialmappingfortheappliation,drivenbyspeiedQoSlevels. Atlast,weshowedtheresults

ofourmappingmethodologywiththepresentedappliation, andwedisussedtheresultsofthemappingand

the exeutionon homogeneousand heterogeneous sets of resoures. Weobtained good results in bothases.

Theappliationwas orretlymappedandtheQoSrequirementrespetedwithasmallerror.

Analytial [35, 19, 36, 20, 18, 29℄ and strutural performane models [25, 14, 8℄ disussed in Set. 2

(11)

0

1

2

3

4

5

6

7

0

100

200

300

400

500

600 Bandwidth (frame/s)

Time (s)

Test results for cluster Fuji

instantaneous bandwidth

averaged bandwidth (100s)

constraint

0

1

2

3

4

5

6

7

0

100

200

300

400

500

600 Bandwidth (frame/s)

Time (s)

Test results for heterogeneous configuration

instantaneous bandwidth

averaged bandwidth (100s)

constraint

Fig.5.3. Twoexeutionsofthetestappliation: top)homogeneouslustersofAthlonsXP2600+, down)setof

heteroge-neousresoures(9P42GHz,1AthlonXP2800+,1P42.8GHz).

of the appliation performane from the target platform, allowing us to evaluate the model one to

de-rive enough information to drive the mapping proess. Trae-based approahes [34, 26℄ are used to

over-ome the limitations of previously disussed approahes, but they are not ompositional. Therefore they

must be applied from srath to every new appliation, even if it is built from the same set of

ompo-nents.

All thosemodelsandthepresentedoneshareanassumption onthebehaviorof theappliations:

ompu-tation exeutionsmust be independentfrom theatual valuesof theinput set. Otherwise, twoexeutions of

thesameappliationwouldbenotomparable(thisisalledergodiityforstohastimodels). Forappliations

thatdonotmeetthisrequirements,thebestsolutionistoresortto runtimeadaptation.

Thepresentedapproah is notperfet. The initialmapping anbe onsidered agood hint to start the

exeution of an appliation on a Grid. The dynami hanges in resoures during the exeution an not be

easily inludedin launh-time strategies. Our approah must be oupledwith reshedulingstrategiesat

run-time to solve suh problems. Our future work is going in this diretion. The presented steady state model

an be exploited at run-time to adapt the behavior of omponents to hanges in resoure performanes. In

this way, it should be possible to fulll the QoS requirements during the whole exeution of the

applia-tion.

REFERENCES

[1℄ M.Aldinui,A.Petroelli,E.Pistoletti,M.Torquati,M.Vanneshi,L.Veraldi,andC.Zoolo,Dynami

reongurationofgrid-awareappliationsinASSIST,inPro.11thEuro-ParConferene,Lisboa,Portugal,Aug.2005.

(12)

[3℄ B. Babok, S. Babu, M. Datar, R. Motwani, and J. Widom,Models and issues indata stream systems, inPro.

21stACM-SIGMOD-SIGACT-SIGARTSymposiumonPriniplesofdatabasesystems(PODS'02),Madison,USA,2002,

pp.116.

[4℄ B.Bai,M.Danelutto,S.Pelagatti,andM.Vanneshi,SkIE:aheterogeneousenvironmentforHPCappliations,

Par.Comp.,25(1999),pp.18271852.

[5℄ S.Balsamo,A.D.Maro,P.Inverardi,andM.Simeoni,Model-Based PerformanePreditioninSoftware

Develop-ment:ASurvey,IEEETrans.onSoftwareEngineering,30(2004),pp.295310.

[6℄ C.Banino,O.Beaumont, L.Carter,J.Ferrante,A.Legrand,andY.Robert,ShedulingStrategiesfor

Master-Slave Tasking on Heterogeneous Proessors platforms, IEEE Trans. on Parallel and Distributed Systems,15 (2004),

pp.319330.

[7℄ O. Beaumont, A. Legrand,L. Marhal,and Y.Robert,Steady-State ShedulingonHeterogeneousClusters: Why

andHow?,inPro.of

18 th

InternationalParallelandDistributedProessingSymposium(IPDPS04)(IPDPS'04),April

2004.

[8℄ A.Benoit,M.Cole,S.Gilmore,andJ.Hillston,ShedulingSkeleton-BasedGridAppliationsUsingPEPAandNWS,

TheComputerJournal,48(2005),pp.369378.

[9℄ F.Berman,A.Chien,K.Cooper,J.Dongarra,I.Foster,D.Gannon,L.Johnsson,K.Kennedy,C.Kesselman,

J.Mellor-Crummey,D.Reed,L.Torzon,andR.Wolski,TheGrADSProjet: SoftwareSupportforHigh-Level

GridAppliationDevelopment,Int.J.ofHighPerformaneComputingAppliations,15(2001),pp.327344.

[10℄ D.BertsimasandD.Gamarnik,Asymptotiallyoptimalalgorithmforjobshopshedulingandpaketrouting,Journalof

Algorithms,33(1999),pp.296318.

[11℄ M. Cole,Bringingskeletonsout of theloset: a pragmatimanifesto forskeletal parallel programming, Par.Comp.,30

(2004),pp.389406.

[12℄ M.Coppola and M.Vanneshi, High-Performane DataMining withSkeleton-based Strutured Parallel Programming,

Par.Comp.,Sp.Iss.onParallelDataIntensiveComputing,28(2002),pp.793813.

[13℄ M.Danelutto,M.Vanneshi,C. Zoolo,N.Tonellotto,R. Baraglia,T.Fagni,D.Laforenza,andA.

Pa-osi,HPCAppliationexeutiononGrids,inFGG:FutureGenerationGrid,CoreGRID,Springer,2006.

[14℄ M. Dìaz, B. Rubio, E. Soler, and J. M. Troya, SBASCO: Skeleton-based Sienti Components, inPro. of

12 th

EuromiroConfereneon Parallel, Distributed, and Network-Based Proessing (PDP'04),ACoruña, Spain,February

2004.

[15℄ W.DuandG.Agrawal,Languageandompilersupportforadaptiveappliations,inPro.2004ACM/IEEEConferene

onSuperomputing(SC'04),Pittsburgh,USA,Nov.2004.

[16℄ N.Dumitrasu,S.Murphy,andL.Murphy,AMethodologyforPreditingthePerformaneofComponent-Based

Appli-ations,inPro.of

8 th

InternationalWorkshoponComponent-OrientedProgramming(WCOP03),Darmstadt,Germany,

July2003.

[17℄ I. FosterandC. Kesselman,eds.,TheGrid: Blueprint fora NewComputingInfrastruture,Morgan KaufmannPub.,

July1998.

[18℄ S.Gilmore,J.Hillston,L.Kloul,andM.Ribaudo,SoftwareperformanemodellingusingPEPAnets,inPro.of

4 th

InternationalWorkshoponSoftwareandPerformane(WOSP04),NewYork,NY,USA,2004,ACMPress,pp.1323.

[19℄ K.Kant,IntrodutiontoComputerSystemPerformaneEvaluation,MGraw-Hill,1992.

[20℄ P.J.B.KingandR.Pooley,DerivationofPetriNetPerformaneModelsfromUMLSpeiationsofCommuniations

Software,inPro. of

11 th

International Confereneon ComputerPerformane Evaluation: ModellingTehniquesand

Tools(TOOLS00),London,UK,2000,Springer-Verlag,pp.262276.

[21℄ H.Kuhen,ASkeletonLibrary,inPro.8thEuro-ParConferene,London,UK,Aug.2002.

[22℄ A.Litke,A.Panagakis,A.D.Doulamis,N.D.Doulamis,T.A.Varvarigou,andE.A.Varvarigos,Anadvaned

arhitetureforaommerialgridinfrastruture.,inEuropeanArossGridsConferene,M.D.Dikaiakos,ed.,vol.3165

ofLetureNotesinComputerSiene,Springer,2004,pp.3241.

[23℄ L. Marhal, Y. Yang, H. Casanova, and Y. Robert, Arealisti network/appliation model for shedulingdivisible

loadsonlarge-saleplatforms,inPro.of

19 th

InternationalParallelandDistributedProessingSymposium(IPDPS05)

(IPDPS'05),April2005.

[24℄ J.Ray, N.Trebon,R. C. Armstrong, S. Shende,and A.D.Malony,PerformaneMeasurementand Modelingof

ComponentAppliationsinaHighPerformaneComputingEnvironment: ACaseStudy,inPro.of

18 th

International

ParallelandDistributedProessingSymposium(IPDPS04),SantaFé,USA,April2004.

[25℄ J.Shopf, Strutural predition models forhigh-performane distributed appliations, inPro. ofthe Cluster Computing

Conferene(CCC'97),Atlanta,USA,Marh1997.

[26℄ L.J.Senger,M.J.Santana,andR.H.C.Santana,UsingRuntimeMeasurementsandHistorialTraesforAquiring

KnowledgeinParallelAppliations,inPro.ofthe2004InternationalConfereneonComputationalSiene(ICCS04),

M.Bubak,G.D.van Albada,P.M. Sloot,andJ.J.Dongarra, eds.,vol.3036ofLeture Notes inComputerSiene,

Kraków,Poland,June2004,SpringerVerlag,pp.661665.

[27℄ M.Sitaraman,G.Kulzyki, J.Krone, W. F.Ogden,andA.L. N.Reddy,Performane speiation ofsoftware

omponents,inPro.ofthe2001SymposiumonSoftwareReusability(SSR01),Toronto,Ontario,Canada,2001,ACM

Press,pp.310.

[28℄ C.U.Smith,PerformaneEngineeringofSoftwareSystems,Addison-Wesley,1990.

[29℄ B. Spitznagel andD. Garlan,Arhiteture-Based PerformaneAnalysis,inPro.of

10 th

International Confereneon

SoftwareEngineeringandKnowledgeEngineering(SEKE98),Y.DengandM.Gerken,eds.,1998,pp.146151.

[30℄ N.Trebon, A. Morris, J.Ray, S. Shende, and A. Malony,Performane Modelingof Component Assemblieswith

TAU,inPro.ofCompFrame2005,Atlanta,USA,June2005.

[31℄ S. Vadhiyar and J. Dongarra, Self Adaptability in Grid Computing, Conurrreny and Computation: Pratie and

(13)

[32℄ A. J. C. van Gemund, Symboli Performane Modeling of Parallel Systems, IEEE Trans. on Parallel and Distributed

Systems,14(2003),pp.154165.

[33℄ M.Vanneshi,Theprogramming modelofASSIST,anenvironmentforparalleland distributed portableappliations, Par.

Comp.,28(2002),pp.17091732.

[34℄ F.Vraalsen, R. A.Aydt, C. L.Mendes,and D.A.Reed,PerformaneContrats: PreditingandMonitoring Grid

Appliation Behavior, in Pro. of

2 nd

International Workshop on Grid Computing (GRID 01), London, UK, 2001,

Springer-Verlag,pp.154165.

[35℄ L. G.WilliamsandC. U.Smith,PASA(SM):An ArhiteturalApproahtoFixingSoftwarePerformaneProblems,in

Pro.of

28 th

InternationalComputerMeasurementGroupConferene,Reno,Nevada,USA,2002,pp.307320.

[36℄ C. M. Woodside, J.E. Neilson, D. C. Petriu, andS. Majumdar,TheStohasti Rendezvous Network Model for

PerformaneofSynhronousClient-Server-likeDistributedSoftware,IEEETrans.onComputer,44(1995),pp.2034.

[37℄ J.Xu,A.Oufimtsev,M.Woodside,andL.Murphy,Performanemodelingandpreditionofenterprisejavabeanswith

layeredqueuingnetworktemplates,inPro.ofthe2005ConfereneonSpeiationandVeriationofComponent-based

Systems(SAVCBS05),NewYork,NY,USA,2005,ACMPress.

[38℄ A.Zhang,Y.Song,andM.Mielke,NetMedia:StreamingMultimediaPresentationsinDistributedEnvironments,IEEE

MultiMedia,9(2002),pp.5673.

Editedby: PasquaD'Ambra,Danieladi Serano,MarioRosarioGuarraino,FranesaPerla

Reeived: June2007