• No results found

Array level Collective Communications

N/A
N/A
Protected

Academic year: 2020

Share "Array level Collective Communications"

Copied!
11
0
0

Loading.... (view fulltext now)

Full text

(1)

) 8 1 0 2 S M S M C ( s c it s it a t S l a c it a m e h t a M d n a n o it a l u m i S , g n il e d o M , l a n o it a t u p m o C n o e c n e r e f n o C l a n o it a n r e t n I 8 1 0 2 8 7 9 : N B S

I -1-60595- 25 -9 6

y

a

r

r

A

-l

e

v

e

l

C

o

ll

e

c

it

v

e

C

o

m

m

u

n

i

c

a

it

o

n

s

g

n

a

u

h

S

Z

H

A

N

G

d

a i

n

Y -

f

e

n

g

C

H

E

N

*

a n i h C , y ti s r e v i n U g n i k e P , e c n e i c S r e t u p m o C d n a g n i r e e n i g n E l a c i r t c e l E f o l o o h c S

*Correspondingauthor

s d r o w y e

K :M , PI Distributeddata, Collecitvecommunicaiton.

t c a r t s b

A . TheMessagePassingInterface(MPI)providessomewell-definedcollectivei nterfacesfor . s e s s e c o r p e l p i t l u m n e e w t e b s e g a s s e m f o t e s

a However ,MPI can be improved .The collective u q e r l li t s s n r e t t a p n o i t a c i n u m m o c e v i t c e l l o c y n a m t a h t , d e z i l a i c e p s o o t s i I P M n i e c a f r e t n

i ire

. ) M D A P ( n o i s n e m i D l e l l a r a P l o o t l e v o n a e b i r c s e d e w , r e p a p s i h t n I . s r e m m a r g o r p y b n o it a it n a t s n i , e r o f e r e h T . s n o i t a c i n u m m o c e v i t c e l l o c f o s s a l c l a r e n e g e r o m a r o f I P M d n e t x e o t d e n i f e d s i M D A P n u m m o c e v i t c e l l o c r e h t o d n a s e c a f r e t n i I P M e h t h t o b , M D A P n

i icaitons are performed wtih the . n o i t u l o s l a r e n e g e m a

s PADM providesarray style abstraction forboth process topology and data ) S A G P ( e c a p S s s e r d d A l a b o l G d e n o i t i t r a P y l h g i h e h t e k i l n U . w e i v l a b o l g e h t m o r f n o i t u b i r t s i d p x e o t r e m m a r g o r p s w o l l a M D A P , l e d o

m lictily contro lthe distribution and communication .And . n o i t a z i m i t p o r o f e p o c s r e t a e r g s w o l l a d n a , s e c n e t n e s n o i t a c i n u m m o c e s i c n o c e r o m s e d i v o r p M D A P n o it c u d o r t n I I P

M is themos tpopular and ubiqutiouschoicefor distributed communication .I tprovideswe ll-r e t t a c s , r e h t a g , t s a c d a o r b s a h c u s , n o it a c i n u m m o c e v it c e ll o c d e z il a i c e p s e m o s r o f s e c a f r e t n i d e n i f e d t n i o p e h t f o d a e t s n i s e c a f r e t n i e v it c e ll o c e h t g n i y l p p a , s r e m m a r g o r p r o F . c t e d n

a - ot -poin tsentences

t c e ll o c e s e h t , r e v e w o H . y ti v i t c u d o r p e h t t s o o b l l i

w iveinterfacesaretoospeciailzedtocovervarious . s n r e tt a

p For example ,the group-wise collective operaiton is undefined in MPI .To apply these it l u m n i s e c a f r e t n

i ple group ,the programmers have to use the MPI communicator to construc t u m m o c d e n o it it r a

p nication contex tand process groups .However ,communicator creaiton is an n o i t a r e p

o that exchanges informaiton between processes wtih expensive synchronization cost . e tt a p n o i t a c i n u m m o c s i h T . n o it a c i n u m m o c r o b h g i e n r a l u c r i c e h t s i e l p m a x e d e n i f e d n u r e h t o n

A rn is

h c a E . g n i s s e c o r p e g a m i d n a n o it a l u m i s c i f i t n e i c s s a h c u s , s m e l b o r p l i c n e t s n i d e s u y l n o m m o c . s r o b h g i e n s t i h t i w a t a d s e g n a h c x e s s e c o r

p The programmers have to use the point- ot -poin t , d n e S _ I P M d n a v c e r I _ I P M e k il , s e c n e t n e

s ort heMPIt opologiessuchasMPI_Cart_shifti nterfaceto s i h t m r o f r e

p kindofcommunication . s h

t r u

F ermore ,i tispossibleto design specificopitmizaiton foreach collecitveinterfaceaccording e b e v a h , H C I P M d n a I P M n e p O s a h c u s , I P M r o f s e i r a r b il d e t n e m e l p m i e h T . n o it i n i f e d s ti o

t en

e v it c e ll o c f o n o i t a z i m it p o e h t n o d e s u c o f s e i d u t s y n a m e r a e r e h T . s e c a f r e t n i l a r e v e s n i d e z i m it p o g n i d n e p e d s m h t i r o g l a e l p it l u m e s u y e h t , e c a f r e t n i h c a e r o F . e c n a m r o f r e p e h t e v o r p m i o t s n o i t a r e p o s a h c u s , e z i s e g a s s e m e h t n

o [1] ,[2 .T] heoptimization only appilesto thesespecialized patternsin . s m h ti r o g l a e s e h t r o f e l b a ti u s e r a s n r e t t a p s u o i r a v y n a m , y ll a u t c A . I P M k r o w r u o e b i r c s e d e w , r e p a p s i h t n

I PADM on providing asinglegenera lsolution forcollecitve e t n i l a r e n e g s i h T . s n o it a r e p o n o i t a c i n u m m o

c rfaceisbasedon anarray-levelalgebraicrepresentaiton

n o is n e m i

d .The construction of dimension follows severa lsimple rules which is easy to learn . n o i s n e m i

D containsboththeprocesstopologyanddatadistribution .Therefore ,eachcommunicaiton m

i s s

i plifiedasapairofdimensionsandacopytosentenceinPADM .Forprogrammers ,ourworkis . y t i v it c u d o r p e h t t s o o b o t e l b

a Inaddiiton ,thedimension i a s binomina ltreestructure .I tispossible toanalyzethestructureand identify thecommunication pattern .Then wecan design theopitmized

. n r e tt a p h c a e r o f m h ti r o g l a a f o l e d o m s i s a b e h t s i S A G P . S A G P e h t n o d e s u c o f s n o it a c i n u m m o c g n i y f i l p m i s n o k r o w y l r a E o C , C l e l l a r a P d e i f i n U s a h c u s , s e g a u g n a l f o r e b m u n e g r a

(2)

. s e l b a i r a v d e r a h s g n i d i v o r p y b y t i v it c u d o r p r e m m a r g o r p t s o o b s e g a u g n a l e s e h T . s s e c o r p h c a e o t l a c o l

a u g n a l e h t f o t s o m n i t i c i l p m i s i n o it u b i r t s i d e h t , r e v e w o

H ges .The runtime performs one-sided . s s e c o r p h c a e n o t n e m e r i u q e r a t a d e h t o t g n i d r o c c a y l l a c i t a m o t u a s e c n e t n e s n o i t a c i n u m m o c

d e r a p m o c , s u h T . y ti l a c o l a t a d e h t f o e r a w a e b d l u o h s s r e m m a r g o r p , e c n a m r o f r e p e h t g n i r e d i s n o C

ti c il p x e s e d i v o r p k r o w r u o , S A G P h t i

w distributionandcommunication. n

o i t c e S : s w o ll o f s a d e z i n a g r o s i r e p a p e h t f o t s e r e h

T I I introduces the construciton rules for l

a r e n e g e h t d n a n o i s n e m i

d copytosentence ,thenSeciton IIIcomparestheMPIandPADMcodesin a

h t w o h s o t s e s a c l l a m s l a r e v e

s tourwork providesmore concise program ,and investigatesinto a .

n o i t a c il p i tl u m x i r t a m n o y d u t s e s a

c Seciton I V introduces the implementation and opitmization . n

o it c e

S V illustratesanddiscussest hecos tmode landperformance .SecitonVIconcludesourwo rk.

e ll a r a

P l Dimen ison

. n o it a c i n u m m o c e v it c e ll o c s u o i r a v r o f e c a f r e t n i l a r e n e g a e d i v o r p o t I P M f o n o i s n e t x e n a s i M D A P

n o it p e c n o c r u o n o d e s a b s i e c a f r e t n i s i h

T dimen ison .In thisseciton ,weintroducetheconstruction s

e l u

r inPADMt oprovideani ntutiiveunderstanding . s

e l u R n o it c u r t s n o C

s w o ll o f n o i t c u r t s n o c e h

T severa lsimple rules ,as shown in Equation (1) and (2). These rules a re .

n o i t u b i r t s i d n o m m o c e h t e b i r c s e d o t h g u o n e e l b i x e l f

D= gt (a size ,step, disp) ( 1)

S= D0∗...∗Dn| array(D0∗...∗Dn)| array (Sr fe, Sb eas ) (2)

, t s r i

F in Equation (1) ,each basic dimension( D)hasonetag and threefundamenta lcomponents :

e zi

s , tsep and dsip. There are three keywords for tag ,dim(dummy) ,proc(process )and data tha t t

n e r e f f i d t n e s e r p e

r elements .Dummy is a void type withou tspecific data or process informaiton . f

o r e b m u n e h t s i e z i

S elementswith defaul tvalue 0 .Step is thedistance between two elementsin e

s quence ,which is also called stride in PGAS [14] .Its defaul tvalue is 1 ,which means the d

n A . n o it a u t i s s u o u n it n o

c disp meansthe displacemen twtih defaul tvalue0 .Elements’ offse twli l e

v o

m a tacertaindistanceaccordingtothis. Foreachindexindimension ,i∈(0, size−1) ,theoffse tis :

s a d e t a l u c l a c

t e s f f

o (i) = (i%size+disp)∗step. (3) t

e c n i

S he defaul tvalues mentioned before can beomitted .Le tus look a ta simple construciton )

2 ( n o i t a u q E n i e l u r t s r i f e h t o t g n i d r o c c

a :

) 4 ( c o r p = A m i

d *data(4 ,16 )*data(16 .)

‘ r o t a r e p o n o i t a c i l p i tl u m e h

T ∗’isusedto construc tmutli-dimensions .ThisdimensionAdescribes n

o a t a d d e t u b i r t s i

d 4processes ,andeachprocesscontains64 conitnuousdataelements .I thasthree o

w t a o s l a s i t i t u b , s n o i s n e m i

d -dimensiona larraywtih16*16elements ,oraone-dimensiona larray h

t i

w 256 elements .Eachdimensionisconstructedasabinomia ltreeaccordingtothemutlipilcation . 1 e r u g i F n i n w o h s s a s e e r t t n e r e f f i d e c u d o r p l li w r e d r o e h t e g n a h c o t s t e k c a r b e h t g n i s U . r e d r

o

r e d r o n o i t a c i l p i t l u m t n e r e f f i d h t i w s e e r t n o i s n e m i D . 1 e r u g i

(3)

t n o c a e b i r c s e d o

T inuoussequence ,wehavetose tthecorrec tstepvalueforeach dimension ,like

) 6 1 , 4 ( a t a

d inA. Orwecanuset hesecondruleinEquaiton(2)tofillt hestepvaluesautomaitcallyfor :

w o l l o f s a n o i t a u t i s s u o u n it n o c

) 4 ( c o r p ( y a r r a = A m i

d *data(4 )*data(16).)

e h

T l ast rule in Equaiton (2) is called reference mechansim .I tisused to describe more flexible e

h t , s n o i s n e m i d o w t s n i a t n o c e c n e r e f e r A . s n o it u b i r t s i

d re fdimension Sr fe and the base o Sne base ,

d e t c u r t s n o c s i h c i h

w asfollowexample:

.) A , ) 8 ( m i d * ) 8 ( m i d ( y a r r a = R m i d

, e s a c s i h t n

I Aisthebaseone ,and dim(8)*dim(8) isthenewdescriptionbased on A .Weusethe

m i

d type herewtihou tspecificdataor process information .SinceA isconfigured asa16*16 two -n

e h T . x i r t a m l a n o i s n e m i

d R describesan 8*8 s -ub matrix in thetoplef tcornerofA .Theoffsetsare e

n n i s e t a l u c l a

c stedway,R.offset(i)=base.offset(ref.offset(i)) .

n o it c n u f e d i v o r p e W . t e s f f o a t a d l a c o l d n a r e b m u n s s e c o r p : t e s f f o f o s d n i k o w t e r a e r e h t , y ll a u t c A

e d o

n and offse t to calculate them respecitvely , ilke A.nodei( ) and A.ofsfe(t)i. Moreover , for y

ti c i l p m i

s ,we u ‘se +’ operator to se tthe d sip value .For example, the rank 1 process can be h

c i h w , 1 + ) 1 ( c o r p s a d e b i r c s e

d isbettert hanproc(1,1,1). o

t y p o C

s i e c a f r e t n i n o it a c i n u m m o c l a r e n e g e h

T definedas:

S m i d ( o t y p o

c s ,void*s ,dimSt ,void*t,i n telem_szie;)

S e r e h

w s ist hesourcedistribution ,andStist het arge toneoft hesamesize .Thevoidpointersaret he

d a e r o n s i e r e h t f I . s e s s e r d d a a t a d t e g r a t d n a e c r u o

s -wrtieconflic,tt heprogrammerscanuset hesame e

t e m a r a p t s a l e h T . s r e t n i o p h t o b r o f s s e r d d

a rrequirest hesizeofdataelement ,suchasszieo(flfoa)t . w

o R d n a n m u l o C

it l u m y r e v e e c n i

S -dimension is constructed as a binomia ltree ,i tis also a two-dimensiona larray . t

o o r e h t f o n e r d li h c o w t e h t d n

A /parentnoderepresen tthecolumnand row dimension ,respecitvely . s

n o i t c n u f e s u n a c e

W co(lA )and row(A )to ge tthem from dimension A .The defaul torderisrow -n

o i t c n u f e h t g n i s u , r o j a

m col_major(A )wli lchange tii ntocolumn-major . s

n o it c n u F m r o f s n a r T

, r e d r o t n e r e f f i d e b i r c s e d o t e l b a s i e c n e r e f e r e h

T er -division ,portion and repeititon of thebasic l

l e w s e d i v o r p M D A P . e n

o -defined transform funcitons to simplify the reference declaration ,and n

i n w o h s e r a s n o i t c n u f d e s u y lt n e u q e r f e m o

s Table1 .Thefirs tfourfuncitonswli ltrea ttheinpu tD m

i d e n o s

a ension ,whlietheothersrequiretha tDistwo-dimensional .Thefunctionmutl iwli lreturn acollapsed dimension ,whichmeanstha tthereferencesizeislargerthanthebasesize .Weprovide

n o i t c n u

f d isze and p isze to ge t the actua l process and loca l data size . For example ,

)( e z is p .) 4 ,) 1 ( c o r p (i tl u

m returns1 .

. 1 e l b a

T TransformFunction. n

o it c n u

F Usage

) k t n i , D m i d ( w o l _ t e g m i

d ge tkunitsfromDt ha tneart heorigin )

k t n i , D m i d ( h g i h _ t e g m i

d ge tkunitsfromDt ha tfarfromt heorigin ,

D m i d ( i t l u m m i

d in tk) repea tDkt imes )

k t n i , D m i d ( c i l c y c m i

d cyclicDwithkunits )

k t n i , D m i d ( w o R c i l c y c m i

d cyclicrowdimensionofD )

k t n i , D m i d ( l o C c i l c y c m i

d cycliccolumndimensionofD )

k t n i , D m i d ( r o j a m _ k c o l b m i

(4)

Con isceCode

I P M n i e ti r w o t d r a h s i t a h t I n o it c e S n i s e s a c e m o s d e s s u c s i d e v a h e

W .In thisseciton ,we take a e

s a c e l p m i

s asexampletocomparet heMPIandPADMcodes.Thecommoncodei somittedi nboth h

t t a h t r a e l c s i t I . e c n e r e f f i d e h t w o h s o t s n o i s r e

v e PADM program is concise and easy to .

g n i d n a t s r e d n

u Thenwewil ldiscussthedistribuiton-independen talgortihmandacasestudy ,matrix .

n o it a c il p it l u m

n o it a c i n u m m o C r o b h g i e N r a l u c r i C

s s e c o r p h c a E . n o i t a c i n u m m o c r o b h g i e n r a l u c r i c e h t s i e s a c t s r i f e h

T sends100 floa telementsto tis e

r a e r e h T . e n o t f e l e h t m o r f t n u o c e m a s e h t s e v i e c e r d n a , r o b h g i e n t h g i

r pnumprocessesi nt otal .The n

i n w o h s s i n o i s r e v I P

M Figure 2(a) .Programmers should consider many detalis ,such as the c

o l b , r e b m u n s s e c o r p n o it a n i t s e

d king ornon-blocking interfaces ,avoiding deadlock (n -on blocking e

h t g n i r u s n e d n a ) t s r i

f completion (MPI_Wati) .Even using the MPI topology interface wli lno t e

d o c f o t n u o m a e h t e c u d e

r .Whliein PADMversion ,asshownin Figure2(b) ,thereisonlyonetask r

o f s n o i s n e m i d t e g r a t d n a e c r u o s e h t n g i s e d o t s i t a h

t copytofunciton .Weusethet ransformfunciton

c il c y

c heretodescribet hedesitnationprocesses .

n o i t a c i n u m m o c r o b h g i e n M D A P s v I P M . 2 e r u g i

F .

n o it u b i r t si

D -independent d

n o c e s e h

T caseisa ilbraryfuncitonofmatrixtranspose .Whendesigninga ilbraryfunctioninMPI , w o r r o k c o l b s a h c u s , x i r t a m t u p n i f o n o it u b i r t s i d l a n i g i r o e h t f o e r a w a e b t s u m s r e m m a r g o r p e h t

m t n e r e f f i d n i tl u s e r s n o it u b i r t s i d l a n i g i r o t n e r e f f i d e h T . n o it u b i r t s i d a t a d d e r e tt a c

s essagepassingpat-

a n g i s e d o t e l b a s i t i , M D A P n i , r e v e w o H . s n r e

t di ts irbuiton- independen talgortihm .I tmeansthe r

b u s n i y r a s s e c e n n u s i t u o y a l a t a d c i f i c e p

s ouitne .Ourfunciton isshown in Figure3 .I trequirestha t A

n i s n o i s n e m i d w o r d n a n m u l o c e h

t represen tthecolumnand rowofthematrix ,respectively .We b

i r c s e d o t s n o i s n e m i d w o r d n a n m u l o c e h t e g n a h c x

e e the targe tdistribution B .Also we can use

_ l o

c major(A )toge ttiatlernaitvely .

e r u g i

F 3. Distribution-independentt ransposei nPADM.

t S e s a

C udy :MatrixMul itpilca iton

A = C t a h t n o it a c il p it l u m x i r t a m f o m h ti r o g l a D 2 l a n o it i d a r t e h t n

(5)

u l o c d n a s w o r e l p i tl u m t e g t s u

m mnsfromAandB .IneachmatrixdistributionofAandB,i ft hesize P

s i e z i s n o i t a c i n u m m o c e h t , P s i d i r g s s e c o r p f

o 1/2 itmesoft hematrixsize .

e e r h t

A -dimensiona lapproach [4]–[6] to paralle lmatrix mutlipilcation has a factor P1/6 less

o i t a c i n u m m o

c nthanthe2Dalgortihm .Inthisalgortihm ,theprocessesareconfigured asa3Dcube . s

s e c o r p h c a

E g etsa singlesub-matrix from A and B ,then performsa loca lmatrix mulitplicaiton . b

u s l a r e v e s , n o it a t u p m o c e h t r e t f

A -matrices of this produc t mus t be sent to their desitnation e

h T . C x i r t a m g n it l u s e r e h t h ti w r e h t e g o t d e m m u s n e h t d n a s e s s e c o r

p computation is described in g

i

F u 4re .Therearet hreecommunicationsi nt hisalgorithm .Thefirs tonedistributest hesub-matrices o

d n o c e s e h T . s e s s e c o r p l l a o t A f

o nedistributest hoseofBwtihdifferen tdesitnations .Andt het hird .

s s e c o r p h c a e n o t l u s e r h c a e r o f s t c u d o r p e h t l l a s r e h t a g e n

o Thecommunicaitons are complicated . n

i , e r o m r e h t r u

F MPI,t heprogrammersmus tbeawareoft heorigina ldistribu itonsofdata.

. 4 e r u g i

F Cubealgorithmformatrixmultiplication. e

h t e ti r w o t r e m m a r g o r p r o f e l b i s s o p s i t i , M D A P n

I di ts irbuiton-independen talgorithm from an e

v it i u t n

i way .In ourimplementaiton ,asshown inFigure5 ,theinpu tdimensionsareconsidered as o

w

t -dimensional .Each processperformsloca lsub-matricesmulitpilcaiton of size len*len .And the m

s i C d n a B , A f o r e b m u n s k c o l

b *n ,n*kandm*k ,respecitvely .Therefore ,weneedaprocesscube m

e z i s f

o *n*k .Thedeclarationofeachcubedimensioni sseparate ,asshowninLine7-9 .Beforet he b

u s e h t , n o it a t u p m o c l a c o

l -matricesofAarescatteredamongt hePm*Pnprocesses ,andarerepeated e

h t g n o l

a Pkdimension .Similarly ,thematrix BisscatteredamongthePn*Pkgrid ,and isrepeated e

h t g n o l

a Pmdirection .

M D A P n i m h t i r o g l a e b u C . 5 e r u g i

(6)

b u s l a c o l e h

T -matrix is described as Blk in Line 11 ,which is continuous in memory .Bu tthe e h t g n i r e d i s n o C . s e s s e c o r p e l p it l u m n o r o s u o u n i t n o c y l e r it n e t o n e b y a m a t a d t u p n i g n i d n o p s e r r o c

i d e h t f o n g i s e d e h t , e c n a m r o f r e

p mensionsmus tensuretha tthecommunication conitnuouslength is n

o i t p i r c s e d a t a d l a c o l e h t , e r o f e r e h T . h g u o n e g n o

l Blkisatt herightmos tsideoft het arge tdimension . e

h t e s u e w d n A . h t g n e l s u o u n it n o c t s e g n o l e h t s a h n o i s n e m i d t e g r a t e h t , t s a e l t

A block_majo r

w o r l a n i g i r o e h t t s u j d a o t n o i t c n u

f -major inpu tdimensions into block-major order .By using the

it l u

m -funciton ,theadjustedoffsetsarerepeatedsevera l itmes .FromLine14-20 ,matrix Aand Bare .

s e s s e c o r p e b u c e h t n o d e t a e p e r d n a d e t u b i r t s i d

t f

A ertheloca lcomputaiton gemm ,theproductsalongthePndireciton mus tbesummedtogether .

e h t e s u e

W CP dimension to enlarge the dataoffsetsscope ,and allocate memory to receivethese .

l a c o l n o s t c u d o r

p ThePn dimension correspondsto theCP dimension. And theproduc tblockson k

P * m

P correspondt ot heblock-majorofC ,asshowni nLine24-26 .Therefore,t heproductsaresen t . C x i r t a m g n it l u s e r e h t h ti w r e h t e g o t d e m m u s n e h t d n a s e s s e c o r p n o i t a n it s e d e v it c e p s e r r i e h t o t

e d o c r u o n

I ,only the malloc and rfee sentences are omitted .The code is concise and easy to n

I . n o it u b i r t s i d l a n i g i r o t n e r e f f i d s t p e c c a m h t i r o g l a s i h t , y ll a i c e p s E . d n a t s r e d n

u Figure6 ,wedescribe

a C . n o it u b i r t s i d a t a d d e r e t t a c s w o r s i e n o e s a C . A x i r t a m r o f s n o it u b i r t s i d t n e r e f f i d e e r h

t setwo isa

n o n r e t t a p d e r e t t a c s k c o l

b r* rprocessgrid ,andeachblockisasquaresub-matrix .Casethreeisalso n

o i t u b i r t s i d k c o l b

a wtih differen tgrid pattern ,whereblock isarectanglepar tofthematrix .These n

r e tt a

p s arealsoavailableformatrixBandC .Differen tdistribuitonsresul tin differen tperformance e z i s f o n o i t a c il p i tl u m x i r t a m e h t f I . h t g n e l s u o u n it n o c n o it a c i n u m m o c t n e r e f f i d e h t f o e s u a c e b

n i n w o h s s i e c n a m r o f r e p e h t , s e s s e c o r p 4 6 n o l e ll a r a p s i 2 9 1 8 * 2 9 1

8 Table2 .Thelas tcolumnc ntis t s e g r a l e h t f o e s u a c e b t s e b s m r o f r e p e s a c d r i h t e h T . n o it a c i n u m m o c n i a t a d s u o u n it n o c f o r e b m u n e h t

t n

c value .I tis possible to design a distribution-independen talgorithm in PADM by using the l

a r e n e g d n a m s i n a h c e m e c n e r e f e

r copyto function .Therefore ,tuning theperformancewtihdifferen t .

y s a e s i s t u p n i

. 6 e r u g i

F Variousi npu tdistributionforcubealgorithm.

n o i t a c i l p i t l u m x i r t a m n i s n o i t u b i r t s i d t n e r e f f i d h t i w e c n a m r o f r e P . 2 e l b a

T .

) s ( e m i

T CopyA CopyB CopyC Total c nt w

o r : 1 e s a

c 0.223 0.254 0.233 0.710 211

k c o l b : 2 e s a

c 0.282 0.338 0.320 0.940 210

e l g n a t c e r : 3 e s a

c 0.180 0.179 0.157 0.516 220

t n e m e l p m

I a itonandOp itmiza iton M

D A

P is implemented by using MPI ,OpenMP and C++ .We defined i tas a ilbrary tool tha t a

s i t I . I P M h ti w e l b it a p m o

c ilghtlyextension tha tallowsprogrammerstouseanabstrac tandlogic n

o i t p i r c s e

d ofdatawhich isseparatesfrom theactually physica lmemory. I tiseasyto use ,PADM e

m o s d d a o t d e e n y l n o r e m m a r g o r p e h T . d e d u l c n i e b o t e li f d a e h a s e d i v o r

p options such as ‘

-g n il i p m o c n e h w ’ p m n e p o

f .

e h

T genera lcopytofuncitonrequiressourcedimensionandthesamesizetarge tone .Byscanning e

p o c s e z i s e h t n i s e x e d n i e h t l l

a onlyonce itme ,wecancalculatetheelemen tlocationsa tbothends , t

n e m e l e s i h t y p o c n e h t d n

a fromsourcetotarge tbyusingpoint- ot -poin tmessagesentences .Thisis .

(7)

it l u M d n a g n i p p a l r e v

O -thread s

i p p a l r e v o s i d o h t e m n o i t a z i m it p o n o m m o c

A ng thecompuitngand communication .In oursoluiton , h

t o t s r e f e r g n i t u p m o c e h

t elocaiton calculaiton foreachindex ,calledscan .WechooseOpenMP to u

m h c n u a

l litplethreadson each process ,bu tonlyonethread isallowed to usethemessagepassing e

h t , r e v e w o H . s e c n e t n e

s scan work is parttiioned among severa lthreads .Each computing thread e h t n o s i n o i t a c o l t e g r a t r o e c r u o s e h t f I . s n o it a c o l e h t s e t a l u c l a c d n a n o i s n e m i d e h t f o n o i t r o p a s n a c s

e v i e c e r r o d n e s a h s u p e w n e h t , s s e c o r p t n e r r u

c request into globa lqueue .Once thequeue isno t n

o it a c i n u m m o c e h t , y t p m

e thread wli lpop up each reques tfrom the queue and turn i tinto a _

I P M e s u e w n e h T . e c n e t n e s v c e r I _ I P M r o d n e s I _ I P

M Tes tto ensure tha tthese requests are n

o it c n u f a e d i v o r p e w , r e v o e r o M . d e t e l p m o

c setTnumtosett henumberofscant hreads . M

D A

P -ledOp itmiza iton i

d M D A P e h

T mension is an analyzable binomia ltree .We can ge tinformaiton from the structure n

i g n i n n a c s t u o h t i

w dexes .Now we are going to discuss some opitmizations tha tbased on the .

s i s y l a n a

h t g n e L s u o u n it n o

C .Eachdimensionrepresentsasequence ofdataoffsets. Theseoffsetsmaybe n

i s u o u n i t n o

c address ,such as proc(4)*data(100) . In other cases , they may no t b e entirely e

k i l , s u o u n i t n o

c data(10,20)*data(10) ,tha tevery 10 elements is continuous in memory . The f

o r e b m u

n regularcontinuouselementson each processisconfiguredasthecontinuouslength .We e

s u n a

c A.cntLen()togeti .t

t s e t a e r g e h T . h t g n e l s u o u n it n o c t n e r e f f i d e v a h y a m n o it a c i n u m m o c e n o n i s n o i s n e m i d o w t e h T

h t g n e l s u o u n it n o c e h t s i s e u l a v o w t e s e h t f o r o s i v i d n o m m o

c cn tof this communication .I tmeans y

r e v e t a h

t cn tindexes refer to the same source and target process number ,and their offsets are s i t n e m g e s s u o u n it n o c h c a e n i x e d n i t s r i f e h t y l n o , e r o f e r e h T . s d n e h t o b t a y r o m e m n i s u o u n it n o c

c n i s i h t g n e l e g a s s e m e h t , e l i h w n a e M . d e n n a c s e b o t d e e

n reased to cnt .This soluiton called

m m o C n a c

s .

y ti r a l u n a r

G .To optimizethecommunication whencntissmall ,weintroduceanotherargumen t

n a r

g ht a tmeans the granulartiy .The message length is se tto this granularity rather than cnt . f

o e u l a v e h t , y l l a u s

U gran is severa ltimesthesize of cnt .Therefore ,each process packs severa l n o i t a n i t s e d e h T . t i d n e s d n a k c o l b r e f f u b e n o o t n i n o i t a n i t s e d e m a s e h t h t i w s t n e m g e s s u o u n i t n o c

s i e g a s s e m e v i e c e r a e c n O . e g a s s e m e h t e v i e c e r o t r e f f u b r e h t o n a s e d i v o r p s s e c o r

p completed ,the

.t e s f f o t h g i r e h t o t t n e m g e s s u o u n i t n o c h c a e e v o m d n a r e f f u b e h t k c a p n u l l i w s s e c o r p n o i t a n i t s e

d In

m r o f r e p o t s d a e r h t f o e z i s e m a s e s u e w , n o it a u ti s s i h

t packing and unpacking seperately. In the e

r o h p a m e s , d a e r h t n o i t a c i n u m m o

c isusedtonoticet heunpackingt hreadt ha tabufferblocki sready .

d e k c a p n u e b o

t ThissolutioncalledbufComm. a

n g i s e d e w , n o it i d d a n

I buffe rpoo ltorecyclet hebufferblocks .Thepooli sparttiionedi ntomany e

z i s f o s k c o l

b gran .Andt heseblockaddressesaremanagedbyagloba lqueue .Whenanewblocki s a e t a c o l l a l l i w e w , y t p m e s i e u e u q e h t f I . e u e u q e h t m o r f p u d e p p o p s i s s e r d d a r e f f u b a , d e r i u q e r

d n a l o o p e h t h t o b , n o i t a c i n u m m o c e h t r e t f A . e s u r e t f a d e l c y c e r o s l a s i h c i h w k c o l b y r a r o p m e t

e r a s k c o l b y r a r o p m e

t released . l li w n o it c n u f o t y p o c e h

T analyzetheinpu tdimensionsandse targumentsautomatically including

t n

c ,gran ,the size of buffer poo land etc .Also we provide funcitons for programmers to contro l s

a h c u s , m e h t f o e m o

s setGranByte sandsetBufBytes . y

ti l a c o

L .Byanalyzingthesourceand targe tdimensions ,wecan discussthelocality .Ifthedata h t o b t a s r e b m u n s s e c o r p e h t d n a , t e g r a t d n a e c r u o s n e e w t e b r e h t o h c a e o t d n o p s e r r o c s n o i s n e m i d

t n e m e v o m a t a d e h t t a h t s n a e m t i , e c n e u q e s e m a s e h t n i e r a s d n

e onlyoccursonl ocal ,suchas :

) 4 ( c o r p = S m i

d *data(10 ,10 )*data(10 ;)

) 4 ( c o r p = R m i

(8)

y b d e m r o f r e p s i t n e m e v o m e h t , n o it a u ti s s i h t n

I memcpyinsteadofmessagepassing .Thisbranch d

e ll a c s i n o i t u l o

s localCopy. The copyto funciton chooses the proper optimized branch soluiton s

n o i s n e m i d e h t o t g n i d r o c c

a ’i nformation .

n o it a r e p O e v it c e ll o C r o f n o it a z i m it p O d n a n o it a c if it n e d I

x e n a s a n o i t a r e p o t s a c d a o r b e k a t e

W ampleto discussthealgorithm leve lopitmizaiton .Thereisan t

p

o imizaiton forlong messagebroadcast .The message isfirs tdivided up and scattered among the e W . r e h t a g ll a n a o t r a li m i s , s e s s e c o r p l l a o t k c a b d e t c e ll o c n e h t e r a a t a d d e r e t t a c s e h T . s e s s e c o r p

e h t d e r u s a e

m u -n optimized MPI Bcas tof OpenMPI and this new algortihm on 32 processes .The w

e

n algorithm reduces up to 74.5% itme when the message length is 1GB . This exisitng d

e i l p p a e b n a c n o it a z i m it p

o forbothstandardandmorevariousscenariosinPADM .Forthesemore e

m i d t e g r a t d n a e c r u o s e h t , t s a c d a o r b l a r e n e

g nsion have some common features. I tis possible to e

h t e z y l a n

a dimensions’featurestoi dentifyt hecommunicaitonpattern . n

o it a c if it n e d n

I .Wechoosethreecasesofbroadcas tasshownin Lis t7 .Thefirs tcasedescribes t

o n h t i w s e s s e c o r p l l a n o t s a c d a o r

b enitrely continuous message ,and the data layou tis changed d n A . s e s s e c o r p r e b m u n n e v e o t a t a d t o o r e h t t s a c d a o r b y l n o e n o d n o c e s e h T . t s a c d a o r b e h t g n i r u d

p u o r g a s e b i r c s e d e n o d r i h t e h

t -wisebroadcast .Thefeaturesofbroadcas tdimensionsareconcluded t

s i B d n a n o i s n e m i d e c r u o s e h t s i A e r e h w , s w o l l o f s

a het arge tone: )

a A.psize( )<B.psize)( ; )

b A.dsize( )=B.dsize)( ; )

c onlyAcontain sacollapseddimension;� )

d A’ sdataandproces sdimensioncorrespondst ot ha to fBr especitvely;� )

e botht hedataoffset sarei nas uccessives egmen twtihou toverlaporj ump.�

n o it c n u f e h t , e r o f e b d e n o i t n e m e w s

A pszieand d iszereturn theactua lsizeofprocessand loca l c

n i S . a t a

d e thedata on roo tprocess is sen tto mutliple desitnaitons ,the actua lprocess size of the e h t s i s d n e h t o b t a a t a d l a u t c a e h t d n A . e n o t e g r a t e h t f o t a h t n a h t s s e l y l e ti n i f e d s i n o i s n e m i d e c r u o s

. e z i s e m a

s Therepettiion of roo tmessageresutls in collapsed procordata dimension. Ifthere isa w

, n o i s n e m i d a t a d d e s p a ll o

c e wil lseparate the repeititon descripiton from data and turn i tinto a n

o i s n e m i d s s e c o r p d e s p a ll o

c fori denitficaiton .

t a h t s n a e m t i , s s e c o r p s ’ e n o r e h t o n a o t s d n o p s e r r o c a t a d s ’ e n o f

I data on one process wil lbe n

r e t t a p t s a c d a o r b a t o n y l e ti n i f e d s i h c i h w , s e s s e c o r p e l p it l u m o t d e t u b i r t s i

d ,this is why we have

) d ( e r u t a e

f .Theoffsetsmayno tbeenitrely conitnuous ,bu ttheymus tbein thesuccessivememory j

d n a p a l r e v o t u o h ti w t n e m g e

s ump .

m h ti r o g l A n o it a z i m it p

O . Ift hepatterni si denitfiedasabroadcast ,weseparatedataandprocess o

b r o f s n o it p i r c s e

d th the source and targe tdimensions .Therefore ,we ge tfour new dimensions p

t e g r a t , a t a d e c r u o s , s s e c o r p e c r u o s t n e s e r p e

r rocessandtarge tdata ,respectively .Thesedimensions d

e s u e r

a to describe thescatter-allgather communication . fI themessage is long enough ,the new m

h t i r o g l

a solutionisperformed .Otherwise,i tperformst hegenera lcopyto . d

r a d n a t s e h t l l a , e r o f e r e h

T ,group- swi e ,no tenitrely continuous data ,and many other various p

o e b n a c t s a c d a o r

b itmized wtih thisnewalgortihmin PADM .Theperformanceisdiscussedinthe .

n o it c e s t x e n

d n a l e d o m t s o

C Performance

l a r e v e s m o r f e c n a m r o f r e p s s u c s i d e w , n o it c e s s i h t n

I cases .Theperformanceismeasured on a 32 -e

l o M e h t f o t e s t s e t s e d o

n -8.5Supercompuitng[3] .Andt heMPIl ibraryi st heOpenMPI2.1 version . t

s o

C Model

s s e c o r p h c a e r o f n e k a t e m i t e h t t a h t e m u s s a e W . e c n a m r o f r e p e h t e t a m i t s e o t l e d o m e l p m i s a e s u e W

t e t e l p m o c o

t he communication can be modeled as mα+ n2 γ+Nβ ,where m is the number of ,

s e g a s s e

(9)

γist hepackingorunpackingtimeperbyte ,Ni st hetota lnumberofbytest ransferred, N=m* ,n andβ

. e t y b r e p e m it r e f s n a r t e h t s

i Weassumefurthertha tthecommunicaiton timealwaysoverlappedthe e

r u g i F e k il , e m it g n i k c a p n u d n a g n i k c a

p 7 )( . a

n o i t p m u s s a s ’ l e d o m t s o C . 7 e r u g i

F andperformancet est. f

i c e p s r o f y ti r a l u n a r g g n i n u t n e h

W ic case ,the value of Nβ is constant ,and m is inversely .

n o t l a n o i t r o p o r

p Ifcntisl argeenought ha ttherei snopackingorunpacking,t hecos tmodelt urnst o

mα+Nβ ,whichmeanst hatl essmessagesi sbetter. o

t y p o C l a r e n e G

t n o c y l e r it n e t o n a e s o o h c e

W inuouscircularneighborcommunicaitoncaseasfollows:

;) m u n p ( c o r p = P m i

d

P = A m i

d *data(2 ,1024 )*data(n ,2048 )*data(1024 ;)

) 1 , P ( c il c y c = B m i

d *data(n*2048 .) ;) )t a o lf (f o e zi s ,r , B , s , A ( o t y p o c

n o it u l o s d e z i m it p o e h t d e c u d o r t n i e v a h e

W scanCommand bufCommforcopytofunciton .In this ,

e s a

c cntis4KB .Wesett hegranulartiyas1/64oftheloca ldata .Ifn=128 ,eachprocessholds1MB n

e h t , a t a

d g nra is 16KB .The performance is shown in Figure 7 )(b .The ‘cnt8’ represents a 8-n it

a c i n u m m o c s s e c o r

p on with cn tmessage length .Since the 4KB conitnuous length is no tlong e

h t , h g u o n

e bufCommsoluitoni ncreasest hemessagel engthandperformsbetter . n

i o p r e h t o n a n i s e i ti r a l u n a r g t n e r e f f i d s s u c s i d e w , e r o m r e h t r u

F t- ot -poin tcommunicaiton casethat 0

k n a

r processsends256MBdatat ot herank1process ,andt heconitnuousl engthi salso4KB .

( y a r r a = D m i

d (data(256)*data(256 ))*data(1024));

;) D ( w o r * )) D (l o c ( r o j a m _ l o c * ) 1 ( c o r p = A m i d

D * ) 1 + ) 1 ( c o r p ( = B m i d

;) )t a o lf (f o e zi s ,r , B , s , A ( o t y p o c

s i e c n a m r o f r e p e h

(10)

, e g r a l o o t s i y t i r a l u n a r g e h t f

I the2nγcos talot .Ifthegranularityistoosmall ,wehavetomanagea e

h t s e s a e r c n i h c i h w s e g a s s e m t r o h s f o t o

l startupcostmα. s

n o it a r e p O e v it c e ll o C d r a d n a t S

e h T . e c n a m r o f r e p e h t e r u s a e m o t n o i t a c i n u m m o c e v it c e l l o c c i s a b e h t s a e c n a t s n i l l a o tl l a e h t e k a t e W

t o n s e o d I P M n e p O n i l l a o t ll A I P M r o f m h ti r o g l

a attemp ttoschedulecommunication .Instead ,each n a y b d e w o ll o f , p o o l a n i d n e s I I P M e h t l l a n e h t , p o o l a n i v c e r I I P M e h t l l a s t s o p s s e c o r p

e h t o t r a l i m i s s i s i h T . l l a t i a W _ I P

M scanCommsoluiton . g

i

F u 7 )re ( d showstheperformanceforMPIAlltoal lversusPADMcopytowtihdifferen tthreads . e

h t s a e m a s e h t s i y ti r a l u n a r g e h T . a t a d B G 1 s n i a t n o c s s e c o r p h c a

E coun tsize .InPADMcases ,the e h t s w o h s h c i h w , s d a e r h t 4 f o e c n a m r o f r e p e h t n a h t r e tt e b y lt h g il s s i s d a e r h t 8 f o e c n a m r o f r e p

e l p it l u m g n i s u f o e g a t n a v d

a threads .The overhead in PADM does no tsignificanlty degrade the r

u o , y ll a u t c A . e c n a m r o f r e

p copytofuncitonperformsbetterwhent heprocessnumberi sl esst han32 . V

d e z i m it p

O ariou sBroadcas t

2 3 n o t s a c d a o r b d r a d n a t s f o e c n a m r o f r e p e h t e r u s a e m e w , t s r i

F processes .The algorithms of MPI ,

t s a c

B copyto and the new one(scatter- da -n allgather) are different .Their performance is shown in g

i

F u 7re (e)and7 )(f .Heret hemessagel engthequalst ot het otall oca ldatasize .Weseet ha tforshor t (

s e g a s s e

m ≤4MB) ,copyto performs better than both the MPI and the new one .And for long (

s e g a s s e

m ≥8MB),t henewalgorithmperformsbest . n

i s e s a c s u o i r a v e e r h t d e r u s a e m e w , n e h

T Figure 8 on32processeswtihdifferen tmessagesize .In f

o e u l a v e h t , t s il e h

t nisse tto512 ,andtheconitnuouslengthofD tdimensionis1/1024oftheloca l l

a r e n e g d n a n o it u l o s d e z i m it p o e h t f o e c n a m r o f r e p e h T . e z i s a t a

d copyto isshownin Table3 .Each

o t y p o

c containsone synchronization .TheBca tsOp tsoluiton thus issynchronized twice .Therefore , s

e m e h t n e h

w sageislonger ,theopitmized algortihm performsbetter .When themessagelength is e

h t , B M 2 1

5 Bca tsOp treduces88.3%communicationt imecomparedwtiht hegenera lsoluitoninthe .

e s a c t s r i f

. 8 e r u g i

F Variousbroadcas tpatterns.

. 3 e l b a

T Variousbroadcas tperformance.

e g a s s e M

h t g n e l

1 e s a

C Case2 Case3

t p

o copyto o pt copyto o pt copyto B

M

2 0.2 21 0.4 26 0.2 05 0.0 52 0.261 0.310 B

M

8 0.2 25 0.5 37 0.2 39 .00 79 0.258 0.340 B

M 2

3 0.2 56 0.5 72 0.2 60 0.2 09 0.256 0.355 B

M 8 2

1 0.326 1.620 0.290 0.728 0.404 0.911 B

M 2 1

5 0.778 6.640 0.684 2.808 0.913 5.200

n o is u l c n o C

. s n r e tt a p e v it c e l l o c s u o i r a v r o f n o i t u l o s l a r e n e g a s e d i v o r p M D A

P Theabstraciton ofdistribuitonis .

e l b i x e l f d n a e v it i u t n i s i h c i h w s e l u r e l p m i s l a r e v e s y b d e w o ll o

(11)

s r o t c a f d e t a l e r e c n a m r o f r e p e h t l o r t n o c o t e l b a d n a y ti l a c o l a t a d e h t f o e r a w

a . I tprovides more

e d o c e s i c n o

c s and greater scope for opitmization. Our work is open source on GtiHub , .

0 V m d a p / 1 0 0 2 9 1 s z / m o c . b u h t i g / /: s p tt

h Thefutureworki st odiscussmorescientificcases.

t n e m g d e l w o n k c A

. o N s t n a r G r e d n u a n i h C f o m a r g o r P D & R y e K l a n o it a N y b d e t r o p p u s s i h c r a e s e r s i h T

, 8 0 2 2 7 6 1 6 . o N s t n a r G a n i h C f o n o it a d n u o F e c n e i c S l a r u t a N l a n o i t a N d n a ; 2 0 5 0 0 2 0 B F Y 6 1 0 2

. 8 1 0 2 3 4 1

6 Thanks to the Mole-8.5 Supercompuitng System developed by Insttiute of Process .

s e c n e i c S f o y m e d a c A e s e n i h C , g n i r e e n i g n E

s e c n e r e f e R

] 1

[ Barnet tM. ,ShulerL. ,DeGejin R.V. ,Interprocesso rCollecitveCommunica iton Library ,High g

n i t u p m o C e c n a m r o f r e

P ,357-364 ,1994 . ]

2

[ ThakurR. ,RabenseifnerR. ,GroppW. ,Opitmizaitono fCollecitveCom- municaitonOpera iton s H

C I P M n

i ,Internationa lJourna lofHighPerformanceComputingApplications ,19(1) :49-66 ,2005 . ]

3

[ Xiaowe i Wang , We i Ge , The Mole-8.5 Supercompu itng System , Contemporary High

g n it u p m o C e c n a m r o f r e

P , .p 5p 7 -98 .NewYork :CRCPress ,2013 . ]

4

[ FoxG. ,OttoS.W. ,HeyA.J. ,e tal ,MatrixAlgortihm sona HypercubeI :MatrixMulitpilcaiton , 7

1 : ) 1 ( 4 , g n i t u p m o C l e l l a r a

P -31 ,1987 . ]

5

[ Krishnan M. ,NieplochaJ. ,Op itmi izng Paralle lMulitpilcaiton Operaiton f o rRectangula rand s

e c i r t a M d e s o p s n a r

T ,Internationa lConferenceonparalle landDistributedSystems ,257-266 ,2004 . ]

6

[ Agarwa lR.C. ,BalleS.M. ,Gustavson F.G. ,A Three-dimensiona lApproach t o Paralle lMatrix n

o it a c il p it l u

M ,IbmJourna lofResearchandDevelop- ment ,39(5) :5 -75 582 ,1995 . ]

7

[ ChanA. ,Balaj iP. ,GroppW. ,Communicaiton Analysi so fParalle l3DFFT f o rFla tCa tresian

s m e t s y S e n e G e u l B e g r a L n o s e h s e

M ,HighPerformanceComputing ,350-364 ,2008 . ]

8

[ F .Kjolstad and M .Snir .Ghos tCel lPa ttern ,In Workshop on Paralle lProgramming Patterns , .

0 1 0 2

] 9

[ Chen Y. , Cu i X. , Me i H. , PARRAY : a unfiying array representa iton fo r heterogeneou s m

s il e ll a r a

p ,Acm Sigplan Symposium on Principles and Pracitce of Paralle lProgramming ,47(8) : 1

7

1 -180 ,2012 . ]

0 1

[ Hoare C.A., Hayes I.J., Jifeng H., Lawsofprogramming,Communicationsof TheACM ,30(8) : 2

7

6 -686 ,1987 . ]

1 1

[ DraperJ.M. ,CullerD.E. ,Yelick K. ,Introducitont oUPCandLanguageSpeciifcaiton ,Center l

a n A e s n e f e D r o f e t u t i t s n I , s e c n e i c S g n i t u p m o C r o

f - yses ,1999 . ]

2 1

[ Coarfa C. , Dotsenko Y. , Mellorcrummey J. , An Evaluaiton o f Globa l Addres s Space o

C : s e g a u g n a

L -arrayFortran andUniifedParalle lC ,Acm SigplanSymposiumonPrinciplesand .

5 0 0 2 , g n i m m a r g o r P l e l l a r a P f o e c i t c a r P

] 3 1

[ Numrich R.W. ,Reid J.K. ,C -o array sin the Nex tFortran Standard ,ACM Sigplan Fortran :

) 2 ( 4 2 , m u r o

F 4-17 ,2005 . ]

4 1

[ Nieplocha J. ,Palmer B.J. ,Tipparaju V. ,Advances ,Appilcaiton sand Performance o fthe

ti k l o o T g n i m m a r g o r P y r o m e M d e r a h S s y a r r A l a b o l

G ,Internationa lJourna lof High Performance 3

0 2 : ) 2 ( 0 2 , s n o i t a c i l p p A g n i t u p m o

C -231 ,2006 . ]

5 1

[ Gropp W. ,Thakur R. ,Lusk E .Using MPI-2 :Advanced features of the message passing e

c a f r e t n

References

Related documents

The IPD and Notaires/INSEE indices represent the return on capital for housing, whereas the type of market sector represented by the listed real estate index depends on the type

To qualify the firm must have experience in payroll processing for multiple sites of not for profits charter schools and/or supporting multiple sites and the required tax filings

To experiment with infrasonic perception, the existing behaviour of sounds (also above 20Hz) was used to examine and test inherent qualities of infrasound samples in order to

To demonstrate that metamorphic testing is an effective technique for testing health care simulation software, we started by identifying the metamorphic properties that we would

To prevent excessive anticoagulation and increased risk of bleeding, the new oral anticoagulants should not be administered to patients with a creatinine clearance below 30 mL/

We are a customer-focused mutual life insurance company with a nationwide network of highly skilled financial professionals and backed by a business and investment strategy that

Over the last few years, several variants of flow-time related problems have received a lot of attention: on single and multiple machines, in online or offline setting, for

The benefits described in this policy wording should be read in conjunction with Options to vary cover (pages 6-8), Existing Medical Conditions and pregnancy (pages 8-11) Your