Array level Collective Communications

(1)

) 8 1 0 2 S M S M C ( s c it s it a t S l a c it a m e h t a M d n a n o it a l u m i S , g n il e d o M , l a n o it a t u p m o C n o e c n e r e f n o C l a n o it a n r e t n I 8 1 0 2 8 7 9 : N B S

I -1-60595- 25 -9 6

y

a

r

A

-l

e

v

e

l

C

o

ll

e

c

it

v

e

C

o

m

u

n

i

c

a

it

o

n

s

g

n

a

u

h

S

Z

H

A

N

G

d

a i

n

Y -

f

e

n

g

C

H

E

N

*

a n i h C , y ti s r e v i n U g n i k e P , e c n e i c S r e t u p m o C d n a g n i r e e n i g n E l a c i r t c e l E f o l o o h c S

*_C_o_r_r_e_s_p_o_n_d_i_n_g_a_u_t_h_o_r

s d r o w y e

K :M , PI Distributeddata, Collecitvecommunicaiton.

t c a r t s b

A . TheMessagePassingInterface(MPI)providessomewell-definedcollectivei nterfacesfor . s e s s e c o r p e l p i t l u m n e e w t e b s e g a s s e m f o t e s

a However ,MPI can be improved .The collective u q e r l li t s s n r e t t a p n o i t a c i n u m m o c e v i t c e l l o c y n a m t a h t , d e z i l a i c e p s o o t s i I P M n i e c a f r e t n

i ire

. ) M D A P ( n o i s n e m i D l e l l a r a P l o o t l e v o n a e b i r c s e d e w , r e p a p s i h t n I . s r e m m a r g o r p y b n o it a it n a t s n i , e r o f e r e h T . s n o i t a c i n u m m o c e v i t c e l l o c f o s s a l c l a r e n e g e r o m a r o f I P M d n e t x e o t d e n i f e d s i M D A P n u m m o c e v i t c e l l o c r e h t o d n a s e c a f r e t n i I P M e h t h t o b , M D A P n

i icaitons are performed wtih the . n o i t u l o s l a r e n e g e m a

s PADM providesarray style abstraction forboth process topology and data ) S A G P ( e c a p S s s e r d d A l a b o l G d e n o i t i t r a P y l h g i h e h t e k i l n U . w e i v l a b o l g e h t m o r f n o i t u b i r t s i d p x e o t r e m m a r g o r p s w o l l a M D A P , l e d o

m lictily contro lthe distribution and communication .And . n o i t a z i m i t p o r o f e p o c s r e t a e r g s w o l l a d n a , s e c n e t n e s n o i t a c i n u m m o c e s i c n o c e r o m s e d i v o r p M D A P n o it c u d o r t n I I P

M is themos tpopular and ubiqutiouschoicefor distributed communication .I tprovideswe ll-r e t t a c s , r e h t a g , t s a c d a o r b s a h c u s , n o it a c i n u m m o c e v it c e ll o c d e z il a i c e p s e m o s r o f s e c a f r e t n i d e n i f e d t n i o p e h t f o d a e t s n i s e c a f r e t n i e v it c e ll o c e h t g n i y l p p a , s r e m m a r g o r p r o F . c t e d n

a - ot -poin tsentences

t c e ll o c e s e h t , r e v e w o H . y ti v i t c u d o r p e h t t s o o b l l i

w iveinterfacesaretoospeciailzedtocovervarious . s n r e tt a

p For example ,the group-wise collective operaiton is undefined in MPI .To apply these it l u m n i s e c a f r e t n

i ple group ,the programmers have to use the MPI communicator to construc t u m m o c d e n o it it r a

p nication contex tand process groups .However ,communicator creaiton is an n o i t a r e p

o that exchanges informaiton between processes wtih expensive synchronization cost . e tt a p n o i t a c i n u m m o c s i h T . n o it a c i n u m m o c r o b h g i e n r a l u c r i c e h t s i e l p m a x e d e n i f e d n u r e h t o n

A rn is

h c a E . g n i s s e c o r p e g a m i d n a n o it a l u m i s c i f i t n e i c s s a h c u s , s m e l b o r p l i c n e t s n i d e s u y l n o m m o c . s r o b h g i e n s t i h t i w a t a d s e g n a h c x e s s e c o r

p The programmers have to use the point- ot -poin t , d n e S _ I P M d n a v c e r I _ I P M e k il , s e c n e t n e

s ort heMPIt opologiessuchasMPI_Cart_shifti nterfaceto s i h t m r o f r e

p kindofcommunication . s h

t r u

F ermore ,i tispossibleto design specificopitmizaiton foreach collecitveinterfaceaccording e b e v a h , H C I P M d n a I P M n e p O s a h c u s , I P M r o f s e i r a r b il d e t n e m e l p m i e h T . n o it i n i f e d s ti o

t en

e v it c e ll o c f o n o i t a z i m it p o e h t n o d e s u c o f s e i d u t s y n a m e r a e r e h T . s e c a f r e t n i l a r e v e s n i d e z i m it p o g n i d n e p e d s m h t i r o g l a e l p it l u m e s u y e h t , e c a f r e t n i h c a e r o F . e c n a m r o f r e p e h t e v o r p m i o t s n o i t a r e p o s a h c u s , e z i s e g a s s e m e h t n

o [1] ,[2 .T] heoptimization only appilesto thesespecialized patternsin . s m h ti r o g l a e s e h t r o f e l b a ti u s e r a s n r e t t a p s u o i r a v y n a m , y ll a u t c A . I P M k r o w r u o e b i r c s e d e w , r e p a p s i h t n

I PADM on providing asinglegenera lsolution forcollecitve e t n i l a r e n e g s i h T . s n o it a r e p o n o i t a c i n u m m o

c rfaceisbasedon anarray-levelalgebraicrepresentaiton

n o is n e m i

d .The construction of dimension follows severa lsimple rules which is easy to learn . n o i s n e m i

D containsboththeprocesstopologyanddatadistribution .Therefore ,eachcommunicaiton m

i s s

i plifiedasapairofdimensionsandacopytosentenceinPADM .Forprogrammers ,ourworkis . y t i v it c u d o r p e h t t s o o b o t e l b

a Inaddiiton ,thedimension i a s binomina ltreestructure .I tispossible toanalyzethestructureand identify thecommunication pattern .Then wecan design theopitmized

. n r e tt a p h c a e r o f m h ti r o g l a a f o l e d o m s i s a b e h t s i S A G P . S A G P e h t n o d e s u c o f s n o it a c i n u m m o c g n i y f i l p m i s n o k r o w y l r a E o C , C l e l l a r a P d e i f i n U s a h c u s , s e g a u g n a l f o r e b m u n e g r a

(2)

. s e l b a i r a v d e r a h s g n i d i v o r p y b y t i v it c u d o r p r e m m a r g o r p t s o o b s e g a u g n a l e s e h T . s s e c o r p h c a e o t l a c o l

a u g n a l e h t f o t s o m n i t i c i l p m i s i n o it u b i r t s i d e h t , r e v e w o

H ges .The runtime performs one-sided . s s e c o r p h c a e n o t n e m e r i u q e r a t a d e h t o t g n i d r o c c a y l l a c i t a m o t u a s e c n e t n e s n o i t a c i n u m m o c

d e r a p m o c , s u h T . y ti l a c o l a t a d e h t f o e r a w a e b d l u o h s s r e m m a r g o r p , e c n a m r o f r e p e h t g n i r e d i s n o C

ti c il p x e s e d i v o r p k r o w r u o , S A G P h t i

w distributionandcommunication. n

o i t c e S : s w o ll o f s a d e z i n a g r o s i r e p a p e h t f o t s e r e h

T I I introduces the construciton rules for l

a r e n e g e h t d n a n o i s n e m i

d copytosentence ,thenSeciton IIIcomparestheMPIandPADMcodesin a

h t w o h s o t s e s a c l l a m s l a r e v e

s tourwork providesmore concise program ,and investigatesinto a .

n o i t a c il p i tl u m x i r t a m n o y d u t s e s a

c Seciton I V introduces the implementation and opitmization . n

o it c e

S V illustratesanddiscussest hecos tmode landperformance .SecitonVIconcludesourwo rk.

e ll a r a

P l Dimen ison

. n o it a c i n u m m o c e v it c e ll o c s u o i r a v r o f e c a f r e t n i l a r e n e g a e d i v o r p o t I P M f o n o i s n e t x e n a s i M D A P

n o it p e c n o c r u o n o d e s a b s i e c a f r e t n i s i h

T dimen ison .In thisseciton ,weintroducetheconstruction s

e l u

r inPADMt oprovideani ntutiiveunderstanding . s

e l u R n o it c u r t s n o C

s w o ll o f n o i t c u r t s n o c e h

T severa lsimple rules ,as shown in Equation (1) and (2). These rules a re .

n o i t u b i r t s i d n o m m o c e h t e b i r c s e d o t h g u o n e e l b i x e l f

D= gt (a size ,step, disp) ( 1)

S= D0∗...∗Dn| array(D0∗...∗Dn)| array (Sr fe, Sb eas ) (2)

, t s r i

F in Equation (1) ,each basic dimension( D)hasonetag and threefundamenta lcomponents :

e zi

s , tsep and dsip. There are three keywords for tag ,dim(dummy) ,proc(process )and data tha t t

n e r e f f i d t n e s e r p e

r elements .Dummy is a void type withou tspecific data or process informaiton . f

o r e b m u n e h t s i e z i

S elementswith defaul tvalue 0 .Step is thedistance between two elementsin e

s quence ,which is also called stride in PGAS [14] .Its defaul tvalue is 1 ,which means the d

n A . n o it a u t i s s u o u n it n o

c disp meansthe displacemen twtih defaul tvalue0 .Elements’ offse twli l e

v o

m a tacertaindistanceaccordingtothis. Foreachindexindimension ,i∈(0, size−1) ,theoffse tis :

s a d e t a l u c l a c

t e s f f

o (i) = (i%size+disp)∗step. (3) t

e c n i

S he defaul tvalues mentioned before can beomitted .Le tus look a ta simple construciton )

2 ( n o i t a u q E n i e l u r t s r i f e h t o t g n i d r o c c

a :

) 4 ( c o r p = A m i

d *_d_a_t_a₍₄_,₁₆ ₎*_d_a_t_a₍₁₆ _.)

‘ r o t a r e p o n o i t a c i l p i tl u m e h

T ∗’isusedto construc tmutli-dimensions .ThisdimensionAdescribes n

o a t a d d e t u b i r t s i

d 4processes ,andeachprocesscontains64 conitnuousdataelements .I thasthree o

w t a o s l a s i t i t u b , s n o i s n e m i

d -dimensiona larraywtih16*16elements ,oraone-dimensiona larray h

t i

w 256 elements .Eachdimensionisconstructedasabinomia ltreeaccordingtothemutlipilcation . 1 e r u g i F n i n w o h s s a s e e r t t n e r e f f i d e c u d o r p l li w r e d r o e h t e g n a h c o t s t e k c a r b e h t g n i s U . r e d r

o

r e d r o n o i t a c i l p i t l u m t n e r e f f i d h t i w s e e r t n o i s n e m i D . 1 e r u g i

(3)

t n o c a e b i r c s e d o

T inuoussequence ,wehavetose tthecorrec tstepvalueforeach dimension ,like

) 6 1 , 4 ( a t a

d inA. Orwecanuset hesecondruleinEquaiton(2)tofillt hestepvaluesautomaitcallyfor :

w o l l o f s a n o i t a u t i s s u o u n it n o c

) 4 ( c o r p ( y a r r a = A m i

d *_d_a_t_a₍₄ ₎*_d_a_t_a₍₁₆₎_.)

e h

T l ast rule in Equaiton (2) is called reference mechansim .I tisused to describe more flexible e

h t , s n o i s n e m i d o w t s n i a t n o c e c n e r e f e r A . s n o it u b i r t s i

d re fdimension Sr fe and the base o Sne base ,

d e t c u r t s n o c s i h c i h

w asfollowexample:

.) A , ) 8 ( m i d * ) 8 ( m i d ( y a r r a = R m i d

, e s a c s i h t n

I Aisthebaseone ,and dim(8)*dim(8) isthenewdescriptionbased on A .Weusethe

m i

d type herewtihou tspecificdataor process information .SinceA isconfigured asa16*16 two -n

e h T . x i r t a m l a n o i s n e m i

d R describesan 8*8 s -ub matrix in thetoplef tcornerofA .Theoffsetsare e

n n i s e t a l u c l a

c stedway,R.offset(i)=base.offset(ref.offset(i)) .

n o it c n u f e d i v o r p e W . t e s f f o a t a d l a c o l d n a r e b m u n s s e c o r p : t e s f f o f o s d n i k o w t e r a e r e h t , y ll a u t c A

e d o

n and offse t to calculate them respecitvely , ilke A.nodei( ) and A.ofsfe(t)i. Moreover , for y

ti c i l p m i

s ,we u ‘se +’ operator to se tthe d sip value .For example, the rank 1 process can be h

c i h w , 1 + ) 1 ( c o r p s a d e b i r c s e

d isbettert hanproc(1,1,1). o

t y p o C

s i e c a f r e t n i n o it a c i n u m m o c l a r e n e g e h

T definedas:

S m i d ( o t y p o

c s ,void*s ,dimSt ,void*t,i n telem_szie;)

S e r e h

w s ist hesourcedistribution ,andStist het arge toneoft hesamesize .Thevoidpointersaret he

d a e r o n s i e r e h t f I . s e s s e r d d a a t a d t e g r a t d n a e c r u o

s -wrtieconflic,tt heprogrammerscanuset hesame e

t e m a r a p t s a l e h T . s r e t n i o p h t o b r o f s s e r d d

a rrequirest hesizeofdataelement ,suchasszieo(flfoa)t . w

o R d n a n m u l o C

it l u m y r e v e e c n i

S -dimension is constructed as a binomia ltree ,i tis also a two-dimensiona larray . t

o o r e h t f o n e r d li h c o w t e h t d n

A /parentnoderepresen tthecolumnand row dimension ,respecitvely . s

n o i t c n u f e s u n a c e

W co(lA )and row(A )to ge tthem from dimension A .The defaul torderisrow -n

o i t c n u f e h t g n i s u , r o j a

m col_major(A )wli lchange tii ntocolumn-major . s

n o it c n u F m r o f s n a r T

, r e d r o t n e r e f f i d e b i r c s e d o t e l b a s i e c n e r e f e r e h

T er -division ,portion and repeititon of thebasic l

l e w s e d i v o r p M D A P . e n

o -defined transform funcitons to simplify the reference declaration ,and n

i n w o h s e r a s n o i t c n u f d e s u y lt n e u q e r f e m o

s Table1 .Thefirs tfourfuncitonswli ltrea ttheinpu tD m

i d e n o s

a ension ,whlietheothersrequiretha tDistwo-dimensional .Thefunctionmutl iwli lreturn acollapsed dimension ,whichmeanstha tthereferencesizeislargerthanthebasesize .Weprovide

n o i t c n u

f d isze and p isze to ge t the actua l process and loca l data size . For example ,

)( e z is p .) 4 ,) 1 ( c o r p (i tl u

m returns1 .

. 1 e l b a

T TransformFunction. n

o it c n u

F Usage

) k t n i , D m i d ( w o l _ t e g m i

d ge tkunitsfromDt ha tneart heorigin )

k t n i , D m i d ( h g i h _ t e g m i

d ge tkunitsfromDt ha tfarfromt heorigin ,

D m i d ( i t l u m m i

d in tk) repea tDkt imes )

k t n i , D m i d ( c i l c y c m i

d cyclicDwithkunits )

k t n i , D m i d ( w o R c i l c y c m i

d cyclicrowdimensionofD )

k t n i , D m i d ( l o C c i l c y c m i

d cycliccolumndimensionofD )

k t n i , D m i d ( r o j a m _ k c o l b m i

(4)

Con isceCode

I P M n i e ti r w o t d r a h s i t a h t I n o it c e S n i s e s a c e m o s d e s s u c s i d e v a h e

W .In thisseciton ,we take a e

s a c e l p m i

s asexampletocomparet heMPIandPADMcodes.Thecommoncodei somittedi nboth h

t t a h t r a e l c s i t I . e c n e r e f f i d e h t w o h s o t s n o i s r e

v e PADM program is concise and easy to .

g n i d n a t s r e d n

u Thenwewil ldiscussthedistribuiton-independen talgortihmandacasestudy ,matrix .

n o it a c il p it l u m

n o it a c i n u m m o C r o b h g i e N r a l u c r i C

s s e c o r p h c a E . n o i t a c i n u m m o c r o b h g i e n r a l u c r i c e h t s i e s a c t s r i f e h

T sends100 floa telementsto tis e

r a e r e h T . e n o t f e l e h t m o r f t n u o c e m a s e h t s e v i e c e r d n a , r o b h g i e n t h g i

r pnumprocessesi nt otal .The n

i n w o h s s i n o i s r e v I P

M Figure 2(a) .Programmers should consider many detalis ,such as the c

o l b , r e b m u n s s e c o r p n o it a n i t s e

d king ornon-blocking interfaces ,avoiding deadlock (n -on blocking e

h t g n i r u s n e d n a ) t s r i

f completion (MPI_Wati) .Even using the MPI topology interface wli lno t e

d o c f o t n u o m a e h t e c u d e

r .Whliein PADMversion ,asshownin Figure2(b) ,thereisonlyonetask r

o f s n o i s n e m i d t e g r a t d n a e c r u o s e h t n g i s e d o t s i t a h

t copytofunciton .Weusethet ransformfunciton

c il c y

c heretodescribet hedesitnationprocesses .

n o i t a c i n u m m o c r o b h g i e n M D A P s v I P M . 2 e r u g i

F .

n o it u b i r t si

D -independent d

n o c e s e h

T caseisa ilbraryfuncitonofmatrixtranspose .Whendesigninga ilbraryfunctioninMPI , w o r r o k c o l b s a h c u s , x i r t a m t u p n i f o n o it u b i r t s i d l a n i g i r o e h t f o e r a w a e b t s u m s r e m m a r g o r p e h t

m t n e r e f f i d n i tl u s e r s n o it u b i r t s i d l a n i g i r o t n e r e f f i d e h T . n o it u b i r t s i d a t a d d e r e tt a c

s essagepassingpat-

a n g i s e d o t e l b a s i t i , M D A P n i , r e v e w o H . s n r e

t di ts irbuiton- independen talgortihm .I tmeansthe r

b u s n i y r a s s e c e n n u s i t u o y a l a t a d c i f i c e p

s ouitne .Ourfunciton isshown in Figure3 .I trequirestha t A

n i s n o i s n e m i d w o r d n a n m u l o c e h

t represen tthecolumnand rowofthematrix ,respectively .We b

i r c s e d o t s n o i s n e m i d w o r d n a n m u l o c e h t e g n a h c x

e e the targe tdistribution B .Also we can use

_ l o

c major(A )toge ttiatlernaitvely .

e r u g i

F 3. Distribution-independentt ransposei nPADM.

t S e s a

C udy :MatrixMul itpilca iton

A = C t a h t n o it a c il p it l u m x i r t a m f o m h ti r o g l a D 2 l a n o it i d a r t e h t n

(5)

u l o c d n a s w o r e l p i tl u m t e g t s u

m mnsfromAandB .IneachmatrixdistributionofAandB,i ft hesize P

s i e z i s n o i t a c i n u m m o c e h t , P s i d i r g s s e c o r p f

o 1/2 _it_m_e_s_o_f_t_h_e_m_a_t_r_i_x_s_i_z_e_.

e e r h t

A -dimensiona lapproach [4]–[6] to paralle lmatrix mutlipilcation has a factor P1/6 _l_e_s_s

o i t a c i n u m m o

c nthanthe2Dalgortihm .Inthisalgortihm ,theprocessesareconfigured asa3Dcube . s

s e c o r p h c a

E g etsa singlesub-matrix from A and B ,then performsa loca lmatrix mulitplicaiton . b

u s l a r e v e s , n o it a t u p m o c e h t r e t f

A -matrices of this produc t mus t be sent to their desitnation e

h T . C x i r t a m g n it l u s e r e h t h ti w r e h t e g o t d e m m u s n e h t d n a s e s s e c o r

p computation is described in g

i

F u 4re .Therearet hreecommunicationsi nt hisalgorithm .Thefirs tonedistributest hesub-matrices o

d n o c e s e h T . s e s s e c o r p l l a o t A f

o nedistributest hoseofBwtihdifferen tdesitnations .Andt het hird .

s s e c o r p h c a e n o t l u s e r h c a e r o f s t c u d o r p e h t l l a s r e h t a g e n

o Thecommunicaitons are complicated . n

i , e r o m r e h t r u

F MPI,t heprogrammersmus tbeawareoft heorigina ldistribu itonsofdata.

. 4 e r u g i

F Cubealgorithmformatrixmultiplication. e

h t e ti r w o t r e m m a r g o r p r o f e l b i s s o p s i t i , M D A P n

I di ts irbuiton-independen talgorithm from an e

v it i u t n

i way .In ourimplementaiton ,asshown inFigure5 ,theinpu tdimensionsareconsidered as o

w

t -dimensional .Each processperformsloca lsub-matricesmulitpilcaiton of size len*len .And the m

s i C d n a B , A f o r e b m u n s k c o l

b *n ,n*kandm*k ,respecitvely .Therefore ,weneedaprocesscube m

e z i s f

o *n*k .Thedeclarationofeachcubedimensioni sseparate ,asshowninLine7-9 .Beforet he b

u s e h t , n o it a t u p m o c l a c o

l -matricesofAarescatteredamongt hePm*Pnprocesses ,andarerepeated e

h t g n o l

a Pkdimension .Similarly ,thematrix BisscatteredamongthePn*Pkgrid ,and isrepeated e

h t g n o l

a Pmdirection .

M D A P n i m h t i r o g l a e b u C . 5 e r u g i

(6)

b u s l a c o l e h

T -matrix is described as Blk in Line 11 ,which is continuous in memory .Bu tthe e h t g n i r e d i s n o C . s e s s e c o r p e l p it l u m n o r o s u o u n i t n o c y l e r it n e t o n e b y a m a t a d t u p n i g n i d n o p s e r r o c

i d e h t f o n g i s e d e h t , e c n a m r o f r e

p mensionsmus tensuretha tthecommunication conitnuouslength is n

o i t p i r c s e d a t a d l a c o l e h t , e r o f e r e h T . h g u o n e g n o

l Blkisatt herightmos tsideoft het arge tdimension . e

h t e s u e w d n A . h t g n e l s u o u n it n o c t s e g n o l e h t s a h n o i s n e m i d t e g r a t e h t , t s a e l t

A block_majo r

w o r l a n i g i r o e h t t s u j d a o t n o i t c n u

f -major inpu tdimensions into block-major order .By using the

it l u

m -funciton ,theadjustedoffsetsarerepeatedsevera l itmes .FromLine14-20 ,matrix Aand Bare .

s e s s e c o r p e b u c e h t n o d e t a e p e r d n a d e t u b i r t s i d

t f

A ertheloca lcomputaiton gemm ,theproductsalongthePndireciton mus tbesummedtogether .

e h t e s u e

W CP dimension to enlarge the dataoffsetsscope ,and allocate memory to receivethese .

l a c o l n o s t c u d o r

p ThePn dimension correspondsto theCP dimension. And theproduc tblockson k

P * m

P correspondt ot heblock-majorofC ,asshowni nLine24-26 .Therefore,t heproductsaresen t . C x i r t a m g n it l u s e r e h t h ti w r e h t e g o t d e m m u s n e h t d n a s e s s e c o r p n o i t a n it s e d e v it c e p s e r r i e h t o t

e d o c r u o n

I ,only the malloc and rfee sentences are omitted .The code is concise and easy to n

I . n o it u b i r t s i d l a n i g i r o t n e r e f f i d s t p e c c a m h t i r o g l a s i h t , y ll a i c e p s E . d n a t s r e d n

u Figure6 ,wedescribe

a C . n o it u b i r t s i d a t a d d e r e t t a c s w o r s i e n o e s a C . A x i r t a m r o f s n o it u b i r t s i d t n e r e f f i d e e r h

t setwo isa

n o n r e t t a p d e r e t t a c s k c o l

b r* rprocessgrid ,andeachblockisasquaresub-matrix .Casethreeisalso n

o i t u b i r t s i d k c o l b

a wtih differen tgrid pattern ,whereblock isarectanglepar tofthematrix .These n

r e tt a

p s arealsoavailableformatrixBandC .Differen tdistribuitonsresul tin differen tperformance e z i s f o n o i t a c il p i tl u m x i r t a m e h t f I . h t g n e l s u o u n it n o c n o it a c i n u m m o c t n e r e f f i d e h t f o e s u a c e b

n i n w o h s s i e c n a m r o f r e p e h t , s e s s e c o r p 4 6 n o l e ll a r a p s i 2 9 1 8 * 2 9 1

8 Table2 .Thelas tcolumnc ntis t s e g r a l e h t f o e s u a c e b t s e b s m r o f r e p e s a c d r i h t e h T . n o it a c i n u m m o c n i a t a d s u o u n it n o c f o r e b m u n e h t

t n

c value .I tis possible to design a distribution-independen talgorithm in PADM by using the l

a r e n e g d n a m s i n a h c e m e c n e r e f e

r copyto function .Therefore ,tuning theperformancewtihdifferen t .

y s a e s i s t u p n i

. 6 e r u g i

F Variousi npu tdistributionforcubealgorithm.

n o i t a c i l p i t l u m x i r t a m n i s n o i t u b i r t s i d t n e r e f f i d h t i w e c n a m r o f r e P . 2 e l b a

T .

) s ( e m i

T CopyA CopyB CopyC Total c nt w

o r : 1 e s a

c 0.223 0.254 0.233 0.710 211

k c o l b : 2 e s a

c 0.282 0.338 0.320 0.940 210

e l g n a t c e r : 3 e s a

c 0.180 0.179 0.157 0.516 220

t n e m e l p m

I a itonandOp itmiza iton M

D A

P is implemented by using MPI ,OpenMP and C++ .We defined i tas a ilbrary tool tha t a

s i t I . I P M h ti w e l b it a p m o

c ilghtlyextension tha tallowsprogrammerstouseanabstrac tandlogic n

o i t p i r c s e

d ofdatawhich isseparatesfrom theactually physica lmemory. I tiseasyto use ,PADM e

m o s d d a o t d e e n y l n o r e m m a r g o r p e h T . d e d u l c n i e b o t e li f d a e h a s e d i v o r

p options such as ‘

-g n il i p m o c n e h w ’ p m n e p o

f .

e h

T genera lcopytofuncitonrequiressourcedimensionandthesamesizetarge tone .Byscanning e

p o c s e z i s e h t n i s e x e d n i e h t l l

a onlyonce itme ,wecancalculatetheelemen tlocationsa tbothends , t

n e m e l e s i h t y p o c n e h t d n

a fromsourcetotarge tbyusingpoint- ot -poin tmessagesentences .Thisis .

(7)

it l u M d n a g n i p p a l r e v

O -thread s

i p p a l r e v o s i d o h t e m n o i t a z i m it p o n o m m o c

A ng thecompuitngand communication .In oursoluiton , h

t o t s r e f e r g n i t u p m o c e h

t elocaiton calculaiton foreachindex ,calledscan .WechooseOpenMP to u

m h c n u a

l litplethreadson each process ,bu tonlyonethread isallowed to usethemessagepassing e

h t , r e v e w o H . s e c n e t n e

s scan work is parttiioned among severa lthreads .Each computing thread e h t n o s i n o i t a c o l t e g r a t r o e c r u o s e h t f I . s n o it a c o l e h t s e t a l u c l a c d n a n o i s n e m i d e h t f o n o i t r o p a s n a c s

e v i e c e r r o d n e s a h s u p e w n e h t , s s e c o r p t n e r r u

c request into globa lqueue .Once thequeue isno t n

o it a c i n u m m o c e h t , y t p m

e thread wli lpop up each reques tfrom the queue and turn i tinto a _

I P M e s u e w n e h T . e c n e t n e s v c e r I _ I P M r o d n e s I _ I P

M Tes tto ensure tha tthese requests are n

o it c n u f a e d i v o r p e w , r e v o e r o M . d e t e l p m o

c setTnumtosett henumberofscant hreads . M

D A

P -ledOp itmiza iton i

d M D A P e h

T mension is an analyzable binomia ltree .We can ge tinformaiton from the structure n

i g n i n n a c s t u o h t i

w dexes .Now we are going to discuss some opitmizations tha tbased on the .

s i s y l a n a

h t g n e L s u o u n it n o

C .Eachdimensionrepresentsasequence ofdataoffsets. Theseoffsetsmaybe n

i s u o u n i t n o

c address ,such as proc(4)*data(100) . In other cases , they may no t b e entirely e

k i l , s u o u n i t n o

c data(10,20)*data(10) ,tha tevery 10 elements is continuous in memory . The f

o r e b m u

n regularcontinuouselementson each processisconfiguredasthecontinuouslength .We e

s u n a

c A.cntLen()togeti .t

t s e t a e r g e h T . h t g n e l s u o u n it n o c t n e r e f f i d e v a h y a m n o it a c i n u m m o c e n o n i s n o i s n e m i d o w t e h T

h t g n e l s u o u n it n o c e h t s i s e u l a v o w t e s e h t f o r o s i v i d n o m m o

c cn tof this communication .I tmeans y

r e v e t a h

t cn tindexes refer to the same source and target process number ,and their offsets are s i t n e m g e s s u o u n it n o c h c a e n i x e d n i t s r i f e h t y l n o , e r o f e r e h T . s d n e h t o b t a y r o m e m n i s u o u n it n o c

c n i s i h t g n e l e g a s s e m e h t , e l i h w n a e M . d e n n a c s e b o t d e e

n reased to cnt .This soluiton called

m m o C n a c

s .

y ti r a l u n a r

G .To optimizethecommunication whencntissmall ,weintroduceanotherargumen t

n a r

g ht a tmeans the granulartiy .The message length is se tto this granularity rather than cnt . f

o e u l a v e h t , y l l a u s

U gran is severa ltimesthesize of cnt .Therefore ,each process packs severa l n o i t a n i t s e d e h T . t i d n e s d n a k c o l b r e f f u b e n o o t n i n o i t a n i t s e d e m a s e h t h t i w s t n e m g e s s u o u n i t n o c

s i e g a s s e m e v i e c e r a e c n O . e g a s s e m e h t e v i e c e r o t r e f f u b r e h t o n a s e d i v o r p s s e c o r

p completed ,the

.t e s f f o t h g i r e h t o t t n e m g e s s u o u n i t n o c h c a e e v o m d n a r e f f u b e h t k c a p n u l l i w s s e c o r p n o i t a n i t s e

d In

m r o f r e p o t s d a e r h t f o e z i s e m a s e s u e w , n o it a u ti s s i h

t packing and unpacking seperately. In the e

r o h p a m e s , d a e r h t n o i t a c i n u m m o

c isusedtonoticet heunpackingt hreadt ha tabufferblocki sready .

d e k c a p n u e b o

t ThissolutioncalledbufComm. a

n g i s e d e w , n o it i d d a n

I buffe rpoo ltorecyclet hebufferblocks .Thepooli sparttiionedi ntomany e

z i s f o s k c o l

b gran .Andt heseblockaddressesaremanagedbyagloba lqueue .Whenanewblocki s a e t a c o l l a l l i w e w , y t p m e s i e u e u q e h t f I . e u e u q e h t m o r f p u d e p p o p s i s s e r d d a r e f f u b a , d e r i u q e r

d n a l o o p e h t h t o b , n o i t a c i n u m m o c e h t r e t f A . e s u r e t f a d e l c y c e r o s l a s i h c i h w k c o l b y r a r o p m e t

e r a s k c o l b y r a r o p m e

t released . l li w n o it c n u f o t y p o c e h

T analyzetheinpu tdimensionsandse targumentsautomatically including

t n

c ,gran ,the size of buffer poo land etc .Also we provide funcitons for programmers to contro l s

a h c u s , m e h t f o e m o

s setGranByte sandsetBufBytes . y

ti l a c o

L .Byanalyzingthesourceand targe tdimensions ,wecan discussthelocality .Ifthedata h t o b t a s r e b m u n s s e c o r p e h t d n a , t e g r a t d n a e c r u o s n e e w t e b r e h t o h c a e o t d n o p s e r r o c s n o i s n e m i d

t n e m e v o m a t a d e h t t a h t s n a e m t i , e c n e u q e s e m a s e h t n i e r a s d n

e onlyoccursonl ocal ,suchas :

) 4 ( c o r p = S m i

d *_d_a_t_a₍₁₀_,₁₀ ₎*_d_a_t_a₍₁₀ _;)

) 4 ( c o r p = R m i

(8)

y b d e m r o f r e p s i t n e m e v o m e h t , n o it a u ti s s i h t n

I memcpyinsteadofmessagepassing .Thisbranch d

e ll a c s i n o i t u l o

s localCopy. The copyto funciton chooses the proper optimized branch soluiton s

n o i s n e m i d e h t o t g n i d r o c c

a ’i nformation .

n o it a r e p O e v it c e ll o C r o f n o it a z i m it p O d n a n o it a c if it n e d I

x e n a s a n o i t a r e p o t s a c d a o r b e k a t e

W ampleto discussthealgorithm leve lopitmizaiton .Thereisan t

p

o imizaiton forlong messagebroadcast .The message isfirs tdivided up and scattered among the e W . r e h t a g ll a n a o t r a li m i s , s e s s e c o r p l l a o t k c a b d e t c e ll o c n e h t e r a a t a d d e r e t t a c s e h T . s e s s e c o r p

e h t d e r u s a e

m u -n optimized MPI Bcas tof OpenMPI and this new algortihm on 32 processes .The w

e

n algorithm reduces up to 74.5% itme when the message length is 1GB . This exisitng d

e i l p p a e b n a c n o it a z i m it p

o forbothstandardandmorevariousscenariosinPADM .Forthesemore e

m i d t e g r a t d n a e c r u o s e h t , t s a c d a o r b l a r e n e

g nsion have some common features. I tis possible to e

h t e z y l a n

a dimensions’featurestoi dentifyt hecommunicaitonpattern . n

o it a c if it n e d n

I .Wechoosethreecasesofbroadcas tasshownin Lis t7 .Thefirs tcasedescribes t

o n h t i w s e s s e c o r p l l a n o t s a c d a o r

b enitrely continuous message ,and the data layou tis changed d n A . s e s s e c o r p r e b m u n n e v e o t a t a d t o o r e h t t s a c d a o r b y l n o e n o d n o c e s e h T . t s a c d a o r b e h t g n i r u d

p u o r g a s e b i r c s e d e n o d r i h t e h

t -wisebroadcast .Thefeaturesofbroadcas tdimensionsareconcluded t

s i B d n a n o i s n e m i d e c r u o s e h t s i A e r e h w , s w o l l o f s

a het arge tone: )

a A.psize( )<B.psize)( ; )

b A.dsize( )=B.dsize)( ; )

c onlyAcontain sacollapseddimension;� )

d A’ sdataandproces sdimensioncorrespondst ot ha to fBr especitvely;� )

e botht hedataoffset sarei nas uccessives egmen twtihou toverlaporj ump.�

n o it c n u f e h t , e r o f e b d e n o i t n e m e w s

A pszieand d iszereturn theactua lsizeofprocessand loca l c

n i S . a t a

d e thedata on roo tprocess is sen tto mutliple desitnaitons ,the actua lprocess size of the e h t s i s d n e h t o b t a a t a d l a u t c a e h t d n A . e n o t e g r a t e h t f o t a h t n a h t s s e l y l e ti n i f e d s i n o i s n e m i d e c r u o s

. e z i s e m a

s Therepettiion of roo tmessageresutls in collapsed procordata dimension. Ifthere isa w

, n o i s n e m i d a t a d d e s p a ll o

c e wil lseparate the repeititon descripiton from data and turn i tinto a n

o i s n e m i d s s e c o r p d e s p a ll o

c fori denitficaiton .

t a h t s n a e m t i , s s e c o r p s ’ e n o r e h t o n a o t s d n o p s e r r o c a t a d s ’ e n o f

I data on one process wil lbe n

r e t t a p t s a c d a o r b a t o n y l e ti n i f e d s i h c i h w , s e s s e c o r p e l p it l u m o t d e t u b i r t s i

d ,this is why we have

) d ( e r u t a e

f .Theoffsetsmayno tbeenitrely conitnuous ,bu ttheymus tbein thesuccessivememory j

d n a p a l r e v o t u o h ti w t n e m g e

s ump .

m h ti r o g l A n o it a z i m it p

O . Ift hepatterni si denitfiedasabroadcast ,weseparatedataandprocess o

b r o f s n o it p i r c s e

d th the source and targe tdimensions .Therefore ,we ge tfour new dimensions p

t e g r a t , a t a d e c r u o s , s s e c o r p e c r u o s t n e s e r p e

r rocessandtarge tdata ,respectively .Thesedimensions d

e s u e r

a to describe thescatter-allgather communication . fI themessage is long enough ,the new m

h t i r o g l

a solutionisperformed .Otherwise,i tperformst hegenera lcopyto . d

r a d n a t s e h t l l a , e r o f e r e h

T ,group- swi e ,no tenitrely continuous data ,and many other various p

o e b n a c t s a c d a o r

b itmized wtih thisnewalgortihmin PADM .Theperformanceisdiscussedinthe .

n o it c e s t x e n

d n a l e d o m t s o

C Performance

l a r e v e s m o r f e c n a m r o f r e p s s u c s i d e w , n o it c e s s i h t n

I cases .Theperformanceismeasured on a 32 -e

l o M e h t f o t e s t s e t s e d o

n -8.5Supercompuitng[3] .Andt heMPIl ibraryi st heOpenMPI2.1 version . t

s o

C Model

s s e c o r p h c a e r o f n e k a t e m i t e h t t a h t e m u s s a e W . e c n a m r o f r e p e h t e t a m i t s e o t l e d o m e l p m i s a e s u e W

t e t e l p m o c o

t he communication can be modeled as mα+ n2 γ+Nβ ,where m is the number of ,

s e g a s s e

(9)

γist hepackingorunpackingtimeperbyte ,Ni st hetota lnumberofbytest ransferred, N=m* ,n andβ

. e t y b r e p e m it r e f s n a r t e h t s

i Weassumefurthertha tthecommunicaiton timealwaysoverlappedthe e

r u g i F e k il , e m it g n i k c a p n u d n a g n i k c a

p 7 )( . a

n o i t p m u s s a s ’ l e d o m t s o C . 7 e r u g i

F andperformancet est. f

i c e p s r o f y ti r a l u n a r g g n i n u t n e h

W ic case ,the value of Nβ is constant ,and m is inversely .

n o t l a n o i t r o p o r

p Ifcntisl argeenought ha ttherei snopackingorunpacking,t hecos tmodelt urnst o

mα+Nβ ,whichmeanst hatl essmessagesi sbetter. o

t y p o C l a r e n e G

t n o c y l e r it n e t o n a e s o o h c e

W inuouscircularneighborcommunicaitoncaseasfollows:

;) m u n p ( c o r p = P m i

d �

P = A m i

d *data(2 ,1024 )*data(n ,2048 )*data(1024 ;)

) 1 , P ( c il c y c = B m i

d *data(n*2048 .) ;) )t a o lf (f o e zi s ,r , B , s , A ( o t y p o c

n o it u l o s d e z i m it p o e h t d e c u d o r t n i e v a h e

W scanCommand bufCommforcopytofunciton .In this ,

e s a

c cntis4KB .Wesett hegranulartiyas1/64oftheloca ldata .Ifn=128 ,eachprocessholds1MB n

e h t , a t a

d g nra is 16KB .The performance is shown in Figure 7 )(b .The ‘cnt8’ represents a 8-n it

a c i n u m m o c s s e c o r

p on with cn tmessage length .Since the 4KB conitnuous length is no tlong e

h t , h g u o n

e bufCommsoluitoni ncreasest hemessagel engthandperformsbetter . n

i o p r e h t o n a n i s e i ti r a l u n a r g t n e r e f f i d s s u c s i d e w , e r o m r e h t r u

F t- ot -poin tcommunicaiton casethat 0

k n a

r processsends256MBdatat ot herank1process ,andt heconitnuousl engthi salso4KB .

( y a r r a = D m i

d (data(256)*data(256 ))*data(1024));

;) D ( w o r * )) D (l o c ( r o j a m _ l o c * ) 1 ( c o r p = A m i d

D * ) 1 + ) 1 ( c o r p ( = B m i d

;) )t a o lf (f o e zi s ,r , B , s , A ( o t y p o c

s i e c n a m r o f r e p e h

(10)

, e g r a l o o t s i y t i r a l u n a r g e h t f

I the2nγcos talot .Ifthegranularityistoosmall ,wehavetomanagea e

h t s e s a e r c n i h c i h w s e g a s s e m t r o h s f o t o

l startupcostmα. s

n o it a r e p O e v it c e ll o C d r a d n a t S

e h T . e c n a m r o f r e p e h t e r u s a e m o t n o i t a c i n u m m o c e v it c e l l o c c i s a b e h t s a e c n a t s n i l l a o tl l a e h t e k a t e W

t o n s e o d I P M n e p O n i l l a o t ll A I P M r o f m h ti r o g l

a attemp ttoschedulecommunication .Instead ,each n a y b d e w o ll o f , p o o l a n i d n e s I I P M e h t l l a n e h t , p o o l a n i v c e r I I P M e h t l l a s t s o p s s e c o r p

e h t o t r a l i m i s s i s i h T . l l a t i a W _ I P

M scanCommsoluiton . g

i

F u 7 )re ( d showstheperformanceforMPIAlltoal lversusPADMcopytowtihdifferen tthreads . e

h t s a e m a s e h t s i y ti r a l u n a r g e h T . a t a d B G 1 s n i a t n o c s s e c o r p h c a

E coun tsize .InPADMcases ,the e h t s w o h s h c i h w , s d a e r h t 4 f o e c n a m r o f r e p e h t n a h t r e tt e b y lt h g il s s i s d a e r h t 8 f o e c n a m r o f r e p

e l p it l u m g n i s u f o e g a t n a v d

a threads .The overhead in PADM does no tsignificanlty degrade the r

u o , y ll a u t c A . e c n a m r o f r e

p copytofuncitonperformsbetterwhent heprocessnumberi sl esst han32 . V

d e z i m it p

O ariou sBroadcas t

2 3 n o t s a c d a o r b d r a d n a t s f o e c n a m r o f r e p e h t e r u s a e m e w , t s r i

F processes .The algorithms of MPI ,

t s a c

B copyto and the new one(scatter- da -n allgather) are different .Their performance is shown in g

i

F u 7re (e)and7 )(f .Heret hemessagel engthequalst ot het otall oca ldatasize .Weseet ha tforshor t (

s e g a s s e

m ≤4MB) ,copyto performs better than both the MPI and the new one .And for long (

s e g a s s e

m ≥8MB),t henewalgorithmperformsbest . n

i s e s a c s u o i r a v e e r h t d e r u s a e m e w , n e h

T Figure 8 on32processeswtihdifferen tmessagesize .In f

o e u l a v e h t , t s il e h

t nisse tto512 ,andtheconitnuouslengthofD tdimensionis1/1024oftheloca l l

a r e n e g d n a n o it u l o s d e z i m it p o e h t f o e c n a m r o f r e p e h T . e z i s a t a

d copyto isshownin Table3 .Each

o t y p o

c containsone synchronization .TheBca tsOp tsoluiton thus issynchronized twice .Therefore , s

e m e h t n e h

w sageislonger ,theopitmized algortihm performsbetter .When themessagelength is e

h t , B M 2 1

5 Bca tsOp treduces88.3%communicationt imecomparedwtiht hegenera lsoluitoninthe .

e s a c t s r i f

. 8 e r u g i

F Variousbroadcas tpatterns.

. 3 e l b a

T Variousbroadcas tperformance.

e g a s s e M

h t g n e l

1 e s a

C Case2 Case3

t p

o copyto o pt copyto o pt copyto B

M

2 0.2 21 0.4 26 0.2 05 0.0 52 0.261 0.310 B

M

8 0.2 25 0.5 37 0.2 39 .00 79 0.258 0.340 B

M 2

3 0.2 56 0.5 72 0.2 60 0.2 09 0.256 0.355 B

M 8 2

1 0.326 1.620 0.290 0.728 0.404 0.911 B

M 2 1

5 0.778 6.640 0.684 2.808 0.913 5.200

n o is u l c n o C

. s n r e tt a p e v it c e l l o c s u o i r a v r o f n o i t u l o s l a r e n e g a s e d i v o r p M D A

P Theabstraciton ofdistribuitonis .

e l b i x e l f d n a e v it i u t n i s i h c i h w s e l u r e l p m i s l a r e v e s y b d e w o ll o

(11)

s r o t c a f d e t a l e r e c n a m r o f r e p e h t l o r t n o c o t e l b a d n a y ti l a c o l a t a d e h t f o e r a w

a . I tprovides more

e d o c e s i c n o

c s and greater scope for opitmization. Our work is open source on GtiHub , .

0 V m d a p / 1 0 0 2 9 1 s z / m o c . b u h t i g / /: s p tt

h Thefutureworki st odiscussmorescientificcases.

t n e m g d e l w o n k c A

. o N s t n a r G r e d n u a n i h C f o m a r g o r P D & R y e K l a n o it a N y b d e t r o p p u s s i h c r a e s e r s i h T

, 8 0 2 2 7 6 1 6 . o N s t n a r G a n i h C f o n o it a d n u o F e c n e i c S l a r u t a N l a n o i t a N d n a ; 2 0 5 0 0 2 0 B F Y 6 1 0 2

. 8 1 0 2 3 4 1

6 Thanks to the Mole-8.5 Supercompuitng System developed by Insttiute of Process .

s e c n e i c S f o y m e d a c A e s e n i h C , g n i r e e n i g n E

s e c n e r e f e R

] 1

[ Barnet tM. ,ShulerL. ,DeGejin R.V. ,Interprocesso rCollecitveCommunica iton Library ,High g

n i t u p m o C e c n a m r o f r e

P ,357-364 ,1994 . ]

2

[ ThakurR. ,RabenseifnerR. ,GroppW. ,Opitmizaitono fCollecitveCom- municaitonOpera iton s H

C I P M n

i ,Internationa lJourna lofHighPerformanceComputingApplications ,19(1) :49-66 ,2005 . ]

3

[ Xiaowe i Wang , We i Ge , The Mole-8.5 Supercompu itng System , Contemporary High

g n it u p m o C e c n a m r o f r e

P , .p 5p 7 -98 .NewYork :CRCPress ,2013 . ]

4

[ FoxG. ,OttoS.W. ,HeyA.J. ,e tal ,MatrixAlgortihm sona HypercubeI :MatrixMulitpilcaiton , 7

1 : ) 1 ( 4 , g n i t u p m o C l e l l a r a

P -31 ,1987 . ]

5

[ Krishnan M. ,NieplochaJ. ,Op itmi izng Paralle lMulitpilcaiton Operaiton f o rRectangula rand s

e c i r t a M d e s o p s n a r

T ,Internationa lConferenceonparalle landDistributedSystems ,257-266 ,2004 . ]

6

[ Agarwa lR.C. ,BalleS.M. ,Gustavson F.G. ,A Three-dimensiona lApproach t o Paralle lMatrix n

o it a c il p it l u

M ,IbmJourna lofResearchandDevelop- ment ,39(5) :5 -75 582 ,1995 . ]

7

[ ChanA. ,Balaj iP. ,GroppW. ,Communicaiton Analysi so fParalle l3DFFT f o rFla tCa tresian

s m e t s y S e n e G e u l B e g r a L n o s e h s e

M ,HighPerformanceComputing ,350-364 ,2008 . ]

8

[ F .Kjolstad and M .Snir .Ghos tCel lPa ttern ,In Workshop on Paralle lProgramming Patterns , .

0 1 0 2

] 9

[ Chen Y. , Cu i X. , Me i H. , PARRAY : a unfiying array representa iton fo r heterogeneou s m

s il e ll a r a

p ,Acm Sigplan Symposium on Principles and Pracitce of Paralle lProgramming ,47(8) : 1

7

1 -180 ,2012 . ]

0 1

[ Hoare C.A., Hayes I.J., Jifeng H., Lawsofprogramming,Communicationsof TheACM ,30(8) : 2

7

6 -686 ,1987 . ]

1 1

[ DraperJ.M. ,CullerD.E. ,Yelick K. ,Introducitont oUPCandLanguageSpeciifcaiton ,Center l

a n A e s n e f e D r o f e t u t i t s n I , s e c n e i c S g n i t u p m o C r o

f - yses ,1999 . ]

2 1

[ Coarfa C. , Dotsenko Y. , Mellorcrummey J. , An Evaluaiton o f Globa l Addres s Space o

C : s e g a u g n a

L -arrayFortran andUniifedParalle lC ,Acm SigplanSymposiumonPrinciplesand .

5 0 0 2 , g n i m m a r g o r P l e l l a r a P f o e c i t c a r P

] 3 1

[ Numrich R.W. ,Reid J.K. ,C -o array sin the Nex tFortran Standard ,ACM Sigplan Fortran :

) 2 ( 4 2 , m u r o

F 4-17 ,2005 . ]

4 1

[ Nieplocha J. ,Palmer B.J. ,Tipparaju V. ,Advances ,Appilcaiton sand Performance o fthe

ti k l o o T g n i m m a r g o r P y r o m e M d e r a h S s y a r r A l a b o l

G ,Internationa lJourna lof High Performance 3

0 2 : ) 2 ( 0 2 , s n o i t a c i l p p A g n i t u p m o

C -231 ,2006 . ]

5 1

[ Gropp W. ,Thakur R. ,Lusk E .Using MPI-2 :Advanced features of the message passing e

c a f r e t n