) 8 1 0 2 S M S M C ( s c it s it a t S l a c it a m e h t a M d n a n o it a l u m i S , g n il e d o M , l a n o it a t u p m o C n o e c n e r e f n o C l a n o it a n r e t n I 8 1 0 2 8 7 9 : N B S
I -1-60595- 25 -9 6
y
a
r
r
A
-l
e
v
e
l
C
o
ll
e
c
it
v
e
C
o
m
m
u
n
i
c
a
it
o
n
s
g
n
a
u
h
S
Z
H
A
N
G
d
a i
n
Y -
f
e
n
g
C
H
E
N
*a n i h C , y ti s r e v i n U g n i k e P , e c n e i c S r e t u p m o C d n a g n i r e e n i g n E l a c i r t c e l E f o l o o h c S
*Correspondingauthor
s d r o w y e
K :M , PI Distributeddata, Collecitvecommunicaiton.
t c a r t s b
A . TheMessagePassingInterface(MPI)providessomewell-definedcollectivei nterfacesfor . s e s s e c o r p e l p i t l u m n e e w t e b s e g a s s e m f o t e s
a However ,MPI can be improved .The collective u q e r l li t s s n r e t t a p n o i t a c i n u m m o c e v i t c e l l o c y n a m t a h t , d e z i l a i c e p s o o t s i I P M n i e c a f r e t n
i ire
. ) M D A P ( n o i s n e m i D l e l l a r a P l o o t l e v o n a e b i r c s e d e w , r e p a p s i h t n I . s r e m m a r g o r p y b n o it a it n a t s n i , e r o f e r e h T . s n o i t a c i n u m m o c e v i t c e l l o c f o s s a l c l a r e n e g e r o m a r o f I P M d n e t x e o t d e n i f e d s i M D A P n u m m o c e v i t c e l l o c r e h t o d n a s e c a f r e t n i I P M e h t h t o b , M D A P n
i icaitons are performed wtih the . n o i t u l o s l a r e n e g e m a
s PADM providesarray style abstraction forboth process topology and data ) S A G P ( e c a p S s s e r d d A l a b o l G d e n o i t i t r a P y l h g i h e h t e k i l n U . w e i v l a b o l g e h t m o r f n o i t u b i r t s i d p x e o t r e m m a r g o r p s w o l l a M D A P , l e d o
m lictily contro lthe distribution and communication .And . n o i t a z i m i t p o r o f e p o c s r e t a e r g s w o l l a d n a , s e c n e t n e s n o i t a c i n u m m o c e s i c n o c e r o m s e d i v o r p M D A P n o it c u d o r t n I I P
M is themos tpopular and ubiqutiouschoicefor distributed communication .I tprovideswe ll-r e t t a c s , r e h t a g , t s a c d a o r b s a h c u s , n o it a c i n u m m o c e v it c e ll o c d e z il a i c e p s e m o s r o f s e c a f r e t n i d e n i f e d t n i o p e h t f o d a e t s n i s e c a f r e t n i e v it c e ll o c e h t g n i y l p p a , s r e m m a r g o r p r o F . c t e d n
a - ot -poin tsentences
t c e ll o c e s e h t , r e v e w o H . y ti v i t c u d o r p e h t t s o o b l l i
w iveinterfacesaretoospeciailzedtocovervarious . s n r e tt a
p For example ,the group-wise collective operaiton is undefined in MPI .To apply these it l u m n i s e c a f r e t n
i ple group ,the programmers have to use the MPI communicator to construc t u m m o c d e n o it it r a
p nication contex tand process groups .However ,communicator creaiton is an n o i t a r e p
o that exchanges informaiton between processes wtih expensive synchronization cost . e tt a p n o i t a c i n u m m o c s i h T . n o it a c i n u m m o c r o b h g i e n r a l u c r i c e h t s i e l p m a x e d e n i f e d n u r e h t o n
A rn is
h c a E . g n i s s e c o r p e g a m i d n a n o it a l u m i s c i f i t n e i c s s a h c u s , s m e l b o r p l i c n e t s n i d e s u y l n o m m o c . s r o b h g i e n s t i h t i w a t a d s e g n a h c x e s s e c o r
p The programmers have to use the point- ot -poin t , d n e S _ I P M d n a v c e r I _ I P M e k il , s e c n e t n e
s ort heMPIt opologiessuchasMPI_Cart_shifti nterfaceto s i h t m r o f r e
p kindofcommunication . s h
t r u
F ermore ,i tispossibleto design specificopitmizaiton foreach collecitveinterfaceaccording e b e v a h , H C I P M d n a I P M n e p O s a h c u s , I P M r o f s e i r a r b il d e t n e m e l p m i e h T . n o it i n i f e d s ti o
t en
e v it c e ll o c f o n o i t a z i m it p o e h t n o d e s u c o f s e i d u t s y n a m e r a e r e h T . s e c a f r e t n i l a r e v e s n i d e z i m it p o g n i d n e p e d s m h t i r o g l a e l p it l u m e s u y e h t , e c a f r e t n i h c a e r o F . e c n a m r o f r e p e h t e v o r p m i o t s n o i t a r e p o s a h c u s , e z i s e g a s s e m e h t n
o [1] ,[2 .T] heoptimization only appilesto thesespecialized patternsin . s m h ti r o g l a e s e h t r o f e l b a ti u s e r a s n r e t t a p s u o i r a v y n a m , y ll a u t c A . I P M k r o w r u o e b i r c s e d e w , r e p a p s i h t n
I PADM on providing asinglegenera lsolution forcollecitve e t n i l a r e n e g s i h T . s n o it a r e p o n o i t a c i n u m m o
c rfaceisbasedon anarray-levelalgebraicrepresentaiton
n o is n e m i
d .The construction of dimension follows severa lsimple rules which is easy to learn . n o i s n e m i
D containsboththeprocesstopologyanddatadistribution .Therefore ,eachcommunicaiton m
i s s
i plifiedasapairofdimensionsandacopytosentenceinPADM .Forprogrammers ,ourworkis . y t i v it c u d o r p e h t t s o o b o t e l b
a Inaddiiton ,thedimension i a s binomina ltreestructure .I tispossible toanalyzethestructureand identify thecommunication pattern .Then wecan design theopitmized
. n r e tt a p h c a e r o f m h ti r o g l a a f o l e d o m s i s a b e h t s i S A G P . S A G P e h t n o d e s u c o f s n o it a c i n u m m o c g n i y f i l p m i s n o k r o w y l r a E o C , C l e l l a r a P d e i f i n U s a h c u s , s e g a u g n a l f o r e b m u n e g r a
. s e l b a i r a v d e r a h s g n i d i v o r p y b y t i v it c u d o r p r e m m a r g o r p t s o o b s e g a u g n a l e s e h T . s s e c o r p h c a e o t l a c o l
a u g n a l e h t f o t s o m n i t i c i l p m i s i n o it u b i r t s i d e h t , r e v e w o
H ges .The runtime performs one-sided . s s e c o r p h c a e n o t n e m e r i u q e r a t a d e h t o t g n i d r o c c a y l l a c i t a m o t u a s e c n e t n e s n o i t a c i n u m m o c
d e r a p m o c , s u h T . y ti l a c o l a t a d e h t f o e r a w a e b d l u o h s s r e m m a r g o r p , e c n a m r o f r e p e h t g n i r e d i s n o C
ti c il p x e s e d i v o r p k r o w r u o , S A G P h t i
w distributionandcommunication. n
o i t c e S : s w o ll o f s a d e z i n a g r o s i r e p a p e h t f o t s e r e h
T I I introduces the construciton rules for l
a r e n e g e h t d n a n o i s n e m i
d copytosentence ,thenSeciton IIIcomparestheMPIandPADMcodesin a
h t w o h s o t s e s a c l l a m s l a r e v e
s tourwork providesmore concise program ,and investigatesinto a .
n o i t a c il p i tl u m x i r t a m n o y d u t s e s a
c Seciton I V introduces the implementation and opitmization . n
o it c e
S V illustratesanddiscussest hecos tmode landperformance .SecitonVIconcludesourwo rk.
e ll a r a
P l Dimen ison
. n o it a c i n u m m o c e v it c e ll o c s u o i r a v r o f e c a f r e t n i l a r e n e g a e d i v o r p o t I P M f o n o i s n e t x e n a s i M D A P
n o it p e c n o c r u o n o d e s a b s i e c a f r e t n i s i h
T dimen ison .In thisseciton ,weintroducetheconstruction s
e l u
r inPADMt oprovideani ntutiiveunderstanding . s
e l u R n o it c u r t s n o C
s w o ll o f n o i t c u r t s n o c e h
T severa lsimple rules ,as shown in Equation (1) and (2). These rules a re .
n o i t u b i r t s i d n o m m o c e h t e b i r c s e d o t h g u o n e e l b i x e l f
D= gt (a size ,step, disp) ( 1)
S= D0∗...∗Dn| array(D0∗...∗Dn)| array (Sr fe, Sb eas ) (2)
, t s r i
F in Equation (1) ,each basic dimension( D)hasonetag and threefundamenta lcomponents :
e zi
s , tsep and dsip. There are three keywords for tag ,dim(dummy) ,proc(process )and data tha t t
n e r e f f i d t n e s e r p e
r elements .Dummy is a void type withou tspecific data or process informaiton . f
o r e b m u n e h t s i e z i
S elementswith defaul tvalue 0 .Step is thedistance between two elementsin e
s quence ,which is also called stride in PGAS [14] .Its defaul tvalue is 1 ,which means the d
n A . n o it a u t i s s u o u n it n o
c disp meansthe displacemen twtih defaul tvalue0 .Elements’ offse twli l e
v o
m a tacertaindistanceaccordingtothis. Foreachindexindimension ,i∈(0, size−1) ,theoffse tis :
s a d e t a l u c l a c
t e s f f
o (i) = (i%size+disp)∗step. (3) t
e c n i
S he defaul tvalues mentioned before can beomitted .Le tus look a ta simple construciton )
2 ( n o i t a u q E n i e l u r t s r i f e h t o t g n i d r o c c
a :
) 4 ( c o r p = A m i
d *data(4 ,16 )*data(16 .)
‘ r o t a r e p o n o i t a c i l p i tl u m e h
T ∗’isusedto construc tmutli-dimensions .ThisdimensionAdescribes n
o a t a d d e t u b i r t s i
d 4processes ,andeachprocesscontains64 conitnuousdataelements .I thasthree o
w t a o s l a s i t i t u b , s n o i s n e m i
d -dimensiona larraywtih16*16elements ,oraone-dimensiona larray h
t i
w 256 elements .Eachdimensionisconstructedasabinomia ltreeaccordingtothemutlipilcation . 1 e r u g i F n i n w o h s s a s e e r t t n e r e f f i d e c u d o r p l li w r e d r o e h t e g n a h c o t s t e k c a r b e h t g n i s U . r e d r
o
r e d r o n o i t a c i l p i t l u m t n e r e f f i d h t i w s e e r t n o i s n e m i D . 1 e r u g i
t n o c a e b i r c s e d o
T inuoussequence ,wehavetose tthecorrec tstepvalueforeach dimension ,like
) 6 1 , 4 ( a t a
d inA. Orwecanuset hesecondruleinEquaiton(2)tofillt hestepvaluesautomaitcallyfor :
w o l l o f s a n o i t a u t i s s u o u n it n o c
) 4 ( c o r p ( y a r r a = A m i
d *data(4 )*data(16).)
e h
T l ast rule in Equaiton (2) is called reference mechansim .I tisused to describe more flexible e
h t , s n o i s n e m i d o w t s n i a t n o c e c n e r e f e r A . s n o it u b i r t s i
d re fdimension Sr fe and the base o Sne base ,
d e t c u r t s n o c s i h c i h
w asfollowexample:
.) A , ) 8 ( m i d * ) 8 ( m i d ( y a r r a = R m i d
, e s a c s i h t n
I Aisthebaseone ,and dim(8)*dim(8) isthenewdescriptionbased on A .Weusethe
m i
d type herewtihou tspecificdataor process information .SinceA isconfigured asa16*16 two -n
e h T . x i r t a m l a n o i s n e m i
d R describesan 8*8 s -ub matrix in thetoplef tcornerofA .Theoffsetsare e
n n i s e t a l u c l a
c stedway,R.offset(i)=base.offset(ref.offset(i)) .
n o it c n u f e d i v o r p e W . t e s f f o a t a d l a c o l d n a r e b m u n s s e c o r p : t e s f f o f o s d n i k o w t e r a e r e h t , y ll a u t c A
e d o
n and offse t to calculate them respecitvely , ilke A.nodei( ) and A.ofsfe(t)i. Moreover , for y
ti c i l p m i
s ,we u ‘se +’ operator to se tthe d sip value .For example, the rank 1 process can be h
c i h w , 1 + ) 1 ( c o r p s a d e b i r c s e
d isbettert hanproc(1,1,1). o
t y p o C
s i e c a f r e t n i n o it a c i n u m m o c l a r e n e g e h
T definedas:
S m i d ( o t y p o
c s ,void*s ,dimSt ,void*t,i n telem_szie;)
S e r e h
w s ist hesourcedistribution ,andStist het arge toneoft hesamesize .Thevoidpointersaret he
d a e r o n s i e r e h t f I . s e s s e r d d a a t a d t e g r a t d n a e c r u o
s -wrtieconflic,tt heprogrammerscanuset hesame e
t e m a r a p t s a l e h T . s r e t n i o p h t o b r o f s s e r d d
a rrequirest hesizeofdataelement ,suchasszieo(flfoa)t . w
o R d n a n m u l o C
it l u m y r e v e e c n i
S -dimension is constructed as a binomia ltree ,i tis also a two-dimensiona larray . t
o o r e h t f o n e r d li h c o w t e h t d n
A /parentnoderepresen tthecolumnand row dimension ,respecitvely . s
n o i t c n u f e s u n a c e
W co(lA )and row(A )to ge tthem from dimension A .The defaul torderisrow -n
o i t c n u f e h t g n i s u , r o j a
m col_major(A )wli lchange tii ntocolumn-major . s
n o it c n u F m r o f s n a r T
, r e d r o t n e r e f f i d e b i r c s e d o t e l b a s i e c n e r e f e r e h
T er -division ,portion and repeititon of thebasic l
l e w s e d i v o r p M D A P . e n
o -defined transform funcitons to simplify the reference declaration ,and n
i n w o h s e r a s n o i t c n u f d e s u y lt n e u q e r f e m o
s Table1 .Thefirs tfourfuncitonswli ltrea ttheinpu tD m
i d e n o s
a ension ,whlietheothersrequiretha tDistwo-dimensional .Thefunctionmutl iwli lreturn acollapsed dimension ,whichmeanstha tthereferencesizeislargerthanthebasesize .Weprovide
n o i t c n u
f d isze and p isze to ge t the actua l process and loca l data size . For example ,
)( e z is p .) 4 ,) 1 ( c o r p (i tl u
m returns1 .
. 1 e l b a
T TransformFunction. n
o it c n u
F Usage
) k t n i , D m i d ( w o l _ t e g m i
d ge tkunitsfromDt ha tneart heorigin )
k t n i , D m i d ( h g i h _ t e g m i
d ge tkunitsfromDt ha tfarfromt heorigin ,
D m i d ( i t l u m m i
d in tk) repea tDkt imes )
k t n i , D m i d ( c i l c y c m i
d cyclicDwithkunits )
k t n i , D m i d ( w o R c i l c y c m i
d cyclicrowdimensionofD )
k t n i , D m i d ( l o C c i l c y c m i
d cycliccolumndimensionofD )
k t n i , D m i d ( r o j a m _ k c o l b m i
Con isceCode
I P M n i e ti r w o t d r a h s i t a h t I n o it c e S n i s e s a c e m o s d e s s u c s i d e v a h e
W .In thisseciton ,we take a e
s a c e l p m i
s asexampletocomparet heMPIandPADMcodes.Thecommoncodei somittedi nboth h
t t a h t r a e l c s i t I . e c n e r e f f i d e h t w o h s o t s n o i s r e
v e PADM program is concise and easy to .
g n i d n a t s r e d n
u Thenwewil ldiscussthedistribuiton-independen talgortihmandacasestudy ,matrix .
n o it a c il p it l u m
n o it a c i n u m m o C r o b h g i e N r a l u c r i C
s s e c o r p h c a E . n o i t a c i n u m m o c r o b h g i e n r a l u c r i c e h t s i e s a c t s r i f e h
T sends100 floa telementsto tis e
r a e r e h T . e n o t f e l e h t m o r f t n u o c e m a s e h t s e v i e c e r d n a , r o b h g i e n t h g i
r pnumprocessesi nt otal .The n
i n w o h s s i n o i s r e v I P
M Figure 2(a) .Programmers should consider many detalis ,such as the c
o l b , r e b m u n s s e c o r p n o it a n i t s e
d king ornon-blocking interfaces ,avoiding deadlock (n -on blocking e
h t g n i r u s n e d n a ) t s r i
f completion (MPI_Wati) .Even using the MPI topology interface wli lno t e
d o c f o t n u o m a e h t e c u d e
r .Whliein PADMversion ,asshownin Figure2(b) ,thereisonlyonetask r
o f s n o i s n e m i d t e g r a t d n a e c r u o s e h t n g i s e d o t s i t a h
t copytofunciton .Weusethet ransformfunciton
c il c y
c heretodescribet hedesitnationprocesses .
n o i t a c i n u m m o c r o b h g i e n M D A P s v I P M . 2 e r u g i
F .
n o it u b i r t si
D -independent d
n o c e s e h
T caseisa ilbraryfuncitonofmatrixtranspose .Whendesigninga ilbraryfunctioninMPI , w o r r o k c o l b s a h c u s , x i r t a m t u p n i f o n o it u b i r t s i d l a n i g i r o e h t f o e r a w a e b t s u m s r e m m a r g o r p e h t
m t n e r e f f i d n i tl u s e r s n o it u b i r t s i d l a n i g i r o t n e r e f f i d e h T . n o it u b i r t s i d a t a d d e r e tt a c
s essagepassingpat-
a n g i s e d o t e l b a s i t i , M D A P n i , r e v e w o H . s n r e
t di ts irbuiton- independen talgortihm .I tmeansthe r
b u s n i y r a s s e c e n n u s i t u o y a l a t a d c i f i c e p
s ouitne .Ourfunciton isshown in Figure3 .I trequirestha t A
n i s n o i s n e m i d w o r d n a n m u l o c e h
t represen tthecolumnand rowofthematrix ,respectively .We b
i r c s e d o t s n o i s n e m i d w o r d n a n m u l o c e h t e g n a h c x
e e the targe tdistribution B .Also we can use
_ l o
c major(A )toge ttiatlernaitvely .
e r u g i
F 3. Distribution-independentt ransposei nPADM.
t S e s a
C udy :MatrixMul itpilca iton
A = C t a h t n o it a c il p it l u m x i r t a m f o m h ti r o g l a D 2 l a n o it i d a r t e h t n
u l o c d n a s w o r e l p i tl u m t e g t s u
m mnsfromAandB .IneachmatrixdistributionofAandB,i ft hesize P
s i e z i s n o i t a c i n u m m o c e h t , P s i d i r g s s e c o r p f
o 1/2 itmesoft hematrixsize .
e e r h t
A -dimensiona lapproach [4]–[6] to paralle lmatrix mutlipilcation has a factor P1/6 less
o i t a c i n u m m o
c nthanthe2Dalgortihm .Inthisalgortihm ,theprocessesareconfigured asa3Dcube . s
s e c o r p h c a
E g etsa singlesub-matrix from A and B ,then performsa loca lmatrix mulitplicaiton . b
u s l a r e v e s , n o it a t u p m o c e h t r e t f
A -matrices of this produc t mus t be sent to their desitnation e
h T . C x i r t a m g n it l u s e r e h t h ti w r e h t e g o t d e m m u s n e h t d n a s e s s e c o r
p computation is described in g
i
F u 4re .Therearet hreecommunicationsi nt hisalgorithm .Thefirs tonedistributest hesub-matrices o
d n o c e s e h T . s e s s e c o r p l l a o t A f
o nedistributest hoseofBwtihdifferen tdesitnations .Andt het hird .
s s e c o r p h c a e n o t l u s e r h c a e r o f s t c u d o r p e h t l l a s r e h t a g e n
o Thecommunicaitons are complicated . n
i , e r o m r e h t r u
F MPI,t heprogrammersmus tbeawareoft heorigina ldistribu itonsofdata.
. 4 e r u g i
F Cubealgorithmformatrixmultiplication. e
h t e ti r w o t r e m m a r g o r p r o f e l b i s s o p s i t i , M D A P n
I di ts irbuiton-independen talgorithm from an e
v it i u t n
i way .In ourimplementaiton ,asshown inFigure5 ,theinpu tdimensionsareconsidered as o
w
t -dimensional .Each processperformsloca lsub-matricesmulitpilcaiton of size len*len .And the m
s i C d n a B , A f o r e b m u n s k c o l
b *n ,n*kandm*k ,respecitvely .Therefore ,weneedaprocesscube m
e z i s f
o *n*k .Thedeclarationofeachcubedimensioni sseparate ,asshowninLine7-9 .Beforet he b
u s e h t , n o it a t u p m o c l a c o
l -matricesofAarescatteredamongt hePm*Pnprocesses ,andarerepeated e
h t g n o l
a Pkdimension .Similarly ,thematrix BisscatteredamongthePn*Pkgrid ,and isrepeated e
h t g n o l
a Pmdirection .
M D A P n i m h t i r o g l a e b u C . 5 e r u g i
b u s l a c o l e h
T -matrix is described as Blk in Line 11 ,which is continuous in memory .Bu tthe e h t g n i r e d i s n o C . s e s s e c o r p e l p it l u m n o r o s u o u n i t n o c y l e r it n e t o n e b y a m a t a d t u p n i g n i d n o p s e r r o c
i d e h t f o n g i s e d e h t , e c n a m r o f r e
p mensionsmus tensuretha tthecommunication conitnuouslength is n
o i t p i r c s e d a t a d l a c o l e h t , e r o f e r e h T . h g u o n e g n o
l Blkisatt herightmos tsideoft het arge tdimension . e
h t e s u e w d n A . h t g n e l s u o u n it n o c t s e g n o l e h t s a h n o i s n e m i d t e g r a t e h t , t s a e l t
A block_majo r
w o r l a n i g i r o e h t t s u j d a o t n o i t c n u
f -major inpu tdimensions into block-major order .By using the
it l u
m -funciton ,theadjustedoffsetsarerepeatedsevera l itmes .FromLine14-20 ,matrix Aand Bare .
s e s s e c o r p e b u c e h t n o d e t a e p e r d n a d e t u b i r t s i d
t f
A ertheloca lcomputaiton gemm ,theproductsalongthePndireciton mus tbesummedtogether .
e h t e s u e
W CP dimension to enlarge the dataoffsetsscope ,and allocate memory to receivethese .
l a c o l n o s t c u d o r
p ThePn dimension correspondsto theCP dimension. And theproduc tblockson k
P * m
P correspondt ot heblock-majorofC ,asshowni nLine24-26 .Therefore,t heproductsaresen t . C x i r t a m g n it l u s e r e h t h ti w r e h t e g o t d e m m u s n e h t d n a s e s s e c o r p n o i t a n it s e d e v it c e p s e r r i e h t o t
e d o c r u o n
I ,only the malloc and rfee sentences are omitted .The code is concise and easy to n
I . n o it u b i r t s i d l a n i g i r o t n e r e f f i d s t p e c c a m h t i r o g l a s i h t , y ll a i c e p s E . d n a t s r e d n
u Figure6 ,wedescribe
a C . n o it u b i r t s i d a t a d d e r e t t a c s w o r s i e n o e s a C . A x i r t a m r o f s n o it u b i r t s i d t n e r e f f i d e e r h
t setwo isa
n o n r e t t a p d e r e t t a c s k c o l
b r* rprocessgrid ,andeachblockisasquaresub-matrix .Casethreeisalso n
o i t u b i r t s i d k c o l b
a wtih differen tgrid pattern ,whereblock isarectanglepar tofthematrix .These n
r e tt a
p s arealsoavailableformatrixBandC .Differen tdistribuitonsresul tin differen tperformance e z i s f o n o i t a c il p i tl u m x i r t a m e h t f I . h t g n e l s u o u n it n o c n o it a c i n u m m o c t n e r e f f i d e h t f o e s u a c e b
n i n w o h s s i e c n a m r o f r e p e h t , s e s s e c o r p 4 6 n o l e ll a r a p s i 2 9 1 8 * 2 9 1
8 Table2 .Thelas tcolumnc ntis t s e g r a l e h t f o e s u a c e b t s e b s m r o f r e p e s a c d r i h t e h T . n o it a c i n u m m o c n i a t a d s u o u n it n o c f o r e b m u n e h t
t n
c value .I tis possible to design a distribution-independen talgorithm in PADM by using the l
a r e n e g d n a m s i n a h c e m e c n e r e f e
r copyto function .Therefore ,tuning theperformancewtihdifferen t .
y s a e s i s t u p n i
. 6 e r u g i
F Variousi npu tdistributionforcubealgorithm.
n o i t a c i l p i t l u m x i r t a m n i s n o i t u b i r t s i d t n e r e f f i d h t i w e c n a m r o f r e P . 2 e l b a
T .
) s ( e m i
T CopyA CopyB CopyC Total c nt w
o r : 1 e s a
c 0.223 0.254 0.233 0.710 211
k c o l b : 2 e s a
c 0.282 0.338 0.320 0.940 210
e l g n a t c e r : 3 e s a
c 0.180 0.179 0.157 0.516 220
t n e m e l p m
I a itonandOp itmiza iton M
D A
P is implemented by using MPI ,OpenMP and C++ .We defined i tas a ilbrary tool tha t a
s i t I . I P M h ti w e l b it a p m o
c ilghtlyextension tha tallowsprogrammerstouseanabstrac tandlogic n
o i t p i r c s e
d ofdatawhich isseparatesfrom theactually physica lmemory. I tiseasyto use ,PADM e
m o s d d a o t d e e n y l n o r e m m a r g o r p e h T . d e d u l c n i e b o t e li f d a e h a s e d i v o r
p options such as ‘
-g n il i p m o c n e h w ’ p m n e p o
f .
e h
T genera lcopytofuncitonrequiressourcedimensionandthesamesizetarge tone .Byscanning e
p o c s e z i s e h t n i s e x e d n i e h t l l
a onlyonce itme ,wecancalculatetheelemen tlocationsa tbothends , t
n e m e l e s i h t y p o c n e h t d n
a fromsourcetotarge tbyusingpoint- ot -poin tmessagesentences .Thisis .
it l u M d n a g n i p p a l r e v
O -thread s
i p p a l r e v o s i d o h t e m n o i t a z i m it p o n o m m o c
A ng thecompuitngand communication .In oursoluiton , h
t o t s r e f e r g n i t u p m o c e h
t elocaiton calculaiton foreachindex ,calledscan .WechooseOpenMP to u
m h c n u a
l litplethreadson each process ,bu tonlyonethread isallowed to usethemessagepassing e
h t , r e v e w o H . s e c n e t n e
s scan work is parttiioned among severa lthreads .Each computing thread e h t n o s i n o i t a c o l t e g r a t r o e c r u o s e h t f I . s n o it a c o l e h t s e t a l u c l a c d n a n o i s n e m i d e h t f o n o i t r o p a s n a c s
e v i e c e r r o d n e s a h s u p e w n e h t , s s e c o r p t n e r r u
c request into globa lqueue .Once thequeue isno t n
o it a c i n u m m o c e h t , y t p m
e thread wli lpop up each reques tfrom the queue and turn i tinto a _
I P M e s u e w n e h T . e c n e t n e s v c e r I _ I P M r o d n e s I _ I P
M Tes tto ensure tha tthese requests are n
o it c n u f a e d i v o r p e w , r e v o e r o M . d e t e l p m o
c setTnumtosett henumberofscant hreads . M
D A
P -ledOp itmiza iton i
d M D A P e h
T mension is an analyzable binomia ltree .We can ge tinformaiton from the structure n
i g n i n n a c s t u o h t i
w dexes .Now we are going to discuss some opitmizations tha tbased on the .
s i s y l a n a
h t g n e L s u o u n it n o
C .Eachdimensionrepresentsasequence ofdataoffsets. Theseoffsetsmaybe n
i s u o u n i t n o
c address ,such as proc(4)*data(100) . In other cases , they may no t b e entirely e
k i l , s u o u n i t n o
c data(10,20)*data(10) ,tha tevery 10 elements is continuous in memory . The f
o r e b m u
n regularcontinuouselementson each processisconfiguredasthecontinuouslength .We e
s u n a
c A.cntLen()togeti .t
t s e t a e r g e h T . h t g n e l s u o u n it n o c t n e r e f f i d e v a h y a m n o it a c i n u m m o c e n o n i s n o i s n e m i d o w t e h T
h t g n e l s u o u n it n o c e h t s i s e u l a v o w t e s e h t f o r o s i v i d n o m m o
c cn tof this communication .I tmeans y
r e v e t a h
t cn tindexes refer to the same source and target process number ,and their offsets are s i t n e m g e s s u o u n it n o c h c a e n i x e d n i t s r i f e h t y l n o , e r o f e r e h T . s d n e h t o b t a y r o m e m n i s u o u n it n o c
c n i s i h t g n e l e g a s s e m e h t , e l i h w n a e M . d e n n a c s e b o t d e e
n reased to cnt .This soluiton called
m m o C n a c
s .
y ti r a l u n a r
G .To optimizethecommunication whencntissmall ,weintroduceanotherargumen t
n a r
g ht a tmeans the granulartiy .The message length is se tto this granularity rather than cnt . f
o e u l a v e h t , y l l a u s
U gran is severa ltimesthesize of cnt .Therefore ,each process packs severa l n o i t a n i t s e d e h T . t i d n e s d n a k c o l b r e f f u b e n o o t n i n o i t a n i t s e d e m a s e h t h t i w s t n e m g e s s u o u n i t n o c
s i e g a s s e m e v i e c e r a e c n O . e g a s s e m e h t e v i e c e r o t r e f f u b r e h t o n a s e d i v o r p s s e c o r
p completed ,the
.t e s f f o t h g i r e h t o t t n e m g e s s u o u n i t n o c h c a e e v o m d n a r e f f u b e h t k c a p n u l l i w s s e c o r p n o i t a n i t s e
d In
m r o f r e p o t s d a e r h t f o e z i s e m a s e s u e w , n o it a u ti s s i h
t packing and unpacking seperately. In the e
r o h p a m e s , d a e r h t n o i t a c i n u m m o
c isusedtonoticet heunpackingt hreadt ha tabufferblocki sready .
d e k c a p n u e b o
t ThissolutioncalledbufComm. a
n g i s e d e w , n o it i d d a n
I buffe rpoo ltorecyclet hebufferblocks .Thepooli sparttiionedi ntomany e
z i s f o s k c o l
b gran .Andt heseblockaddressesaremanagedbyagloba lqueue .Whenanewblocki s a e t a c o l l a l l i w e w , y t p m e s i e u e u q e h t f I . e u e u q e h t m o r f p u d e p p o p s i s s e r d d a r e f f u b a , d e r i u q e r
d n a l o o p e h t h t o b , n o i t a c i n u m m o c e h t r e t f A . e s u r e t f a d e l c y c e r o s l a s i h c i h w k c o l b y r a r o p m e t
e r a s k c o l b y r a r o p m e
t released . l li w n o it c n u f o t y p o c e h
T analyzetheinpu tdimensionsandse targumentsautomatically including
t n
c ,gran ,the size of buffer poo land etc .Also we provide funcitons for programmers to contro l s
a h c u s , m e h t f o e m o
s setGranByte sandsetBufBytes . y
ti l a c o
L .Byanalyzingthesourceand targe tdimensions ,wecan discussthelocality .Ifthedata h t o b t a s r e b m u n s s e c o r p e h t d n a , t e g r a t d n a e c r u o s n e e w t e b r e h t o h c a e o t d n o p s e r r o c s n o i s n e m i d
t n e m e v o m a t a d e h t t a h t s n a e m t i , e c n e u q e s e m a s e h t n i e r a s d n
e onlyoccursonl ocal ,suchas :
) 4 ( c o r p = S m i
d *data(10 ,10 )*data(10 ;)
) 4 ( c o r p = R m i
y b d e m r o f r e p s i t n e m e v o m e h t , n o it a u ti s s i h t n
I memcpyinsteadofmessagepassing .Thisbranch d
e ll a c s i n o i t u l o
s localCopy. The copyto funciton chooses the proper optimized branch soluiton s
n o i s n e m i d e h t o t g n i d r o c c
a ’i nformation .
n o it a r e p O e v it c e ll o C r o f n o it a z i m it p O d n a n o it a c if it n e d I
x e n a s a n o i t a r e p o t s a c d a o r b e k a t e
W ampleto discussthealgorithm leve lopitmizaiton .Thereisan t
p
o imizaiton forlong messagebroadcast .The message isfirs tdivided up and scattered among the e W . r e h t a g ll a n a o t r a li m i s , s e s s e c o r p l l a o t k c a b d e t c e ll o c n e h t e r a a t a d d e r e t t a c s e h T . s e s s e c o r p
e h t d e r u s a e
m u -n optimized MPI Bcas tof OpenMPI and this new algortihm on 32 processes .The w
e
n algorithm reduces up to 74.5% itme when the message length is 1GB . This exisitng d
e i l p p a e b n a c n o it a z i m it p
o forbothstandardandmorevariousscenariosinPADM .Forthesemore e
m i d t e g r a t d n a e c r u o s e h t , t s a c d a o r b l a r e n e
g nsion have some common features. I tis possible to e
h t e z y l a n
a dimensions’featurestoi dentifyt hecommunicaitonpattern . n
o it a c if it n e d n
I .Wechoosethreecasesofbroadcas tasshownin Lis t7 .Thefirs tcasedescribes t
o n h t i w s e s s e c o r p l l a n o t s a c d a o r
b enitrely continuous message ,and the data layou tis changed d n A . s e s s e c o r p r e b m u n n e v e o t a t a d t o o r e h t t s a c d a o r b y l n o e n o d n o c e s e h T . t s a c d a o r b e h t g n i r u d
p u o r g a s e b i r c s e d e n o d r i h t e h
t -wisebroadcast .Thefeaturesofbroadcas tdimensionsareconcluded t
s i B d n a n o i s n e m i d e c r u o s e h t s i A e r e h w , s w o l l o f s
a het arge tone: )
a A.psize( )<B.psize)( ; )
b A.dsize( )=B.dsize)( ; )
c onlyAcontain sacollapseddimension;� )
d A’ sdataandproces sdimensioncorrespondst ot ha to fBr especitvely;� )
e botht hedataoffset sarei nas uccessives egmen twtihou toverlaporj ump.�
n o it c n u f e h t , e r o f e b d e n o i t n e m e w s
A pszieand d iszereturn theactua lsizeofprocessand loca l c
n i S . a t a
d e thedata on roo tprocess is sen tto mutliple desitnaitons ,the actua lprocess size of the e h t s i s d n e h t o b t a a t a d l a u t c a e h t d n A . e n o t e g r a t e h t f o t a h t n a h t s s e l y l e ti n i f e d s i n o i s n e m i d e c r u o s
. e z i s e m a
s Therepettiion of roo tmessageresutls in collapsed procordata dimension. Ifthere isa w
, n o i s n e m i d a t a d d e s p a ll o
c e wil lseparate the repeititon descripiton from data and turn i tinto a n
o i s n e m i d s s e c o r p d e s p a ll o
c fori denitficaiton .
t a h t s n a e m t i , s s e c o r p s ’ e n o r e h t o n a o t s d n o p s e r r o c a t a d s ’ e n o f
I data on one process wil lbe n
r e t t a p t s a c d a o r b a t o n y l e ti n i f e d s i h c i h w , s e s s e c o r p e l p it l u m o t d e t u b i r t s i
d ,this is why we have
) d ( e r u t a e
f .Theoffsetsmayno tbeenitrely conitnuous ,bu ttheymus tbein thesuccessivememory j
d n a p a l r e v o t u o h ti w t n e m g e
s ump .
m h ti r o g l A n o it a z i m it p
O . Ift hepatterni si denitfiedasabroadcast ,weseparatedataandprocess o
b r o f s n o it p i r c s e
d th the source and targe tdimensions .Therefore ,we ge tfour new dimensions p
t e g r a t , a t a d e c r u o s , s s e c o r p e c r u o s t n e s e r p e
r rocessandtarge tdata ,respectively .Thesedimensions d
e s u e r
a to describe thescatter-allgather communication . fI themessage is long enough ,the new m
h t i r o g l
a solutionisperformed .Otherwise,i tperformst hegenera lcopyto . d
r a d n a t s e h t l l a , e r o f e r e h
T ,group- swi e ,no tenitrely continuous data ,and many other various p
o e b n a c t s a c d a o r
b itmized wtih thisnewalgortihmin PADM .Theperformanceisdiscussedinthe .
n o it c e s t x e n
d n a l e d o m t s o
C Performance
l a r e v e s m o r f e c n a m r o f r e p s s u c s i d e w , n o it c e s s i h t n
I cases .Theperformanceismeasured on a 32 -e
l o M e h t f o t e s t s e t s e d o
n -8.5Supercompuitng[3] .Andt heMPIl ibraryi st heOpenMPI2.1 version . t
s o
C Model
s s e c o r p h c a e r o f n e k a t e m i t e h t t a h t e m u s s a e W . e c n a m r o f r e p e h t e t a m i t s e o t l e d o m e l p m i s a e s u e W
t e t e l p m o c o
t he communication can be modeled as mα+ n2 γ+Nβ ,where m is the number of ,
s e g a s s e
γist hepackingorunpackingtimeperbyte ,Ni st hetota lnumberofbytest ransferred, N=m* ,n andβ
. e t y b r e p e m it r e f s n a r t e h t s
i Weassumefurthertha tthecommunicaiton timealwaysoverlappedthe e
r u g i F e k il , e m it g n i k c a p n u d n a g n i k c a
p 7 )( . a
n o i t p m u s s a s ’ l e d o m t s o C . 7 e r u g i
F andperformancet est. f
i c e p s r o f y ti r a l u n a r g g n i n u t n e h
W ic case ,the value of Nβ is constant ,and m is inversely .
n o t l a n o i t r o p o r
p Ifcntisl argeenought ha ttherei snopackingorunpacking,t hecos tmodelt urnst o
mα+Nβ ,whichmeanst hatl essmessagesi sbetter. o
t y p o C l a r e n e G
t n o c y l e r it n e t o n a e s o o h c e
W inuouscircularneighborcommunicaitoncaseasfollows:
;) m u n p ( c o r p = P m i
d �
P = A m i
d *data(2 ,1024 )*data(n ,2048 )*data(1024 ;)
) 1 , P ( c il c y c = B m i
d *data(n*2048 .) ;) )t a o lf (f o e zi s ,r , B , s , A ( o t y p o c
n o it u l o s d e z i m it p o e h t d e c u d o r t n i e v a h e
W scanCommand bufCommforcopytofunciton .In this ,
e s a
c cntis4KB .Wesett hegranulartiyas1/64oftheloca ldata .Ifn=128 ,eachprocessholds1MB n
e h t , a t a
d g nra is 16KB .The performance is shown in Figure 7 )(b .The ‘cnt8’ represents a 8-n it
a c i n u m m o c s s e c o r
p on with cn tmessage length .Since the 4KB conitnuous length is no tlong e
h t , h g u o n
e bufCommsoluitoni ncreasest hemessagel engthandperformsbetter . n
i o p r e h t o n a n i s e i ti r a l u n a r g t n e r e f f i d s s u c s i d e w , e r o m r e h t r u
F t- ot -poin tcommunicaiton casethat 0
k n a
r processsends256MBdatat ot herank1process ,andt heconitnuousl engthi salso4KB .
( y a r r a = D m i
d (data(256)*data(256 ))*data(1024));
;) D ( w o r * )) D (l o c ( r o j a m _ l o c * ) 1 ( c o r p = A m i d
D * ) 1 + ) 1 ( c o r p ( = B m i d
;) )t a o lf (f o e zi s ,r , B , s , A ( o t y p o c
s i e c n a m r o f r e p e h
, e g r a l o o t s i y t i r a l u n a r g e h t f
I the2nγcos talot .Ifthegranularityistoosmall ,wehavetomanagea e
h t s e s a e r c n i h c i h w s e g a s s e m t r o h s f o t o
l startupcostmα. s
n o it a r e p O e v it c e ll o C d r a d n a t S
e h T . e c n a m r o f r e p e h t e r u s a e m o t n o i t a c i n u m m o c e v it c e l l o c c i s a b e h t s a e c n a t s n i l l a o tl l a e h t e k a t e W
t o n s e o d I P M n e p O n i l l a o t ll A I P M r o f m h ti r o g l
a attemp ttoschedulecommunication .Instead ,each n a y b d e w o ll o f , p o o l a n i d n e s I I P M e h t l l a n e h t , p o o l a n i v c e r I I P M e h t l l a s t s o p s s e c o r p
e h t o t r a l i m i s s i s i h T . l l a t i a W _ I P
M scanCommsoluiton . g
i
F u 7 )re ( d showstheperformanceforMPIAlltoal lversusPADMcopytowtihdifferen tthreads . e
h t s a e m a s e h t s i y ti r a l u n a r g e h T . a t a d B G 1 s n i a t n o c s s e c o r p h c a
E coun tsize .InPADMcases ,the e h t s w o h s h c i h w , s d a e r h t 4 f o e c n a m r o f r e p e h t n a h t r e tt e b y lt h g il s s i s d a e r h t 8 f o e c n a m r o f r e p
e l p it l u m g n i s u f o e g a t n a v d
a threads .The overhead in PADM does no tsignificanlty degrade the r
u o , y ll a u t c A . e c n a m r o f r e
p copytofuncitonperformsbetterwhent heprocessnumberi sl esst han32 . V
d e z i m it p
O ariou sBroadcas t
2 3 n o t s a c d a o r b d r a d n a t s f o e c n a m r o f r e p e h t e r u s a e m e w , t s r i
F processes .The algorithms of MPI ,
t s a c
B copyto and the new one(scatter- da -n allgather) are different .Their performance is shown in g
i
F u 7re (e)and7 )(f .Heret hemessagel engthequalst ot het otall oca ldatasize .Weseet ha tforshor t (
s e g a s s e
m ≤4MB) ,copyto performs better than both the MPI and the new one .And for long (
s e g a s s e
m ≥8MB),t henewalgorithmperformsbest . n
i s e s a c s u o i r a v e e r h t d e r u s a e m e w , n e h
T Figure 8 on32processeswtihdifferen tmessagesize .In f
o e u l a v e h t , t s il e h
t nisse tto512 ,andtheconitnuouslengthofD tdimensionis1/1024oftheloca l l
a r e n e g d n a n o it u l o s d e z i m it p o e h t f o e c n a m r o f r e p e h T . e z i s a t a
d copyto isshownin Table3 .Each
o t y p o
c containsone synchronization .TheBca tsOp tsoluiton thus issynchronized twice .Therefore , s
e m e h t n e h
w sageislonger ,theopitmized algortihm performsbetter .When themessagelength is e
h t , B M 2 1
5 Bca tsOp treduces88.3%communicationt imecomparedwtiht hegenera lsoluitoninthe .
e s a c t s r i f
. 8 e r u g i
F Variousbroadcas tpatterns.
. 3 e l b a
T Variousbroadcas tperformance.
e g a s s e M
h t g n e l
1 e s a
C Case2 Case3
t p
o copyto o pt copyto o pt copyto B
M
2 0.2 21 0.4 26 0.2 05 0.0 52 0.261 0.310 B
M
8 0.2 25 0.5 37 0.2 39 .00 79 0.258 0.340 B
M 2
3 0.2 56 0.5 72 0.2 60 0.2 09 0.256 0.355 B
M 8 2
1 0.326 1.620 0.290 0.728 0.404 0.911 B
M 2 1
5 0.778 6.640 0.684 2.808 0.913 5.200
n o is u l c n o C
. s n r e tt a p e v it c e l l o c s u o i r a v r o f n o i t u l o s l a r e n e g a s e d i v o r p M D A
P Theabstraciton ofdistribuitonis .
e l b i x e l f d n a e v it i u t n i s i h c i h w s e l u r e l p m i s l a r e v e s y b d e w o ll o
s r o t c a f d e t a l e r e c n a m r o f r e p e h t l o r t n o c o t e l b a d n a y ti l a c o l a t a d e h t f o e r a w
a . I tprovides more
e d o c e s i c n o
c s and greater scope for opitmization. Our work is open source on GtiHub , .
0 V m d a p / 1 0 0 2 9 1 s z / m o c . b u h t i g / /: s p tt
h Thefutureworki st odiscussmorescientificcases.
t n e m g d e l w o n k c A
. o N s t n a r G r e d n u a n i h C f o m a r g o r P D & R y e K l a n o it a N y b d e t r o p p u s s i h c r a e s e r s i h T
, 8 0 2 2 7 6 1 6 . o N s t n a r G a n i h C f o n o it a d n u o F e c n e i c S l a r u t a N l a n o i t a N d n a ; 2 0 5 0 0 2 0 B F Y 6 1 0 2
. 8 1 0 2 3 4 1
6 Thanks to the Mole-8.5 Supercompuitng System developed by Insttiute of Process .
s e c n e i c S f o y m e d a c A e s e n i h C , g n i r e e n i g n E
s e c n e r e f e R
] 1
[ Barnet tM. ,ShulerL. ,DeGejin R.V. ,Interprocesso rCollecitveCommunica iton Library ,High g
n i t u p m o C e c n a m r o f r e
P ,357-364 ,1994 . ]
2
[ ThakurR. ,RabenseifnerR. ,GroppW. ,Opitmizaitono fCollecitveCom- municaitonOpera iton s H
C I P M n
i ,Internationa lJourna lofHighPerformanceComputingApplications ,19(1) :49-66 ,2005 . ]
3
[ Xiaowe i Wang , We i Ge , The Mole-8.5 Supercompu itng System , Contemporary High
g n it u p m o C e c n a m r o f r e
P , .p 5p 7 -98 .NewYork :CRCPress ,2013 . ]
4
[ FoxG. ,OttoS.W. ,HeyA.J. ,e tal ,MatrixAlgortihm sona HypercubeI :MatrixMulitpilcaiton , 7
1 : ) 1 ( 4 , g n i t u p m o C l e l l a r a
P -31 ,1987 . ]
5
[ Krishnan M. ,NieplochaJ. ,Op itmi izng Paralle lMulitpilcaiton Operaiton f o rRectangula rand s
e c i r t a M d e s o p s n a r
T ,Internationa lConferenceonparalle landDistributedSystems ,257-266 ,2004 . ]
6
[ Agarwa lR.C. ,BalleS.M. ,Gustavson F.G. ,A Three-dimensiona lApproach t o Paralle lMatrix n
o it a c il p it l u
M ,IbmJourna lofResearchandDevelop- ment ,39(5) :5 -75 582 ,1995 . ]
7
[ ChanA. ,Balaj iP. ,GroppW. ,Communicaiton Analysi so fParalle l3DFFT f o rFla tCa tresian
s m e t s y S e n e G e u l B e g r a L n o s e h s e
M ,HighPerformanceComputing ,350-364 ,2008 . ]
8
[ F .Kjolstad and M .Snir .Ghos tCel lPa ttern ,In Workshop on Paralle lProgramming Patterns , .
0 1 0 2
] 9
[ Chen Y. , Cu i X. , Me i H. , PARRAY : a unfiying array representa iton fo r heterogeneou s m
s il e ll a r a
p ,Acm Sigplan Symposium on Principles and Pracitce of Paralle lProgramming ,47(8) : 1
7
1 -180 ,2012 . ]
0 1
[ Hoare C.A., Hayes I.J., Jifeng H., Lawsofprogramming,Communicationsof TheACM ,30(8) : 2
7
6 -686 ,1987 . ]
1 1
[ DraperJ.M. ,CullerD.E. ,Yelick K. ,Introducitont oUPCandLanguageSpeciifcaiton ,Center l
a n A e s n e f e D r o f e t u t i t s n I , s e c n e i c S g n i t u p m o C r o
f - yses ,1999 . ]
2 1
[ Coarfa C. , Dotsenko Y. , Mellorcrummey J. , An Evaluaiton o f Globa l Addres s Space o
C : s e g a u g n a
L -arrayFortran andUniifedParalle lC ,Acm SigplanSymposiumonPrinciplesand .
5 0 0 2 , g n i m m a r g o r P l e l l a r a P f o e c i t c a r P
] 3 1
[ Numrich R.W. ,Reid J.K. ,C -o array sin the Nex tFortran Standard ,ACM Sigplan Fortran :
) 2 ( 4 2 , m u r o
F 4-17 ,2005 . ]
4 1
[ Nieplocha J. ,Palmer B.J. ,Tipparaju V. ,Advances ,Appilcaiton sand Performance o fthe
ti k l o o T g n i m m a r g o r P y r o m e M d e r a h S s y a r r A l a b o l
G ,Internationa lJourna lof High Performance 3
0 2 : ) 2 ( 0 2 , s n o i t a c i l p p A g n i t u p m o
C -231 ,2006 . ]
5 1
[ Gropp W. ,Thakur R. ,Lusk E .Using MPI-2 :Advanced features of the message passing e
c a f r e t n