http://wrap.warwick.ac.uk/
Original citation:Baude, F. (1992) PRAM implementation on fine-grained MIMD multicomputers. University of Warwick. Department of Computer Science. (Department of Computer Science Research Report). (Unpublished) CS-RR-220
Permanent WRAP url:
http://wrap.warwick.ac.uk/60909
Copyright and reuse:
The Warwick Research Archive Portal (WRAP) makes this work by researchers of the University of Warwick available open access under the following conditions. Copyright © and all moral rights to the version of the paper presented here belong to the individual author(s) and/or other copyright owners. To the extent reasonable and practicable the material made available in WRAP has been checked for eligibility before being made available.
Copies of full items can be used for personal research or study, educational, or not-for-profit purposes without prior permission or charge. Provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way.
A note on versions:
The version presented in WRAP is the published version or, version of record, and may be cited as it appears here.For more information, please contact the WRAP Team at:
Research
R.port
220
PRAM
implementation
on fine-grained
MIMD
multicomputers
Francoise
Baude
RR220
The more user-friendly tools for expressing data-parallelism are encompassed by the
PRAM
language mechanisms (mainly, global addressing space,implicit
processes synchronization).In
this paper, we evaluate the cost for implementing such mechanisms on the only class of parallel architectures that should prove to be really scalable as well as versatile, i.e.MIMD
fine-grainedmulticomputers. Being massively parallel, these architectures are consfained to not exceed a given wire density and also to have fine-grained processing nodes. For these reasons, recent algorithmic and hardware technics for hidding communication latency of PRAM emulations by parallel slackness do not apply.
We consider an abstraction of these machines by reasoning in terms of the actor model
of
concurent computation. Given the architectural constraints raised by the chosen class
of
architectures, the idea for implementing any PRAM program is to transform
it
into an actor program such that it can be efficiently implemented on the target machine via mapping technicssuch as
for
exemple graph embedding. This requires the underlying communication -gruph of the qctor plogram to be predictable and sparse. This transformation is grounded on probabifistictheoretical simulations of PRAMs on parallel machines based on sparse communication
networks.
Departrnent of Computer Science
University of Warwick
Coventry CV4 7AL
United Kingdom
PRAM
implementation
on fine-grained
MIMD
multicomputers
Preliminary Version
FranEoise Baude*
Dept
of
Computer
ScienceUniversity of
WarwickCoventry CV4
7AL,
United Kingdom
e-mail:
[email protected]. ac.ukApril
29,1992
Abstract
The more user-friendly tools
for
expressing data-parallelism are encom-passed by the PRAM language mechanisms (mainly, global addressing space,implicit processes synchronization).
In
this paper, we evaluate the cost forimplementing such mechanisms on the only class of parallel architectures that
should prove to be really scalable as well as versatile, i.e. MIMD fine-grained
multicomputers. Being massively parallel, these architectures are constrained
to
not exceecl a given wire density and alsoto
have fine-grained processingnodes. For these reasons, recent algorithmic and hardware technics for hidding
communication latency of PRAM emulations by parallel slackness do not apply. We consider an abstraction of these machines by reasoning in terms of the actor model of concurrent computation. Given the architectural constraints
raised by the chosen class of architectures, the idea for implementing any PRAM
program is to transform
it
into an actor program such thatit
can be efficiently implemented on the target machine via mapping technics such as for exemple graph embedding. This requires the underlying communication graph of the actor program to be predictable and sparse. This transformation is grounded on probabilistic theoretical simulations of PRAMs on parallel machines basedon sparse communication networks.
1
Introduction
The
PRAM
(Parallel Random Access Machine) has now wonfairly
widespreadac-ceptance in the theoretical community as an extremely useful vehicle for studying the logical structure of highly parallel
(i.e.
data-parallel) computation [GR88], [I{R90]. 'This work was partially funded by EC grant under Esprit Project P440, by Alcatel-AlsthomThere are two major mechanisms
that
constitue the basis of this attractive wayof
expressing parallel prograrns:
global synchronization anda
global addressing space materialized by a shared global memory to serve as a communication medium betweenthe
parallel processes.But
they
are so powerfulthat
they
cannot
be reasonably considered as hardwareprimitives
and needto
beemulated.
At
the central key to any implementation of PRAM based languages, how to emulate shared memory has been given alot
of
attention [I{U88], [Ran87], [Va190b]. Evenif
thecurrent tendency
is
to
reduce the number of callsto
these high-level primitives inorder
to
decrease the overall cost of executing a parallel program [Gib8g], IACS89], [Va190a], their implementation is horvever required.Here, we are considering
the
massively parallel architectures such asthey
are investigatedby
the J-machine project [DW89] or theMEGA
project [GBES9O] for exernple.To
achievethe
connectionof
a massive number of processing elements, some technological constraints arise.The physical
lirnit
on the wiring density is less penalizingin
case of mesh or torustopolog'ies than
in
caseof
hypercubesor
butterfl,ies(in
generalin
case oflogarith-mic
diameter topologies) [DaB9]. So the J-machine [DaB9]or
theMEGA
machine [GBES90] are based on a torus topology.Hence using such a
high
diameterrouting
network as a palliativeof
the tech-nological impedirnent raisedby
large size crossbars, asit
would be required forsimulating
PRAM
programs, is not recommended. Nevertheless, somePRAM
sin-ulations apply theoretically to these high diameier routing network [Sie89], [Va190b].
But
they are not practically attractive; indeed, for the case of a square mesh routingnetwork
for
exemple, they would useall but
one rows of the mesh for simulating ahigh throughput switching network, while amassing all the PRAM processes on just
one row.
More generally, the parallel slackness idea currently proposed
to
hidecommu-nication latency of
PRAM
programs simulations [Val90a] and [ACS89] cannot
be appliedin
the present fine-grainedcontext.
The design choicesof
such massivelyparallel machines
is
that
processing nodes havea
small arnount ofmemory.
Butto
hiclethe
communication latency of simulating PRAMs on such architectures,it
woulcl require
a
processing elementio
simulateO(/I,f
)
processesof
the
PRAMprogram.
For
-cize reasons, thefunctionality
of
the routing chipsis
reducedto
thes'im-plest one, mainlE the abi,li,ty to forward a nzessage towards
its
destinateon. Thus,it
prevents to provide hardware implementation of more complex operations such that
combining or dupLicating, even
if
effi.cient emulation of shared memory could requireit
[RanB7]. Moreover, bounded size buffersin
therouting
chips preventto
imple-rncnt
in
the hardwarepriority
based routing schemesthat
do not yieldto
constant size queues IKUBB], [Val9Ob].one computation processor/routing chip pair (routing chips enabling a
virtual total
connection between processors) executing a very simple routing algorithm enablingto follow shortest paths. The J-Machine and the
MEGA
machine illustrate exactly those design choices. By proposing the rnost efficient use of available chip and board area, such machine model should be the only one really scalable, thus reaching the higher degree of hardware parallelism.But,
the now common technics for efficientiy(i.".
work preserving) implementingthe PRAM
model, which basically enable tohide latency and
low
throughputof the routing
network thanksto
some parallelslackness, can not be appl-ied given the present architectural choices. So, the present
airn
is
to
propose other technics suchthat
it
becomes possibleto
quite eff.ciently irnplement thePRAM
mechanisms under the chosen architectural constraints.Outline
Wewill
estimate the cost of implernenting suchPRAM
mechanisms onany
MIMD
fine-grained multicomputer, whatever been the topology of theunder-lying
communication network,to
be a logarithmic multiplicative factor increase intime.
The independancewith
respectto
the underlying topology comesfrom
thefact
that
we workwith
an abstraction of these rnessage-passing based machines by reasoningin
termsof a
paradigmof
sucha
message-passing computation model,i.e.
theactor
model
of
concurrent computation
[AghB6]. Moreover,this
im-plernentationin
term
of
actorswill not
preventfrom
exploiting other sources ofparallelism (control-parallelism for exemple)
at
the same time [AghSg].The implernentation of PRAM programs through an automatic translation into
an
actor program (translation grounded on theoretical sirnulationsof
PRAMs onsparse communication network based parallel machines [MVB4], [RanB7], [Va190b])
will
enableto
solvedifficult
implementation problems whilenot
beingtoo
much constrainedby
the target architecture. Nevertheless, wewill
seen (section4)
thatefficient execution of some type of actor programs can also raise
difrcult
problems.Thus
the
translationof
PRAM
programsinto
actor programs (section5)
will
becarefully conducted,
in
orderto
obtain only efrciently executable actor prograrns.After
sorne def.nitions, wewill first
estimate the minimurn number of callsto
actor language primitives requiredto
simulate the PRAM mechanisms (section 3).2
Definitions
2.T
PRAM
The
PRAM
[Gol82] Ianguage lookslike
a
declarative sequential language, exceptthat
instructions can be parallel andin this
case, the number of parallel processing units (PPUs)that
need to be active at the same time is also specified. The nurnberof
PPUsthat
can be activatedis
unbounded. These PPUswill
executethe
sanreEach PPU has some Iocal memory. Moreover, an unbounded size global memory
can
be
accessedby
al1PPUs.
Evenif
PPIJs executethe
sameinstruction,
theability
of indirect
addressing enables different memory locations (localor
global)to
be accessed.But
an unbounded numberof
PPIJs maytry
to
accessthe
sameglobal memory
location
at once.
The
variousPRAM
languages differfrom
one anotherin
the
waythey
handie these reador write conflicts.
The most powerful ones (CRCW) allow read and write concurrent accesses.A
protocol is nevertheless required.Forexample,if
severalPPllspt,...,p1
attempttowritevalues
rrt...,ak
into
the same variable, only the value tr1 iswritten
because p1 (the less nurnberedPPU) has
priority
(CRCWPriority);
or the value z1+...*?)& gets written, for some prespecified associative operator*
(CRCW-Fetch-&-*).The
time
complexityfunction
of a
PRAM
program dependsdirectly
on
the maximum nurnber of instructions (sequential and parallel)that
must be executed in orderto
complete the computation for any giveninput.
We noteit
T(n)
where n is the input length. The space complexity function of a PRAM program depends on the rnaximum number of PPUsihat
needto
be active at once during the computation on any giveninput
(notedS(")).
2.2
Actors
The actor language [Agh86] enables to define an unbounded number of asynchronous, autonomous computing entities, called actors, which can only be activated
by
the mean of asynchronous message sending. More precisely, an actor has sorne boundecl size local rnelnory, a script which describes how the actor must dealwith
the next messageit
will
receive and a mailbox to store incoming messages. In response to onemessage, an actor can only execute a bounded number of some of these followins actions :
o
it
can create a bounded number of new actors (in this case, theirinitial
script and theinitial
values of their memories have to be rroticed); for each new cre-ated actor, a unique address is automatically allocated and can be memorizedby the creator actor;
o
it
can send a bounded rrumber of messages to actors whose address is known; all of the actorswith
wich an actor carl communicate are called its acquaintances;r it
canmodify
its
memoryor/and
changeits script.
There aretwo
types of actors:
serialized actorsthat
can modify their state, unserialized actors thatare unable
to nodify
onething
in
their
state andthat
are also qualif.ed asbeing insensible
to
the history of the ongoing computation. We consider thatthis last type of actor has an additional property compared
with
the orthodox actor language [Agh86],i.e.
that
unserialized actors cantreat
an unboundedTo understand how an actor program computes, we refer us to an ideal actor machine onto which actor programs proceed. On this machine, each one of the computation entities is an actor. Computation entities can be created dynamicaliy. Those entities are placed around a communication system, whose task
is to
forward rnessages totheir destination. This communication system works
in
an unspecified way. Its only specified property isthat
every messagewill
be eventually deliveredat
destination. Thus the order in which messages arrive at destination is undeterministic. Each time a serialized actor has terminatedwith
the treatment of one message,it
checks in itsmailbox to see
if
a message is present and can be taken fromit.
If
not, the arrival of a messagewill
wakeit
up.
Each time an unserialized actor has terminatedwith
ihe treatment of one or more messages,it
checks its mailbox.If
there are messages, they areall
taken from the mailbox at once, andtheir
simultaneous treatment begins.This informal description is best represented by a partial order of the events that
arise during the computation
[Cli8l].
One event is the treatment of one message (or more messagesfor
the
case of unserialized actors). Essentially, events are ordered accordingto
thetrivial
rulethat
a message can not be received beforeit
has beensent.
This abstraction of the execution of actor programs is the basis of the definition
of complexity measures
for
actor programs [BVN91]. To do this,it
is assumed thatthe treatrnent of message(s), i.e. event, costs one unit of time, and that the reception
of one message occurs
at
the best one timeunit
after the message has been built. Takinginto
accountthat
non-determinismof
messagesarrival
canyield
to
rnany possiblepartial
orders of events from the execution of the same actorprogran
on the sameinput,
the two major complexity measuresfor
actor programs are :Definition
I
(Actor
Time Complexity
[BVN9L])
Thefunction of time
com-plerity
of
an actor progrunlA
is
defi,ned by the marimumof
the actortiure
of theprograrn
A
on all possi.ble inputs of lengthn.
The actor t'ime is the marimum depth.of
allpartial
orders that can be generated byA
an one giuen i.nput.Definition
2
(Actor
SpaceComplexity [BVN91]
)
The function of spacecom-plerity
of an actor progro,rnA
is
defined by the marimumof
the actor space of theprograrn
A
onall
possible inputsof
lengthn.
The actor spaceis
the marimurn offhe required space considering all
partial
orders that can be generated byA
on onegiuen
input.
The required space of a computation is defined to be the sum of the localmernory length ouer
all
actors created by this computation.Now
that
the
actor languageis
defined,it
is
clearthat
it
is more closeto
thechosen target machine than the PRAM language is,
but
also thatit
is independantlocalisation); unlike
in
the
caseof
a
specific machine, message forwarclingis
left unconstrained and actors can be placed anywhere aroundthe
network whatever been the topology of the underlying architecture.3
Comparison of
the
expressiveness
of
PRAM
and
ac-tor
languages
As proved
in
[BVN91],the
actor languageis
a parallel computation modelin
thesense
that
if
a problem is knownto
belongto
the ,A/C complexity class (i.e can be solvedin
time logfr(n)
ancl space poty(n) on any given parallel computation moclelsuch
that for
examplePRAM
or
booleancircuits)
thenit
can alsobe
solved in timelogr(n)
ancl space poty(n) using the actor language. The proof was conclucteclby
comparing actorswith
PRAM. This
result impliesthat
the
actor language ex-pressivenessis not
too
much distantfrom this of the PRAM
language. Wewill
now exactly evaluate how much
they
differ.
More precisely, wewill
establish thefollowing :
Theorem
I
There eristsat
least aPRAM
progra'rn haaing ti,me complerityT(n)
and space compleritypoly(n)
that cannot
be sirnulated by an actor prograrrl hauingle-cs than
O(T(n).logz)
time cornplerity andpoly(n)
space cornplerity.Proof
The CPU and each PPU is sirnulated by a serialized actor whose behaviour correspondsto
the PRAM
program translated asfollow
:
thelength
S sequenceof
instructionsis
cornpileclinto
Sdistinct behaviours. The transition from
one behaviourto
thenext is
doneby a
'become'instruction.
For example,if
the rzthPRAM
instructionis
"gotoM
if
y1> 0"
then the corresponding behaviour narned "Instruction-'n(...)"
ends by the statement "if.yi
)
0 then becorneInstruction-M
eise become Instruction -n
*
L" .In
the PRAM, the
PPtTs andthe
CPU are synchronized.In
actor, treating a message means to execute at most a bounded number of instructions includingmes-sages sending. Thus, during the treatment of one message, an actor can broadcast
the synchronization signal
to
only a bounded number ofits
acquaintances. Hence, the actor time requiredto
activate an actor is 0(1og.9(n ))if
5(z)
is the number of actors simulatingPPlls,
as establishedby
[AFIS3].Each
of
the memory cells must be hostedby
a serialized actor becauseit
canthen record an eventual change of the cell value. According
to
the PRAM ianguagedefinition, a PPU can have access to any cell of the global memory, simply by giving
its rank from the
first
cellin
the memory.It
is then necessary to provide a mean for an actor simulating a PPUto
obtain the address of any actor hosting the requiredcell.
Obviously,an
actor hasnot all
actorsof the
program as acquaintances. Infact
searchingfor
the
wanted actor address can be formulated as progressing onorienting
the
rnessage searchin
the
right direction.
Indeed,it
is the
same as to realize a communication between anypair
of actorsthat
arenot
connectedin
thisnetwork.
Becauseof the
bounded degree, the lower boundof f)(logy'f)
proved in[Mey86], theorem 10, applies, where y'f is the number of actors around this network. Once
the
addressis
known,the
accessis
simply realizedby
the
meanof
one rnessage sending. Given the fact that the access request message conta.ins the address of the actor issuing the request, the answerto
the accessis
simply realized by onemessage sending from the actor hosting the cell to the actor simulating the PPU.o
This
lowerbound
gives indicationon minimal
costfor
implementing PRAMprograms
by their
translationinto
actor programs,but its
proof doesnot
provide an implementation scheme. Althoughit
clearly appearsthat
the general mechanisrnof
global memory access hasto
be implementedby
actor messagerouting,
rtary
points rernain unsolved
:
how many actors should be used, how is the global rnemorydistributed, how to deal
with
concurrent memory address searches, how to cleal with concurrent accessesto
a same memory cell, etc . . .Implementation
with
sublogarithmic
time-loss
Before solving these specificpoints,
let
us observethat
the lower bound does not apply for some specific PRAM programs.In
fact,
if
one hasnot
to
searchfor the
actor address hosting a givenmemory cell because
this
actoris
already an acquaintance, andif
the
actorssim-ulating
the PPUs can be locally synchronized then the simulation can be achievedwith
only a constant time-loss.Definition
3 (Problems
with
predictable
communication)
A parallel algorithmhas predictable communication
if
each process us'ing onlyits
local state, can i,nferin
constant time, the address of the nert process from whi,chit
wiII receiue acommuni-cat'ion.
For a process can infer the process from which
it
will
receive the next comrnuni-cation, this meansthat
the computation does not depend on the value of the input databut
rather onits
length.
Thus,it
becomes possibleto build
an actor network before the beginning of the computation, suchthat
every actor simulating a PPU is connectedwith
actors hosting the memory cells the PPUwill
access. Nevertheless,this
is
only
reasonableif
the
numberof
acquaintancesis
not
too
high, otherwisecontradicting the assumption
that
actors have bounded memory. So wewill
adoptthis way of implementing PRAM programs, for a
/(n
) number of different memory cellsa PPU
will
accessto
(obviously, f(n)
3 f
(")),
suchthat
/(n)
is
at
most a slowly growing functionof
n,i.e.
log(n)
or
less(for
f(n)
:
log(n),
the sirnulationin
[PVB1] enablesto
implementthis
with
only
constant time-loss using boundeddegree processes network). Moreover such a class of problems is far from empty and
contains famous examples such
that
bitonic sort,
pref.x computations, and rnoreLastly, synchronization between PPIIs can be simply achieved by forcing actors
to have a cadanced behaviour
:
an actor simulating a given PPII must treat messagesnot
in
the
orderit
receivethem (this
beingin
fact
undeterministicin
the
actor Ianguage context)but
accordingto what the
algorithm requires. To describe thisorder more simply, a high-level construct introduced
in
the actor language, called"select-accept" [BVN91] can be used.
Once we have checked
that
the lower bound to implementPRAM
programs by mean of their translation in the actor language is affordable, wewill
check how actor programs can now be implemented on the chosen machines.4
Implementation
of
massively
parallel actor programs
As
shownin
[DWB9],the
actor language should be easily implernentableon
the chosen classof
machines, havingthe
J-Machine asa
buildable exemple. Indeed,even
if
cornmunicating actors can be physically distant, [Da89] saythat
"the latencyof
sending a messagefrom
nodei
to
any other node 7is
sufrciently lowthat
the programmer sees the network as a complete connection".But,
when asserting suchattractive performances,
it
is assumedthat
the network is notat
all loaded.This
unloaded network assumption doesnot
apply here, because programs of interest are rnassively parallel and fine-grained:
such programs are based on a huge nuurber of concurrently active processes and moreover, each process uses to initiatea
communicationafter
executingonly a
small number of local operations. Thus,without
care, contention on the underlying network can be very high, evenin
thecase where messages would come al1 from distinct nodes and would all be destined
to
distinct nodes. This case of permutation routing of ly' messages has a worst casecontention (nurnber
of
nessages crossingthe
samerouter)
of
JN
when using an obliviousrouting
algorithm, whatever been the diameter of the network [BHBs].But
clearly, this worst case does not applyif
communication tasks are restrictedto
thosethat
send messages between nodes suchthat
routing paths do not overlap.This
arisesif
cornmunicating processes are physicaliy closeto
one another, but also, given the performances of the underlying networkwithout
contention, we canaccept programs
that
yield to a low contention on the network even by not exploiting physical locality.So,
if
all
possible communicating pairs of actors are known beforecommunica-tions thernselves arise, then the
initial
placement of any actor can be achieved such as to guarantee the physical locality of its future communications, or even only a low contention on the network. In order to realize such mappings, many technics are cur-rently used suchthat
graph embedding [RosB1], simulated annealing [Dah90]. Thus,it
is possible to efficiently dealwith
communication predictable PRAM programs byfirst
transforming therninto
communication predictable actor programs andthel,
The problem
that
remains concerns communication unpredictablePRAM
pro-grams.
One technicto
dealwith
programsthat
behave dynamicallyis
processmigration whose airn is to
put
a process more close of the processit
hasjust
planedto communicate
with.
Becauseit
applies during execution, the migration cost showsitself tolerable only
if
the amount of the communications between the two processes is really high.In
other respects, we want not to afford the cost of the simultaneaousrouting of
O(If
) messages having any possible destination.The proposed solution is to manage so
that
a PRAM program evenif
it
hasun-predictable communications
will
be simulated by a communication predictable actor program, evenif
the time-loss is not constant.In
other respects, we have seenin
$3that
simulation of communication unpredictablePRAM
programs using the actor language cannot
have a constant time-losssbut
should raise at least a iogarithmictime-loss. The next
part
will
presenta
wayof
transforming comrnunicationun-predictable
PRAM
programsinto
communication predictable actor programs thatyields to a time-loss very close to the lower bound. Then given the bounded number of predictable communication tasks that those actor programs would generate, these
shoulcl be efficiently implementable on the target rnachines using mapping technics.
5
Simulation of
PRAM
programs
by efficiently
imple-mentable actor
programs
As
suggestedin
[Mey86], a solutionto
dealwith
unpredictable communications isto
constrainall
message sending between anytwo
processes (evenif
they
knowtheir
respective addresses)to
passthrough
a bounded degreenetwork. The
newcomurunication
pattern of the
prograrn can then be more easily embedded givingthe physical netwotk topology.
But,
the distance between any two communicatingprocesses can raise
the
diameter of the network.With
a bounded degree, we have seenthat
ihe best lower diameter for the case of connectingI[
nodes is f)(log 1f ).Of
corrrse,it
requiresthat
the bounded degree network be ableto
routemes-sages
to their
respective destination. For scalability purposes, the routing must bedistributed
(i.e., any node takesthe
decision whereto
forward a message only onthe
basisof
localinformations).
Until
now, the best known strategiesto
route aO(/f
)
set of messages on a bounded degree networkbuilt with
/f
nodes achieve aO(1oglf) with
highprobability
delay (takinginto
account thepaih
length as wellas the
total
time a message waitsin
queues) [Upf8a], [Ran87], [Val90b].The
simulation, whose following theorem givesthe
performances,will
thus be based on such a network of actors programmed by one of those distributed routinsstrategies.
log n
)
with high probabi,li,ty usingO(S(n))
actors.Proof
We have chosen the routing strategy developpedin
[Ran87] becauseit
notonly ef0ciently, and
in
a very practical way, solves the routing of -l[ messages butit
aLso serves other purposes such as combining memory access requests as a solution
for
managing concurrent access to identical memory cells. Moreover, when used in conjunction with universal hashing as a global melnory distribution strategy [MVB4],the total needed amount of randomness is reduced compared with others models such
that
[Val90b] for exemple.The network of actors
is
connected following abutterfly.
The nurnber of nodesin
abutterfly
with n
levelsis
lf
=
2n .,,
and we assumethat
levels 0 and rz areidentif.ed, so
that
the butterfly is
wrappedaround.
Each nodein
a butterfly
isassignedauniquenumber
1c,r )where0(c(n,0( r12" -1.
Node1c,r)
is connected to norles
(
c*
l
rnodn,r
)
and to node(
c*
l modn,r@2'),
where0
clenotes bitwise exclusive or.Each node
i
--
r
- nI
c of the 5(rz) nodesbutterfly
network isin
fact
an actordedicated for routing messages, which,
in
addition toits
neighboursin
the network,has some others actors as acquaintance
:
one actor is dedicated for simulating theifh
PPII
of the PRAM program, one is dedicated for hosting the cells of global rnernory allocated by the hash function, oneis
dedicatedto
apply the hash functionto
anyPRAM
memory address.In
orderto
host memory cells and hash function parame-ters while actors have fixed-size memories, the solution describedin
the Appendixof
[RanB9] is relevant.The amount of randomness that is introduced by allocating memory by universal hashing (picking
at
random one functionin
a
classof
universal hashfunction)
issufficient to ensure a deterministic routing of
/{
messagesin
O(log I'{) synchrono'ussteps
with
high probability.Indeed, this strategy as well as all other quoted routing strategies are described
and
evaluatedby
assurningthat the
underlying network hasa
synchronousbe-haviour.
Clearlyit
is not
the
caseof an
actor network because message sendingbetween actors
is
assurnedto
be asynchronous and arrival order of messages is un-deterministic.But,
the synchronous routing strategy of [RanB7] is original becauseit
is based on a cromrnunication predictable algorithrn. This is dueto
the fact thateach node invariably forwards a message
to
each ofits
successorsin
the
network. Becauseof the
4
degree underlying supportfor routing,
the prograurrnilgof
thisalgorithm using the actor language has only a constant-time loss compared
with
the synchronous version ($IBVN91]or
Appendixfor details).
Moreover,it
uses fixecl-size FIFO buffers (while rernaining deadlock-free)that
can be managedwith
a fixetl number of actor computation steps.Finally,
this routing
strategy supports, without anEfurther
tirne-loss, anypro-tocol for managing concurrent accesses
to
common memory cells.Transition
to
the
next
PRAM instruction
Becauseall
actors simuiating PPIJs havea
copyof
the
same code,they
can proceed independentlyin
case of instructionsthat
are loca.lto
the PPUs.In
case of a giobal memory accessinstruc-tion, each actor simulating a PPIJ must send a message to the associated actor that toutes, even
if
the PPUis not
concerned. Or, we can use a more flexibie model ofPRAM
[Gib8g]that
enables PPUs to not issue the same number of memory accesses,and
that
introduces a barrier synchronization instruction.But,
such asynchroniza-tion
can be easily irnplemented ontop of
therouting
algorithm usingthe
End ofStream messages (see [Bau91] for details).
After
the routing of requests, access, androuting of replies, termination is automatically detected by each actor
that
routes, as soon asit
has received an End Of Stream messaqe from each ofits
oredecessorson the
butterfly.
trRemarks
A
similar automatic transformation enables to execute massivelyparal-lel
and communication unpredictable actor programs, where actors are dealtin
the sarne way as the global memory ofa PRAM
prograrn. For .9(z))
p,p
being thenumber of processors
of
the machine,the
resulting actor programis
built
arounda O(p)-nodes
butterfly
and each node hostst
S(")lp
actors simulatingPPUs.
Tosimulate one memory access step of the PRAM program, the preceeding simulation has to be applied
O(S(n)le)
times.But
we can hope to take advantage of a pipeline effect like when applying instruction lookahead (see [Ran89] for implementationde-tails).
6
Conclusions
and
further
work
The results
that
have been presented estab[sh a bridge between parallei program-ming inNC
on one side (where in fact the way of programming in AfC were for long amalgalatedwith
SIMD
architectures) andMIMD
concurrent multicomputers onthe other side. Indeed, we have shown
that
the execution of highly parallel PRAM-Iike programs on these architectures canin
fact be expressed as a problemthat
isvery common
in
this
domain,i.e.
a problem of mapping one set of communicatinoprocesses on a given physical topology [AthB7].
We would be interested
by
worksto
develop mapping technics adaptedto
the specific case of abutterfly
communication network on one side and different possiblephysical topologies, including meshes, torus, on the other side.
In
the
contextof
theseMIMD
rnulticomputers, we have shownthat the
dis-tinction between cornmunication predictable and communication unpredictable pro-grams is crucial. Communication predictable programs have the propertythat
theycan
be directly
mapped although communication unpredictable ones haveto
be fi.rst reorganized and sufferat
least a logarithmic increasein
thetotal
number ofoperations for
their
execution.'or is connected with, for the case where local memory is too small
This puts
once morein light
the
crucial importanceof
works whose aims areto
decrease thetotal
cost of the communications, for exemple by taking advantageof the
communication latency [ACSB9].But,
given the present architecturalcon-straints whose essentialiy prevent
to
use parallel slackness, we have achieved the best possible implementation. Thus, the only wayto
reducePRAM
programming costs would beto
use as intensively as possible communication predictable PRAMprimitives. In this
context, the redefinition of PRAM programs by using communi-cation predictable primitives instead of unpredictable communication ones could bea way of achieving better performances. For example, we should remplace whenever
possible the list-ranking procedure
with
the prefix sum procedure [GMTBB]. Also,using
the
powerful CRCW-Fetch-&-* protocol could save global memory accesseswhile
its
irnplementation does not cost more than the EREW protocol.References
[ACS89]
A.
Aggarwal,A.K.
Chandra, andM.
Snir.
Communication Latency inPRAM
Computations.In
Proc.of
the 1989ACM
Syrnpos'iurn on ParallelAlgorithms and Archi,tectures, pages 11-21-, 1989.
[AFL83] E. Arjomandi, M.J.
Fischer, andN.A. Lynch.
Efficiencyof
synchronous versus asynchronous distributed systerns.J.
ACM,30:449-456, 1983.[AghB6]
G.
Agha.
ACTORS: A
modelof
concurrent computation'in di,stributed, systerns.MIT
Press, Cambridge Massachuseits, 1986.[AghB9]
G.
Agha.
Supporting Multiparadigm Programmingon
Actor
Architec-tures.
In
E. Odjyk, M.Rem, and J.C. Syre, editors, Proceed,ings of PARLE 89, pages 1-19, Eindhoven, 1989. Springer Verlag.[AthB7] W.C.
Athas.
Finegrain
concurrentcomputation.
CALTECH TechnicalReport 5242:TP;.87, 1987.
[Bau91]
F. Baude.
Utilisation du paradigme acteur pourle
calculparallile.
PhD thesis, Universit6 Paris-Sud, Orsay, 1991.[8H85] A.
Borodin and
J. Hopcroft.
Routing merging and sortingon
parallel rnodels ofcomputatior.
Journalof
Comp. and SEstem Sciences, 30:130-145, 1995.[BVN91]
F.
Baude and G. Vidal-Naquet. Actors as a parallel programming model.h
Proc. of the 8th Symposi,um of Theoret'ical Aspects of Computer Science,volume 480, pages 184-195.
Ll{CS,
1"991.[Cli81]
W.D.C.
Clinger. Found,ation of Actor Semantics. PhD thesis,MIT,
Cam-bridge, Mass, 1981.[DaB9]
W.J. Dally
andal.
The J-Machine:
a fine-grain concurrent computer. InIFIP
Congress, 1989.[Dah90] E. Denning
Dahl.
Mapping and compiled communication on the connection rnachine system.In
Proceedings of the Sth Distributed Memory ComputingConference, 1990.
[DW89]
W.J. Dally
and D.S.Wills.
Universai Mechanismsfor
Concurrency. InE. Odjyk, M.Rem, and J.C. Syre, editors, Proceed'ings of PARLE 89, pages 19 33, Eindhoven, 1989. Springer Verlag.
IGBES9O] C. Geunain,
J.L
B6chennec, D. Etiemble, and J.P Sansonnet.An
inter-connection network and a routing scheme for a rnassively parallel message-passingmulticomputer.
h
Proc.of
the ?rd Symposiumon
Frontiers ofMassiaely Parallel Computation, College Park,
MD,
1990.[Gib89]
P.B. Gibbons.A
More Practical PRAM Model.In
Proc. of the 1989ACM
Symposium on Parallel Algorithms and Architectures, pages 158 168, 1989.[GMT88]
H.
Gazit, G.L. Miller,
and S.H.Teng.
Concurrent computations,chap-ter
Optimal tree contractionin
the EREWPRAM
model, pages 139-155. Plenurn Publishing Corporation, 1988.[Gol82]
L. M. Goldschlager. A lJniversal Interconnection Pattern for Parallel Corn-puters.J.
ACM, 29:7073-1086, 1982.[GR88] A.
Gibbons andW.
Rytter.
Efficient parallel alltorithms. CambridgeUni-versity Press, Cambridge, 1988.
fKR90]
R.M.Karp
andV.
Ramachandran. Hand,bookof
theoretical computersc'i-ence, volume
B,
chapter Parallel algorithmsfor
shared-memory machines, pages 869-942. Elsevier, 1990.[KU88]
A.R.
Karlin
andE. Upfai.
Parallel hashing:
An
efficient implementationof shared mernory.
J.
ofACM,35:876
892, 1988.[N{evB6]
F.
Meyer:aufder
Heide.
Efficient
simulations among several models ofparallel conputers.
SIAM
J.
Comput,15:106 119, 1986.[MVB4] K.
Melhorn andU.
Vishkin.
Randomized and Deterrninistic Sirnulationsof PRAMs
by
Parallel Machineswith
Restricted Granularitvof
Parallel Memories. Acta Inforrnatica, 2L:339 374, L984.IPVB1]
F.P
Preparata and J.Vuillemin.
The cube-connected cycles:
A
versatile network for parallelcomputatiot.
Comm. of the ACM,24:300-30g, 1981.[Ran87] A.G. Ranade. How
to
emulate shared memory.In
Proc. of the 28thIEEE
symp. on FOCS., pages 185-194, 1987.
1a
[Ran89]
A.G.
Ranade. Fluent parallel computation PhD thesis, YaJe lJniv., Yale,us,
1989.[RosB1]
A.L.
Rosenberg. Issuesin
the study of graph embeddings.In
Proc. of theInternational
Workshop WG80, volume 100, pages 150 176. LNCS, 1981.[SieB9]
A.
Siegel. On universal classes of fast high performance hash functions,their time-space tradeoff, and their applications.
In
Proc. of the 30thIEEE
sEmp. on FOCS., pages 20 25, 1989.[UpfB
]
E.Upfal.
Efficient schemesfor
parallei communication..I.
ACM,31:507-577, L984.
[Va190a] L.G. Valiant. A bridging model for parallel computation. Communication-s of the
ACM,33:103
111, 1990.[Val9Ob] L.G.
Valiant.
Handbookof
theoreti,cal computer science, chapter General purpose paraliel architectures, pages 943-972. Elsevier, 1990.Appendix
:
Actor
programming
of
[Ran87]
routing
strategy
In
[Ran87], thanks to the forward of "ghost" messages? every node is continuouslyinformed about messages forwarded by
its
predecessors (the nodes which can sendit
rnessages).In
case of an asynchronous and non-deterministic context,it
is usefulthat
each node sends back an acknowledgement eachtime
it
has receiveda
data message.In
other words, the following rule hasto
be coded as the behaviour ofactors dedicated
for routing
:Let
succ(resp.
pred) be the number of neighboursto
which messages are for-warded (resp. from which messages are received).If
(nb.mess.received-
pred) and (nb.ack.received-
succ)then nb.mess.received :
:
0 ; nb. ack.received ::0
selection of messages from the buffersconstruction and forwarding of succ messages
with timestalnp=J
construction and forwarding of pred ackswith
timestamp:J
T
::
T*1
At
rninimal
cost,this
rule
keepsthe
executionof
the
wholerouting task
ca-denced,
without
requiring any global synchronization. The cadenced behaviour ofthe neighbours of any node is described below.
Property
I
For
euery actorA
dedicatedfor
routing on the network,at
each stepT
of the routing, A forwards data messages to its successors which w'iII receiue them during theirTtt'
routing step, andA
acknowledges the data messages receiaed fromits predecessors
during this
Tth
routing step, and these acknowledgementswill
bereceiued by these predecessors during
their (T
+
1)'h routing step.Application
to
the
analysis
of
Ranade's
strategy implemented on the
ac-tors
network
A
classical technic used to evaluate perforrnances of routingstrate-gies is based on the buiding of a delay sequence
bylooking
formessages which have causecl delays,starting frorn the
most delayedone.
Here insteadof
tracing backusing a global time, the backward is done according
to
thepartial
order C of actor events. Although the global time concept is not used for our case, the transpositionin actor terms of the definitions of the two primitive delays (b
-
delay andm-
delay)enables to observe
that
messages in the delay sequence are in fact the same as thosethat
would have been extracted by the analysis of the synchronous execution of therouting for the
sameinput.
Then, we can adapt the theorem 2of
[Ran87], whered is the diameter of the
butterfly
network constitutedby
5(n)
actor executing therouting
:Theorem
3
Supposethat
the routingof a
memory requests set of sizep
takes at rnost an actor computation time equal to (7'd
+
6) .4.
If
each bufferis
of si,ze b (bbeing a constant ind,epend,ant of the network si,ze), then there erists an input-output path S of lengthSd and a sequence of min(6
,
b.d)
rnessage.s uhich is polarized alongTheorem 3
of
[Ran87]that
showswhy "bad"
sequences of messages occur withvery low probability can be applied, thus the routing is terminated after log p routing
steps