Parallel
Merge
Sortby
Richard Cole
Ultracomputer Note #115
Computer
Science TechnicalReport #278 March,
1987Ultracomputer Research Laboratory
u
00 r~
I
OS 4J
u O m
6-1TD cr>
u u
en
o
Qj -H I—
I
bH
O
lOfe
O
CMNew
YorkUniversityCourant Instituteof MathematicalSciences
Division of
Computer
Science 251MercerStreet,New
York,NY
10012Parallel
Merge
Sortby
Richard Cole
Ultracomputer Note #115
Computer
Science TechnicalReport #278 March,
1987This
work was
supported in partby
anIBM
FacultyDevelopment Award and by NSF
grantDCR-84-01633.
ABSTRACT
We
give a parallel implementation ofmerge
sorton
aCREW PRAM
that uses n processorsand O(logn)
time; the constant in the running time is small.We
alsogive a
more complex
version of the algorithm for theEREW PRAM;
it also uses n processorsand O(logn)
time.The
constant in the running time is stillmoderate, though
not as small.1.
Introduction
There
are a variety ofmodels
inwhich
parallel algorithms can be designed.For
sort- ing,two models
are usually considered: circuitsand
thePRAM;
the circuitmodel
is themore
restrictive.An
early result in this areawas
the sorting circuitdue
to Batcher [B-68];it uses time 1/2 log n.
More
recently,[AKS-83]
gave a sorting circuit that ran inO(logn)
time;however,
the constant in the running timewas
very large(we
will refer to this as theAKS network)
.The huge
size ofthe constant is due, in part, to the use ofexpander
graphsin the circuit.
The
recent result ofLubotsky
et al [LPS-86] concerningexpander
graphsmay
well reduce this constant considerably;however,
it appears that the constant is stilllarge
[CO-86,
Pa-87].The PRAM
provides an alternative,and
less restrictive,computation model. There
are three variants of thismodel
that are frequently used: theCRCW PRAM,
theCREW
PRAM, and
theEREW PRAM;
the firstmodel
allows concurrent access to amemory
loca- tion both for readingand
writing, the secondmodel
allows concurrent access only for read- ing, while the thirdmodel
does not allow concurrent access to amemory
location.A
sort- ing circuit can beimplemented
inany
of thesemodels
(without loss of efficiency).Preparata [Pr-78] gave a sorting algorithm for the
CREW PRAM
that ran inO(logn)
timeon
(nlogn) processors; the constant in the running timewas
small. (In fact, therewere some
implementation details left incomplete in this algorithm; thiswas
rectifiedby Borodin and Hopcroft
in [BH-82].) Preparata's algorithmwas based on
amerging
pro- cedure givenby
Valiant [V-75]; this proceduremerges two
sorted arrays, each of length atmost
n, in timeO(loglogn)
using a linearnumber
of processors.When
used in the obvi- ousway,
Valiant'sprocedure
leads to an implementation ofmerge
sort on n processors usingO(lognloglogn)
time.More
recently,Kruskal
[K-83]improved
this sorting algo- rithm to obtain a sorting algorithm that ran in time0(lognlog]ogn/loglog]ogn)
on n pro- cessors.(The
basic algorithmwas
Preparata's;however,
a different choice of parameterswas made.)
In part, Kruskal's algorithmdepended on
using themost
efficient versions of Valiant'smerging
algorithm; these are also described in Kruskal's paper.More
recently, Bilardiand
Nicolau [BN-86] gave an implementation of bitonic sorton
theEREW PRAM
that used n/logn processors
and 0(log
n) time.The
constant in the running timewas
small.
In the next Section,
we
describe a simpleCREW PRAM
sorting algorithm that uses n processorsand
runs in time O(logn); the algorithmperforms
just5/2nlogn
comparisons.In Section 3
we modify
the algorithm to runon
theEREW PRAM. The
algorithm stillruns in time
O(logn) on
n processors;however,
the constant in the running time issome- what
less small than for theCREW
algorithm.We
note that apartfrom
theAKS
sorting network, theknown
deterministicEREW
sorting algorithms that use about n processors allrun in time 0(iog''n) (these algorithms are implementations of the various sorting
networks
such as Batcher's sort).Our
algorithms will notmake
use ofexpander
graphs or any related constructs; this avoids thehuge
constants in the running time associated with theAKS
construction.The
contribution of thiswork
is twofold: first, it provides a secondO(logn)
time, n processor parallel sorting algorithm (the first such algorithm is impliedby
theAKS
sorting circuit); second, it considerably reduces the constant in the running time (bycomparison
with theAKS
result).Of
course,AKS
is a sorting circuit; thiswork
does not provide a sorting circuit.In Section 4,
we show how
tomodify
theCREW
algorithm to obtainCRCW
sorting algorithms that run in sublogarithmic time.We
will alsomention some open problems
con- cerning sortingon
thePRAM model
in sublogarithmic time. In Section 5,we
consider a parametric search techniquedue
toMegiddo
[M-83];we show
that the partialimprovement
of this technique in [C-84] is
enhanced by
using theEREW
sorting algorithm.2.
The CREW Algorithm
By way
of motivation, let us consider the natural tree-basedmerge
sort.Consider
an algorithm for sorting nnumbers. For
simplicity,suppose
that n is apower
of 2,and
all the items are distinct.The
algorithm will use an n-leaf complete binary tree. Initially, the inputs are distributed one per leaf.The
task, at each internalnode
u ofthe tree, is tocom-
pute the sorted order for the items initially at the leaves of the subtree rooted at u.The
computation proceeds
up
the tree, levelby
level,from
the leaves to the root, as follows.At
eachnode we compute
themerge
ofthe sorted setscomputed
at its children.Using
theO(loglogn)
time, n processormerging
algorithm of Valiant, will yield anO(lognloglogn)
time, n processor sorting algorithm. In fact,we know
there is ann(loglogn)
time lowerbound
formerging two
sorted arrays ofn items using n processors [BH-82], thuswe do
not expect thisapproach
to lead to anO(logn)
time, n processor sorting algorithm.We
will not use the fastO(loglogn)
timemerging
procedure; instead,we
base our algorithmon
anO(logn)
timemerging
procedure, similar to theone
described in the nextfew
sentences.The problem
is tomerge two
sorted arrays ofn items.We proceed
in logn stages. In the ith stage, for each array,we
take a sortedsample
of 2'~^ items, comprising every n/2'~-'th item in the array.We compute
themerge
of thesetwo
samples.Given
the results of themerge from
the i—1st stage, themerge
in the ith stage can becomputed
inconstant time (this, or rather a related result, will be justified later').
At
present, thismerging
proceduremerely
leads to an0(log
n) time sorting algo- rithm.To
obtain anO(logn)
time sorting algorithmwe need
the followingkey
observa-tion:
The merges
at the different levels ofthe tree can be pipelined.This is plausible because
merged
samplesfrom
one level of the tree provide fairlygood
samples at the next level of the tree.Making
this statement precise is thekey
to theCREW
algorithm.We now
describe our sorting algorithm.The
inputs are placed at the leaves of the tree. Let u be an internalnode
of the tree.The
task, atnode
u, is tocompute
L(u), the sorted array that contains the items initially at the leaves of the subtree rooted at u.At
intermediate steps in the computation, atnode
u,we
willhave computed UP(u),
a sorted subset of the items in L(u);UP(u)
will be stored in an array also.The
items inUP(u)
will be arough sample
of the items in L(u).As
the algorithm proceeds, the size ofUP(u)
increases,and UP(u) becomes
amore
accurate approximation of L(u).(Note
that at each stagewe
use a different array forUP(u).)
We
explain the processingperformed
inone
stage at an arbitrary internalnode
u of the tree.The
arrayUP(u)
is the array athand
at the start of the stage;NEWUP(u)
is thearray at
hand
at the start of the next stage,and OLDUP(u)
is the array athand
at the startUltracomputer Note 115 Page 3
of the previous stage, if any. Also, in each stage,
we
will create an arraySUP(u)
(short forSAMPLEUP(u))
atnode
u;NEWSUP(u), OLDSUP(u)
are the corresponding arrays inrespectively, the next,
and
previous, stage.At
each stage, for eachnode
u, thecomputa-
tion comprises the following
two
phases.(1)
Form
the arraySUP(u);
it comprises every fourth item inUP(u),
in sorted order.(2) Let V
and w
be u's children.Compute NEWUP(u) = SUP(v) U SUP(w), where
U
denotes merging.There
aresome boundary
caseswhere we need
tochange Phase
1. (Forexample,
initially, theUP
arrays each containone
or zero items.Thus
theSUP
arrayswould
all beempty and
the algorithmwould do
nothing.) Inview
of this,we
establish the following goal: ateach stage, so long as i=- |UP(u) ! -'= |L(u) |, the size of
NEWUP(u)
will be twice the size ofUP(u). At
this point,some
definitions will be helpful.A node
is external if|UP(u) I
=
|L(u) I,and
it is inside otherwise. Phases 1and
2, above, areperformed
at each inside node.At
external nodes.Phase
2 is notperformed and Phase
1 is modified as fol- lows. For the first stage inwhich
u is external.Phase
1 isunchanged. For
the second stage,SUP(u)
is defined to be every second item inUP(u),
in sorted order.And
for the third stage,SUP(u)
is defined to be every item inUP(u),
in sorted order. It shouldbe
clear that
we have
achieved our goal, namely:Lemma
1:While <
|UP(u) | <|L(u)|,|NEWUP(u)
|=
2|UP(u)|.It is also clear that 3 stages after
node
ubecomes
external,node
t, the parent of u, alsobecomes
external.We
conclude:Lemma
2:The
algorithm has3]ogn
stages.It
remains
for us toshow how
toperform
themerges needed
forPhase
2.We
willshow
that they can beperformed
in constant time using0(n)
processors. This yields theO(logn)
running time for the sorting algorithm.A few
definitions will be helpful. Let e, f, g be three items, withe<g.
fis between eand
g ife^f and f<g; we
also say that eand
g straddle f. LetL and
J be sorted arrays.Let f be an item in J,
and
let eand
g be thetwo
adjacent items inL
that straddle f (ifnecessary,
we
let e= —
0° or g=
=0); then the rank of f inL
is defined to be the rank of e inL
(ife=
-«=, f is defined tohave
rank 0).We
define the range (e,g] tobe
the intervalinduced
by
item g (including the cases e=
-ocand
g=
°°).L
is a c-cover of J if each inter- val inducedby
an item inL
contains atmost
c itemsfrom
J.We
also say that the itemsfrom
J in the range (e,g] are contained in g's interval.We
defineL and
J to be cross- ranked if for each item in J (respectively L)we know
its rank inL
(respectivelyJ).We
willneed
the following observation toshow
that themerge
can beperformed
in0(1)
time:OLDSUP(v)
is a 3-cover forSUP(v)
for eachnode
v;hence UP(u)
is a 3-cover for each ofSUP(v) and SUP(w),
sinceUP(u) = OLDSUP(v) U OLDSUP(w)).
This will
be shown
in Corollary 1, below. But first,we
describe themerge
(Phase 2).We need some
additional information in order toperform
themerge
quickly. Specifi- cally, for each insidenode
u, for each itemfrom UP(u), we
provide its rank in each ofSUP(v) and SUP(w). Using
these ranks, firstwe show how
tocompute NEWUP(u),
and secondwe show,
for each itemfrom NEWUP(u), hew
tocompute
its ranks inNEWSUP(v) and NEWSUP(w).
Step 1
" Computing NEWUP(u).
Let e be an item inSUP(v);
the rank of e inNEWUP(u) = SUP(v) U SUP(w)
is equal to thesum
of its ranks inSUP(v) and SUP(w).
So to
compute
themerge we
cross-rankSUP(v) and SUP(w)
(themethod
is given in the followingtwo
paragraphs).At
this point, for each item e inSUP(v),
besidesknowing
itsrank in
NEWUP(u), we know
thetwo
items d and f inSUP(w)
that straddle e,and we know
the ranks of dand
f inNEWUP(u)
(these will beneeded
in Step 2).For
each itemin
NEWUP(u) we
recordwhether
itcame from SUP(v)
orSUP(w) and we
record the ranks (inNEWUP(u))
of thetwo
straddling itemsfrom
the other set.Let e
be
an item inSUP(v); we show how
tocompute
its rank inSUP(w). We
proceed
intwo
substeps. First, for each item inSUP(v) we compute
its rank inUP(u).
This task is
performed by
processors associated with the items inUP(u),
as follows. Let y be an item inUP(u). Consider
the interval I(y) inUP(u)
inducedby
y,and
consider the items inSUP(v)
contained in I(y) (there are atmost
3 such itemsby
the 3-cover property).Each
of these items is given its rank inUP(u) by
the processor associated with y. This substep takes constant time ifwe
associateone
processor with each item in theUP
array ateach inside node.
Ultracomputer Note 115 Page5
In the second substep
we compute
the rank of e inSUP(w). We
determine thetwo
items dand
f inUP(u)
that straddle e, using therank computed
in the first substep. Sup- pose that dand
fhave
ranks rand
t, respectively, inSUP(w). Then
all items ofrank r or less are smaller than item e (recallwe assumed
that all the inputswere
distinct), while allitems of rank greater than t are larger than item e; thus the only items about
which
there isany doubt
as to their size relative to e are the items with rank s,r<s^t. But
there are atmost
3 such itemsby
the 3-cover property.By means
of atmost
2 comparisons, the rela- tive order of eand
these (at most) 3 items can be determined. So the second substep requires constant time ifwe
associateone
processor with each item in eachSUP
array.Step 2
"
maintaining ranks.For
each item e inNEWUP(u), we want
todetermine
itsrank
inNEWSUP(v') We
startby making
afew
observations.Given
the ranks for an itemfrom UP(u)
in bothSUP(v) and SUP(w) we
canimmediately deduce
therank
of this item inNEWUP(u) = SUP(v) U SUP(w)
(thenew
rank is just thesum
of thetwo
old ranks). Similarly,we
obtain the ranks for itemsfrom UP(v)
inNEWUP(v).
This yields the ranks of itemsSUP(v)
inNEWSUP(v)
(for each item inSUP(v) came from UP(v),
andNEWSUP(v)
comprises every fourth item inNEWUP(v)). Thus
for every item e inNEWUP(u)
thatcame from SUP(v) we have
its rank inNEWSUP(v);
itremains
tocom-
pute this rank for those items e inNEWUP(u)
thatcame from SUP(w).
Recall that for each item e
from SUP(w) we computed
the straddling items dand
ffrom SUP(v)
(in Step 1).We know
the ranks rand
t of dand
f, respectively, inNEWSUP(v)
(as asserted in the previous paragraph).Every
item of rank r or less inNEWSUP(v)
is smaller than e, while every item of rank greater than t is larger than e;thus the only items about
which
there isany doubt
concerning their size relative to e are the items with rank s,r<s<t.
But there are atmost
3 such itemsby
the 3-cover property.As
before, the relative order of eand
these (at most) 3 items can bedetermined by means
of at
most
2 comparisons.Thus
Step 2 takes constant time ifwe
associate a processor with each item in eachNEWUP
array.It
remains
toprove
the 3-cover property (Corollary 1 toLemma
3)and
to determine the complexity of the algorithm(Lemmas
4and
5).Lemma
3: Letk^
1.At
the start of a stage,any
k adjacent intervals inSUP(u)
contain atmost 2k+l
itemsfrom NEWSUP(u).
Proof:
We prove
the resultby
induction on the (implicit) stagenumber. The
claim is true initially, forwhen SUP(u)
firstbecomes non-empty,
it containsone
itemand NEWSUP(u)
contains 2 items,
and when SUP(u)
isempty, NEWSUP(u)
contains atmost
1 item.Inductive step.
We
seek toprove
that k adjacent intervals inNEWSUP(u)
contain atmost
2k+
l itemsfrom SUP(u), assuming
that the result is true for the previous stage, i.e.that for all
nodes
u', for allk'>l,
k' intervals inOLDSUP(u')
contain atmost 2k'+l
items
from
SUP(u')-We
firstsuppose
that u is not external at the start of the current stage.Consider
asequence of k adjacent intervals in
SUP(u);
they cover thesame
range assome
sequence of4k
adjacent intervals inUP(u).
Recall thatUP(u) = OLDSUP(v) U OLDSUP(w). The 4k
intervals in
UP(u)
overlapsome
h>
1 adjacent intervals inOLDSUP(v) and some j^l
adjacent intervals in
OLDSUP(w),
with h-l-j= 4k+l. The
h intervals inOLDSUP(v)
con- tain atmost 2h+l
itemsfrom SUP(v), by
the inductive hypothesis,and
likewise, the j intervals inOLDSUP(w)
contain atmost
2j+
l itemsfrom SUP(w).
Recall thatNEWUP(u) = SUP(v) U SUP(w). Thus
the4k
intervals inUP(u)
contain atmost
2h-H2j-l-2
=
8k-H4 itemsfrom NEWUP(u).
ButNEWSUP(u)
comprises every fourth itemin
NEWUP(u);
thus the k adjacent intervals inSUP(u)
contain atmost
2k-HI itemsfrom
NEWSUP(u).
It
remains
toprove
theLemma
for the firstand
second stages inwhich
u is external (there isno NEWUP(u)
array,and hence no NEWSUP(u)
array for the third stage inwhich
u is external).Here we
canmake
the following stronger claim concerning the rela- tionshipbetween SUP(u) and NEWSUP(u):
k adjacent intervals inSUP(u)
contain atmost 2k
itemsfrom NEWSUP(u)
and every item inSUP(u)
occurs inNEWSUP(u).
This isreadily seen.
Consider
the first stage inwhich
u is external.SUP(u)
comprises every fourth item inUP(u) =
L(u)and NEWSUP(u)
comprises every second item inUP(u).
Clearly the claim is true for this stage; the
argument
is similar for thesecond
stage.Taking
k=
1we
obtain:Corollary 1:
SUP(u)
is a 3-cover ofNEWSUP(u).
Remark: An
attempt toprove Lemma
3, in thesame way,
with a sampling strategy of everysecond
(rather than every fourth) item will not succeed. This explainswhy we
chosethe present
sampling
strategy. It is not the only strategy that willwork
(another possibilityUltracomputer Note 115 Pag* '
is to
sample
every eighth item, or even to use amixed
strategy, such as using samples comprising every secondand
every fourth item, respectively, at alternate levels of the tree);however,
the present strategy appears to yield the best constants.We
turn to the analysis of the algorithm.We
startby computing
the totalnumber
of items in theUP
arrays. If |UP(u) |# and
v is not external, then2|UP(u)| = |NEWUP(u)| =
|SUP(v)|+ |SUP(w)| =
l/4(|UP(v)| +
|UP(w)|)=
l/2|UP(v)|; that isObservation: |UP(u) |
=
1/4 |UP(v) |. So the total size oftheUP
arrays at u's level is 1/8 of the size of theUP
arrays at v's level, ifv is not external.The
observationneed
not be true at externalnodes
v. It is true for the first stage inwhich
V is external; but for the second stage, |UP(u)|=
1/2 |UP(v)|,and
so the total size of theUP
arrays at u's level is 1/4 of the total size ofthe arrays at v's level; likewise, for the third stage, |UP(u)i -= |UP(v;)|,and
so the total size of theUP
arrays at u's level is 1/2 of the total size oftheUP
arrays at v's level.Thus, on
the first stage inwhich
v is external, the total size of theUP
arrays isn
+
n/8+
n/64+
• • •=
n +n/7;on
the second stage, it is n+
n/4+
n/32+
• •=
n+
2n/7;on
the third stage, it is n+
n/2+ n/16+
• • •= n+4n/7.
Similarly, on the first stage, thetotal size of the
SUP
arrays (andhence
of theNEWUP
arrays) isn/4
+
n/32+
n/256+
• • •=
2n/7;on
the second stage, it isn/2
+
n/16+ n/128+
• •=
4n/7;on
the third stage, it is n+
n/8+
n/64+••
•=
8n/7.We
conclude that the algorithmneeds 0(n)
processors (so as tohave
a processor standingby
each item in theUP, SUP, and NEWUP
arrays)and
takes constant time. Let us count preciselyhow many comparisons
the algorithm performs.Lemma
4:The
algorithmperforms 15/4nlogn
comparisons.Proof:
Comparisons
areperformed
in the second substep of Step 1and
in Step 2. In Step1, at
most
2comparisons
areperformed
for each item in eachSUP
array.Over
a sequence of three successive stages this is 2-(2n/7+
4n/7+
8n/7)= 4n
comparisons. In Step 2, atmost
2comparisons
areperformed
for each item in eachNEWUP
list.Over
a sequence ofthree successive stages this is also
4n
comparisons.However, we have overcounted
here;on
the third stage, inwhich node
ubecomes
external,we do
notperform any comparisons
for items in
NEWUP(u);
this reduces the cost of Step 2 to 2n comparisons.So
we have
a total of atmost
6ncomparisons
forany
three successive stages.How-
ever,
we
are still overcounting;we have
not used the fact that during the secondand
third stages inwhich node
u is external,SUP(u)
is a 2-cover ofNEWSUP(u) and
every item inSUP(u)
occurs inNEWSUP(u)
(see theproof
ofLemma
3). This implies that in Step 1,for each item in array
SUP(u),
in the second or third stage inwhich
u is external, atmost one comparison need
bemade
(and not two). This reduces thenumber
ofcomparisons
inStep 1, over a
sequence
of three stages,by
n/2+
n=
3/2n. Likewise, in Step 2, for each item in arrayNEWUP(u),
in the first or second stages inwhich
the children of u are exter- nal nodes, atmost one comparison
is performed. This reduces thenumber
ofcomparisons
in Step 2, over a
sequence
of three stages,by
n/4+
n/2=
3/4n.Thus
the totalnumber
of comparisons, over the course of three successive stages, is 5/2n for Step 1,and
5/4n for Step 2, a total of 15/4n comparisons.In order to reduce the
number
ofcomparisons
to5/2nlogn we need
tomodify
the algorithm slightly. In Step 1,when computing
the rank of each itemfrom SUP(v)
(resp.SUP(w))
inSUP(w)
(resp.SUP(v)), we
will allow only the items inSUP(v)
toperform comparisons
(or rather, processors associated with these items).We compute
the ranks for itemsfrom SUP(v)
as before.To
obtain the ranks for itemsfrom SUP(w) we need
tochange
both steps.We change
Step 1 as follows. For each item h inSUP(w), we compute
its rank r in
UP(u)
as before (the first substep). Let k be the item of rank r inUP(u), and
let s be the rank of k in
SUP(v). We
also store the rank s with item h. s is agood
esti-mate
of the rank of h inSUP(v);
it is atmost
3 smaller than the actual rank.We change
Step 2 as follows. Item e inSUP(v),
of rank t,communicates
its rank to the following, atmost
three, receiving items inSUP(w):
those items with rank t inSUP(v) which
at present store a smaller estimate for this rank. (These items aredetermined
as follows. Let dand
fbe the items
from UP(u)
that straddle e. Let g be the successor of e inSUP(v). The
receiving items are exactly those items straddled both
by
eand
g,and by
dand
f; the second constraint implies that there are atmost
3 receiving items for e,by
the 3-cover pro- perty.)For
those items h inSUP(w)
that do not receive anew
rankfrom
an item inSUP(v),
the rank scomputed
in the modified Step 1 is the correct rank.This
new procedure
reduces thenumber
ofcomparisons
in Step 1by
a factor of2, and leaves unaltered thenumber
ofcomparisons
in Step 2.We
conclude:Ultracomputer Note 115 Page 9
Lemma
5:The
algorithmperforms 5/2nlogn
comparisons.We have shown
Theorem
1:There
is anCREW PRAM
sorting algorithm that runs inO(logn)
time on n processors,performing
atmost 5/2nlogn
comparisons.Remark: The
algorithm needs only0(n)
space.For
although each stage requires0(n)
space, the space can be reusedfrom
stage to stage.3.
The ERE W Algorithm
The
algorithmfrom
Section 2 is notEREW
at present.While
it is possible tomodify
Step 1 of a phase so that it runs in constant time on anEREW PRAM,
thesame
does not appear to be true for Step 2. Sincewe
use asomewhat
differentmerging procedure
here,we
will not explainhow
Step 1 can be modified.However, we do
explain the difficulty faced inmaking
Step 2 run in constant time on anEREW PRAM. The
explanation fol- lows.Consider NEWUP(u) and
consider eand
g,two
items adjacent inSUP(v);
suppose that inNEWUP(u), between
eand
g, there aremany
items ffrom SUP(w).
Letf
be an item inNEWSUP(v), between
eand
g.The
difficulty is that for each item fwe have
to decide the relative order of fand
f; furthermore, the decisionmust be made
in constant time, without read conflicts, for every such item f. This cannot be done.To
obtain an optimal logarithmic timeEREW
sorting algorithmwe need
tomodify
our approach.Essentially, the modification causes this difficult computation to
become
easyby precom-
putingmost
of the result.We now
describe theEREW
algorithm precisely.We
use thesame
tree as for theCREW
algorithm.At
eachnode
v of the treewe
maintain 2 arrays:UP(v)
(defined as before),and DOWN(v)
(to be defined).We
define the arraySUP(v)
as before;we
intro- duce a secondsample
array,SDOWN(v):
it comprises every fourth item inDOWN(v).
Let u,
w,
Xand
y, be respectively, the parent, sibling, left child,and
right child of v.A
stage of the algorithm comprises the following three steps,
performed
at eachnode
v.(1)
Form
the arraysSUP(v) and SDOWN(v).
(2)
Compute NEWUP(v) = SUP(x) U SUP(y).
(3)
Compute NEWDOWN(v) = SUP(w) U SDOWN(u).
We
willneed
to maintainsome
other arrays in order toperform
themerges
in constant time;namely,
the arraysUP(v) U SDOWN(v) and SUP(v) U DOWN(v).
It is useful to note thatSDOWN(v)
is a 3-cover ofNEWSDOWN(v);
theproof
of this result is identical to theproof
of the 3-cover property for theSUP
arrays (given inLemma
3and
Corollary1).
We
describe thenew merging
procedure. First,we need
to introducesome
notation.We
write Jx K
to denote that the arrays Jand K
are cross-ranked.We show how
tocom-
pute JX K
in constant time using a linearnumber
of processors (this yieldsL =
JU
K), supposing thatwe
are given the following arraysand
cross-ranks:(i)
Arrays
SJand SK
that are 3-covers for Jand K,
respectively. Also, for each itemin SJ (resp.
SK) we have
its rank in J (resp. K).(ii) J
X SK and
SJX
K.(iii) SJ
X SK and SL =
SJU
SK.The
intervalbetween two
adjacent items, eand
f,from SL =
SJU SK
contains atmost
3 itemsfrom
each of J andK.
In order to cross-rank Jand K,
it suffices, for each such interval, to determine the relative order of the (at most) 6 items it contains. Such an interval, without loss of generality, can beassumed
tohave one
of the followingtwo
forms:(a)
Both
endpoints are in SJ.(b)
The
left endpoint is in SJand
the right endpoint is inSK.
We show how
tomerge
the (at most) 6 items contained in the interval in each of these cases.There
aretwo
substeps: in Substep 1we
identify thetwo
sets of (at most) 3 items tobe
merged, and
in Substep 2we perform
themerge.
Substep
1.Case
(a).The
(at most) 3 itemsfrom
J are those straddledby
eand
f (they are readilydetermined from
(i)).The
(at most) 3 itemsfrom K,
again, are those straddledby
e and f(they are readily
determined
using SJX K from
(ii)).Case
(b).We
obtain the (at most) 3 itemsfrom
Jby examining
Jx
SK:we
use the pointerfrom
the left (SJ) endpoint into J to provide the left endpoint in Jx SK, and we
use the right(SK)
endpoint to give the right endpoint in JX
SK. All the items in TU SK between
Ultracomputer Note 115 Page 11
these
two
endpoints are items in J.The
(at most) 3 itemsfrom K
are obtained similarly,by examining
SJx K.
Substep
2. Itremains
toperform
a set ofmerge
operations, the output of eachmerge
comprising atmost
6 items.Each merge
requires atmost
5 comparisons,and
a constantnumber
of other operations.To
carry out thisprocedure we
associateone
processor with each interval in the array SL.The number
of intervals isone
larger than thenumber
of items in this array.Remark:
If SJ (resp.SK)
is a 4-cover of J (resp.K)
but SJ (resp.SK)
is contained in J (resp.K)
then thesame
algorithm can be used, since the interior ofany
interval inSL
will still contain atmost
3 itemsfrom
Jand
atmost
3 itemsfrom K.
We suppose
that the following cross-ranks are available at the start of aphase
at eachnode
v: ' •'SUP(v) X SDOWN(v), SUP(v) x DOWN(v), UP(v) x SDOWN(v), OLDSUP(v) X OLDSUP(w), OLDSUP(w) x OLDSDOWN(u).
In addition,
we
note that sinceDOWN(v) = OLDSUP(w) U OLDSDOWN(u), and
aswe have SUP(v) x DOWN(v) and OLDSUP(w) x OLDSDOWN(u), we
can immediately obtain:SUP(v) X OLDSUP(w), SUP(v) x OLDSDOWN(u).
Likewise,
from UP(u) = OLDSUP(v) U OLDSUP(w), UP(u) x SDOWN(u)
(atnode
u),and OLDSUP(v) x OLDSUP(w), we
obtainOLDSUP(w) x SDOWN(u).
The computation
proceeds in 5 steps.Step 1.
Compute SUP(v) x SUP(w)
(yieldingNEWUP(u)). We
already have:(i)
OLDSUP(v) X OLDSUP(w).
(ii)
SUP(v) X OLDSUP(w).
(iii)
OLDSUP(v) x SUP(w) (from node
w).Step 2.
Compute SUP(w) x SDOWN(u)
(yieldingNEWDOWN(v)). We
already have:(i)
OLDSUP(w) X OLDSDOWN(u).
(ii)
OLDSUP(w) X SDOWN(u).
(iii)
SUP(w) X OLDSDOWN(u) (from node
w).Step 3.
Compute NEWSUP(v) x NEWSDOWN(v). We
already have:(i)
SUP(v) X SDOWN(v).
(ii)
SUP(v) X NEWDOWN(v), and hence SUP(v) X NEWSDOWN(v)
(this is obtained from:
SUP(v) x SUP(w), from
Step 1,and SUP(v) x SDOWN(u),
from
Step 2 atnode w,
yieldingSUP(v) X [SUP(w) U SDOWN(u)] = SUP(v) X NEWDOWN(v)).
(iii)
NEWUP(v) x SDOWN(v), and hence NEWSUP(v) x SDOWN(v)
(this is obtained
from SUP(x) x SDOWN(v), from
Step 2 atnode
y,and
SUP(y) X SDOWN(v), from
Step 2 atnode
x, yielding[SUP(x) U SUP(y)] X SDOWN(v) = NEWUP(v) x SDOWN(v).
Step 4.
Compute NEWUP(v) x NEWSDOWN(v). We
already have:(i)
NEWSUP(v) X SDOWN(v) (from
Step 3(iii)).(ii)
NEWUP(v) X SDOWN(v) (from
Step 3(iii)).(iii)
NEWSUP(v) x NEWSDOWN(v) (from
Step 3).(Here NEWSUP(v)
is a 4-cover ofNEWUP(v),
contained inNEWUP(v);
as explained inthe
remark
following themerging
procedure, this leaves the complexity of themerging procedure unchanged.)
Step 5.
Compute NEWSUP(v) x NEWDOWN(v). We
already have:(i)
SUP(v) X NEWSDOWN(v) (from
Step 3(ii)).(ii)
SUP(v) X NEWDOWN(v) (from
Step 3(ii)).(iii)
NEWSUP(v) x NEWSDOWN(v) (from
Step 3).We
conclude that each stage can beperformed
in constant time, given one processor for each item in each arraynamed
in (i) of each step.It
remains
toshow
that only0(n)
processors areneeded by
the algorithm. This is aconsequence
of the following linearbound
on the total size of theDOWN
arrays.Lemma
6:|DOWN(v)
|<
16/31 |SUP(v)|.Ultracomputer Note 115 ^"8^ 13
Proof: This is readily verified
by
inductionon
the stagenumber.
Corollary 2:
The
total size of theDOWN
arrays, atany one
stage, is 0(n).This algorithm, like the
CREW
algorithm, has31ogn
stages.We
concludeTheorem
2:There
is anEREW PRAM
algorithm for sorting n items; it uses n processorsand 0(n)
space,and
runs inO(logn)
time.We do
not give the elaborations to themerging procedure
required to obtain a small totalnumber
ofcomparisons
(as regards constants).We
simplyremark
that it is not diffi- cult to reduce the totalnumber
ofcomparisons
to less thanSnlogn
comparisons;however,
as there is
no
corresponding reduction in thenumber
of other operations, this complexitybound
gives a misleading impression of efficiency (which iswhy we do
not strive to attainit).
4.
A Sublogarithmic Time CRCW Algorithm
[AAV-86]
give tightupper and
lowerbounds
for sorting in the parallelcomparison model;
usingpS;n
processors to sort n items thebound
is ©(-;—
^
,)
In addition, there logzp/nis a lower
bound
of Ci(-—
, ) for sorting n items with a
polynomial number
of proces- loglognsors in the
CRCW PRAM model
[BH-87].We
give aCRCW
algorithm that uses time0(
-1+
, loglog2p/n—
f— r— —
) fornSp^n^.
It is not clear which, if any, of theupper and
lower'
^ . ^.
FF
bounds
are tight for theCRCW PRAM model.
We
describe the algorithm; it is very similar to theCREW
algorithm. Let r=
p/n. Itis convenient to
assume
that n is apower
of r (the details of the general case are left to the reader).There
are threemajor
changes to theCREW
algorithm. First, rather than use abinary tree to guide the
merges, we
use anr-way
tree. This tree has height h=
logn/logr.Second,
we
define the arraySUP(v)
tocomprise
every r^ item inUP(v),
rather than every fourth item (with the natural modification at the external nodes). Third, the arrayNEWUP(v)
is defined to be ther-way merge
of theSUP
arrays at its r children.As
before, the algorithm will
have
3h=
31ogn/logr stages.But
here, rather than use constant time, each stage will take 0(logr/loglog2r) time.To
obtain ther-way merge we perform
l/2r(r—1) pairwisemerges
of the rSUP
arrays.
To
obtain the rank of an item in the arrayNEWUP(v) we sum
its ranks in each ofthe r
SUP
arrays.We
use r processors tocompute
thissum;
they take 0(logr/loglogr) timeon
aCRCW PRAM
using thesummation
algorithmfrom
[CV-87].Before explaining
how
toperform
a single pairwisemerge
it is useful toprove
a cover property.Lemma
7: k intervals inSUP(v)
contain atmost
rk+
1 items inNEWSUP(v).
Proof:
The proof
is very similar to that ofLemma
2; the details are left to the reader.Corollary 3:
SUP(v)
is an (r+l)-cover ofNEWSUP(v).
We perform
themerges
essentially as in theCREW
algorithm.Here,
at the start of a stage,we need
toassume
that for eachnode
u, for each item inUP(u) we have
its rank inthe
SUP
arrays at all r of u's children. Let v,w
be children of u.We proceed
intwo
steps, as in the
CREW
algorithm.In Step 1,
we
start be dividing each pairwisemerge
into a collection ofmerging
sub- problems, each of size atmost
2(r+l); eachsubproblem
is then solved using Valiant'smerging
algorithm [Va-75].To
divide amerge
intosubproblems, we
exploit the fact thatUP(u)
is an (r+
l)-cover ofSUP(v),
as follows.For
each child v of u,we
label each itemin
SUP(v)
with its rank inUP(u)
(this is carried out in constant timeby
providing each item inUP(u)
with r(r+l) processors).A subproblem
is definedby
those items labeled with thesame
rank; it comprisestwo
subarrays each containing atmost r+1
items.For
theproblem
ofmerging SUP(v) and SUP(w),
forany
item e inSUP(v)
(resp.SUP(w))
the boundaries of itssubproblem
can befound
as follows: let fand
g be the items inUP(u)
straddling e (obtained using the rank of e in
UP(u));
the ranks of fand
g inSUP(v), SUP(w)
yield the boundaries of thesubproblem. To
ensure that themerges performed
using Valiant's algorithm each take 0(loglog2r) timewe need
to provide a linearnumber
of processors for each
merge.
Since each item inSUP(v)
participates in exactlyr-1 merges
it suffices to allocater—
1 processors to each item in eachSUP
array.In Step 2 (ranking the items
from NEWUP(u) - SUP(v)
inNEWSUP(v)
for each child V of u)we proceed
as follows.Consider
an interval I inducedby
an item efrom SUP(v). For
each such interval I,we
divide each collection of itemsfrom
NEWUP(u) - SUP(v),
contained in I, into sets of r contiguous items, with possiblyone
smaller set per interval.Using
Valiant's algorithm,we merge
each such set with the atmost r+
1 items inNEWSUP(v)
contained in I. Ifwe
allocate r processors to eachmerging
UltracomputerNote 115 Pag« 15
problem
they will take0(log]og2r)
time.To
allocate the processors,we
assign2(r—
1)processors to each item in
NEWUP(u). Each
item participates inr—
1merging
problems.In a
merging problem,
if the item is part of a set of size r, the item usesone
ofits assigned processors. If the item is part of a set of size <r, the item takesone
processorfrom
the item e defining the interval I.Each
item e contributes atmost r—
1 processors to amerging problem
in the lattermanner,
thus it suffices to provide2(r—
1) processors to each item inNEWUP(u).
We
concludeTheorem
3:There
is aCRCW
sorting algorithm for theCRCW PRAM
that usesn<p<n^
J • • ->/ logn _
processors
and
runsm
time0(
, , ,—
f—
-
——
).^
1
+
loglog2p/n 5.A Parametric Search Technique
The
reader iswarned
that this section is not self contained.We
recallMegiddo's
parametric search technique [M-83]and
itsimprovement
inmany
instances in [C-87].The improvement was
an asymptoticimprovement,
butwas
not practical for itwas based on
theAKS
sorting network.As we
will explain, the role playedby
theAKS network
can be replacedby
theEREW
sorting algorithmfrom
Section 3.In a nutshell, the procedure of [C-87] can
be
described as follows.A comparison
based sorting algorithm is executed;however
each 'comparison' is an expensive operation costingC(n)
time,where
n is the size of theproblem
at hand. In addition, thecomparisons have
the property that they can be 'batched': given a set of c comparisons,one
ofthem
can be evaluated,and
the result of this evaluation resolves a further c/2 comparisons, in an additional 0(c) time.Examples
of searchproblems
(called parametric search problems), mostly geometric search problems, forwhich
thisapproach
is fruitful, include [M-83, C-87, C-85, CSS-86].Megiddo showed
that parallel sorting algorithms, executed sequentially, providegood
sorting algorithms for this type ofproblem;
the reason is that a parallel sort- ing algorithm naturally batches comparisons.In [C-87] it
was shown how
to achieve a running time of0(nlogn + lognC(n))
for the parametric search problems,when
using a depthO(logn)
sortingnetwork
as the sorting algorithm. (Briefly, the solution requiredO(logn)
'comparisons' tobe
evaluated; the over-head
for running the sorting algorithmand
selecting thecomparisons
to be evaluatedwas
O(nlogn)
time.) Itwas
alsoobserved
that to achieve this result it sufficed tohave
acom-
parison based algorithmwhich
could be represented as anO(logn)
depth,0(n)
width, directed acyclic graph, havingbounded
in degree,where
eachnode
of the graph represented acomparator and
the edges carried the outputs of the comparators.In fact, a slightly
more
general claim holds.We
startby
defining a computationgraph
corresponding to anEREW PRAM
algorithm on a given input.We
define a parallelEREW PRAM
computation to proceed in asequence
of steps of the following form. In each step, every processorperforms
atmost
b (a constant) reads, followedby
atmost
b writes; a constantnumber
of arithmetic operationsand comparisons
are intermixed (inany
order) with the readsand
writes.We
represent the computation of the algorithm on agiven input as a computation graph; the
graph
has depth2T (where T
is the running time of the algorithm)and
widthM+P (where M
is the space usedby
the algorithmand P
is thenumber
of processors usedby
the algorithm). In thecomputation graph
eachnode
represents either amemory
cell at a given step, or a processor at a given step.There
is a directededge (<m,t>, <p,t>)
if processor p readsmemory
cellm
at step t;hkewise
thereis a directed
edge (<p,t>,<m,t+ 1>)
if processor p writes tomemory
cellm
at step t. Ifno
write ismade
tomemory
cellm
at step t, there is a directededge (<m,t>, <m,t+
1>).Suppose we
restrict our attention to algorithms such that at the start of the algorithm, for eachmemory
cell (at stept+1) we know whether
the in-edge (in the computation graph)comes from
a processor (at step t) or amemory
cell (at step t).For
a sorting algo- rithm of this type, that runs in timeO(logn)
on n processors using0(n)
space,we
can stillachieve a running time of
0(nlogn + lognC(n))
for the parametric search problems. (To understand this, it is necessary to read [C-87], Sections 1-3.The
reason the result holds isthat
we
can determinewhen
amemory
cell is active, to use the terminology of [C-87],and
thus play thegame
of Section 2 of [C-87] on the computation graph. In general, ifwe
do nothave
the conditionon
the in-edges formemory
cells, it is not clearhow
to determine ifa
memory
cell is active.As
explained in Section 3 of [C-87], given a solution to thegame
of Section 2,
we
can readily obtain a solution to the parametric search problem).Remark: The computation
graphneed
not be thesame
for all inputs of size n. In addition, the graph does nothave
to be explicitlyknown
at the start of the sorting algorithm.Ultracomputer Note 115 P"g« 1'