http://wrap.warwick.ac.uk/
Original citation:Munro, J. I. and Paterson, M. S. (1978) Selection and sorting with limited storage. Coventry, UK: Department of Computer Science. (Theory of Computation Report). CS-RR-024
Permanent WRAP url:
http://wrap.warwick.ac.uk/46321 Copyright and reuse:
The Warwick Research Archive Portal (WRAP) makes this work by researchers of the University of Warwick available open access under the following conditions. Copyright © and all moral rights to the version of the paper presented here belong to the individual author(s) and/or other copyright owners. To the extent reasonable and practicable the material made available in WRAP has been checked for eligibility before being made available.
Copies of full items can be used for personal research or study, educational, or not-for-profit purposes without prior permission or charge. Provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way.
A note on versions:
The version presented in WRAP is the published version or, version of record, and may be cited as it appears here.
The
University
of
Warwick
THEORY
OF
COMPUTATION
REPORT
NO.24
SELECTION AND
SORTING
h|ITH
LII'1ITID
STORAGE
BY
J.I,FI UNRO
&
[Vi.S,PATERSON
Department
of
Computer ScienceUniversity
of
WanwickCOVENTRY CV4 7AL
ENGLAND.
SELECTIO}i A}iD SORTING WITTi LIIqITED STOMGE
,J.
i.
Munro *Dept.of
Computen ScienceUni.zensity
of
WaterlooCntario
CAIIADA
by
and
M.S" PatersonDept.of Ccmputer Science
llnivensifv of
WarwickCoventny
U. K.
Abstnact.
Whenselecting
fnom, onsorting, a file stored
on areao-only
tape andthe irrternal
storageis nather linritedrseveral
!'asses o-f
the
inpr-rt tape may berequir,ed.
h-e strrdythe relation
befween t}'re arnount
of internal
stor"age availabl-e andthe
numberof
passes r:equirecl
tc select the Kti]
h.ghestof
Ninputs.
We show,:fc,.r'example,
that to find the
medianirrtwc
passesrequires
a.t'l 1
].east A(Nz) and
at
rnost O(N2log N) in'i*rriai sto:-age.
For'i
1.;t:'r:babi1.i.stic nrr'tli,t,Js, O(il2)
internal
sloJ..rgeis
ne<.:essa:.y antJ.r;':i..{-'ir:i.er:t
fcr'a sing'rl
pdss nrethocl whichfinrls
Ltre rnerlian witharbitr.ar,ily high
pnoba):iIi ty"
1"
f ntr:r:ducti-onAs
a
paladig'matic studyof
effects
cf
interrral
stor'agelimitations
on J-arge-scale data-pr<;cessing
tasks,
we consider problensof
searchingand
iorr.:ng
in
da'tastored
ona
one-way r"eadrcnly tape whentlre
amountof
random*access wonking space
is
severeiyconstrained.
Weshall
<luantifynather'ciose.l-y
the relatiorr
betweenthe
number.of
passes oventhe
inputfil-e
whichare
requir.edfor
thesetasks
anrithe
amo'unt r:f' storage availablefor
a given
size
of
the
file" In
sever.al cas€sthe
upper bounds are demonstr.ated by new sampling algorithmsof
somepractical
interest
"'r this
author waspartial)y
suppcrted bya SeniortVisiting
feJlowshipfrprn
the
Scrlgnqs Reseanch Council v,rhj-levjsitine
the
U;riversitv of
In
our
computational moclel-the
datais
a
sequenceof
Ndistinct
elements
stored
ona
one-way nead-onlytape.
An element frrcmthe
tapecan be nead
jnto
oneof
S l-ocationsof
r"andom-accessstonage.
Theelenrents a:re from some
totally
onderedset (for
examplethe
real
numbens)and
a binary
comparison can be madeat
anytime
between any two elementswithin ttie
ranclom-acceasstorage. Initially
the
storageis
enrpty andthc
tape
is
placedwith the
reading hcadat
the beginning,.
Altcr
eachpass
the
tape
is
r.ewoundto this
position with
no r'eading permitted.Notational
note.Fon
functions
of
several- argurTrents weshall write f({)
= O(g(X))when
fc
>
O suchthat lff[l| .
".g(X)
for:
all
X except thosenaturally
on
explicitly
excluded.
Wealso
usef
= $?(g)for
g
=0(f);
and we usef :
0(g)
for f
=o(g)
andg
=o(f).
fn
sectjorr
2 we presentnesults
conce::ningthe
problemof
sontingthe data,
wher.e,in
viewof
the
linritations
imposed by oun model.,this
must be considered as
the
determinationof
the
sor"ted onder nather thanany
actual
rearuangement. For P-pass algor.ithms we showthat
0(N/P)storage
locations
are
necessary andsuffieient.
The gneater
part of this
paperis
occupiedwith the
selectionpnoblem
of retrieving for
some givenK, the
Xthhigh"st
among N inputelements.
For"clarity
and convenience we adopta
terrninologyof
al-titude
in
nespectof
the
ondening,e.E.
we use ter.ms such as "highestr', "bel-owt',
rrlowe:lthanr'.
The mostinteresting speciai
caseof this
is
finding the
median(i.e.
when X=
tN/21).
By symmetry we may alwaysassume
that
K .< [N/2l.
It is
easyto
showthat
K +I
.Iocations ar.e necessary andsufficient
Algor"ithms using
this
minimal storageare studied
in [fJ,
wheneit
is
shownthat for
the
mediahonly 0(t't)
comparisonsare
needed, whereasfor
K*
a N,
c fixed,
o
(
a
" *,
the
retrieval of
the
xth high""t
requires
0(N.fogN)
comparisons.in
cr:nt:"ast,a
two-passprobabitistic
metho,C usingonfy
O(N% )stolage
and 3N/2 +o(N)
comparisonsis
plesentedin IZl.
I'laking useof
an
internal
randomizer,it
finds the xth
highest
elementwith
anarbitnariJ-y
gr"eatpnobability.
whichis
independentof
the order
of
theinputs.
The
principal
results
obtainedin this
paperar€
upper and lowerbounds which show
the
amountof
storage requined bya
P*pass detenministicselection
a3-gorithmto
be rc.rughlyN1/P.
OtherresuJts are
that
underthe rather
strong
assumptionthat all
input
orderingsare equally
likely'
for
a
single-pass
.llgorittrr,rwittr
a
high
expectationof
selecting
the1
median, O(I,12) .l.ocations
are
necessary anclsufficient.
2.
Elementary nesul-ts.liere,
as thnoughoutthe
papen, S denotesthe
numL,erof
stonagefocations
available.
Since companisons canonly
be madewithin
theselocations,
wewili
assume alwaysthat
S7
2.
Anaive
sorting
al-go:rithmdeter"mines
in its finst
passthe
highest
S*I
elementsof
the
input,
andtheir
nelative
order,
andthen
in
successive passesit
ignores anyelements r.anked
in
pnevious passesin
order
to
<ieterminetl:e
ranksof
the
next
S-1highest
elements.
This
algonithm nequiresonly
t(N-l)/(S-f)l
passes.
Wenote
that to
t'ignore"
previorrsly nanked elements requir"esthe retention
of
a
lar.ge amountof
inforrnation bythe
pnogram. A }a::geI'program memory"
is
inconsistent
with our
stonagelimitation for
anyit is
being cletermj.ned then an al gonithmwith
very
smal-l pnogram memorynray be obtained
at
the
cost
of just
oneextra
storagelocation.
Thisl-ocation
is to
hold
the
lowest element ranked sofan
and each new dataelement
is
companedwjth
this to
determine whetheror
not
it
should beigrroned.
Near.Iyall
the
algoritbmsto
be describedwill
usethis
technique
in
order"to
nemainwithin the
domainof practicality.
We
give a
sirnple lowen bound angumentto
establish the
following
result.
Theorem
I.
Theleast
storage requined by any P-passsonting
algonithm 0(N/P).fon
N el-ementsis
Pnoof. In
viewof
the
algonithm given above we requineonly a
lowerbound.
Supposethat
the
or"deningof
the
datais
suchthat lst,
3rd,5th,
highest
elementsare
in
the
finst half of
the tape,
whereas+L^ LrrE zrru, TLrr, vLrl , a-J lr+L A+L clc 1' url€ S€CoDd tralf . -*^ i- +l Since a val-id algorithm
nust
at
sometime
makea
<linect comparison betweenthe (2n-t)st
and
+1-(2r)"'elements
for
r:
=f:
,
LN/2J, either"the
odd-nanked elementmust be ca:r,r,ied
in
storageat
some forwardtransition
acncss themidpoint
of
the
tapeor
the
even-ranked element must beretained
duringsome interrnediate
rewind. If
P passes ar?e usedby
analgonithn
for
this
case we can argue
that
(zp-r)s
>
LN/2JHences>N/4P
tr3.
Multi-pass
algorithms When Sis
more thanl>e des j glle(l cts f o.l lows
.
A tfor
selection.
about (1og N)2- on
efficient
algorithm
may'tlre lregJ.trtrlng o.f e)dclr l)ct6ri il l)allI' o1 ,'-lt;mallt ti,
requir:ed element
is
guaranteedto lie, is
netained
in
the
stonage, thoughtheir
precise
ranks may sofar
beundetermined.
At the stant
of
the
algor"ithm we may pretendthat
rridealt'elements repnesenting
t- fulfil- this role.
Duningthe
pass any elementsnot
betweenthe
filters
are
used merelyto
establish the
exact nanksof
the
filter.s.
Fnomthe
remaindera suitably
constructed sampleis
netained from which
a
newpair of
fil-tens
is
aelected.Fcr.
the
initial
passthe
numberof
el-ements betweenthe
filters is
N,
andfor
the
final
passthis is to
be neducedto at
rnost S-2 sothat
all
such elements can beretained
for
a
final selection.
With thedetails
of
the
al-gor"ithm weshall
establish the following relation.
Lemma
1. If at
mostn
elementslie
betweenthe
filtens at
the
beginningof
a
pass thenfor
the
fol-towing passthis
numberis
O(n(log n)2/S).A
sirlpie
estjmation
fromthis
lemriLayields the
next
upper bound.Theorem*2.
A P-pass algorithrn which1lD
requi
res
storageat
most O(t';-''
. ( logOutline
of
the
algorithm.Foi:' sonre f
ixed
s,
a
qampleat
elements chosen from
a
ttpopulationttrh
selects
the K-"
highest
of
N elenents1
N)2),
and onl.y O(Nz.loglI) for
P = 2.level i
wil-1 bea
sorted
set
of
si
of 2'. s
elements accondingto
thefollowing
scherne.A sample
at level
Oconsists
of
the
whole populationof
2o.s
elementsin
sorted
orden.
A sampleat level i
+1
is
formed bytaking
two samplesat
leve.I
i
fnomthe
first
and second halvesof
the
population,
frthinning"each
by
netai.ningonly the
second,fourth, sixth, ...
elements fncm thetop
in
each, and then mergingthe
two subsamplesto
forrn onesorted
sample.We
shall
showthat
a
sample deservesits
namein that it
contains areasonably
well
spacedselection
frornthe
tota-l- order"of its
population.To
this
end considerthe
jth
"fur"nt
fromthe top
in
a
sampleat level i.
We denote
by
L-..r, M-." respectively the
feast,
and most numbersof
elements fnom
its
coruesponding population which can appear.str"ictly
above
it in
the
total
orden.i
Lemma
2. L..
=j.2*
_
L1l
i
M..=(i+j-1).2^
1lPr"oof.
Clearly,
for
1.<
j < s,
-oJol
L_-
=M-,
=j - l-.
We use theconvention
that tio
=-1forall, i >0.
Fori>1, j )1,
wemaythenverify
that
L..
= min{L.
+L.
+ 1}fl ].-Irzp !-L)zq
p+q =
j
P>o' q>o and
M..
= max{M.
t M
}--ij
"'---
'--i-1, 2p
"i-1
,
2q+2'p+q=j
p>o, e)o
Fnom these equations
the nesult
may be pnovedinductively.
trFor
a population
of
size
at
most2r.s
frrcm which we wishto
select.th..
th
ththe
k"'highest
weshall
choose as newfiltens,
the
u-"
and v--'elementsof
the
final
sampleat
leve1r,
whereurv
af.ethe
gr:eatest andleast
integers nespectively
suchthat
k-1>M
=(:r+u-I)2r
and
k-1..L"r=v.2r-I
+l^
The k
"'
element must then be oneof
these elements or:lie
between themin
the onden.
The numberof
elements between themis at
mostM -L
tv
ru
-t=(L
rv
+r+(r-1)2r)-(M
ru
-r-(r-t)zr)-r
=2(n-1)zr+(1,
J'v
-M +1)
ru\< (
zr-t
)zr
since L--.
+f
.< M.. +
2T bythe
extnemalityof
urv.The maximr.m stor"age
required
is for
a
sub-sample(consisting of
even-positioned elements
of
a
sampl-e)for
eachlevel
berowthe
nth,
fonone "wonking samplet' and
for: the
pair of filtens.
This
is at
mostrs/2
+
s
+
2.
We chooses
=[S/]ognT
andr
=
flog(n/s)l
sothat
rlr-.2F.s
and
the
stonage r:equinedis at
rnostS,
when Sis sufficiently
large.(we can assume
s l
a((tog
n)2).)
Thepopulation
size
nt
for:
the
nextpass
satisfies
n'
.,< (2r-I)2r
-<4rn/s
=o(n(log
n)2/s)
and we have
justified
Lemma1.
The storage requir.ernentof
the
algonithmcan be reduced by
a
constantfactor if
samplesare
combi.nedfive at
atime instead
of
twoat
a
time.Veny
small
storageIt is
cleanthat
the
above algonithmrequires
S >,n((fog
W)2).For smaller
valuesof
S,
one mlght employthe
morepracticaL
of
the"so::ting"
algonithms and terrnj.nateit
aften
f(K-f)/(S-2)1
passes.This
is
the only algorithn
we knowfor
very
smalJ storage which doesnot
nequine extensive progr"am memory.If
we disregandpnactical
limitations
arrdallow
analgorithm
to
rernember anarbitr"ary
arnountof
inforrnation
about previous compar"isons, we can provethe following
uppenbound.
Theorem
3.
For" 2 .< S S0((Iog N)2), there
is
a
class
of
selectionalgorithms which use
at
most O((1ogN)3/s)
passes.Proof.
The algor:ithms simul-ate each passof
the
algonithmof
Theonem 2by several
passeswith
smaller
storage.
The cornpanisons penforrnedin
one pass
of
the oniginal
algorithm
can be undenstoodin
cor:respondencewith a binary tree
of
height
r.
At the
leavesare
2r level
o
sampresof
samples ane
thinned
and mer:geduntil
the
finar
sampleat level r
is
reached.with
storages
equalto s, all
the
operationsat
onelever
of
thetree
can be can:riedout
in
one pass, whereaswith s >
s, it is
possibreto
execute0("/s)
levels
at once.
When S<
s,
a single
level
can becompleted
in
O("/S)
pu"""".
Thesonting
and menging openationsare
doneby
the
naivemulti-pass sonting
algonithm describedin
section
2,
appliedsimultaneously
to
eachsample.
The menory nequired bythe
pnognam torecond
the par"tial
progressduring
such an operation would beintolelable
in pnactice.
Howeverin all
cases where2.<
S<
O((1ogN)2) ttre
total
/
^\
number
of
passesto
simulate one pass befoneis
a[(1og'\ s
N)').
/
Thetotal
of
passesfor
the selection
pnoblemis
therefone O((1ogN)3/S).
n4.
Lower boundsfor
rnulti-pass selection.
To show
that
the
upper bounds derivedin
the
pr'evioussection
aneclose
to
optimal-
we hene present coryesponding lower"bounds.
Our mainpnoof uses
the
idea
of
the
rrAdversary" who, knowingthe
innennost workingsof
our algorithm,
devises anordering
of
the input
to
confoundit.
Hemay
also
supply uswith
anyextra
infonmation whatsoever, which cannotof
course adverselyaffect
the
perfonnanceof
the
algonithn
but
is
designed
to facil,itate
the
pnoof.Theor"em
4.
Any P*passalgorithm
to
determinethe
mediar,(or Kth highestCI(N))
of
N elementsrequires
at
least
n(w1/P) storageThe minimum storage S
fon
a
two-pass algor"ithmsatisfies
't1
a(Nz)
.(
S s< o(Nz.log N)for
N/2 >. K =locations.
Coroflary
1.Corollary
2.
Pnovided 1og S >n((log
lt.log
log
N)z),
the
rnaximumnumber"
of
passes requir"edis
log
N/1og Sf 0(1),
while for
log log
N =o(]og
S)
we have P-
log
N/log
S.Proof.
Immediate frorn Ttreor.ems2
and4.
fl
The p::oof
of
Theonem4
foll-owsat
once fnomthe following
Lemmawhich
establishes
that
aften
one passof
any (median-finding) algor.ithmusing S
locations there
nemainsto
be donea
computationat
feast
ashand
as
finding the
median for" aninput
of
size
approximatety N/2S.Lenrma 3. For any S-loc:ation
algorithm
on I'linput
eLements theneis
an
ordering
of
the
input
tape
sothat
after"the
first
passthere
rsa set
Xof
inputs with the
fcllowirrg
pr,openties:(i)
no elernentof
X remainsin
stonage,(ii)
no orderings between elementsof
X ane known,(iii)
the
medianof
the
original
set
is
the
medianof
X,and
(iv)
X containsat least
t N/(2S-f) J elements.Pr.oof
of
Lerrrna. Withclutloss
of
generality the
al-gor"ithm reads thefirst
Sinputs
into
storage and decides which oneto
discand as the(S+f)st input
is nead.
The Adversarlr ensuresthat this
(S+1)st elementstands
in
the
samerelative
order:ingwith
respectto
the
remaining S-1elements
in
stor"ageas
the
oneit replaces.
This
st::ategyfor
theAdversary
is
followed
r"epeatedly,replacing
each discarded element bya
new element whichis effectively indistinguishable.
Forx :
lN/(2S-1)J,as
the
(Sx+
l)st
elementis
aboutto
be readnat
least
oneof
thestorage l,ocations has had disca::ded from
it
a set
Xof at
least
xelements between which no comparisons have been rnade and no orderings
can
yet
bededuced.
It
may bevenified
that
the
relative
o:rderingof
the
r.ernaining N-
Sx elements may be designed sothat
the
medianWhilst the
.:symptotic constantof
I/Z in this
lemma can be raisedto ln 2,
and even highen, bya
more r"efined argument, an upperlinit
of this
approachis
mankedby
the
trivial-
algor"ithm, whichinputs
anddiscards S
at
a
tirne
and leaves itincomDarablerrsets
of
size
at
mostl-N/S I .
5.
Selection
algonithmsthat
I'nearly
always" succeed.If
we makethe
assumption(not
requir:edin [Z])
ttrat
al]
inputonder"ings ane
equally
likely
and weare
willing to
tolenate
some smallpnobability,
say10-6,
of
faifune
thenthe
amountof
storage r^equiredcan be much
reduced.
For examplea
single-pass medianalgorithm finds
'l
O(N2) storage necessauy and
sufficient.
Probabilistic
algonithns
fon selecting
the
median.Fon
a suitable
choiceof
stor.agesize S, the
algorithm maintainsin
storagefor
aslong as
it
canS-l
elements whose ranks among thoseread
thus
far
are
consecutive arrd asclose
to
the
current
median aspossible.
Tothis
endit
keeps two counts I{ andL,
botblnitlally
zera'
of
the
numbersof
elements which have sofar
been discarded above andbelow,
r,espectivety,
the
consecutive segmentnetained.
Under ounassumption
of
equallikelihood,
the probability
that
a new element readties
aboveall
thoseretained
is
precisely
(H+
l)/(fi
+
S+
L).
Inthis
casethe
efement must be discarded and H incremented byone.
Thecase where
it lies
belowthe
retained
segrnentis sirnilar.
Withprobability (s -
2)/c1
+s
+L),
the
new element can be insertedstrictly
within the
segment andeither
the
highest
or
lowestof
thoseretained.
is
chosenfor.
discanding accor^ding as H <L
or
H >-L
r"espectively.At the
endof
the
tapethe
median has beenretained
and deterrninedprovided. H
+
I
r< tN/21 -< N- L.
we haveonly
to
estimatethe size
of
srequired
to
guaranteethis result
with
high
probability.
The progressof
the
algorithm
can be viewed asa
random rnal-kof
the integer
variableD = H-L
starting
fnomthe
origin. It
can bevenified
that
fonO
< lnl
< S*1the probability
that lnl is
r,educedat
the
next step
is
at
least
onehalf,
and furthermorethe
condition
that
the
medianis
foundis
equivafent
to
lOl
<
S-1.
For" any€
> O,there
is
a
constant C suchthat
durjng
the
first
CS2ste;'s
of
a
nanciorn walk aboutthe origlin,
withegual
pnobabilities
of
a
step
to
the
right
onleft,
the probability of
the::andom
vaniable
attaining
magnitude S-1is at
most€
(see[3]).
Since
the
random wa.l-kfor
ounalgorithm
is
mor"e biased towandsthe
o::iginoven
the
negionof interest
the
sameresult
holds.The
algorithm
described can be used asthe
basis
of
a
multi-pass algo:rithmin
the following
way.
Forsuitably
chosen constants Cl'C2,
dependjng on €,
the
probability
that
the
medianof
the
whole inputset
lies
betweenthe
extreme elementsof
the
segment netainedaften
C1S2steps
is
very
high.
Fromthis
point on,
for
the
remainderof
the
pass'the
same S-1 el.ementsare
netainedin
stonage andthein
::anksare
foundby
comparisonswith the
rest of
the
input. If
oneof
the
retained
elementsis
the
median,the
algorithm
terminates; if not,
the
numberof
elenentssharing
the
same ttgaptt asthe
medianwith
respect
to
the
stoned elementscan be shown
to
beat
most C2N/52with
high
probability.
This
set
of
elements
satisfy
the
sarne assumption asto
randomness asthe
initial-
setand so
the
same pnocedure may be usedfor
furthen
passes.
Hence: Theonem5.
Fonwith
p:robability
any
€
> O, P2
I
there
is
a
P-pass median-finding algorithmof
fail-ure
at
moste
which usesonly
o(N1/2P) stonage.Lower bound
for probabilistic
algonithns.Theonem
6.
Theneis
ane
>0,
suchthat
any one-pass algonithm whichfinds
the
medianwith prrcbability
of
failune
l-ess thane requires
at1
l-east CI(Nz) stor^age.
Proof.
Considerthe situation
after
[N/21 elements have beenread.
Theprobability is at
least
hal-fthat
the
medianis
oneof
these,
but only
Sof
them can have beennetained.
The mostlikely
candidatesare
towandsthe
niddle
but the
straightforwandestimation
of
a
h54pergeornet:ricdistr:ibution
[3]
showsthat
fon a
subsetof
size
Sof
these elementsto
contain
the
medianwith probability
above one quarten nequires1
s
>, a(N2).
nCo:rollany.
Fona
single-pass algo::ithm whichnearly
alwaysfinds
the'l
median, O(Nz) l-ocations
are
necessary andsufficient.
6.
Conclusions.Oun
aim
has beento
determinethe
pnecise computational requinementsf6r
specific
tasks
of
selecting
fr.om,or sorting,
data presented ona
r"eadonly input
tape undera
r"egimeof limited
intermal
storage.
We presentnew algorithms
of
somepractical interest
aswell
as lower bound proofswhich
exptoit the
joint
constraints
oninternal
storage and access toinput
data.Our main algor^ithm
for
selection
usesa novel
sampling techniqueancl cein be impJ-emented
easily
to
nequineonly
about N(3P/2+
log
S)comparisons
in all.
The upper and lower bounds on stonagediffer
onlyby
a
factor of
or"der. (1ogn)2
andgive a
clear: ideaof
the
tr:ade-offrelation
betweenthe
numberof
passes andthe
amountof
storage necluined.The
picture
we havein
the
probabilistic
caseis
muchless
cornplete.Theorem G
is
neadily extensible
to
give a
lower boundof
log1og N
-
log1og S-
O(1) passesif
we nequirethat
the only
informationnetained fr"om one pass
to
the
next
is
a
pair of filtens
andtheir
nanks.It
seemslikeIy that
the
uppen bound may be reducedto
aboutthis
valuebut
analysis
of
the
algor.ithms we considered has sofar
pnovedintnactable.
References
f-fl
Dobkin, D.P. andJ.I"
Munro,
rtTime and space bounds fonselection
pnoblemsr',Proc.
5th
lrrt.
CoIIoq.
on Automata, Languages and Pr.ogramming,July
1978, Udine,Italy.
12)
Eisenstato
S.C., R.J.
Lipton
andJ.I.
Munro, I'Probabilistic
algor.ithmsn',