Original citation:
Perry, S. C., Harper, J. S., Kerbyson, D. J. and Nudd, G. R. (1999) Theory and operation
of the Warwick multiprocessor scheduling (MS) system. University of Warwick.
Department of Computer Science. (Department of Computer Science Research Report).
(Unpublished) CS-RR-363
Permanent WRAP url:
http://wrap.warwick.ac.uk/61090
Copyright and reuse:
The Warwick Research Archive Portal (WRAP) makes this work by researchers of the
University of Warwick available open access under the following conditions. Copyright ©
and all moral rights to the version of the paper presented here belong to the individual
author(s) and/or other copyright owners. To the extent reasonable and practicable the
material made available in WRAP has been checked for eligibility before being made
available.
Copies of full items can be used for personal research or study, educational, or
not-for-profit purposes without prior permission or charge. Provided that the authors, title and
full bibliographic details are credited, a hyperlink and/or URL is given for the original
metadata page and the content is not changed in any way.
A note on versions:
Multiproessor Sheduling (MS) System
S. C. Perry and J. S. Harper and D. J. Kerbyson and G. R. Nudd
High PerformaneSystems Group Department of Computer Siene
University of Warwik
Coventry, UK
Abstrat
Thispaperisonernedwiththeappliationofperformanepreditiontehniquesto theoptimisation ofparallelsystems, and,inpartiular,theuse ofthese tehniques on-the-yforoptimisingperformaneat run-time.Inontrasttootherperformane
tools,performanepredition resultsaremade availablevery rapidly,whihallows theiruseinreal-timeenvironments.Whenappliedtoprogramoptimisation,this al-lowsonsiderationofrun-timevariablessuhasinputdataandresoureavailability
thatarenot,ingeneral,availableduringthetraditional(ahead-of-time)performane tuningstage.
The main ontributionof thiswork is theappliation of preditive performane
data to the sheduling of a number of parallel tasks aross a large heterogeneous distributedomputingsystem. This is ahieved throughuseof just-in-time perfor-manepreditionoupledwithiterativeheuristialgorithmsforoptimisationofthe
meta-shedule.
Thepaperdesribesthemaintheoretialonsiderationsfordevelopmentofsuha shedulingsystem,andthendesribesaprototypeimplementation,theMS
shedul-ing system, together with some results obtained from this system when operated overamedium-sized(ampus-wide)distributedomputingnetwork.
1 Introdution
The priniple motivation for the use of any parallel system is the inreased
performane that an be ahieved by distributing workload among a number
of separate nodes. With omputational systems, the prodution of a
paral-lel appliation requires onsiderable extra work ompared to the equivalent
sequential program, and it is learly very important to maximise the
gorithm for an identied target arhiteture, and subsequent tuning using
performane tools suh as appliation measurement/traing. These tools are
generallyused before the appliationis run inearnest, and the `optimisation'
is neessarilyspei to the partiular target arhiteture.
However, the eÆieny with whih an appliation runs ona parallel mahine
also depends on other fators not foreseeable when the ode was originally
optimised. For example, it is possible to optimise ahe re-use on a
data-dependent basis[1,2℄, and the hoie of algorithm an be made dependent
on the input appliation parameters at run-time[3,4℄. Furthermore,
hetero-geneous data neessitates load balaning within the appliation, and, at the
system level, heterogeneous omputing resoures must alsobe sheduled and
dynamiallyload balaned. These issues beome even morehallenging when
one onsiders the trend towards large distributed omputing systems, often
built fromarange of ommodity omponents,whih provide aheterogeneous
and onstantly hanging set of omputing resoures.
Optimisationtehniques used whena programisabout to beexeuted
(here-afteralled dynami performane optimisation methods) an beonveniently
divided intotwo ategories:
Single-programissues are onernedwith the optimisationofasingle
appli-ationrunningona parallelsystem.Forexample, thesize and organisation
of loal memory and the eÆieny of the network and message passing
interfae are `hardware fators' inuening performane at this level. The
mappingof data and subtasks ontained within the parallel appliation to
the underlying network arhiteture is the main `software fator'
inuen-ing performane; this is itself dependent upon the number of proessing
nodes whihhave been alloated,and their relative network topology.The
hoie of algorithms and data-partitioning strategies is onsequently very
rihatthis stage,and dynamiperformaneoptimisationisonerned with
seletionof these parameterson a system- and data-dependent basis.
Multiprogramming issues are onernedwith optimisingtheperformaneof
the parallel system as a whole. In partiular, the problems of quantitative
partitioning(i.e.the numberofproessingnodes assignedtoeahtask)and
qualitative partitioning (i.e. the partiular subset of nodes and, for a
het-erogeneous system, their assoiated omputing power and ommuniation
osts)are ompromisesbetween minimisingthe exeution time of any
par-tiulartaskandmaximisingtheoveralleÆieny oftheparallelsystem.The
searhspae for this multipleparameter optimisationproblemisextremely
large,and is not fully dened untilrun-time.
Inontrasttothedynamioptimisationtehniquesitedabove[1{4℄,thispaper
run ona heterogeneous omputing network.
Theoretial onsiderationsappliable tothis problem willnow beoutlined.
1.1 Intratability
Considerforthemomentthe simpleraseofahomogeneousomputersystem,
onsistingof anumberofidential proessingnodesforwhihommuniation
osts are similar between all nodes, then the above problem an be stated
formallyas follows.
Problem 1 A shedulefora set of paralleltasks fT 0
;T 1
;:::;T n 1
g whihare
to be run on a network of idential proessing nodes fP 0
;P 1
;:::;P m 1
g is
dened by the alloation to eah task T j
a set of nodes j
2 fP 0
;:::;P m 1
g
and a start time j
at whih the alloated nodes all begin to exeute the task
in unison. Beause the proessing nodes are idential, the exeution time for
eah task is simply a funtion, t x
() j
, of the number of nodes alloated to it,
and it is presumed that this funtion is known in advane. The system uses
Run-To-Completion (RTC) sheduling, hene the tasks are not permitted to
overlap. The makespan, w, for a partiular shedule is dened as:
w= max 0jn 1 f j +t x (jj j jj) j
g; (1)
whih is the latest ompletion time of any task. The goal is to minimise this
funtion with respet to the shedule.
Problem1 is alledthe Multiproessor Sheduling(MS) problem.
The intratability of MS has been studied[5℄, and it is found that although
pseudo-polynomial time algorithmsexist for the ases of m = 2 and m = 3,
MS is NP-hard for the general ase m > 4. Approximate algorithms have
been proposed[6,7℄; these have a worst ase bound of w 2w ?
, where w ?
is
the makespanofthe optimalshedule.Althoughthesealgorithmsare auseful
fall-bak,forhighperformanesystemsafatoroftwoisunaeptableinmost
ases, and one would hope toahieve muh higher levels of optimisation.
In order to nd (near) optimal solutions to this ombinatorial optimisation
problem, the approah taken in this work is to nd good shedules through
useofiterativeheuristimethods.Twoheuristitehniqueshavebeenstudied:
Simulated Annealing (SA)[8,9℄ and a Geneti Algorithm (GA)[10,11℄. These
methods proeed through use of a `tness funtion' whih assigns a quality
this value. Details of these methods' implementation are given insetion 2.
1.2 Heterogeneous ombinatorial onsiderations
As mentioned above, the funtiont x
(),whihprovides aprogram's exeution
time for a given alloation of proessing nodes, (hereafter referred to as
a performane senario), is assumed to be known in advane. For an
other-wise unloaded homogeneous parallel system, where the exeution times are
not data-dependent, prole data ould oneivably be used for this purpose
(although for large one-shot appliationsthis would somewhat defeat the
ob-jet of the optimisation). The nodes are idential, so, for example, with a 60
node system there exists exatly 60 dierent performane senarios per task
(althoughifthe optimum speedupoursatalowernumberof nodesthenthe
later senarios willnot beused).
However, for the morerealistiase of aheterogeneous system, ombinatorial
explosion generally preludes prior knowledge of the exeution time. In the
extreme ase, the number of performane senarios for m entirely dierent
proessing nodes is P
m k=1
k!, whih for m = 60 is of the order 10 81
. Clearly,
for any non-trivial heterogeneous system it willnot be possible to obtainthe
funtion t x
() in advane of the sheduling stage. To overome this diÆulty,
we propose to use data from a performane predition system, alulated in
real time as the sheduling program requires it. This just-in-time approah
eetivelyprovidest x
()forthe heuristimethodsforheterogeneoussystems.
The implementationis desribed further insetion 5.2.
1.3 Performane predition
This work willonentrate entirely onperformane predition asthe method
of providing t x
(), either in advane for homogeneous situations or in real
time for heterogeneous systems. Performane predition is the tehnique of
estimating values for various aspets of how an appliation program will
ex-eute on a omputer system, given ertain well-dened desriptions of both
program and system. The performane predition toolset used here, PACE,
operates by haraterising the appliationin terms ofits prinipleoperations
(representing omputation osts) and its parallelisation strategy (whih
di-tates the ommuniation pattern). These are then ombined with models of
the system environment to produe a predition for the exeution time. The
key to this strategy is that the separation of program and system models
internal strutures have a low level of data-dependene is generally aurate
to within 10 % of the measured time. However, it is important to note that
for run-time optimisation purposes the auray of the predition is not an
overridingonern; any informationfromdetailedtrae data down toa basi
measure of program omplexity is always useful { it is better than no
infor-mation at all. A omplete desription of the PACE system an be found in
[12℄.
2 Solution of MS using Heuristi Methods
A number of standard texts desribe the operation of the SA and GA
teh-niques; the followingsetionsdetail onlyissues relatingto solutionof the MS
problemusing these methods.
2.1 Coding sheme and tness funtion
Bothtehniquesrequireaodingshemewhihanrepresentalllegitimate
so-lutionstotheoptimisationproblem.Forbothalgorithms,anypossiblesolution
isuniquelyrepresented by apartiularstring,S i
,andstrings aremanipulated
in various ways until the algorithm onverges upon an optimal solution. In
order for this manipulation to proeed in the orret diretion, a method of
presribinga qualityvalue(ortness) toeahsolutionstring isalsorequired.
The algorithmfor providingthis value is alled the tness funtion f v
(S i
).
The tness values of solutions to the MS problemare readily obtained { the
solutions whih represent the shedule with the least makespan are the most
desirable,and vie-versa.Theproessingnode set foreahtask,andthe order
in whih the tasks are exeuted, are enoded in eah solution string, and the
exeution times for eah task (given the set of nodes alloated) are obtained
from the predition system. It is therefore straightforward to alulate the
makespan of the shedule represented by any solutionstring S i
. This number
may be onverted froma ost funtion (f
) toa value funtion (f v
)by
multi-plyingallthemakespansby-1andnormalisingonthe interval0f v
(S i
)1.
The oding sheme we have developed for this problemonsists of 2 parts:
An ordering part, whih speies the order in whih the tasks are to be
exeuted. This part of the string is q-ary oded where q=n.
A mappingpart, whih speies the alloation of proessing nodes to eah
spe-task #2
map of
map of
task #4
map of
task #3
map of
task #5
map of
task #0
map of
task #1
11010 01010 11110 01000 10111 01001
2 4 1 0 5 3
task ordering
time
3
4
0
1
2
processor
task #5
task #4
task #2
task #0
task #1
task #3
Fig.1.An examplesolutionstring(LHS)and Gantthartrepresentingitsshedule (RHS).Thetwo partsofthesolutionstringareshownseparately,withtheordering
partaboveandthemappingpartbelow.Notethattheexeutiontimesofthevarious tasks are a funtion of the proessing node set alloated and are provided by the performane predition system. This data is assoiated with the task objet and
is never used diretly by the optimisation algorithm, but onlythrough the tness funtionf
v .
ifyingwhether ornot a partiular node is assignedto apartiular task.
The orderingof the task-alloation setionsinthe mappingpart of the string
isommensurate withthe taskorder.Ashort exampleofthis type ofsolution
string and its assoiated shedule are shown in Fig. 1. The ordering part of
this string is always guaranteed to belegitimate by the various manipulation
funtions used in the heuristi algorithms. However the same is not true of
the mapping part, and if the tness funtion enounters a string with a task
that has no proessing nodes alloated it will randomly assign one. Further
manipulationmay be required whenusing the system in a heterogeneous
en-vironment, asdesribed insetion 4.
From Fig. 1 it is lear that the task of parallel shedule optimisation is a
`paking problem' where the goal is to t the programs together as tightly
as possible. Examples of a simple shedule before and after appliation of a
heuristialgorithmare shown inFig. 2.
2.2 Geneti algorithm-based optimisation
AGA-based optimisationprogram,using the odingsheme desribed above,
was developed for testing with simulated (homogeneous) performane data
and the MS problem tness funtion. The ode was based upon the ideas
developed in [11℄, using a populationsize of 60 and stohasti remainder
se-letion. Speialised rossover and mutation funtions were developed for use
with the two-part oding sheme. The rossover funtion rst splies the two
ordering strings at a random loation, and then re-orders the ospring to
produe legitimatesolutions. The mappingparts are rossed over by rst
optimisation (leftpanel)and after 500 iterationsof theoptimising(GA) algorithm
(right panel). The dierent shaded retangles represent the programs indexed on theleft.
asingle-point(binary)rossover. The motivationfor there-ordering isto
pre-servethe node mappingassoiatedwithapartiulartaskfromone generation
to the next. The mutation stage is also two-part, with a swithing operator
randomly appliedto the ordering parts, and a randombit-ip applied tothe
mappingparts.
2.3 Simulated annealing-basedoptimisation
TheSA-basedoptimisationprogramusedthe sameoding shemeand tness
funtion desribed above, and wasa modiation of the travelling ode given
in Numerial Reipes[13, hap. 7℄. The program again uses a two-part
re-arrangement funtion, with path transport or reversal (performed aording
to an annealing shedule derived from the metropolis algorithm) ourring
separately for the ordering and mapping parts, and the mapping part
re-ordered in between.
3 Performane of the heuristis
In order to onveniently test the performane of the algorithms with
dier-ent problem sizes, tasks were simulated from a simple homogeneous parallel
omputationmodel, viz,
t x
(i)= C
pu
i +C
om
ti approahes for a set of 20 simulated tasks running on a 16 node system. The lines are the average of 100 runs of eah searh method; the error bars show the widthof 1 standarddeviationfrom themean at onvergene.
Hene from the derivative of Eq. 2 it was straightforward to reate task sets
with speied distributionsof sizes and optimum proessing node numbers.
Examples of the onvergene times for a simple sheduling problem are
pre-sented in Fig. 3. The gure shows the relative performane of the heuristi
algorithms and a random searh algorithm (using peak speedup node
allo-ations for eah task). The intratability of the MS problem preludes
om-parison of the results with an optimal shedule, but we assume the results
at onvergene to be near-optimal. As expeted, the heuristi methods oer
anappreiableimprovementover therandomsearh(whihdemonstratesthe
vastness of the searh spae even for a relatively simple problem). Although
the solution quality forthe two heuristialgorithmsisomparable at
onver-gene, in general GA provides higher-quality solutionsat shorter times than
SA. The reason for this is believed to be that the operation of SA is based
uponpatientmovement towards aminimum following anannealing shedule,
anddoesnotlenditselftoearlytermination.Conversely,GAisbaseduponthe
evolutionofaset(population)ofsolutionstrings,whihnotonlyndsminima
more quikly,but alsoadapts easilytohangesin the problem(e.g. the
addi-tionorompletion ofatask,orahangeinthe availablehardware resoures),
beause any urrent solution set will serve as goodquality starting points in
a slightly hanged situation. Hene hene for realisti, real-time sheduling
environments, where the poolof tasks and omputing resoures is onstantly
hanging,the ratio of the onvergene times forGA vs. SA (shown inFig. 3)
isgreatly exaggerated, with SAhaving toomplete itsannealing shedule for
3.1 Real-time extension of the MS problem
The simulations desribed above were onerned with the solutionof the MS
problemas dened in Prob. 1, with allthe tasks and their assoiated
perfor-mane data known in advane. Consider now a slightly more realisti ase,
wherenew tasksmaybeaddedtothe queue afterthe rst taskshavestarted.
HenetheMSproblembeomesdynami;tasksareaddedandremoved (when
they are omplete) from the sheduling problem in real time. As mentioned
earlier, the evolutionary basis of GA is partiularlywell suited to this
situa-tion, and this was the only heuristiused for the followingsimulations.
The simulationswork by rst allowing a \warm-up" period for the GA with
a set of tasks, and then removing the tasks whih have been sheduled for
exeution (i.e. the ones at the beginning of the Gantt hart). This tends to
leaveajagged`base'whihformsthebasisofthenextpakingproblem.When
a taskompletes, the algorithmis stopped and interrogated for its best
solu-tion, and any tasksat the bottom of the shedule begin exeution. Tasks are
ontinuouslyadded soastokeepthe numberofpre-sheduled tasksonstant.
For the ase where the queue initially ontains all the tasks to be exeuted,
minimisation of the makespan provides the most eÆient shedule. However,
for the real time ase now onsidered, one should also take into aount the
natureofthe idletimeinthe shedule.Idletimeatthe bottomofthe shedule
is partiularly undesirable, beause this is the proessing time whih will be
wasted rst, and is leastlikely tobe reovered by further iterationof the GA
orwhenmore tasksare added.Forthis reasonwepropose and use amodied
ost funtion:
f
=w (
1+ X
nodes Æ
h t
idle j
(0) i
)
(3)
Wheret idle j
(t)istheamountofidletimeonnodej startingattimet,withtime
beginningatzero.Withinthisregimesolutionswithidletimeatthebeginning
of the shedule on any of the proessors are penalised with a higher ost
funtion, in proportion to the overall makespan. Equation 3 will be referred
toas the Extended MS (XMS)ost funtion.
Shown in table 1 are the task exeution rates for the standard and extended
ost funtions used with the GA, ompared with the results from a simpler
rst-ome-rst-served (FIFO)sheduler. GA-XMSgivesa 20%improvement
fun-j
FIFO 242 110 58
GA-MS 821 357 182
GA-XMS 963 428 222
Table 1
Rates of exeution (hr 1
) for simulatedtasks of average size j
. The row marked \FIFO" shows results for a `rst-ome-rst-served' shedule. The rows prexed
\GA" show results for GA optimised shedules with dierent quality funtions, asdesribedinthe text.
tion,i.e.makespanalone).However, separatesimulationshaveshownthatthe
makespan isgenerally 10%worse forGA-XMS atonvergene. This
disrep-any demonstrates the penalty inurred for lak of knowledge of the entire
taskqueue beforesheduling begins.
In either ase, the heuristialgorithms give an inrease in the exeutionrate
of 3-4 times over the standard queueing method { this is a diret result of
knowledge of the exeution time funtion, t x
(), provided by performane
predition.
4 An Implementation: The MS Sheduling System
The ideas presented in the previous setions have been developed into a
working system for sheduling sequential and parallel tasks over a
hetero-geneous distributed omputing network. The goal of this projet, alled the
MS sheduling system, is to investigate and demonstrate the role of
perfor-mane predition in the optimisation of large distributed systems. We hope
that this work is omplementary to the many other distributed omputing
projets inoperation atthe moment[14,15℄.Our prototype system addresses
theissuesofintratability,heterogeneoussystems,dynamiandevolving
om-putingresoures,andperformanepredition.Wedonotattempttodupliate
or re-invent previous or ongoing work, hene our prototype system does not
address issues suh as seurity, network operating systems, resoure
disov-ery/identiationet.Thestate-of-the-artintheseandotherareasareovered
in[15℄.
An overview of the design and implementation of the MS system is given
TheMSshedulingsystemhasatitsheartaGAshedulingprogramtoprovide
good-quality solutions to the sheduling problem in real time, as desribed
above. However, a number of other programs (daemons) and design features
are neessary inorder to produe aworking system.
The main goal of the implementation was to make the `state' of the system
both visible and transendent. To this end, MS makes extensive use of the
lesystem (whihshould failitate itsextensionto larger distributed
omput-ing environments suh as GLOBUS), and a olletion of various diretories
and database les represent its omplete state at any partiular instant in
time.
Themostimportanttypeofleisajob-le.Afterajob 1
hasbeen submitted,
itundergoesaseriesofstatehanges,witheahstaterepresentingapartiular
stageinitslifetime.Inreality,this orrespondstoajob-le(adatabasewhih
uniquely represents the job, and ontains various job parameters and other
information)being moved through a diretory hierarhy. At most hanges of
state, additionalinformationis addedto the job-le. This hierarhy isshown
shematiallyon the LHS of Fig. 4.
Otherimportantlesarethehostsandhost-statsdatabases,whihprovideMS
with real-timeinformation onthe state of the availableomputing resoures.
These databases will be desribed further in setion 5.1.5. The various job
states (diretories)will nowbe desribed individually.
Submitted. This is the entry point to the sheduling system, where the
user presents jobs to be exeuted. At this point the job-le ontains the
basi information required to run the job, inluding the loation of the
exeutable(s), the input data, and various information pertaining to the
performane data ormodel.
Queued. When the job-le ismoved intothis diretoryit is given aunique
jobidentiationnumber(JID).Itthenawaitstheattentionofthe shedule
optimiser.
Sheduling. Onethesheduleoptimiserbeomesaware ofajobitismoved
intothis diretory for the durationof the optimisation proedure.
Runnable. At the time that the shedule optimiser deides to run a job it
movesthejob-leintothisdiretorywhereanotherdaemonisresponsiblefor
running it on the system. The shedule optimiser speies the omputing
1
A taskis alleda\job" withintheontext ofthe shedulingsystem (byanalogy
submitted
queued
scheduling
runnable
running
finished
exited
ms_stats
host stats
hosts
cache
eval
user
job states
daemons
data stores
state changes
information flow
evaluate performance model
query host
load avg,
idle, etc...
request perf.
estimate for
job/host
configuration
identify available
hosts
submit job with
add/remove hosts
performance model
+host-spec
+predicted-tx
+start-time
+finish-time
report results
by email
+job-id
Fig.4.Flowhartdiagram oftheMSshedulingsystem. Afullexplanationisgiven inthetext. Notethat notall MSdaemonsare shown inthisdiagram.
resoures on whih the job is to be run, and adds information about the
predited exeutiontime tothe job-le.
Running.One the job has begun to be exeuted on the system its job-le
ismovedtothisdiretory,wherethestart-timeisadded(thestart-timeand
preditedexeutiontimes of the urrently runningjobs are requiredby the
where it waits for the shedule optimiser to move it to the next diretory.
Inthis way, the optimiserprogramisable tokeep trak of whenjobs nish
relative totheir predited nish time.
Exited. When the job-le reahes this diretory it signies that the job is
ompleted as far as the system is onerned. The JID number is reyled,
and the user is sent notiation that the job has been ompleted.
The maintenane of these diretories and databases is the responsibility of
various daemons, as desribed in the setion 5.1.
5 The MS toolset
The various programs and libraries whih make up the MS system will now
be desribed.
5.1 Daemons
A number of bakground proesses, or daemons, are used to implement the
state mahine desribed above. All interproess ommuniation is performed
via the le-system, with a single daemon being designated the owner of eah
partiular queue diretory. Ownershipgivesthe right toedit the jobles ina
diretory.Daemons withoutownership of adiretoryare onlyallowedtoread
oradd jobsin the queue.
5.1.1 ms-init
The ms-initdaemonmonitorsthe submissionqueue. As jobles are addedto
thediretorybyauser,theinitdaemonalloatesanunused jobid,andmoves
the job intothe queueingdiretory,ready for the sheduler.
5.1.2 ms-shed
The sheduler daemonenompasses the ideas of setions1-3 of this paper. It
uses the GAheuristi tosearh foroptimalsolutionsto the sheduleproblem
at hand, and interrogates the GA when there are free resoures available, in
order tosubmit jobsfor exeution.
The daemon operates by periodially sanning the \queued" queue for new
alsosans the \running"queue, in orderto ndthe base for the optimisation
(paking)problem,asdesribed earlier.Italsomonitorsthe \nished"queue,
inordertoasertainif thepreditedexeutiontimeswere orret,andmodify
itsshedule base if this isnot the ase. When the shedulernds ajob inthe
\nished" queue itis moved to the \exited"queue, inorder toletthe system
know that the sheduler is aware of the job's ompletion, and that it is no
longer required.
When the sheduler submits a job to the \runnable" diretory it adds the
speiationforthehoststhatthejobistoberunon.Italsoaddsthepredited
exeutiontimefor thejob,whihitrequiresone thejobhas startedrunning,
asdesribed above.
5.1.3 ms-run
After the sheduler, the ms-run daemon is perhaps the most omplex. It is
responsibleforexeutingthe programassoiatedwithajob onaspeiedlist
of hosts. Currently, only MPI onforming programs are supported, but it is
envisaged that this will be extended as the system matures (for example to
supportPVM-based appliations).Variouselds ofthe jobdata strutureare
used toontrolthe exeutionof the program, theseinludeoptionstoontrol
the arguments to the proess, and how the input and output streams of the
proess are treated.
Currently, the exeutable programs must be pre-ompiledand availableinall
loal lesystems. Heterogeneity is handled by allowing the lenames to be
onstruted on a host-by-host basis, and inluding loal parameter
substitu-tions, suh as the name of the arhiteture. A frameworkis in plae to allow
exeutablestobeompiledondemandfrom\pakets"ofsoureode,butthis
is yet tobeompleted.
The ms-run daemon sans the \runnable" queue, exeuting any jobs before
moving them to the \running" queue. After they omplete their exeution,
the exit status and time of ompletion are added tothe job struture, before
being moved to the \nished" queue.
If for some reason a job is unable toberun, it is returned to the \queueing"
diretory, ready to be resheduled at a later date, unless this it has already
After the sheduler has noted job ompletions, and moved them to the
\ex-ited"queue,thems-reaperdaemonreports theresultsbaktothe user(inthe
urrent implementation this is done via email). If no destination was
spei-ed for the program's standard output stream, the output of the program is
inluded in the report. The report also inludes information onerning the
exeutiontimeoftheprogram(andhowauratetheperformanemodelwas),
and whih hosts the job was exeuted on. Finally the job le is deleted, and
itsjob id isreyled.
The reaperdaemonalsomonitorsthe \rejet" queue. This iswhere the other
daemons plae jobs that are in some way inorretly speied, or otherwise
unableto be exeuted.
5.1.5 ms-stats
The ms-stats daemon is responsible for gathering statistis onerning the
hosts on whih tasks may be sheduled. The three statistis required are the
uptime,loadaverage,and idletime ofeahhost.The uptimeisthetimesine
thesystemwasbooted,theidletimeisthetimesinetheuserofthatmahine
lastgaveanyinput(orinnityifthereisnourrentuser),andtheloadaverage
isthe numberofproessesinthesystemsheduler'srunqueue,amortisedover
areent xed period.
Every ve minutes the ms-stats daemonqueries eah system in the database
of known hosts for these threeparameters. It uses standard UNIX utilitiesto
gather the information: finger and rup. As the statistis are gathered, they
are added tothe database of host statistis.
5.2 Evaluation Library
As noted earlier, the searh spae is generally so large suh that it is not
possible to pre-alulate all required performane estimates ahead of time.
Instead a demand-driven evaluation sheme is used, oupled with a ahe of
past evaluations. The motivation for the ahe is that there is a large degree
of repetitionin the listof performane senarios that the GAwillrequire.
Although it would be straightforward to use other predition methods, this
implementationuses models reated by the PACE environment.Foreah
ap-pliation that is submitted to the MS system there must be an assoiated
performane model. PACE allows the textual performane desriptions to be
pae? list Npro 1 N 500
pae? set Npro 2 pae? set N 20000 pae? eval
2.19181e+08
pae? hrduse SunUltra1 SunUltra1 pae? eval
4.89571e+08
pae? set N 10000 pae? eval
1.22434e+08
Fig.5.ExamplePACEsession
denitionswillevaluatetheperformanemodelforthespeiedonguration.
Toavoidthestartupoverheadassoiatedwithevaluatingamodel,PACE
pro-vides an interative interfae, where multiple evaluations an be performed,
speifyingtheparametersand hardware ongurationsfor eahseparate
eval-uation. Figure 5 shows an example interative session. The MS evaluation
library reates a proess running eah performane model, then uses the
ses-sioninterfaetoinvokeevaluationsandreadtheresults,ommuniatingaross
apairofsokets.Thismethodallowsthepossibilityofdistributingevaluations
aross multiplehosts, althoughthis hasn't been neessary asyet.
When requesting evaluationsthe sheduler translates fromthe vetor of host
namesspeifyingwhihhoststhejobwould runon,toavetorofarhiteture
namesand pu load averages.Forexample ifa speiedhost isa Sun Ultra-1
workstation,theassoiatedPACEhardwaremodelisalledSunUltra1,andits
urrentload averageis0.3(this dataisolleted bythe statistisdaemon,
de-sribed inSetion 5.1.5),the string SunUltra1:hardware/CPU LOAD=0 would
be speied as the hardware model of that partiular host. Repeating this
proess for all hosts that the job would run on gives the vetor of hardware
models that the evaluation requires.
Although evaluations omplete relatively quikly (usually in the order of a
few tenths of a seond), this is still a less than ideal. For example, if the
GA has a populationsize of 50, and there are 20 jobs being sheduled, then
1000 evaluations are required eah generation. If eah evaluation takes 0.01
seonds, then this is 10 seonds per generation. However, many of the
eval-uations requested by the geneti algorithm are likely to be exatly the same
as those required by previous generations (due to the nature of the rossover
added between the sheduler and the performane model. When apartiular
evaluation result is requested, the ahe is searhed (the ahe uses a
hash-table, so lookups are fast). If the result already exists, it is returned to the
sheduler. Otherwise theperformanemodelisalled, andthe result isadded
tothe ahe beforebeingreturned tothe sheduler. Thelibrary alsosupports
aseondarylevelofahing,usingadatabasestoredinthelesystem.The
re-sultsofallpreviousevaluationsofapartiularmodelarereorded,along with
the model parameters for eah result. This has several benets to the
shed-uler: rstly itis possibleto stop and then restartsheduler proesses without
losingtheevaluationhistory,andseondlysimilarjobsmaybesheduled more
than one, but the modelis only evaluatedthe rst time.
5.3 User Interfae
Naturally, the end user of the shedulingsystem an not be expeted to
ma-nipulate the job les themselves. To this end a graphial user interfae has
been developed, allowing all of the information ontained within the system
tobedisplayed and modied.
The rst part of the interfae is the browser. This allows all of the databases
in the system to be displayed. These databases inlude the job queue
dire-tories, and the ontroldatabases, suhas those ontaining the host data and
statistis.A sreenshot of the browser is shown in Figure6.
The other part of the user interfae is the sheduler front end. This displays
theGantthart oftheurrent shedule,and allowsthe variousdaemons tobe
ontrolled.Figure 7shows atypialsreenshot of the sheduler interfae.
An alternative interfae to the system is via the World Wide Web, through
a CGI sript. This interfae allows muh the same ations as the desktop
interfae, with the exeption of the shedule Gantt hart (we are urrently
developing aJava appletto allowthis).
Aknowledgments
ThisworkisfundedinpartbyU.K.governmentE.P.S.R.C.grantGR/L13025
and by DARPA ontrat N66001-97-C-8530 awarded under the Performane
[1℄ John S. Harper, Darren J. Kerbyson, and Graham R. Nudd. EÆient analytialmodellingof multi-levelset-assoiative ahes. In Proeedings of the
International ConfereneHPCNEurope'99,volume1593ofLNCS,pages473{ 482.Springer, 1999.
[2℄ John S. Harper, Darren J. Kerbyson, and Graham R. Nudd. Analytial modelingofset-assoiativeahe behavior. ToappearinIEEETransationson
Computers.
[3℄ D. J. Kerbyson, E. Papaefstathiou, and G. R. Nudd. Appliation exeution steering using on-the-y performane predition. In High-Performane ComputingandNetworking,volume1401ofLetureNotesinComputerSiene,
pages718{727, Amsterdam,April1988. Springer.
[4℄ S. C. Perry, D. J. Kerbyson, E. Papaefstathiou, G. R. Nudd, and R. H. Grimwood. Performaneoptimization ofnanial optionalulations. Journal of Parallel Computing, speial issue on Eonomis, Finane and Deision
Making,26,2000.
[5℄ J. Du and J. Leung. Complexity of sheduling parallel task systems. SIAM Journal on Disrete Mathematis,2:473, November 1989.
[6℄ K. Belkhale and P. Banerjee. Approximate sheduling algorithms for the
partitionableindependent taskshedulingproblem. InProeedings of the 1990 InternationalConfereneofParallelProessing,volumeI,page72,August1990.
[7℄ J. Turek, J. L. Wolf, and P. S. Yu. Approximate algorithms for sheduling parallelizable tasks. ACM Performane Evaluation Review, page 225, June
1992.
[8℄ S.Kirkpatrik,D.C.Gelatt,andM.P.Vehi. Statistialmehanisalgorithm formonte arlooptimization. Physis Today, pages 17{19, 1982.
[9℄ S Kirkpatrik, C D Gelatt, and M P Vehi. Optimisation by simulated
annealing. Siene,220:671{680, 1983.
[10℄J. H. Holland. Adaption in Natural and Artiial Systems. University of Mihigan Press,AnnArbor,MI, 1975.
[11℄D. E. Goldberg. Geneti Algorithms in Searh, Optimization and Mahine Learning. Addison-WesleyPublishingCo., In., Reading,MA., 1989.
[12℄G. R.Nudd, D.J. Kerbyson,E.Papaefstathiou, S.C. Perry,J.S.Harper,and D. V. Wilox. Pae - a toolset forthe performane predition of parallel and distributedsystems. Aepted for publiation in International Journal of High
Performane andSienti Appliations, 1999.
International Journal of Superomputer Appliations, 11(2):115{128, 1997.