Theory and operation of the Warwick multiprocessor scheduling (MS) system

(1)

Original citation:

Perry, S. C., Harper, J. S., Kerbyson, D. J. and Nudd, G. R. (1999) Theory and operation

of the Warwick multiprocessor scheduling (MS) system. University of Warwick.

Department of Computer Science. (Department of Computer Science Research Report).

(Unpublished) CS-RR-363

Permanent WRAP url:

http://wrap.warwick.ac.uk/61090

Copyright and reuse:

The Warwick Research Archive Portal (WRAP) makes this work by researchers of the

University of Warwick available open access under the following conditions. Copyright ©

and all moral rights to the version of the paper presented here belong to the individual

author(s) and/or other copyright owners. To the extent reasonable and practicable the

material made available in WRAP has been checked for eligibility before being made

available.

Copies of full items can be used for personal research or study, educational, or

not-for-profit purposes without prior permission or charge. Provided that the authors, title and

full bibliographic details are credited, a hyperlink and/or URL is given for the original

metadata page and the content is not changed in any way.

A note on versions:

(2)

Multiproessor Sheduling (MS) System

S. C. Perry and J. S. Harper and D. J. Kerbyson and G. R. Nudd

High PerformaneSystems Group Department of Computer Siene

University of Warwik

Coventry, UK

Abstrat

Thispaperisonernedwiththeappliationofperformanepreditiontehniquesto theoptimisation ofparallelsystems, and,inpartiular,theuse ofthese tehniques on-the-yforoptimisingperformaneat run-time.Inontrasttootherperformane

tools,performanepredition resultsaremade availablevery rapidly,whihallows theiruseinreal-timeenvironments.Whenappliedtoprogramoptimisation,this al-lowsonsiderationofrun-timevariablessuhasinputdataandresoureavailability

thatarenot,ingeneral,availableduringthetraditional(ahead-of-time)performane tuningstage.

The main ontributionof thiswork is theappliation of preditive performane

data to the sheduling of a number of parallel tasks aross a large heterogeneous distributedomputingsystem. This is ahieved throughuseof just-in-time perfor-manepreditionoupledwithiterativeheuristialgorithmsforoptimisationofthe

meta-shedule.

Thepaperdesribesthemaintheoretialonsiderationsfordevelopmentofsuha shedulingsystem,andthendesribesaprototypeimplementation,theMS

shedul-ing system, together with some results obtained from this system when operated overamedium-sized(ampus-wide)distributedomputingnetwork.

1 Introdution

The priniple motivation for the use of any parallel system is the inreased

performane that an be ahieved by distributing workload among a number

of separate nodes. With omputational systems, the prodution of a

paral-lel appliation requires onsiderable extra work ompared to the equivalent

sequential program, and it is learly very important to maximise the

(3)

gorithm for an identied target arhiteture, and subsequent tuning using

performane tools suh as appliation measurement/traing. These tools are

generallyused before the appliationis run inearnest, and the `optimisation'

is neessarilyspei to the partiular target arhiteture.

However, the eÆieny with whih an appliation runs ona parallel mahine

also depends on other fators not foreseeable when the ode was originally

optimised. For example, it is possible to optimise ahe re-use on a

data-dependent basis[1,2℄, and the hoie of algorithm an be made dependent

on the input appliation parameters at run-time[3,4℄. Furthermore,

hetero-geneous data neessitates load balaning within the appliation, and, at the

system level, heterogeneous omputing resoures must alsobe sheduled and

dynamiallyload balaned. These issues beome even morehallenging when

one onsiders the trend towards large distributed omputing systems, often

built fromarange of ommodity omponents,whih provide aheterogeneous

and onstantly hanging set of omputing resoures.

Optimisationtehniques used whena programisabout to beexeuted

(here-afteralled dynami performane optimisation methods) an beonveniently

divided intotwo ategories:

Single-programissues are onernedwith the optimisationofasingle

appli-ationrunningona parallelsystem.Forexample, thesize and organisation

of loal memory and the eÆieny of the network and message passing

interfae are `hardware fators' inuening performane at this level. The

mappingof data and subtasks ontained within the parallel appliation to

the underlying network arhiteture is the main `software fator'

inuen-ing performane; this is itself dependent upon the number of proessing

nodes whihhave been alloated,and their relative network topology.The

hoie of algorithms and data-partitioning strategies is onsequently very

rihatthis stage,and dynamiperformaneoptimisationisonerned with

seletionof these parameterson a system- and data-dependent basis.

Multiprogramming issues are onernedwith optimisingtheperformaneof

the parallel system as a whole. In partiular, the problems of quantitative

partitioning(i.e.the numberofproessingnodes assignedtoeahtask)and

qualitative partitioning (i.e. the partiular subset of nodes and, for a

het-erogeneous system, their assoiated omputing power and ommuniation

osts)are ompromisesbetween minimisingthe exeution time of any

par-tiulartaskandmaximisingtheoveralleÆieny oftheparallelsystem.The

searhspae for this multipleparameter optimisationproblemisextremely

large,and is not fully dened untilrun-time.

Inontrasttothedynamioptimisationtehniquesitedabove[1{4℄,thispaper

(4)

run ona heterogeneous omputing network.

Theoretial onsiderationsappliable tothis problem willnow beoutlined.

1.1 Intratability

Considerforthemomentthe simpleraseofahomogeneousomputersystem,

onsistingof anumberofidential proessingnodesforwhihommuniation

osts are similar between all nodes, then the above problem an be stated

formallyas follows.

Problem 1 A shedulefora set of paralleltasks fT 0

;T 1

;:::;T n 1

g whihare

to be run on a network of idential proessing nodes fP 0

;P 1

;:::;P m 1

g is

dened by the alloation to eah task T j

a set of nodes j

2 fP 0

;:::;P m 1

g

and a start time j

at whih the alloated nodes all begin to exeute the task

in unison. Beause the proessing nodes are idential, the exeution time for

eah task is simply a funtion, t x

() j

, of the number of nodes alloated to it,

and it is presumed that this funtion is known in advane. The system uses

Run-To-Completion (RTC) sheduling, hene the tasks are not permitted to

overlap. The makespan, w, for a partiular shedule is dened as:

w= max 0jn 1 f j +t x (jj j jj) j

g; (1)

whih is the latest ompletion time of any task. The goal is to minimise this

funtion with respet to the shedule.

Problem1 is alledthe Multiproessor Sheduling(MS) problem.

The intratability of MS has been studied[5℄, and it is found that although

pseudo-polynomial time algorithmsexist for the ases of m = 2 and m = 3,

MS is NP-hard for the general ase m > 4. Approximate algorithms have

been proposed[6,7℄; these have a worst ase bound of w 2w ?

, where w ?

is

the makespanofthe optimalshedule.Althoughthesealgorithmsare auseful

fall-bak,forhighperformanesystemsafatoroftwoisunaeptableinmost

ases, and one would hope toahieve muh higher levels of optimisation.

In order to nd (near) optimal solutions to this ombinatorial optimisation

problem, the approah taken in this work is to nd good shedules through

useofiterativeheuristimethods.Twoheuristitehniqueshavebeenstudied:

Simulated Annealing (SA)[8,9℄ and a Geneti Algorithm (GA)[10,11℄. These

methods proeed through use of a `tness funtion' whih assigns a quality

(5)

this value. Details of these methods' implementation are given insetion 2.

1.2 Heterogeneous ombinatorial onsiderations

As mentioned above, the funtiont x

(),whihprovides aprogram's exeution

time for a given alloation of proessing nodes, (hereafter referred to as

a performane senario), is assumed to be known in advane. For an

other-wise unloaded homogeneous parallel system, where the exeution times are

not data-dependent, prole data ould oneivably be used for this purpose

(although for large one-shot appliationsthis would somewhat defeat the

ob-jet of the optimisation). The nodes are idential, so, for example, with a 60

node system there exists exatly 60 dierent performane senarios per task

(althoughifthe optimum speedupoursatalowernumberof nodesthenthe

later senarios willnot beused).

However, for the morerealistiase of aheterogeneous system, ombinatorial

explosion generally preludes prior knowledge of the exeution time. In the

extreme ase, the number of performane senarios for m entirely dierent

proessing nodes is P

m k=1

k!, whih for m = 60 is of the order 10 81

. Clearly,

for any non-trivial heterogeneous system it willnot be possible to obtainthe

funtion t x

() in advane of the sheduling stage. To overome this diÆulty,

we propose to use data from a performane predition system, alulated in

real time as the sheduling program requires it. This just-in-time approah

eetivelyprovidest x

()forthe heuristimethodsforheterogeneoussystems.

The implementationis desribed further insetion 5.2.

1.3 Performane predition

This work willonentrate entirely onperformane predition asthe method

of providing t x

(), either in advane for homogeneous situations or in real

time for heterogeneous systems. Performane predition is the tehnique of

estimating values for various aspets of how an appliation program will

ex-eute on a omputer system, given ertain well-dened desriptions of both

program and system. The performane predition toolset used here, PACE,

operates by haraterising the appliationin terms ofits prinipleoperations

(representing omputation osts) and its parallelisation strategy (whih

di-tates the ommuniation pattern). These are then ombined with models of

the system environment to produe a predition for the exeution time. The

key to this strategy is that the separation of program and system models

(6)

internal strutures have a low level of data-dependene is generally aurate

to within 10 % of the measured time. However, it is important to note that

for run-time optimisation purposes the auray of the predition is not an

overridingonern; any informationfromdetailedtrae data down toa basi

measure of program omplexity is always useful { it is better than no

infor-mation at all. A omplete desription of the PACE system an be found in

[12℄.

2 Solution of MS using Heuristi Methods

A number of standard texts desribe the operation of the SA and GA

teh-niques; the followingsetionsdetail onlyissues relatingto solutionof the MS

problemusing these methods.

2.1 Coding sheme and tness funtion

Bothtehniquesrequireaodingshemewhihanrepresentalllegitimate

so-lutionstotheoptimisationproblem.Forbothalgorithms,anypossiblesolution

isuniquelyrepresented by apartiularstring,S i

,andstrings aremanipulated

in various ways until the algorithm onverges upon an optimal solution. In

order for this manipulation to proeed in the orret diretion, a method of

presribinga qualityvalue(ortness) toeahsolutionstring isalsorequired.

The algorithmfor providingthis value is alled the tness funtion f v

(S i

).

The tness values of solutions to the MS problemare readily obtained { the

solutions whih represent the shedule with the least makespan are the most

desirable,and vie-versa.Theproessingnode set foreahtask,andthe order

in whih the tasks are exeuted, are enoded in eah solution string, and the

exeution times for eah task (given the set of nodes alloated) are obtained

from the predition system. It is therefore straightforward to alulate the

makespan of the shedule represented by any solutionstring S i

. This number

may be onverted froma ost funtion (f

) toa value funtion (f v

)by

multi-plyingallthemakespansby-1andnormalisingonthe interval0f v

(S i

)1.

The oding sheme we have developed for this problemonsists of 2 parts:

An ordering part, whih speies the order in whih the tasks are to be

exeuted. This part of the string is q-ary oded where q=n.

A mappingpart, whih speies the alloation of proessing nodes to eah

(7)

spe-task #2

map of

task #4

map of

task #3

map of

task #5

map of

task #0

map of

task #1

11010 01010 11110 01000 10111 01001

2 4 1 0 5 3

task ordering

time

3

4

0

1

2 processor

task #5

task #4

task #2

task #0

task #1

task #3

Fig.1.An examplesolutionstring(LHS)and Gantthartrepresentingitsshedule (RHS).Thetwo partsofthesolutionstringareshownseparately,withtheordering

partaboveandthemappingpartbelow.Notethattheexeutiontimesofthevarious tasks are a funtion of the proessing node set alloated and are provided by the performane predition system. This data is assoiated with the task objet and

is never used diretly by the optimisation algorithm, but onlythrough the tness funtionf

v .

ifyingwhether ornot a partiular node is assignedto apartiular task.

The orderingof the task-alloation setionsinthe mappingpart of the string

isommensurate withthe taskorder.Ashort exampleofthis type ofsolution

string and its assoiated shedule are shown in Fig. 1. The ordering part of

this string is always guaranteed to belegitimate by the various manipulation

funtions used in the heuristi algorithms. However the same is not true of

the mapping part, and if the tness funtion enounters a string with a task

that has no proessing nodes alloated it will randomly assign one. Further

manipulationmay be required whenusing the system in a heterogeneous

en-vironment, asdesribed insetion 4.

From Fig. 1 it is lear that the task of parallel shedule optimisation is a

`paking problem' where the goal is to t the programs together as tightly

as possible. Examples of a simple shedule before and after appliation of a

heuristialgorithmare shown inFig. 2.

2.2 Geneti algorithm-based optimisation

AGA-based optimisationprogram,using the odingsheme desribed above,

was developed for testing with simulated (homogeneous) performane data

and the MS problem tness funtion. The ode was based upon the ideas

developed in [11℄, using a populationsize of 60 and stohasti remainder

se-letion. Speialised rossover and mutation funtions were developed for use

with the two-part oding sheme. The rossover funtion rst splies the two

ordering strings at a random loation, and then re-orders the ospring to

produe legitimatesolutions. The mappingparts are rossed over by rst

(8)

optimisation (leftpanel)and after 500 iterationsof theoptimising(GA) algorithm

(right panel). The dierent shaded retangles represent the programs indexed on theleft.

asingle-point(binary)rossover. The motivationfor there-ordering isto

pre-servethe node mappingassoiatedwithapartiulartaskfromone generation

to the next. The mutation stage is also two-part, with a swithing operator

randomly appliedto the ordering parts, and a randombit-ip applied tothe

mappingparts.

2.3 Simulated annealing-basedoptimisation

TheSA-basedoptimisationprogramusedthe sameoding shemeand tness

funtion desribed above, and wasa modiation of the travelling ode given

in Numerial Reipes[13, hap. 7℄. The program again uses a two-part

re-arrangement funtion, with path transport or reversal (performed aording

to an annealing shedule derived from the metropolis algorithm) ourring

separately for the ordering and mapping parts, and the mapping part

re-ordered in between.

3 Performane of the heuristis

In order to onveniently test the performane of the algorithms with

dier-ent problem sizes, tasks were simulated from a simple homogeneous parallel

omputationmodel, viz,

t x

(i)= C

pu

i +C

om

(9)

ti approahes for a set of 20 simulated tasks running on a 16 node system. The lines are the average of 100 runs of eah searh method; the error bars show the widthof 1 standarddeviationfrom themean at onvergene.

Hene from the derivative of Eq. 2 it was straightforward to reate task sets

with speied distributionsof sizes and optimum proessing node numbers.

Examples of the onvergene times for a simple sheduling problem are

pre-sented in Fig. 3. The gure shows the relative performane of the heuristi

algorithms and a random searh algorithm (using peak speedup node

allo-ations for eah task). The intratability of the MS problem preludes

om-parison of the results with an optimal shedule, but we assume the results

at onvergene to be near-optimal. As expeted, the heuristi methods oer

anappreiableimprovementover therandomsearh(whihdemonstratesthe

vastness of the searh spae even for a relatively simple problem). Although

the solution quality forthe two heuristialgorithmsisomparable at

onver-gene, in general GA provides higher-quality solutionsat shorter times than

SA. The reason for this is believed to be that the operation of SA is based

uponpatientmovement towards aminimum following anannealing shedule,

anddoesnotlenditselftoearlytermination.Conversely,GAisbaseduponthe

evolutionofaset(population)ofsolutionstrings,whihnotonlyndsminima

more quikly,but alsoadapts easilytohangesin the problem(e.g. the

addi-tionorompletion ofatask,orahangeinthe availablehardware resoures),

beause any urrent solution set will serve as goodquality starting points in

a slightly hanged situation. Hene hene for realisti, real-time sheduling

environments, where the poolof tasks and omputing resoures is onstantly

hanging,the ratio of the onvergene times forGA vs. SA (shown inFig. 3)

isgreatly exaggerated, with SAhaving toomplete itsannealing shedule for

(10)

3.1 Real-time extension of the MS problem

The simulations desribed above were onerned with the solutionof the MS

problemas dened in Prob. 1, with allthe tasks and their assoiated

perfor-mane data known in advane. Consider now a slightly more realisti ase,

wherenew tasksmaybeaddedtothe queue afterthe rst taskshavestarted.

HenetheMSproblembeomesdynami;tasksareaddedandremoved (when

they are omplete) from the sheduling problem in real time. As mentioned

earlier, the evolutionary basis of GA is partiularlywell suited to this

situa-tion, and this was the only heuristiused for the followingsimulations.

The simulationswork by rst allowing a \warm-up" period for the GA with

a set of tasks, and then removing the tasks whih have been sheduled for

exeution (i.e. the ones at the beginning of the Gantt hart). This tends to

leaveajagged`base'whihformsthebasisofthenextpakingproblem.When

a taskompletes, the algorithmis stopped and interrogated for its best

solu-tion, and any tasksat the bottom of the shedule begin exeution. Tasks are

ontinuouslyadded soastokeepthe numberofpre-sheduled tasksonstant.

For the ase where the queue initially ontains all the tasks to be exeuted,

minimisation of the makespan provides the most eÆient shedule. However,

for the real time ase now onsidered, one should also take into aount the

natureofthe idletimeinthe shedule.Idletimeatthe bottomofthe shedule

is partiularly undesirable, beause this is the proessing time whih will be

wasted rst, and is leastlikely tobe reovered by further iterationof the GA

orwhenmore tasksare added.Forthis reasonwepropose and use amodied

ost funtion:

f

=w (

1+ X

nodes Æ

h t

idle j

(0) i

)

(3)

Wheret idle j

(t)istheamountofidletimeonnodej startingattimet,withtime

beginningatzero.Withinthisregimesolutionswithidletimeatthebeginning

of the shedule on any of the proessors are penalised with a higher ost

funtion, in proportion to the overall makespan. Equation 3 will be referred

toas the Extended MS (XMS)ost funtion.

Shown in table 1 are the task exeution rates for the standard and extended

ost funtions used with the GA, ompared with the results from a simpler

rst-ome-rst-served (FIFO)sheduler. GA-XMSgivesa 20%improvement

(11)

fun-j

FIFO 242 110 58

GA-MS 821 357 182

GA-XMS 963 428 222

Table 1

Rates of exeution (hr 1

) for simulatedtasks of average size j

. The row marked \FIFO" shows results for a `rst-ome-rst-served' shedule. The rows prexed

\GA" show results for GA optimised shedules with dierent quality funtions, asdesribedinthe text.

tion,i.e.makespanalone).However, separatesimulationshaveshownthatthe

makespan isgenerally 10%worse forGA-XMS atonvergene. This

disrep-any demonstrates the penalty inurred for lak of knowledge of the entire

taskqueue beforesheduling begins.

In either ase, the heuristialgorithms give an inrease in the exeutionrate

of 3-4 times over the standard queueing method { this is a diret result of

knowledge of the exeution time funtion, t x

(), provided by performane

predition.

4 An Implementation: The MS Sheduling System

The ideas presented in the previous setions have been developed into a

working system for sheduling sequential and parallel tasks over a

hetero-geneous distributed omputing network. The goal of this projet, alled the

MS sheduling system, is to investigate and demonstrate the role of

perfor-mane predition in the optimisation of large distributed systems. We hope

that this work is omplementary to the many other distributed omputing

projets inoperation atthe moment[14,15℄.Our prototype system addresses

theissuesofintratability,heterogeneoussystems,dynamiandevolving

om-putingresoures,andperformanepredition.Wedonotattempttodupliate

or re-invent previous or ongoing work, hene our prototype system does not

address issues suh as seurity, network operating systems, resoure

disov-ery/identiationet.Thestate-of-the-artintheseandotherareasareovered

in[15℄.

An overview of the design and implementation of the MS system is given

(12)

TheMSshedulingsystemhasatitsheartaGAshedulingprogramtoprovide

good-quality solutions to the sheduling problem in real time, as desribed

above. However, a number of other programs (daemons) and design features

are neessary inorder to produe aworking system.

The main goal of the implementation was to make the `state' of the system

both visible and transendent. To this end, MS makes extensive use of the

lesystem (whihshould failitate itsextensionto larger distributed

omput-ing environments suh as GLOBUS), and a olletion of various diretories

and database les represent its omplete state at any partiular instant in

time.

Themostimportanttypeofleisajob-le.Afterajob 1

hasbeen submitted,

itundergoesaseriesofstatehanges,witheahstaterepresentingapartiular

stageinitslifetime.Inreality,this orrespondstoajob-le(adatabasewhih

uniquely represents the job, and ontains various job parameters and other

information)being moved through a diretory hierarhy. At most hanges of

state, additionalinformationis addedto the job-le. This hierarhy isshown

shematiallyon the LHS of Fig. 4.

Otherimportantlesarethehostsandhost-statsdatabases,whihprovideMS

with real-timeinformation onthe state of the availableomputing resoures.

These databases will be desribed further in setion 5.1.5. The various job

states (diretories)will nowbe desribed individually.

Submitted. This is the entry point to the sheduling system, where the

user presents jobs to be exeuted. At this point the job-le ontains the

basi information required to run the job, inluding the loation of the

exeutable(s), the input data, and various information pertaining to the

performane data ormodel.

Queued. When the job-le ismoved intothis diretoryit is given aunique

jobidentiationnumber(JID).Itthenawaitstheattentionofthe shedule

optimiser.

Sheduling. Onethesheduleoptimiserbeomesaware ofajobitismoved

intothis diretory for the durationof the optimisation proedure.

Runnable. At the time that the shedule optimiser deides to run a job it

movesthejob-leintothisdiretorywhereanotherdaemonisresponsiblefor

running it on the system. The shedule optimiser speies the omputing

1

A taskis alleda\job" withintheontext ofthe shedulingsystem (byanalogy

(13)

submitted

queued

scheduling

runnable

running

finished

exited

ms_stats

host stats

hosts

cache

eval

user

job states

daemons

data stores

state changes

information flow

evaluate performance model

query host

load avg,

idle, etc...

request perf.

estimate for

job/host

configuration

identify available

hosts

submit job with

add/remove hosts

performance model

+host-spec

+predicted-tx

+start-time

+finish-time

report results

by email

+job-id

Fig.4.Flowhartdiagram oftheMSshedulingsystem. Afullexplanationisgiven inthetext. Notethat notall MSdaemonsare shown inthisdiagram.

resoures on whih the job is to be run, and adds information about the

predited exeutiontime tothe job-le.

Running.One the job has begun to be exeuted on the system its job-le

ismovedtothisdiretory,wherethestart-timeisadded(thestart-timeand

preditedexeutiontimes of the urrently runningjobs are requiredby the

(14)

where it waits for the shedule optimiser to move it to the next diretory.

Inthis way, the optimiserprogramisable tokeep trak of whenjobs nish

relative totheir predited nish time.

Exited. When the job-le reahes this diretory it signies that the job is

ompleted as far as the system is onerned. The JID number is reyled,

and the user is sent notiation that the job has been ompleted.

The maintenane of these diretories and databases is the responsibility of

various daemons, as desribed in the setion 5.1.

5 The MS toolset

The various programs and libraries whih make up the MS system will now

be desribed.

5.1 Daemons

A number of bakground proesses, or daemons, are used to implement the

state mahine desribed above. All interproess ommuniation is performed

via the le-system, with a single daemon being designated the owner of eah

partiular queue diretory. Ownershipgivesthe right toedit the jobles ina

diretory.Daemons withoutownership of adiretoryare onlyallowedtoread

oradd jobsin the queue.

5.1.1 ms-init

The ms-initdaemonmonitorsthe submissionqueue. As jobles are addedto

thediretorybyauser,theinitdaemonalloatesanunused jobid,andmoves

the job intothe queueingdiretory,ready for the sheduler.

5.1.2 ms-shed

The sheduler daemonenompasses the ideas of setions1-3 of this paper. It

uses the GAheuristi tosearh foroptimalsolutionsto the sheduleproblem

at hand, and interrogates the GA when there are free resoures available, in

order tosubmit jobsfor exeution.

The daemon operates by periodially sanning the \queued" queue for new

(15)

alsosans the \running"queue, in orderto ndthe base for the optimisation

(paking)problem,asdesribed earlier.Italsomonitorsthe \nished"queue,

inordertoasertainif thepreditedexeutiontimeswere orret,andmodify

itsshedule base if this isnot the ase. When the shedulernds ajob inthe

\nished" queue itis moved to the \exited"queue, inorder toletthe system

know that the sheduler is aware of the job's ompletion, and that it is no

longer required.

When the sheduler submits a job to the \runnable" diretory it adds the

speiationforthehoststhatthejobistoberunon.Italsoaddsthepredited

exeutiontimefor thejob,whihitrequiresone thejobhas startedrunning,

asdesribed above.

5.1.3 ms-run

After the sheduler, the ms-run daemon is perhaps the most omplex. It is

responsibleforexeutingthe programassoiatedwithajob onaspeiedlist

of hosts. Currently, only MPI onforming programs are supported, but it is

envisaged that this will be extended as the system matures (for example to

supportPVM-based appliations).Variouselds ofthe jobdata strutureare

used toontrolthe exeutionof the program, theseinludeoptionstoontrol

the arguments to the proess, and how the input and output streams of the

proess are treated.

Currently, the exeutable programs must be pre-ompiledand availableinall

loal lesystems. Heterogeneity is handled by allowing the lenames to be

onstruted on a host-by-host basis, and inluding loal parameter

substitu-tions, suh as the name of the arhiteture. A frameworkis in plae to allow

exeutablestobeompiledondemandfrom\pakets"ofsoureode,butthis

is yet tobeompleted.

The ms-run daemon sans the \runnable" queue, exeuting any jobs before

moving them to the \running" queue. After they omplete their exeution,

the exit status and time of ompletion are added tothe job struture, before

being moved to the \nished" queue.

If for some reason a job is unable toberun, it is returned to the \queueing"

diretory, ready to be resheduled at a later date, unless this it has already

(16)

After the sheduler has noted job ompletions, and moved them to the

\ex-ited"queue,thems-reaperdaemonreports theresultsbaktothe user(inthe

urrent implementation this is done via email). If no destination was

spei-ed for the program's standard output stream, the output of the program is

inluded in the report. The report also inludes information onerning the

exeutiontimeoftheprogram(andhowauratetheperformanemodelwas),

and whih hosts the job was exeuted on. Finally the job le is deleted, and

itsjob id isreyled.

The reaperdaemonalsomonitorsthe \rejet" queue. This iswhere the other

daemons plae jobs that are in some way inorretly speied, or otherwise

unableto be exeuted.

5.1.5 ms-stats

The ms-stats daemon is responsible for gathering statistis onerning the

hosts on whih tasks may be sheduled. The three statistis required are the

uptime,loadaverage,and idletime ofeahhost.The uptimeisthetimesine

thesystemwasbooted,theidletimeisthetimesinetheuserofthatmahine

lastgaveanyinput(orinnityifthereisnourrentuser),andtheloadaverage

isthe numberofproessesinthesystemsheduler'srunqueue,amortisedover

areent xed period.

Every ve minutes the ms-stats daemonqueries eah system in the database

of known hosts for these threeparameters. It uses standard UNIX utilitiesto

gather the information: finger and rup. As the statistis are gathered, they

are added tothe database of host statistis.

5.2 Evaluation Library

As noted earlier, the searh spae is generally so large suh that it is not

possible to pre-alulate all required performane estimates ahead of time.

Instead a demand-driven evaluation sheme is used, oupled with a ahe of

past evaluations. The motivation for the ahe is that there is a large degree

of repetitionin the listof performane senarios that the GAwillrequire.

Although it would be straightforward to use other predition methods, this

implementationuses models reated by the PACE environment.Foreah

ap-pliation that is submitted to the MS system there must be an assoiated

performane model. PACE allows the textual performane desriptions to be

(17)

pae? list Npro 1 N 500

pae? set Npro 2 pae? set N 20000 pae? eval

2.19181e+08

pae? hrduse SunUltra1 SunUltra1 pae? eval

4.89571e+08

pae? set N 10000 pae? eval

1.22434e+08

Fig.5.ExamplePACEsession

denitionswillevaluatetheperformanemodelforthespeiedonguration.

Toavoidthestartupoverheadassoiatedwithevaluatingamodel,PACE

pro-vides an interative interfae, where multiple evaluations an be performed,

speifyingtheparametersand hardware ongurationsfor eahseparate

eval-uation. Figure 5 shows an example interative session. The MS evaluation

library reates a proess running eah performane model, then uses the

ses-sioninterfaetoinvokeevaluationsandreadtheresults,ommuniatingaross

apairofsokets.Thismethodallowsthepossibilityofdistributingevaluations

aross multiplehosts, althoughthis hasn't been neessary asyet.

When requesting evaluationsthe sheduler translates fromthe vetor of host

namesspeifyingwhihhoststhejobwould runon,toavetorofarhiteture

namesand pu load averages.Forexample ifa speiedhost isa Sun Ultra-1

workstation,theassoiatedPACEhardwaremodelisalledSunUltra1,andits

urrentload averageis0.3(this dataisolleted bythe statistisdaemon,

de-sribed inSetion 5.1.5),the string SunUltra1:hardware/CPU LOAD=0 would

be speied as the hardware model of that partiular host. Repeating this

proess for all hosts that the job would run on gives the vetor of hardware

models that the evaluation requires.

Although evaluations omplete relatively quikly (usually in the order of a

few tenths of a seond), this is still a less than ideal. For example, if the

GA has a populationsize of 50, and there are 20 jobs being sheduled, then

1000 evaluations are required eah generation. If eah evaluation takes 0.01

seonds, then this is 10 seonds per generation. However, many of the

eval-uations requested by the geneti algorithm are likely to be exatly the same

as those required by previous generations (due to the nature of the rossover

(18)

added between the sheduler and the performane model. When apartiular

evaluation result is requested, the ahe is searhed (the ahe uses a

hash-table, so lookups are fast). If the result already exists, it is returned to the

sheduler. Otherwise theperformanemodelisalled, andthe result isadded

tothe ahe beforebeingreturned tothe sheduler. Thelibrary alsosupports

aseondarylevelofahing,usingadatabasestoredinthelesystem.The

re-sultsofallpreviousevaluationsofapartiularmodelarereorded,along with

the model parameters for eah result. This has several benets to the

shed-uler: rstly itis possibleto stop and then restartsheduler proesses without

losingtheevaluationhistory,andseondlysimilarjobsmaybesheduled more

than one, but the modelis only evaluatedthe rst time.

5.3 User Interfae

Naturally, the end user of the shedulingsystem an not be expeted to

ma-nipulate the job les themselves. To this end a graphial user interfae has

been developed, allowing all of the information ontained within the system

tobedisplayed and modied.

The rst part of the interfae is the browser. This allows all of the databases

in the system to be displayed. These databases inlude the job queue

dire-tories, and the ontroldatabases, suhas those ontaining the host data and

statistis.A sreenshot of the browser is shown in Figure6.

The other part of the user interfae is the sheduler front end. This displays

theGantthart oftheurrent shedule,and allowsthe variousdaemons tobe

ontrolled.Figure 7shows atypialsreenshot of the sheduler interfae.

An alternative interfae to the system is via the World Wide Web, through

a CGI sript. This interfae allows muh the same ations as the desktop

interfae, with the exeption of the shedule Gantt hart (we are urrently

developing aJava appletto allowthis).

Aknowledgments

ThisworkisfundedinpartbyU.K.governmentE.P.S.R.C.grantGR/L13025

and by DARPA ontrat N66001-97-C-8530 awarded under the Performane

(19)

(20)

[1℄ John S. Harper, Darren J. Kerbyson, and Graham R. Nudd. EÆient analytialmodellingof multi-levelset-assoiative ahes. In Proeedings of the

International ConfereneHPCNEurope'99,volume1593ofLNCS,pages473{ 482.Springer, 1999.

[2℄ John S. Harper, Darren J. Kerbyson, and Graham R. Nudd. Analytial modelingofset-assoiativeahe behavior. ToappearinIEEETransationson

Computers.

[3℄ D. J. Kerbyson, E. Papaefstathiou, and G. R. Nudd. Appliation exeution steering using on-the-y performane predition. In High-Performane ComputingandNetworking,volume1401ofLetureNotesinComputerSiene,

pages718{727, Amsterdam,April1988. Springer.

[4℄ S. C. Perry, D. J. Kerbyson, E. Papaefstathiou, G. R. Nudd, and R. H. Grimwood. Performaneoptimization ofnanial optionalulations. Journal of Parallel Computing, speial issue on Eonomis, Finane and Deision

Making,26,2000.

[5℄ J. Du and J. Leung. Complexity of sheduling parallel task systems. SIAM Journal on Disrete Mathematis,2:473, November 1989.

[6℄ K. Belkhale and P. Banerjee. Approximate sheduling algorithms for the

partitionableindependent taskshedulingproblem. InProeedings of the 1990 InternationalConfereneofParallelProessing,volumeI,page72,August1990.

[7℄ J. Turek, J. L. Wolf, and P. S. Yu. Approximate algorithms for sheduling parallelizable tasks. ACM Performane Evaluation Review, page 225, June

1992.

[8℄ S.Kirkpatrik,D.C.Gelatt,andM.P.Vehi. Statistialmehanisalgorithm formonte arlooptimization. Physis Today, pages 17{19, 1982.

[9℄ S Kirkpatrik, C D Gelatt, and M P Vehi. Optimisation by simulated

annealing. Siene,220:671{680, 1983.

[10℄J. H. Holland. Adaption in Natural and Artiial Systems. University of Mihigan Press,AnnArbor,MI, 1975.

[11℄D. E. Goldberg. Geneti Algorithms in Searh, Optimization and Mahine Learning. Addison-WesleyPublishingCo., In., Reading,MA., 1989.

[12℄G. R.Nudd, D.J. Kerbyson,E.Papaefstathiou, S.C. Perry,J.S.Harper,and D. V. Wilox. Pae - a toolset forthe performane predition of parallel and distributedsystems. Aepted for publiation in International Journal of High

Performane andSienti Appliations, 1999.

(21)

International Journal of Superomputer Appliations, 11(2):115{128, 1997.