Scheduling Algorithms and Support Tools for Parallel Systems

(1)

Abstract—High Perfomance Computing (HPC) is an evolving trend in computing industry with predicted growth of 10% annually. System sizes are also being increased with petaflop system becoming available. Scheduling in HPC systems is a vital part in harvesting available processing power. Scheduling is inherently a hard problem that is made even harder in HPC environment because of high task rates and potentially unstable environment.

In this paper all the aspects of scheduler design such as system architectures, workload types, metrics, simulator tools and benchmarks are summarized. Overview of the existing scheduling techniques is given, as well as their comparison.

Index Terms—distributed systems, clusters, multiprocessors, scheduling

I. INTRODUCTION

Scheduling is a process of matching tasks to the resources at specific times. In most environments this is a static problem that is solved once and the solution can be applied many times. More dynamic environments exist where all tasks are not known in advance and only a subset of tasks can be scheduled. However, in most manufacturing and servicing applications task rate is rather low and exhaustive algorithms that can schedule tasks in near real time can be designed.

Parallel computer systems are among most unstable environments to schedule for because of the nature of the underlying technology. Very large systems are available due to the existence of modern interconnection networks. These systems can be used by a lot of users generating huge task rates as system input. Even more, users can change parameters and even abort already submitted or scheduled jobs. This is typically not an issue in services where resources are prepaid and different money back policies can be enforced to keep high resource utilization. In some parallel system setups, additional volatility is introduced by constant variation in accessible resources.

Scheduling algorithms for parallel systems are generally known to be NP hard [1]. However, many fast heuristics exist that can cope with inherently unstable environment and high task rates obtaining very efficient results. Design of these algorithms depends on multiple factors such as parallel system types, workload type and efficiency metrics. Simulation tools and representative benchmarks are needed during the

scheduler development and testing phase. Parallel systems and workload are classified in sections II and III. Metrics, benchmarks and simulation tools that are helpful in scheduler design are described in section IV. Overview of different scheduling algorithms coupled with the current vendor offering summary are given in section V.

II. PARALLEL SYSTEM ARCHITECTURE CLASSIFICATION

Symmetric Multiprocessing (SMP), Massively Parallel Processing (MPP), Clusters and Non Uniform Memory Access (NUMA) architectures are the main available parallel system types. SMP denotes system architecture in which multiple processors are connected via bus or crossbar to the single shared main memory. This solution scales up to a dozen processors because of the limited available memory bandwidth. Programming model for SMPs is simple because all the processes can access entire system memory. Clusters are typically composed of interconnected SMP nodes in order to provide increasing processing power. MPPs are similar to clusters because they also support multiple connected nodes. The main difference is in sophistication of interconnection network where bandwidth is often designed to increase as more nodes are added to the system. Most powerful supercomputers are all MPP systems. The main drawback of cluster and MPP systems is a complex message based programming model. NUMA architecture is consisted of interconnected groups of processors that are given its own local memory. Processors can access non-local memory with a speed penalty which complicates software development for NUMA. Cache coherent NUMA (ccNUMA) is designed to hide internal memory architecture and to present SMP-like programming model. Because of their limited performance SMP systems alone are rarely subject of scheduling research. NUMA architecture is no longer in development, while ccNUMA architectures are becoming more popular because new technologies from AMD and Intel provide NUMA support for modern processors/chipsets.

According to [2] 80% of fastest 500 computers are computer clusters, while the rest is reserved for the MPP systems. The main focus in scheduling methods should be concentrated to these two architectures.

Scheduling Algorithms and Support Tools for

Parallel Systems

Igor Grudenić

Fakultet elektrotehnike i računarstva, Unska 3, Zagreb

(2)

III. WORKLOAD CLASSIFICATION

Jobs submitted to the computer parallel systems are either sequential or parallel. Parallel jobs can be classified as rigid, moldable, evolving or malleable [3] depending on the timing and flexibility of resource demand. For rigid jobs the number of processors is externally given to the scheduler and remains fixed throughout the execution. An evolving job is submitted to the scheduler together with the needed number of processors, but this number can change during the execution of the job. Moldable jobs are submitted to the scheduler without CPU requirements, and it is up to the scheduler to decide its running size. Size doesn’t change through the job runtime. Malleable jobs are the most flexible type where job size is not given to the scheduler and is subject to change throughout the job execution.

Currently available parallel programming technologies such as PVM [4], MPI [5] and various thread libraries support rigid and moldable jobs, but malleable jobs are partly supported only by MPI 2 [6].

Some scheduling strategies assume preemption mechanism to be available for submitted jobs in order to pause and resume or even move them to different machines. Pause and resume support is inherent in all multitasking operating systems assuming processes remain memory resident. In cases when memory residency is too costly or processes must be moved to different location, different check pointing strategies are advised for both sequential [7] and parallel [8] jobs. Some of these integrate with OS kernel, while some reside in user space or build on top of parallel communication libraries. Checkpoint/restart strategies provide greater flexibility in scheduling, fault tolerance, and as well as extended debug support. In order to be classified as preemptive, some jobs must be redesigned to “well behave” according to the employed checkpoint strategy.

From the end user point of view jobs can be either batch or interactive and scheduler should prioritize them differently in order to maximize user benefit.

IV. METRICS,BENCHMARKS AND TOOLS

Unbiased evaluation of scheduling strategies is facilitated by well defined performance metrics and freely available workload sets [9]. Discrete event simulators [10] provide cost effective environment for algorithm development and testing.

There are different metrics used for measurement of scheduler performance. Important criteria for metric choice are metric semantics and speed of convergence [11]. Metric semantics is related to the user experience with the computer system, while convergence assures consistent behavior of the parallel system as observed by the users.

Response time (RT) is defined as sum of waiting time (Tw) and running time (Tr) and is often the metric of choice for system evaluation. Convergence is the main issue with this metric because distribution of job run times typically reveals very large variance. Slowdown defined as

r r w T T T Slowdown= +

is proposed as a solution that is very descriptive to the users, although it evidently favors short jobs. This can be leveraged by bounding this metric using τ constant:

_⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ + = ,1 ) , max( max ) ( _ τ r r w T T T BS slowdown Bounded

and should be even further normalized to account the parallel jobs by dividing the bounded slowdown with the number of processors (BSPP) [12]. The problem arises when the number of processors is not given to the scheduler, and scheduler might increase response time in order to minimize bounded slowdown by using fewer processors sooner [13]. The solution is to go with RT, but using geometric mean in order to compensate for long jobs. It is concluded [11] that metric convergence can be influenced by the workload type and the scheduling algorithm.

Metrics such as RT, BS and BSPP have a clear definition and are very intuitive but may not be descriptive enough to describe user preference with the system response. Utility functions [14] are more precise, as they describe user satisfaction as a function of job response time, but these must be provided by the user. Convergence of this metric hasn’t been analyzed.

Variety of standardized workload sets from several computing centers and different workload models that are used to determine scheduler performance are available at Parallel Workloads Archive [15]. Advantage of using workload traces is their complete and precise reflection of the system load which includes complex properties that not available in trace analysis [16]. That kind of accuracy decreases chances for generalization of performance results to other systems.

Identification of interarrival times and job runtimes are the main issues in workload generation. The main focus of research is a statistical behaviour of job runtimes, where different distributions such as exponential [17], hyper-erlang [18], log-uniform [19] and hyper-gamma are fitted to different workload sets. Different job sizes are grouped together and modeled separately in [18], while others provide universal model. It is concluded [11] that load generators based on the mentioned distributions are more stable then real workloads because generated job sequences very quickly reach steady average runtime.

Interarrival times for processes are traditionally fitted to exponential distribution by assuming that the behavior of jobs matches the Poisson process [11]. This doesn’t count in the effects of burstiness (significant variance) and self-similarity [20]. Self similarity is a statistical term that expresses burstiness at many or all time scales. Burstiness and self similarity have not yet been modeled for parallel system workloads. The use of invalid statistical workload models is listed as one of common simulation pitfalls [20].

Evaluation of scheduling algorithms that are designed for handling of moldable and malleable jobs can not be done without parallel job speedup model [19].

(3)

Simulation of a parallel system environment is a prerequisite for scheduler development, testing and evaluation. Cluster scheduler developers such as [21] usually provide their own simulation environment with upgradeable scheduling policies. This is a useful feature for system administrators because it allows them to tune up system policies and test them with recent workload traces to conclude whether suggested changes produce desired behavior.

Simulation in grid environments is more complex and subject of research mainly in academia. SimGrid [22] and GridSim [23] are the only two currently available and maintained grid simulators. Both feature support for different resource types, network topologies and application modeling. SimGrid is built from scratch in C, while GridSim is implemented in Java on top of discrete event simulation package simjava [24]. Complete comparison between the two is given in [23] [25] and GridSim was declared the best at the majority of the comparison criteria.

V. SCHEDULING ALGORITHMS

The goal of parallel system scheduler design is to maximize both system utilization and user experience while preserving scalability of the algorithm. These conflicting goals result in tradeoffs that are necessary to make scheduler feasible.

Traditional multiprocessor scheduling schemes are space sharing and time sharing. Time sharing or multitasking is a paradigm established with modern operating systems where quantum of CPU time is assigned to every job via context switching. Conversely, space sharing methods dedicate CPUs to the job until it finishes or predefined time limit expires which results in job termination. Space and time sharing techniques together with vendor implementation overview are given in the following subsections.

A. Space sharing techniques

First Come First Served (FCFS) is the simplest space sharing technique that allows resources to be unused while accumulating them for the next highest priority job. It guarantees fairness with respect to the job ordering, but wastes resources that can be used for other jobs. Long wait times for short jobs result in poor slowdown metric. Slowdown metric can be improved by introducing several queues with exclusively dedicated CPUs [26]. Jobs are submitted to the queue depending on their duration.

Unused “holes” in processor-time space caused by enforcing FCFS can be used by other jobs. This process is called backfilling and multiple variants were developed that differ in number of reservations, order of queued jobs and the amount of lookahead into the queue [27]. In EASY backfilling [28], reservation is made for the first job in the queue. Other jobs may leap forward in the queue provided as long as they do not prolong reservation for the first job. This raises the concern that resource consuming jobs that are not on top of the queue may often or always be overtaken by other jobs. Conservative backfilling [29] creates reservations for every queued job and and move ahead in the queue is allowed only if it is not in the conflict with other reservations. Comparison of conservative and easy backfilling [30] showed EASY to perform better in

majority of job traces and also revealed worst-case turnaround time in EASY to be significantly larger. It is advised to give reservation only for top of the queue job, but if expected slowdown for some other waiting job reaches predefined threshold, reservation is made for that job. Relaxed backfilling strategy [31] allows every job to be delayed up to a predefined factor resulting in reduced fairness but increased system utilization.

Jobs can be preordered in order to reflect different priority and fairness requirements. Queue prioritization usually cumulatively includes user prioritization, project or group prioritization as well as scheduler induced prioritization. In [32] different queue orderings such as FCFS, shortest job first (SJF) and largest expansion factor (LXF) are discussed with respect to their specific definition of fairness.

Backfilling algorithms traverse queue jobs sequentially trying to schedule them for execution. It is possible to consider several jobs (queue lookahead) at once and use optimization techniques to produce more efficient execution plan [33].

Successful execution of backfilling algorithm depends on job runtime estimate that is usually provided by the users. Users are encouraged to provide as precise estimate as possible in order to increase chances of backfilling which results in reduction of their wait time. On the other hand, if estimate is lower than real runtime, the job will be terminated by the scheduler. This is the reason for many users to provide runtime overestimations. It is revealed [34] that about half of the users are able to improve their approximation if termination is not performed, but there was no substantial improvement in overall average accuracy.

Inaccurate runtime estimates may lead to improved average wait time and slowdown [35] but larger overestimation leads to the less efficient schedule. It was also concluded that the same effect is achieved by using shortest job backfilled first strategy (SJBF).

Job runtime prediction is not necessarily a user responsibility, since history usage can be used by the system to provide more accurate runtimes [36]. It is advisable to keep user estimate as a termination point for a job, because users are not expected to respond well to a job killed because of inaccurate system prediction. Such a strategy coupled with the SJBF can nearly double the performance of FCFS EASY (up to 47% reduction in average slowdown).

The techniques and results mentioned above work only with rigid jobs. FCFS with backfilling that is slightly altered to benefit from moldability and malleability reduces average response time up to 50% assuming 25% of jobs to be non-rigid [37]. Linear speedup model for parallel jobs is assumed in these experiments.

Another approach [38] assumes existence of AppLeS (application-level scheduler) which selects the size of the moldable job Jm on users behalf by simulating future schedule for several job sizes and picks the one with smallest turn around time for Jm. AppLes reduces mean response times for submitted jobs depending on the current system load (usually about 30%). SCOJO-P scheduler [39] takes into account all the waiting jobs and even the statistically modeled future jobs when determining the size of moldable job. This resulted in

(4)

70% improved response time compared to traditional scheduling and 59% compared to AppLeS. It is concluded that statistical modeling of future workload doesn’t contribute to performance metrics to a large extent, but it is left open for future analysis.

In contrast to the previous techniques that target CPUs as the critical scheduling resource, gangmatching algorithm [40] allows jobs to be matched to the different resource types such as processors, memory, software licenses and other. Gangmatching algorithm defines three main roles: requestor, provider and matchmaker. Requestor (job) and providers (resources) advertise their properties to the matchmaker. Each advertisement defines ports that should be connected to the ports of other advertisements. Matchmaker announces the match to providers and requestor only if all the ports are connected. Port connection is performed recursively and identified as the algorithm bottleneck. The use of index trees that should speed up sequential port probing at multiple recursion levels is proposed. After the match is announced to the requestor and the providers, an agreement is confirmed by the claiming protocol. In case of the agreement failure, new advertisements are made to the matchmaker and the entire process is repeated. Gangmatching algorithm is designed to work in heterogenous environments such as network of workstations where various administrative policies may apply, and fluctuations in resource availability are expected.

Genetic algorithms are usually used for static scheduling problems where computation speed is not an issue. Input bounded genetic algorithm [41] is proposed to cope with scheduler volatile environment. Initial population is generated randomly for subset of jobs while others are mapped to he CPUs one by one targeting minimal response time. Fitness function is defined as divergence between the theoretical optimal processing time and the estimate of processing time for the evaluated schedule. Roulette wheel selection and cycle crossover are used. Mutation is performed by randomly swapping tasks between the processors which is followed by rebalancing. Rebalancing chooses highest load processor and swaps one of it tasks with shorter task from another processor if one can be found in 5 guesses. In order to limit execution time of the algorithm only subset of queued jobs is scheduled. The size of subset varies with the estimation of the occurrence of the first idle processor. Perfomance of the genetic algorithm is shown to decrease makespan by up to 50% for sequential rigid workload when compared to the round robin and lightest load first heuristics.

Market oriented approach to scheduling [42] proposes users to bid for resources at desired timeslots. Timeslot price in such a system reflects the ratio of supply and demand on the simulated market. This technique is shown to be successful in both space and time sharing systems.

Space sharing algorithms described are used for both clusters and MPPs because they rarely rely on bandwidth exhausting preemption mechanisms.

B. Time sharing techniques

Time sharing or multitasking scheduling algorithms depend on existence of preemptive mechanism that ensures at pause-and-continue and even migration to be available for running

jobs. This makes them more suitable for MPP architecture because higher bandwidth reduces preemption based job migration overhead. Available time for each processing unit is sliced into timeslots that are given to the jobs. Size of timeslot is determined with regard to the cost of context switching.

Since each parallel job is composed of threads and processes it is performance wise to run them on different CPUs at the same time. Such thread or process group is called a gang [43], and the scheduling approach that uses that kind of grouping is named gang-scheduling. In order to decrease demand for system capacity only a subset of job threads that interact often can form a group. This can be done by either observing thread interactions and regrouping or suggested by special syntax in the distributed program.

Partitioning of resources is necessary in huge systems because scheduling for entire cluster centrally would not be feasible with respect to the task rate. Distributed hierarchical control systems (DHC) are often used to cope with the complexity [44].

Gang scheduling can employ either fixed or dynamic partitioning for CPUs. Fixed partitioning divides CPUs into disjoint subsets, and jobs are dispatched to be scheduled within the subset. This can introduce fragmentation if gang sizes are not equal. Dynamic partitioning allows partitions to change sizes (move CPUs from one partition to another) but this requires complex synchronization of context switching across the entire parallel system. It is advised to perform repartitioning only when new job arrives into the system (not with every context switch).

While designing DHC system it was assumed that moving gangs across different partitions introduces overhead that cannot be justified by the expected performance gain. Gang migration policy proposed in [45] results in decreased response time. Decrease ranges from 14% to 67% depending on the system load.

Another issue with gang scheduling is memory consumption. In order for jobs to run efficiently, paging in the system should be kept at minimum because it drastically affects thread synchronization. Admission control [46] can prevent jobs to be gang scheduled if the system is running to low on memory. These jobs are queued up and FCFS with backfilling can be used for queue control.

Gang scheduling prevents long resource consuming jobs from starving shorter jobs, thus resulting in decreased slowdown. Scheduled interactive jobs appear to have near real time performance due to time sharing nature of the algorithm. C. Vendor implementation overview

Job management system can be roughly decomposed into resource managers and job schedulers. Resource manager is a component distributed to all the nodes in the parallel system. It is used for state monitoring and reporting to the central managing unit. Resouce manager provides all sort of statistical and event data to the plugged in scheduler component.

One of the first job management systems is PBS [47] which is originally developed for NASA. PBS evolved into open source version OpenPBS [48] and commercially available PBS Professional [49] maintained by the Altair Engineering.

(5)

Open PBS is further enhanced by the Cluster Resources and named Torque [50]. All the PBS derivatives employ space sharing and a variant of backfilling algorithm. Use of external schedulers is supported by the PBS interfaces. When such a scheduler is used, PBS system acts like a resource manager. SGE (Sun Grid Engine) [51] is an open source job management system supervised by the Sun Microsystems that supports backfilling with more detailed queuing support then PBS. It can also be extended by the other scheduling software.

Maui [52] and its commercial version Moab [21] are strictly job schedulers without resource management support provided by the Cluster Resources. Different variant of backfilling algorithm are available in Maui, while more complex policy manager is integrated into Moab.

IBM’s LoadLeveler [53] is a job management system that combines gang scheduling with and is designed specifically for IBM AIX operating system..

Load Sharing Facilify (LSF) system [54] by the Platform is designed to manage batch jobs in HPC environments. No specifics are available at the product website.

Condor project [55] developed at University of Wisconsin, Madison, is a job management system that is targeted for workstation cycle stealing. It uses matchmaking scheduling algorithm and supports preemption techniques with enabled job migration. Xgrid [56] by Apple provides user friendly clustering environment similar to Condor but the implementation details are not elaborated in the available whitepapers.

All the scheduling techniques that are not mentioned in this vendor overview are either very platform specific and their implementation is not widely available or results for the specific technique are obtained by simulation and the algorithm is not yet in the production environment.

VI. CONCLUSION

This paper deals with different issues and factors in parallel system scheduler design and gives a summary of existing scheduling algorithms.

Every scheduling process is about mapping task to resources. Parallel system (resource) architectural classification is given together with estimated current usage of the technology. Workload (tasks) is classified with respect to parallelism, interactivity and migration capabilities. Tools, metrics and benchmarks that are needed for analysis and development of scheduling algorithms are elaborated.

In the scheduling algorithm analysis, performance comparison is given where performance metrics used allowed it. Types of input workload are clearly stated in the comparisons in cases in which standard benchmarks where not used. Current vendor offering is analyzed with respect to the scheduling algorithm used.

There are many open questions in parallel job scheduling. Some are of philosophic nature such as definition of effective system, while others consider utilization, scalability, job moldability, migration support, etc..

VII. REFERENCES

[1] D. Bernstein, M. Rodeh and I. Gertner, “On the Complexity of Scheduling Problems for Parallel/Pipelined Machines“,IEEE Transactions on Computers, 1998, vol. 38, pp. 1308-.

[2] Top500 Supercomputer Sites, http://www.top500.org/

[3] D. G. Feitelson, L. Rudolph, U. Schwiegelshohn and K. C. Sevcik, “Theory and Practice in Parallel Job Scheduling”, Lecture Notes in Computer Science, 1997, vol. 1291, pp. 1-34.

[4] V. S. Sunderam, "PVM: A Framework for Parallel Distributed Computing", Concurrency: Practice and Experience, December, 1990, Vol. 2, pp. 315-339.

[5] MPI Documents, http://www.mpi-forum.org/docs/

[6] MPI-2 Standard,

http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html

[7] Jose C. Sancho, Fabrizio Petrini, Kei Davis, Roberto Gioiosa and Song Jiang, “Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance”, In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, 2005, vol. 19.

[8] Oren Laadan, Dan Phung and Jason Nieh, “Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters”, In Proceedings of the IEEE International Conference on Cluster Computing, 2005, pp. 1-13.

[9] D. G. Feitelson and L. Rudolph, “Metrics and Benchmarking for Parallel Job Scheduling”, Lecture Notes in Computer Science, 1998, vol. 1459, pp. 1-24.

[10] M. Curiel, G. Alvarez and L. Flores, “Evaluating Tools for Performance Modeling of Grid Applications”, Lecture Notes in Computer Science, 2006, vol. 4331, pp. 854-863.

[11] D. G. Feitelson, “Metrics for parallel job scheduling and their convergence”, Lecture Notes in Computer Science, 2001, vol. 2221/-1/2001, pp. 188-205.

[12] 1999 D. Zotkin and P.J. Keleher, “Job-length Estimation and Performance in Backfilling Schedulers”, In Proceedings of 8th

Internetional Symposium on High Performance Distributed Computing, 1999.

[13] W. Cirne and F. Berman, “Adaptive Selection of Partition Size for Supercomputer Requests”, Lecture Notes in Computer Science, 2000, vol. 1911, pp. 187-208.

[14] C.B Lee and A.E. Snavely, “Precise and Realistic Utility Functions for User-Centric Performance Analysis of Schedulers”, In Proceedings of the 16th international symposium on High performance distributed computing, 2007, pp. 107-116.

[15] Parallel Workloads Archive,

http://www.cs.huji.ac.il/labs/parallel/workload/

[16] D.G. Feitelson, “Workload modeling for performance evaluation”, Lecture Notes in Computer Science, 2002, vol. 2459, pp. 114-141. [17] D.G. Feitelson, “Packing Schemes for Gang Scheduling”, Lecture Notes

in Computer Science, 1996, vol. 1162, pp. 89-110.

[18] J. Jann, P. Pattnaik, H. Franke, F. Wang, J. Skovira and J. Riordan, “Modeling of workload in MPPs”, Lecture Notes in Computer Science, 1997, vol. 1291, pp. 95-116.

[19] A.B. Downey, “A parallel workload model and its implications”, Proceedings of the 6th IEEE International Symposium on High Performance Distributed Computing, 1997, pp. 112-142.

[20] E. Frachtenberg and D.G. Feitelson, “Pitfalls in parallel job scheduling evaluation”, Lecture Notes in Computer Science, 2005, vol. 3834, pp. 257-282.

[21] Moab Cluster Software Suit,

http://www.clusterresources.com/pages/products/moab-cluster-suite.php

[22] H. Casanova “Simgrid - a Toolkit for the Simulation of Application Scheduling”, In Proceedings of the First IEEE/ACM International Symposium on Cluster Computing and the Grid, 2001, pp. 430-437 [23] A. Sulistio, U.Cibej, S. Venugopal, B. Robic and R. Buyya, “A Toolkit

for Modelling and Simulating Data Grids - An Extension to GridSim”, Concurrency and Computation: Practice & Experience, 2008, vol. 20, pp. 1591-1609.

[24] R. McNab and F.W. Howel, “Using Java for Discrete Event Simulation”, Proceedings of 12th UK Computer and Telecommunications Performance Engineering Workshop (UKPEW), 1996, pp. 219-228

(6)

[25] M. Curiel, G. Alvarez and L. Flores, “Evaluating Tools for Performance Modeling of Grid Applications”, Lecture Notes in Computer Science, 2006, vol. 4331, pp. 854-863.

[26] M. Harchol-Balter, M. Crovella and C.D. Murta, “On Choosing a Task Assignment Policy for a Distributed Server System”, Lecture Notes in Computer Science, 1998, vol. 1469, pp. 231-242.

[27] D.G. Feitelson, L. Rudolph and U. Schwiegelshohn, “Parallel Job Scheduling - A Status Report”, Lecture Notes in Computer Science, 2005, vol. 3277, pp. 1-16.

[28] D,A. Lifka, “The ANL IBM SP Scheduling System”, Lecture Notes in Computer Science, 1995, vol. 949, pp. 295-303.

[29] A.W. Mu'alem and D.G. Feitelson, “Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling”, IEEE Transactions on Parallel and Distributed Systems archive, 2001, vol. 12 , pp. 529-543.

[30] S. Srinivasan, R. Kettimuthu, R. Subramani and V. Sadayappan, “Characterization of Backfilling Strategies for Parallel Job Scheduling”, Proceedings of the 2002 International Conference on Parallel Processing Workshops, 2002, pp. 514-519.

[31] W.A. Ward Jr., C.L. Manhood and J.E.West, “Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy”, Lecture Notes in Computer Science, 2002, vol. 2537, pp. 88-102.

[32] G. Sabin , G. Kochhar and P. Sadayappan, “Job fairness in non-preemptive job scheduling”, Proceedings of the 2004 International Conference on Parallel Processing, 2004, pp. 186-194.

[33] E. Shmueli and D.G. Feitelson, “Backfilling with lookahead to Optimize the Performance of Paralell Job Scheduling”, Lecture Notes in Computer Science, 2003, vol. 2862, pp. 228-251.

[34] C. Bailey Lee, Y. Schwartzman, J. Hardy and A. Snavely, “Are user runtime estimates inherently inaccurate”, Lecture Notes in Computer Science, 2005, vol. 3277, pp. 253-263.

[35] D. Tsafrir and D.G. Feitelson, “The Dynamics Of Backfilling: Solving the Mysteryof Why Increased Inaccuracy May Help”, Proceedings of 2006 IEEE International Symposium on Workload Characterization, 2006, pp. 131-141.

[36] D. Tsafrir, D.G. Feitelson, Y. Etsion, “Backfilling Using System-Generated Predictions Rather than User Runtime Estimates”, IEEE Transactions on Parallel and Distributed Systems, 2007, vol. 18, pp. 789-803.

[37] J. Hungershofer, “On the Combined Scheduling of Malleable and Rigid Jobs”, Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing, 2004, pp 206-213.

[38] W. Cirne, C. Grande and F. Berman, “When the Herd is Smart Aggregate Behavior in the Selection of Job Request”, IEEE Transactions in Parallel and Distributed Systems, 2003, vol. 14, pp. 181-192.

[39] L. Barsanti and A. Sodan “Adaptive Job Scheduling via Predictive Job Resource Allocation”, Lecture Notes in Computer Science, 2007, vol. 4376, pp. 115-140.

[40] R. Raman, M. Livny and M. Solomon, “Policy Driven Heterogeneous Resource Co-Allocation with Gangmatching”, Proceedings of International Symposium on High Performance Distributed Computing, 2003, pp. 80-89.

[41] A.J. Page and T.J. Naughton, “Dynamic Task Scheduling using Genetic Algorithms for Heterogeneous Distributed Computing”, In the Proceedings of 19th IEEE International Symposium on Parallel and Distributed Processing, p. 189., 2005

[42] B.N. Chun, “Market-based cluster resource management”, PhD thesis, University of California, Berkley, 2001

[43] D.G. Feitelson, “Job Scheduling in Multiprogrammed Parallel Systems”, IBM Research Report RC 19790 (87657), 1994.

[44] D.G. Feitelson and L. Rudolph, “Evaluation of Design choices for gang Scheduling using Distributed Hierarchical Control”, Journal of Parallel and Distributed Computing, May 1996, vol. 35, pp. 18-34.

[45] S.K. Setia, “Trace Driven Analysis of Migration-based Gang Scheduling Policies for Parallel Computers”, Proceedings of International Conference on Parallel Processing, 1997, pp. 489-492.

[46] A.Batat and D.G. Feitelson, “Gang scheduling with memory considerations”, Proceedings of International Parallel and Distributed Processing Symposium, 2000, pp. 109-114.

[47] The Portable Batch System,

http://www.nas.nasa.gov/Software/PBS/pbslist.html

[48] PBS GridWorks: OpenPBS, www.openpbs.org

[49] Enabling on demand computing, http://www.pbsgridworks.com/

[50] Torque Resource Manager,

http://www.clusterresources.com/pages/products/torque-resource-manager.php

[51] Sun Grid Engine, http://www.sun.com/software/gridware/

[52] Maui Cluster Scheduler,

http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php

[53] Tivoli Workload Scheduler LoadLeveler,

http://www-3.ibm.com/systems/clusters/software/loadleveler/index.html

[54] Platform Computing,

http://www.platform.com/Products/platform-lsf

[55] Condor Project Homepage,

http://www.cs.wisc.edu/condor/

[56] Apple - Mac OS X Server - Technology – Xgrid,