• No results found

Chapter 2 Performance prediction and its application

2.4 Scheduling as an application

For grid applications to execute efficiently they rely on middleware services to manage resources and allocate them to tasks among variable and unpredictable application workloads. Many tools exist to model application performance based on performance data and other tools gather and store performance data. Used together, these tools can be part of a scheduling system that attempts to allocate applications to resources subject to various constraints as efficiently as possible. Previous performance work at Warwick has focused on performance prediction with the PACE framework described inSec. 2.1.2andAppendix Band an adapt- able scheduling system called TITAN [SJC+03]. The PACE tools assist the user in generating analytical performance models, and the evaluation of these models allows the performance and scalability of parallel applications to be estimated on different architectures. A recent example of this can be found in [MJSN06].

TITAN is a multi-cluster scheduling system that uses a rapid genetic algorithm and application performance models to create schedules in real-time. [SJC+03] In order to scale well, a complete TITAN system may consist of a number of distribution brokers in a loosely hierarchical structure which pass tasks between each other and determine whether any workload manager within the hierarchy has the available resources and meets any Quality of Service (QoS) criteria. The individual workload managers are responsible for a local pool of resources and interact with local batch schedulers for job submission and monitoring services. The workload managers

also query local performance prediction engines such as PACE for performance data. This section will ignore most of these components and focus on the workflow manager itself.

In contrast to planning-based scheduling systems such as OpenCCS[Ope] which plan the start times for each task, and produces a complete schedule for the future, TITAN’s workflow manager uses a queuing system, and simply tries to use available resources as efficiently as possible. The workflow manager uses a genetic algorithm to explore the task mappings required to minimise the overall runtime of all tasks, resource idle time, and average delay of a schedule. The genetic algorithm achieves this by continually generating scheduling solution sets, and replacing the current best schedule as it discovers improved schedules.

The scheduling problem is one where a set of tasksT= {T1,T2, . . . ,Tn}is mapped

onto a set of hostsH={H1,H2, . . . ,Hm}. The tasks are considered to be independent

of each other and are submitted to run in some order` ∈ P(T) whereP(T) is the

set of all possible permutations ofT. For each task `j within this ordering, there is a mapping of that task onto one or more hosts βj, where βj ⊆ H and βj , ∅.

A schedule consists of a 2-D matrix where the rows are the available hosts, and the columns the ordered tasks. Sk as defined below is thekthschedule in a set of schedulesS=S1, . . . ,Spmanipulated by the genetic algorithm. The columns of the matrix are simply another way of representingβjas a bitmap where an element in the column is 1 if the task is allocated to a particular host, and 0 if it isn’t.

Mi,j=        1, ifHi ∈βj 0, ifHi <βj Sk=                        `1 `2 . . . `n H0 M0,0 M0,1 . . . M0,n H1 M1,0 M1,1 . . . M1,n ... ... ... ... ... Hm Mm,0 Mm,1 . . . Mm,n

One of the unintuitive aspects of this definition of a schedule is that it makes no reference to time. Also it appears at first glance to be impossible to specify two or more tasks to run concurrently on different hosts. However this appearance is misleading. The reason for this is that the schedule isn’t a schedule in the sense of being a timetable, rather it simply defines the order in which tasks are submitted

to a batch queue, and a host mapping for each task. If, for example, if a schedule has two tasks, T1 and T2, and `1 and `2 specify a non-overlapping set of hosts,

then whenT1 andT2 are submitted to an empty batch queue, they will both start

executing concurrently. If`1and`2doshare a subset of hosts, thenT2will not start

executing untilT1completes and frees up the hosts they have in common.

This representation has a major benefit in the context of a genetic algorithm. When the genetic algorithm randomly permutes or changes the schedule, it is impossible for it to invalidate the schedule by producing overlapping host requirements: tasks are simply submitted to run on a specific set of hosts, and they begin to execute as soon as they are able.

Some means of knowing how long a task will take to complete, and how well it scales to different numbers of hosts is needed to convert the schedule to one where time is involved. This is provided by the performance prediction engine, which provides runtimes for every query pext(`j, βj) where `j is a task and βj the host allocation.

The end time for a task istej, and the start timetsj. trji is the earliestrelease time possible for task `j on an individual host Hi independent of all the other hosts. These are defined as

tej =tsj+pext(`j, βj) tsj = max

ist.Hi∈βj

trji

trjican can be defined as a recurrence relation on the release time for all the earlier tasks `1. . . `j−1 running on the same host. The release time for `1 by definition is

the current timet. The maximum in the definition oftsjoccurs because the job can only be released when all the hosts for the task become available, and this occurs at the latest release time for an individual host.

The genetic algorithm operates in an iterative fashion by measuring candidate schedules using fitness functions. The best schedule becomes the one used by the scheduler, and the poorest schedules are removed and replaced by new ones created by crossover and mutation. Crossover takes a random pair of schedules and mixes them together producing a new schedule, whereas mutation produces a new schedule by randomising the contents of an existing schedule.

T3 T1 T6 T5 T4 T3 T3 T3 T6 T1 T1 T2 T2 T2 T2 T2 Processors Time 1 2 3 4 5 6 T5 T4 T4

(a) before scheduling

T3 T2 T1 T6 T5 T4 T3 T3 T3 T3 T3 T6 T1 T1 T5 T5 T2 T2 T2 T2 T2 Processors 1 2 3 4 5 6 Time (b) after scheduling Fig. 2.5: Run-time schedule of three example workflows

and splices them together at some random pointiso that`0

1. . . `

0

i are from`Aand

`0

i+1. . . `

0

nfrom`B. This can produce invalid task orders with duplicated tasks (which is not permitted for a permutation of T), and this can be resolved by repeatedly replacing the first occurrence of each duplicated task by the original task from`A until no duplications remain. Mutation of a task order simply swaps random tasks within`, which is always guaranteed to produce a valid task order.

Crossover of hostmapsβAandβB performs a similar splice operation to crossover of task orders. Mutation of a hostmap simply flips random bits within a hostmapβ. After crossover and mutation, hostmaps are then checked for ‘topology’ constraints such as a task being allocated to no hosts, or a task running on a subset of hosts that is not permitted for some reason.

The fitness functions include minimising the makespan, minimising idle time, and minimising the over deadline time of anytej.

Because it is based on a continually iterating genetic algorithm, TITAN’s workflow manager copes well with change. Due to the schedule representation, addition or removal of tasks does not invalidate a schedule, and the genetic algorithm quickly adapts the schedule to remove any idle gaps that emerge in such an event. Similarly, the genetic algorithm can adapt to imprecise performance estimations, such as if a task completes earlier or later than expected, or if, for some reason the performance estimatechanges.

By the same token, a schedule can adapt to news from a performance monitoring tool that hosts have been added or removed from a cluster. Added hosts simply appear as a new empty host in each of the schedulesSk. Failed hosts are likewise removed from the schedule, and eachβj has the failed hosts removed from their hostmaps. Any tasks running on the hosts while they failed are simply resubmitted

for scheduling.

Subsequent work on TITAN added the ability to schedule workflows by allowing dependencies to be considered across a process flow. [SCJ+04] This can be accom- modated by adding an extra restriction totsjthat it cannot start before any of the tasks it depends on complete.

Scheduling entire workflows differs from scheduling individual tasks in that the tasks within a workflow will usually have dependencies that introduce a certain amount of serialisation between tasks in the workflow. When multiple workflows occur in a schedule, and their descriptions request the scheduler to allocate as many CPUs as possible to each task, this can lead to large amounts of idle time in the schedule. A scheduler that is aware of the scalability properties of an application has substantial opportunities to optimise these kinds of inefficient schedules. Fig. 2.5(a)andFig. 2.5(b)show a simple example of how the scheduler can optimise a schedule by allocating individual subtasks to fewer or moreCPUs as appropriate, and by interleaving the subtasks from different workflows. Furthermore, interleav- ing tasks is an optimisation that mutating a task order is likely to find. Fig. 2.5(a) shows the time composition of three workflows as submitted to the system prior to scheduling. Fig. 2.5(b)shows the same three workflows after scheduling by the resource management system. It can be seen that the makespan of all the tasks has been reduced and the utilisation of resources has increased significantly, i.e., the idle white-space in the schedule diagram is reduced.