Job Dispatching in HPC: a CP Approach - Power-Aware Job Dispatching in High Performance Computi

In its current state, the PBS system works mostly as an online heuristic, incur- ring the risk to make poor resource assignments due to the lack of an overall plan. Also the hard-coded mapping constraints, designed as a way to ensure low waiting times for specific job classes (e.g. the debug queue), may easily cause resource under-utilization, and long waiting times for the remaining jobs (e.g. those in the longpar queue). A proactive dispatching approach should intuitively be able to improve the resource utilization and reduce the waiting times without the need of devising such hard-coded restrictions. The task of obtaining a proactive dispatching plan on a supercomputer can be naturally framed as a resource allocation and scheduling problem, for which CP as a long

track of success stories. However, to our knowledge this is the first attempt to frame the HPC dispatching problem within the CP paradigm.

3.4.1 Rolling Horizon

We adopt a rolling horizon approach, in which our scheduler is awakened whenever a job 1) enters the system or 2) ends its execution. At each iteration, we build a full schedule and mapping for all the jobs in the input queues, taking into account resource capacity limitations. We consider different performance metrics, which we treat either as objective functions or as soft-constraint. Then we dispatch only those jobs that are scheduled for immediate execution.

The schedule is computed based on the worst-case durations (as provided by the users), but the dispatcher reactivation is triggered by the job actual terminations (besides of course by their arrivals). Whenever this occurs, the jobs currently in execution cannot be migrated, but all the waiting ones can be re-scheduled to take advantage of the released resources.

The scheduler has to react to runtime events, like new job submissions and job terminations, and obtain a new scheduling solution from the supercomputer’s current state. To solve this problem, since we are in a non-preemptive system, we must obtain all the jobs running on the supercomputers and set their actual start time and the used nodes, as well as insert the jobs in queues.

3.4.2 Formal Problem Definition

We can now provide a precise definition of the scheduling problem solved at each activation of the dispatcher. Each job i enters the system at a certain arrival time eqti, by being submitted to a specific queue (depending on the user choices

and on the job characteristics). By analyzing existing execution traces coming from PBS, we have determined an estimated waiting time for each queue, which applies to each job it contains: we refer to this value as ewti.

When submitting the job, the user has to specify several pieces of informa- tion, including the maximum allowed execution time Di, the maximum number

of nodes to be used rni, and the required resources (cores, memory, GPUs,

MICs). By convention, the PBS systems consider each job as if it was divided into a set of exactly rni identical “job units”, to be mapped each on a single

node. It is therefore convenient to specify the resource requirements on a job- unit basis. Job-units belonging to the same job can be mapped on different nodes (but not necessarily) and they must have the same start time (it is also assumed that they share the same duration).

Formally, let R be a set of indexes corresponding to the resource types (cores, memory, GPUs, MICs), and let the capacity of a node k for resource r ∈ R be denoted as capk,r. We recall that the system has m = 64 nodes, each with 16

cores and 16 GB of RAM memory; 32 nodes have 2 GPUs each (and 0 MICs), and the remaining 32 nodes have 2 MICs each (and 0 GPUs). Finally, let rqi,r

be the requirement of a unit of job i for resource r.

The dispatching problem at time t consists in assigning a start time sti≥ t

to each waiting job i and a node to each of its units. All the resource capacity limits should be respected, taking into account the presence of jobs already in execution. Once the problem is solved, only the jobs having sti= t are actually

3.4 Job Dispatching in HPC: a CP Approach 67

Informally speaking, in the big picture, the goal is to increase the resource utilization and reduce the waiting times, but those metrics can be meaningfully evaluated only once the actual job durations become known. Hence we for- mulate the problem in terms of several objective functions that are intuitively correlated with the metrics we are interested in. After extensive preliminary experimentations, we settled for the following possible problem objectives:

max i=0..n−1(sti+ Di) (makespan) (3.1) X i=0..n−1 max 0,sti− eqti− ewti ewti (weighted tardiness) (3.2) X i=0..n−1

[[sti− eqti> ewti]] (num of late jobs) (3.3)

where n is the number of jobs and the notation [[−]] stands for the reification of the constraint between brackets. The makespan has been chosen because compressing the schedule length tends to increase the resource utilization. For the tardiness and the number of late jobs, we consider a job to be late if it stays queued for a time larger than ewti. The tardiness is weighted, because

we assume that users that are already expecting to wait more (i.e. jobs with higher ewti) should adjust better to prolonged queue times. Both the tardiness

based objectives are chosen to improve the perceived response time, in one case by avoiding (proportionally) long waiting times, in the second by reducing the number of jobs in the queues.

3.4.3 Model Definition

We defined for the described dispatching problem a CP model that is based on Conditional Interval Variables (CVI, see [LR08]). A CVI τ represents an interval of time: the start of the interval is referred to as s(τ ) and its end as e(τ ); the duration is d(τ ). The interval may or may not be present, depending on the value of its existence expression x(τ ). In particular, if x(τ ) = 0 the interval is not present and does not affect the model: for this situation we also use the notation τ = ⊥.

CVIs can be subject to a number of constraints, including the classical cumulative [BLLN06] to model finite capacity resources, and the more specific alternative constraint [LR08]. This last global constraint has the following sig- nature:

alternative(τ0,[τ1, .., τnτ], mτ) (3.4)

The constraint forces all the interval variables τ1, τ2, . . .to have the same start

and end time as τ0. Moreover, exactly mτ of τ1, τ2, . . .will be actually present

if τ0 is present. Formally, the constraint enforces:

s(τ0) = s(τi), e(τ0) = e(τi) ∀i = 1..nτ

i=1

x(τi) = mτx(τ0) (3.5)

3.4.3.1 Modeling Decisions and Constraints

In our model, we use a CVIs to model the scheduling decisions. In particular, we introduce an interval variable τi with duration Di for each job waiting in

the input queues or already in execution. Then, we fix the start of all τi cor-

responding to running jobs to their real value (which is known at this point). For the waiting jobs we have s(τi) ∈ t..eoh, where t is the time instant for

which the model is built and eoh can be given for example by t plus the sum of the maximum duration of all jobs1_{. All the τ}

i variables are mandatory, i.e.

x(τi) = 1.

Mapping decisions should be taken at the level of single job-units. The modeling style we adopt for them is best explained by temporarily introducing a simplifying assumption, namely that no two units of the same job can be mapped on a single node. With this assumption, the mapping decisions can be modeled by introducing a second set of optional interval variables υi,ksuch that

x(υi,k) = 1 if a unit of job i is mapped to node k.

However, mapping multiple units of the same job on the same node is possible and can be beneficial. To account for this possibility, we have to introduce for each job i multiple sets of υ variables. Specifically, we add one more index and we maintain the semantic, so that we have variables υi,j,ksuch that x(υi,j,k) = 1

if a unit of job i is mapped to node k. The j index is only used to control the number of job units that can be mapped to the same node. Finding a suitable range for the index is a critical step: on the one hand, allowing j to range on 0..rni− 1 (i.e. one set of υ variables for each requested node) is a safe choice.

On the other hand, it is impossible to map multiple units of the same job on the same node if doing so would exceed the availability of some resource. Hence, a valid upper bound on the number of υ variable sets for a single job i is given by: pi= min rni,min r∈R capk,r ri,r (3.6)

and for each job i, the index j can range in 0..pi− 1. Then we have to specify

that exactly rni job-units should be mapped, i.e. that exactly such number

of υi,j,k intervals should be present. This can be done by using an alternative

constraint:

alternative(τi,[υi,j,k], rni) ∀i = 0..n − 1 (3.7)

Additionally, the alternative constraint forces all the job-units to start at the same time instant as τi. Now, the resource capacity restrictions can be modeled

via a set of cumulative constraints: cumulative([υi,j,k], [D

(pi)

i ], [r (pi)

i,r ], capi,r) ∀k = 0..m − 1, ∀r ∈ R (3.8)

where m is the number of nodes and the notation D(pi)

i stands for a vector

containing D0 repeated p0 times, then D1 repeated p1 times, and so on. We

disregard all the hard-coded constraints introduced by the PBS administrator and we trust the decision making capabilities of our optimization system with providing waiting times as low as possible.

1_{Note that it is possible to shift all the domains by subtracting the smallest st} i to all

3.4 Job Dispatching in HPC: a CP Approach 69

3.4.3.2 Handling the Objective Function

We consider several variants of our dispatching problem, differing one from each other for the considered objective and for the possible presence of soft constraints. First, we have three “pure” models, obtained by adding on top of the presented formulation one of the problem objectives that we have discussed in Section 3.4.2:

min max

i=0..n−1e(τi) (makespan) (3.9)

min X i=0..n−1 max 0,s(τi) − eqti− ewti ewti (weighted tardiness) (3.10) min X i=0..n−1

[[s(τi) − eqti− ewti>0]] (num. of late jobs) (3.11)

Then we consider three “composite” formulations obtained by choosing as a main cost function one of Equations (3.9)-(3.11), and then by posting a constraint on the value of the remaining ones. For example, assuming the makespan is the main objective, we get:

min max i=0..n−1e(τi) (3.12) s.t. X i=0..n−1 max 0,s(τi) − eqti− ewti ewti ≤ δ0θ0 (3.13) X i=0..n−1 [[s(τi) − eqti− ewti >0]] ≤ δ1θ1 (3.14)

The values θ0 and θ1 are obtained by solving the pure models corresponding

to the constrained functions. The parameters δ0, δ1 allow to tune the tightness

of the constraints. The three new composite formulations are loosely inspired by multi-objective optimization approaches and aim at obtaining good solu- tions according to one global metric (say, resource utilization), while keeping acceptable levels for the other (say, waiting times).

3.4.3.3 Example of a solution

Let us suppose we have the set of waiting jobs described in Table 3.2, then a feasible solution to this instance is described in Table 3.3. As reported in the table, jobs 000, 001 and 002 can execute only on the nodes equipped with GPUs (i.e. node 0 to 31), job 004 can execute only in nodes with MICs (i.e. node 32 to 63). Two units of job 000 are allocated on node 1, the other 30 units of job 000 are allocated in nodes 2 to 31; node 0 is completely free and can run job 001 while job 000 is executing; job 003 can execute on nodes 32 to 63; after the termination of job 001, job 002 can start its execution with two units on node 0 and after the termination of job 003, job 004 can start in nodes 32 to 63.

i rni rqi,core rqi,gpu rqi,mic rqi,mem (KB) Di (seconds) 000 32 4 1 0 1000 14000 001 1 14 1 0 400 600 002 2 4 1 0 400 14400 003 32 16 0 0 400 800 004 32 3 0 2 800 400

Table 3.2: An example of problem instance

i s(τi) υi,0,0 υi,0,1 υi,0,2..31 υi,0,32..63 υi,1,0 υi,1,1 υi,1,2..31 υi,1,32..63

000 0 ⊥ 0 0 ⊥ ⊥ 0 ⊥ ⊥

001 0 1 ⊥ ⊥ ⊥ ⊥ ⊥ ⊥ ⊥

002 600 600 ⊥ ⊥ ⊥ 600 ⊥ ⊥ ⊥

003 0 ⊥ ⊥ ⊥ 0 ⊥ ⊥ ⊥ ⊥

004 800 ⊥ ⊥ ⊥ 800 ⊥ ⊥ ⊥ ⊥

Table 3.3: A feasible solution for the instance from Table 3.2

In document Power-Aware Job Dispatching in High Performance Computing Systems (Page 77-82)