Distributed Load Balancing for Machines Fully Heterogeneous

(1)

Internship Report

2

^nd

of June - 22

^th

of August 2014

Distributed Load Balancing for Machines Fully Heterogeneous

Nathana¨el Cheriere

[email protected] ENS Rennes

Academic Year 2013-2014 Département Informatique et Télécommunications

http://perso.eleves.ens-rennes.

fr/~ncherier/

Supervisor Erik Saule [email protected] Department of Computer Science University of North Carolina Charlotte http://webpages.uncc.edu/~esaule/

(2)

Distributed Load Balancing for Machines Fully Heterogeneous

Nathana¨el Cheriere ENS Rennes, France, [email protected]

http://perso.eleves.ens-rennes.fr/~ncherier/

Abstract. With the increasing size of parallel systems, having a centralized algorithm to distribute the jobs to be executed can have a huge cost on the system; this is why many researchs are done in order to have decentralized scheduling algorithms. The objective of this work is to propose a decentralized solution to balance jobs on unrelated machines. We propose two distributed algorithms able to balance the load on unrelated machines with proven guarantees for some situations. The first one ex- ploits a low heterogeneity of the jobs while the second one is focused on the case of two clusters of identical machines.

Keywords: Decentralized Algorithms, Approximation Algorithms, Machine Heterogeneity, Unrelated Scheduling Problem, Work Stealing, Load-Balancing

1 Introduction

Scheduling the execution of a parallel program on multiple machines is one of the basic problems in parallel computing. Its solution is far from obvious as many hypothesis can be done to specify the conditions, especially on the heterogeneity of the machines used to execute the program. These problems have been studied since the ’60s, but often from a centralized point of view. However, with the current size of parallel systems, the cost of the resolution of the problem on one machine cannot be overlooked any more [9].

A solution to decrease the overload induced by the scheduling would be to have every machine participate in the distribution of the work. The work stealing strategy, used in Cilk [3] follows this idea: every machine is responsible for its own charge, and when there is no longer a job to execute, the machine tries to steal some work from another machine.

However, this strategy cannot be directly applied in systems with high heterogeneity in their machine set because the initial distribution (which specifies the jobs that must be finished before stealing) is important. The execution of a first algorithm to balance the load between the machines removes the problem- atic cases which can disrupt a work stealing strategy.

This is why this work is focused on the problem of decentralized load- balancing for heterogeneous machines sets. It could provide a suitable initial

(3)

distribution which can then be used by a work stealing algorithm or be used at regular intervals. We propose centralized and decentralized algorithms with proven properties operating under practically relevant conditions. The first one is specialized for systems with high machine heterogeneity while having a low job heterogeneity while the second one has been created to solve the problem for two different clusters each containing only identical machines, typically a GPU-accelerated cluster.

The problem is formally defined in Section 2. In Section 3 we present the results we extend in this paper and the model is detailed in Section 4. We develop an algorithm to solve the case of high machine heterogeneity and low job heterogeneity and prove an approximation ratio in Section 5. Then, in Section 6, we limit the heterogeneity of the machines to two clusters and provide an algorithm to load balance the machines and prove an approximation ratio in case of convergence, then experimentally study the convergence.

2 Problem

The problem studied in this work is a load balancing problem, the goal is to construct a partition S of a set of jobs J onto the machines in the set M in order to minimize an objective function.

Many functions can be minimized but we focus on the makespan Cmax(S) which is Cmax(S) = maxi∈MC(S, i) where C(S, i) =P

j∈S(i)pi,jis the makespan of the machine i, p_i,jis the time needed for the execution of job j on the machine i, and S(i) is the set of jobs assigned to the machine i.

In the following parts, we will denote C(S, i) as C(i) when there is no am- biguity on the solution used to compute the makespan. In this problem, the partition obtained is usually compared to the minimum possible makespan OPT which characterizes the optimal solutions, OPT = minSCmax(S).

The homogeneity of the set of machines is the main hypothesis which is usually done on the set of machines used for a parallel execution. The machines can be uniform, related, or unrelated.

– The machines can be homogeneous. In this case, they are said to be uniform, each job j can be executed on every machine i with the exact same cost (usually the same duration), ∀ i, i⁰∈ M, pi,j = pi⁰,j.

– In the related case, all the machines are different but only differ by a fixed factor. For all the jobs we have ∀ i, i⁰∈ M, ∃ α, ∀ j ∈ J, pi,j = αpi⁰,j. – The unrelated case is the most general with a lot of heterogeneity in the

machine set; Every machine has a fixed cost p_i,j for each job (which can be infinite) and there are no relations restricting this cost.

The problem developed in the following parts is the load balancing of unrelated machines with objective to minimize C_max which is denoted as R||C_max by Graham et al. in [8]. Because the problem is NP -Complete [6], we develop two approximation algorithms for the problem.

(4)

3 Related Work

Scheduling problems have interested many researchers since the 60’: in 1969 Graham [7] showed that the scheduling problem on uniform machines can be approximated with in a factor 2 (there exists an algorithm that provides a solution S such that Cmax(S) ≤ 2 ∗ OPT). The scheduling problem on unrelated machines which is studied in this work has been studied by Lawler and La- betoulle in [10]. They showed that the problem can be solved in polynomial time using a linear programming problem when there is the hypothesis of pre-emption (the possibility to pause a job on a machine and restart it on another machine).

The same problem without pre-emption has been approximated by a factor 2 by Lenstra et al. in [11] also using a linear programming problem. Moreover they showed that the problem cannot be approximated with a better approximation ratio than ³₂ unless P = NP .

In [5], Guochuan et al. propose a centralized online algorithm to distribute the new arriving jobs between two different clusters of identical machines, and proved this algorithm has an approximation ratio of 4.

Algorithm 1 Work Stealing for a particular machine Data: m machine

Data: S(m) jobs assigned to m while true do

if S(m) = ∅ then

Select randomly a target machine i if S(i) 6= ∅ then

Steal half of the jobs of i else

Start running a job j of S(m) Remove j of S(m)

In order to avoid an excessive work load on the scheduling machine, a decentralized strategy has been introduced in 1981 by Burton and Sleep in [4]:

the work stealing algorithm (Algorithm1). The goal of the algorithm is to keep every machine busy until there are no more jobs to run. In order to meet this objective, when a machine finishes its last locally available job, it contacts its neighbours and try to steal some of the non running jobs.

In 1994, Blumofe in [2,3] continued the idea of work stealing to apply it into the middleware Cilk, and proved some guarantees on the execution time of a batch of jobs using a work stealing algorithm on identical parallel machines. In particular, he proved that the expected maximum makespan is bounded by the average work per machine plus a big O of the critical path p^∞ of the problem, E(Cmax(S)) ≤P

j∈Jp_i,j/|M | + O(p^∞).

In 2002, Bender and Rabin [1] extended this result to the problem with related machines and the possibility to use pre-emption, and proved that the

(5)

completion time is bounded by E(Cmax(S)) ≤P

j∈Jp_1,j/(|M | ∗ π) + O(p^∞/π) where expected machine 1 is the fastest machine and π is the average speed of the machines, normalized by the speed of fastest machine.

4 A Priori Load Balancing

For the work stealing algorithm as developed by Blumofe in [2,3], little is known about the jobs and the machines and this is possible because all machines are the same. For the work stealing algorithm adapted by Bender and Rabin [1], the algorithm knows and uses the relative speed of both machines which is enough to characterize the differences between the machines. However, in the case of unrelated machines, machines are characterized by the cost of each job execution and this also defines each job, so we consider as known the costs pi,jof each job.

Applying a work stealing strategy on unrelated machines has a flaw. Indeed, this strategy starts stealing jobs from another machine only when the work previ- ously scheduled on the machine has been executed. But if the initial distribution is poorly done, the first steal can happen long after the optimal makespan.

Theorem 1. Applying a work stealing strategy on unrelated machines can induce an unbounded makespan.

Proof. The circled schedule presented in table1, presents a situation where the first steal can only happen after a time n, and so the execution can be finished in n + 1 units of time. However, with a good schedule the work can be finished in 2 units of time.

Table 1. Bad initial distribution Cost Machine A Machine B Machine C

Job 1 1 n n

Job 2 1 1 n

Job 3 n 1 1

Job 4 n 1 1

Job 5 n 1 1

The circled distribution is a bad initial distribution to apply a Work Stealing strategy(Algorithm1).

u t This is why a load balancing before executing the first jobs in an unrelated work stealing strategy seems necessary. However, making a centralized load balancing would not fit the work stealing algorithm as it is completely distributed.

(6)

Theorem 2. A generic algorithm balancing optimally each pair of machine optimally can induce an unbounded makespan.

Proof. The example developed in Fig.1 has a makespan of n and each pair of machine is optimally balanced. However, the optimal solution has a makespan of 1.

{2}

{3}

{1}

Optimal Optimal

Optimal A

B

C

Cost Machine A Machine B Machine C

Job 1 1 n n²

Job 2 n² 1 n

Job 3 n n² 1

– Makespan of the circled distribution: n – Optimal distribution for all pairs – Optimal makespan: 1

Fig. 1. Example of situation with 3 machines and 3 jobs with an unbounded makespan

u t There is little hope to find generic solutions, so we look at particular cases.

The following parts focus on providing bounded algorithms to solve the problem.

5 Load Balancing per Type of Job

In this section, the jobs are grouped by type. In each type, all the jobs have the same costs: ∀j, j⁰ ∈ J, j has the same type as j⁰ ⇒ ∀i ∈ M, p_i,j = p_i,j⁰. This distinction can easily be made in real systems where simple queries can represent most of the jobs of a system: even if the jobs are not exactly the same, their cost is similar and only vary depending on which machine executes them.

5.1 Balancing only one type of job

This section presents an algorithm the case where there is only one type of job, and the proof of optimality of this algorithm.

The One Job Type Balancing algorithm (Algorithm 3) is quite simple. It randomly chooses a target and balances the load of the two machines in an optimal way. Getting the optimal load balancing for two machines in this problem is ensured by Basic Greedy (Algorithm2) thanks to the fact that the jobs’ cost is defined by the machine only (Proof omitted since this is trivial).

Theorem 3. One Job Type Balancing (Algorithm 3) provides an optimal distribution S of the jobs.

C_max(S) = OPT (1)

(7)

Algorithm 2 Basic Greedy Data: machines m and i Data: S distribution of jobs A := S(i) ∪ S(m)

S(m) := ∅ S(i) := ∅ while J 6= ∅ do

let j be the first job in A if C(m) + pm,j≤ C(i) + pi,j

then

S(m) := S(m) ∪ {j}

else

S(i) := S(i) ∪ {j}

A := A\{j}

Algorithm 3 One Job Type Balancing Data: m host machine

Data: S initial distribution of the jobs

Result: Cmax(S) = OPT while true do

Select i ∈ M

Distribute optimally the load of i and m with the Basic Greedy

Proof. Let S(n) be the solution created after the n-th execution of the algorithm.

Note that Cmax(S(n + 1)) ≤ Cmax(S(n)); If not, the balancing done between the two machines at step n + 1 would not be optimal. Hence the function Cmax(S(n)) is decreasing.

Let S be the distribution created by the algorithm3after an infinite number of executions.

As all the jobs have the same costs we denote as p_i the cost of any job on the machine i.

We can now represent S(i) only by its cardinal which we denote as N (i), N (i) = |S(i)|. We also have C(i) = N (i) ∗ pi

Let S^∗ be an optimal distribution and N^∗(i) = |S^∗(i)|

By contradiction, let us assume that C_max(S) > C_max(S^∗)

There exists imax∈ M such that C(imax) = Cmax(S). In particular, N (imax) >

N^∗(imax), but this also implies that there exists i such that N (i) < N^∗(i) (be- causeP

i∈MN (i) = |J | =P

i∈MN^∗(i)).

So at least one job can be moved from imax to i to have a better local distribution (C_max would decrease).

Hence the distribution S is not optimal for the pair of machine i_max and i but as the algorithm3has been applied an infinite number of times and because Cmaxis decreasing, S should be optimal for imaxand i.

In conclusion we have Cmax(S) ≤ Cmax(S^∗) hence Cmax(S) = Cmax(S^∗).

u t

5.2 Extension to multiple types of jobs

The One Job Type Balancing algorithm can be extended into the Multiple Job Type Balancing algorithm to balance multiple types of jobs; however, the performance guarantee of the algorithm become linear with the number of types of jobs.

(8)

Algorithm 4 Multiple Job Type Balancing Data: m host machine of type Data: S initial distribution of the jobs Data: k number of types of jobs Result: Cmax(S) ≤ k ∗ OPT while true do

Select i ∈ M

foreach l: type of job do

Distribute the jobs of type l between i and m using Basic Greedy

Theorem 4. Multiple Job Type Balancing (Algorithm4), applied to k types of jobs, provides distribution S of the jobs which is a k-approximation.

Cmax(S) ≤ k ∗ OPT (2)

Proof. The theorem3is applied for each type of job. ut The approximation ratio is at most linear with the number of types of jobs k but we can also show that the minimal lower bound is at least ln(k) (Fig.2).

Cost Machine

1 2 ... m − 1 m

Job jm 1/m 1/m ... 1/m 1/m

jm−1 ∞ 1/(m − 1) ... 1/(m − 1) 1/(m − 1) ..

. ...

j2 ∞ ∞ ... 1/2 1/2

j1 ∞ ∞ ... ∞ 1

J is constructed with the rule

∀i ∈ [1, m], add i job ji

Number of job Machine per machine 1 2 ... m − 1 m Job jm 1 1 ... 1 1 jm−1 0 1 ... 1 1

..

. ...

j2 0 0 ... 1 1 j1 0 0 ... 0 1

If the order used by the algorithm4is [jm, jm−1, ..., j2, j1], the distribution is stable for all pairs of machines, and

Cmax=

m

X

k=1

1/k ≥ ln(k)

Fig. 2. Example of ln(k) schedule produced by the algorithm4

6 Load balancing with two types of machines

In this section, we limit the problem to two different clusters of identical machines M¹ and M² (∀i, i⁰ ∈ M¹(resp. M²), ∀j ∈ J, pi,j = pi⁰,j). This is meaningful in practice since the advent of GPU-accelerated clusters. We develop a new algorithm with a proven approximation ratio of 2 under the assumption that the

(9)

cost of any task is smaller than the optimal makespan for the problem, which is formally expressed as ∀i ∈ M, ∀j ∈ J, p_i,j ≤ OPT. This hypothesis is realistic in the sense that any machine is able to do almost everything another machine does but with a different speed, we suppose here that there is not a job which can only run on one cluster (and if this was the case, they could be assigned beforehand).

Moreover, this hypothesis also suggests that there is a large amount of jobs, which is the starting point for creating a decentralized algorithm.

To simplify the notations, in this section we will denote as pi,j the cost of job j on cluster i.

6.1 A centralized algorithm

We first focus on the centralized version of the problem and then use this centralized algorithm as a stepping stone to create a decentralized one that solves the problem.

This problem is a sub-problem of the scheduling problem of m unrelated machines which has already been studied by Lenstra et al. in 1990 [11]. They provide an algorithm with an approximation ratio of 2. However, the solution is provided by solving a linear programming problem first and this method seems difficult to decentralize. This is why we developed a new greedy algorithm to balance the load between two clusters.

The idea behind this new algorithm, called Centralized Load Balancing for two clusters of machines (Algorithm 5), is quite simple: we first sort the jobs according to the ratio p1,j/p2,j so that the jobs which are executed faster on the first cluster are at the beginning of the list and the one executed faster on the second cluster are at the end of the list.

Then we evaluate the decision of placing the first job of the list on the machine of the first cluster with minimal makespan and placing the last job of the list on the second cluster on the machine with minimal makespan. The job placed is the one that minimize the makespan of those two machines. That way, if a job is not placed on the cluster where it can be executed at its minimal cost, we know this choice does not have a significant impact on the overall makespan.

Theorem 5. The job partition S given by the Centralized Load Balancing for two clusters of machines algorithm (Algorithm 5) has a makespan at most two times larger than the optimal solution.

Cmax(S) ≤ 2 ∗ OPT (3)

Proof. Let imaxbe a machine of M = M¹∪ M² such that Cmax(S) = C(imax) We can suppose, without loss of generality, that imax∈ M¹.

Let jmax be the last job placed on imax.

Let S⁰ be the incomplete partition of J just before the choice is made by Algorithm5to place jmax on imax.

Let C¹be C(S⁰, imax) and C²be such that C²= min_i∈M2C(S⁰, i). Note that C¹ has also the property C¹= min_i∈M1C(S⁰, i).

(10)

Algorithm 5 Centralized Load Balancing for two clusters of machines Data: M¹ Cluster of identical machines of type 1

Data: M² Cluster of identical machines of type 2

Data: J Set of n jobs such that ∀j ∈ {1, ..., n}, p1,J (j)≤ OPT and p2,J (j)≤ OPT

Result: Partition S of jobs for each machine, such that Cmax(S) ≤ 2 ∗ OPT Sort jobs in J in increasing order of p1,j/p2,j

Initialize S with one empty set per machine j¹:= 1

j²:= n

while j¹≤ j² do

Select i¹∈ M¹ such that C(i¹) = min_i∈M1C(i) Select i²∈ M² such that C(i²) = min_i∈M2C(i) if C(i¹) + p_{1,J (j}1)≤ C(i²) + p_{2,J (j}2)then

S(i¹) := S(i¹) ∪ {J (j¹)}

j¹:= j¹+ 1 else

S(i²) := S(i²) ∪ {J (j²)}

j²:= j²− 1 return S

Let us compare min(C¹, C²) to OPT

Let S⁰¹ be the jobs placed on cluster 1 in S⁰, and S⁰² the ones placed on cluster 2, S^0k=S

i∈M^kS⁰(i).

To compare C¹ and C², we will use the notion of work, the work is defined as the total cost of the jobs assigned to the machines.

The work done on the cluster 1, W¹, has the property W¹ ≥ |M¹| ∗ C¹, similarly, the work W²done on cluster 2 is such that W²≥ |M²| ∗ C².

Let S^∗ be an optimal solution with S^∗1 the set of jobs placed on cluster 1 and S^∗2 the set of jobs placed on cluster 2. The work done on cluster 1 with an optimal solution is denoted as W^∗1, the work done on cluster 2 is denoted as W^∗2

To create an optimal solution from S⁰, the jobs J¹ = S⁰¹∩ S^∗2 should be moved from the cluster 1 to the cluster 2, and the jobs J² = S⁰²∩ S^∗1 from cluster 2 to cluster 1.

We have the equations W^∗1= W¹− X

j∈J¹

p1,j+ X

j∈J²

p1,j and W^∗2= W²− X

j∈J²

p2,j+ X

j∈J¹

p2,j (4)

By contradiction, let us assume that W^∗1< W¹ and W^∗2< W², from4 we deduce

X

j∈J²

p1,j < X

j∈J¹

p1,j and X

j∈J¹

p2,j < X

j∈J²

p2,j (5)

Let α = p_1,j_max/p_2,j_max

Because the algorithm sorts the jobs, we have ∀j ∈ J¹, p1,j/p2,j ≤ α and

∀j ∈ J², p1,j/p2,j ≥ α and with 5we have

(11)

X

j∈J²

p1,j < X

j∈J¹

p1,j ≤ αX

j∈J¹

p2,j< α X

j∈J²

p2,j (6)

But

X

j∈J²

p1,j ≥ αX

j∈J²

p2,j (7)

With the contradiction given by6and7, we deduce that W^∗1≥ W¹or W^∗2≥ W², in particular, OPT ≥ max(W^∗1/|M¹|, W^∗2/|M²|) ≥ min(W¹/|M¹|, W²/|M²|) ≥ min(C¹, C²). In conclusion,

min(C¹, C²) ≤ OPT (8)

Let j be the job compared to jmaxwhen jmaxis placed. Because jmaxhas been placed on imaxfrom the first cluster, we have

C¹+ p1,j_max≤ C²+ p2,j⁰ (9) If C¹≤ C², C¹≤ OPT and p_1,j_max ≤ OPT so C_max(S) ≤ 2 ∗ OPT.

If C²< C¹, we have C²≤ OPT, p2,j⁰≤ OPT and C¹+ p1,j_max≤ C²+ p2,j⁰

so Cmax(S) ≤ 2 ∗ OPT.

In both cases,

Cmax(S) ≤ 2 ∗ OPT (10)

u t

6.2 Distributed algorithm

The Centralized Load Balancing for two clusters of machines algorithm is the base for a decentralized algorithm to balance the load of machines on two clusters of uniform machines.

Indeed, like the peer to peer approach from the Work Stealing algorithm the Decentralized Load Balancing for two clusters of machines algorithm is executed on each machine, and each machine randomly selects a target (a machine), and if both machines are from the same cluster, a Greedy Load Balancing is applied, and if both machines are from different clusters, the Centralized Load Balanc- ing for two clusters of machines algorithm is used to balance both machines (considering two sub-clusters of one machine each).

This algorithm uses Greedy Load Balancing to balance two machines from the same cluster (Algorithm6), this algorithm is in particular a³₂-approximation [7]. Decentralized Load Balancing for two clusters of machines also ensures an interesting property: when the situation is stable, the makespan of the job distribution is bounded by twice the optimal makespan.

(12)

Algorithm 6 Greedy Load Balancing

Data: m¹, m² machines of the same cluster to balance Data: S distribution of jobs

A = S(m¹) ∪ S(m²) if m¹ is in cluster 1 then

Sort jobs in A in increasing order of p1,j/p2,j

else

Sort jobs in A in increasing order of p2,j/p1,j

Initialize S with one empty set for m¹ and m² while A 6= ∅ do

j first job in A

if C(m¹) ≤ C(m²) then S(m¹) := S(m¹) ∪ {j}

else

S(m²) := S(m²) ∪ {j}

A := A\{j}

return S

Algorithm 7 Decentralized Load Balancing for two clusters of machines Data: m host machine of type k

Data: J (m) Set of jobs initially on m Result: C(J (m)) ≤ 2 ∗ OPT

while true do Select i ∈ M

if i is in the same cluster as m then Apply Greedy Load Balancing to i and m else

M1:= {m}

M2:= {i}

Apply Centralized Load Balancing for two clusters of machines to M1

and M2

Theorem 6. If the distribution S provided by the execution of Algorithm7 be- comes stable (for every pair of machine, the algorithm does not move any job), S is such that C_max(S) ≤ 2 ∗ OPT.

Proof. Let S¹ be the set of jobs assigned to cluster M¹ (S¹ = S

m∈M¹S(m)) and S² the set of jobs assigned to cluster M².

We first show that the jobs in S² have a ratio p1,j/p2,j larger the jobs in S¹. Let j¹ be such that p1,j¹/p2,j¹ = maxj∈S¹p1,j/p2,j and j² be p1,j²/p2,j² = min_j∈S2p_1,j/p_2,j.

By contradiction let us assume that p_1,j2/p_2,j2 < p_1,j1/p_2,j1. Let us denote by m¹ the machine on which j¹ is scheduled, and m² the machine on which j² is scheduled. Because Decentralized Load Balancing for two clusters of machines have been executed for all pairs of machine and does not change the solution, it has been executed to balance m¹ and m² in particular.

As we saw in proof of Theorem5, there exists α such that ∀j ∈ S(m¹), ∀j⁰∈ S(m²), p1,j/p2,j ≤ α and p1,j⁰/p2,j⁰ ≥ α.

(13)

From these we conclude that

∀j¹∈ S¹, ∀j²∈ S², p_1,j1/p_2,j1 ≤ p_1,j2/p_2,j2 (11) Let imax be a machine such that Cmax(S) = C(imax). Without any loss of generality, we assume that imax∈ M¹. Let jmaxbe a job assigned to imaxsuch that p1,j_max/p2,j_max = maxj∈S_imaxp1,j/p2,j. Let C¹= C(imax) − p1,j_max.

Since Greedy Load Balancing has been applied between the machines of the same cluster, we have the property

∀i ∈ M¹, C¹≤ C(i) (12)

Let C² and i² be such that C²= C(i²) = mini∈M²C(i).

Using the same reasoning by contradiction as in the proof of theorem5with α = p_1,j1/p_2,j1, we get

min(C¹, C²) ≤ OPT (13)

Let us consider an execution of Centralized Load Balancing for two clusters of machines between i_max and i².

– if jmax is not the last job placed by the algorithm, then jmax has been compared to j which has been placed afterward, so C¹+ p1,j_max ≤ C² but C¹+ p1,j_max ≥ C² (because C¹+ p1,j_max= Cmax(S)).

Hence C¹≤ C² so we deduce C¹≤ OPT and with p1,j_max ≤ OPT we have Cmax(S) ≤ 2 ∗ OPT

– if jmax is the last job placed by the algorithm, we have C¹+ p1,j_max ≤ C²+ p2,j_max

Hence Cmax(S) ≤ 2 ∗ OPT In both cases we have

C_max(S) ≤ 2 ∗ OPT (14)

u t This result is valid when the situation is stable, and the same upperbound can be obtained for any situation where imax can not exchange any job with any other machine. However, the convergence is not always possible, there exists possible cycles of exchanges (Fig.3).

6.3 Experiments

Theorem6ensures an upperbound when the partition is stable but, as Figure3 demonstrates, the convergence may not happen. In this part, we experimentally study the convergence of the solution generated by Algorithm7. We focus on two clusters of similar size (often there is one or two CPU and one GPU per machine)

(14)

1 4

2 3

5 A

B

C

(a) A + balancing(B,C)

1 4

2

5 3 A

B

C

(b) C + balancing(A,B)

1 3 4

2

5 A

B

C

(c) B + balancing(A,C)

Job Cluster 1 = {A,B} Cluster 2 = {C}

1 4 1

2 4 1

3 2

4 1 − 8 1

5 1 2

(d) Cost of each job

Fig. 3. The cycle [(a),(b),(c)] is an example of cycle in the execution of the Decentralized Load Balancing for two clusters of machines (Algorithm7). For each step of this cycle, there is only one non-trivial balancing possible and it leads to the next step.

and use randomly generated jobs which are initially distributed randomly among the machines.

We developed a simulator that implements Centralized Load Balancing for two clusters of machines and Decentralized Load Balancing for two clusters of machines (Algorithms 5 and 7). For each iteration of the simulation, a pair of machine is randomly chosen and then balanced using the decentralized algorithm. As the partition can cycle, the process is limited to 1,000,000 iterations at most.

We computed the makespan of the distribution obtained as well as the stability of the partition for clusters composed of 64 and 32 machines. The results show that the stability is not often obtained, only 18% of the simulations are stable when the average number of jobs per machine is smaller than 4, and it quickly drops as the average number of jobs per machine increases (Fig. 4) and eventually reaches 0%. However, the same simulation shows that the makespan of the solution created by the decentralized algorithm gets quickly under 150%

of the makespan obtained with the centralized algorithm (we denote this value as 1.5cent). This threshold is hopefully much smaller than 3 times the optimal makespan as the centralized algorithm is a proven 2-approximation. The experiments show that 90% of the partitions created have reached this threshold in less than 350 balancing which is less than 4 balancings per machine in this case.

(15)

96 192 384 768 1536 3072 0

0.1 0.2 0.3 0.4

Number of jobs

Fractionofstabilityattained

0 100 200 300 400 500

0 0.2 0.4 0.6 0.8 1

Iterations

FractionofexperimentswhereCmax≤1.5cent

96 jobs 192 jobs 384 jobs 768 jobs 1536 jobs 3072 jobs

Fig. 4. Experimentation on Decentralized Load Balancing for two clusters of machines onto a cluster of 64 machines and a cluster of 32 machines. (Left) Percentage of experiments that produced a stable partition. (Right) Percentage of partitions with a makespan smaller than 1.5cent after a number of iterations.

7 Conclusion

In this work we propose two decentralized algorithms designed to balance the work on heterogeneous machines in two different cases. The Multiple Job Type Balancing algorithm uses of the limited number of classes of jobs and has an approximation ratio with an upper bound equal to the number of types of jobs considered. The Decentralized Load Balancing for two clusters of machines has a proven 2-approximation ratio if convergence is reached. However, experiments showed that the distribution is often unstable but also that the centralized algorithm quickly provides a distribution close to its centralized counter-part.

Providing an upperbound for all the distributions created by the Decentral- ized Load Balancing for two clusters of machines algorithm and its extension to more than two clusters of machines are possible future works. Finally, it could be interesting to study implementations of this algorithm and in particular the convergence time and the quality of the approximation.

Acknowledgements Many thanks to Erik Saule for the opportunity and help provided during the training course.

References

1. Michael A. Bender and Michael O. Rabin. Online scheduling of parallel programs on heterogeneous systems with applications to cilk. Theory of Computing Systems, Special Issue on SPAA, 35(3):289–304, 2002.

2. Robert D. Blumofe. Scheduling multithreaded computations by work stealing. In FOCS, pages 356–368, 1994.

(16)

3. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leis- erson, Keith H. Randall, and Yuli Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), page 207–216, 1995.

4. F. Warren Burton and M. Ronan Sleep. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture, FPCA ’81, pages 187–194, 1981.

5. Lin Chen, Deshi Ye, and Guochuan Zhang. Online scheduling on a cpu-gpu cluster. In T-H.Hubert Chan, LapChi Lau, and Luca Trevisan, editors, Theory and Applications of Models of Computation, volume 7876 of Lecture Notes in Computer Science, pages 1–9. Springer Berlin Heidelberg, 2013.

6. Michael R. Garey and David S. Johnson. “Strong” NP-completeness results: Mo- tivation, examples, and implications. J. ACM, 25(3):499–508, 1978.

7. Ronald L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal of Applied Mathematics, 17(2):416–429, 1969.

8. Ronald L. Graham, Eugene L. Lawler, Jan Karel Lenstra, and Alexander. H. G.

Rinnooy Kan. Optimization and approximation in deterministic sequencing and scheduling: a survey. Annals of discrete mathematics, 5(2):287–326, 1979.

9. Ralf Hoffmann, Matthias Korch, and Thomas Rauber. Performance evaluation of task pools based on hardware synchronization. In SC, page 44, 2004.

10. Eugene L. Lawler and Jacques Labetoulle. On preemptive scheduling of unrelated parallel processors by linear programming. J. ACM, 25(4):612–619, 1978.

11. Jan Karel Lenstra, David B. Shmoys, and ´Eva Tardos. Approximation algorithms for scheduling unrelated parallel machines. Mathematical Programming, 46:259–

271, 1990.