Load Balancing in Delay-Limited Distributed Systems

(1)

Load Balancing in Delay-Limited

Distributed Systems

by

Sagar Dhakal

B.E. Electrical and Electronics Engineering,

Birla Institute of Technology, May 2001

THESIS

Submitted in Partial Fulfillment of the Requirements for the Degree of

Master of Science

Electrical Engineering

The University of New Mexico

Albuquerque, New Mexico

(2)

c

(3)

Dedication

(4)

Acknowledgments

I would like to express sincere gratitude towards my advisor, Professor Majeed M. Hayat, for his guidance, encouragement and support throughout this thesis work.1

His enthusiasm in research and teaching has been a perennial source of inspiration to me. Working with him provided me an excellent learning opportunity.

I would like to thank Professor Chaouki T. Abdallah for sharing his expertise in the field of time-delay systems and for motivating me. I would also like to thank my other thesis committee member, Professor Gregory L. Heileman, for his support and helpful comments.

I take this opportunity to thank my colleagues, Jean Ghanem and Biliana Paskaleva, for their help and contribution to the successful completion of this work. My heartfelt gratitude to all those who in some way helped me achieve my objective.

Finally, I thank my family for their immense love, patience and support.

1_{This work was supported by the National Science Foundation under Information}

(5)

Load Balancing in Delay-Limited

Distributed Systems

by

Sagar Dhakal

ABSTRACT OF THESIS

Submitted in Partial Fulfillment of the

Requirements for the Degree of

Master of Science

Electrical Engineering

The University of New Mexico

Albuquerque, New Mexico

(6)

Load Balancing in Delay-Limited

Distributed Systems

by

Sagar Dhakal

B.E. Electrical and Electronics Engineering,

Birla Institute of Technology, May 2001

M.S., Electrical Engineering, University of New Mexico, 2003

Abstract

Load balancing is the allocation of the workload among a set of co-operating compu-tational elements (CEs). In large-scale distributed computing systems, in which the CEs are physically or virtually distant from each other, there are communication-related delays that can significantly alter the expected performance of the load-balancing policies that do not account for such delays. This is a particularly promi-nent problem in systems for which the individual units are connected by means of a shared broadband communication medium (e.g., the Internet, ATM, ad hoc net-works, wireless LANs or the wireless Internet). In such cases, the delays, in addition to being large, fluctuate randomly, making their one-time accurate prediction im-possible. Therefore, the performance of such distributed systems under any load balancing policy is stochastic in nature and must be assessed in a statistical sense. Moreover, the design of load-balancing policies that best suits such delay-infested distributed systems must also be carried out in a statistical framework.

(7)

In this work we study the effect of random delays (small and large) on the perfor-mance of a dynamic load-balancing algorithm. The study shows that the presence of random delay leads to a significant degradation in the performance of a load-balancing policy. Therefore, we exploit the stochastic dynamics, using a queuing framework, to model the load-balancing algorithm and optimize its performance. We find that weakening the load-balancing mechanism, or so-called gain, appropri-ately leads to an improved performance of the distributed system. Motivated by this fact, we consider the optimization problem for a policy that has a fixed (one or two) number of balancing instants while optimizing the policy over the strength of load balancing and the times when the schedulings are executed. We discuss the performance of a single-scheduling policy on a distributed physical system consisting of a wireless LAN.

To look into the interplay between delay and load-balancing gain, we develop a novel analytical model to characterize the mean of the total completion time for a distributed system when a single scheduling is performed. We then use our optimal single-time load-balancing strategy to propose an autonomous on-demand (sender initiated) load-balancing scheme.

(8)

Chapter 1

Introduction

1.1

Problem Description and Motivation

The demand for high performance computing continues to increase everyday. The computational need in areas like cosmology, molecular biology, nanomaterials, etc., cannot be met even by a small group of fastest computers available [16, 17, 18, 19]. But with the availability of high speed networks, a large number of geographically dis-tributed computational elements (CEs) can be interconnected and effectively utilized in order to achieve a performance which is not ordinarily attainable on a single CE. The distributed nature of this type of computing environment calls for consideration of heterogeneities in computational and communication resources. A common archi-tecture is the cluster of otherwise independent CEs communicating through a shared network. Incoming workload has to be efficiently allocated to these CEs so that no single CE is overburdened where one or more other CEs remain idle. Further, tasks migration from high to low traffic area in a network alleviates the network-traffic congestion problem up to some extent.

(13)

Chapter 1. Introduction

to as load balancing in the literature. Effective load balancing of a cluster of CEs in a distributed computing system relies on accurate knowledge of the state of the individual CEs. This knowledge is used to judiciously assign incoming computational tasks to appropriate CEs, according to some load-balancing policy [1, 23]. In large-scale distributed computing systems in which the CEs are physically or virtually distant from each other, there are a number of inherent time-delay factors that can seriously alter the expected performance of the load-balancing policies that do not account for such delays. One manifestation of such time delay is attributable to the computational limitations of individual CEs. A more significant manifestation of such delay arises from the communication limitations between the CEs. These include delays in transferring loads between CEs and delays in the communication between them. Moreover, these delay elements not only fluctuate within each CE, as the amounts of the loads to be transferred vary, but also vary as a result of the uncertainty in the condition of the communication medium that connects the units. This kind of delay-uncertainty is frequently observed in systems for which the individual units are connected by means of a shared broadband communication medium (e.g., the Internet, ATM, ad hoc networks, wireless LANs or the wireless Internet).

There has been extensive research in the development of the appropriate dynamic load balancing policies (some of which will be discussed in Chapter 2 of this thesis). Some of these existing approaches consider constant performance of the network while others consider deterministic communication and transfer delay. The load balancing schemes designed under this conviction undermine the randomness in delay. But it is observed through Fig1.1 that this randomness in delay leads to an unnecessary exchange of tasks between CEs which results in an oscillatory behavior of the queues. In this thesis, we will propose and investigate a dynamic load balancing scheme for distributed systems which incorporates the stochastic nature of the delay in both communication and load transfer.

(14)

Chapter 1. Introduction 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35 40 TIME, ms

Variance in the Queue length

Queue 3

Figure 1.1: Uncertainty in the queue length is observed towards the tail. The random delay for transferring the small amount of tasks back and forth causes this oscillation.

To adequately model load balancing problems, several features of the parallel com-putation environment should be captured: These include (1) The workload awaiting processing at each CE (i.e., queue size); (2) The relative performances of the CEs; (3) The computational requirements of each workload component; (4) The delays and bandwidth constraints of CEs and network components involved in the exchange of workloads, and (5) The delays imposed by CEs and the network on the exchange of measurements and information. The effect of delay is expected to be a key factor as searching large databases moves toward distributed architectures with potentially geographically distant units.

1.2

General Framework for Load Balancing

We now present the general framework for load balancing for the distributed envi-ronment. A typical distributed system will have a number of processors working

(15)

independently with each other. Some of them are linked by communication channel and while some are not. Each processor possesses an initial load, which represents an amount of work to be performed, and each may have a different processing ca-pacity. To minimize the time needed to perform all tasks, the workload has to be evenly distributed over all processors based on their processing speed. This is why load balancing is needed. If all communication links are of infinite bandwidth and instantaneous, the load distribution would suffer from no delay, and this does not represent the distributed environments considered in this thesis. But in any practi-cal distributed systems, the channels are of finite bandwidth and and the units may be physically distant; therefore, we would encounter information-flow bottlenecks. Obviously, we do not want to send packets in a noisy channel with sufficiently large delays and prone to packet loss. Therefore, load balancing is also a decision making process of whether to allow tasks migration or not. The situation is aggravated by the fact that the delay involved is random in nature.

Another issue related to load balancing is that a job is not arbitrarily divisible leading to certain constraints in dividing tasks. Each job consists of several smaller tasks and each of those tasks can have different execution time. Also, the load on each processor as well as on the network can vary from time to time based on the workload brought about by the users. The processor capacity may be different from each other in architecture, operation system, CPU speed, memory size, and available disk space. The load balancing problem also needs to consider fault-tolerance and fault-recovery. With all these factors taken into account, load balancing can be generalized into four basic steps: (1) Monitoring processor load, (2) Exchanging load information between processors , (3) Calculating the new work distribution, and (4) Actual data movement. Numerous load balancing schemes have been proposed and implemented and we will look into some of them in Chapter 2. Broadly speaking, the goal of a load balancing algorithm is to redistribute the load to minimize the overall execution time. Clearly, the search has to be directed to find the algorithm

(16)

which gives an optimal solution. However, most of the available literature in load balancing consider the problem as NP-complete and attempts solving the problem heuristically or suboptimally.

1.3

Objective of this Thesis

The main goal of this thesis is to investigate the effect of stochastic delay on the performance of a load balancing policy in a distributed environment and find a remedy to upgrade the performance. The purpose is oriented to come up with a better decision making policy to tackle the randomness in delay. This thesis work does not address issues like divisibility of a job, network architecture, operating system, memory size, fault-tolerance and fault-recovery. In the existing literature [13, 14, 15, 2, 20, 21, 25, 23, 24], the balancing policies have been developed where delay has been considered as deterministic and predictable. However, our view is this kind of policy will not perform as expected in real situations where network is shared (as described earlier in Section 1.1), channels have high bit-error rates and the level of traffic fluctuates every now and then. This uncertainty in delay will have a further destabilizing effect. Therefore, there is a need to come up with an improved balancing policy which takes into account the random nature of the delay.

For a given workload distribution among a group of heterogeneous processors,

we recognize the overall completion time of the group as the performance metric, and the objective is to develop a balancing strategy which minimizes this. First, we identify the feasibility of this kind of optimization by undertaking a Monte-Carlo (MC) simulation approach. We then verify the validity of the assumptions used in the MC approach and further apply our load-balancing scheme to a physical system consisting of a wireless LAN. We then launch a novel, analytical stochastic approach, based on renewal principles, that characterizes the average completion time. We

(17)

present the results for the case of two nodes (n = 2); however, the approach can be extended to the multi-CE case in a straightforward fashion. Notably, the n= 2 case maintains the gist of the multi-CE problem and conveys the underlying principles of our analytical solution while keeping the algebra at a minimum. Therefore, our aim is to analytically model this 2-processor system and define a way to apply the model for dynamic load balancing. This thesis work may also have the potential for being useful in other fields such as networked control systems (NCS) and teleautonomy. In a NCS the sensor and the controller are connected over a shared network and therefore, there is a delay in closing the feedback loop. A special application of teleautonomy [35, 36] is: robots distributed geographically and working autonomously but at the same time being monitored by a distant controller. Clearly, the randomness in communication delay degrades the performance of such systems.

1.4

Overview of Thesis

In Chapter 2, we present an overview of existing balancing strategies. We start by briefly discussing different schemes. We then look into special types of load balancing schemes available in the literature. Chapter 3 begins with the brief introduction to load balancing scheme developed by the authors of [23, 24, 25, 26] for modelling deter-ministic time-delay systems. We utilize some features from this model to develop our balancing strategy. Next, we present a discrete-time stochastic dynamical-equation model describing the evolution of the random queue size of each node. We generate a MC simulation algorithm and use it to demonstrate the extent of the role played by the magnitude and uncertainty of the various time-delay elements in altering the performance of load balancing. Chapter 4 presents the drawback in the implemen-tation of a load balancing policy on a continuous basis in a delay-limited distributed computing environment. We present the single and the double load balancing

(18)

strate-Chapter 1. Introduction

gies. The performance of the single load balancing strategy on a distributed physical system is discussed and is compared to our simulation results. Based on the concept of regeneration, in Chapter 5 we present a mathematical model for the distributed system with two nodes where one-shot balancing is done. We obtain a system of four difference-differential equations characterizing the mean of the overall comple-tion time. Finally, in Chapter 6 we propose a dynamic load balancing scheme which utilizes the analytical model developed in Chapter 5.

(19)

Chapter 2

Taxonomy of Load Balancing

Policies

There has been an extensive research in the development of the appropriate load balancing policy. The policies can broadly be categorized as static, dynamic, local, global, centralized, distributed, sender-initiated, receiver initiated, symmetrically-initiated, deterministic and non-deterministic.

2.1

Brief Overview of Balancing Policies

2.1.1

Static versus Dynamic Load Balancing

Static load distribution assigns jobs to nodes probabilistically or deterministically, without consideration of runtime events. For example, using a simple static strategy, tasks can be assigned to processors in a round-robin fashion so that each processor executes approximately the same number of tasks. This approach works better when the workload can be accurately characterized and the system dynamics do not

(20)

fluc-Chapter 2. Taxonomy of Load Balancing Policies

tuate. The runtime overhead involved is very small since processors know exactly which tasks they are to execute based on their processor numbers and the task iden-tifiers. It is generally impossible to predict or collect task characteristics like arrival time, execution costs, interdependencies, etc., and therefore, static balancing scheme has a very limited application.

Dynamic load distribution is designed to overcome the problems of unknown or uncharacterizable workloads and the non-deterministic run-time performance vari-ation of the nodes. In this unpredictable environment, it is better to perform the load balancing more than once or periodically during run-time such that the prob-lem’s variable behavior more closely matches available computational resources. For example, in areas like molecular dynamics, fluid dynamics, etc., the computational requirement associated with different parts of a problem domain may change with time as the computation progresses. In dynamic scheduling, the overhead associ-ated with the task of scheduling can directly affect the performance of the systems. Therefore, it is vital to look into issues related to where and when scheduling is per-formed, where the information required for scheduling is stored, and how complex the scheduling algorithm can be. In this thesis, we focus on the dynamic balancing domain.

2.1.2

Local versus Global Load Balancing

In a local load balancing scheme, each processor polls other processors in its small neighborhood and uses this local information to decide upon a load transfer. At every step a processor communicates with its nearest neighbors in order to achieve a local balance. The primary objective is to minimize remote communication as well as to efficiently balance the load on the processors. However, in a global balancing scheme, a certain amount of global information is used to initiate the load balancing.

(21)

Chapter 2. Taxonomy of Load Balancing Policies

The DASUD (Diffusion Algorithm Searching Unbalanced Domains) [3] algorithm belongs to the nearest neighbors classes. The authors evaluate the performance of the DASUD algorithm across ring, torus and hypercube topologies and observe via simulations that this balancing scheme outperforms the strategies for global balance degree in these cases. In [2], the authors divide the load balancing process into global load balancing phase and local load balancing phase so as to capture the heterogeneity of the network. The redistribution cost and the computational gain has to be compared before invoking any global distribution.

2.1.3

Centralized versus Distributed Load Balancing

Centralized schemes [4, 5] store global information at a centralized location and use this information to make more comprehensive scheduling decisions using the com-puting and storage resources of one or more dedicated processors. In some strategies, the sending or receiving processors contact a specific scheduling processor to identify another processor to which tasks are sent or from which tasks are received. There is always a contention to access the shared information and request tasks for execu-tion, which may cause the designated processor to become bottleneck. Further, the scheme fails if the designated processor crashes.

In distributed scheduling [25, 23, 26, 24, 6, 7, 8, 1], the scheduling task and the scheduling information are distributed among the processors and their memories. In some cases [6, 7, 8], the scheme allows idle processor to assign tasks to themselves at runtime by accessing a shared global queue. The time required to access this shared queue to remove one or more tasks from the common pool of waiting tasks might introduce runtime overhead.

(22)

2.1.4

Sender/Receiver/Symmetrically Initiated Balancing

Techniques of scheduling tasks in distributed systems have been divided mainly as sender-initiated, receiver-initiated, and symmetrically-initiated. In sender-initiated algorithms [9, 10, 1, 23, 26], the overloaded nodes transfer one or more of their tasks to more under-loaded nodes. In receiver-initiated schemes [4, 11, 10], under-loaded nodes request tasks to be sent to them from nodes with higher loads. In symmetric approach [10, 12], both the under-loaded as well as the loaded nodes can initiate load transfers.

2.1.5

Deterministic versus Non-deterministic Load

Balanc-ing

In deterministic load balancing, the information about tasks to be scheduled and their relation to one another is entirely known prior to their execution time. In non-deterministic some information may not be known prior to execution. Both deterministic and non-deterministic scheduling can be implemented using all the above discussed balancing methodologies.

2.2

Related Work

In this section we present some load balancing models and approaches available in the literature [13, 14, 15, 2, 20, 21].

(23)

2.2.1

Graph Partitioning Method

In [13], the authors presents a heuristic method for partitioning arbitrary graphs and show that it is both effective in finding optimal partitions, and fast enough to be practical in solving large problems like load balancing in distributed environment. We give a brief exposition to the method used by the authors to partition the graph. The authors consider a graph G of n nodes with cost on its edges and the objective is to partition the nodes into subsets of given sizes so as to minimize the sum of the costs on all edge cuts. The nodes are assigned sizes(weights) wi, i = 1, ..., n such that for all i,0 < wi ≤ p for some p > 0. They define a connectivity matrix

C = (cij), i, j = 1, ..., n which describe the edges of G. Now, for any k ∈ N, a

k−way partition ofG is a set of nonempty, pairwise disjoint subsets of G, given by

v1, ..., vk such that Sk_i₌₁vi =G. The k−way partition is allowed if |vi| ≤pfor all i, where the symbol |vi| = P

wi. Finally, the cost of a partition is the summation of

cij over all possible i and j such that i and j are in different subsets, i.e. the cost is the sum of all external costs in the partition. The performance metric is thus to find a minimal-cost admissible partition of G.

The authors show that finding an optimal solution using a strictly exhaustive procedure requires an inordinate amount of computation, and therefore, solving the problem heuristically is a quick approach to produce good solutions. First, they find the minimal-cost partition of a given graph into two-subsets (k = 2). They start with 2n points in the original graph and partition it arbitrarily into two sets A and

B, each with n points. The goal here is to try to decrease the initial cost T by a series of interchanges of subsets of A and B. Every time an interchange is made, the cumulative gain associated with it and with all prior interchanges is calculated according to their algorithm. Finally, when there is no more room for the reduction in initial cost, the partition is called locally optimum partition. Now, the local optimum partition is perturbed so that an iteration of the process on the perturbed solution

(24)

will yield a further reduction in the total cost. If it leads to an improvement, the new solution thus obtained is considered to be the optimal partition. The authors call it a global optimal solution. Now, the authors relax the requirement for the nodes of the graph to be of the same size. They achieve this by converting any node of size

s >1 to a cluster ofs nodes of size 1, bound together by edges of appropriately high cost. Finally, the idea of 2-way partition is extended to perform k-way partition. They start with any arbitrary k sets each with n nodes and by repeated application of the 2-way partitioning procedure to pair of subsets, they make the partition as close as possible to being pairwise optimal. The authors say that this may not lead to a globally optimal k-way partition, there may be situations where interchanges involving three or more items from three or more subsets is required. Also, the choice of the starting partitions will determine how fast the solution converges to being pairwise optimal. This concept is utilized in load balancing by modelling the cost on the nodes as the number of tasks and the edge cost as the amount of data transfer between the nodes. Partitioning is done to make the cost on each processing node equal while minimizing the respective edge costs. This model takes into account the computation and the communication costs but considers them deterministic.

2.2.2

Balancing scheme for SAMR Applications

In [2] the authors propose a dynamic load balancing algorithm for Structured Adap-tive Mesh Refinement (SAMR) applications on distributed systems. The focus is on the heterogeneity and dynamic load of the networks which are essentially prevalent in a distributed regime. SAMR is an algorithm used in multidimensional numerical simulations to achieve high resolution in localized regions and the authors mention that the algorithm has already been applied to model computational fluid dynamics, computational astrophysics, meteorological simulations, structural dynamics, mag-netic, and thermal dynamics. ENZO is a parallel implementation of this algorithm

(25)

for astrophysical and cosmological applications. Obviously, SAMR requires a large amount of computation, and therefore, the authors have appropriately chosen to ex-ecute SAMR applications on distributed systems by dynamically assigning the work-load among the systems at runtime. The authors execute ENZO on a distributed system (WAN), and compare the performance with its parallel implementation. The load balancing scheme is designed to reduce the overhead introduced by the WAN in the distributed system. The available processors are divided into groups: a group is defined as a homogeneous system and all the processors assigned to it have the same performance and share an intra-connected network. The load balancing within a group is referred as a local load balancing and the balancing among the groups as global balancing. The authors define their distributed systems to contain two or more groups. The objective is to minimize remote communication as well as to efficiently balance the load on the processors.

The balancing scheme is divided into two phases: 1. Global load balancing phase and 2. Local load balancing phase. The global balancing phase occurs after each time-step but only at level 0 of SAMR algorithm. The evaluation of workload redistri-bution cost among groups is made and this includes communication and computation overhead. The authors heuristically come up with an expression for redistribution cost which is: Cost = (α +β ×W) +δ, where α is the communication latency,

β is the communication transfer rate, W is the amount of workload in bytes to be redistributed andδis the computational overhead calculated using past information. Similarly, the estimated computational gain for global load balancing at that par-ticular time is also evaluated. For each group, the total workload (including all the levels) is calculated for one time-step at level 0 using the past data and then the dif-ference of total workload between groups is estimated. Finally, computational gain is estimated by using the difference of total workload and the recorded execution time of one iteration at the top level. Now, the global load balancing is invoked only if the computational gain is some factor times the redistribution cost. The factor is

(26)

a user-defined parameter to control the strength of global load balancing. While re-distribution among the groups, the authors come up with simple scheme to consider the heterogeneity of processors. Each processor has a performance weight, and the workload assigned to a group is weighted by the ratio of sum of performance weights of the processors belonging to this group to the sum of the performance weights of the entire processors in the system. Within each time-step for global balancing, balancing is performed at the local level number of times. The local balancing is done within a group and hence remote communication is avoided. ENZO is invoked whenever a local balancing is performed at the local level. Therefore, the dynamic load balancing proposed by the authors apply distributed scheme at the global level and ENZO at the local level. The experiments performed by the authors according to this scheme show that the total execution time can be reduced by 9% to 46% and get an improvement of 26% over the case where only ENZO is applied to the whole distributed system. We think that the performance of this policy can be further im-proved if we have a clear picture of how the stochastic delay affects the redistribution cost in the global balancing step, which the authors do not look at.

2.2.3

Hydrodynamic Algorithm

In the approach given in [14], each processor is viewed as a liquid cylinder where the cross-sectional area corresponds to the capacity of the processor, the commu-nication links are modelled as liquid channels between the cylinders, the workload is represented by liquid, and the load balancing algorithm manages the flow of the liquid. The objective is to reach the state where the heights of the liquid columns are the same in all the cylinders. The computing system is modelled as an undirected graph G = (N, E) where N is the set of processors and E represents the network topology. The authors propose a general hydrodynamic framework to redistribute the workload among the processors such that each processor obtains its share of the

(27)

workload proportional to its capacity. They define a potential energy function for

G such that its minimum value corresponds to the state of equilibrium where the heights of the liquid columns are the same in all the cylinders. The nearest-neighbor approach is used to migrate tasks among the processors.

Each processor ni has its processing capacity ci > 0 and load li which it is currently running. Further, each ni is associated with a liquid cylinder whose cross-sectional corresponds to ci. ∀ni ∈N the potential energy of the liquid column in ni is defined as P E(ni) =

cih2i

2 , wherehi is the height of the liquid column in ni. Now,

the global potential energy ofG is defined as the sum of potential energies of all the nodes. The authors consider an infinitely thin liquid channel joining the bottom of two liquid cylinders if there is a connection between the two corresponding processors. The global fairness is said to be achieved when the heights of the liquid columns in the cylinders are equal. They show that this state of equilibrium corresponds to the minimum global potential energy. The amount of workload that is transferred from node ni to node nj is given by γ

cicj

ci+cj(hi−hj) where γ is defined as the balancing

factor which is in the range (0,1) and is used to control the amount of workload flow. It is assumed that the communication channels have fixed delay times such that load balancing activity is completed within a finite interval B. Every load balancing step has two phases: information exchange and migration. The authors show that with this approach, the global potential energy converges geometrically to the optimal state. They applied the balancing scheme on eight network topologies: binary tree, complete, hypercube, linear, mesh, ring, star and torus, and found that the hypercube and torus generate the lowest load balancing time. The authors have not addressed the issue of stochastic delay in this work.

(28)

2.2.4

Gang-scheduling, Backfilling, and Migration

The authors of [15] discuss three techniques: backfilling, gang-scheduling and migra-tion to improve response times, throughput and utilizamigra-tion in large super computing environments. Backfilling is a purely space-sharing approach while gang-scheduling is time-sharing strategy and migration corresponds to moving a job from one virtual machine to the other.

Backfilling attempts to assign unutilized nodes to jobs that are behind in the priority queue. The users need to provide an estimate of job execution time and the number of nodes required by each job. If a job exceeds its estimated execution time, then it is killed. Therefore, users have to overestimate the execution time of their jobs. The ratio between the estimated and actual execution time is referred as an overestimation factor. The average job behavior is shown to be insensitive to the average degree of overestimation. A scheduling event takes place whenever a new job arrives or an executing job terminates. The authors define the performance metrics as the average job slow down and average job wait time. Job slowdown measures how much slower than a dedicated machine the system appears to the users, and the job wait time measures how long a job takes to start execution after its arrival at that machine. The authors measure quality of service from the system’s perspective with two parameters: utilization and capacity loss. Utilization is defined as the fraction of total system resources that are actually used during the execution of a workload and capacity loss accounts for the case when there are jobs waiting in the queue to execute and some nodes are idle.

In gang-scheduling, the available number of processors are shared in time. The time axis is partitioned into multiple slices according to some algorithm, and each slice will have all the processors work in parallel on all tasks of a parallel job. The authors presents a Ousterhout matrix whose columns are equal to the number of

(29)

available processors and the rows correspond to the time slices. The matrix is cyclic in that time-slicen−1 is followed by time slice 0 if there arenmultiprogramming levels. One cycle through all the rows of the matrix defines a scheduling cycle. Each element of the matrix represents a task of a job being processed in a particular processor during a particular time slice. The authors introduce two types of cost associated with this time-sharing approach: 1) the cost of the context-switches themselves, 2) additional memory pressure created by multiple jobs sharing nodes, and 3) additional swap space pressure caused by more jobs executing concurrently. They show that by controlling the multiprogramming level, the costs can be taken care of. Every job arrival or departure constitutes a scheduling event in the system and for each scheduling event a new Ousterhout matrix is computed. Computing this matrix involves four steps: 1. removing every instance of a job that does not stay in its assigned home row, 2. moving jobs from less populated rows to more populated rows, 3.scheduling new jobs into the matrix and 4. filling gaps in the matrix by replacing jobs from their home rows into a set of replicated rows.

The authors analyze the following strategies in their work: 1. Gang-scheduling (GS), 2. Backfilling Gang-scheduling (BGS), 3. Migration Gang-scheduling (MGS) and 4. Migration Backfilling Gang-scheduling (MBGS) strategies. In backfilling gang-scheduling, each of the virtual machines created by gang-scheduling is treated as a target for backfilling. Thus, this is a combined effort of time and space sharing scheduling strategy. The migration inflicts some additional costs which have been appropriately taken care of by the authors. The authors show that MBGS gives the best results by driving utilization higher than MGS and having better slow down and wait times than BGS. Also, they emphasize that at all combinations of context switch overhead and utilization, BGS outperforms GS with the same multiprogramming level. The authors have developed this policy for its implementation on a parallel processing domain where all the nodes are in a small neighborhood and hence, can be connected to each other according to some fixed topology. But, in a distributed

(30)

environment, it is not possible to have all the processors connected in a particular fashion. Further, the delay grows larger and unpredictable in the latter environment.

2.2.5

Load Balancing using Queuing Theory

Dynamic load balancing inside groups using queuing theory approach is discussed in [21]. The authors model the balancing scheme named optimal algorithm. Ac-cording to this scheme, a process is migrated when the load difference in processors is more than 1. In this policy, the load difference is 0 or 1. When a process is created, the local load is compared to that of all the other nodes and the process is assigned, before the beginning of its execution, to the node with lowest load. But if the communication cost is too high, the migration is avoided even if the imbalance exists. The authors developed their analytical model for two groups each with two processors. The load balancing is done in two phases: intra-group and inter-group. The intra-group communication rate(c) is reasonably taken to be greater than the inter-group communication rate (c0_{). Job arrival rate at every processor is} _λ _and

departure rate isµ. The four-tuple (i, j;k, l) defines the state of the system with two groups each with 2 processors where i, j are the number of jobs at processors of the first group and (k, l) for the second one. From the state (0,0; 0,2) load balancing gives the state (0,0; 1,1) rather than (1,0; 0,1) since priority is first given to the intra-group imbalance. According to this model, the authors come up with a transi-tion graph and lump their Markov Chain to reduce it. They finally come up with a complex expression giving an estimate of number of processes on a processor which depends on λ, µ, c and c0_{. Obviously, when there is no communication cost, load}

balancing is found to be beneficial. With communication cost taken into account, they found a threshold for the communication rate under which the performance is better of without load balancing. They also found a slight improvement with the application of grouping approach. In their work, the authors consider the task arrival

(31)

rate and the service rate to be the same for all the nodes. They also consider the communication rate to be constant between any two nodes within a group. Clearly, this work is not applicable to the distributed delay-limited systems.

(32)

Chapter 3

Dynamic Load Balancing: A

Stochastic Approach

The randomness in delay is a problem in systems for which the individual units are connected by means of a shared broadband communication medium (e.g., the Internet, ATM, wireless LAN or wireless Internet). In such cases, the delays, in ad-dition to being large, fluctuate randomly, making their one-time accurate prediction impossible. The performance of any load balancing policy designed for dedicated communication links and systems (where the delay is deterministic) is significantly altered when the delays encountered are stochastic. In this chapter, the stochastic dynamics of a load-balancing algorithm in a cluster of computer nodes are mod-eled and used to predict the effects of the random time delay on the algorithm’s performance. The contents of this chapter have been accepted for publication [26].

This chapter is organized as follows. We begin with an introduction to the contin-uous time models developed and studied in [23, 24]. The authors developed a linear model whose stability can be characterized in terms of the delays in the transfer of information between nodes and the gain in the load balancing algorithm. In

(33)

Sec-Chapter 3. Dynamic Load Balancing: A Stochastic Approach

tion 3.2 we identify the stochastic elements of the load-balancing problem at hand and describe its time dynamics. In Section 3.3, we present a discrete-time queuing model describing the evolution of the random queue size of each node in the presence of delay for a typical load balancing algorithm. In Section 3.4 we present the results of Monte-Carlo simulations which demonstrate the extent of the role played by the uncertainty of the various time-delay elements in altering the performance of load balancing from that predicted by deterministic models, which assume fixed delays. Conclusions are given in Section 3.5.

3.1

Load Balancing in Deterministic Delay

Sys-tems

In this section, a continuous time sender-initiated dynamic load balancing model in the form of a nonlinear delay-differential system of equations developed by the au-thors of [23, 24] is introduced. The model considers the deterministic communication and transfer delay.

The authors consider a computing network consisting ofn nodes all of which can communicate with each other. Initially, the nodes are assigned an equal number of tasks. However, when a node executes a particular task it can generate more tasks so that the overall load distribution becomes non-uniform. To balance the loads, each computer in the network sends its queue size qj(t) to all other computers in the network. A node i receives this information from node j delayed by a finite amount of time τij, that is, it receives qj(t−τij). Each node i then uses this information to compute its local estimate of the average number of tasks per node in the network using the simple estimator Pn

j=1qj(t−τij)

/n(τii= 0) which is based on the most recent observations. Node i then compares its queue size qi(t) with its estimate of

(34)

Chapter 3. Dynamic Load Balancing: A Stochastic Approach

the network average as qi(t)−Pn

j=1qj(t−τij)

/n and, if this is greater than zero, the node sends some of its tasks to the other nodes while if it is less than zero, no tasks are sent. Further, the tasks sent by nodei are received by node j with a delay

hij. The authors present a mathematical model of a given computing node for load balancing, which is given as:

dxi(t) dt =λi−µi+ui(t)− n X j=1 pij tpi tpj uj(t−hij) yi(t) =xi(t)− Pn j=1xj(t−τij) n (3.1) ui(t) =−Kisat (yi(t)) pij >0, pjj = 0, n X i=1 pij = 1 where sat (y) =y if y>0 = 0 if y <0. In this model:

• xi(t) is the expected waiting time experienced by a task inserted into the queue of the ith _{node and} _u_i(_t_{) is the rate of removal (transfer) of the tasks as} deter-mined by the balancing algorithm.

• λi is the rate of increase in xi • µi is the service rate at theith node

(35)

The authors use the local information of the waiting timesxi(t), i= 1, .., nto set the values of the pij such that nodej can send tasks to nodei in proportion to the amounts by which node i is below the local average as seen by node j.

3.2

Description of the Stochastic Dynamics

The load balancing problem in the presence of delay can be generically described as follows. Consider n nodes in a network of geographically-distributed CEs. Com-putational tasks arrive at each node randomly and tasks are completed according to an exponential service-time model. In a typical load-balancing algorithm, each node routinely checks its queue size against other nodes and decides whether or not to allocate a portion of its load to less busy nodes according to a predefined policy. Now due to the physical (or virtual) distance between nodes in large-scale distributed computing systems, communication and load transfer activity among them cannot be assumed instantaneous. Thus, the information that a particular node has about other nodes at any time is dated and may not accurately represent the current state of the other nodes. For the same reason, a load sent to a recipient node arrives at a delayed instant. In the mean time, the load state of the recipient node may have considerably changed from what was known to the transmitting node at the time of load transfer. Furthermore, what makes matters more complex is that these delays are random. For example, the communication delay is random since the state of the shared communication network is unpredictable, depending on the level of traffic, congestion, and quality of service (QoS) attributes of the network. Clearly, the char-acteristics of the delay depend on the network configuration and architecture, the type of communication medium and protocol, and on the overall load of the system.

Other factors that contribute to the stochastic nature of the distributed comput-ing problem include: 1) Randomness and possible burst-like nature of the arrival of

(36)

new job requests at each node from external sources (i.e., from users); 2) Random-ness of the load-transfer process itself, as it depends on some deterministic law that may use a sliding-window history of all other nodes (which are also random); and 3) Randomness in the task completion process at each node. In the next section, we lay out a queuing model that characterizes the dynamics of the load-balancing problem described so far.

3.3

A Discrete-time Queuing Model with Delays

Considernnodes (CEs), and letQi(t) denote the number of tasks awaiting processing at the ith node at time t. Suppose that the ith node completes tasks at a rate µi, and new job requests are assigned to it from external sources (i.e., external users) at a rate λi. Note that these incoming tasks come from sources external to the network of nodes and do not include the jobs transferred to a node from other nodes as a result of load balancing. Let the counting process Ji(t1, t2) denote the number of

such external tasks arriving at node iin the interval [t1, t2]. To capture any possible

burst-like characteristics in the external-task arrivals (as each job request may involve a large number of computational tasks), we will assume that the process Ji(·,·) is a compound Poisson process [29]. That is, Ji(t1, t2) = Pk:t1<τk≤t2Hk, whereτk are the arrival times of job requests (which arrive according to a Poisson process with rate

λi) andHk (k= 1,2. . .) is an integer-valued random variable describing the number of tasks associated with the kth job request. We next address the load transfer between nodes which will allow us to describe the dynamics of the evolution of the queues.

For the ith node and at its specific load-balancing instants Ti

l, l = 1,2, . . . , the node looks at its own load Qi(Ti

l) and the loads of other nodes at randomly delayed instants (due to communication delays), and decides whether it should allocate some

(37)

of its load to other nodes, according to a deterministic (or randomized, if so desired) load-balancing policy. Moreover, at times when it is not balancing its load, it may receive loads from other nodes that were transmitted at a randomly delayed instant, governed by the characteristics of the load-transfer delay. With the above descrip-tion of task assignments between nodes, and with our earlier descripdescrip-tion of task completion and external-task arrivals, we can write the dynamics of theith queue in differential form as Qi(t+∆t) =Qi(t)−Ci(t, t+∆t)−X j6=i Lji(t)+X j6=i Lij(t−τij(t))+Ji(t, t+∆t), (3.2) where

• Ci(t, t+ ∆t) is a Poisson process with rate µi describing the random number of tasks completed in the interval [t, t+ ∆t]

• Ji(t, t+∆t) is the random number of new (from external sources) tasks arriving in the same interval, as discussed above

• τij(t) is the delay in transferring the load arriving to node i in the interval [t, t+ ∆t] from node j, and finally

• Lij(t) is the load transferred from node j to node i at the timet.

For any k 6= l, the random load Lkl diverted from node l to node k is gov-erned by the mutual load-balancing policy a-priorily agreed upon between the two nodes, which utilizes knowledge of the state of the lth node and the delayed knowl-edge of the kth node and all the other nodes. More precisely, we assume Lkl(t) =4

gkl(Ql(t), Qk(t−ηlk(t)), . . . , Qj(t−ηlj(t)), . . .), where for any j 6= k, ηkj(t) = ηjk(t) is the communication delay between the kth and jth nodes at time t. The function

(38)

Chapter 3. Dynamic Load Balancing: A Stochastic Approach example is gkl(Ql(t), Qk(t−ηlk(t)), . . . , Qj(t−ηlj(t)), . . .) =Kkpkl· Ql(t)−n−1 n X j=1 Qj(t−ηlj(t)) · u Ql(t)−n−1 n X j=1 Qj(t−ηlj(t)) , (3.3)

where u(·) is the unit step function with the obvious convention ηii(t) = 0, and Kk is a parameter that controls the “strength” or “gain” of load balancing at the kth (load distributing) node. We will refer to it henceforth as the gain coefficient. In this example, the lth node simply compares its load to the average over all nodes and sends out a fractionpkl of its excess load,Ql(t)−n−1Pn_j₌₁Qj(t−ηlj), to the lth node. (Of course we require that P

k6=lpkl = 1.) This form of policy has been used at the University of Tennessee for the FBI project [1, 23]. Finally, the fractions pkl can be defined in a variety of ways. Here, they are defined as follows:

pkl = 1 n−2 1− PQk(t−ηlk) i6=lQi(t−ηli) , (3.4)

where n≥3. In this definition, a node sends a larger fraction of its excess load to a node with a small load relative to all other candidate recipient nodes. For the special case when n = 2, pkl= 1, where k 6=l.

3.4

Simulation Results

We have developed a custom-made Monte-Carlo simulation software according to our queuing model. We utilized actual data from load-balancing experiments (conducted at the University of Tennessee) pertaining to the number of tasks awaiting processing, average communication delay, average load-transfer delay, and actual load-balancing

(39)

instants [23]. In the actual experiment, the communication and load-transfer delays were minimal (due the fact that the PCs were all in a local proximity and benefited from a dedicated fast Ethernet). Thus, to better reflect cases when the nodes are ge-ographically distant we synthesized larger delays in communication and load transfer in our simulations. 0 10 20 30 40 50 60 0 0.2 0.4 0.6 0.8 TIME, ms QUEUE LENGTH Zero−Delay Case Queue 1 Queue 2 Queue 3 Tasks Completed 0 10 20 30 40 50 60 −0.5 0 0.5 1 TIME, ms EXCESS LOAD Queue 1 Queue 2 Queue 3

Figure 3.1: Top: Queue size in the ideal case when delays are nonexis-tent. The queues are normalized by the total number of submitted tasks (12000 in this case). The dashed curves represent the tasks completed cumulatively in time by each node.

Bottom: Excess queue length for

each node computed as the difference between each nodes normalized queue size and the normalized queue size of the overall system. Note that the three nodes are balanced at approxi-mately 15 ms and that all tasks are completed in approximately 39 ms. 0 10 20 30 40 50 60 0 0.2 0.4 0.6 0.8 TIME, ms QUEUE LENGTH Deterministic−Delay Case Queue 1 Queue 2 Queue 3 Tasks Completed 0 10 20 30 40 50 60 −0.5 0 0.5 1 TIME, ms EXCESS LOAD Queue 1 Queue 2 Queue 3

Figure 3.2: Similar to Fig. 3.1 but with a deterministic communication and load-transfer delays of 8 ms and 16 ms, respectively. In contrast to the zero-delay case, the three nodes are balanced at approximately 60 ms and all tasks are completed shortly after-wards. Also note that nodes 2 and 3 each execute approximately 40% of the total tasks, where node 3 executes only 20% of the total tasks submitted to the system.

(40)

3.4.1

Effect of Delay

Three CEs were used in the simulations and a standard load-balancing policy [as de-scribed by (3.3)] was implemented. The PCs were assumed to have equal computing power (the average task completion time was 10 µs per task), but the initial load was distributed unevenly among the three nodes as 7000, 4500, and 500 tasks, with no additional external arrival of tasks (e.g., J1(t1, t2) = 7000 if t1 = 0,0 < t2 and

zero otherwise). Figure 3.1 corresponds to the case where no communication nor load-transfer delays are assumed. This case approximates the actual experiment, where all the computers were within the proximity of each other benefiting from a dedicated fast Ethernet. Note that the system is balanced at approximately 15 ms and remains balanced thereafter until all tasks are executed in approximately 39 ms. We next considered the presence of deterministic communication delay of 8 ms and a load transfer-delay of 16 ms. The behavior is seen in Fig. 3.2, where it is observed that the delay prevents load balancing to occur. For example, nodes 2 and 3 each eventually execute approximately 40% of the total tasks, whereas node 3 executes only 20% of the total tasks submitted to the system (as seen from the dashed curves in the top figure in Fig. 3.2). The conclusion drawn here is that the presence of delay in communication and load transfer seriously disturbs the performance of the load balancing policy, as each node utilizes “dated” information about the state of the other nodes as it decides what fraction of its load must be transferred to each of the other nodes.

To see the effect of the delay randomness on the load balancing performance, two representative realizations of the performance were generated and are shown in Figs. 3.3 and 3.4. The average delays were taken as in the deterministic case (i.e., 8 ms for the communication delay and 16 ms for the load-transfer delay). For the example considered, it turns out that the performance is sensitive to the realizations of the delays in the early phase of the load-balancing procedure. For example, it

(41)

Chapter 3. Dynamic Load Balancing: A Stochastic Approach 0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 TIME, ms QUEUE LENGTH Random−Delay Case Queue 1 Queue 2 Queue 3 Tasks Completed 0 10 20 30 40 50 60 70 80 −0.5 0 0.5 1 TIME, ms EXCESS LOAD Queue 1 Queue 2 Queue 3

Figure 3.3: In this example, the com-munication and load-transfer delays are assumed random with average val-ues of 8 ms and 16 ms, respectively. Note that the performance is some-what superior to the deterministic-delay case shown in Fig. 3.2.

0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 TIME, ms QUEUE LENGTH Random−Delay Case Queue 1 Queue 2 Queue 3 Tasks Completed 0 10 20 30 40 50 60 70 80 −0.5 0 0.5 1 TIME, ms EXCESS LOAD Queue 1 Queue 2 Queue 3

Figure 3.4: Another realization of the case described in Fig. 3.3 showing the variability in the performance from one realization to another. Load-balancing characteristics here are in-ferior to those in Fig. 3.3.

is seen from the simulation results that a deterministic (fixed) delay can lead to a more severe performance degradation than the case when the delays are assumed random (with the same mean as the deterministic case). To see the average effect of the random delay, we calculated the mean queue size and the normalized variance (normalized by the mean square) over 100 realizations of the queue sample functions, each with a different set of randomly generated delays. The results are shown in Figs. 3.5 and 3.6. It is seen from the mean behavior that the randomness in the delay actually leads, on average, to balancing characteristics (as far as the excess-load is concerned) that are superior to the case when the delays are deterministic! However, there is a high level of uncertainty in the queue size, and hence in the load balancing. It is seen from Fig. 3.5 (dashed curves) that the average total number of tasks completed by each node continues to increase well beyond 60 ms, which is inferred from the positive slope of the dashed curves. This indicates that in

(42)

comparison to the deterministic-delay case, the system requires 1) almost twice as long as the zero-delay case to complete all the tasks and 2) a longer time to complete all the tasks than the deterministic-delay case.

0 10 20 30 40 50 60 0 1000 2000 3000 4000 5000 6000 7000 TIME,ms Queue Length Queue 1 Queue 2 Queue 3

mean tasks completed

Figure 3.5: The empirical average queue length using 100 realizations of the queues for each node (solid curves). The dashed curves are the empirical average of the number of tasks performed by each node cumula-tively in time normalized by the total number of tasks submitted to the sys-tem. Only 87% of the total tasks are completed within 60 ms. 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35 40 TIME, ms

Variance in the Queue length

Queue 3

Figure 3.6: The empirical variance of the queue length normalized by the mean-square values. Observe the high-degree of uncertainty in the low-est queue as well as the variability at large times, which is indicative of the fact that nodes continue to exchange tasks back and forth, perhaps unnec-essarily.

3.4.2

Interplay Between Delay and the Gain Coefficient

K

We finally consider the effect of varying the gain parameter K on the performance of load balancing (assume that K1 = K2 = K3 ≡ K). Figures 3.7, 3.8 show the

(43)

K = 0.8, andK = 0.2, respectively. It is seen that whenK = 0.8, the queue lengths fluctuate more than the case when K = 0.2, resulting in a longer overall time to total task completion. This example shows that a “weak” load-balancing algorithm can outperform a “strong” policy in the presence of random delay. We will revisit this interesting observation in more detail in the next section.

0 20 40 60 80 100 120 0 1000 2000 3000 4000 5000 6000 7000 TIME, ms QUEUE LENGTH Queue 1 Queue 2 Queue 3

Figure 3.7: Actual queue length when the gain parameter is K = 0.8. No-tice the abundance of fluctuations in the tail of the queue in comparison to Fig. 3.8. 0 20 40 60 80 100 120 0 1000 2000 3000 4000 5000 6000 7000 TIME, ms QUEUE LENGTH Queue 1 Queue 2 Queue 3

Figure 3.8: Same as Fig. 3.7 but with

K = 0.2 corresponding to “weak” load-balancing.

3.4.3

Load Dependent Delay

Clearly, the nature of the transfer delay depends on the amount of load to be trans-ferred; a sizable load will entail, on average, a longer transfer delay than a small load. As a consequence, the load balancing policy is directly affected by the load-dependent nature of transfer delay. For example, if there is a high degree of load imbalance present at any time, it might seem tempting to redistribute big packets of data up front so as to get rid of the imbalance quickly. However, the tradeoff

(44)

here is that the sizable load takes much longer to reach the destination node, and hence, the overall computation time will inevitably increase. Thus,we would expect the gain coefficient K to play an important role in cases when transfer delay is load dependent. Since the balancing is done frequently, it is intuitively obvious that we would be better of if we were to selectK conservatively. To address this issue quanti-tatively, we will need to develop a model for the load-dependent transfer delay. This is done next.

We propose to capture the load-dependent nature of the random transfer delay

τij by requiring that its average valueθij(t), assumes the following form

θij(t) = dmin−

1 + exp([(Lij(t)dβ)]−1₎

1−exp([(Lij(t)dβ)]−1)

, (3.5)

where dmin is the minimum possible transfer delay (its value is estimated as 9 ms in this paper), dis a constant (equal to 0.082618),β is a parameter which characterizes the transfer delay (selected as 0.04955). Moreover, we will assume that conditional on the size of the load to be transferred, the random delayτij is uniformly-distributed in the interval [0,2θij(t)]. This model assumes that up to some threshold, the de-lay is constant (independent of the load size) that is dependent on the capacity of the communication medium. Beyond this threshold, however, the average delay is expected to increase monotonically with the load size. The parameters d and b are selected so that the above model is consistent with the overall average delay for all the actual transfers that occurred in the previous simulations.

The load-dependent transfer delay versus the load is shown in the Fig. 3.9. The transfer delay for the loads sent from node 1 to node 3 (top) and from node 2 to node 3 (bottom) over the period of execution time is shown in Fig. 3.10. With the average communication delay being equal to 8 ms (as before) and the transfer delay made load dependent, according to the model described in (3.5), one realization of the performance for K = 0.5 was generated and it is shown in Figure 3.11. As expected, the performance deteriorates beyond the case corresponding to a fixed transfer delay.

(45)

Chapter 3. Dynamic Load Balancing: A Stochastic Approach 0 100 200 300 400 500 600 700 800 900 1000 10 10.2 10.4 10.6 10.8 11 11.2 No. of tasks

Average Transfer Delay, ms

Figure 3.9: Transfer Delay changes significantly for big loads.

0 5 10 15 20 25 30 35 40 0 20 40 60 80 TIME, ms Delay, ms 0 5 10 15 20 25 30 35 40 0 50 100 150 TIME, ms Delay, ms

Figure 3.10: Transfer Delay Varia-tion for a particular realizaVaria-tion of the queues

For example, we see from the figure that a load sent by node 1 at around 5ms arrives at node 3 approximately 50 ms later, thereby bringing more fluctuations to the tail to the queues. The average effect (over 50 realizations) of this delay model for two different gain parameters (K = 0.1 and K = 0.9) can be seen in Figs. 3.12 and 3.13. When K = 0.9, the queue is fluctuating beyond t = 80ms while when K = 0.1, all the tasks are completed at approximately 60ms. The optimal value of K for this delay model was found to be equal to K = 0.06 and the overall completion time in this case was 54.85 ms. The variation of the overall completion time with respect to the gain coefficient is shown in Table 3.1.

It is clearly seen that the required time for completion of all tasks (in the sys-tem) is significantly larger than the time required to execute 95% of the assigned tasks. The difference increases with higher values of K. This is due to the fact that even when all the queues arealmost depleted of tasks, they continue to execute the balancing policy. As a result, small amount of tasks (e.g., one or two) are sent from one node to other nodes and vice versa. This unnecessary task-swapping

(46)

sig-Chapter 3. Dynamic Load Balancing: A Stochastic Approach 0 10 20 30 40 50 60 0 1000 2000 3000 4000 5000 6000 TIME, ms QUEUE LENGTH

Realization of the Load Dependent Random−Delay Case

Queue 1 Queue 2 Queue 3 Tasks completed

Figure 3.11: Queue is more unsta-ble than in the case of the load inde-pendent delay case for the same gain

K = 0.5 0 10 20 30 40 50 60 70 80 0 1000 2000 3000 4000 5000 6000 7000 TIME, ms QUEUE LENGTH

Mean Realization of the Load Dependent Random−Delay Case

Figure 3.12: With K = 0.1 execution time is approximately 60 ms 0 10 20 30 40 50 60 70 80 0 1000 2000 3000 4000 5000 6000 7000 TIME, ms QUEUE LENGTH

Mean Realization of the Load Dependent Random−Delay Case

Figure 3.13: With K = 0.9, the queues are changing even at 80ms

nificantly increases the transfer delay, therefore increasing the overall computational time. Further, the tiny amount of tasks keep moving back and forth. This

(47)

phe-Chapter 3. Dynamic Load Balancing: A Stochastic Approach

Table 3.1: Dependence of the load-balancing performance on the gain coefficeint K. Gain(K) Task Completion Time(in

ms)

Time to Execute 95 percent of tasks(in ms) 0.01 62.53 41.80 0.02 61.44 42.86 0.03 59.68 42.59 0.04 57.27 41.98 0.05 56.79 41.35 0.06 54.85 41.99 0.07 56.04 42.49 0.08 59.68 41.56 0.09 62.53 41.81 0.1 61.10 42.18 0.2 65 43.38 0.3 63.40 46.2 0.4 78.313 53.33 0.5 >80 55.21

nomenon is clearly depicted in Figs. 3.6 where the minute fluctuations are evident near the tail of the queues.

3.5

Summary and Conclusions

Whenever there are tangible communication limitations between nodes in a dis-tributed system, possibly with geographically distant CEs, we must take a totally new look at the problem of load balancing. In such cases, the presence of

(48)

non-Chapter 3. Dynamic Load Balancing: A Stochastic Approach

negligible random delays in inter-node communication and load transfer can signif-icantly alter the expected performance of existing load-balancing strategies. The load-balancing problem must be viewed as a stochastic system, whose performance must be evaluated statistically. More importantly, the policy itself must be devel-oped with appropriate statistical performance criteria in mind. Thus, if we design a load-balancing policy under the no-delay or fixed-delay assumptions, the policy will not perform as expected in a real situation when the delays are non-zero or random. A load-balancing policy must be designed with the stochastic nature of the delay in mind.

Monte-Carlo simulation indicates that the presence of delay (deterministic or ran-dom) can lead to a significant degradation in the performance of a load-balancing policy. Moreover, when the delay is stochastic, this degradation is worsened, lead-ing to extended cycles of unnecessary exchange of tasks (or loads), back and forth between nodes, leading to extended overall delays and prolonged task-completion times. One way to remedy such a problem is to weaken the load-balancing mecha-nism (or discourage) appropriately. this action makes the load balancing policy in the presence of random delays “less reactionary” to changes in the load distribution within the system. This, in turn, reduces the sensitivity of the load-balancing process to inaccuracies in the state-of-knowledge of each node about the load distribution in the remainder of the system caused by communication limitations. We look into these interesting issues in the following chapter.

(49)

Chapter 4

Discrete-Time Load Balancing

In a distributed computing environment with a high communication cost, limiting the number of balancing instants results in a better performance than the case where load balancing is executed continuously. Therefore, finding the optimal number of balancing instants and optimizing the performance over the inter-balancing time and over the load-balancing gain becomes an important problem. In this chapter we show that the choice of the balancing strategy is an optimization problem with respect to the choice of the gain parameter. We discuss the performance of a single load-balancing strategy on a real distributed physical system and the performance is compared to our simulation predictions.

The contents of this chapter have been taken from [27, 28]. This chapter is or-ganized as follows. In Section 4.1 we present the motivation behind limiting the balancing instants. In Section 4.2, the results of Monte-Carlo simulations for single and double load-balancing strategies are presented and analyzed. Section 4.3 dis-cusses the performance of our single-load balancing strategy on a physical wireless 3-node network, while simulation results for this case is presented in Section 4.4. Finally, we conclude the chapter in Section 4.5.

(50)

Chapter 4. Discrete-Time Load Balancing

4.1

Motivation

In chapter 4, we looked at the problem of dynamic load balancing using a dynamical model that captures the stochastic delays of the distributed system. In the ideal case where the communication and load-transfer delays are small (as in a fast Ethernet environment) and the time required to implement the load-balancing policy is also negligible, the best performance (minimizing the waiting times associated with all CEs) is obtained when the load balancing is executed almost continuously without any reservation. Namely, at almost every instant, each CE compares its queue size to the average queue size of the network and distributes all its excess load to other nodes. Every other node also follows a similar policy. However, in a practical setting such a strategy has two main disadvantages: 1) the implementation of the load balancing policy on a continuous basis can drain the computational resources of each CE; and 2) excessive load balancing, both in frequency and strength, can lead to timely and possibly unnecessary exchange of loads between CEs. This means that valuable time may be unduly wasted exchanging loads back and forth between nodes (as the system is diligently attempting to balance the queues) while this time could have been used to actually execute the tasks submitted! In particular, we showed that the strength of the load-balancing policy must be reduced in a delayed environment to avoid any “over-reaction” consequences that may arise due to such delay factors.

In a more practical setting, the continuous implementation of load balancing, as we stated earlier, can be very costly (wasteful of computational resources) and more importantly, it can inflict an additional delay, namely, the time needed to implement the load balancing policy. Thus, there is an inherent tradeoff between the strength and frequency of load balancing on one hand, and the need to conserve computational resources used in implementing any load-balancing policy. Motivated by such a fundamental tradeoff, in this chapter we investigate whether limiting the number of load balanci

Load Balancing in Delay-Limited Distributed Systems

Load Balancing in Delay-Limited

Load Balancing in Delay-Limited

M.S., Electrical Engineering, University of New Mexico, 2003

Contents

Chapter 1

General Framework for Load Balancing

Objective of this Thesis

Overview of Thesis

Taxonomy of Load Balancing

Static versus Dynamic Load Balancing

Local versus Global Load Balancing

Centralized versus Distributed Load Balancing

Sender/Receiver/Symmetrically Initiated Balancing

Related Work

Balancing scheme for SAMR Applications

Hydrodynamic Algorithm

Gang-scheduling, Backfilling, and Migration

Load Balancing using Queuing Theory

Chapter 3

Load Balancing in Deterministic Delay

Description of the Stochastic Dynamics

A Discrete-time Queuing Model with Delays

Simulation Results

Effect of Delay

Interplay Between Delay and the Gain Coefficient

Load Dependent Delay

Summary and Conclusions

Chapter 4

Motivation