Parallel Computing(Load Balancing): Rajeev Wankar 1
Load Balancing Techniques
Lecture Outline
Following Topics will be discussed
Static Load Balancing
Dynamic Load Balancing
Mapping for load balancing
Minimizing Interaction
Parallel Computing(Load Balancing): Rajeev Wankar 3
Load Balancing Techniques
Objectives:
The amount of computation assigned to each processor is balanced so that some processors do not sit idlewhile others are executingtasks.
The interactionamong the different processors is
minimized, so that the processors spend most of the time in doing work.
Many times to balance the load among processors, it may be necessary to assign tasks, that interact heavily, to different processors.
General classes of problems
The firstclass consists of problems in which all the tasks are available at the beginning of the computation but the amount of timerequired by each task is different
The second class consists of problems in which tasks are available at the beginning but as the computation progresses, the amount of timerequired by each task changes
The third class consists of problems in which tasksare not available at the beginning but are generated
dynamically
Parallel Computing(Load Balancing): Rajeev Wankar 5
Load Balancing Techniques
Static load-balancingDistribute the work among processors prior to the execution of the algorithm
Example: Matrix-Matrix Computation
Easy to design and implement
Dynamic load-balancing
Distribute the work among processors during the execution of the algorithm
Algorithms that require dynamic load-balancing are somewhat more complicated (Examples: Parallel Graph Partitioning and Adaptive Finite Element Computations)
Static load-balancing
Before the execution of any process. Some potential static load balancing techniques:
• Round robin algorithm — passes out tasks in sequential order of processes, coming back to the first when all processes have been given a task
• Randomized algorithms — selects processes at random to take tasks
• Recursive bisection — recursively divides the problem into sub problems of equal computational effort while minimizing message passing
• Simulated annealing — an optimization technique
• Genetic algorithm— another optimization technique
Parallel Computing(Load Balancing): Rajeev Wankar 7
Static load-balancing
Since load is balanced prior to the execution, several fundamental flaws with static load balancing even if a mathematical solution exists:
• Very difficult to estimate accurately the execution times of various parts of a program without actually executing the parts.
• Communication delays that vary under different circumstances
• Some problems have an indeterminate number of steps to reach their solution.
General Strategy for Load Balancing
Parallel Computing(Load Balancing): Rajeev Wankar 9
Static Load Balancing Ex.
Array distribution schemes:
One dimensional (strip)
Block distributions of matrix (checkerboard)
Block cyclic distributions
Randomized block distributions
Using Stripped Data Decomposition
When stripped data decomposition is used to derive
concurrency, a suitable decomposition of data can itself be used to balance the load and minimize interactions.
Parallel Computing(Load Balancing): Rajeev Wankar 11
P0
P1
P2
P3
P4
P5
P6
P7
A B C
P0
P1
P2
P3
P4
P5
P6
P7
P0
P1
P2
P3
P4
P5
P6
P7
Matrix-Matrix addition
Block distribution (Dense matrix-matrix multiplication)
Parallel Computing(Load Balancing): Rajeev Wankar 13
(a) will lead to load imbalances sparse matrix in Gaussian elimination.
Using the block-cyclicdistribution shown in (b) to distribute the computations to processors
P0 P1 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 P4 P5 P6 P7 P8 P9 P10 P11 P8 P9 P10 P11 P12 P13 P14 P15 P12 P13 P14 P15 P0 P1 P2 P3 P0 P1 P2 P3
P4 P5 P6 P7 P4 P5 P6 P7
P8 P9 P10 P11 P8 P9 P10 P11
P12 P13 P14 P15 P12 P13 P14 P15
Schemes for Dynamic Load Balancing
Dynamic partitionThere are problems in which we cannot statically partition the work among the processors
In these problems, a static work partitioning is either
impossible (e.g. first class) or can potentially lead to serious load imbalance problems (e.g., second and third classes)
The only way to develop efficient message passing programs for these classes of problem is if we allow dynamic load balancing
Thus during the execution of the program, work is dynamically transferredamong the processors that have a lot of work to one that has little or no work
Parallel Computing(Load Balancing): Rajeev Wankar 15
Dynamic Load Balancing
Can be classified as:
• Centralized
• Decentralized
Centralized dynamic load balancing
Tasks handed out from a centralized location. Master-slave structure.
Decentralized dynamic load balancing Tasks are passed between arbitrary processes.
A collection of worker processes operate upon the problem and interact among themselves, finally reporting to a single process.
A worker process may receive tasks from other worker processes and may send tasks to other worker processes (to complete or pass on at their discretion).
Parallel Computing(Load Balancing): Rajeev Wankar 17
Centralized Dynamic Load Balancing
Master process(or) holds the collection of tasks to be performed.
Tasks are sent to the slave processes. When a slave process completes one task, it requests another task from the master process.
Terms used : work pool, replicated worker, processor farm.
Centralized work pool
Parallel Computing(Load Balancing): Rajeev Wankar 19
Termination
Computation terminates when:
• The task queue is empty and
• Every process has made a request for another task without any new tasks being generated Not sufficient to terminate when task queue empty
Reason:if one or more processes are still running if a running process may provide new tasks for task queue.
Decentralized Dynamic Load Balancing
Distributed Work PoolParallel Computing(Load Balancing): Rajeev Wankar 21
Process Selection
Algorithms for selecting a process:Round robin algorithm – process Pi requests tasks from process Px, where x is given by a counter that is
incremented after each request, using modulo n arithmetic (n processes), excluding x = i.
Random polling algorithm – process Pi requests tasks from process Px,where x is a number that is selected randomly between 0 and n - 1 (excluding i).
Load Balancing Using a Line Structure
Parallel Computing(Load Balancing): Rajeev Wankar 23
Master process (P0) feeds queue with tasks at one end, and tasks are shifted down queue.
When a process, Pi (1 ≤ i < n), detects a task at its input from queue and process is idle, it takes task from queue.
Then tasks shuffle down to the right in queue so that space held by task is filled. A new task is inserted into the left side end of queue.
Eventually, all processes have a task and queue filled with new tasks. High- priority or larger tasks could be placed in queue first.
Load Balancing Using a Line Structure
Code Using Time Sharing Between Communication and Computation
Master process (P0)
for (i = 0; i < no_tasks; i++) {
recv(P1, request_tag); /*request for task*/
send(&task, P1, task_tag);/*send tasks into queue*/
}
recv(P1, request_tag); /*request for task*/
send(&empty, P1, task_tag); /*in case end of tasks*/
Parallel Computing(Load Balancing): Rajeev Wankar 25
Process Pi(1 < i < n) if (buffer == empty) {
send(Pi-1, request_tag); /* request new task */
recv(&buffer, Pi-1, task_tag); /* task from left proc */
}
if ((buffer == full) && (!busy)) { /* get next task */
task = buffer; /* get task*/
buffer = empty; /* set buffer empty */
busy = TRUE; /* set process busy */
}
nrecv(Pi+1, request_tag, request); /* check msg from right */
if (request && (buffer == full)) {
send(&buffer, Pi+1); /* shift task forward */
buffer = empty;
}
if (busy) { /* continue on current task */
Do some work on task.
If task finished, set busy to false.
}
Nonblocking nrecv() necessary to check for a request being received from right.
Load balancing using a tree
Tasks passed from node into one of the two nodes below it when node buffer empty.
Parallel Computing(Load Balancing): Rajeev Wankar 27
General Techniques for Choosing Right Parallel Algorithm
» Maximize data locality
» Minimize volume of data
» Minimize frequency of Interactions
» Overlapping computations with interactions.
» Decision on Data replication
» Minimize construction
» Use highly optimized collective interaction operations
» Collective data transfers and computations
» Maximize Concurrency
Decision tree to choose a mapping strategy
Parallel Computing(Load Balancing): Rajeev Wankar 29 Static number of
tasks
Structured communication
pattern
Unstructured communication
pattern
Roughly constant computation time
per task
Join tasks to minimize communication.
Create one task per processor
Cyclically map tasks to processors for computational load balancing
Computation time per task
varies
Use static load balancing techniques
Dynamic number of tasks
Many short-lived tasks. No inter task
communication Frequent
communications between tasks
Use dynamic load balancing techniques
Use run time task scheduling algorithms