Load balancing; Termination detection

(1)

Load balancing; Termination detection

Parallel and Distributed Computing

Department of Computer Science and Engineering (DEI) Instituto Superior T´ecnico

(2)

Outline

Load Balancing static dynamic centralized decentralized Termination Detection Debugging

(3)

Foster’s Design Methodology

Development of scalable parallel algorithms by delaying machine-dependent decisions to later stages.

Four steps:

partitioning communication agglomeration mapping

(4)

Mapping

Examples discussed in previous classes: circuit satisfiability

sieve of Eratosthenes all pairs shortest paths matrix-vector multiplication

So far, all primitive tasks require same amount of computation.

Aggregation + Mapping strategy:

create one task per processor and distribute primitive tasks evenly.

What if we can not predict beforehand the amount of computation required per primitive task?

(5)

Mapping

Examples discussed in previous classes: circuit satisfiability

sieve of Eratosthenes all pairs shortest paths matrix-vector multiplication

So far, all primitive tasks require same amount of computation.

Aggregation + Mapping strategy:

create one task per processor and distribute primitive tasks evenly. What if we can not predict beforehand the amount of computation

(6)

Load Balancing

P₄ P₃ P₂ P₁ P₀ t

Perfect load balancing

P₄ P₃ P₂ P₁ P₀ t Imperfect load balancing

Parallel execution time defined by last task to finish. Parallel time minimized when load distributedevenly.

(7)

Load Balancing

P₄ P₃ P₂ P₁ P₀ t Perfect load balancing

P₄ P₃ P₂ P₁ P₀ t’ Imperfect load balancing

Parallel execution time defined by last task to finish.

(8)

Load Balancing

P₄ P₃ P₂ P₁ P₀ t Perfect load balancing

P₄ P₃ P₂ P₁ P₀ t’ Imperfect load balancing

Parallel execution time defined by last task to finish. Parallel time minimized when load distributedevenly.

(9)

Load Balancing

Static load balancing: assignment of tasks to processors decided during program development

⇒ mapping problem.

Dynamic load balancing: assignment of tasks to processors decided during runtime

(10)

Load Balancing

Static load balancing: assignment of tasks to processors decided during program development

⇒ mapping problem.

Dynamic load balancing: assignment of tasks to processors decided during runtime

(11)

Static Load Balancing

Substantial amount of research has been dedicated to static load balancing.

Problem Statement (Partition Problem)

Given a number of tasks, each with a given computational effort, and a number of partitions (processors), assign each to a partition such that the maximum computational effort over all partitions is minimized.

NP-Hard combinatorial problem.

(12)

Static Load Balancing

(13)

Static Load Balancing

(14)

Limitations of Static Load Balancing

Limitations of static load balancing:

computational effort of each task may not be known a-priori

⇒ some problems have an indeterminate number of steps to complete

programs are subject to variable communication delays

performance of each processor may vary, and may not be known beforehand

Dynamic load balancing overcomes these issues by making the division of the load dependent on the actual runtimes.

(15)

Limitations of Static Load Balancing

(16)

Limitations of Static Load Balancing

(17)

Limitations of Static Load Balancing

(18)

Limitations of Static Load Balancing

(19)

Dynamic Load Balancing

Dynamic load balancing can be accomplished through:

centralized management

one process (master) is responsible for assigning tasks toslave

processes

decentralized management

(20)

Centralized Dynamic Load Balancing

Work Poolor Processor Farm model:

master holds all tasks of the application

new tasks may be generated during execution

when idle, slave process requests a task to the master

master selects tasks among those ready to run and sends to slave

tasks of larger size or importance are executed first

if tasks are all the same, a FIFO queue may be used

specialized slaves can be considered master can also hold global data

(21)

Centralized Dynamic Load Balancing

master holds all tasks of the application new tasks may be generated during execution

(22)

Centralized Dynamic Load Balancing

(23)

Centralized Dynamic Load Balancing

(24)

Centralized Dynamic Load Balancing

(25)

Centralized Dynamic Load Balancing

(26)

Centralized Dynamic Load Balancing

specialized slaves can be considered

(27)

Centralized Dynamic Load Balancing

(28)

Centralized Dynamic Load Balancing

T₁ T₄ T₅ T₆ T_n T₃ T₂ T₇ Master Process Work pool P₀ Send Task Request Task (possibly, submit new task) P₁ P₂ P_p Slave Processes

(29)

Centralized Dynamic Load Balancing

Limitations of centralized dynamic load balancing:

Master can become a bottleneck as it can only issue one task at a time.

not a significant problem if few slaves and/or computationally intensive tasks

(30)

Centralized Dynamic Load Balancing

Limitations of centralized dynamic load balancing:

Master can become a bottleneck as it can only issue one task at a time.

not a significant problem if few slaves and/or computationally intensive tasks

(31)

Decentralized Dynamic Load Balancing

Initial approach can be simply to evolve the centralized system into a tree-like task management scheme:

main master distribute tasks to be managed by second-level masters

each second-level master behaves as in the centralized method

(32)

Decentralized Dynamic Load Balancing

Fully distributed work pool: each process starts with a given number of tasks, but may send or receive new tasks to handle.

Receiver-initiatedmethod:

process request new tasks when it has few or no tasks to do

⇒ better in high load systems

Sender-initiated method:

process under heavy load send tasks to processes willing to accept them

A mixture of both approaches is possible.

Which process to send/request? round-robin

(33)

Decentralized Dynamic Load Balancing

A mixture of both approaches is possible. Which process to send/request?

round-robin random

(34)

Decentralized Dynamic Load Balancing

A mixture of both approaches is possible. Which process to send/request?

round-robin random

(35)

Mapping Strategy

Decision tree for mapping strategy:

Static number of tasks Dynamic number of tasks

Structured communication pattern Unstructured communication pattern Roughly constant computation time per task

Computation per task varies

Create one task

per processor Cyclically map tasks

Frequent communication between tasks

Many tasks with little communication

(36)

Termination Detection

When can we be sure that the computation has finished?

Static load balancing

Typically termination is easy to detect has layout of application is fixed and well controlled.

Centralized dynamic load balancing

Also easy for master to recognize termination: task queue is empty

every slave is idle, having requested another task, and without generating any new task

Alternatively, if a slave indicates that a solution has been found, master can terminate all slaves.

Decentralized dynamic load balancing

(37)

Termination Detection

(38)

Termination Detection

(39)

Termination Detection

(40)

Termination Detection

(41)

Termination Detection

(42)

Termination Detection

(43)

Termination Detection

(44)

Termination Detection

In general, termination at time t requires the following conditions:

1 local termination conditions exist for all processes at time t

2 there are no messages in transit at time t

Second condition is what makes this problem difficult.

(45)

Termination Detection

In general, termination at time t requires the following conditions:

1 local termination conditions exist for all processes at time t

2 there are no messages in transit at time t

Second condition is what makes this problem difficult.

(46)

Termination Detection

Using Acknowledgment Messages

each process has two states: active andinactive.

processes start inactive

they become active when they receive first task

sender of tasks becomesparentof process

process may receive many other tasks while active (computation itself need not be a tree)

every time a process receives a task sends an acknowledgment, with the exception of the parent

acknowledgment to parent is only sent when process is ready to become inactive

process is ready to become inactive when

local termination conditions exist (all tasks finished) it has transmitted all acknowledgments for tasks it received it has received all acknowledgments for tasks it has sent out

(47)

Termination Detection

each process has two states: active andinactive. processes start inactive

(48)

Termination Detection

(49)

Termination Detection

(50)

Termination Detection

(51)

Termination Detection

(52)

Termination Detection

(53)

Termination Detection

local termination conditions exist (all tasks finished)

it has transmitted all acknowledgments for tasks it received it has received all acknowledgments for tasks it has sent out

(54)

Termination Detection

local termination conditions exist (all tasks finished) it has transmitted all acknowledgments for tasks it received

it has received all acknowledgments for tasks it has sent out

(55)

Termination Detection

local termination conditions exist (all tasks finished) it has transmitted all acknowledgments for tasks it received

(56)

Termination Detection

Inactive Active Process Parent First Task Final Ack Ack Task Other processes

(57)

Termination Detection

Ring Termination Algorithm

processes become white when they terminate

when P0 becomes white, it sends a white token to P1

the token is passed to the next process in the ring when the process finishes, and:

if the color of the process is black, the token becomes black otherwise, the token keeps the same color

a process Pi becomes black if it sends a message to process Pj and

j < i

a black process becomes white when it passes the token

when P0 receives a black token, it passes a white token

(58)

Termination Detection

j < i

(59)

Termination Detection

j < i

(60)

Termination Detection

j < i

(61)

Termination Detection

j < i

(62)

Termination Detection

j < i

(63)

Termination Detection

j < i

(64)

Debugging

Most common tool for debugging of MPI programs:

printf! Program runs slower. Difficulties for parallel debugging?

redirection of stdio: MPI takes care of this

by delaying some tasks, behavior of program may change significantly

Don’t forget fflush!

Debuggers can be used to exam execution of individual processes. However:

they change program execution times significantly

cannot capture relevant aspects of parallel computations such as timing of events

(65)

Debugging

Most common tool for debugging of MPI programs: printf!

Program runs slower. Difficulties for parallel debugging? redirection of stdio: MPI takes care of this

(66)

Debugging

Program runs slower. Difficulties for parallel debugging?

redirection of stdio: MPI takes care of this

(67)

Debugging

(68)

Debugging

(69)

Debugging

(70)

Review

Load Balancing static dynamic centralized decentralized Termination Detection Debugging

(71)

Load balancing; Termination detection