LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

(1)

Hermann Härtig

LOAD BALANCING

(2)

Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

ISSUES

■

_{starting points}

■

_{independent Unix processes and}

■

_{block synchronous execution}

■

_{who does it}

■

_{load migration mechanism (MosiX)}

■

_{management algorithms (MosiX)}

■

_{information dissemination}

■

_{decision algorithms}

(3)

EXTREME STARTING POINTS

■

_{independent OS processes}

■

_{block synchronous execution (HPC)}

■

_{sequence: compute - communicate}

■

_{all processes wait for all other processes}

■

_{often: message passing}

for example Message Passing Library (MPI)

3

(4)

SIMPLE PROGRAM MULTIPLE DATA

■

_{all processes execute same program}

■

_{while (true)}

{ work; exchange data (barrier)}

■

_{common in}

High Performance Computing: 

Message Passing Interface (MPI) 

library

(5)

DIVIDE AND CONQUER

5

node 1

CPU #2

CPU #1

node 2

CPU #2

CPU #1

part 1

part 2

part 3

part 4 result 1

result 2

result 3

result 4

result

problem

(6)

MPI BRIEF OVERVIEW

■

_{Library for message-oriented parallel}

programming

■

_{Programming model:}

■

_{Multiple instances of same program}

■

_{Independent calculation}

■

_{Communication, synchronization}

(7)

MPI STARTUP & TEARDOWN

■

_{MPI program is started on all processors}

■

MPI_Init(), MPI_Finalize()

■

_{Communicators (e.g., MPI_COMM_WORLD)}

■

MPI_Comm_size()

■

_{MPI_Comm_rank()}

_{: “Rank” of process within}

this set

■

_{Typed messages}

■

_{Dynamically create and spread processes}

using MPI_Spawn() (since MPI-2)

(8)

MPI EXECUTION

■

_{Communication}

■

_{Point-to-point}

■

_Collectives

■

_{Synchronization}

■

_Test

■

_Wait

■

_Barrier

8

MPI_Send(

void* buf,

int count,

MPI_Datatype,

int dest,

int tag,

MPI_Comm comm

)

MPI_Recv(

void* buf,

int count,

MPI_Datatype,

int source,

int tag,

MPI_Comm comm,

MPI_Status *status

)

MPI_Bcast(

void* buffer,

int count,

MPI_Datatype,

int root,

MPI_Comm comm

)

MPI_Reduce(

void* sendbuf,

void *recvbuf,

int count

MPI_Datatype,

MPI_Op op,

int root,

MPI_Comm comm

)

MPI_Barrier(

MPI_Comm comm

)

MPI_Test(

MPI_Request* request,

int *flag,

MPI_Status *status

)

MPI_Wait(

MPI_Request* request,

MPI_Status *status

)

(9)

BLOCK AND SYNC

9

blocking call

non-blocking call

synchronous

communication

asynchronous

communication

returns when message

has been delivered

returns immediately,

following test/wait

checks for delivery

returns when send

buffer can be reused

returns immediately,

following test/wait

checks for send buffer

(10)

EXAMPLE

10

int rank, total;

MPI_Init();

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &total);

MPI_Bcast(...);

/* work on own part, determined by rank */

if (id == 0) {

for (int rr = 1; rr < total; ++rr)

MPI_Recv(...);

/* Generate final result */

} else {

MPI_Send(...);

}

(11)

AMDAHLS’ LAW

interpretation for parallel systems:

■

_{P: section that can be parallelized}

■

_{1-P: serial section}

■

_{N: number of CPUs}

 

(12)

AMDAHL’S LAW

Serial section:  

communicate, longest sequential section 

Parallel, Serial, possible speedup:

■

_{1ms, 100 μs: 1/0.1 → 10}

■

_{1ms, 1 μs: 1/0.001 → 1000}

■

_{10 μs, 1 μs: 0.01/0.001 → 10}

■

_...

(13)

WEAK VS. STRONG SCALING

Strong:

■

_{accelerate same problem size}

Weak:

■

_{extend to larger problem size}

(14)

WHO DOES IT

■

_application

■

_{run-time library}

■

_{operating system}

14

(15)

SCHEDULER: GLOBAL RUN QUEUE

15

immediate approach: global run queue

all ready processes

(16)

SCHEDULER: GLOBAL RUN QUEUE

■

_{… does not scale}

■

_{shared memory only}

■

_{contended critical section}

■

_{cache affinity}

■

_…

■

_{separate run queues with}

explicit movement of processes

(17)

OS/HW & APPLICATION

■

_{Operating System / Hardware:}

“All” participating CPUs: active / inactive

■

_{Partitioning (HW)}

■

_{Gang Scheduling (OS)}

■

_{Within Gang/Partition:}

Applications balance !!!

(18)

HW PARTITIONS & ENTRY QUEUE

18

Application

request

queue

(19)

HW PARTITIONS

■

_{optimizes usage of network}

■

_{takes OS off critical path}

■

_{best for strong scaling}

■

_{burdens application/library with balancing}

■

_{potentially wastes resources}

■

_{current state of the art in High}

Performance Computing (HPC)

(20)

TOWARDS OS/RT BALANCING

20

work item

time

(21)

EXECUTION TIME JITTER

21

work item

time

(22)

SOURCES FOR EXECUTION JITTER

■

_{Hardware ???}

■

_Application

■

_{Operating system “noise”}

(23)

OPERATING SYSTEM “NOISE”

Use common sense to avoid:

■

_{OS usually not directly on the critical path,}

BUT OS controls: interference via interrupts, caches,

network, memory bus, (RTS techniques)

■

_{avoid or encapsulate side activities}

■

_{small critical sections (if any)}

■

_{partition networks to isolate traffic of different}

applications (HW: Blue Gene)

■

_{do not run Python scripts or printer daemons in parallel}

(24)

SPLITTING BIG JOBS

24

work item

time

Barrier

(25)

SMALL JOBS (NO DEPS)

25

work item

time

Barrier

(26)

BALANCING AT LIBRARY LEVEL

Programming Model

■

_{many (small) decoupled work items}

■

_{overdecompose}

create more work items than active units

■

_{run some balancing algorithm}

Example: CHARM ++

(27)

BALANCING AT SYSTEM LEVEL

■

_{create many more processes}

■

_{use OS information on run-time and}

system state to balance load

■

_examples:

■

_{create more MPI processes than nodes}

■

_{run multiple applications}

(28)

CAVEATS

added overhead

■

_{additional communication between}

smaller work items (memory & cycles)

■

_{more context switches}

■

_{OS on critical path}

(for example communication)

(29)

BALANCING ALGORITHMS

required:

■

_{mechanism for migrating load}

■

_{information gathering}

■

_decisions

MosiX system as an example

-> Barak’s slides now

(30)

FORK IN MOSIX

30

Deputy

Remote

fork() syscall request

Deputy

(child)

Remote

(child) fork()

Establish a new link

fork()

Reply from fork()

(31)

PROCESS MIGRATION IN MOSIX

31

Process

migdaemon

migration request

Deputy

Remote

_fork()

Send state, memory maps, dirty pages Transition ack Finalize migration Migration completed Ack