• No results found

Presentation Outline

N/A
N/A
Protected

Academic year: 2021

Share "Presentation Outline"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

Advanced Issues on Grid Technologies, Aristotle Univ. of Thessaloniki, 2008/04/07

The Message Passing Interface (MPI) The Message Passing Interface (MPI) and its integration with the EGEE Grid and its integration with the EGEE Grid

Vangelis Koukis Vangelis Koukis HG

HG- -01 01- -GRNET and HG GRNET and HG- - 06- 06 -EKT admin team EKT admin team [email protected]

[email protected]

Presentation Outline Presentation Outline

 Parallel Programming

 Parallel architectures

 Parallel programming models and MPI

 Introduction to basic MPI services

 MPI demonstration on a dedicated cluster

 Integration of MPI jobs on the EGEE Grid

 MPI job submission to HG-01-GRNET

 Discussion / Q&A Session

(2)

The lifetime of a serial job on the Grid The lifetime of a serial job on the Grid

RB CE

UI

LRMS (Torque / PBS)

allocation of a single processor

Worker Node

The need for MPI apps on the Grid The need for MPI apps on the Grid

 The Grid offers very large processing capacity:

How can we best exploit it?

 Thousands of processing elements / cores

 The easy way: The EP way

 Submit a large number of independent (serial) jobs, to process distinct parts of the input workload concurrently

 What about dependencies?

 What if the problem to be solved is not

“Embarassingly Parallel”;

(3)

Presentation Outline Presentation Outline

 Parallel Programming

Parallel architectures

Parallel programming models and MPI

 Introduction to basic MPI services

 MPI demonstration on a dedicated cluster

 Integration of MPI jobs on the EGEE Grid

 MPI job submission to HG-01-GRNET

 Discussion / Q&A Session

Parallel Architectures (1) Parallel Architectures (1)

 Distributed Memory Systems

(e.g., Clusters of Uniprocessor Systems)

(4)

Parallel Architectures Parallel Architectures (2) (2)

 Shared Memory Architectures (e.g., Symmetric Multiprocessors)

Parallel Architectures Parallel Architectures (3) (3)

 Hybrid – Multilevel Hierarchies

(e.g., Clusters of SMPs, Multicore/SMT Systems)

(5)

One model: The Message

One model: The Message- -Passing Paradigm Passing Paradigm

MPI_Recv MPI_Send

Presentation Outline Presentation Outline

 Parallel Programming

 Parallel architectures

 Parallel programming models and MPI

 Introduction to basic MPI services

 MPI demonstration on a dedicated cluster

 Integration of MPI jobs on the EGEE Grid

 MPI job submission to HG-01-GRNET

 Discussion / Q&A Session

(6)

What Is MPI?

What Is MPI?

 A standard, not an implementation

 An app library for message-passing

 Following a layered approach

 Offering standard language bindings at the highest level

 Managing the interconnect at the lowest level

 Offers C, C++, Fortran 77 and F90 bindings

Lots of MPI implementations Lots of MPI implementations



MPICH

http://www-unix.mcs.anl.gov/mpi/mpich



MPICH2

http://www-unix.mcs.anl.gov/mpi/mpich2



MPICH-GM

http://www.myri.com/scs



LAM/MPI

http://www.lam-mpi.org



LA-MPI

http://public.lanl.gov/lampi



Open MPI

http://www.open-mpi.org



SCI-MPICH

http://www.lfbs.rwth-aachen.de/users/joachim/SCI-MPICH



MPI/Pro

http://www.mpi-softtech.com



MPICH-G2

http://www3.niu.edu/mpi

(7)

 Multiple peer processes executing the same program image

 A number, called rank is used to tell each of the processes apart

 Each process undertakes a specific subset of the input workload for processing

 Execution flow changes based on the value of rank

 The basic rules of parallel programming

 Effort to maximize parallelism

 Efficient resource management (e.g., memory)

 Minimization of communication volume

 Minimization of communication frequency

 Minimization of synchronization

Single

Single Program, Program , Multiple Multiple Data Data (SPMD) (SPMD)

Processes and Communicators Processes and Communicators

 Peer processes are organized in groups, called communicators. At program start, there is MPI_COMM_WORLD

 Each process is assigned a single rank in the range of 0...P-1, where P is the number of processes in a communicator

 We’re referring to processes, not

processors (what about time-sharing?)

(8)

Typical MPI code structure Typical MPI code structure

#include <mpi.h>

int main(int argc, char *argv[]) {

...

/* Initialization of MPI support */

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &size);

...

/* MPI Finalization, cleanup */

MPI_Finalize();

}

Basic MPI services (1) Basic MPI services (1)

 • MPI_Init(argc,argv)

 Library Initialization

 MPI_Comm_rank(comm,rank)

 Returns the rank of a process in communication comm

 MPI_Comm_size(comm,size)

 Returns the size (the number of processes) in comm

 MPI_Send(sndbuf,count,datatype,dest,tag,comm)

 Sends a message to process with rank dest

 MPI_Recv(rcvbuf,count,datatype,source,tag, comm,status)

 Receives a message from process with rank source

 MPI_Finalize()

 Library Finalization

(9)

Basic MPI Services (2) Basic MPI Services (2)

int MPI_Init(int* argc, char*** argv)

 Initializes the MPI environment

 Usage example:

int main(int argc,char *argv[]) {

MPI_Init(&argc,&argv);

… }

Basic MPI Services (3) Basic MPI Services (3)

int MPI_Comm_rank (MPI_Comm comm, int* rank)

 Returns the rank of the calling process in communicator comm

 Usage example:

int rank;

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

(10)

Basic MPI Services (4) Basic MPI Services (4)

int MPI_Comm_size (MPI_Comm comm, int* size)

 Returns the size (number of processes) in communicator comm

 Usage example:

int size;

MPI_Comm_size(MPI_COMM_WORLD,&size);

Basic MPI Services (5) Basic MPI Services (5)

int MPI_Send(void *buf, int count, int dest, int tag, MPI_Datatype datatype, MPI_Comm

comm)

 The calling process sends a message from buf to the process with rank dest

 Array buf should contain count elements of type datatype

 Usage example:

int message[20],dest=1,tag=55;

MPI_Send(message, 20, dest, tag, MPI_INT,

MPI_COMM_WORLD);

(11)

Basic MPI Services (6) Basic MPI Services (6)

int MPI_Recv(void *buf, int count, int source, int tag, MPI_Datatype datatype, MPI_Comm comm, MPI_Status *status)

 Receives a message from process with rank source and saves it in buf

 At most count elements of type datatype are to be received (MPI_Get_count used to get the precise count)

 Wildcards

 MPI_ANY_SOURCE, MPI_ANY_TAG

 Usage example:

int message[50],source=0,tag=55;

MPI_Status status;

MPI_Recv(message, 50, source, tag,

MPI_INT, MPI_COMM_WORLD, &status);

Basic MPI Services (7)

Basic MPI Services (7)

(12)

Basic

Basic MPI Services (8) MPI Services (8)

int MPI_Finalize()

 Finalizes MPI support

 Should be the final MPI call made by the program

/* Computes f(0)+f(1) in parallel */

#include <mpi.h>

int main(int argc,char** argv){

int v0,v1,sum,rank;

MPI_Status stat;

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

if(rank==1) { v1=f(1);

MPI_Send(&v1,1,0,50,MPI_INT,MPI_COMM_WORLD);

else if(rank==0){

v0=f(0);

MPI_Recv(&v1,1,1,50,MPI_INT,MPI_COMM_WORLD,&stat);

sum=v0+v1;

}

MPI_Finalize();

}

Process 1

Process 0

A simple example

A simple example

(13)

Different Communication Semantics Different Communication Semantics

 Point-to-point / Collective Communication

 Synchronous, buffered or ready

 With different buffering and synchronization semantics

 Blocking or non-blocking calls

 Depending on when MPI returns control to the calling process

if (rank == 0)

for (dest = 1; dest < size; dest++)

MPI_Send(msg,count,dest,tag,MPI_FLOAT,MPI_COMM_WORLD);

Example: Process 0 needs to send msg to processes 1-7

In general: p – 1 communication steps needed for p processes

Collective Communication (1)

Collective Communication (1)

(14)

In general: communication steps needed for p processes

MPI_Bcast(msg,count,MPI_FLOAT,0,MPI_COMM_WORLD);

Example: Process 0 needs to send msg to processes 1-7

log

2

p

Collective Communication(2) Collective Communication(2)

int MPI_Bcast(void* message, int count, MPI_Datatype datatype, int root, MPI_Comm comm)

Collective Communication Collective Communication (3) (3)

 Message in message is broadcast from process root to all processes in

communicator comm

 Memory at message should contain count elements of type datatype

 Called by all processes in comm

(15)

int MPI_Reduce(void* operand, void*

result, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

Collective Communication Collective Communication (4) (4)

 All data in operand pointers contributed to reduction operation op, and the result is retrieved by root in result

 Needs to be called by all processes in comm

 MPI_Op: MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, etc.

 An MPI_Allreduce variant is also available

Collective Communication Collective Communication (5) (5)

/* Compute f(0)+f(1) + … + f(n) in parallel */

#include <mpi.h>

int main(int argc,char *argv[]){

int sum,rank;

MPI_Status stat;

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);

/* Assumes values have been computed in f[] */

MPI_Reduce(&f[rank],&sum,1,MPI_INT,MPI_SUM,0, MPI_COMM_WORLD);

MPI_Finalize();

}

(16)

Collective Communication Collective Communication (6) (6)

int MPI_Barrier(MPI_Comm comm)

 Synchronizes execution of processes in communicator comm

 Each process blocks until all participating processes reach the barrier

 Reduces the degree of attainable parallelism

Collective Communication Collective Communication (7) (7)

int MPI_Gather(void* sendbuf, int sendcnt, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

 Data in sendbuf are gathered in memory belonging to process with rank root (in increasing rank)

 Results stored in recvbuf, which contains meaningful data only for root

 Also available as an MPI_Allgather variant

 The reverse project: MPI_Scatter

(17)

Synchronous

Synchronous – – Buffered - Buffered - Ready Ready

 Different completion semantics for send and receive operations

 Available in blocking as well as non-blocking variants

 A simple MPI_Send can be synchronous or buffered, depending on implementation

Synchronous

Synchronous – – Buffered – Buffered – Ready (2) Ready (2)

 int MPI_Ssend(void *buf, int count,

MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

 Returns successfully only when operation has completed on the receiver side - safe

 int MPI_Bsend(void *buf, int count,

MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

 Returns as soon as possible, performs intermediate buffering and schedules sending over the network – may fail later on

 int MPI_Rsend(void *buf, int count,

MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

 Returns as soon as possible, but requires guarantee that a receive

(18)

Synchronous

Synchronous – – Buffered – Buffered – Ready (3) Ready (3)

Returns only if send successful Returns only if send

successful May fail later due to

resource constraints

Will fail if no receive is outstanding on

the remote No need for

outstanding receive No need for

outstanding receive

1 memory copy 1 memory copy

2 memory copies

Completes locally Syncs with remote

Completes locally

MPI_Rsend MPI_Ssend

MPI_Bsend

Non – Non – Blocking Communication Blocking Communication

 MPI returns control immediately to the calling process, but

 It is not safe to reuse provided buffers before the posted operations have completed

 Two ways to check for operation completion:

 int MPI_Test (MPI_Request* request,int*

flag, MPI_Status* status)

 int MPI_Wait (MPI_Request*

request,MPI_Status* status)

(19)

 Each blocking function has a non-blocking counterpart:

 MPI_Isend (corresponds to MPI_Send)

 MPI_Issend (corresponds to MPI_Ssend)

 MPI_Ibsend (corresponds MPI_Bsend)

 MPI_Irsend (corresponds MPI_Rsend)

 MPI_Irecv (corresponds MPI_Recv)

Non – Non – Blocking Communication (2) Blocking Communication (2)

 Why use non-blocking operations?

 Enables overlapping computation with communication for efficiency:

Non – Non – Blocking Communication (3) Blocking Communication (3)

Non-blocking Blocking

MPI_Irecv();

MPI_Isend();

Compute();

Waitall();

MPI_Recv();

MPI_Send();

Compute();

(20)

MPI Datatypes MPI Datatypes

MPI_CHAR: 8-bit character

MPI_DOUBLE: 64-bit floating point value MPI_FLOAT: 32-bit floating point value MPI_INT: 32-bit integer

MPI_LONG: 32-bit integer

MPI_LONG_DOUBLE: 64-bit floating point value MPI_LONG_LONG: 64-bit integer

MPI_LONG_LONG_INT: 64-bit integer MPI_SHORT: 16-bit integer

MPI_SIGNED_CHAR: 8-bit signed character MPI_UNSIGNED: 32-bit unsigned character MPI_UNSIGNED_CHAR: 8-bit unsigned character MPI_UNSIGNED_LONG: 32-bit unsigned integer MPI_UNSIGNED_LONG_LONG: 64-bit unsigned integer MPI_UNSIGNED_SHORT: 16-bit unsigned integer MPI_WCHAR: 16-bit unsigned integer

MPI Datatypes MPI Datatypes (2) (2)

 MPI data packing for communication needed for complex datatypes

 count parameter (for homogeneous data in consecutive memory locations)

 MPI_Type_struct (derived datatype)

 MPI_Pack(), MPI_Unpack() (for

heterogeneous data)

(21)

 Support for Parallel I/O

 Dynamic process management, runtime process spawning and destruction

 Support for remote memory access operations

 One-sided RDMA operations

The MPI

The MPI- -2 Standard 2 Standard

The MPICH implementation The MPICH implementation

MPID συσκευή ADI-2

MPI API

Interaction with OS (system libraries, inteconnection network,

MPIR run-time library

MPIP profiling interface

Library interface Interconnect MPI API

MPID

ADI-2 device

(22)

The MPICH Implementation (2) The MPICH Implementation (2)

 1 send message queue, 2 receive queues per process

 posted + unexpected

 Underlying device selection based on the destination rank

 p4, shmem

 Protocol selection based on message size

 Short < 1024 bytes, rendezvous > 128000 bytes, eager protocol for sizes in-between

 Flow control

 1MB buffer space for the eager protocol per pair of processes

Presentation Outline Presentation Outline

 Parallel Programming

 Parallel architectures

 Parallel programming models and MPI

 Introduction to basic MPI services

 MPI demonstration on a dedicated cluster

 Integration of MPI jobs on the EGEE Grid

 MPI job submission to HG-01-GRNET

 Discussion / Q&A Session

(23)

MPI program execution (1) MPI program execution (1)

 The traditional, HPC way: running directly on a dedicated PC Cluster

 Linux cluster of 16 multicore nodes (clone1…clone16)

 Program compilation and execution

 Appropriate PATH for a specific MPI implementation

• export PATH=/usr/local/bin/mpich-intel:…:$PATH

 Compile and link with the relevant MPI-specific libraries

• mpicc test.c –o test –O3

 Program execution

• mpirun –np 16 test

Demo time Demo time! !

 Run a simple “Hello World” 16-process

MPICH job on dedicated cluster (clones)

(24)

MPI program execution (2) MPI program execution (2)

 Which machines do the peer processes run on?

 Machine file

$ cat <<EOF >machines clone4

clone7 clone8 clone10 EOF

$ mpiCC test.cc –o test –O3 –static –Wall

$ mpirun –np 4 –machinefile machines test

MPI program execution (3) MPI program execution (3)

 Implementation details

 How are the needed processes created? An implementation- and OS-specific issue

• passwordless rsh / ssh, cluster nodes trust one another and share a common userbase

• Using daemons, (“lamboot” for LAM/MPI)

 What about file I/O;

 Shared storage among all cluster nodes

• NFS in the most common [and slowest] case

• Deployment of a parallel fs, e.g., PVFS, GFS, GPFS

(25)

Presentation Outline Presentation Outline

 Parallel Programming

 Parallel architectures

 Parallel programming models and MPI

 Introduction to basic MPI services

 MPI demonstration on a dedicated cluster

 Integration of MPI jobs on the EGEE Grid

 MPI job submission to HG-01-GRNET

 Discussion / Q&A Session

MPI jobs in the Grid environment MPI jobs in the Grid environment

 Submission of MPICH-type parallel jobs

Type = "job";

JobType = "MPICH";

NodeNumber = 64;

Executable = “mpihello";

StdOutput = "hello.out";

StdError = "hello.err";

InputSandbox = {“mpihello"};

OutputSandbox = {"hello.out","hello.err"};

#RetryCount = 7;

#Requirements = other.GlueCEUniqueID ==

"ce01.isabella.grnet.gr:2119/jobmanager-pbs-short"

(26)

The lifetime of an MPI job on the Grid The lifetime of an MPI job on the Grid

RB CE

UI

LRMS (Torque / PBS)

Node selection ($PBS_NODEFILE)

and mpirun

Worker Nodes

Presentation Outline Presentation Outline

 Parallel Programming

 Parallel architectures

 Parallel programming models and MPI

 Introduction to basic MPI services

 MPI demonstration on a dedicated cluster

 Integration of MPI jobs on the EGEE Grid

 MPI job submission to HG-01-GRNET

 Discussion / Q&A Session

(27)

Demo time Demo time! !

 Submission of a “Hello World” 4-process MPICH job to HG-01-GRNET

Questions

Questions – – Issues - Issues - Details Details

 Who is responsible for calling mpirun;

 On which nodes? How are they selected?

 Shared homes / common storage?

 Process spawing and destruction? Accounting?

 MPICH-specific solutions, based on rsh / ssh

 mpiexec to integrate process creation with Torque

 CPU Accounting for multiple processes per job

 Support for different Interconnects and/or MPI

implementations?

(28)

Now and in the future Now and in the future ... ...

 Grid support for MPI jobs is a Work In Progress

 Support for MPICH over TCP/IP (P4 device)

 Possible problems with other devices, since P4-specific hacks are used

 Need for pre/post-processing scripts

 Compilation of the executable on the remote Worker Nodes?

EGEE MPI Working Group EGEE MPI Working Group

 Aims to provide standardized, generic support for different MPI

implementations

 http://egee-docs.web.cern.ch/egee-

docs/uig/development/uc-mpi-jobs_2.html

 Proposes implementation guidelines for

the compilation and execution of parallel

jobs

(29)

Other Issues Other Issues

 Processor selection and allocation to processes, packing of processes to nodes

 What about message latency?

 Per-node memory bandwidth

 Available memory per node

 Support for hybrid architectures

 Combine MPI with pthreads / OpenMP to better adapt to the underlying architecture

Bibliography

Bibliography – – Online sources Online sources

 Writing Message-Passing Parallel Programs with MPI (Course Notes – Edinburgh Parallel

Computing Center)

 Using MPI-2: Advanced Features of the Message- Passing Interface (Gropp, Lusk, Thakur)

 http://www.mpi-forum.org (Definition of the MPI 1.1 and 2.0 standards)

 http://www.mcs.anl.gov/mpi (home of the MPICH implementation)

 comp.parallel.mpi (newsgroup)

(30)

Presentation Outline Presentation Outline

 Parallel Programming

 Parallel architectures

 Parallel programming models and MPI

 Introduction to basic MPI services

 MPI demonstration on a dedicated cluster

 Integration of MPI jobs on the EGEE Grid

 MPI job submission to HG-01-GRNET

 Discussion / Q&A Session

References

Related documents

The goals of this study, undertaken on Bouvetøya, were (1) to determine if the first trip to sea after instrumentation is representative of subsequent trips in lactating Antarctic

During the management of the Covid-19 in Turkey, officials of the Ministry of Health, members of the Covid-19 scientific board (summoned from the leading physicians) and

Additional requirement in present document The requirements in the original table for Band VII apply to Bands XV and XVI as well... 8.2.6.2 UE Rx-Tx time difference

Bicolored or uniformly dark, with microsculpture mesh pattern slightly transverse, sculpticells slightly wider than long (Fig. 5C-D) in dorsal aspect slender, shaft subsinuate,

The OpenStage 30T/40/40G/40T always requires a local power supply (SMPS mains adapter, must be ordered separately) if an OpenStage Busy Lamp Field 40 is connected.. The PNOTE

[r]

int recvfrom(int sockfd, void *buf, int len, unsigned int flags, struct sockaddr *from, int *fromlen);. Polly Huang, NTU

int bind (int socket, struct sockaddr *address, int addr_len) int listen (int socket, int backlog). int accept (int socket, struct sockaddr *address,