• No results found

Introduction to Hybrid Programming

N/A
N/A
Protected

Academic year: 2021

Share "Introduction to Hybrid Programming"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction to

Hybrid Programming

Hristo Iliev

Rechen- und Kommunikationszentrum

aiXcelerate 2012 / Aachen

10. Oktober 2012

Version: 1.1

(2)

Motivation for hybrid programming

Clusters of small supercomputers

 Increasingly complex nodes – many cores, GPUs, Intel MIC, etc.

 Green computing will make things even more complex

(3)

MPI on a single node

MPI is sufficiently abstract to run on a single node

 It doesn’t care where its processes are located

 Message passing implemented using shared memory and IPC

The MPI library takes care

Usually faster than sending messages over the network

 but…

… it is far from optimal

 MPI processes are implemented as separate OS processes

 Portable data sharing is hard to achieve

 Lots of control / problem data has to be duplicated

(4)

Hybrid programming

(Hierarchical) mixing of different programming paradigms

MPI

OpenMP 0 5 6 7 8 9 1 2 3 4 Shared memory OpenCL GPGPU OpenMP 0 5 6 7 8 9 1 2 3 4 Shared memory OpenCL GPGPU
(5)

Best of all worlds

OpenMP / pthreads

 Cache reusage, data sharing

 Simple programming model (not so simple with POSIX threads)

 Threaded libraries (e.g. MKL)

OpenCL

 Massively parallel GPGPU accelerators and co-processors

MPI

 Fast transparent networking

 Scalability

Downsides

(6)

MPI – threads interaction

Most MPI implementation are threaded (e.g. for non-blocking

requests) but

not thread-safe

Four levels of threading support

All implementations support

MPI_THREAD_SINGLE

, but some does not

support

MPI_THREAD_MULTIPLE

Level identifier Description

MPI_THREAD_SINGLE Only one thread may execute

MPI_THREAD_FUNNELED Only the main thread may make MPI calls

MPI_THREAD_SERIALIZED Only one thread may make MPI calls at a time

MPI_THREAD_MULTIPLE Multiple threads may call MPI at once with no restrictions

Hi gh e r le ve ls

(7)

MPI and threads – initialisation

Initialise MPI with thread support

required tells MPI what level of thread support is required

provided tells us what level of threading MPI actually supports

could be less than the required level

MPI_INIT – same as a call with required = MPI_THREAD_SINGLE

 The thread that has called MPI_INIT_THREAD becomes the main thread

 The level of thread support is fixed by this call and cannot be changed later

C/C++: int MPI_Init_thread (int *argc, char ***argv,

int required, int *provided) Fortran: SUBROUTINE MPI_INIT_THREAD (required, provided, ierr)

(8)

MPI and threads – initialisation

Obtain current level of thread support

provided is filled with the current level of thread support

 If MPI_INIT_THREAD was called, provided will equal the value returned by the MPI initialisation call

 If MPI_INIT was called, provided will equal an implementation specific value

Find out if current thread is the main one

MPI_Query_thread (provided)

(9)

MPI and OpenMP

The most widely adopted hybrid programming model in engineering

and scientific codes

no MPI MPI

no OpenMP serial code MPI code

(10)

MPI_THREAD_FUNNELED

Only make MPI calls from outside the OpenMP parallel regions

double data[], localData[];

MPI_Scatter(data, count, MPI_DOUBLE,

localData, count, MPI_DOUBLE, 0, MPI_COMM_WORLD);

#pragma omp parallel for

for (int i = 0; i < count; i++)

localData[i] = exp(localData[i]); MPI_Gather(localData, count, MPI_FLOAT, data, count, MPI_DOUBLE,

0, MPI_COMM_WORLD);

No MPI calls from inside the parallel region

(11)

MPI_THREAD_FUNNELED

Only make MPI calls from outside the OpenMP parallel regions or

inside an OpenMP MASTER construct

float data[];

#pragma omp parallel {

#pragma omp master

{

MPI_Bcast(data, count, MPI_FLOAT, 0, MPI_COMM_WORLD); }

#pragma omp barrier

#pragma omp for

for (int i = 0; i < count; i++) data[i] = exp(data[i]);

(12)

MPI_THREAD_SERIALIZED

MPI calls should be serialised within (named) OpenMP CRITICAL

constructs or using some other synchronisation primitive

#pragma omp parallel {

MPI_Status status;

#pragma omp critical(mpicomm)

MPI_Recv(&data, 1, MPI_FLOAT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status);

result = exp(data);

#pragma omp critical(mpicomm)

MPI_Send(&result, 1, MPI_FLOAT, status.MPI_SOURCE, 0, MPI_COMM_WORLD);

(13)

MPI_THREAD_MULTIPLE

No need to synchronise between MPI calls

Watch out for possible data races and MPI race conditions

#pragma omp parallel {

int count; float *buf; MPI_Status status;

MPI_Probe(MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status);

MPI_Get_count(&status, MPI_FLOAT, &count); buf = new float[count];

MPI_Recv(buf, count, MPI_FLOAT, status.MPI_SOURCE,

0, MPI_COMM_WORLD);

(14)

Addressing in Hybrid Codes

MPI and OpenMP have orthogonal address spaces:

 MPI rank  [0, #procs-1] MPI_Comm_rank()

 OpenMP thread ID  [0, #threads-1] omp_get_thread_num()

 Hybrid rank:thread  [0, #procs-1] × [0, #threads-1]

MPI does not provide a direct way to address threads in a process

Field Value source Remark

source rank Sender process rank Fixed, taken from the rank of the sending

process

destination rank MPI_Send argument Only one rank per process

tag MPI_Send argument Free to choose

(15)

Addressing in Hybrid Codes

Tags for threads addressing

 MPI standard requires that tags provide at least 15 bits of user defined

information (query for the MPI_TAG_UB attribute of MPI_COMM_WORLD to obtain the actual upper limit)

Use tag value to address destination thread only

 (+) straightforward to implement – threads simply supply their ID to MPI_Recv

 (+) very large number of threads could be addressed

 (-) not possible to further differentiate messages

(16)

Addressing in Hybrid Codes

Tags for threads addressing

 MPI standard requires that tags provide at least 15 bits of user defined

information (query for the MPI_TAG_UB attribute of MPI_COMM_WORLD to obtain the actual upper limit)

Multiplex destination thread ID with tag value

 e.g. 7 bits for tag value (0..127), 8 bits for thread ID (up to 256 threads)

 (+) possible to differentiate messages

 (-) receive operation must be split into MPI_Probe / MPI_Recv pair if we would like to emulate MPI_TAG_ANY

(17)

Addressing in Hybrid Codes

Tags for threads addressing

 MPI standard requires that tags provide at least 15 bits of user defined

information (query for the MPI_TAG_UB attribute of MPI_COMM_WORLD to obtain the actual upper limit)

Multiplex sender and destination thread IDs with tag value

 For communication between two BCS nodes @ RWTH 14 bits are required for thread IDs (up to 128 threads – 7 bits)

 Suitable for MPI implementations that provide larger tag space

both Open MPI and Intel MPI have MPI_TAG_UB of 231-1

(18)

Addressing in Hybrid Codes

Sample implementation – sender thread

#define MAKE_TAG (tag,stid,dtid) \

(((tag) << 16) | ((stid) << 8) | (dtid))

// Send data to drank:dtid with tag

MPI_Send(data, count, MPI_FLOAT, drank,

MAKE_TAG(tag, omp_get_thread_num(), dtid), MPI_COMM_WORLD);

(19)

Addressing in Hybrid Codes

Sample implementation – receiver thread

#define MAKE_TAG (tag,stid,dtid) \

(((tag) << 16) | ((stid) << 8) | (dtid))

// Receive data from srank:stid with a specific tag MPI_Recv(data, count, MPI_FLOAT, srank,

MAKE_TAG(tag, stid, omp_get_thread_num()), MPI_COMM_WORLD, MPI_STATUS_IGNORE);

(20)

Addressing in Hybrid Codes

Sample implementation – receiver thread

#define GET_TAG(val) \ ((val) >> 16) #define GET_SRC_TID(val) \ (((val) >> 8) & 0xff) #define GET_DST_TID(val) \ ((val) & 0xff)

// Receive data from srank:stid with any tag

MPI_Probe(srank, MPI_ANY_TAG, MPI_COMM_WORLD, &status);

if (GET_SRC_TID(status.MPI_TAG) == stid &&

GET_DST_TID(status.MPI_TAG) == omp_get_thread_num())

{

MPI_Recv(data, count, MPI_FLOAT, srank, status.MPI_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

(21)

Multiple Communicators

In some cases it makes more sense to use multiple communicators

comms[0]

comms[1]

comms[2]

(22)

Multiple Communicators

Sample implementation

MPI_Comm comms[nthreads], tcomm;

#pragma omp parallel private(tcomm) num_threads(nthreads) {

MPI_Comm_dup(MPI_COMM_WORLD, &comms[omp_get_thread_num()]); tcomm = comms[omp_get_thread_num()];

. . .

// Sender side

MPI_Send(data, count, MPI_FLOAT, omp_get_thread_num(), drank, comms[dtid]);

. . .

// Receiver side

MPI_Recv(data, count, MPI_FLOAT, stid, srank, tcomm, &status);

. . . }

(23)

MPI thread support @ BULL cluster

Thread support in various MPI libraries

Open MPI

 Use module versions ending with mt (e.g. openmpi/1.6.1mt

 *InfiniBand BTL disables itself when thread level is MPI_THREAD_MULTIPLE!

Intel MPI

requested Open MPI Open MPI mt Intel MPI w/o -mt_mpi

Intel MPI w/ -mt_mpi

SINGLE SINGLE SINGLE SINGLE SINGLE

FUNNELED SINGLE FUNNELED SINGLE FUNNELED

SERIALIZED SINGLE SERIALIZED SINGLE SERIALIZED

(24)

Interoperation with CUDA (experimental!)

Modern MPI libraries provide some degree of CUDA interoperation

 Device pointers as buffer arguments in many MPI calls

 Direct device to device RDMA transfers over InfiniBand

 Implemented by Open MPI and MVAPICH2

Experimental CUDA build of Open MPI 1.7 in the BETA modules

category

 You would need first to obtain access to the GPU cluster (ask RZ ServiceDesk)

gpu-cluster$ module load BETA

(25)

Running hybrid code with LSF

Sample hybrid job script (Open MPI)

#!/usr/bin/env zsh

# 16 MPI procs x 6 threads = 96 cores

#BSUB -x

#BSUB -a openmpi #BSUB -n 16

# 12 cores/node / 6 threads = 2 processes per node

#BSUB -R “span[ptile=2]”

module switch openmpi openmpi/1.6.1mt

# 6 threads per process

export OMP_NUM_THREADS=6

# Pass OMP_NUM_THREADS on to all MPI processes

$MPIEXEC $FLAGS_MPI_BATCH –x OMP_NUM_THREADS \ program.exe <args>

(26)

Running hybrid code with LSF

Sample hybrid job script (Intel MPI)

For the most up to date information refer to the HPC Primer

#!/usr/bin/env zsh

# 16 MPI procs x 6 threads = 96 cores

#BSUB -x

#BSUB -a intelmpi #BSUB -n 16

# 12 cores/node / 6 threads = 2 processes per node

#BSUB -R “span[ptile=2]”

module switch openmpi intelmpi

# 6 threads per process

# Pass OMP_NUM_THREADS on to all MPI processes

$MPIEXEC $FLAGS_MPI_BATCH –genv OMP_NUM_THREADS 6 \ program.exe <args>

(27)

Q & A

session

the

References

Related documents

Using the ne w keywo rd is no t limited to reference data types, and can be used with value data types; ho wever, as we've learned, value data types do no t require an

We further interrogate the impact of estrous cycle and introduction of the human vaginal pathobiont, group B Streptococcus (GBS) on community state type and stability, and

Franklin and Associates Ltd.’s study, Comparative Energy Evaluation of Plastic Products and Their Alternatives for the Building and Construction and Transportation Industries,

Planning for the 1999 Iowa Oral Health Survey began in the spring of 1999 and included personnel from the Dental Health Bureau of the Iowa Department of Public Health,

Cast iron motors in sizes 71-225 are greased for life and motors in sizes 250-355 have a regreasing device as a standard.. Strong

Both use stored energy sources to produce a large magnetic field and a high electric current through a driving armature.. The interaction of the current with the

• Speed of weaning: induction requires care, but is relatively quick; subsequent taper is slow • Monitoring: Urinary drug screen, pain behaviors, drug use and seeking,

(B) the adsorption at a single site on the surface may involve multiple molecules at the same time.. (C) the mass of gas striking a given area of surface is proportional to the