• No results found

McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC

N/A
N/A
Protected

Academic year: 2021

Share "McMPI. Managed-code MPI library in Pure C# Dr D Holmes, EPCC"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

McMPI

Managed-code MPI library in Pure C#

Dr D Holmes, EPCC

(2)

Outline

Yet another MPI library?

Managed-code, C#, Windows

McMPI, design and implementation details

Object-orientation, design patterns,

communication performance results

Threads and the MPI Standard

(3)

Why Implement MPI Again?

Parallel program, distributed memory => MPI library

• Most (all?) MPI libraries written in C

MPI Standard provides C and FORTRAN bindings

• C++ can use the C functions

Other languages can follow the C++ model

• Use the C functions

Alternatively, MPI can be implemented in that language

• Removes inter-language function call overheads but … • May not be possible to achieve comparable performance

(4)

Why Did I Choose C#?

Experience and knowledge I gained from my career in

software development

My impression of the popularity of C# in commercial

software development

My desire to bridge the gap between high-performance

programming and high-productivity programming

One of the UK research councils offered me funding for a

(5)

C# Myths

C# only runs on Windows

• Not such a bad thing – 3 of the Top500 machines use Windows • Not actually true – Mono works on multiple operating systems

C# is a Microsoft language

• Not such a bad thing – resources, commitment, support, training • Not actually true – C# follows ECMA and ISO standards

C# is slow like Java

• Not such a bad thing – expressivity, readability, re-usability • Not actually true – no easy way to prove this conclusively

C# and its ilk are not things we need to care about

• Not such a bad thing – they will survive/thrive, or not, without us • Not actually true – popularity trumps utility

(6)

McMPI Design & Implementation

Desirable features of code

• Isolation of concerns -> easier to understand

• Human readability -> easier to maintain

• Compiler readability -> easier to get good performance

Object-orientation can help with isolation of concerns

• So can modularisation and judiciously reducing LOC per code file

Design patterns can help with human readability

• So can documentation and useful in-code comments

Choice of language & compiler can help with performance

• So can coding style and detailed examination of compiler output

(7)

Communication Layer

Abstract class factory design pattern

• Similar to plug-ins

• Enables addition of new functionality without re-compilation of the rest of the library

All communication modules:

• Implement the same Abstract Device Interface (ADI)

• Isolate the details of their implementation from other layers

• Provide the same semantics and capabilities • Reliable delivery

• Ordering of delivery

• Preservation of message boundaries

(8)
(9)

Protocol Layer

Bridge design pattern

• Enables addition of new functionality without re-compilation of the rest of the library

All protocol messages:

• Implement inherit from the same base class

• Isolate the details of their implementation from other layers

• Modify state of internal shared data structures independently

Shared data structures (message ‘queues’)

• Unexpected queue – message envelope at receiver before receive • Request queue – receive called before message envelope arrival • Matched queue – at receiver waiting for message data to arrive • Pending queue – message data waiting at sender

(10)
(11)

Interface Layer

Simple façade design pattern

• Translates MPI Standard-like syntax into protocol layer syntax

Will become adapter design pattern

• For example, when custom data-types are implemented

Current version of McMPI covers parts of MPI 1 only

• Initialisation and finalisation

• Administration functions, e.g. to get rank and size of communicator

• Point-to-point communication functions • ready, synchronous, standard (not buffered)

• blocking, non-blocking, persistent

• Previous version had collectives • Implemented on top of point-to-point

(12)
(13)

Performance Results

– Introduction 1

Shared-memory results – hardware details

• Number of Nodes: 1 Armari Magnetar server

• CPUs per Node: 2 Intel Xeon E5420

• Threads per CPU: 4 Quad-core, no hyper-threading

• Core Clock Speed: 2.5GHz Front-side bus 1333MHz

• Level 1 Cache: 4x2x32KB Data & instruction per core

• Level 2 Cache: 2x6MB One per pair of cores

• Memory per Node: 16GB DDR2 667MHz

• Network Hardware: 2xNIC Intel 82575EB Gigabit Ethernet

(14)

Performance Results

– Introduction 2

Distributed-memory results – hardware details

• Number of Nodes: 18 Dell PowerEdge 2900

• CPUs per Node: 2 Intel Xeon 5130 Fam 6 mod 15 step 6

• Threads per CPU: 2 Dual-core, no hyper-threading

• Core Clock Speed: 2.0GHz Front-side bus 1333MHz

• Level 1 Cache: 2x2x32KB Data & instruction per core

• Level 2 Cache: 1x4MB One per CPU

• Memory per Node: 4GB DDR2 533MHz

• Network Hardware: 2xNIC BCM5708C NetXtreme II GigE

(15)

Shared-memory – Latency

0 1 2 3 4 5 6 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 Late n cy ( µs )

Message Size (bytes)

MPICH2 Shared Memory MS-MPI Shared Memory McMPI thread-to-thread

(16)

Shared-memory – Bandwidth

0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576 B an d wi d th (M b it/ s)

Message Size (bytes)

McMPI thread-to-thread MPICH2 shared-memory MS-MPI shared-memory

(17)

Distributed-memory

– Latency

0 50 100 150 200 250 300 350 400 450 500 550 600 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 Late n cy ( µs )

Message Size (bytes)

McMPI Eager MS-MPI

(18)

Distributed-memory

– Bandwidth

0 250 500 750 1,000 4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576 B an d wi d th (M b it/ s)

Message Size (bytes)

McMPI Rendezvous McMPI Eager MS-MPI

(19)

Thread-as-rank – Threading Level

McMPI allows MPI_THREAD_AS_RANK as input for the

MPI_INIT_THREAD function

McMPI creates new threads during initialisation

• Not needed – MPI_INIT_THREAD must be called enough times

McMPI uses thread-local storage to store ‘rank’

• Not needed – each communicator handle can encode ‘rank’

Thread-to-thread message delivery is zero-copy

• Direct copy from user send buffer to user receive buffer

(20)

Thread-as-rank – MPI Process

Diagram created by Gaurav Saxena MSc, 2013

(21)

Thread-as-rank – MPI Standard

Is thread-as-rank compliant with the MPI Standard?

Does the MPI Standard allow/support thread-as-rank?

• Ambiguous/debatable at best

The MPI Standard assumes MPI process = OS process

• Call MPI_INIT or MPI_INIT_THREAD twice in one OS process • Erroneous by definition or results in two MPI processes?

MPI Standard “thread compliant” prohibits thread-as-rank

• To maintain a POSIX-process-like interface for MPI process • End-points proposal violates this principle in exactly the same way

(22)

Thread-as-rank – End-points

Similarities

• Multiple threads can communicate reliably without using tags

• Thread ‘rank’ can be stored in thread-local storage or handles

• Most common use-case likely requires MPI_THREAD_MULTIPLE

Differences

• Thread-as-rank part of initialisation and active until finalisation

• End-points created after initialisation and can be destroyed

• Thread-as-rank has all possible ranks in MPI_COMM_WORLD

• End-points only has some ranks in MPI_COMM_WORLD

• Thread-as-rank cannot create ranks but may need to merge ranks

(23)

Thread-as-rank

– MPI Forum Proposal?

Short answer: no

Long answer: not yet, it’s complicated

• More likely to be suggested amendments to end-points proposal

Thread-as-rank is a special case of end-points

• Standard MPI_COMM_WORLD replaced with an end-points communicator during MPI_INIT_THREAD

• Thread-safety implications are similar (possibly identical?)

• Advantages/opportunities similar

• Thread-to-thread delivery rather than process-to-process delivery

(24)

Figure

Diagram created  by Gaurav Saxena  MSc, 2013

References

Related documents

DharmaFECT Duo transfection reagent achieves optimal co-transfection results - effective plasmid delivery (high grey bars) and target mRNA knockdown (low white bars) when

CSD will review complaints from homeowners who have had rehabilitation work performed under the Harris County Home Repair Program. A determination will be made by HCIS whether

executives. On March 5, 2003, Commissioner Larson denied the petition by CareFirst to convert and to be acquired by Wellpoint. Both the District of Columbia and Delaware

Based on the analysis security situation on smart mobile terminals, the key security issues of smart mobile terminals are analyzed detailed: program security analysis, data

Although various mechanisms exist for protecting private keys, most of these require interaction with the user and so are not suitable for use in the smart grid environment due to

Open Source software projects invite computer programmers from around the world to view software code and make changes and improvements to it. Through such collaboration,

However, they vary a lot from the perspectives of grouping factors (e.g. personalities and team roles), grouping criteria (e.g. ability heterogeneous groups and

REVIEW AND DOWNLOAD THIS COMPLETE USER GUIDE OR TROUBLESHOOTING SECTION ANATOMY PHYSIOLOGY ANSWER MEDICAL TERMINOLOGY, TO PROVIDES THE ANSWER AND THEN.. ANY