Message-passing over shared memory for the DECK programming environment

(1)

Message-passing over shared memory for the DECK programming environment

Rafael B. ´

Avila

Caciano Machado

Philippe O. A. Navaux

Parallel and Distributed Processing Group

Instituto de Inform´atica — UFRGS

Caixa Postal 15064

91501-970 Porto Alegre — Brazil

Phone: +55 51 3316-6165 Fax: +55 51 3316-7308

E-mail:

avila,caciano,navaux

@inf.ufrgs.br

Abstract

Message-passing is a representative communication model in today’s parallel and distributed programming, and should be efficiently supported even for multithreaded-only parallel programs. This papers describes the design and implementation of a communication mechanism which em-ulates message passing on top of shared memory for mul-tithreaded applications. The mechanism is implemented in the DECK parallel programming environment and a perfor-mance analysis is presented.

Keywords: cluster computing, multithreading, parallel programming environment, message passing, shared mem-ory.

1. Introduction and context

Message passing has been established, in the last years, as a de facto standard in parallel and distributed program-ming, mainly due to the popularization of workstation net-works, or clusters, as parallel machines, programmed via libraries such as PVM [5] or MPI [11]. Though more “diffi-cult” to use than shared memory, people have become ac-costumed to employing it. On the other hand, consider-ing the availability of the not less popular SMP machines, featuring 2, 4 or even 8 processors in a single PC, multi-threaded programming [10] is also significant and has be-come a key point to achieve top performance when combin-ing computation and communication.

This work was partially supported by project FINEP/CTPetro no. 65.00.0363.00 and by grants from CNPq and CAPES.

PhD student

Undergraduate student, research assistant

Professor, Dr. (Institut National Polytechnique de Grenoble, France, 1979)

Our work is motivated by the fact that, even if shared memory is available for communication betwen threads, message passing might still be attractive, since program-mers are used to it. Also, existing applications with dis-tributed nature could be easily ported to a multithreaded en-vironment.

But if the message passing subsystem is not aware of the possibility of multithreading, the resulting performance may be a disaster. Since communication must take place locally, it is desirable that shared memory be used for com-munication, instead of going the normal way through the operating system down to the network card and bouncing back to the application.

For these reasons, we present in this paper a mecha-nism for message passing emulated in the local memory of a node, for communication between local threads. More-over, to the difference of existing implementations of such functionality, our implementation is based on chained lists of messages, which presents interesting results. This mech-anism has been incorporated in the DECK parallel program-ming library, and a performance analysis has been made.

The paper is organized as follows: Section 2 presents an overview of DECK and its main features; Section 3 de-scribes the implementation of MP on top of SHMEM and Section 4 presents an evaluation of it; finally, Section 5 poses some comparisons to similar implementations and Section 6 concludes the paper.

2. The DECK environment

DECK—Distributed Execution and Communication Kernel [2]— is a programming environment aimed at clus-ter computing. It features the most common resources for parallel programming and supports a variety of communi-cation technologies such as Myrinet [3] and SCI [9] and respective protocols.

(2)

thread semaph msg mbox shmem RCD naming FT group sched

uDECK services

Figure 1. Internal structure of DECK.

The API of DECK provides 5 abstractions that we consider basic for any parallel application: threads,

mu-texes, semaphores, mail boxes and messages.

Addition-ally, DECK still provides some features which depend ex-clusively on the basic abstractions, and thus are platform-independent. Currently, such features include collective communication, multithreaded servicing (pool of threads) and condition variables. These features are internally or-ganized in a two-layer structure as shown in Figure 1. The basic abstractions form the lower layer, which we call

DECK. The other features form the upper layer and are selected at compile time, possibly being empty.

Threads and synchronization are based on the stan-dard POSIX Threads implementation [10], and make di-rect use of its functions. Traditional primitives such as

thread create() and mutex lock() are provided. The idea

in DECK, however, is to minimize the complexity of POSIX Threads’ functions. For example, threads in DECK only exist in the attached state.

Inter-node communication in DECK is based on a

mail box abstraction. In order to communicate with another

node, a thread must own a reference to a remote mail box and post a message in it. Such message is later retrieved on the remote node by the thread owning the mail box. Mes-sage posting has always asynchronous (non-blocking) se-mantics (except for flow-control constraints), and message retrieving is always synchronous (i.e. it blocks the calling thread until the message arrives.)

Communication is the most active development part of DECK. Since its first implementation for UDP sockets, in 1999, DECK has been ported to TCP, BIP [13] for Myrinet, and SISCI [7] for SCI. All of these implementations are freely available for download from the project’s homepage at gppd.inf.ufrgs.br/projects/mcluster. Results on the per-formance of DECK for these systems can be found in [1] and [4].

3. Message passing over shared memory in

DECK

As stated before, the goal of our work is to use mem-ory as a channel for the exchange of messages between threads. This situation is illustrated on Figure 2. In order

1 − 2 −

Posting a message to the memory Retrieving a message from the memory Message Thread A Posting Message Memory Shared Message Thread B Retrieving 2 1

Figure 2. Exchanging messages by means of shared memory

to implement this mechanism on DECK, we have desinged new post() and retrieve() functions that manipulate mes-sages exclusively through shared memory. These new rou-tines are coupled in the original post and retrieve primi-tives, which have been adapted to recognize when a thread sends/receives a message to/from another thread in the same node and, in this case, activate the local-delivery semantics. To the difference of existing systems, the proposed shared-memory message passing mechanism was imple-mented using FIFO queues. The implementation uses a mechanism that dynamically allocates memory for mes-sages in the post operations, and another mechanism to take advantage of the already allocated message structures and buffers to minimize memory allocation operations. This is useful because of the potentially heavy dynamic mem-ory allocation process. As a consequence, it is best suited for applications that present a “pattern” in the posting and retrieving of messages and their sizes. If the application does not vary the size of messages too often, the realloca-tion operarealloca-tions will be minimized and performance will be improved.

The implemented mechanism dynamically manages two message queues, for posted and retrieved messages respec-tively, for each mail box referenced at least once. This means that when the application is started the mail box list of the node is empty, and only when a message is firstly posted to a mail box a descriptor for it will be created.

The mail box structure is a descriptor which contains in-formation about the message FIFOs, along with the name of the mail box and a link to the next mail box in the list. The two queues are used for posted and retrieved messages. When a message is posted, it is added to the posted queue, being later removed when the corresponding retrieve occurs (Figure 3). The retrieved queue is used to take advantage of the already allocated message buffers. When a message is retrieved its message structure and buffer is moved to the end of the retrieved queue in order to be used by subsequent post operations.

All the mail boxes and message structures are accessed by a global reference to the mail box list. Concurrency for

(3)

!! !! "" ## $$%% & ' ( ) * * + + , , -Message queue in Mailbox Message Mailbox list Message queue

Message queue out

Figure 3. FIFO queues used for message han-dling

the structures access is controlled by DECK mutexes (with spinning implementation).

The representation of messages in memory consists of a dynamic allocated structure which contains control infor-mation and a buffer which contains the data to be exchanged (payload). Furthermore, the message structure contains a link to the next message on the FIFO or a NULL value if it is the last in the queue.

Here is a simple description of the routines work:

.

Post

1. First the deck mbox post() function tests if the destination of the message is in a thread on the local node and, if true, the shared memory rou-tines are called

2. The local post function tests if the target mail box exists and if not a new mail box is created in the mail box list

3. The function tests if the mail box has message structures in the retrieved queue, and uses the first of the queue if it exists, otherwise a dynamic allocation is needed

4. The data and control information of the message are copied to the shared memory message object 5. The message object is linked in the message

queue.

.

Retrieve

1. As in the case of the post function,

deck mbox retrv() tests if the sender of the

message is in the local node

2. The function tests if the mail box exists and if not it waits for a post operation that will create the mail box and post the message

3. If the mail box has message structures in the mes-sage queue, the function recovers the data associ-ated with the first message of the queue if it exists 4. The data and control information are copied to

the user message

5. The retrieved message object is linked in the re-trieved message queue to be used by other post operations.

4. Evaluation

In order to evaluate the performance of the DECK shared-memory message passing functions, we have com-pared the exchange of messages of different sizes in three situations:

.

between two DECK threads using the implementation just described

.

between two DECK threads using our TCP implemen-tation

.

between two MPI processes on the same node

In the case of MPI we had to make use of two processes since the library does not natively support threads. On all experiments we have used a Dual Pentium III 500MHz with 256M of RAM. The operating system is GNU/Linux with kernel 2.4.12 and glibc 2.2.3, and the MPI implementation is MPICH 1.2.2. Additionally, every result presented has been taken as a mean of at least 1000 repetitions.

We have measured the raw performance of the three im-plementations for the exchange of messages varying from 1 byte up to 8 Mbytes. Figure 4 presents the obtained perfomance in terms of latency (measured as half of the round-trip time). These results show a significant overall improvement in our implementation in relation to TCP or MPI. DECK has been able to reach only 5

s latency, against 50

s of TCP and 70

s of MPI. We understand that the gain in performance comes from the use of threads, not possible in MPI, and the reuse of buffers in the message queues, to our knowledge not used in neither MPI or the TCP implementation of the Linux kernel. It is also possible to notice that local TCP message exchanges in the kernel are greatly optimized in the 2.4 series. We have made ex-periments with 2.2 kernels and obtained results that are at least two orders of magnitude higher. In that case, the per-formance obtained by the local post and retrieve functions is even more significant.

(4)

20 40 60 80 100 120 1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Latency (in microseconds)

Message size (in bytes, log2 scale)

DECK−ShMem Benchmarking − Latency (Biprocessed machine) ShMem

TCP−IP MPI

Figure 4. Results for message exchange la-tency. 10M 20M 30M 40M 50M 60M 70M 80M 90M 100M 110M 120M 130M 140M 150M 2 8 32 128 512 2K 8K 32K 128K 512K 2M 8M

Bandwidth (in bytes per second)

Message size (in bytes, log2 scale)

DECK−ShMem Benchmarking − Bandwidth (Biprocessed machine) ShMem

TCP−IP MPI

Figure 5. Results for bandwidth.

The benefits from the shared-memory message passing functions are more evident when we analyze the achiev-able bandwidth, shown in Figure 5. The gain in perfor-mance is specially evident for message sizes between 32 and 64 Kbytes. The peak bandwidth is achieved for mes-sages of 32 Kbytes, delivering about 125 Mbytes/s. The peak values for TCP and MPI are respectively 112 and 52 Mbytes/s. We believe the same reasons for bad perfor-mance apply here, specially in the case of MPI.

5. Related work

Two available implementations of similar mechanisms for memory-based communication have influenced our work more directly, which we present in this Section.

The first one is the ch shmem implementation of MPICH [12]. This is a device layer for shared memory mes-sage exchange between MPI processes running on the same node. The implementation uses three different protocols for

message passing, depending on the message size. The first protocol, dedicated to small messages, uses a pre-allocated area in memory to the send and receive operations. The sec-ond protocol is responsible for medium-size messages and uses a dynamic allocation mechanism for data manipula-tion. The last protocol, used for large messages, imposes a

rendez-vous semantics for the send and receive operations,

copying the message directly to the destination buffer. Our implementation is similar to the second protocol of MPICH, but the difference lies on the fact that message buffers are reused in DECK, and in MPI they are freed af-ter the corresponding receive is performed. This behaviour imposes a higher overhead, but may be more balanced for applications with irregular communication pattern.

The second implementation is BIP-SMP [6], of the Uni-versity of Lyon, France. BIP-SMP is a protocol for intra-node communication in clusters of SMPs. The interesting characteristic of this implementation is the protocol used for the exchange of large messages. The exchange of small messages is similar to that used in DECK and the MPICH ch shmem. The data exchange of large messages is done by direct memory copy, i.e., the message is copied from the user space of the sender process to the user space of the receiver process. This is possible because the researchers have implemented a kernel module for Linux that allows data exchange between the processes of a node.

In their article the authors show many performance re-sults for BIP, MPI and other message passing libraries like GM [8] employing the mentioned protocol and specific im-plementations. A mostly interesting result, in relation to the work presented here, is mentioned for the ch lfshmem de-vice, designed for MPICH, where latency and bandwidth reach values of 2.4

s and 100 MBytes/s respectively. Pure BIP-SMP is capable of reaching 1.8

s and 160 Mbytes/s between two processes on a SMP node.

6. Conclusions and future directions

The performance obtained by the message passing emu-lation mechanism presented in this paper has been consid-ered very good, being able to present results which are com-parable and even outperform those presented by traditional programming interfaces such as TCP and MPI. For exam-ple, in the case of latency, the results obtained for DECK represent 10% of TCP and only 7% of that of a traditional MPI implementation. Similarly, we have shown that other research projects have been able to achieve even better re-sults when using dedicated features. As a general conclu-sion, and for the reasons exposed in the introduction, pos-ing the importance of message passpos-ing primitives in parallel and distributed programming, we have considered this work a significant contribution to the DECK environment.

(5)

memory based communication mechanism for DECK pro-cesses running on the same SMP node, similarly to that of MPICH’s ch shmem. Also, we plan to run some experi-ments comparing the performance of DECK with that of local communication when using specific communication libraries such as BIP and SISCI.

7. Acknowledgements

The authors would like to thank Maur´ıcio Pilla for his valuable observations on process and thread scheduling.

References

[1] M. Barreto, R. ´Avila, R. Cassali, A. Carissimi, and P. Navaux. Implementation of the DECK environment with BIP. In Proc. of the 1st Myrinet User Group Conference, pages 82–88, Lyon, France, 2000. Lyon, INRIA Rocquen-court.

[2] M. E. Barreto. DECK: Um ambiente para programação par-alela em agregados de multiprocessadores. Master’s the-sis, Instituto de Informática, Universidade Federal do Rio Grande do Sul, Porto Alegre, 2000.

[3] N. Boden et al. Myrinet: a gigabit-per-second local-area network. IEEE Micro, 15(1):29–36, Feb. 1995.

[4] F. A. D. de Oliveira, R. B. ´Avila, M. E. Barreto, P. O. A. Navaux, and C. De Rose. DECK-SCI: High-performance communication and multithreading for SCI clusters. In D. S. Katz, T. Sterling, M. Baker, L. Bergman, M. Paprzycki, and R. Buyya, editors, Proc. of the 3rd IEEE International Conference on Cluster Computing, pages 372–379, Newport Beach, CA, 2001. Los Alamitos, CA, IEEE Computer Soci-ety.

[5] A. Geist et al. PVM: Parallel Virtual Machine. MIT Press, Cambridge, 1994.

[6] P. Geoffray, L. Prylli, and B. Tourancheau. BIP-SMP: High performance message passing over a cluster of commodity SMPs. In Proc. of SuperComputing’99, 1999.

[7] F. Giacomini, T. Amundsen, A. Bogaerts, R. Hauser, B. D. Johnsen, H. Kohmann, R. Nordstrøm, and P. Werner. Low-level SCI software requirements, analysis and predesign. Technical report, ESPRIT Project 23174 — Software Infras-tructure for SCI (SISCI), May 1998.

[8] GM. Available at http://www.myri.com/GM, Dec. 1999. [9] H. Hellwagner and A. Reinefeld, editors. SCI: Scalable

Coherent Interface: Architecture and Software for High-Performance Compute Clusters, volume 1734 of Lecture Notes in Computer Science. Springer-Verlag, Berlin, 1999. [10] IEEE. Information technology—portable operating system

interface (POSIX), threads extension [C language]. IEEE 1003.1c-1995, 1995.

[11] MPI Forum. The MPI message passing interface standard. Technical report, University of Tennessee, Knoxville, Apr. 1994.

[12] B. V. Protopopov. Comparison of designs of shared memory devices for mpich, 2002. Available at http://www.cs.msstate.edu// boris/papers/ShmemDev.ps.

[13] L. Prylli and B. Tourancheau. BIP: a new protocol designed for high performance networking on Myrinet. In J. Rolim, editor, Parallel and Distributed Processing — Workshop on Personal Computer Based Networks of Workstations, vol-ume 1388 of Lecture Notes in Computer Science, pages 472– 485. Berlin, Springer-Verlag, 1998.