2.4.5.2 Place Virtualization
Because the PlaceManagerpreserves the order of the active places throughout the execution, the index of a place in the activePlaces group can be used in the program as a virtual identifierfor the place, rather than using the physcial place id. By this simple coding strategy, minimal changes will be required for adding resilience to the existing source code. We follow this strategy in the implementation of all resilient applications presented in the thesis.
2.5
Summary
This chapter presented a taxonomy of resilient programming models and a broad overview of resilience features in different programming models. It described the rationale behind the choice of the APGAS model for supporting multi-resolution resilience and described the details of the X10 programming model. In the following chapter, we describe how we improved the performance scalability of RX10 using
Chapter 3
Improving Resilient X10 Portability
and Scalability Using MPI-ULFM
This chapter describes how we enabledRX10to scale to thousands of cores in super- computer environments usingMPI-ULFM. It details the matching semantics betweenMPI-ULFMand the transport layer of RX10. It also describes the methods used to deliver global failure detection for resilient finish, and to speed up the performance of X10’s fault-tolerant collectives. We evaluate our implementation using microbench- marks that demonstrate the efficiency and scalability advantages thatRX10applica- tions can achieve by switching from the socket transport to theMPI-ULFMtransport. The experiments focus on failure-free scenarios only. We evaluate the performance of applications in failure scenarios in Chapter5.
After the introduction in Section 3.1, we describe how X10 uses MPI in non- resilient mode in Section 3.2. Then we describe the MPI-ULFM specification in Section3.3. After that we explain how we integratedMPI-ULFMwith the X10 runtime system in Section3.4. We conclude with the performance evaluation in Section3.5.
This chapter is based on work published in the paper “Resilient X10 over MPI user level failure mitigation” [Hamouda et al.,2016].
3.1
Introduction
MPI is the standard communication API on supercomputers. Compared to other communication APIs, such as TCP sockets or Infiniband verbs, MPI is simpler to use for writing distributed programs. For example, MPI transparently establishes the connections between processes, assigns a rank for each process, and matches a rank to its IP address and port on behalf of the programmer. In contrast, a socketAPIwould require the programmer to manually handle these details. MPI implementations are portable over different networking hardware, and they often strive to make the best use of the available hardware capabilities to provide optimized point-to-point and collective operations. While supercomputers are dominated by high-performance computing applications following MPI’s SPMD model, cloud environments have
more diverse applications with different networking requirements. Cloud applica- tions are more likely to use lower-level communication APIs than MPI to directly establish, monitor, and destroy the connections between processes. TCP socket APIs are commonly used for inter-process communication on cloud environments. When a process fails, connections to that process are invalidated and errors are raised at the other end of the connection. Based on this failure notification capability, fault-tolerant applications and runtime systems can be built on top of sockets.
X10 can be used for developing high-performance computing applications as well as cloud-based services. It supports both MPI and TCP sockets for inter-process com- munication. Figure3.1shows the layered architecture of the X10 runtime. The middle layer, X10 Runtime Transport (X10RT), is an abstract transport layer that delegates communication requests from the runtime to one of the concrete transport implemen- tations. The latest version of X10 at the time of writing (v2.6.1) provides threeX10RT
implementations: a TCP sockets implementation, an MPI implementation, and an implementation for the Parallel Active Message Interface (PAMI) — IBM’s proprietary
APIfor inter-process communication on supercomputers.
The X10 runtime executes the tasks using a pool of worker threads. The commu- nication activities of the tasks are performed by the same worker threads. To avoid blocking the worker threads, which may lead to deadlocks, the X10 runtime requires non-blocking communication supportfrom the underlying transport layer.
X10RT
Abstract runtime transport
X10RT-Sockets X10RT-MPI X10RT-PAMI
X10 Runtime
Task and memory management (local and remote)
Figure 3.1:X10 layered architecture.
The initial development of RX10 was primarily concerned with bringing X10 to cloud computing environments. Therefore, it only made resilient the portions of the X10 runtime stack that were appropriate for a cloud computing environment. Thus, only X10RT-Sockets was modified to survive place failure. X10RT-Sockets opens Secure Shell (SSH) connections to launch processes at remote nodes — a mechanism that is often not permitted by supercomputers for security reasons. On the other hand, the majority of supercomputers permit launching processes usingMPI. Because both
MPI and PAMI are not fault-tolerant, RX10 was not portable to supercomputers. Experimental evaluations ran on small clusters and scaled to only a few hundred X10 places.
At the same time RX10 was being implemented, the MPI FTWGand members of the Innovative Computing Laboratory at University of Tennessee were actively developingMPI-ULFM[Bland et al.,2012a,b,2013], a fault tolerance specification for MPI. The proposed specification is planned to be part of the upcomingMPI-4[2018]