2.3 Resilience Support in Distributed Programming Models
2.3.3 The Asynchronous Partitioned Global Address Space Model
The coarse-grain parallelism model provided by MPI and PGAS, although suited for many regular HPC applications, is highly limited in exploiting the available fine-grain parallelism in mainstream architectures. The Asynchronous Partitioned Global Address Space (APGAS) model addresses this limitation by extending the
PGASmodel with general task parallelism. Each task has an affinity to a particular partition; however, it can spawn tasks at all other partitions. In the following, we describe two widely-known APGAS languages: Chapel and X10. Both languages
initially emerged as part of DARPA’s High Productivity Computing Systems (HPCS) project, which aimed to advance the performance, programmability, portability and robustness of high-end computing systems.
2.3.3.1 Chapel
Chapel [Chamberlain et al., 2007] is an object-orientedAPGASlanguage developed by Cray Inc. It provides a uniform programming model for programming shared- memory systems and distributed-memory systems. For the latter, it supports a num- ber of communication libraries, such asGASNetand Cray’s user Generic Network Interface (uGNI), for performing data and active-message transfer operations.
Chapel’s execution model is organized around a group of multithreadedLocales, on which tasks and global data structures can be created. It provides thedomaindata type for describing the index set of an array, which can be dense, sparse, or in any other user-defined format. Adomain mapdescribes how the domain is mapped to a group of locales. Assignment statements involving remote data translate implicitly to one-sided get/put operations. For locality control, the on construct is provided to determine the locale on which a particular active message should execute. In addition to locality control and global-view data-parallelism, Chapel provides simple abstractions for expressing task-parallelism.
Task parallelism is supported by the following constructs, which can be composed flexibility to express nested parallelism:
• begin: spawns a new task that can execute in parallel with the current task. • cobegin: spawns a group of tasks, one for each statement in the cobegin scope,
and waits for their termination.
• (co)forall: spawns a group of parallel tasks for processing the iterations of a loop, and waits for their termination.
• syncstatement: waits for all dynamically created begins within its scope. In addition to thesyncstatement described above, Chapel provides asyncvariable type that can also be used for synchronization. Async variable has a value and a state that can be either full orempty. Reading or writing a syncvariable can cause the enclosing task to block depending on the state of the variable [Hayashi et al., 2017]. For example, reading an empty variable or writing to a full variable will block execution until the state changes to the opposite state. Coordinating the different tasks can therefore be achieved by orchestrating their access to sharedsyncvariables. Currently, Chapel is not resilient to process failures. A recent research effort byPanagiotopoulou and Loidl[2015] targeted the challenges of transparently recov- ering the control flow of a Chapel program when locales fail. A copy of each remote task is replicated at the parent locale that spawned the task. When a locale fails, other locales identify the lost tasks and re-execute them locally. The proposed design is limited to tasks that have no side effects. Their future work plans include handling
§2.3 Resilience Support in Distributed Programming Models 27
general tasks that alter the data. GASNetwas used as the underlying communication layer for Chapel. BecauseGASNetis not resilient, this study simulated the existence of failures rather than actually killing locales.
2.3.3.2 X10
X10 [Charles et al.,2005] is an object-orientedAPGAS language developed by IBM. The language syntax is based on the Java language, with the addition of new con- structs for expressing task parallelism and global data. Two runtime implementations are available for X10: native X10 (executes X10 programs translated into C++ code) andmanagedX10 (executes X10 programs translated into Java code).
Aplaceis the locality unit in X10. The program starts from a root task at the first place and evolves by dynamically creating more tasks at the different places. The programmer specifies the place where a certain task should execute using the at construct. X10 supports nested task parallelism using the async-finish model. The asyncconstruct is similar to Chapel’sbeginconstruct; it creates an asynchronous task at the current place. Thefinish construct is similar to Chapel’ssync statement; it waits for the termination of all asyncs spawned dynamically within its scope. Using async, finish, and at, the programmer can express not only simple fork-join task graphs but also more complex task graphs with arbitrary synchronization patterns.
For handling global data, X10 provides aglobal referencetype that carries a globally unique address for an object; it is similar to the shared pointer primitive of UPC. However, unlike allPGASmodels we described above in which a global reference can be used for transferring data implicitly between processes, X10’s global references do not permit any implicit communication. To access an object using its global reference, a task must be created at the home place of the object where the object will be accessed locally. Otherwise, bulk-transfer functions are provided in the standard library to transfer arrays using their global references. The runtime system can implement these functions via one-sided put/get operations or normal two-sided communications (possibly at a higher performance cost). In all cases, data transfer is explicit, and the cost of communication is obvious in the program.
Recently, the X10 team at IBM has focused on improving the language’s resilience to process fail-stop failures [Cunningham et al.,2014]. They adopted a flexible user- level fault tolerance approach that enables the development of generic as well as algorithmic-based fault tolerance techniques at the application level. Native X10, which is designed forHPC, is currently limited to fixed resource allocation. However, spare places can be allocated in advance for supporting non-shrinking recovery.
Hao et al. [2014b] describe X10-FT, an extension of X10 that supports transparent disk-based checkpointing. The compiler is modified to insert checkpointing instruc- tions at program synchronization points. The places are organized hierarchically; when a place fails, its parent place creates a replacement place and initializes it using the latest disk checkpoint. The PAXOS protocol is used for failure detection (via heartbeating) and for reaching consensus on the status of places.