2.4 Libraries Used by our Implementation
2.4.3 Libfabric
OFI (OpenFabrics Interfaces) is a framework focused on exporting fabric5 communication services to applications. Libfabric [97] is a core component of OFI, that defines the userAPI, enabling a tight semantic link between ap-plications and underlying fabric services. More specifically, libfabric soft-ware interfaces have been co-designed with hardsoft-ware providers (the bot-tom layer of the stack in Fig.2.4) with the goal of giving access to different hardware forHPCusers and applications.
5Fabric is an industry term to denote a network of interconnected devices in a tightly coupled environment.
2.4. Libraries Used by our Implementation 33
FIGURE2.4: OFI interfaces overview [97].
A distinguishing feature of libfabric is that it is agnostic with respect to the underlying hardware provider, thus allowing programmers to write applications that can exploit any supported hardware. Based on this prin-ciple, we use the libfabricAPIfor implementing the communication among GAMexecutors. In this setting, considering Fig.2.4, theGAMruntime sits among the “libfabric-enabled middlewares”, at the same level as MPI or UPC.
Libfabric provides two differentAPIsfor transferring data among net-work nodes (the “Data Transfer Services” block on the right of Fig. 2.4):
Message Queues andRMA. According to the formerAPI, usually referred as two-sided communication, nodes communicate via intermediate queues by means of send and receive primitives, as in any message-passing envi-ronment. With the latterAPI, usually referred as one-sided communication, nodes exchange data by accessing memory locations from some shared space. For bothAPIs, libfabric enforces asynchronism by means of user-level notifications (the “Completion Services” block in the middle of Fig. 2.4), through which the user can query the runtime about the completion of is-sued data transfers, for instance, to safely reuse memory involved in trans-fers.
In addition to asynchronous operations, libfabric focuses its support on HPCenvironments through a number of design choices, described in detail in the “High Performance Network Programming with OFI” guide [114].
Among these, we based the implementation ofGAMtopologies on connect-ion-less communication (“Address Vectors” within the “Communication Services” block in Fig.2.4), that targets large-scale environments by reduc-ing the amount of memory required to maintain large address look-up ta-bles, thus eliminating expensive address resolution.
Summary
In this chapter, we provided a review of the most common parallel comput-ing platforms and programmcomput-ing models for such platforms, with a focus onHPCenvironments. We also provided a brief review of parallel memory models, in particular SC, that we used to characterize the memory model
proposed in this thesis. Finally, we described the libraries that we exploited in developing the contributions of this thesis, namely C++ smart pointers, the FastFlow framework for structured parallel programming, and the lib-fabric library for large-scale, high-performance networking.
35
Chapter 3
Global Asynchronous Memory
In this chapter, we present the first novel contribution in this thesis.
We introduce theGAMprogramming model, based on a memory space shared among a set of executors (i.e., a GAS). AGAM memory location is either public or private. Public memory is accessed in a single-assignment fashion, whereas private memory is accessed exclusively by the respective owner. Therefore, GAMprograms areData Race Free (DRF) by construc-tion. By proposingGAM, we advocate to trade off some expressiveness—
GAM memory is more limited than an arbitrary load/store memory—in exchange of an efficient yet user-friendly memory consistency model (i.e., SC).
With respect to the categorization in Sect.2.2,GAMis a shared-memory model, thus based on a shared address space. Moreover, GAM provides message-passing communication along with shared-memory primitives, by which executors exchange capabilities over memory locations, thus over-coming the traditional dichotomy between shared-memory and message-passing paradigms.
We materialize the proposedGAMmodel in a C++ library, implemented on top of libfabric (cf. Sect.2.4.3) to target multiple networking hardware in the context of large-scaleHPCenvironments.
This chapter proceeds as follows. In Sect.3.1, we introduceGAMas ab-stract model, together with an operational semantics for GAMprograms, in Sect.3.2. In Sect.3.3, we discuss some aspects related to parallel execu-tion ofGAMsystems, including theGAMparallel memory model. Finally, in Sect.3.4, we present the C++ library that we implemented based on the GAMabstract model.
3.1 System Model
AGAMsystem consists in a set e1, . . . ,enof executors issuing memory op-erations over a global address space. If a global address is mapped, it points to a memory slot of arbitrary size.
Moreover, each slot is either public or private, according to the associ-ated access capability. A public slot can be accessed by any executor via loador store operations, although it cannot be updated once a value has been stored into it—i.e., GAM public slots are single-assignment. Conversely, a private slot can be accessed via load and store operations, but only by its owner, that is, the executor owning exclusive access capability over the slot.
A capability represents the way in which a given memory slot can be accessed by a given executor. For a public slot, a load-only capability is
Operation Meaning
map Allocate a slot, either public or private unmap Free a slot
load Retrieve the value stored in the slot store Store a value into the slot
pass Transfer the slot capability to another executor publish Make the (private) slot public
TABLE3.1:GAMmemory operations.
associated to some executors, whereas no executor has store capability on the slot. Conversely, for a private slot, a load-store capability is associated to exactly one executor, that owns exclusive access to the slot.
In addition to memory access operations, executors may issue opera-tions for managing capabilities, namely, pass and publish When a slot is passed from an executor eito another executor ej, the associated capability is transferred to ej. In the case of a public slot, eialso retains the read-only capability, whereas in the case of a private slot, the read-write capability is lost by ei. Finally, a private slot may be published to make it public, whereas the converse operation is not possible. Table3.1summarizes the operations that may be issued byGAMexecutors.
We proceed by describing step by step a simple execution of a GAM system (Sect.3.1.1) and by informally comparingGAMsystems with those based on cache coherence (Sect.3.1.2).