A Study of Page-Based Memory Allocation Policies for the Argo Distributed Shared Memory System

(1)

IT 21 002

Examensarbete 30 hp

Januari 2021

A Study of Page-Based Memory

Allocation Policies for the Argo

Distributed Shared Memory System

Ioannis Anevlavis

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

A Study of Page-Based Memory Allocation Policies

for the Argo Distributed Shared Memory System

Ioannis Anevlavis

Software distributed shared memory (DSM) systems have been one of the main areas

of research in the high-performance computing community. One of the many imple-mentations of such systems is Argo, a page-based, user-space DSM, built on top ofMPI. Researchers have dedicated considerable effort in making Argo easier to use and alleviate some of its shortcomings that are culprits in hurting performance and scaling. However, there are several issues left to be addressed, one of them concerning the simplistic distribution of pages across the nodes of a cluster. Since Argo works onpage granularity, the page-based memory allocation or placement of pages in a distri-buted system is of significant importance to the performance, since it determines the extent of remote memory accesses. To ensure high performance, it is essential to employ memory allocation policies that allocate data in distributed memory modules intelligently, thus reducing latencies and increasing memory bandwidth. In this thesis, we incorporate several page placement policies on Argo and evaluate their impact on performance with a set of benchmarks ported on that programming model.

IT 21 002

Examinator: Philipp Rümmer Handledare: Stefanos Kaxiras

(4)

(5)

1

Introduction

Nowadays, business and government organizations create large amounts of both un-structured and un-structured information which needs to be processed, analyzed and linked. Applications that address this need can be generally classified as either compute-intensive or data-intensive or both. The most important reason for develop-ing such applications in parallel is the potential performance improvement, which can either be obtained by expanding the memory or the compute capabilities of the de-vice they are being run on. Due to these characteristics, typical hardware computing infrastructures for large-scale applications are a group of multicore nodes connected via a high-bandwidth commodity network, each one having each own private shared memory and disk storage (a.k.a. a cluster).

For programming distributed memory multiprocessor systems, such as cluster of workstations, message passing is usually used. However, message passing systems require explicit coding of the inter-process communications, which makes parallel programming difficult. This has led to a diverse ecosystem of programming models that enable programming in a much larger scale than a single multicore or a single symmetric multiprocessor (SMP) and ease the development by specializing to algo-rithmic structure and dynamic behavior; however, applications that do not fit well into one particular model suffer in performance. Software distributed shared mem-ory (DSM) systems improve the programmability of message passing machines and workstation clusters by providing a shared memory abstraction (i.e., a coherent global address space) to programmers. One of the many implementations of such systems is Argo [Kax+15], a page-based, user-space DSM, built one top of message pass-ing interface (MPI). Argo provides a transparent shared address space with scalable performance on a cluster with fast network interfaces.

Although this design preserves the abstraction of a single shared memory to the programmer, it comes at the cost of load-balancing issues and remote accesses. In or-der to guarantee high performance on these architectures, an efficient data placement strategy becomes crucial. Due to this need, memory allocation policies for hierarchi-cal shared memory architectures [IWB02;Rib+09;Ser+12a;Ser+12b] have attracted considerable research efforts and have shown significant network and memory perfor-mance benefits when benchmarking a variety of scientific applications.

In this thesis, we investigate incorporating seven page-based memory allocation policies in Argo. We begin by giving an overview of the Argo system, in terms of how its global memory is laid out and managed, as well as an overview of MPI one-sided communication, which composes a significant part of Argo’s backend (Section2). We then explain why Argo’s default way of managing memory is inefficient, and propose data placement strategies to address these deficiencies (Section 3). Consequently, we present the benchmarks ported on Argo for the sake of this thesis, which will be used to evaluate the impact of the implemented policies on performance (Section4). Finally, we depict and elaborate on the execution results (Section5).

We deploy and evaluate the correctness of the policies on an 8-node RMDA-enabled cluster. Their performance, however, is evaluated on a larger distributed cluster.

(7)

2

Background

2.1 Argo Distributed Shared Memory

Argo [Kax+15] is a page-based, user-space DSM and its prototype implementation is built on top of MPI. It ensures coherence in the global address space in a distributed system, thus enabling shared memory programming in a much larger scale than a single multi-core or a single SMP [Kax+15]. Coherence could be accomplished both with hardware or software, but since there is no dedicated hardware support at this scale, interest is focused more on software solutions. Argo is another from the existing plethora of software solutions to create a shared virtual address space among all nodes in a distributed system, that introduces three innovative techniques in terms of coherence and critical-section handling. In particular, a novel coherence protocol (Carina) based on passive classification directories (Pyxis) and a new locking system (Vela) [Kax+15].

2.1.1 System Design

Similar to other DSM systems [Li88;BZS93;RH01;Kel+94], Argo implements shared memory using the facilities provided by the virtual memory system. It is implemented entirely in user space and uses MPI as the underlying messaging library as well as for process setup and tear down, for portability and flexibility reasons.

Figure 2.1 shows an overview of Argo’s DSM system. In the Argo system, each node contributes an equal share of memory to the globally shared memory of the system. The size of the shared memory space is user specified at the application

(8)

level and has to be large enough to fit the desired workload. For example, if the application code includes the collective initializer call argo::init(10GB) and the nodes being used are four, then every node will contribute 2.5GB of size from their physical memory in order for the global address space of the system to be constructed.

2.1.2 Memory Management

Since a memory page is the smallest unit of data that is possible to map virtual memory to, Argo works with a page granularity of 4KB. The API sets up a shared virtual address space spanning all nodes using POSIX shared memory, which is first initialized by each node, allocating the same range of virtual addresses using the mmap system call. These addresses are then available for allocation at the page-level using Argo’s own allocators.

As Argo being a home-based DSM, each virtual page is assigned a home node. The term ‘home node’ essentially refers to the actual node in the distributed system in whose physical memory the logical page will be mapped to. Argo’s default memory management scheme falls into the category of the bind memory policies in general and bind all in particular [Rib+09]. That said, this policy will use all available memory (physical) contributed to the global address space from the first node, before using the next node’s memory.

Figure 2.2 depicts the bind all memory allocation policy in a cluster machine using four nodes. The globally allocated application data is composed of M memory pages which are divided into four groups (each group is represented by color). In that setting, using the bind all policy, the first group of virtual pages (starting from the left) will begin to get mapped to the physical memory of node0, and when that node

runs out of physical page frames, the mapping will continue to the next node by id which is node1. As aforementioned, the size of the shared memory space should be

sufficient to host all memory pages of the application data.

Figure 2.2: Argo’s Default Memory Management Scheme

2.1.3 Signal Handler

Even though the default page placement scheme follows a static approach to asso-ciate virtual memory ranges to nodes, the very process of binding is not done at the initialization phase, but at runtime. The mapping between virtual to physical memory is taken care of by Argo’s signal handler. Argo’s signal handler is actually a user SIGSEGV signal handler implemented in the MPI backend of the system and is invoked when the access permission into a memory region is not valid.

In an application, by the time all operations issued by argo::init have finished and execution resumes, all virtual memory addresses available for allocation have no access permissions. Even after encountering Argo’s allocation calls, the memory addresses continue to have no access permissions, since no physical allocation takes place at the allocation-point, pretty similar to how memory allocation works on Linux.

(9)

From that point onwards, any first access to a memory page from the globally allocated data structures would result in a page fault (considered read miss by default), which is passed to the handler function via a SIGSEGV signal. The execution path of the function is divided into two main branches and which one is taken is decided from whether the home node of the faulting page is the current process1 _executing

the function or not. If the memory page belongs to the node based on the memory allocation policy, it is mapped to the backing memory of the local machine, otherwise the page data is fetched from the remote node and mapped to the local page cache.

2.2 MPI One-Sided Communication

Argo does not have its own custom fine-tuned network layer to perform its underlying communications, rather, it uses passive one-sided communications of MPI [Kax+15]. One-sided communication, also known as remote direct memory access (RDMA) or remote memory accesss (RMA), was introduced in the MPI-2 standard [Gro+98]. This form of communication, unlike the two-sided communication, decouples data movement from process synchronization — hence its name. In essence, it allows a process to have direct access to the memory address space of a remote process through the use of non-blocking operations, without the intervention of that remote process.

2.2.1 Memory Windows

The fact that the target process does not perform an action that is the counterpart of the action on the origin, it does not mean the origin process can access and modify arbitrary data on the target at arbitrary times. In order to allow processes to have access into each others’ memory, processes have to explicitly expose their own memory to others. That said, one-sided communication in MPI is limited to accessing only a specifically declared memory area on the target known as a window.

In the one-sided communication model, each process can make an area from its physical memory, called a window, available to one-sided transfers. The variable type for declaring a window is MPI_Win. The window is defined on a communicator, and thus a process in that communicator can put arbitrary data from its own memory to the window of another process, or get something from the other process’ window in its own memory, as seen in Figure2.3.

The memory for a window is at first sight ordinary data in user space. There are multiple ways to associate data with a window, one of them being to pass a user buffer to MPI_Win_create, along with its size measured in bytes, the type of its elements (also in bytes), and the relevant communicator.

Figure 2.3: Remote Put & Get between Processes in a Communicator 1_{MPI processes are considered as nodes in the ArgoDSM system.}

(10)

2.2.2 Basic Operations

There are multiple routines for performing one-sided operations, but three of the most basic ones are the Put, Get, and Accumulate. These calls somewhat correspond to the Send, Receive, and Reduce of the two-sided communicator model, except that of course only one process makes the call.

We shall denote by origin the process that performs the call, and by target the process in which the memory is accessed. Thus, in a put operation, source=origin and destination=target; in a get operation, source=target and destination=origin. 2.2.2.1 Put and Get

The MPI_Put call can be considered as a one-sided send and, as such, it must specify: • the target rank,

• the data to be send from the origin, and

• the location where it is to be written on the target.

The description of the data on the origin supplied to the call is the usual trio of the pointer to the buffer, the number of its elements to be considered and the type of each individual element. The description of the data on the target is almost similar to that of the origin, as the number of elements and datatype need also to be specified, but instead of an address to the buffer, a displacement unit with respect to the start of the window on the target needs to be supplied. This displacement can be given in bytes, but essentially it is a multiple of the displacement unit (datatype in bytes) that was specified in the window definition.

As an example, consider a window created with a displacement unit of four bytes (sizeof(int)). Any access to that window with a MPI_INT datatype and a target displacement of three provided to the call, would read or write, depending on the operation, the third element of the window memory based on the calculation:

window_base + target_disp × disp_unit.

Figure 2.4: Offset Calculation for an MPI Window

The MPI_Get call has exactly the same parameters as MPI_Put, however, they hold a different meaning to the operation, since now the origin buffer will host the data coming from the remote window.

2.2.2.2 Accumulate

The third of the basic one-sided routines is MPI_Accumulate, which does a reduction operation on the data being put to the remote window, thus, introducing only one additional parameter to its call with respect to the put operation.

Accumulate is a reduction with remote result. As with MPI_Reduce, the same pre-defined operators are available, but no user-pre-defined ones. There is one extra operator: MPI_REPLACE, this has the effect that only the last result to arrive is retained.

2.2.3 Atomic Operations

One-sided calls are said to emulate shared memory in MPI, but the put and get calls are not enough for certain scenarios with shared data. The problem is that

(11)

reading and updating shared data structures is not an atomic operation, thus leading to inconsistent views of the data (race condition).

In the MPI-3 standard [For12] some atomic routines have been added. These routines refer to a set of operations that includes remote read and update, and re-mote atomic swap operations as “accumulate” operations. In the former group belong the routines MPI_Get_accumulate and MPI_Fetch_and_op, which atomically retrieve data from the window indicated, apply an operator, and then combine the data on the target with the data on the origin. In the latter belongs the MPI_Compare_and_swap routine, in which the origin data are swapped with the target data, only if the target data are equal to a user-specified value.

All of the previously mentioned routines perform the same operations: return data before the operation, then atomically update data on the target, but among them, the most flexible in data type handling is MPI_Get_accumulate. The routines MPI_Fetch_and_op and MPI_Compare_and_swap, which operate on only a single ele-ment, allow for faster implementations, in particular through hardware support.

2.2.4 Passive Target Synchronization

Within the one-sided communication, MPI has two modes: active RMA and passive RMA. In active RMA, or active target synchronization, the target sets boundaries on the time period (the ‘epoch’) during which its window can be accessed. This type of synchronization acts much like asynchronous transfer with a concluding MPI_Waitall in the two-sided communication model.

In passive RMA, or passive target synchronization, the target puts no limitation on when its window can be accessed. Based on this model, only the origin can be actively involved, allowing it to be able to read and write from a target at arbitrary time without requiring the target to make any calls whatsoever. This means that the origin process remotely locks the window on the target, performs a one-sided transfer, and releases the window by unlocking it again.

During an access epoch, also called an passive target epoch, a process can initiate and finish a one-sided transfer, by locking the window with the MPI_Win_lock call and unlocking it with MPI_Win_unlock. The two lock types are:

• MPI_LOCK_SHARED which should be used for Get calls: since multiple processes are allowed to read from a window in the same epoch, the lock can be shared. • MPI_LOCK_EXCLUSIVE which should be used for Put and Accumulate calls: since

only one process is allowed to write to a window during one epoch, the lock should be exclusive.

These routines make MPI behave like a shared memory system; the instructions between locking and unlocking the window effectively become atomic operations. Completion and Consistency

In one-sided communication one should be aware of the multiple instances of the data, and the various completions that effect their consistency.

• The user data. This is the buffer that is passed to a Put or Get call. For instance, after a Put call, but still in an access epoch, it is not safe to reuse. Making sure the buffer has been transferred is called local completion.

• The window data. While this may be publicly accessible, it is not necessarily always consistent with internal copies.

• The remote data. Even a successful Put does not guarantee that the other process has received the data. A successful transfer is a remote completion. We can force remote completion, that is, update on the target with MPI_Win_unlock or some variant of it, concluding the epoch.

(12)

2.2.5 Memory Models

The window memory is not the same as the buffer that is passed to MPI_Win_create. The memory semantics of one-sided communication are best understood by using the concept of public and private window copies. The former refers to the memory region that a system has exposed, so it is addressable by all processes, while the latter refers to the fast private buffers (e.g., transparent caches or explicit communication buffers) local to each process where copies of the data elements from the main memory can be stored for faster access. The coherence between these two distinct memory regions is determined by the memory model. One-sided communication of MPI defines two distinct memory models, the separate and the unified memory model.

In the separate memory model, the private buffers local to each process are not kept coherent with all the updates to main memory. Thus, conflicting accesses to main memory need to be synchronized and updated in all private copies explicitly. This is achieved by explicitly calling one-sided functions in order to reflect updates to the public window in the private memory.

In the unified memory model, the public and private windows are identical. This means that updates to the public window via put or accumulate calls will be even-tually observed by load operations in the private window. Contrariwise, local store accesses are eventually visible to remote get or accumulate calls without additional one-sided calls. These stronger semantics of the unified model allow to omit some synchronization calls and potentially improve performance.

(13)

3

Contributions

In this chapter, we highlight the drawbacks of the default memory management ap-proach used in Argo, present the contributions of our work to address these deficiencies by introducing several page-based memory allocation policies, and then outline the code modifications applied to incorporate these policies in Argo’s backend.

In the context of this thesis, as far as the page placement policies are concerned, we are particularly interested in static other than dynamic techniques in managing memory allocation. The reason for this selection is because no prior work up until now has been done on tinkering with memory management, so as this being the first step in this direction, it was judged to be a good approach to implement simple allocation techniques and see if they favor performance, before moving on to more complicated ways of managing memory, such as preallocation, data migration mechanisms, etc.

3.1 Default Memory Allocation Policy: Drawbacks

The default memory management scheme presented in Section2.1.2is a double-edged sword, meaning that despite its simple implementation, it can be detrimental to the performance and scaling of an application, especially if not exploited efficiently.

3.1.1 Performance and Scalability

One of the drawbacks of the simplistic data placement is that it hurts performance and scaling. The issue is caused by the relation between the size of the global address space specified, and the size of the allocated data structures in an application. As an example, consider a case where the size of the global address space is specified to be 10GB, while the globally allocated data structures in the application are of size 1GB. This will result in all memory pages to fit in the physical memory of node0, since each

node has reserved a space of 2.5GB from its physical memory for the DSM to use. In that setting, since the workload will be distributed across all nodes, there would be essentially no locality for all except the zero process and a lot of network traffic will be generated for the purpose of fetching data from node0.

3.1.2 Ease of Programmability

Another drawback of the simplistic data placement is that it does not contribute to one of the key concepts of Argo, which is the ease of programmability. Argo’s creators have dedicated considerable efforts in making this programming model easier to use, by keeping the abstraction as high as possible from the user, thereby making it possible with just a few modifications to scale a shared memory application to the distributed system level. However, Argo will not offer its most optimal performance, especially in data-intensive applications, if the user is not aware of the default page placement scheme. To acquire such information, the user will have to either look into the system’s research literature or into the system’s source code, or ask one

(14)

of the system’s creators, or to figure it out by conducting his/her own performance tests, since that information is not quickly available through the small programming tutorial1 _{under the Argo’s web page. In the case that the user realizes that the page}

placement policy affects performance and gets to understand its functionality, then in each application, if he strives for optimal performance, he will have to take care of the relation between the size of the global address space and the size of the globally allocated data, by providing the exact globally allocated data structures’ size (plus some padding) to the initializer call argo::init.

3.2 Page-Based Memory Allocation Policies

In order to tackle the side effects of the default memory management scheme, having in mind that memory in Argo is handled at a page granularity level, we look into page-based memory allocation policies. Considering the diversity of the data parallel applications characteristics, such as for example different memory access patterns, only one memory policy might not deem appropriate to enhance performance in all cases, so we incorporate seven static page placement policies on Argo. We propose memory policies that handle both bandwidth and latency issues, and also address different granularities other than handling data at the default page granularity level.

3.2.1 Cyclic Group

As mentioned above, a downside of the default memory policy is that it binds all data to the physical memory of node0, thus causing network contention when the

workload is distributed. The cyclic group of memory policies addresses both the per-formance and programmability issues the default memory policy elicits. Specifically, it improves performance by spreading data across all physical memory modules in the distributed system, thus balancing memory modules’ usage, improving network band-width, and easing programmability, since whatever size is provided to the initializer call argo::init does not affect the placement of data. The cyclic group consists of six memory policies, namely cyclic, cyclic block, skew mapp, skew mapp block, prime mapp, and prime mapp block.

Figure3.1depicts the cyclic and the cyclic block memory policies, on the left and right side of the figure respectively, in a cluster machine using four nodes. Each node of the machine has a physical memory which will host the physical page frames of the application data. The application data allocated in global memory is composed of M memory pages, which are divided into four contiguous groups (each color represents a group). In general, the cyclic group of memory policies spread memory pages over a number of memory modules of the machine following a type of round-robin

Figure 3.1: Cyclic & Cyclic Block Policies

(15)

distribution and, in particular, cyclic and cyclic block policies do so in a linear way. The cyclic policy uses a memory page per round, a page i is placed in the memory module i mod N, where N is the number of nodes being used to run the application. In the cyclic block policy on the other hand, a block of pages b (user specified) is placed in the memory module b mod N.

Cyclic and cyclic block memory policies can be used in applications with regular and irregular behavior that have a high level of sharing since the distribution of pages is extremely uniform, thus smoothening out the traffic generated in the network, providing more bandwidth and better memory modules’ usage. However, the fact that these data placement techniques make a linear power of two distribution of memory pages on a power of two number of nodes, can still lead to contention problems in some scientific applications. For example, in the field of numerical scientific applications, the data structure sizes used are also power of two and thus using the cyclic memory policy may lead to memory pages used by different processes to reside in the same memory modules [IWB02].

Figure 3.2: Skew Mapp & Skew Mapp Block Policies

To overcome this phenomenon Iyer et al. [IWB02] introduced two non-linear round-robin allocation techniques, the skew mapp and prime mapp memory policies. The basic idea behind these allocation schemes is to perform a non-linear page placement over the machines memory modules, in order to reduce concurrent accesses directed to the same memory modules for parallel applications. The skew mapp memory policy is a modification of the cyclic policy that has a linear page skew. In this policy, a page i is allocated in the node (i + b i/N c + 1) mod N, where N is the number of nodes used to run the application. In this case, the skew mapp policy is able to skip a node for every N pages allocated, resulting in a non-uniform distribution of pages across the memory modules of the distributed system. Figure 3.2 depicts the skew mapp memory policy as well as its corresponding block implementation in a cluster machine using four nodes. Notice the red arrows pointing to the node skipped in the first round of the data distribution.

The prime mapp memory policy uses a two-phase round-robin strategy to better distribute memory pages over a cluster machine. In the first phase, the policy places data using the cyclic policy in (P ) nodes, where P is a prime number greater than or equal to N (number of nodes used). Due to the condition that the prime number has to satisfy, and also for ease of programmability, it is calculated at runtime and is equal to 3N/2. Aside from the reasons specified, the mathematical expression used to calculate the prime number also preserves a good analogy between the real and virtual nodes. In the second phase, the memory pages previously placed in the virtual nodes are re-placed into the memory modules of the real nodes also using the cyclic policy. In this way, the memory modules of the real nodes are not used in a uniform way to place memory pages. Figure 3.3 depicts the prime mapp memory policy as well as its corresponding block implementation in a cluster machine using four nodes. Notice the red arrows pointing to the re-placement of pages from the virtual to the memory

(16)

Figure 3.3: Prime Mapp & Prime Mapp Block Policies

modules of the real nodes. The red arrows are two in the page-level allocation case and four in the block-level case because, since four nodes are used, the prime number is equal to six, which makes up two virtual nodes.

3.2.2 First-Touch

First-touch is the default policy in Linux operating systems to manage memory al-location on NUMA. This policy places data in the memory module of the node that first accesses it. Due to this characteristic, data initialization must be done with care so that data is first accessed by the process that is later on going to use it. Two of the most common strategies to initialize data in parallel programming is initial-ization only by the master thread or having each worker thread to initialize its own data chunk. Figure3.4shows the difference between these two strategies in a cluster machine using four nodes. Since we parallelize on the distributed system level, we talk about master process and team process initialization, presented on the left and right side of the figure respectively. In this example, global memory is composed of two arrays, which are operated in the computation part of the program with an even workload distribution across the processes. Using the master process to initialize the global arrays, the outcome is no different from the default memory policy of Argo (if not handled correctly), where all data reside in node0. On the contrary, using

team process initialization the memory pages are spread over the four memory mod-ules of the cluster, with each node hosting only the data that will need during the computation, thus exploiting locality and dramatically reducing remote accesses.

So if initialization is handled correctly in applications that have a regular access pattern, first-touch will present performance gains deriving from the short access la-tencies to fetch data. However, in the case of irregular access pattern applications, this allocation scheme may result in a high number of remote accesses, since

(17)

cesses will not access the data which is bind to their memory modules. Regarding the drawbacks that the default memory management scheme brings to the surface, first-touch certainly addresses both the performance and programmability issues and becomes the best choice among the presented policies in regular applications. An example of this would be the fact that memory layouts as in Figure 3.4, where the size of the allocated arrays is equal, make cyclic block also a favorable choice as a policy. However, this choice comes with an ease of programmability cost, since the user will have to pre-calculate the optimal page block size to set up Argo and then proceed with the execution of the application.

3.3 Implementation Details

To incorporate the seven page placement policies mentioned in the previous section on Argo, we attained a satisfactory knowledge of the programming model’s backend, in order to deliver the most optimal solution in terms of code modifications and design quality. With that said, we apply modifications and introduce new code on the source directories backend/mpi and data_distribution, from the Argo’s main repository2.

3.3.1 MPI Backend

As a starting point, we look into the file swdsm.cpp under the source directory backend/mpi. This source file contains most of the Argo’s MPI backend implemen-tation, including the SIGSEGV signal handler.

As mentioned in Section 2.1.3, the handler function is invoked in a cache miss, which in Argo also occurs in an access to an unmapped memory page or an access to a memory page without the right access permissions. First-time accesses to mem-ory pages are specially important, since they determine if memmem-ory pages have to be mapped to the backing memory of the local machine or be cached from another node and mapped to the local page cache. The location to where a memory page should be mapped is pointed out by the chosen memory policy.

The functions that make up the functionality of the page placement policy are getHomenode and getOffset, both defined in swdsm.cpp and called at the beginning of the handler function. The first function returns the home node of the memory page, while the latter the relevant offset in the backing memory in case the page is mapped to the local machine. Prerequisite to the calculation of the home node and offset is the aligned at a page granularity offset of the faulting address from the starting point of the global address space. This offset is calculated at the very beginning of the handler function and is passed, amongst other functions, as an argument to getHomenode and getOffset (Listing3.1).

In the vanilla version of Argo, the functionality of the bind all page placement pol-icy is directly defined in the getHomenode and getOffset functions (Listing3.2). This is not a bad design approach in the original case, since it is the only policy available to handle the placement of data across the nodes of a distributed system. However, since we incorporate seven other policies, we make use of the global_ptr template class

318 c o n s t s t d : : s i z e _ t a c c e s s _ o f f s e t = s t a t i c _ c a s t<c h a r∗>( s i −>s i _ a d d r ) − s t a t i c _ c a s t<c h a r∗>( s t a r t A d d r ) ; 321 c o n s t s t d : : s i z e _ t a l i g n e d _ a c c e s s _ o f f s e t = a l i g n _ b a c k w a r d s ( a c c e s s _ o f f s e t , CACHELINE∗ p a g e s i z e ) ; 327 u n s i g n e d l o n g homenode = getHomenode ( a l i g n e d _ a c c e s s _ o f f s e t ) ; 328 u n s i g n e d l o n g o f f s e t = g e t O f f s e t ( a l i g n e d _ a c c e s s _ o f f s e t ) ;

Listing 3.1: Argo: Function Invocation of getHomenode & getOffset (swdsm.cpp) (Original Version)

(18)

498 u n s i g n e d l o n g getHomenode (u n s i g n e d l o n g addr ) { 499 u n s i g n e d l o n g homenode = addr / s i z e _ o f _ c h u n k ; 500 i f( homenode >=(u n s i g n e d l o n g) numtasks ) { 501 e x i t (EXIT_FAILURE) ; 502 } 503 r e t u r n homenode ; 504 } 505 506 u n s i g n e d l o n g g e t O f f s e t (u n s i g n e d l o n g addr ) {

508 u n s i g n e d l o n g o f f s e t = addr − ( getHomenode ( addr ) ) ∗ s i z e _ o f _ c h u n k ; 509 i f( o f f s e t >=s i z e _ o f _ c h u n k ) {

510 e x i t (EXIT_FAILURE) ;

511 }

512 r e t u r n o f f s e t ; 513 }

Listing 3.2: Argo: Function Definition of getHomenode & getOffset (swdsm.cpp) (Original Version)

defined in data_distribution.hpp under the source directory data_distribution, to improve readability by hiding the implementation details (lines 518-9 and 534-5 of Listing3.4).

Other than hiding the internals of the page placement policies using the global_ptr class, we introduce an if-else construct which if its condition is satisfied, the code that retrieves the home node and offset is protected by a semaphore and a mutex lock, while in the opposite case, it remains unprotected. The choice of which branch of the if-else statement is taken is decided by the cloc function parameter, whose purpose is to identify if the program is at a specific location running under a specific memory policy. In case the branch is taken, it means that we are at that specific lo-cation (later analyzed) in the program running under the first-touch memory policy, which corresponds to the function parameter cloc being seven.

In the backend of the Argo system, almost all the functional code is enclosed by different mutex locks and a semaphore. The reason for using pthread mutex locks is to protect from concurrent accesses the locally and globally operated by multiple threads data structures, which serve the purpose of ensuring data coherency but also mitigating some performance bottlenecks. Despite allowing one thread at a time to perform operations on these data structures, global operations that involve the InfiniBand network need also to be serialized. This is due to the fact that either the settings or the hardware itself of a cluster machine might not support concurrent one-sided operations coming from the same node, and in case such operations are imposed, execution might fail and, if not, unpredictable delays will be observed because the network would have downgraded from InfiniBand to Ethernet. The abstract data type used in the backend of Argo to ensure atomicity of the InfiniBand network is a semaphore. The semaphore ibsem is shared between the threads of a process and

334 c o n s t s t d : : s i z e _ t a c c e s s _ o f f s e t = s t a t i c _ c a s t<c h a r∗>( s i −>s i _ a d d r ) − s t a t i c _ c a s t<c h a r∗>( s t a r t A d d r ) ; 337 c o n s t s t d : : s i z e _ t a l i g n e d _ a c c e s s _ o f f s e t = a l i g n _ b a c k w a r d s ( a c c e s s _ o f f s e t , CACHELINE∗ p a g e s i z e ) ; 343 u n s i g n e d l o n g homenode = getHomenode ( a l i g n e d _ a c c e s s _ o f f s e t , MEM_POLICY) ; 344 u n s i g n e d l o n g o f f s e t = g e t O f f s e t ( a l i g n e d _ a c c e s s _ o f f s e t , MEM_POLICY) ;

Listing 3.3: Argo: Function Invocation of getHomenode & getOffset (swdsm.cpp) (Modified Version)

(19)

514 u n s i g n e d l o n g getHomenode (u n s i g n e d l o n g addr , i n t c l o c ) { 515 i f ( c l o c == 7 ) { 516 pthread_mutex_lock(& s p i n m u t e x ) ; 517 sem_wait(& ibsem ) ; 518 dm : : g l o b a l _ p t r <c h a r> g p t r (r e i n t e r p r e t _ c a s t<c h a r∗>( addr + r e i n t e r p r e t _ c a s t( s t a r t A d d r ) ) , 0 ) ; 519 addr = g p t r . node ( ) ; 520 sem_post(& ibsem ) ; 521 pthread_mutex_unlock(& s p i n m u t e x ) ; 522 } e l s e { 523 dm : : g l o b a l _ p t r <c h a r> g p t r (r e i n t e r p r e t _ c a s t<c h a r∗>( addr + r e i n t e r p r e t _ c a s t( s t a r t A d d r ) ) , 0 ) ; 524 addr = g p t r . node ( ) ; 525 } 526 527 r e t u r n addr ; 528 } 529 530 u n s i g n e d l o n g g e t O f f s e t (u n s i g n e d l o n g addr , i n t c l o c ) { 531 i f ( c l o c == 7 ) { 532 pthread_mutex_lock(& s p i n m u t e x ) ; 533 sem_wait(& ibsem ) ; 534 dm : : g l o b a l _ p t r <c h a r> g p t r (r e i n t e r p r e t _ c a s t<c h a r∗>( addr + r e i n t e r p r e t _ c a s t( s t a r t A d d r ) ) , 1 ) ; 535 addr = g p t r . o f f s e t ( ) ; 536 sem_post(& ibsem ) ; 537 pthread_mutex_unlock(& s p i n m u t e x ) ; 538 } e l s e { 539 dm : : g l o b a l _ p t r <c h a r> g p t r (r e i n t e r p r e t _ c a s t<c h a r∗>( addr + r e i n t e r p r e t _ c a s t( s t a r t A d d r ) ) , 1 ) ; 540 addr = g p t r . o f f s e t ( ) ; 541 } 542 543 r e t u r n addr ; 544 }

Listing 3.4: Argo: Function Definition of getHomenode & getOffset (swdsm.cpp) (Modified Version)

ensures that even though threads might be willing to concurrently process different global data structures, only one of them will proceed at a time.

The cloc parameter that determines which branch of the if-else statement is taken in the getHomenode and getOffset functions, is introduced because of the internals of the first-touch memory policy in conjunction with where the two functions are called inside the handler. First-touch, as a standalone category from the other memory policies, is the only one that makes use of a directory to keep track of the owner of every page. That said, its implementation involves one-sided operations and that is why it is protected by a directory dedicated lock and the relevant semaphore, as seen in lines 516-7, 520-1 and 532-3, 536-7 of Listing 3.4. Note that the mutex lock spinmutex used in that particular case is used solely to avoid the overhead of sleeping imposed by the semaphore.

A stripped-down version of the handler function is shown in Listing3.5. Despite the plethora of operations performed in the actual code, the operation workflow of the function is rather simple. Initially, the faulting address is aligned at a 4KB page granularity and is passed to the getHomenode and getOffset functions. Once the home node and offset of the faulting address are retrieved, it is checked if the page belongs to the node and, in that case, it is mapped to the backing memory of the local machine (globalData), as seen in line 384 of Listing3.5. Otherwise, it is being fetched from its relevant home node and mapped to the local page cache (cacheData). Observe that operations that involve the two aforementioned data structures as well as the Pyxis directory (globalSharers) are enclosed by the lock cachemutex and

(20)

326 v o i d h a n d l e r (i n t s i g , s i g i n f o _ t ∗ s i , v o i d ∗ unused ) { 334 c o n s t s t d : : s i z e _ t a c c e s s _ o f f s e t = s t a t i c _ c a s t<c h a r∗>( s i −>s i _ a d d r ) − s t a t i c _ c a s t<c h a r∗>( s t a r t A d d r ) ; 337 c o n s t s t d : : s i z e _ t a l i g n e d _ a c c e s s _ o f f s e t = a l i g n _ b a c k w a r d s ( a c c e s s _ o f f s e t , CACHELINE∗ p a g e s i z e ) ; 341 c h a r∗ c o n s t a l i g n e d _ a c c e s s _ p t r = s t a t i c _ c a s t<c h a r∗>( s t a r t A d d r ) + a l i g n e d _ a c c e s s _ o f f s e t ; 343 u n s i g n e d l o n g homenode = getHomenode ( a l i g n e d _ a c c e s s _ o f f s e t , MEM_POLICY) ; 344 u n s i g n e d l o n g o f f s e t = g e t O f f s e t ( a l i g n e d _ a c c e s s _ o f f s e t , MEM_POLICY) ; // P r o t e c t s g l o b a l D a t a , c a c h e D a t a and g l o b a l S h a r e r s . 348 pthread_mutex_lock(& cachemutex ) ; 350 // I f t h e page i s l o c a l . . . 351 i f( homenode == ( g e t I D ( ) ) ) { 353 sem_wait(& ibsem ) ; // u p d a t e t h e P y x i s d i r e c t o r y ( g l o b a l S h a r e r s ) and . . .

382 // map t h e page t o t h e b a c k i n g memory o f t h e l o c a l machine . 384 vm : : map_memory ( a l i g n e d _ a c c e s s _ p t r , p a g e s i z e ∗CACHELINE, c a c h e o f f s e t+o f f s e t , PROT_READ) ; 423 sem_post(& ibsem ) ; 424 pthread_mutex_unlock(& cachemutex ) ; 425 r e t u r n; 426 } // I f t h e page d o e s n o t b e l o n g t o t h e node , // f e t c h i t from t h e r e l e v a n t home node and // map i t t o t h e l o c a l page c a c h e . . . . // Update t h e P y x i s d i r e c t o r y and p e r f o r m f u r t h e r o p e r a t i o n s . . . . 507 pthread_mutex_unlock(& cachemutex ) ; 510 r e t u r n; 511 }

Listing 3.5: Argo: Function Definition of the Signal Handler: handler (swdsm.cpp) (Modified Version)

the semaphore ibsem when one-sided operations are about to take place. It can be clearly seen that if we move both of these locking structures just before the invocation of the getHomenode and getOffset functions, the introduced branch and the locking structures in these functions would be unnecessary, but the very motivation of not changing their original location is the reason why this code is injected.

Since the first-touch memory policy requires a globally accessed directory to keep track of the owner as well as the offset of every page, further code is added to the initialization function argo_initialize in order to set up this data structure, and is executed only when the relevant function is being selected (Listing3.6). We allocate and initialize the first-touch implementation directory in the same way as the rest global data structures. In the beginning, the size of the directory is calculated and is set to be twice the size of the total distributed shared memory (in pages), since the format for every page is [home node, offset], identical to the globalSharers directory which is [readers, writers]. Then, the implementation specific directory globalOwners is allocated at a 4KB alignment, mapped to the Argo’s virtual address space and associated with the newly created window ownerWindow. Lastly, the buffer

(21)

935 v o i d a r g o _ i n i t i a l i z e ( s t d : : s i z e _ t a r g o _ s i z e , s t d : : s i z e _ t c a c h e _ s i z e ) { 977 #i f MEM_POLICY == 7 978 o w n e r O f f s e t = 0 ; 979 #e n d i f 1019 #i f MEM_POLICY == 7 1020 o w n e r S i z e = a r g o _ s i z e ; 1021 o w n e r S i z e += p a g e s i z e ; 1022 o w n e r S i z e /= p a g e s i z e ; 1023 o w n e r S i z e ∗= 2 ; 1024 u n s i g n e d l o n g o w n e r S i z e B y t e s = o w n e r S i z e ∗ s i z e o f(u n s i g n e d l o n g) ; 1025 1026 o w n e r S i z e B y t e s /= p a g e s i z e ; 1027 o w n e r S i z e B y t e s += 1 ; 1028 o w n e r S i z e B y t e s ∗= p a g e s i z e ; 1029 #e n d i f 1056 #i f MEM_POLICY == 7 1057 g l o b a l O w n e r s = s t a t i c _ c a s t(vm : : a l l o c a t e _ m a p p a b l e ( p a g e s i z e , o w n e r S i z e B y t e s ) ) ; 1058 #e n d i f 1086 #i f MEM_POLICY == 7 1087 c u r r e n t _ o f f s e t += p a g e s i z e ; 1088 tmpcache=g l o b a l O w n e r s ; 1089 vm : : map_memory ( tmpcache , o w n e r S i z e B y t e s , c u r r e n t _ o f f s e t , . . . PROT_READ | PROT_WRITE) ; 1090 #e n d i f 1107 #i f MEM_POLICY == 7 1108 MPI_Win_create ( g l o b a l O w n e r s , o w n e r S i z e B y t e s , s i z e o f(u n s i g n e d

l o n g) , MPI_INFO_NULL, MPI_COMM_WORLD, &ownerWindow ) ;

1109 #e n d i f

1119 #i f MEM_POLICY == 7

1120 memset ( g l o b a l O w n e r s , 0 , o w n e r S i z e B y t e s ) ; 1121 #e n d i f

1130 }

Listing 3.6: Argo: Function Definition of argo_initialize (swdsm.cpp) (Modified Version)

in the process space is initialized to zero.

The memory region of ownerWindow is initialized in the argo_reset_coherence function (Listing3.7), which is invoked at the very end of the initialization function. Notice that initialization is being done in the local memory region, but since we are under the unified memory model, the local and public copies are kept coherent.

3.3.2 Data Distribution

For the rest of the memory policies implementation we apply modifications on the files under the source directory data_distribution. These files contain the two predefined template classes which we modify and later use in order to hide the com-putational part of the memory policies.

In the official unmodified version of Argo, the contents of the source directory data_distribution is only one file named data_distribution.hpp. Aside from the template class definitions of global_ptr and naive_data_distribution, this file also contains the definitions of their member functions. The code blocks that we are particularly interested in, is the constructor of global_ptr and the member functions homenode and local_offset of naive_data_distribution.

(22)

mem-1240 v o i d a r g o _ r e s e t _ c o h e r e n c e (i n t n ) { 1253 #i f MEM_POLICY == 7

1254 MPI_Win_lock (MPI_LOCK_EXCLUSIVE, workrank , 0 , ownerWindow ) ; 1255 g l o b a l O w n e r s [ 0 ] = 0 x1 ;

1256 g l o b a l O w n e r s [ 1 ] = 0 x0 ;

1257 f o r( j = 2 ; j < o w n e r S i z e ; j ++) 1258 g l o b a l O w n e r s [ j ] = 0 ;

1259 MPI_Win_unlock ( workrank , ownerWindow ) ;

1260 o w n e r O f f s e t = ( workrank == 0 ) ? p a g e s i z e : 0 ; 1261 #e n d i f

1268 }

Listing 3.7: Argo: Function Definition of argo_reset_coherence (swdsm.cpp) (Modified Version)

ber functions of naive_data_distribution (Listing 3.9) is rather apparent. Once a global_ptr object is created with the faulting address passed as an argument, as previously seen in lines 518 and 534 of Listing 3.4, the constructor is invoked, which in turn invokes the member functions homenode and local_offset of the naive_data_distribution class to do the policy computation (lines 43-4 of List-ing 3.8). After the computation finishes, the private members of the global_ptr class homenode and local_offset are retrieved with the public member functions node and offset, respectively, as seen in lines 519 and 535 of Listing3.4.

As it can be seen in Listing3.9, the body of the member class functions homenode and local_offset is defined inside the class. The choice of not separating the def-inition from the declaration in that particular case is not bad and does not damage readability and design quality, since only the bind all memory policy is implemented, expressed as an one-liner in each of the functions.

However, due to the introduction of the other seven memory policies, we increase the abstraction further by introducing a new implementation file which will host the computational part of the policies. So, under the source directory data_distribution we introduce the implementation file data_distribution.cpp3 _{to host the body of}

the functions homenode and local_offset of the naive_data_distribution class. The computational part of each policy is distinguished through the use of the prepro-cessor directive MEM_POLICY defined in data_distribution.hpp as a number from zero to seven, starting from bind all (Listing A.1and A.2) and continuing with the rest of the policies in the order presented in Section3.2.

24 t e m p l a t e<typename T, c l a s s D i s t = n a i v e _ d a t a _ d i s t r i b u t i o n <0>> 25 c l a s s g l o b a l _ p t r {

26 p r i v a t e:

27 /∗ ∗ @ b r i e f The node t h i s p o i n t e r i s p o i n t i n g t o . ∗/ 28 node_id_t homenode ;

30 /∗ ∗ @ b r i e f The o f f s e t i n t h e node ’ s b a c k i n g memory . ∗/ 31 s t d : : s i z e _ t l o c a l _ o f f s e t ; 32 33 p u b l i c: 42 g l o b a l _ p t r (T∗ p t r ) 43 : homenode ( D i s t : : homenode (r e i n t e r p r e t _ c a s t<c h a r∗>( p t r ) ) ) , 44 l o c a l _ o f f s e t ( D i s t : : l o c a l _ o f f s e t (r e i n t e r p r e t _ c a s t<c h a r∗>( p t r ) ) ) 45 {} 88 } ;

Listing 3.8: Argo: Class Constructor of global_ptr (data_distribution.hpp) (Original Version)

(23)

99 t e m p l a t e 100 c l a s s n a i v e _ d a t a _ d i s t r i b u t i o n { 101 p r i v a t e: 102 /∗ ∗ @ b r i e f Number o f ArgoDSM n o d e s . ∗/ 103 s t a t i c i n t n o d e s ; 105 /∗ ∗ @ b r i e f S t a r t i n g a d d r e s s o f t h e memory s p a c e . ∗/ 106 s t a t i c c h a r∗ s t a r t _ a d d r e s s ; 108 /∗ ∗ @ b r i e f S i z e o f t h e memory s p a c e . ∗/ 109 s t a t i c l o n g t o t a l _ s i z e ;

111 /∗ ∗ @ b r i e f One node ’ s s h a r e o f t h e memory s p a c e . ∗/ 112 s t a t i c l o n g s i z e _ p e r _ n o d e ; 113 114 p u b l i c: 133 s t a t i c node_id_t homenode (c h a r∗ c o n s t p t r ) { 134 r e t u r n ( p t r − s t a r t _ a d d r e s s ) / s i z e _ p e r _ n o d e ; 135 } 142 s t a t i c s t d : : s i z e _ t l o c a l _ o f f s e t (c h a r∗ c o n s t p t r ) { 143 r e t u r n ( p t r − s t a r t _ a d d r e s s ) − homenode ( p t r ) ∗ s i z e _ p e r _ n o d e ; 144 } 155 } ;

Listing 3.9: Argo: Class Member Functions homenode & local_offset of naive_data_distribution (data_distribution.hpp)

(Original Version)

Along with MEM_POLICY, we also introduce the PAGE_BLOCK preprocessor directive to set the block size for the policies working on varying granularities.

3.3.2.1 Cyclic Memory Policies

The software implementation of the cyclic group of memory policies consists of a series of simple mathematical expressions in conjuction with conditions and loops (only in the prime mapp policies case) executed at runtime for the address from the global address space provided to the relevant functions. In Listing 3.10, the calculation of the home node for the cyclic and cyclic block policies is presented. We don’t present the rest of the policies in this category, since the method of calculation is similar.

In particular, for all the memory policies in both of the homenode and local_offset functions, the faulting from the shared virtual address space address which is passed as an argument to them is subtracted from the starting address of the memory space resulting in the actual offset being held in the addr variable. Notice that in the cyclic group of memory policies, we don’t use addr in the calculation of the homenode and offset variables but, rather, only use it in some conditions instead. The variable we use in the calculation of the aforementioned variables is lessaddr, which is addr minus granularity (size of a page). That said, the cyclic policies work by taking the second page of the global address space as the first page and start the distribution from there. We do this due to the fact that the allocation of any global data structure in an application starts from the second page (offset 0x1000) onwards, since the very first page of globalData (offset 0x0000) is reserved by the system to hold the amount of memory pool currently allocated as well as the data structure for the TAS lock to update this variable. The first page of the global address space is assigned the master process (proc0) to be its home node, since the execution experience stalls if

the ownership is passed somewhere else.

The first page of the global memory space will always be assigned to the master process in the bind all memory policy, due to its distribution pattern, and that is why the lessaddr variable is not introduced in that case.

The difference between the scalar and block implementation of the cyclic group of memory policies is less apparent in the homenode function and more in local_offset.

(24)

67 t e m p l a t e<> 68 node_id_t n a i v e _ d a t a _ d i s t r i b u t i o n <0 >:: homenode (c h a r∗ c o n s t p t r ) { 72 # e l i f MEM_POLICY == 1 73 s t a t i c c o n s t e x p r s t d : : s i z e _ t z e r o = 0 ; 74 c o n s t s t d : : s i z e _ t addr = p t r − s t a r t _ a d d r e s s ; 75 c o n s t s t d : : s i z e _ t l e s s a d d r = ( addr >= g r a n u l a r i t y ) ? addr − g r a n u l a r i t y : z e r o ; 76 c o n s t s t d : : s i z e _ t pagenum = l e s s a d d r / g r a n u l a r i t y ; 77 c o n s t node_id_t homenode = pagenum % n o d e s ;

78 # e l i f MEM_POLICY == 2 79 s t a t i c c o n s t e x p r s t d : : s i z e _ t z e r o = 0 ; 80 s t a t i c c o n s t s t d : : s i z e _ t p a g e b l o c k = PAGE_BLOCK ∗ g r a n u l a r i t y ; 81 c o n s t s t d : : s i z e _ t addr = p t r − s t a r t _ a d d r e s s ; 82 c o n s t s t d : : s i z e _ t l e s s a d d r = ( addr >= g r a n u l a r i t y ) ? addr − g r a n u l a r i t y : z e r o ; 83 c o n s t s t d : : s i z e _ t pagenum = l e s s a d d r / p a g e b l o c k ; 84 c o n s t node_id_t homenode = pagenum % n o d e s ; 137 }

Listing 3.10: Argo: Class Member Function homenode of naive_data_distribution (data_distribution.cpp) for the cyclic & cyclic_block memory policies

(Modified Version)

More specifically, what is being done differently in the block implementation than the scalar one is the use of the implementation specific variable pageblock instead of granularity in the calculation of pagenum, which happens in line 76 and 83 for the cyclic and cyclic block implementation, respectively. In contrast, in the local_offset function, the calculation of the pagenum variable is not the only difference that sets the two implementations apart, but also the calculation of the offset variable which happens in line 150 and 160 for cyclic and cyclic block, respectively.

Aside from the differences between the scalar and block implementations, in gen-eral, prime mapp and prime mapp block work a bit differently from the other policies in their way of calculating the offset. The peculiarity of their implementation is the use of loops for calculating the offset of a page in certain address ranges. These ad-dress ranges refer to the ones corresponding to the real nodes in the system, after the very first cyclic distribution of pages. For these address ranges, no mathematical expression for a 3N/2 prime number is found to calculate correctly the offset of the pages in the backing memory of the nodes. However, the offset of the pages in the first cyclic distribution as well as for those corresponding to the virtual nodes in the system is correctly calculated with the same statement as the one used in the cyclic memory policies (line 190 and 217 for prime mapp and prime mapp block, respectively, ListingA.8). That said, we can take advantage of these offsets to correctly calculate the ones corresponding to real nodes. We accomplish this by iterating backwards a page or a block of pages, depending on the implementation, till we hit a page of the same owner in the correctly calculated offset address ranges, counting all the pages of the same home node along the way (lines 195-9 and 222-6 for prime mapp and prime mapp block, respectively, ListingA.8). If the page is hit, we calculate its offset and then add to it the number of pages that we counted multiplied by granularity, resulting in the correct offset of a page corresponding to a real node (line 202-3 and 229-30 for prime mapp and prime mapp block, respectively, ListingA.8).

3.3.2.2 First-Touch Memory Policy

The software implementation of the first-touch memory policy uses a directory that it accesses with a very simple index function to fetch these values.

The calculation of the index for fetching the home node and offset for an address happens in line 119 and 238 of Listing3.11respectively, with the corresponding ac-cesses to the directory in line 122 and 241. Notice that in the local_offset function,

(25)

one is added to the calculated index for accessing the local copy of the globalOwners directory, since the info for every page is in the format of [home node, offset]. Also, observe that accesses to the window segment ownerWindow are enclosed with the ownermutex lock which, in contrast to the spinmutex lock, is the directory dedi-cated lock used to protect the data structure from concurrent thread accesses.

As far as the home node values in the directory are concerned, they don’t corre-spond to the actual id (workrank) of each process, but actually to the number one left-shifted by process id. This is why the additional code block in lines 127–130 of Listing 3.11 is introduced, so that the actual id is extracted from this value. The reason we deposit the home node of a page in the globalOwners directory in such format is so that we can make the distinction between a page with no owner and a page owned by proc0. Since the globalOwners directory is initialized to zero (except

its first index), a page with no owner will hold zero in its home node index, and if that is the case, the process acquiring that information will try to claim its ownership through the firstTouch function, as seen in line 124 of Listing3.11.

The firstTouch function, like the homenode and local_offset functions, is a public member function of the naive_data_distribution class introduced to hold the one-sided MPI operations for claiming ownership of pages. In the beginning of the function, the id based on the rank of the process as well as the index of where this will be issued to be deposited in the directory, are calculated (lines 29–30 of Listing3.12). The operation of issuing the deposition of the id in the directory follows just after the aforementioned calculations, with an one-sided compare-and-swap atomic at line 35. Essentially, what this operation does is compare the value of the compare variable, which is zero, with the value at the specified index in the directory, and if this condition holds, the precalculated id will be deposited in that location. The old value in that index is returned either way in the result variable.

In spite of the operation chosen, what is very important to ensure the atomicity of all the accesses made, is the choice of the lock type argument provided to the window enclosing that operation, as well as the target window of where this operation is performed. As seen in line 33 of Listing 3.12, the lock type provided to the locking

67 t e m p l a t e<> 68 node_id_t n a i v e _ d a t a _ d i s t r i b u t i o n <0 >:: homenode (c h a r∗ c o n s t p t r ) { 117 # e l i f MEM_POLICY == 7 118 c o n s t s t d : : s i z e _ t addr = p t r − s t a r t _ a d d r e s s ; 119 c o n s t s t d : : s i z e _ t i n d e x = 2 ∗ ( addr / g r a n u l a r i t y ) ; 120 pthread_mutex_lock(&ownermutex ) ;

121 MPI_Win_lock (MPI_LOCK_SHARED, workrank , 0 , ownerWindow ) ; 122 node_id_t homenode = g l o b a l O w n e r s [ i n d e x ] ;

123 MPI_Win_unlock ( workrank , ownerWindow ) ; 124 i f ( ! homenode ) homenode = f i r s t T o u c h ( addr ) ; 125 pthread_mutex_unlock(&ownermutex ) ; 126 127 i n t n ; 128 f o r( n = 0 ; n < n o d e s ; n++) 129 i f( ( 1 << n ) == homenode ) 130 homenode = n ; 131 #e n d i f 132 133 i f( homenode >=n o d e s ) { 134 e x i t (EXIT_FAILURE) ; 135 } 136 r e t u r n homenode ; 137 }

Listing 3.11: Argo: Class Member Function homenode of naive_data_distribution (data_distribution.cpp) for the first_touch memory policy.

(26)

23 t e m p l a t e<> 24 s t d : : s i z e _ t n a i v e _ d a t a _ d i s t r i b u t i o n <0 >:: f i r s t T o u c h (c o n s t s t d : : s i z e _ t& addr ) { 25 // V a r i a b l e s f o r CAS . 26 node_id_t homenode ; 27 s t d : : s i z e _ t r e s u l t ; 28 c o n s t e x p r s t d : : s i z e _ t compare = 0 ; 29 c o n s t s t d : : s i z e _ t i d = 1 << workrank ; 30 c o n s t s t d : : s i z e _ t i n d e x = 2 ∗ ( addr / g r a n u l a r i t y ) ; 31 32 // Check / t r y t o a c q u i r e o w n e r s h i p o f t h e page . 33 MPI_Win_lock (MPI_LOCK_EXCLUSIVE, 0 , 0 , ownerWindow ) ; 34 // CAS t o p r o c e s s ’ 0 i n d e x .

35 MPI_Compare_and_swap(& i d , &compare , &r e s u l t , MPI_LONG,

0 , i n d e x , ownerWindow ) ; 36 // F o r c e l o c a l and r e m o t e c o m p l e t i o n w i t h MPI_Win_unlock ( ) . 37 MPI_Win_unlock ( 0 , ownerWindow ) ; 38 39 // T h i s p r o c e s s was t h e f i r s t one t o d e p o s i t t h e i d . 40 i f ( r e s u l t == 0 ) { 41 homenode = i d ; 42

43 // Mark t h e page i n t h e l o c a l window .

44 MPI_Win_lock (MPI_LOCK_EXCLUSIVE, workrank , 0 , ownerWindow ) ; 45 g l o b a l O w n e r s [ i n d e x ] = i d ;

46 g l o b a l O w n e r s [ i n d e x +1] = o w n e r O f f s e t ; 47 MPI_Win_unlock ( workrank , ownerWindow ) ; 48

49 // Mark t h e page i n t h e p u b l i c windows . 50 i n t n ;

51 f o r( n = 0 ; n < n o d e s ; n++) 52 i f ( n != workrank ) {

53 MPI_Win_lock (MPI_LOCK_EXCLUSIVE, n , 0 , ownerWindow ) ; 54 MPI_Accumulate(& i d , 1 , MPI_LONG, n , i n d e x , 1 , MPI_LONG,

MPI_REPLACE, ownerWindow ) ;

55 MPI_Accumulate(& o w n e r O f f s e t , 1 , MPI_LONG, n , i n d e x +1 , 1 ,

MPI_LONG, MPI_REPLACE, ownerWindow ) ;

56 MPI_Win_unlock ( n , ownerWindow ) ;

57 }

58

59 // S i n c e a new page was a c q u i r e d i n c r e a s e t h e homenode o f f s e t . 60 o w n e r O f f s e t += g r a n u l a r i t y ; 61 } e l s e 62 homenode = r e s u l t ; 63 64 r e t u r n homenode ; 65 }

Listing 3.12: Argo: Class Member Function first_touch of naive_data_distribution (data_distribution.cpp)

window call is MPI_LOCK_EXCLUSIVE and the target window is that of proc0. With

these selections, we say that the atomic compare-and-swap operation will be issued by all processes to the public window of the zero process, but with only one process allowed to write to the window during one epoch. Also, concluding the epoch with the MPI_Win_unlock call will force remote completion on the target window and the other processes next to perform the operation are guaranteed view the latest changes. In case a process manages to deposit its id variable to the directory copy of proc0,

its next job is to update its own local copy as well as the rest of the remote copies. This very same approach is used in certain cases of updating the Pyxis directory and is adopted here for performance reasons. Following this method of keeping the directory copies across the processes coherent, we exploit locality by retrieving the home node and offset of a page from local copies in the homenode and local_offset

(27)

functions, and also take advantage of the fact that eventually all the pages of the working set of an application will be touched by the processes and thus no remote accesses regarding the directory will be therefore needed.

A process knows that it has successfully deposited the id variable through the atomic operation to the directory of proc0if the result variable holds the value zero,

check which is being done in line 40 of Listing3.12. Otherwise, this variable will hold the id variable of the remote owner of this page, and this is what is returned from the function. So, once a process has managed to acquire the ownership of a page, it first updates its local window segment (lines 44–47 of Listing 3.12) and then the remote public window segments using the one-sided routine MPI_Accumulate along with the MPI_REPLACE operator in a for loop (lines 50–57 of Listing3.12). Both the id variable of the node as well as the offset of the page in its backing memory are deposited in the aforementioned code blocks.

Despite the calculation of id being static for each process, ownerOffset is in-creased by page size each time after a page acquisition. That is because, as an implementation specific variable, ownerOffset is used to hold the offset for the next page to be acquired. Note that, as in the cyclic policies implementation, the first page of the global address space is also assigned the zero process as its homenode, however, in this case this is done through the initialization of the globalOwners di-rectory in the argo_reset_coherence function (lines 1255–1256 of Listing3.7). That said, ownerOffset is initialized a page more for the zero process than the rest of the processes, which is initialized to zero.

The code block which would have being left unmodified if it wasn’t for the in-troduction of the first-touch memory policy is the constructor of the global_ptr class. In the original code block, as presented in Listing 3.8, exists a software in-efficiency which does not particularly affect performance by using the cyclic group of policies, and especially by using the bind all policy. The inefficiency that hurts performance using the first-touch memory policy is that in the original version of the constructor, when a global_ptr is created, both the homenode and local_offset member functions of the naive_data_distribution class are invoked, regardless of what is actually needed to be computed and returned. Specifically, a global_ptr object created in the getHomenode function would need only the homenode function to be invoked and not local_offset, since that information is not exploited, and vice versa. To get around this issue by applying minimal code modifications and not changing the original code structure, we simply introduce an additional input parameter to the constructor which can be used to point to the member function needed to be invoked. The constructor is given the number zero as an argument in the getHomenode function and one in the getOffset function.

(28)

4

Benchmarks

In this chapter, we present the programs ported on Argo in order to evaluate the performance of the implemented policies. These programs consist of synthetic and numerical scientific benchmarks and one real recommender systems application. We specifically selected numerical scientific benchmarks and applications, because they exhibit significant memory and processing power usage, data-sharing and various memory access patterns throughout the execution.

In each of the following sections, we start by describing the benchmark/application it concerns, then briefly point out the modifications applied to its original code in order to deliver a version suitable for running on top of Argo, and lastly state the input problem size based on which its measurements are conducted.

Note that all the benchmarks/applications used here were by default parallelized on the shared memory level with OpenMP and then parallelized on the distributed system level with Argo, just for the sake of this project. However, further changes were applied to the shared memory parallelism so that a thread pool is implemented. Regarding the porting process, we have taken care so as to not uselessly move data structures from local to global memory and thus generate useless network traffic. In contrast, we keep a balance between the local and global memory allocations and move to global memory only the required data structures with the sole purpose of gaining in performance, but at the same time to be able to tell the difference in performance between the memory allocation policies.

4.1 Stream Benchmark

Stream [McC95] is a synthetic benchmark which is largely used to evaluate memory bandwidth performance of parallel machines. It is a set of multiple kernel operations, that is, sequential accesses over array data with simple arithmetic, as shown in Ta-ble4.1. Stream outputs sustainable memory bandwidth and average execution time at the application level for each kernel after iterative measurements.

All operations shown in Table 4.1 are performed with double vectors. The copy operation measures transfer rates between the processing unit and memory bank. The scale kernel adds a multiplication by a scalar to the copy operation. Sum verifies memory system performance when multiple load/stores are performed. The triad kernel is a merge of all kernel operations (copy, scale and sum). All of these operations are computed in separated parallel loops, one for each kernel.

Kernel Operation Stream

Copy a(i) = b(i) Scale a(i) = q*b(i)

Add a(i) = b(i) + c(i) Triad a(i) = b(i) + q*c(i) Table 4.1: Stream Benchmark

A Study of Page-Based Memory Allocation Policies for the Argo Distributed Shared Memory System

Examensarbete 30 hp

Januari 2021

A Study of Page-Based Memory

Allocation Policies for the Argo

Distributed Shared Memory System

Ioannis Anevlavis

Abstract

A Study of Page-Based Memory Allocation Policies

for the Argo Distributed Shared Memory System

Ioannis Anevlavis

Contents

1

Introduction

2

Background

2.1

Argo Distributed Shared Memory

2.1.1

System Design

2.1.2

Memory Management

2.1.3

Signal Handler

2.2

MPI One-Sided Communication

2.2.1

Memory Windows

2.2.2

Basic Operations

2.2.3

Atomic Operations

2.2.4

Passive Target Synchronization

2.2.5

Memory Models

3

Contributions

3.1

Default Memory Allocation Policy: Drawbacks

3.1.1

Performance and Scalability

3.1.2

Ease of Programmability

3.2

Page-Based Memory Allocation Policies

3.2.1

Cyclic Group

3.2.2

First-Touch

3.3

Implementation Details

3.3.1

MPI Backend

3.3.2

Data Distribution

4

Benchmarks

4.1

Stream Benchmark