DSM Case Studies - The SMG DSM system: enabling shared memory for the grid

Systems composed of stricter consistency models tend to impose less burden on the application developer, as the programming semantics are less convoluted. The trade-off,

DSM CASE STUDIES 52

however, is that there is a significant increase in the volume of control messages and a greater amount of operating system resources are consumed. Stricter consistency models also tend to reduce potential concurrency as only single writer protocols have been implemented. Multiple writer protocols can be utilised to offset this [56, 61].

In this section some previous DSM implementations will be briefly explored with regard to the various approaches taken to the topics discussed in the previous sections, particularly in relation to weak consistency models. A summary of the features of other DSM implementations is given in Appendix B. The salient DSM design decisions are given. A description of the APIs for some of the DSMs are also given in Appendix C. Although there are some DSM implementation that are more recent, it can be argued that those outlined below are still the seminal works in the field.

4.8.1 Munin

The primary goal of the Munin DSM project was to allow applications written for a mul- tiprocessor system to execute efficiently in a distributed memory system with minimal modifications to the source code [34]. Additionally the programming interface should have similar semantics to that used when using true shared memory multiprocessors, while at the same time not restricting the programming language.

Munin was one of the first DSMs to implement Release Consistency [91] (discussed in Section 4.5.5). The developers of Munin implemented eight tunable parameters that govern the nature of the consistency protocol. (i)Invalidate or update, (ii)Replicas allowed, (iii)Delayed operations allowed, (iv) fixed ownership, (v) writable, (vi) multiple concurrent writers, (vii) stable sharing patterns, and (viii) changes flushed to owner. The main Munin DSM components consist of: the run-time engine which handles synchronisation and consistency events such as page faults, the object directory which man- aged the shared data that is in use locally, and the Delayed Update Queue (DUQ) which is responsible for maintaining consistency, in this case release consistency.

Munin was written in C, and although one of the primary design goals was to avoid using language, hardware or OS features not readily available on a wide variety of systems [92], in addition to a custom compiler, an OS with custom Munin extensions is also a require- ment. The extensions made to the default development System V platform included the ability to handle segmentation (SEGV) faults and other protection violations in user rather than kernel space, the ability for a user process to manipulate arbitrary virtual memory mappings, and the ability to create and destroy processes on remote nodes so that a cloned image of the calling process (initialised data etc.) can be replicated in the new process. Additionally System V’s IPC mechanisms are heavily relied upon. All communication in the prototype system occurred over Ethernet.

The main modification necessary in order to port an application to use the Munin DSM is the annotation of the declaration of shared variables using the sharedkeyword, and additionally the type of coherence required should be specified (one of the typical shared memory access patterns discussed in Section 4.1: write-shared, conventional, migratory and read-only). Preprocessor support is required to process these user annotations to produce an auxiliary file that lists all shared memory regions in an application. At link

DSM CASE STUDIES 53

time the variables can be identified from this auxiliary file and are placed on individual system pages, or on pages with the same anticipated access pattern. During run-time when a variable is accessed the system can examine the variable type and determine the coherence action to take. For a write-shared item a delayed update protocol is employed, while for the conventional and migratory sharing schemes an invalidate approach is used. Only statically allocated variables are supported. No routines similar to a globalmalloc exists as with many other DSM implementations. Only variables of the same type may be placed on the same page as decisions taken by the fault handler mechanism are closely related to the consistency model, which is checked at each memory fault. In a similar manner to hardware DSM implementations Munin used directory tables to locate the variable meta-data when an access fault at a particular location occurs. This meta-data comprises information such as variable size, type, locally present, and probable owner. The programming model that is presented to the programmer is similar to that when writing multi-threaded applications for shared memory machines. API calls exist for the creation and deletion of these extra-process threads. Synchronisation accesses are dis- tinguished from variable accesses through the use of library routines, e.g. global barrier and lock/unlock functions are available to access these special primitives.

4.8.2 Midway

The Midway DSM tried to achieve a similar goal to Munin, to ease porting a multipro- cessor application to a distributed memory system with minimal changes [74].

The Midway DSM was the first to implement Entry Consistency [81]. Midway included support for multiple consistency models to be employed concurrently (basic support for release and processor consistency models is also mentioned). Parallel applications were developed using a threads model, whereby all threads shared the same address space. The duties of the run-time system included the maintenance of consistency. The Mid- way API is given in Appendix C. EC requires that all shared data must be declared and explicitly associated with at least one synchronisation object [42]. Binding functions were provided to achieve this (midway bind synch object), and also functions to rebind shared memory areas to other synchronisation primitives (midway rebind synch object). For a lock synchronisation primitive, Midway distinguishes between two modes in which it could be held: non-exclusive and exclusive. The former allows multiple threads of execution to concurrently hold the lock, while in the latter case only one is permitted. A lock held in non-exclusive access can only be promoted to exclusive access if only the re- questing thread holds the lock. However, a lock can always be demoted to non-exclusive mode. Midway allowed a significant optimisation whereby once a thread holds a lock in non-exclusive mode, it can then delegate non-exclusive access to other threads without being the owner of the lock.

Midway could use one of two approaches to detect access to shared regions: the first was to use a compiler that identified a modification to the shared memory region and recorded such an event in an auxiliary table that had an entry for all shared variables. If this first approach was not available, say the compiler didn’t support the data-type, or the special Midway compiler was itself not available/supported, then the second ap-

DSM CASE STUDIES 54

proach was to use the MMU to detect accesses to shared data. The former approach could result in increased performance by reducing the DSM run-time overhead as access to shared memory was detected without the need of a page fault, however, each shared memory type needed to be explicitly supported [93]. The Midway compiler only supported EC, so the other stated consistency models supported (PC and RC) used the MMU method. When the time arose to enforce consistency the auxiliary table was checked for the presence of ’dirty’ data. The update coherence protocol was employed as it has a natural affinity with the default Entry Consistency model. One serious deficit with the Midway DSM was its lack of any real support for multiple concurrent writers, thus Midway suffered tremendously from the effects of false sharing.

Midway used a special optimisation to reduce the amount of coherence data transferred at consistency points. Only the modifications performed on a shared variable in the time interval{ta, tb}were sent to the acquiring process (which had a copy with a modification

date ofta) from the process with the most recent copy (mod datetb). Midway employed

a form of garbage collection, whereby once a shared variable was no longer being accessed locally (indicated by the lack of ownership of the associated synchronisation primitive), it could be discarded. Upon being referenced again the shared data could be refreshed from a remote process.

4.8.3 CRL

The C Region Library (CRL) [94] was an attempt to construct a DSM system that would be highly portable, language and system independent, while at the same time being efficient though minimising the depth of the DSM layer.

The programming model allows the programmer to define an arbitrarily sized contiguous memory region. Following this the region must be mapped into the local process address space using a rgn map routine. CRL will make no guarantee where in the process’ address space a shared region will be allocated, so a region may be located in different locations in different processes. The granularity of writes is whatever size of region that is created. The CRL API is given in Appendix C. It has functions to clearly delineate the beginning (rgn start op) and end (rgn end op) actions for shared memory regions. No explicit synchronisation operations are used by the programmer to enforce consistency, but in essence the Entry Consistency model is employed.

The coherence granularity is that of the allocated region’s size. CRL borrows many of the approaches that a hardware DSM would employ (for a comparison see Table B.1), such as fixed-home invalidate protocol, resulting in high communication costs (poten- tially three sets of messages per access). A typical hardware DSM directory mechanism is used to locate shared memory home. Support is included for explicit flushing of the local cached copy back to the home process using the rgn flush call.

CRL requires an underlying messaging system implementation for communication. The CM-5 [95] and MIT Alewife machines were supported by making use of Active Messages implementations and in these implementations proprietary features were also used. A message passing implementation, PVM (described in Section D) is used for start-up of system, with CRL’s custom messaging API used thereafter.

DSM CASE STUDIES 55

The programming model offered by CRL is very arduous to develop with, particularly when the applications tend to have multiple shared regions accessed concurrently. In addition concurrency is reduced as there is a lack of support for concurrent writes. It is efficient in regard to OS overhead in write detection as the programmer has to do this. CRL was the first DSM to run on a relatively large number of processes (128). Conversely, it demonstrated the limitations of DSM scalability for some types of application. However, the actual implementation cannot be said to be a software-only DSM as reliance is made upon specialised coherent interconnects.

4.8.4 Treadmarks

The Treadmarks DSM [96] was the first DSM to implement Lazy-Release consistency. Its designers chose this model for their implementation as they considered that this model resided in the ’sweet-spot’ of the consistency-model/control-message trade-off, and would still offer an intuitive programming model with good performance.

The Treadmarks system requires no special kernel or compilers, and as such has been ported to a myriad of platforms, and is itself still in limited use. The Treadmarks API is simple and efficient and has been the basis for many other projects. The API is given in Appendix C(page 221). Shared memory is declared using the Tmk alloc call, which allocates memory from a reserved pool of memory. This memory pool is allocated at run-time and is located at the same memory location in the address space across all Treadmarks processes, which is difficult to achieve in a hetrogenous system. A pointer to a unstructured byte array is returned. The Tmk distribute is subsequently used to broadcast the appropriate data to the other processes in the application.

In Treadmarks synchronisation is implemented using barriers, locks, and condition variables. The barrier primitive is global in nature, and the actual algorithm implemented is the central barrier algorithm. Barrier managers are assigned in a round-robin fash- ion. As stated above Treadmarks employs an invalidate approach to coherence so the workload is much reduced. The locking routines only provide exclusive access modes. Condition variables are implemented on top of the lock variable. An acquire operation on a synchronisation variable signals the start of an ’interval’. Modifications to shared data for the duration of this interval, the end of which is designated by a release operation, are valid.

A write notice, if required by the coherence mechanism, is sent to other processes upon release consisting of a list of shared data items modified in the interval, and a diff is generated (a write collection method, see Section 4.7) that only encodes the modifications to a shared unit by comparing the run-length of the original and modified versions of a coherence unit (a page in Treadmarks’ case). Diffs are only merged upon demand, and it is necessary that this is performed in the proper causal order (by interval vector timestamp).

Treadmarks LRC implementation supports Multiple-Writer protocols [97], allowing multiple threads of execution to concurrently modify data in a shared unit, thereby solving the problem of false sharing. This is supported so long as the modifications are distin- guishable from one another, i.e. each memory location (4 bytes in Treadmarks) within

DSM CASE STUDIES 56

the sharing unit can have only one writer (writes are non-conflicting). As can be seen in Figure 4.8, where multiple writers are in use, modifications to variables within the same shared page are identified at the synchronisation point using write notices. Po- tential conflicts are determined when the diffs from all writers are merged following a subsequent acquire. Any conflicts result in an access violation.

Treadmarks uses a number of novel features to optimise the DSM run-time. Use of adaptive coherence protocols [98] is one such approach, whereby coherence protocols can dynamically adapt between single and multiple writers, thus allowing for a reduction in coherence information whenever possible. Treadmarks also allows for a lazy-diff creation scheme, which allows the process to be delayed until the time the diff is required at a remote node. This has the effect that superfluous diffs are not generated, reducing overhead on the DSM run-time [62].

Communication between Treadmarks processes is achieved using sockets. At start-up of a N process application, the root process will first spawn the remaining N-1 child tasks via a remote shell on remote nodes specified in a machine file. Each Treadmarks DSM process creates N sockets. The maximum number of processes that is allowed in a Treadmarks is fixed at 32, and the total shared memory is also limited to 16384 virtual memory pages (this limit can be modified by recompiling the DSM distribution) that must be located at the same virtual address in all processes. Treadmarks can allocate multiple shared variables on a single page, while offering potential memory space savings the consistency model can be broken unbeknown to the user. The Intel C/C++ compiler has been coupled with the Treadmarks DSM system to provide distributed execution of OpenMP applications [99], see Section 5.3.2.

DSM CASE STUDIES 57

4.8.5 CVM

With previous DSM implementations the choice of coherence protocol/consistency model was built into the DSM with no variation possible without modifications to the DSM core. The Coherent Virtual Machine or CVM was developed with the intention of being a platform for protocol development and experimentation [100]. CVM was developed by Pete Keleher, one of those responsible for Treadmarks, and is written as a user- level library that supports most UNIX style systems. It is written in C++ and defines interfaces for which different protocols and consistency models can be implemented. With CVM no further modifications to the DSM core is required for the extension and implementation of new protocols, allowing for rapid development and deployment of new protocol implementations. This is achieved by deriving new classes from the Page and Protocolbase classes. Implementations exist for sequential and lazy release consistencies, and the home and homeless variants of the invalidate coherence protocol.

CVM employs many of the same approaches that were taken in the more recent projects that utilised Treadmarks, such as the underlying end-to-end communication protocol built upon UDP sockets, start-up of remote processes by a remote shell process, and the same limit of 32 processes. However, new communication layers can be implemented by implementing theMsginterface. The programmer’s API is given in Appendix C, and is also practically identical to that of Treadmarks.

Dynamic CVM (D-CVM) [101] is an extension that allows for the dynamic migration of threads from one process to another, allowing for better load balancing of applications.

4.8.6 Brazos

The Brazos DSM system [102] implemented Scope consistency. Although not described in this thesis, this is a hybrid of release and entry consistencies [103]. This consistency model aims to provide the latter’s potential performance advantages with the superior usability of the former, allowing for the reduction in false sharing.

The Brazos DSM was implemented to take advantage of the proliferation of cost effective networks of multiprocessors. The DSM was developed for the Windows NT operating system, and sought to leverage some of its features. The DSM had native support for multi-threading, which is facilitated in the OS with support for preemptive multi- threading, something that had been lacking at that time in the Linux OS. The DSM allowed dynamic run-time performance tuning by reducing the shared memory pages’ copy-set sizes by selecting the most appropriate coherence protocol.

Communication between processes was enabled through the use of the WinSock API, and the Brazos DSM made use of novel features such as selective multi-cast sockets for communication, resulting in a reduction of the number of consistency-related messages. The Brazos DSM was used as the target for an OpenMP compiler, which enabled applications developed with the increasingly popular OpenMP standards to execute on a group of distributed shared memory machines [104]. This project also offered a combined ’common’ programming model composed of OpenMP and MPI.

REVIEW 58

4.9 Review

Chapter 2 developed the rationale (orwhy) of DSM. This chapter dealt with issues sur- rounding the trade-off in thewhen and how it is actually achieved. The SMG DSM is a fresh approach at implementing a DSM, by learning from the inherent defects of previous DSM implementations that have prevented their use in a grid environment (which is reasonable considering most were implemented before the Grid paradigm). Many of the seminal works in the field mentioned in the case studies suffer from trivial design deficits such as support for a (relatively) small global memory space, small process pool

In document The SMG DSM system: enabling shared memory for the grid (Page 71-86)