Exploration - The SMG DSM system: enabling shared memory for the grid

The question this thesis explores is whether there is a methodology whereby applications can efficiently share data, transparently to the programmer, and across a number of distributed sites in a grid environment that exhibits (relatively) large latencies. It is investigated whether there are applications that can employ latency-hiding techniques, such as relaxing memory consistency, overlapping communication and computation, and user-driven multi-threading. To do this a DSM system, SMG (Shared Memory for Grids), is implemented as a tool for the investigation.

The traditional DSM implementations have to date been focused on the cluster environment. While garnering lots of research focus in the 1990s, they were not a success, as the embedded overhead was difficult to amortise over the duration of the application. Coarser-grained applications that involve processing large amounts of data relative to the communication demands may be more latency-tolerant and should prove a good fit for executing in a grid environment.

In addition, it is asked whether grid-enabled applications can benefit from a hybrid approach that combines two programming models, i.e. message passing and shared memory (provided by DSM), within a program. The inherent drawbacks of one model might be leveraged by the benefits of another, thus lowering the programming burden and costs. Mixed-mode programming models have already been employed successfully in systems consisting of a large number of multiprocessors [19], where a shared memory approach

EXPLORATION 33

is used locally for each shared-memory multiprocessing node, while message passing is used for the communication between nodes.

Such a hybrid approach could promote the use of a shared memory paradigm on a grid where otherwise the inefficiencies would prohibit it. Applications written in a shared memory style are easier to write, and subsequently they might be ported to the grid by identifying the shared object and/or code locations responsible for communication hot-spots and converting these to explicit message passing. Potentially, the process of changing from one model to another need only be done in a localised fashion, i.e. the only sections of a program to be converted to message passing would be those that would ex- hibit a significant performance gain. This might even be done in an incremental fashion dictated by the laws of diminishing returns.

CHAPTER

4

DSM

To recap, Distributed Shared Memory (DSM) is a concept that attempts to combine the advantages of shared memory and distributed memory systems into one parallel programming paradigm. It combines the single address space of the shared memory model into a simplified programming environment with the scalability and cost effectiveness of multicomputer systems. In essence the goal of any DSM is to extend the shared memory paradigm into distributed-memory platforms so as to provide the illusion that a shared variable is physically shared by all threads of execution in a multi-process multi-threaded application. This illusion comes at the cost of reduced performance compared with a message-passing implementation due to the computation and communication overhead associated with the DSM system. This will discussed in detail in Chapter 6. There are, however, some applications where DSM is more applicable than an equivalent message- passing version as the algorithm may lend itself to be more efficiently implemented. Different approaches to solving the problems associated with constructing DSMs involve a trade-off between the hardware and software facilities available. One extreme, hardware-only DSMs (H-DSM), employ the use of specialised hardware interconnect for latency hiding and/or data consistency, while software-only DSM (S-DSM) uses only software techniques. There are hybrid schemes that attempt to mix both. Appendix B lists previous DSM implementations according to these classifications. As the thesis relates to grid environments we consider software-only approaches.

The implementation of a software-only DSM (S-DSM)1system involves many, and poten- tial conflicting, considerations [42]. This chapter examines such implementation issues. There are four primary questions to be answered in order to implement DSM. The first is what underlying protocol will be used to allow access to shared state. Second, how to ensure that shared state is kept consistent across all nodes when there are competing accesses to a shared variable. Third, how is the location of a shared data item to be found when it is required but not available in the local cache. Last, how can communication between nodes be best minimised in the implementation of the previous issues.

From here on the terms S-DSM and DSM will be used interchangeably

SHARED MEMORY ACCESS PATTERNS 36

4.1 Shared Memory Access Patterns

DSM allows for a variable to be accessed by many threads of execution that may not reside on the same physical machine, while at the same time these accesses should be transparent to the application developer. When a read or write occurs memory should be consistent according to the consistency model (rules) of the system (see Section 4.5). Providing this functionality is the job of the DSM system. With some applications this can place a severe burden on the resources (operating system traps, inter-node communication, DSM system handling). It is important therefore for a developer to consider data access patterns and related actions such as locality of reference. To ease the task, shared memory is often allocated in chunks or regions.

Access patterns to shared regions are an important consideration as better performance will be achieved through optimal use of the available resources. The implementers of the Munin DSM project found that there are characteristic types of shared data that occur in a shared memory application [34]:

• Read-only: This type of variable is first initialised, and subsequently only read accesses occur. The incident matrices in the matrix multiply application of Chap- ter 2 are such an example.

• Migratory: this type of variable is accessed by one thread, modified, and then accessed by a different one. This pattern is repeated for other threads in the system. Caching or replicating copies results in little benefit. This type of variable is found in the classical travelling salesman problem (TSP) that is examined in Section 10.1.1.

• Write-shared: multiple writers access the shared area concurrently between synchronisation points, but the individual threads modify different sections. Suc- cessive Over Relaxation (SOR) (Section 10.1.1) or Jacobi Iteration are classes of problems where a variable of this type may occur.

• Conventional: there is no discernible access pattern associated with this type of shared variable. Access is irregular so techniques such as prefetching2 of shared data are not beneficial. This type of variable occurs in numerical simulation applications such as the N-Body Water benchmark (Section 10.1.1).

• Synchronisation: these variables related to synchronisation actions that are required to synchronise threads. Often synchronisation accesses are viewed in isola- tion to normal shared memory accesses.

The varied nature of these access types indicates the memory-access flexibility required by the DSM system. The flexibility must be supported by the shared memory modes that govern the access to the shared data.

Prefetching is a consumer-initiated technique, as opposed toremote write which is producer initiated, that moves data close to the process before being actually required.

SHARED MEMORY ACCESS MODES 37

In document The SMG DSM system: enabling shared memory for the grid (Page 52-57)