Parallel Synchronisation - The SMG DSM system: enabling shared memory for the grid

In parallel programming, situations occur where there is a data dependence between the order of program statement execution. Ensuring such situations (often termed data races) are absent is vital. In order to achieve this, synchronisation primitives are required. In addition to these explicit primitives, implicit means are also available. In the message-passing model discussed above it was seen that there is also the possibility to implement synchronisation using synchronous messaging routines. However, only those tasks that are participating in the communication operation are synchronised.

With any shared-memory-style of programming paradigm (or variant) there is a need for exclusive access to shared memory, i.e. mutually exclusive access, and for the synchronisation of threads of execution in the application. The most common occurring synchronisation variables are locks and barriers.

2.4.1 Lock Synchronisation

A lock, or a MUTual EXclusion (mutex) device, is used for protecting shared data structures from conflicting modifications by multiple processes by protecting sections of code that actually modify them. These code areas are termed critical sections. Locks are also used for implementation of higher-level abstractions such as monitors. There are four requirements of a system that provide the use of critical sections through the use of locks. The first three requirements are the responsibility of the system, while a developer/algorithm will be responsible for ensuring the last:

i. Mutual exclusion: at most one thread of execution may execute the critical section at any one time

ii. Eventual Entry: in there are multiple threads trying to access a critical section at the same time, then one of them must succeed in acquiring the lock

PARALLEL SYNCHRONISATION 21

iii. Absence of Delay: a thread should get access to the critical section if no other thread is already doing so

iv. Absence of Starvation: lack of deadlock/livelock; a thread that is attempting to access a critical section will eventually be allowed

In traditional multi-processor settings these primitives consist of a shared memory location that can be set to a certain value indicating its state (lock/unlocked). This structure is often termed a spin-lock and is equivalent to the code fragment below. Mutual exclusion is supported by all modern CPUs through the provision of an atomic test and set (or equivalent) instruction that can test the lock, and depending on this value set it or do nothing, without been preempted by another processor. Such a situation could occur if the (crude) test-and-spin code section below was being concurrently executed by two separate threads of execution. It must be noted that more efficient solutions exist for the implementation of critical section on shared memory machines, such as Peterson’s Algorithm, that results in reduced memory contention [29].

while(lock != SET); lock = SET;

The implementation of distributed locks has associated with it a number of problems. Most importantly there is no distributed atomicread-and-modify instruction, and memory with the required consistency for the above code fragment is not available, and in particular, defects with the required properties (ii & iii) above are highly exacerbated. Additional challenges associated with implementing distributed locks are that every lock primitive must be able to be uniquely identified system-wide, and that asynchronous techniques used in shared memory locks are not suitable for distributed memory machines due to scarce inter-machine communication resources.

In general, a lock can be in three main states: exclusive, non-exclusive and free. The modes that a lock can be held in can be classified in two categories, exclusive or non- exclusive mode (also termed read & write modes). The number of threads that can possess a lock in these modes simultaneously will be governed by the shared memory access modes supported by the system (discussed in Section 4.2).

A lock held in non-exclusive mode can be acquired by multiple nodes simultaneously in this mode (i.e. rule (i) above does not apply). Once in this state it cannot be promoted to exclusive mode without first releasing it and reacquiring it. This type of mode can also be referred to as read access. In order for a thread of execution to acquire a lock variable in non-exclusive mode it must obtain permission from either the owner, or another thread that has already been granted non-exclusive access. This other thread may reside within the same process as the requesting thread, or in a process on a remote node.

Exclusive lock access differs slightly from non-exclusive locks as only one thread, the owner, can possess the lock and therefore perform write accesses inside the critical section. In order for a lock to be granted in exclusive mode, no other process can be in possession of the lock in exclusive or non-exclusive mode.

PARALLEL SYNCHRONISATION 22

2.4.2 Barrier Synchronisation

A barrier is a mechanism that provides for the synchronisation of a number of processes in a parallel application. It requires all threads participating in the operation to call a barrier routine and wait until all other processes have also done so. Once this has occurred all threads may proceed. The simplest method to implement a barrier in a shared memory system is to use a memory location as a shared counter. This counter is incremented (atomically) when a thread arrives at the barrier. The thread subsequently waits until the required count (quorum) is reached. Such an algorithm results in high memory contention [30]. A combining tree barrier algorithm reduces contention by intro- ducing sub-counters that record arrivals of a subset of the threads (Figure 2.6(a)). When the sub-count is reached the parent count is incremented. When this count reaches its quorum the thread may proceed. Other (symmetric) algorithms exist that attempt to reduce the load so that all threads wait for the same amount of time. ‘

Barrier primitive implementations have always been inefficient [31], their use in distributed-memory applications should be minimised. However, often their presence is mandatory in applications, particularly in iterative applications. Whenever this oc- curs the situation can be exploited, as shared state information can be distributed glob- ally piggybacked in the barrier messages. For distributed systems some notable barrier implementations have included [32, 30].

Figure 2.6: Barrier primitives. (a)Central server waiting for all threads to arrive. (b) tree-based barrier all thread proceeding

The simplest distributed barrier implementation, the central server, is similar in method- ology to the shared counter. It consists of a central barrier administrator that maintains a count and accepts arrival notices from the processes partaking in the barrier, incre- menting the count with each new arrival. Once all arrival notices have been received, including the local notice, proceed notices are issued. Such an implementation suffers from a lack of scalability as the barrier administrator becomes the bottleneck, with N-1 nodes contacting the master at arrival, and it having to reciprocate N-1 times.

OTHER CONSIDERATIONS 23

A modification of the previous scheme involves the adoption a tree algorithm where barrier sub-administrators relieve some of the burden from the administrator. A node partaking in a barrier operation may have antecedents and a consequent. The node can only issue a barrier arrival notification to its consequent once its local quorum has been satisfied. This quorum is composed of the local requirements (have all local threads ar- rived?), and that of the antecedents. Once the requirements have been met the process can issue an arrival notice to its consequent. At the top of the tree there exists the barrier administrator. Once the quorum of the administrator has been reached it can issue proceed notices. Based on the LogP model [33], the minimum wait latency for a barrier using such an implementation is 2log2N ×Average message latency.

In document The SMG DSM system: enabling shared memory for the grid (Page 40-43)