The algorithm: a three-party asymmetry - Asymmetric distributed lock algorithm

4.5 Asymmetric distributed lock algorithm

4.5.2 The algorithm: a three-party asymmetry

– The al go rithm: a thr ee-p ar ty as ymme tr y

4.5.2 The algorithm: a three-party asymmetry

The ring can be kept low cost, because of three reasons. 1) The required bandwidth for synchronization purposes is low. 2) The routing in a ring is trivial, and thus cheap. 3) The latency of one clock cycle per tile is not an issue, because most of the latency is introduced by the software of the message-handling daemons—even when a hundred cores are added, the increased latency by the ring is much less than the costs of a single context switch. FPGA synthesis shows that the ring uses about 1.4 % of the total amount of logic for the PLB slaves and the ring itself. As the algorithm only relies on message exchange, which does not require time to behave correctly, predictable timing of the ring is not required.

The main idea of our algorithm is that when a process wants to enter a critical section—and tries to lock a mutex—it sends a request to a server. This server either responds with “You got the lock and you own it”, or “Process p owns the lock, ask there again”. In the latter case, the locking process sends a message to p, which can reply with “The lock is free, now you own it”, or “It has been locked, I’ll signal you when it is unlocked”. When a process unlocks a lock, it will migrate it to the process that asked for it, or flag it is being free in the local administration otherwise. In this way, processes build a fair, distributed, FCFS waiting queue. In great contrast to a token-based solution, which also migrates ownership of locks, processes only give up the ownership when they are asked to do so.

Algorithms 1 to 3 show the implementation. In more detail:

» lock server (algorithm 1): a process that registers the owner process of a lock (or the last one waiting) using the map G. When the server gets a request message, it either responds with that the lock is available (line 7) or already owned (line 11), depending on whether the lock has already been registered on the server. When a process owning a lock does not need it anymore, it can give it up. A lock is (statically) assigned to a single server, and a server can service many locks.

» message handler (algorithm 2): every MicroBlaze runs a single message- handling daemon, which handles incoming messages (see section 2.7.2). When another process asks for an owned lock, this daemon inspects the lock administration (denoted map Lp) of the owning local process. Then, it either marks the lock for migration on unlock (line 9) or steals the lock (line 12). In case of the race condition that the give up and the ask message are sent concurrently, the daemon replies that the lock is free (line 7). » locking process (algorithm 3): the process that wants to enter a critical sec-

tion. When locking, it checks its own administration. When the lock is already owned by the process, it will lock it immediately, without commu- nicating with other cores. Otherwise, it will request the server (line 9) and ask the owner (line 10) of the lock when appropriate. Asking for a locked lock implicitly enqueues the asking process (line 11). On unlock, an (un- cached) read from main memory is performed (line 15), which enforces

66 Ch ap ter 4 – Efficient Har dw ar e Infra str uctur e

Algorithm 1:Lock server

1 Global: administration of the owner (or last waiting in queue) of all locks that are currently given out by this server:

G ∶ lock → process, G ← ∅ 2 Procedure:request(l,r)

3 Input: process r requesting lock l 4 begin

5 ifG(l) is unassociated then

6 G(l) ← r

7 returngot lock l 8 else

9 p ← G(l)

10 G(l) ← r

11 returnask p for state of l 12 Procedure:giveup(l,r)

13 Input: process r giving up lock l 14 begin

15 ifG(l) = r then 16 unassociate G(l)

Algorithm 2:Message handling daemon

1 Global: administration of owned locks of a local process p ∈ P:

Lp∶lock → {free, locked, migrate, stolen}, ∀p ∈ Plocal∶Lp← ∅ 2 Procedure:ask(l,p,r)

3 Input: lock l, owned by process p, requested by r 4 begin

5 atomic

6 ifLp(l) is unassociated then 7 returnfree

8 else ifLp(l) = locked then

9 L_p(l) ← migrate on unlock to r 10 returnlocked

11 else

12 Lp(l) ← stolen

67 4.5.2 – The al go rithm: a thr ee-p ar ty as ymme tr y

Algorithm 3:Locking process

1 Global: Lself, which is one of L of algorithm 2 2 Procedure:lock(l) 3 Input: lock l 4 begin 5 atomic 6 s ← Lself(l) 7 Lself(l) ← locked 8 ifs ≠ free then

9 ifrequest(l,self) = ask p then 10 ifask(l,p,self) = locked then

11 wait for signal (signal counterpart at line 21)

12 Procedure:unlock(l) 13 Input: locked lock l 14 begin

15 dummy read SDRAM 16 atomic

17 s ← Lself(l) 18 Lself(l) ← free 19 ifs = migrate to r then 20 unassociate Lself(l)

21 signal r (wait counterpart at line 11) 22 else iftoo many free locks in Lselfthen 23 l′←oldest free lock in Lself 24 unassociate Lself(l′) 25 giveup(l′,self)

an ordering between communication via the ring and operations on the background memory, and ensures that all outstanding memory operations will be completed before the lock is unlocked—the system implements the Release Consistency memory model. This is guaranteed, as the interconnect arbitrates in FCFS manner and the memory controller processes all requests in-order. Then, only locally is the state updated (line 18), unless there is a process already waiting, which will be signaled in that case (line 21). When the process exits, all owned locks must be given up, which is left out of algorithm 3 for simplicity.

This algorithm works only when assuming that messages cannot get lost, and the message handler cannot be interrupted to handle another message. Then, mutual

68 Ch ap ter 4 – Efficient Har dw ar e Infra str uctur e

exclusion is guaranteed, as the first process in the queue is well known, which owns (and might lock) the mutex. When a mutex is not owned, a process requesting it only needs to send one message. If it is owned, but not locked, two messages are required. In any case, progress to enter the critical section cannot be stalled, as long as messages are handled. Moreover, the waiting time for processes that are locking a mutex, is bounded by means of this queue—assuming that the process that holds the lock, will release it eventually.

Unfortunately, testing the algorithm with a model checker like Spin [56] is not feasi- ble. To test queuing properly, the Spin model requires many concurrent processes, which all have multiple ways of interleaving messages and state changes. This leads to excessive verification run times. On the other hand, when the model is simpli- fied in order to reduce verification time, it is unclear whether it still matches the algorithm.

Next, the performance of the proposed ring and new lock algorithm will be com- pared to the bakery lock.

In document Programming models for many-core architectures: a co-design approach (Page 81-84)