Wait Free Synchronization Lecture 2

(1)

1

Wait Free Synchronization

Lecture 2

CS380D Distributed Computing I

2

Linearizability

Each operation of the system appears to take effect instantaneously between the invocation and response.

Linearizability is a local property

! a concurrent system is linearizable if and only if each individual object is linearizable

Linearizability is a non-blocking property

! a total operation (defined for all object states) is never required to block

Wait-Free Data Structures

A wait-free data structure guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of other processes.

A lock-free data structure guarantees that *some*

process will complete an operation in a finite number of steps, regardless of the execution speeds of other processes.

Compare-and-Swap

boolean CAS( val* addr, val old, val new) { if (*addr == old) {

*addr = new;

return true;

} else

return false;

}

atomically

!CMPXCHG (with “lock”) – Intel x86

!Load Linked / Store Conditional – MIPS, PowerPC

(2)

5

Wait-Free Synchronization and Consensus

In a system with n processes, a primitive can be used to construct wait-free objects if and only if the primitive can be used to solve the

consensus problem.

6

Consensus Objects

A consensus object is a concurrent object that implements a consensus protocol:

The consensus number of a concurrent object is the maximum number of processes for which the object can solve a simple consensus problem.

// A consensus object class cobj {

// decide - always returns the same value, which is a value previously passed as input value_t decide( value_t input );

}

Wait-Free Hierarchy

compare&swap, FIFO queue w/ peek

!

n-register assignment 2n - 2

test&set, fetch&add 2

atomic read/write registers 1

Object Consensus

Number

Universal Objects

An object is universal if it can be used to construct a wait-free implementation of any object.

In a system of n processes, an object is

universal if and only if the object has

consensus number n.

(3)

9

Wait-Free vs Lock-Free

In lock-free data structures, it’s okay to interfere with another process as long as you make progress – this maintains the guarantee that some process makes progress.

In wait-free data structures, interfering with another process is not okay, because this could prevent that process from making progress.

So in wait-free data structures, concurrent operations must cooperate to make sure everyone makes progress.

This cooperation is called helping.

10

Helping

The general approach for helping is:

!Each process “announces” its intent to perform an operation before starting.

!Once an operation is announced, any process can perform the operation.

!“Eventually” some process performs the operation

• Even if the original process crashed.

• More than one process could attempt to perform the operation. The protocol must ensure that only one succeeds.

Universal Construction - 1

Object is represented as a linked list of cells, each of which represents an operation on the object

! Order of cells in the list determines order of operations

.

struct cell {

cobj after // consensus object with value < cell *after >

// null indicates end of the list cell *before; // ptr to previous cell

seqnum_t seq; // sequence number; 0 means not threaded // montonically increasing by 1

invoc_t inv; // invocation (operation name and argument values) cobj new; // consensus object with value <new.state, new.result>

}

Universal Construction - 2

class object { cell anchor;

// Shared variables - all processes can read, but only process P can write element P cell* announce[1:N]; // Pth element is the cell P is trying to thread cell* head[1:N]; // Pth element is last cell P has observed // Auxilliary variables - "write only" variables used only for proof purposes

set of cell concur[1:N]; // the set of cells whose addresses have been stored into // the head array since P's last announcement (stmt 2) seqnum_t start[1:N]; // value of max(head[Q].seq) at P's last announcement object() { // constructor

anchor = { after = new cobj(null), before = null, seq = 1, inv = init, new = cobj(<init.state, 0>) };

for all Q {

announce[Q] = anchor;

head[Q] = anchor;

concur[Q] = {};

start[Q] = anchor.seq;

} } }

(4)

13

Universal Construction - 3

universal(invoc_t what) returns(RESULT) // Allocate a cell to represent an operation

1 cell mine = { after = new cobj, before = null, seq = 0, inv = what, new = new cobj } // Announce intent to thread cell

2 <announce[P] = mine; start[P] = max(head[1].seq,...,head[N].seq); concur[P] = {};>

// Locate a cell near the end of the list 3 for (Q = 1; Q <= N; Q++) do

if (head[P].seq < head[Q].seq) then head[P] = head[Q];

end for

// Execute until the cell for this process has been threaded onto the object.

4 while announce[P].seq == 0 do // while body on next slide end while

13 <head[P] = announce[P]; (for all Q) concur[Q] = concur[Q] U announce[P]; >

14 return (announce[P].new.result) end universal

14

Universal Construction - 4

// Execute until the cell for this process has been threaded onto the object.

4 while announce[P].seq == 0 do

5 cell *c = head[P] // c is our view of the last cell on the list.

6 cell *help = announce[(c.seq mod N) + 1] // Choose a process to help // If the process needs help, try to thread its cell for it. Else thread mine.

7 if help.seq == 0 then prefer = help else prefer = announce[P]

end if

8 d = c.after->decide(prefer) // Attempt to thread a cell (either mine or help) // operation could be nondeterminstic, so compute result using consensus obj 9 d.new->decide(apply (d.inv, c.new.state))

10 d.before = c 11 d.seq = c.seq + 1

12 <head[P] = d; (for all Q) concur[Q] = concur[Q] U {d}>

end while

Universal Construction - Example

anchor

announce

0 y

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

seq inv new before after

3 push (B)

<|A|B|, 0>

< >

x y

0 0 0 x y 0

0 x 0 0

Universal Construction - Example

anchor

announce

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

3 push (B)

<|A|B|, 0>

< >

0 pop ()

< >

x y z

0 0 0 x y 0

(5)

17

Universal Construction - Example

anchor

announce

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

3 push (B)

<|A|B|, 0>

< >

0 pop ()

< >

x y z

0 z 0 x y 0

0 y 0 x y 0

18

Universal Construction - Example

anchor

announce

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

3 push (B)

<|A|B|, 0>

< >

0 pop ()

< >

x y z

0 z 0 x y 0

0 y 0 x y 0

help c

Universal Construction - Example

anchor

announce

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

3 push (B)

<|A|B|, 0>

< >

0 pop ()

< >

x y z

0 z 0 x y 0

0 y 0 x y 0

help c prefer

Universal Construction - Example

anchor

announce

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

3 push (B)

<|A|B|, 0>

< z >

0 pop ()

< >

x y z

0 z 0 x y 0

0 y 0 x y 0

(6)

21

Universal Construction - Example

anchor

announce

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

3 push (B)

<|A|B|, 0>

< z >

4 pop ()

<|B|, A >

< >

x y z

0 z 0 x y 0

0 y 0 x y 0

22

Universal Construction - Example

anchor

announce

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

3 push (B)

<|A|B|, 0>

< z >

4 pop ()

<|B|, A >

< >

x y z

0 z 0 x y 0

Helping

anchor

announce

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

3 push (B)

<|A|B|, 0>

< >

0 pop ()

< >

x y z

0 z 0 x y 0

0 y 0 x y 0

Helping

anchor

announce

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

3 push (B)

<|A|B|, 0>

< >

0 pop ()

< >

x y z

0 z 0 w y 0

0 y 0 y y 0 0

pop ()

< >

w

(7)

25

Helping

anchor

announce

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

3 push (B)

<|A|B|, 0>

< >

0 pop ()

< >

x y z

0 z 0 w y 0

0 y 0 w y 0 0

pop ()

< >

help c

w

26

Helping

anchor

announce

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

3 push (B)

<|A|B|, 0>

< >

0 pop ()

< >

x y z

0 z 0 w y 0

0 y 0 w y 0 0

pop ()

< >

w

help c

prefer

Helping

anchor

announce

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

3 push (B)

<|A|B|, 0>

< w >

0 pop ()

< >

x y z

0 z 0 w y 0

0 y 0 w y 0 0

pop ()

< >

w

help c

prefer

Helping

anchor

announce

head

2 push (A)

<|A|, 0>

< y >

1 init

<init_state, 0>

NULL

< x >

3 push (B)

<|A|B|, 0>

< w >

5 pop ()

<||, B >

< >

x y z

0 z 0 w y 0

0 z 0 w y 0 4

pop ()

<|B|, A >

< z >

w

help c

prefer

(8)

29

Auxilliary variables

“write only” variables only used for proofs

!concur[P] – the set of cells whose addresses have been stored into the head array since P's last announcement (stmt 2)

!start[P] -- value of max(head) at P's last announcement

30

Construction is Wait-Free

Lemma 1: The following assertion is invariant:

| concur[P] | > n ==> announce[P] " head

Lemma 2: The following assertion is invariant:

max(head) >= start[P]

Construction is Wait-Free

Lemma 3: The following is the loop invariant for stmt 3:

max(head[P].seq, head[Q].seq, ...,head[N].seq ) >= start[P]

where Q is the loop index.

.

Construction is Wait-Free

Lemma 4: Just before stmt 4:

head[P].seq >= start[P]

Lemma 5: The following is invariant:

| concur(P) | >= head[P].seq – start[P] >= 0.

(9)

33

Construction is Wait-Free

Theorem 14: Construction is linearizable and wait-free.

Proof:

!linearizable because order of operations is determined by order of cells in the list

!Wait-free because the main loop executes at most N+1 times.

34

Universal Construction: Summary

Given an object with a sequential specification, we can use a consensus object with consensus number n to create a linearizable, wait-free concurrent object for a system of n processes.

This construction is theoretically important.

This construction is not practically useful.

!Each operation requires two consensus protocols

!The object requires O(N²) space

Practical Lock-Free Synchronization

Desiderata:

!Ease of reasoning: programmers should be able to construct a correct lock-free data structure -- and be able to prove or rigorously argue its correctness -- without ending up with a publishable result.

!Performance: programmers should be able to construct lock-free data structures with acceptable performance, and be able to understand and influence the performance of the implementation

Let’s Start Small

Suppose our object is small enough to fit in a single word ...

lockfree_op( obj_type *obj, args ) obj_type new_obj;

do

new_obj = Load_Linked( object );

ret = op( &new_obj, args );

cc = Store_Conditional( obj, new_obj );

until ( cc );

return ret;

(10)

37

For bigger (but still small) objects

Add a level of indirection

! Object must be small enough to be copied efficiently

! Object storage must be in a single, fixed-size contiguous block.

lockfree_op ( obj_type **obj, args ) obj_type *old_obj, *new_obj;

do

old_obj = Load_Linked( obj );

new_obj = new( obj_type );

memcpy( new_obj, old_obj, sizeof( obj_type) );

ret = op( new_obj, args );

until ( cc );

free( old_obj );

return ret;

38

lockfree_op ( obj_type **obj, args ) obj_type *old_obj, *new_obj;

do

new_obj = new( obj_type );

until ( cc );

free( old_obj );

return ret;

Two problems

Standard memory management routines are typically not lock-free

Freeing the storage for the old object could cause another process to crash

! No calls to standard memory management routines

static obj_type *new_obj; // One per process – points to an empty obj lockfree_op ( obj_type **obj, args )

obj_type *old_obj;

do

until ( cc );

new_obj = old_obj; // Save old object for use on next op return ret;

Solution Attempt 1 New Problem

static obj_type *new_obj; // One per process – points to an empty obj lockfree_op ( obj_type **obj, args )

obj_type *old_obj;

do

until ( cc );

memcpy is not atomic, so contents of new_obj could be inconsistent.

(11)

41

Inconsistent Data

May seem harmless

!Store_Conditional will fail, so it can’t corrupt the obj

But it could cause the process to crash

!Null ptr dereference, divide by zero, etc.

Hardware solution: validate instruction Software solution: version numbers

42

The “Small Object” Protocol - 1

typedef struct { obj_type obj;

unsigned check[2];

} Obj_type;

static Obj_type *new_obj; // One per process – points to an empty obj lockfree_op ( Obj_type **Obj, args )

Obj_type *old_obj, *new_obj;

unsigned first, last;

while ( TRUE )

// While loop body on next slide end while;

The “Small Object” Protocol - 2

while ( TRUE )

new_obj->check[0] = new_obj->check[1]+1; // Mark inconsistent first = old_obj->check[1];

memcpy( &new_obj->obj, &old_obj->obj, sizeof( obj_type) );

last = old_obj->check[0];

if ( first != last ) continue;

new_obj->check[1]++; // Mark consistent cc = Store_Conditional( obj, new_obj );

if ( cc ) break;

end while;

! Readers access version numbers in opposite order from writers

So far ...

Ease of Reasoning

!For small objects, we can (almost mechanically) construct a lock-free concurrent implementation of an object given a sequential implementation

Performance

!Should be pretty good, if cost of memcpy is small

!But ...

(12)

45

Performance of “Naive” Approach

Encore Multimax

! 18 NS32532 processors

Compare to spin locks

! Using test&test&set

Benchmark

! Each process performs 2²⁰/n queue operations on a single queue

! All runs perform the same amount of work

46

Performance Problems

Useless Parallelism

!When one process successfully performs an operation, all other processes that have started an operation will fail – but continue to consume

resources (and generate contention for memory and bus bandwidth)

Starvation

!Operations that take longer have a much greater chance of being aborted by shorter operations

• Even when the relative difference in running time is small

Solution to Performance Problems

Exponential backoff

!When contention is detected

• suspend for a random time interval between 0 and t

• also double t (up to some maximum)

!On successful operation

• reduce t by half (down to some minimum)

Performance with Backoff

Performance is better than standard spin lock for 8 or more processes

Performance is within a factor of two of a

“sophisticated” spin lock implementation (using exponential backoff)

(13)

49

Now where are we?

Lock-free implementation of small objects

"

Ease of reasoning

"

Performance

What’s left to do:

!

Wait-free small objects

!

Lock-free and wait-free large objects

Read the paper!

50