Design - On the many faces of atomic multicast

In this section, we detail the GeoPaxos protocol (§4.4.1), present some extensions and improvements to the basic protocol (§4.4.2), and discuss practical aspects (§4.4.3).

4.4.1 The ordering protocol

In GeoPaxos, clients can send operations to any replica. An operation op received by a replica has two main attributes: (a) the destination op.dst, which contains the preferred sites of the objects accessed in op; and (b) the timestamp op.ts, which defines the order in which op must be executed by the replicas. A preferred site corresponds to a Paxos group. Note that only the groups representing the preferred sites of the objects accessed in op coordinate to order op, but all replicas execute op.

Algorithm 4 presents GeoPaxos, divided in atomic tasks. Whenever a client submits an operation op for execution to a replica r, the latter invokes psi t e(op), which defines op.dst as the set of preferred sites from every object accessed by

op, and proposes op to each group in op.dst in a STEP1 message, as detailed

in Task 0. The proposing of an operation triggers an execution of Paxos in each group in op.dst.

In Task 1, when a replica in op.dst learns about a new STEP1 message ad-

dressed to a group h (line 13), it increments h’s local clock, updates op.ts, and appends op to the set of unordered messages. When r learns STEP1 messages

from each group in op.dst, it proposes op final timestamp to its local group g (the group it belongs to) within a STEP2 message (line 18).

When r learns aSTEP2 message from a group h (the decide event in line 19),

it updates h’s clock to the maximum between the previous value and the one just decided. As soon as replicas learnSTEP2 messages from all groups in op.dst, they consider the operation ordered and move it from the ToOrder set to the Ordered set.

An operation in the Ordered set can be executed by the replica in Task 2 when it has a timestamp smaller than any other operations with common destination in both ToOrder and Ordered sets.

70 4.4 Design

Algorithm 4 replica r in group g.

1: Initialization

2: cl ock[N] ← {〈0, 1〉, · · · , 〈0, g〉, · · · , 〈0, N〉}

3: ToOrder_{← ;; Ordered ← ;}

4: To submit operation op: {Task 0}

5: op.dst_{← psite(op)}

6: op.ts_{← 〈0, 0〉}

7: for all h∈ op.dst do

8: propose_h(STEP1, op,₋₎

9: when decide_h(S, op, cl ock0) {Task 1}

10: ifS=STEP1 then 11: if|op.dst| = 1 then 12: execute operation op 13: else 14: inc(clock[h]) 15: op.ts_{← max(op.ts, clock[h])}

16: ToOrder← ToOrder ∪ {op}

17: if decided_j(STEP1, op, _) from all j ∈ op.dst then

18: propose_g(STEP2, op, op.ts)

19: else

20: cl ock[h] ← max(clock[h], clock0)

21: if decided_j(STEP2, op, _) from all j ∈ op.dst then

22: Ordered← Ordered ∪ {op}

23: ToOrder← ToOrder \ {op}

24: while∃op ∈ Ordered : {Task 2}

(∀op0_{∈ Ordered ∪ ToOrder :}

op6= op0∧ op.dst ∩ op0.dst_{6= ; ⇒ op.ts < op}0.ts) do

25: execute operation op

26: Ordered_{← Ordered \ {op}}

Operations ordered by a single Paxos group (single-group operations) do not receive a timestamp. Instead, they are immediately forwarded to all replicas to be executed, bypassing any ongoing multi-group operations (line 11). We detail this optimization in §4.4.2.

4.4.2 Extensions and optimizations

In this section, we describe a few optimizations that improve the performance of the basic ordering protocol.

Speeding up single-group operations

In Algorithm 4, replicas deliver single-group operations without assigning them timestamps. This optimization ensures that operations involving a single Paxos group do not have to wait for slower multi-group messages. To understand why, assume that both single- and multi-group messages are assigned timestamps. Let

op_s and op_m be two operations that access a common object x (i.e., op_s and op_m do not commute), where op_s is single-group and op_m is multi-group. Thus, op_s and op_m must be executed in the same order by all replicas.

Consider an execution in which op_s’s communication round happens between

op_m’s first and second communication rounds at the replicas in the preferred site for object x. Thus, op_s receives a timestamp greater than op_m’s. This means that even if op_s is ready to be executed, it must wait for op_m to be ordered. We call the phenomenon by which the execution of an ordered operation is delayed by the ordering of another operation the “convoy effect”. Since there is a significant difference between the response times of single-group and multi-group operations, even a small percentage of multi-group operations in the workload can add substantial delays in the execution of single-group operations. To cope with the convoy effect, we let single-group operations be executed as soon as they are ordered inSTEP1 (line 11 of Algorithm 4). Consequently,STEP2 of a multi-group

operation does not delay the execution of a single-group operation. Intuitively, this does not violate correctness because op_s and op_m are handled in the same total order within a group, and so, all replicas agree that op_s should be executed before op_mis ordered.

Setting an object’s preferred site

Since the preferred site of objects may be unknown until the first accesses to the objects and it may also change over time (e.g., as users change their physical location), reassigning an object’s preferred site is inevitable. A practical way to maintain preferred sites is to monitor access to objects, client location, and replica load, and periodically evaluate and reassign groups to object accordingly. Differently from Spanner, where data needs to be copied in the background to the destination zone by the placement driver, in GeoPaxos there is no bulk data transfer involved in reassigning an object preferred site. The task is accom- plished by a set_psi t e(ob ject_id, new_pre f erred_site) operation that involves the Paxos groups at the object’s current and new preferred sites. After ordered, the operation updates the object’s preferred site in all replicas. We exercise this feature of GeoPaxos in §4.7.2 using simple heuristics to assign preferred sites to

72 4.5 Proof of correctness

In document On the many faces of atomic multicast (Page 91-94)