• No results found

Proof of correctness

4.4.3

Practical considerations

Paxos groups are available as long as there is a majority of non-faulty nodes (acceptors) in the group. Clients connected to a replica that fails can reconnect to any operational replica, possibly in the client’s closest region. We experimentally evaluate the effects on performance when clients reconnect to a remote replica in §4.7.2.

To recover from failures, the in-memory state of acceptors must be saved on stable storage (i.e., disk). In GeoPaxos, acceptors can persist their state in both asynchronous or synchronous mode. These modes represent a performance and reliability trade-off: the asynchronous mode is more efficient but can cause information loss if an acceptor crashes before flushing its state to disk. We use asynchronous mode in our evaluation.

As an optimization, multi-group operations (with associated parameters) do not need to be sent to all groups involved in the operation. It is sufficient that one group receives the full operation while the other groups receive only the unique id of the operation, so that the operation can be ordered in all involved groups.

4.5

Proof of correctness

Proposition 18 If operations opi and opj do not commute, then replicas execute them in the same order.

PROOF: Since opi and opj do not commute, they access at least one common object. Let PSi and PSj be the preferred sites for opi and opj (i.e., these are the

Paxos groups that will order each operation). Thus, either (i) PSi∩ PSj6= ;, that

is, the operations are ordered by at least one group in common; or (ii) PSi∩PSj6=

;. Case (ii) is only possible if the objects accessed by opi and opj changed their

preferred site after the first operation is executed and before the second operation is executed. Without lack of generality, let x be an object accessed by the two operations, and PS(x)i and PS(x)j be the preferred sites for x when opi and opj are executed, respectively. From the mechanism to reassign the preferred site of an object, both the current and the next preferred sites must be involved in the operation. It is possible to conclude that opi and opj are ordered in at least one group, directly, as in case (a), or indirectly, as in case (b).

The claim follows from two facts: (a) for ordered operations (operations in the Or d er ed set) opi and opj either opi.t p < opj.t p or opj.t p < opi.t p; and (b) replicas execute ordered operations in timestamp order. Fact (a) holds since timestamp values are unique and the timestamp of an ordered operation op is the maximum among the timestamp values proposed by each one of the destination groups in op.dst (line 20 of the Algorithm 4). Fact (b) holds because when an operation op is executed by a replica, there is no operation op0at the replica with a smaller timestamp (Task 2 of Algorithm 4). Moreover, no future operation can have a smaller timestamp than op’s timestamp since timestamps are monotoni- cally increasing and the timestamp of each group in op.dst is at least equal to

op’s (line 19 of Algorithm 4). ƒ

Proposition 19 GeoPaxos is linearizable.

PROOF: From the definition of linearizability [40], there must exist a permu-

tation π of the operations in any execution of GeoPaxos that respects (i) the real-time ordering of operations as seen by the clients, and (ii) the semantics of the operations. Let opi and opj be two operations submitted by clients Ci and Cj, respectively.

There are two cases to consider.

Case (a): opi and opj are commutative. Thus, opi and opj access different objects and the sets of groups representing preferred sites of the objects involved in each operation are disjoint. Consequently, the execution of one operation does not affect the execution of the other and they can be placed in any relative order inπ. Operations opiand opjare arranged inπ so that their relative order respects their real-time dependencies, if any.

Case (b): opi and opj do not commute. It follows from GeoPaxos order prop- erty above that replicas execute the operations in the same order. Since the two operations execute in sequence, the execution of the operations satisfies their semantics. It is now shown that the execution order satisfies any real-time con- straints among opi and opj. Without lack of generality, assume opi finishes before

opj starts (i.e., opi precedes opj in real time). Thus, before opj is submitted by

Cj, opi has completed (i.e., Ci has received opi’s response). Since opj is ordered

and then executed, the conclusion is that opi is ordered before opj.

From the claims above, opi and opj can be arranged in π according to their delivery order so that the execution of each operation satisfies its semantics.

The last consideration regards the earlier execution of single-group opera- tions as presented in lines 10 to 12 of Algorithm 4. Because replicas receive both

74 4.6 Implementation