Avoid a single point of failure by replicating the server Increase scalability by sharing the load among replicas

(1)

3. Replication

Replication

Goal:

– Avoid a single point of failure by replicating the server

– Increase scalability by sharing the load among replicas

Problems:

– Partial failures of replicas and messages

– No global ordering of messages

– Some replicas might execute operations that others did not know about

– The state of replicas diverge

L M B L M

(2)

Replicated State Machine

Known also as Active Replication

The idea:

– Every replica sees exactly the same set of messages in the same order and will execute them in that order

Benefit:

– Immediate fail-over

Limitations:

– Waste of resources, since all replicas are doing the same

– Requires determinism, which is not trivial to ensure

An important issue:

– At what level?

Option 1:

– Machine instruction level

– Virtual Machine Level

Machine Instruction Level Replication

Benefits:

– Fast consistency resolution

– Transparent

Problems:

– Requires special hardware

– Usually behind technology curve

– Resource wasteful

– Requires physical proximity

– Does not overcome software bugs / no multi-versioning

(3)

N-Modular Redundancy

Node 1 Node 2 Node N

Voter

…

Output

Monitoring System

Logical State Replication

Idea:

– The important thing is the logical/semantic state of the application

Benefits:

– The negation of the shortcoming of machine level

Problems:

– Not transparent

– Changes programming model

Although one tries to minimize this – Slower consistency resolution

– May result in lower throughput and higher latency

(4)

Total Ordering Based Replication

Simple replication protocol

– Clients can send requests to any replica

– All replicas utilize a black-box total ordering mechanism to apply the updates in the same order

Client1 Client2

Total Order Request1

Request2

Reply2 Reply1

Replicated Servers

Total Ordering Based Replication – Fine Details

How do clients find an alive server?

– Name servers

– Local directors

– Virtual IP being migrated between alive servers

What happens if the server fails?

– Before servicing the request

Resubmit

– After servicing the request

We need to avoid re-execution

– Sequence numbers

(5)

Primary-Backup Replication

Cold backup

– Only the primary is active

– Periodically checkpoints its state to storage that is available to the backup

Stable storage or shared storage (SCSI, SAN)

– When the primary fails, the backup is initiated, loads the state from storage, and continues from there

– Slow recovery

– The backup needs to be started (run applications, obtain resources, etc.) – Either the backup needs to replay the last actions from a log file, or it may miss

the last updates since the most recent checkpoint + Resource efficient

+ Invocations need not be deterministic

– It is possible to have several backups to survive multiple failures

Requires consistent failure detection, e.g., by a group membership service

– It is possible to have several nodes, each primary for some services and backup for others

Primary-Backup Replication

Warm backup

– In this case, the backup is (at least) partially alive, so the recovery phase is faster

But typically still requires some replaying of last transactions, or losing the last few updates

– Typically, updates are also frequent

Hot standby (leader/follower)

– The backup is also up, and is constantly updated about the state of the primary

+ Fast and up-to-date recovery

Special protocols are required to ensure true up-to-date recovery

We will talk about such protocols later

(6)

Challenges in Primary/Backup

How to consistently agree on the primary?

How to detect that the primary has failed?

How to ensure that if we suspect the primary, it is indeed no longer operating on behalf of the system

How to enable additional backups to join the system without manual intervention and reset

We will discuss these issues when we talk about Consensus, Failure Detection, and Membership

Quorum Replication

Atomic Register

– Intuitively, operations should appear as if executed on a single server

This is a private case of linearizability

Well suited for distributed storage and distributed shared memory – More specifically,

A read always returns a value written either by the last write that

terminated before the read started, or by a write that is concurrent with the read

If several writes are concurrent, then a following read can return a value written by any one of them

A read should not return a value that is “older” than the value returned by a previous read

(7)

Quorums

A quorum system consists of a set of subsets such that the intersection of every two subsets is not empty

– Each of these subsets is called a quorum

Example

– U = {1,2,3,4}, S = {{1,2,3}, {2,3,4}, {1,4}}

The simplest generic quorum system is majority

– Every majority subset intersects with every other majority subset

Another common type of generic quorum system is “any row + any column” of a lattice

Bi-Quorums

The universe U is now composed of two sets of subsets, A and B

– Every subset from A must intersect every subset from B

Clearly, each quorum system is also a bi-quorum system

– Majority is a generic bi-quorum system

Scalable generic bi-quorums

– Idea

Servers are arranged in a logical matrix

A write quorum consists of any one row in the matrix

A read quorum consists of any one column in the matrix – Drawback

Read quorums

(8)

Implementing Read/Write Registers with Quorums

Quorum replication

– Clients read and write directly to quorums of servers

Essentially adaptation of Attiya, Bar-Noy, Dolev protocol for emulating distributed shared memory robustly, but using any given bi-quorum rather than majority

Tradeoff between availability of the system and its scalability (size of quorum)

– Cannot implement Read-Modify-Write semantics

Implementing Quorum Replication

Each server maintains a logical timestamp for each register

Servers’ protocol:

– Upon receiving a r-request(x) message do

o := values[x]

reply with a r-reply(x,o.val,<o.ts,o.id>) message

– Upon receiving a w-request(x,v,<ts,id>) message do

o := values[x]

if <ts,id> is lexicographically larger than <o.ts,o.id> then

(9)

Implementing Quorum Replication – continued

A read operation R(x)

1. Send a read request r-request(x) to all servers

2. Wait for r-reply(x,v,<ts,id>)messages from a READ quorum of servers

– let v be the value with largest logical timestamp <ts,id>

3. Send a write request w-request(x,v,<ts,id>) to all servers

4. Wait for w-reply(x)replies from a WRITE quorum of servers

5. Return (v) // the value that was reserved in line 2

– Why are lines 3&4 important?

– It is possible to first send requests only to the corresponding quorums; only if there is no reply, send to more servers

Implementing Quorum Replication – continued

A write operation W(x,v)

1. Send a read request r-request(x) to all servers

2. Wait for r-reply(x,-,<ts,id>) messages from a READ quorum of servers

– let <ts,id> be the largest returned logical timestamp

3. Send a write request w-request(x,v,<ts+1,my_id>) to all servers

4. Wait for w-reply(x) messages from a WRITE quorum of servers

5. Return