3. Replication
Replication
Goal:
– Avoid a single point of failure by replicating the server
– Increase scalability by sharing the load among replicas
Problems:
– Partial failures of replicas and messages
– No global ordering of messages
– Some replicas might execute operations that others did not know about
– The state of replicas diverge
L M B L M
Replicated State Machine
Known also as Active Replication
The idea:
– Every replica sees exactly the same set of messages in the same order and will execute them in that order
Benefit:
– Immediate fail-over
Limitations:
– Waste of resources, since all replicas are doing the same
– Requires determinism, which is not trivial to ensure
An important issue:
– At what level?
Option 1:
– Machine instruction level
– Virtual Machine Level
Machine Instruction Level Replication
Benefits:
– Fast consistency resolution
– Transparent
Problems:
– Requires special hardware
– Usually behind technology curve
– Resource wasteful
– Requires physical proximity
– Does not overcome software bugs / no multi-versioning
N-Modular Redundancy
Node 1 Node 2 Node N
Voter
…
Output
Monitoring System
Logical State Replication
Idea:
– The important thing is the logical/semantic state of the application
Benefits:
– The negation of the shortcoming of machine level
Problems:
– Not transparent
– Changes programming model
Although one tries to minimize this – Slower consistency resolution
– May result in lower throughput and higher latency
Total Ordering Based Replication
Simple replication protocol
– Clients can send requests to any replica
– All replicas utilize a black-box total ordering mechanism to apply the updates in the same order
Client1 Client2
Total Order Request1
Request2
Reply2 Reply1
Replicated Servers
Total Ordering Based Replication – Fine Details
How do clients find an alive server?
– Name servers
– Local directors
– Virtual IP being migrated between alive servers
What happens if the server fails?
– Before servicing the request
Resubmit
– After servicing the request
We need to avoid re-execution
– Sequence numbers
Primary-Backup Replication
Cold backup
– Only the primary is active
– Periodically checkpoints its state to storage that is available to the backup
Stable storage or shared storage (SCSI, SAN)
– When the primary fails, the backup is initiated, loads the state from storage, and continues from there
– Slow recovery
– The backup needs to be started (run applications, obtain resources, etc.) – Either the backup needs to replay the last actions from a log file, or it may miss
the last updates since the most recent checkpoint + Resource efficient
+ Invocations need not be deterministic
– It is possible to have several backups to survive multiple failures
Requires consistent failure detection, e.g., by a group membership service
– It is possible to have several nodes, each primary for some services and backup for others
Primary-Backup Replication
Warm backup
– In this case, the backup is (at least) partially alive, so the recovery phase is faster
But typically still requires some replaying of last transactions, or losing the last few updates
– Typically, updates are also frequent
Hot standby (leader/follower)
– The backup is also up, and is constantly updated about the state of the primary
+ Fast and up-to-date recovery
Special protocols are required to ensure true up-to-date recovery
We will talk about such protocols later
Challenges in Primary/Backup
How to consistently agree on the primary?
How to detect that the primary has failed?
How to ensure that if we suspect the primary, it is indeed no longer operating on behalf of the system
How to enable additional backups to join the system without manual intervention and reset
We will discuss these issues when we talk about Consensus, Failure Detection, and Membership
Quorum Replication
Atomic Register
– Intuitively, operations should appear as if executed on a single server
This is a private case of linearizability
Well suited for distributed storage and distributed shared memory – More specifically,
A read always returns a value written either by the last write that
terminated before the read started, or by a write that is concurrent with the read
If several writes are concurrent, then a following read can return a value written by any one of them
A read should not return a value that is “older” than the value returned by a previous read
Quorums
A quorum system consists of a set of subsets such that the intersection of every two subsets is not empty
– Each of these subsets is called a quorum
Example
– U = {1,2,3,4}, S = {{1,2,3}, {2,3,4}, {1,4}}
The simplest generic quorum system is majority
– Every majority subset intersects with every other majority subset
Another common type of generic quorum system is “any row + any column” of a lattice
Bi-Quorums
The universe U is now composed of two sets of subsets, A and B
– Every subset from A must intersect every subset from B
Clearly, each quorum system is also a bi-quorum system
– Majority is a generic bi-quorum system
Scalable generic bi-quorums
– Idea
Servers are arranged in a logical matrix
A write quorum consists of any one row in the matrix
A read quorum consists of any one column in the matrix – Drawback
Read quorums
Implementing Read/Write Registers with Quorums
Quorum replication
– Clients read and write directly to quorums of servers
Essentially adaptation of Attiya, Bar-Noy, Dolev protocol for emulating distributed shared memory robustly, but using any given bi-quorum rather than majority
Tradeoff between availability of the system and its scalability (size of quorum)
– Cannot implement Read-Modify-Write semantics
Implementing Quorum Replication
Each server maintains a logical timestamp for each register
Servers’ protocol:
– Upon receiving a r-request(x) message do
o := values[x]
reply with a r-reply(x,o.val,<o.ts,o.id>) message
– Upon receiving a w-request(x,v,<ts,id>) message do
o := values[x]
if <ts,id> is lexicographically larger than <o.ts,o.id> then
Implementing Quorum Replication – continued
A read operation R(x)
1. Send a read request r-request(x) to all servers
2. Wait for r-reply(x,v,<ts,id>)messages from a READ quorum of servers
– let v be the value with largest logical timestamp <ts,id>
3. Send a write request w-request(x,v,<ts,id>) to all servers
4. Wait for w-reply(x)replies from a WRITE quorum of servers
5. Return (v) // the value that was reserved in line 2
– Why are lines 3&4 important?
– It is possible to first send requests only to the corresponding quorums; only if there is no reply, send to more servers
Implementing Quorum Replication – continued
A write operation W(x,v)
1. Send a read request r-request(x) to all servers
2. Wait for r-reply(x,-,<ts,id>) messages from a READ quorum of servers
– let <ts,id> be the largest returned logical timestamp
3. Send a write request w-request(x,v,<ts+1,my_id>) to all servers
4. Wait for w-reply(x) messages from a WRITE quorum of servers
5. Return