In this section we will present Dynamo [DHJ+07], a Key-Value store from Amazon. In the Ama- zon’s business it is very important do deliver fast response and consistent view of users updates:
when shopping on-line, one wants to see every update to the shopping cart quickly and correctly. If an user update is lost the user experience is decreased and the trust on the service is compromised. In this section, we will present the main design decisions of Dynamo and how it addresses the availability and consistency requirements.
2.2.5.1 Partitioning algorithm
Nodes in the network are connected in a ring infrastructure. To store data, items keys are hashed yielding their position in the ring. Every node is responsible for a range of values that fall be- tween their position and their predecessor’s position. Hashing of keys is done through consistent hashing [KLL+97]. This technique is better than traditional hashing because it will only require to redistribute a small set of keys when a new node is added or removed. However, the original consistent hashing technique provides less effective load distribution. To allow better distribution of data, nodes are assigned to multiple regions of the ring [SMLN+03] by allowing them to run multiple instances of virtual nodes.
To route requests in the ring, nodes have a list with the other nodes and their hash-ranges. This is an extension of Chord routing protocol to reduce the number of hops to reach a node to one.
2.2.5.2 Replication
Data distribution in the ring is not enough to increase overall system’s availability. For instance, if some object is very popular and it is being heavily accessed, the node hosting that content might not be able to handle all requests. Replication is necessary to make the system responsive in those situations. Dynamo’s approach to deal with replication is to elect a coordinator who is responsible for the object, and replicate it at N of its successors. The identification of nodes that store that object are stored in a preference list and the algorithm that computes that list excludes multiple virtual nodes hosted by the same physical node to forbid redundant replication in the same node.
2.2.5.3 Data Versioning
Dynamo associates a vector clocks to every object to capture causality relations and detect con- flicting versions. If there are conflicts, the system stores the conflicting values to deliver them to the application, which in turn will solve the conflict (application level reconciliation).
Vector clocks present a way to deal with concurrency but this technique has scalability issues, as vector size grows with the number of replicas that update the same content. In Dynamo, it is unlikely that many different nodes update the same values, since most of the updates will be exe- cuted by the preferred nodes to do the operation. To avoid that vector clocks grow immeasurably, for every pair (nodeId,counter), if the difference between the counters surpasses a threshold then the pair with the lower counter is discarded. This is unsafe, but the authors report that, in practice, it poses no problems.
2.2.5.4 Operation execution
If the client requests operations to the network through a load balancer, he may contact any node in the system, either for get or put operations. When a request is sent, if the receiver node is not one of the top N preferred nodes to handle the query it will forward the operation to one of those nodes. The node that is handling the query is called the coordinator. It is necessary to specify how many replicas are necessary to handle a read (R) or write (W) request (Read/Write Quorum). The minimum number of replicas that participate in a read(R) or write(W) can be parametrized. If those values are defined such that R + W > N and W > N/2 , then there cannot be concurrent read and write operations on the same value and only one write persists. This behaviour is described in [Gif79] and it assures one copy serializability inside the cluster. Under this configuration it is still possible to generate conflicts.
When doing a write, the coordinator generates a new vector clock and sends the operation to N-1 nodes in the preference list. If W-1 nodes reply then the operation is considered committed. If there is already one object stored with a version vector that is not older then the vector that we are writing, then both values are kept. On get operations, the protocol is similar but if there are different vector clocks in the answers among the R replicas, then the content must be reconciled and written back. The availability of the system can be parametrized through R and W values. When W=1 the system require the commitment of the operation on only one replica, and hence make it more available.
To maintain durability and consistency among replicas, the system implements an anti-entropy protocol to keep replicas synchronized. The system uses a Merkle tree [JLMS03] to detect incon- sistencies between replicas. One advantage of using a Merkle tree is that it is not required to download the whole tree to check consistency, on the downside, whenever a node enter or leaves the system the tree must be recomputed.