Part II Scalable and Consistent Distributed Storage – CATS
9.6 Comparison with Cassandra
Cassandra [125] and other distributed key-value stores [67, 30, 80] which use consistent hashing with successor-list replication have a very similar architecture to that of CATS. Since Cassandra was freely available, we compared the performance of CATS with that of Cassandra.
We should note that we are comparing our research system prototype with a system that leverages half a decade of implementation optimizations and fine tuning by a community of open-source contributors. Our goal is to give the reader an idea about the relative performance difference between the two systems. Extrapolating our previous evaluation of the overhead of atomic consistency using consistent quorums, this may give an insight into the cost of atomic consistency if implemented in Cassandra. We leave the actual implementation of consistent quorums in Cassandra to future work.
Both CATS and Cassandra are implemented in Java. We used Cassandra version 1.1.0, the latest version available at the time, and we used the QUORUMconsistency level for a fair comparison with CATS. We chose the
9.6. COMPARISON WITH CASSANDRA 149 16 64 256 1024 4096 2 4 6 8 10 12 14
Value size [bytes] (log)
Latency [ms]
Reads (Eventual consistency) Updates (Eventual consistency) Reads (Cassandra)
Updates (Cassandra)
Figure 9.9. Latency comparison between Cassandra and an implementation of eventual consistency in CATS, under an update-intensive workload.
initial data size such that the working set would fit in main memory. Since CATS was storing data in main memory while Cassandra used disk, we set commitlog_sync: periodic in Cassandra to minimize the effects to disk activity on operation latencies and make for a fair comparison. Figures 9.8 and 9.9 show mean operation latencies, whereby each data point represents measurements averaged over five runs. Using the same workloads, we compared Cassandra and CATS with eventual consistency. The trend of higher latencies for large value sizes remains the same for both systems and workloads as the network starts to become a bottleneck. For CATS, read and update latencies are the same since both operations have the same message complexity and same-size messages. On the other hand, Cassandra updates are faster than reads, which was expected since in Cassandra updates are committed to an append-only log and require no disk reads or seeks, while read operations may need to consult multiple uncompacted SSTables1 in search for the requested data. The results show that the operation latencies in CATS are approximately three times higher
than in Cassandra, except for reads under an update-intensive workload, where SSTable compactions occur too seldom relatively to the high update rate, causing the need to consult multiple SSTable for each read operation and thus affecting Cassandra’s performance.
Given our comparison between Cassandra and CATS with eventual consistency, as well as the relatively small decrease in throughput when pro- viding atomic consistency – using consistent quorums and two-phase write operations – instead of only eventually consistent single-phase operations (see Section 9.5), we believe that an implementation of consistent quorums in Cassandra can provide linearizable consistency without a considerable drop in performance, e.g., less than 5% overhead for a read-intensive workload, and about 25% overhead for update-intensive workloads.
Chapter 10
CATS Discussion and
Comparison to Related Work
In this chapter we discuss alternative consistency models that can be implemented in a simple manner on top of the foundation of scalable reconfigurable group membership provided by CATS. We also discuss possible efficient implementations of these models and we compare CATS with related work in the areas of scalable key-value stores and consistent meta-data storage systems.
CATS brings together the scalability and self-organization of DHTs with the linearizable consistency and partition tolerance of atomic registers.
10.1
Alternatives to Majority Quorums
For some applications majority quorums may be too strict. To accommodate specific read-intensive or update-intensive workloads, they might want flexible quorum sizes for put and get operations, like read-any-update-all or read-all-update-any, despite the fault-tolerance caveats entailed. Interestingly, our ABD-based two-phase algorithm, depends on majority quorums for
linearizability, however, by using more flexible yet overlapping quorums, the algorithm still satisfies sequential consistency [25], which is slightly weaker, but still a very useful level of consistency, as we discussed in Section 6.3. This means that system designers are free to decide the size of quorums for read and write operations to suit their workloads, as long as the read and write quorums overlap. For instance, in a stable environment, like a data center, motivated by the need to handle read-intensive workloads, a system designer may choose the size of write quorums to be larger than the size of read quorums, in order to enable lower read latencies at the expense of more costly and less fault-tolerant writes – meaning that write operations need to send more messages and wait for acknowledgements from more replicas, and thus they can tolerate fewer crashed replicas. Consider a read-intensive workload and a replication degree of three. The write quorum size can be chosen as three and the read quorum as one. Such a configuration makes writes more expensive and less fault-tolerant, yet the read latency reduces tremendously since only one node – any node in the replication group – is involved in the read operation.
On a related note, the idea of primary-backup replication could be applied onto the consistent replication groups of CATS, to enable efficient primary reads. For instance, the node with the lowest identifier in each group could be considered to be the primary for that group; thus enabling primary-based replication in CATS. With a primary-backup scheme there are two possible designs: lease-based and non-lease-based.
The lease-based design [133] assumes a timed-asynchronous model and relies on this assumption to guarantee that at all times, at most one node considers itself to be the primary. In this design, read operations can always be directly answered by the primary, without contacting other nodes since the unique primary must have seen the latest write. Write operations can be sequenced by the primary but cannot be acknowledged to clients before the primary commits them at a majority of replicas in order to avoid lost updates in case of primary failure.
In the non-lease-based design, all operations are directed at the primary and for both read and write operations, the primary must contact a majority of replicas before acknowledging the operation to the client. Because all operations involve a majority of nodes, there is no safety violation when more than one node considers itself to be a primary. This can be achieved if