Scalable BFT - On the many faces of atomic multicast

2.6 Conclusion

3.5.2 Scalable BFT

Despite the large amount of work on BFT replication in the last two decades (e.g., [1, 5, 12, 21, 36, 45, 53, 58, 81, 85]), the scalability of BFT protocols is still a relatively unexplored topic, which is discussed in this section.

A common observation of BFT protocols is that their performance degrades significantly as the number of faults tolerated increase [1]. This lack of fault-

scalability comes mostly from the all-to-all communication used in these protocols, which implies in a quadratic amount of messages. This limitation can be mitigated either by using protocols with linear message pattern [1, 36, 45], by using protocols with a smaller ratio between n and f [53, 85], or by exploring erasure codes and large message batches[58]. Independently on the trade-offs explored by these protocols, all of them lose performance as the number of replicas increase, contrary to ByzCast.

There are few BFT protocols that target wide-area networks[5, 81]. These protocols tend to use more replicas to decrease the relative quorum size or the distance between replicas in the quorums. Similarly to the scalable protocols described before, the performance of these protocols tends to decrease with the number of replicas.

The natural way of scaling replicated systems is sharding the state in multiple replica groups and running ordering protocols only in these groups. To the best of our knowledge, there are only three works that consider partitionable replication for BFT systems. Augustus[63] and Callinicos [62] introduces protocols for executing transactions in multiple shards of a key-value store implemented on top of multiple BFT groups. A recent work by Nogueira et al.[60] introduces protocols for splitting and merging replica groups in BFT-SMaRt, without discussing ways to disseminate messages to more than one of these groups with Byzantine failures. ByzCast complements these works by providing a protocol for dissem- inating requests on multiple partitions, enabling thus the efficient support for services that require multi-partition operations.

3.6 Conclusion

Atomic multicast is a fundamental communication abstraction in the design of scalable and highly available strongly consistent distributed systems. This chapter presents ByzCast, the first Byzantine Fault-Tolerant atomic multicast, de- signed to build on top of existing BFT abstractions. ByzCast is partially genuine, i.e., it scales linearly with the number of groups, for messages addressed to a

62 3.6 Conclusion single group. In addition to introducing a novel atomic multicast algorithm, its performance is also assessed in two different environments. The results show that ByzCast outperforms BFT-SMaRt in most cases, as well as a non-genuine BFT atomic multicast protocol.

Speeding up state machine

replication in wide-area networks

4.1 Introduction

Many current online services must serve clients distributed across geographic areas. In order to improve service availability and performance, servers are typically replicated and deployed over geographically distributed sites (i.e., datacenters). By replicating the servers, the service can be configured to tolerate the crash of nodes within a single datacenter and the disruption of an entire datacenter. Geographic replication can improve performance by placing the data close to the clients, which reduces service latency.

Designing systems that coordinate geographically distributed replicas is chal- lenging. Some replicated systems resort to weak consistency to avoid the over- head of wide-area communication. Strong consistency provides more intuitive service behavior than weak consistency, at the cost of increased latency. Due to the importance of providing services that clients can intuitively understand, several approaches have been proposed to improve the performance of geo- distributed strongly consistent systems (e.g., [44, 59, 78, 83]). This chapter presents GeoPaxos, a protocol that combines three insights to implement efficient state machine replication in geographically distributed environments.

First, GeoPaxos decouples ordering from execution[87]. Although Paxos introduces different roles for the ordering and execution of operations (acceptors and learners, respectively[48]), Paxos-based systems typically combine the two roles in a replica (e.g.,[44, 59, 67]). Coupling order and execution in a geographically distributed setting, however, leads to a performance dilemma. On the one hand, replicas must be deployed near clients to reduce latency (e.g., clients can quickly

64 4.1 Introduction read from a nearby replica). On the other hand, distributing replicas across geographic areas to serve remote clients slows down ordering, since replicas must coordinate to order operations. By decoupling order from execution, GeoPaxos can quickly order operations using servers in different datacenters within the same region[44] and deploy geographically distributed replicas without penal- izing the ordering of operations.

Second, instead of totally ordering operations before executing them, as tra- ditionally done in state machine replication [47], GeoPaxos partially orders op-

erations. It is well-known that state machine replication does not need a to- tal order of operations [77] and a few systems have exploited this fact (e.g., [46, 59]). GeoPaxos differs from existing systems in the way it implements partial order. GeoPaxos uses multiple independent instances of Multi-Paxos [23] to order operations—hereafter, an instance of Multi-Paxos is called a Paxos group or simply a group. Operations are ordered by one or more groups, depending on the objects they access. Operations ordered by a single group are the most efficient ones since they involve servers in datacenters in the same region. Op- erations that involve multiple groups require coordination among servers in datacenters that may be far apart and thus perform worse than single-group operations. GeoPaxos’ approach to partial order can take advantage of public cloud computing infrastructures such as Amazon EC2 [4]: fault tolerance is provided by nodes in datacenters in different availability zones, within the same region; performance is provided by replicas in different regions. Although intra-region redundancy does not tolerate catastrophic failures in which all datacenters of a region are wiped out, most applications do not require this level of reliabil- ity[44].

Third, to maximize the number of single-group operations, GeoPaxos exploits

geographic locality. Geographic locality presumes that objects have a preferred site, that is, a site where objects are most likely accessed. Geographic locality is common in many online services. For example, operations on a user’s data often originate in the region where the user is. Some distributed systems exploit locality by sharding the data and placing shards near the users of the data (e.g., [44, 83]). GeoPaxos does not shard the service state; instead, it distributes the responsibility for ordering operations to Paxos groups deployed in different regions. Operations are ordered by the groups in the preferred sites of the objects accessed by the operation.

The rest of the chapter is structured as follows. Section 4.2 details the system model and recalls fundamental notions. Section 4.3 overviews the main contri- butions. Section 4.4 details GeoPaxos. Section 4.5 discuss the correctness of the provided algorithm. Section 4.6 describes the prototype. Section 4.7 presents

performance evaluation. Section 4.8 reviews related work and Section 4.9 con- cludes the chapter.

In document On the many faces of atomic multicast (Page 83-87)