types
Marek Zawirski
To cite this version:
Marek Zawirski. Dependable eventual consistency with replicated data types. Data Structures
and Algorithms [cs.DS]. Universit´
e Pierre et Marie Curie - Paris VI, 2015. English.
<
NNT :
2015PA066638
>
.
<
tel-01248051v2
>
HAL Id: tel-01248051
https://tel.archives-ouvertes.fr/tel-01248051v2
Submitted on 11 Jul 2016
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destin´ee au d´epˆot et `a la diffusion de documents scientifiques de niveau recherche, publi´es ou non, ´emanant des ´etablissements d’enseignement et de recherche fran¸cais ou ´etrangers, des laboratoires publics ou priv´es.
THÈSE DE DOCTORAT DE
l’UNIVERSITÉ PIERRE ET MARIE CURIE Spécialité
Informatique
École doctorale Informatique, Télécommunications et Électronique (Paris)
Présentée par
Marek ZAWIRSKI
Pour obtenir le grade de
DOCTEUR de l’UNIVERSITÉ PIERRE ET MARIE CURIE
Sujet de la thèse :
Cohérence à terme fiable avec des types de données répliquées
soutenue le 14 janvier 2015 devant le jury composé de :
M. Marc SHAPIRO Directeur de thèse M. Pascal MOLLI Rapporteur
M. Luís RODRIGUES Rapporteur M. Carlos BAQUERO Examinateur M. Jerzy BRZEZI ´NSKI Examinateur M. Sebastian BURCKHARDT Examinateur M. Peter DICKMAN Examinateur M. Pierre SENS Examinateur
Abstract
Web applications rely on replicated databases to place data close to their users, and to tolerate failures. Many replication systems opt for eventual consistency, which offers excellent responsive-ness and availability, but may expose applications to the complexity of concurrency and failures. To alleviate this problem, recent replication systems have begun to strengthen their interface with additional guarantees, such as causal consistency, which protects the application from ordering anomalies, andReplicated Data Types(RDTs), which encapsulate concurrent updates via high level object interface and ensure convergent semantics. However, dependable algorithms for these abstractions come at a high cost in metadata size. This thesis studies three related topics: (i)design of RDT implementations with minimized metadata;(ii)design of consistency algorithms with minimized metadata; and(iii)the limits of the design space.
Our first contribution is a study of metadata complexity of RDTs. RDTs need metadata to provide rich semantics under concurrency and failures. Several existing implementations incur high overhead in storage space, in the number of updates, or polynomial in the number of replicas. Minimizing metadata is nontrivial without impacting their semantics or fault-tolerance. We designoptimizedset and register RDTs with metadata overhead reduced to the number of replicas. We also demonstrate metadatalower boundsfor six RDTs, thereby proving optimality of four implementations. As a result, RDT designers and users can better navigate the design space.
Our second contribution is the design of SwiftCloud, areplicated causally-consistent RDT object database for client-side applications, e.g., mobile or in-browser apps. We devise algorithms to support high numbers of client-side partial replicas backed by the cloud, in a fault-tolerant manner, with small and bounded metadata. We demonstrate how to support available and consistent reads and updates, at the expense of some slight data staleness; i.e., our approach trades freshness for scalability (small and bounded metadata, parallelism), and availability (ability to switch between data centers on failure). Our experiments with thousands of client replicas confirm the design goals were achieved at a low staleness cost.
Acknowledgement
This work would not be possible without many inspiring and friendly people that I had a chance to meet and work with.
I am thankful to my advisor, Marc Shapiro, for his insights and constant encouragement that he offered me throughout our countless meetings. He taught me the role of communication and gave me the opportunity to collaborate with other people. I joined Marc’s group on the grounds of my BSc/MSc experience at PUT. For that period, I owe my gratitude to Jerzy Brzezi ´nski.
I am grateful to Nuno Preguiça, who always had time to review “unsolvable” problems and questionable solutions with me. I thank Nuno, as well as my other co-authors, Annette Bieniusa, Sérgio Duarte, Carlos Baquero, and Valter Balegas for their exceptionally amicable attitude. They all significantly contributed to Chapter 3 and Part III of this thesis.
I appreciate the guidance of Sebastian Burckhardt during my internship at Microsoft Re-search. I thank Sebastian for the three inspiring and enjoyable months, his openness and willingness to share his experience. His ideas considerably influenced Part II of this work.
I am thankful to all my committee members, to the reviewers in particular, for accepting to take part in this process without hesitation, offering me their precious time and feedback.
Among many others, I would like to particularly thank Masoud Saeida Ardekani, Alexey Gotsman, Pierre Sutra, and Hongseok Yang for our inspiring meetings, and all the ideas they contributed with. The feedback I got from Peter Bailis, João Leitão, Allen Clement, Vivien Quéma, Tyler Crain, and Basho’s team shaped the presentation along with my own view of this work.
I am grateful to Google (Googlers) for generously supporting me with a PhD fellowship, as well as for matching me with an excellent team in Zürich for my superb internship experience.
The time I spent with my fellow grad-student friends is unforgettable. Thanks Masoud Saeida Ardekani, Pierpaolo Cincilla, Lisong Guo, Florian David, Alejandro Tomsic, Valter Balegas, and many others for sharing a round of beer with me, or a lap in Mario Kart, whenever (in)appropriate. Thank you Pierpaolo, Florian and Kasia Kapusta for your nearly 24h/7j translation services.
I am indebted to my family, especially to my parents, my sister, and my brother, for their support in all plans that I decided to carry out. They helped me take my first steps in this world, speak first English and French words, and even write my first lines of code. The presence of my dear friends all around the world has also been invaluable. Finally, I especially thank my girlfriend Marta for all her patience, efforts and smile.
Table of Contents
List of Tables xiii
List of Figures xv
I Preliminaries 1
1 Introduction 3
1.1 Contributions . . . 4
1.1.1 Optimality of Replicated Data Types . . . 4
1.1.2 Causally-Consistent Object Database for Client-Side Applications . . . 6
1.2 Organization . . . 8
1.3 Authorship and Published Results . . . 8
2 Replicated Data Types 9 2.1 Motivation . . . 10
2.2 System Model and Object Implementation . . . 14
2.2.1 Object Implementation Model . . . 15
2.2.2 Replication . . . 16
2.2.3 Examples . . . 17
2.2.3.1 Counter Implementations . . . 18
2.2.3.2 Register Implementations . . . 21
2.3 Specification of Intended Behavior . . . 24
2.3.1 Specification Model . . . 24
2.3.2 Examples . . . 28
2.3.2.1 Counter Specification . . . 28
2.3.2.2 Register Specifications . . . 28
2.3.2.3 Set Specifications . . . 29
2.4 Execution Model and Correctness . . . 31
2.4.1 Execution Model . . . 31
2.4.3 Implementation Categories . . . 34
2.4.3.1 Network Specifications . . . 34
2.4.3.2 Visibility Witnesses . . . 35
2.4.3.3 Main Categories . . . 35
II Optimality of Replicated Data Types 37 3 Metadata Space Complexity Problem 39 3.1 Problem Statement . . . 40
3.2 Optimizing Implementations . . . 42
3.2.1 Successful Optimizations . . . 42
3.2.1.1 Add-Wins Set . . . 42
3.2.1.2 Multi-Value Register . . . 48
3.2.2 Prior Implementations and Unsuccessful Optimizations . . . 51
3.2.2.1 Last-Writer-Wins Register . . . 51
3.2.2.2 Remove-Wins Set . . . 52
3.2.2.3 Last-Writer-Wins Set . . . 53
3.2.2.4 Counter . . . 54
3.3 Summary . . . 54
4 Lower Bounds on Complexity and Implementation Optimality 55 4.1 Proof Technique . . . 56 4.1.1 Experiment Family . . . 56 4.1.2 Driver Programs . . . 58 4.2 Lower Bounds . . . 58 4.2.1 Counter . . . 58 4.2.2 Add-Wins Set . . . 61 4.2.3 Remove-Wins Set . . . 64 4.2.4 Last-Writer-Wins Set . . . 66 4.2.5 Multi-Value Register . . . 66 4.2.6 Last-Writer-Wins Register . . . 68 4.3 Summary . . . 68
5 Related Work and Discussion 71 5.1 Other Data Types . . . 72
5.2 State-Based Optimizations Beyond Our Metric . . . 72
5.2.1 Background Compaction of Stable Metadata . . . 73
5.2.2 Finer-Grained Optimizations . . . 73
TABLE OF CONTENTS
5.3 Other Implementation Categories . . . 75
5.3.1 State-Based Implementations With Smaller Messages . . . 75
5.3.2 Optimizations Based on Topology Restrictions and Delayed Visibility . . . 76
5.3.3 Replicated File Systems . . . 77
5.4 Lower Bound Proofs in Distributed Computing . . . 78
III Causally-Consistent Object Database for Client-Side Applications 79 6 Problem Overview 81 6.1 System Model and Basic Requirements . . . 82
6.2 Consistency with Convergence . . . 83
6.2.1 Causal Consistency . . . 84
6.2.2 Convergence with Replicated Data Types . . . 84
6.3 Application Programming Interface . . . 85
6.4 Challenge . . . 85
6.4.1 Metadata Design . . . 85
6.4.2 Causal Consistency with Partial Replication is Hard . . . 86
7 The SwiftCloud Approach 87 7.1 Design . . . 88
7.1.1 Causal Consistency at Full Data Center Replicas . . . 88
7.1.2 Causal Consistency at Partial Client Replicas . . . 90
7.1.3 Failing Over: The Issue with Transitive Causal Dependency . . . 91
7.1.3.1 Conservative Read: Possibly Stale, But Safe . . . 91
7.1.3.2 Discussion . . . 92
7.2 Implementation . . . 93
7.2.1 Timestamps, Vectors and Log Merge . . . 93
7.2.2 Protocols . . . 94
7.2.2.1 State . . . 94
7.2.2.2 Client-Side Execution . . . 94
7.2.2.3 Transfer Protocol: Client to Data Center . . . 95
7.2.2.4 Geo-replication Protocol: Data Center to Data Center . . . 96
7.2.2.5 Notification Protocol: Data Center to Client . . . 96
7.2.3 Object Checkpoints and Log Pruning . . . 97
7.2.3.1 Log Pruning in the Data Center . . . 97
7.2.3.2 Pruning the Client’s Log . . . 98
8 Experimental Evaluation 99 8.1 Prototype and Applications . . . 100
8.2 Experimental Setup . . . 101
8.3 Experimental Results . . . 102
8.3.1 Response Time and Throughput . . . 102
8.3.2 Scalability . . . 105
8.3.3 Tolerating Client Churn . . . 107
8.3.4 Tolerating Data Center Failures . . . 107
8.3.5 Staleness Cost . . . 108
9 Related Work 109 9.1 Consistency Models for High Availability . . . 110
9.2 Relevant Systems . . . 110
9.2.1 Replicated Databases for Client-Side Applications . . . 111
9.2.1.1 Systems that Support Inter-Object Consistency . . . 111
9.2.1.2 Systems that Support Intra-Object Consistency Only . . . 113
9.2.1.3 Session Guarantees . . . 115
9.2.2 Geo-replicated Databases for Server-Side Applications . . . 115
9.2.2.1 Approaches . . . 116
9.2.2.2 Comparison and Applicability to Client-Side Replication . . . 117
10 Conclusion 119 10.1 Summary . . . 119
10.2 Limitations and Perspectives . . . 120
IV Appendix 123 A Additional Material on Replicated Data Types 125 A.1 Formal Network Layer Specifications . . . 125
A.2 Optimized Op-Based Implementations . . . 125
B Metadata Overhead Proofs 127 B.1 Standard Encoding . . . 127
B.2 Metadata Overhead of Specific Implementations . . . 127
B.3 Lower Bound Proofs . . . 135
B.3.1 Add-Wins Set . . . 136 B.3.2 Remove-Wins Set . . . 136 B.3.3 Last-Writer-Wins Set . . . 137 B.3.4 Multi-Value Register . . . 139 B.3.5 Last-Writer-Wins Register . . . 140 C Résumé de la thèse 143
TABLE OF CONTENTS
C.1 L’optimalité des types les données répliquées . . . 145 C.2 Une base de données causalement cohérente pour les applications coté client . . . 147 C.2.1 Présentation du problème . . . 148 C.2.1.1 La cohérence et la convergence . . . 150 C.2.1.2 La conception des métadonnées . . . 151 C.2.1.3 La cohérence causale avec une réplication partielle est dur . . . . 151 C.2.2 L’approche SwiftCloud . . . 152
C.2.2.1 Cohèrence causale dans les répliques complètes des Centre de Données . . . 152 C.2.2.2 Cohèrence causale dans les répliques client partielles . . . 154 C.2.2.3 Le basculement sur erreur: Le problème avec la dépendance
causale transitive . . . 155 C.2.2.4 Protocoles avec les métadonnées découplées et délimitées . . . . 157 C.2.3 La mise en œuvre et l’évaluation . . . 157
List of Tables
1.1 Summary of metadata overhead results for different data types. . . 5
2.1 Comparison of op-based and state-based implementation categories. . . 36
4.1 Experiment family used in the lower bound proof for counter (Ctr). . . 59
4.2 Experiment family used in the lower bound proof for add-wins set (AWSet). . . 62
4.3 Experiment family used in the lower bound proof for remove-wins set (RWSet). . . 65
4.4 Experiment family used in the lower bound proof for last-writer-wins set (LWWSet). . 67
4.5 Experiment family used in the lower bound proof for multi-value register (MVReg). . . 69
4.6 Summary of metadata overhead results for different data types. . . 70
8.1 Characteristics of applications/workloads. . . 100
9.1 Qualitative comparison of metadata used by causally-consistent partial replication systems. . . 112
A.1 Catalog of popular network specifications . . . 126
B.1 Standard encoding schemes for components of replica state and return values. . . 128
B.2 Experiment family used in the lower bound proof for last-writer-wins register (LWWReg).140 C.1 Sommaire des résultats de surcharge pour différents types de données. . . 146
List of Figures
2.1 Comparison of different approaches to conflicts on an example execution. . . 12
2.2 Components of the system model. . . 14
2.3 Time-space diagram of an execution of the op-based counter with three replicas. . . . 18
2.4 An example execution of the state-based counter with three replicas. . . 19
2.5 Semilattice of states (vectors) of the state-based counter implementation. . . 20
2.6 An example execution of the basic state-based multi-value register. . . 22
2.7 Components of implementation and specification models and their relation. . . 24
2.8 Examples of abstract executions forCtrandMVRegdata types. . . 26
2.9 Set semantics alternatives on an example operation context. . . 31
2.10 Definitions of the set of configurations and the transition relation. . . 32
3.1 Concrete executions of two implementations of add-wins set. . . 44
3.2 Semantics of a version vector in the optimized add-wins set. . . 46
3.3 Operation context of an add-wins set illustrating opportunity for coalescing adds. . . 47
3.4 Example execution of three different implementations of multi-value register (MVReg). 50 4.1 The structure of experiment family and implications on object state size. . . 57
4.2 Example experiment and test for counter (Ctr). . . 59
4.3 Example experiment and test for add-wins set (AWSet). . . 62
4.4 Example experiment and test for remove-wins set (RWSet). . . 65
4.5 Example experiment and test for last-writer-wins set (LWWSet). . . 67
4.6 Example experiment and test for multi-value register (MVReg). . . 69
6.1 System components, and their interfaces. . . 83
7.1 Example evolution of configurations for two DCs, and a client. . . 89
8.1 Response time for YCSB operations. . . 102
8.2 Maximum system throughput for different workloads and protocols. . . 103
8.3 Throughput vs. response time for different system configurations. . . 104
8.5 Size of metadata for a variable number of replicas. . . 105
8.6 Storage occupation at a single DC in reaction to client churn. . . 107
8.7 Response time for a client that hands over between DCs. . . 108
8.8 K-stability staleness overhead. . . 108
C.1 Les composantes du système et leurs interfaces. . . 149
List of Algorithms
2.1 Template for data type implementation. . . 17
2.2 Op-based implementation of counter (Ctr). . . 18
2.3 State-based implementation of counter (Ctr). . . 19
2.4 State-based implementation of last-writer-wins integer register (LWWReg). . . 21
2.5 Basic state-based implementation of multi-value integer register (MVReg). . . 22
3.1 Naive state-based implementation of add-wins set (AWSet). . . 43
3.2 Optimized state-based implementation of add-wins set (AWSet). . . 45
3.3 Incorrect state-based implementation of multi-value integer register (MVReg). . . . 49
3.4 Optimized state-based implementation of multi-value integer register (MVReg). . . 51
3.5 State-based implementation of remove-wins set (RWSet). . . 52
3.6 State-based implementation of last-writer-wins set (LWWSet). . . 54
Part I
Chapter 1
Introduction
High availability and responsiveness are essential for many interactive web applications. Such applications often share mutable data between users in geographically distributed locations. To ensure availability and responsiveness, they rely ongeo-replicated databases or even on
client-side data replicas. Geo-replication provides users with access to a local replica in a nearby data center, whereas client-side replication provides access to a local replica at a user’s device. This local data access is neither affected by the cost of network latency between replicas nor by failures (e.g., replica disconnection) [30, 37, 43, 44]. When, furthermore, performance and availability of update operations are crucial for an application, then these updates need to be also performed locally, without coordination with remote replicas. This entails thatconcurrent updatesmust be accepted at different replicas and replicated asynchronously.
Unfortunately, asynchronous replication is at odds with strong consistency models, such as linearizability or serializability [20, 53]. Strong consistency offers applications a single common view of the distributed database, but requires to execute operations in a global total order at all replicas (synchronous replication). The incompatibility of availability and fault-tolerance with strong consistency, known as the CAP theorem, forces a choice of weaker consistency models, i.e.,
eventual consistency[44, 50, 79].
Under eventual consistency, replicas are allowed to diverge transiently due to concurrent updates, but they are expected to eventually converge to a common state that incorporates all updates [79, 100]. Intermediate states expose applications to consistency anomalies. This is undesirable for users, and poses a challenge to the implementation of the database and the application [43]. Difficulties include detection of concurrent conflicting updates, and their convergent resolution, or protection from asynchronous delivery of updates and from failures.
Two complementary abstractions were proposed to alleviate the issues of eventually consistent replication. These abstractions encapsulate the complexity of asynchronous replication and failures behind an interface with a well-defined behavior. Adependableimplementation of such an interface guarantees the behavior regardless of the underlying infrastructure (mis)behavior. First,Replicated Data Types(RDTs) expose data in the database as typed objects with
high-level methods. RDTs rely on type and method semantics for convergence [34, 92]. With RDTs, the database consists of objects such as counter, set, or register, that offer read and update methods, such asincrementfor a counter, oraddandremovefor a set. A data type encapsulates convergent replication, and resolves concurrent updates according to the type’s defined semantics.
Second,causal consistencyoffers partial ordering of updates across objects [4, 66, 67]. In-formally, under causal consistency, every application process observes a monotonically non-decreasing set of updates, which includes its own updates, in an order that respects the causality between operations. Applications are protected from causality violations, e.g., where read could observe updatebbut not updateathatbis based upon.Transactionalcausal consistency further simplifies programming with atomicity. It guarantees that all the reads of a transaction come from a same database snapshot, and either all its updates are visible, or none is [14, 66, 67].
These abstractions facilitate programming against a replicated database, but their dependable implementation comes at the cost of storage and network metadata. Metadata is required, for instance, to relate asynchronously replicated updates, i.e., to determine which one caused another, or to compare replica states after a transient failure. Some metadata cost is inherent to concurrency, and to uncertainty of the status of an unresponsive replica, which could be unavailable either permanently or transiently (and could possibly perform concurrent updates).
This thesis studies the design of dependable RDT and causal consistency algorithms with minimized metadata, and the limits and the trade-offs of their design space.
1.1 Contributions
This dissertation makes two main contributions in the area of dependable algorithms for eventual consistency: a study of space optimality of RDTs, including impossibility results and optimized im-plementations, and the design of a causally-consistent RDT database for client-side applications. We discuss both contributions in more detail in the remainder of this section.
1.1.1 Optimality of Replicated Data Types
In the first part of the thesis, we consider the problem of minimizing metadata incurred by RDT implementations. In addition to client-observable data, RDTs use metadata to ensure correctness in the presence of concurrent updates and failures. The metadata manifests either in the implementation of RDTs or in a middleware for updates dissemination. A category of
state-basedRDT implementations, i.e., objects that communicate by exchanging their complete state [92], uses metadata directly in the RDT state. Such state-based implementations are used, for example, by server-side geo-replicated object databases [3]. Some state-based RDTs incur substantial metadata overhead that impacts storage and bandwidth cost, or even the viability of an implementation. On the other hand, the design of RDT implementations that reduce the
1.1. CONTRIBUTIONS
Data type Existing implementation Optimized impl. Any impl. Source Bounds bounds lower bounds
counter [92] £b(n) — ≠b(n)
add-wins set [93] £b(mlgm) new:£b(nlgm) ≠b(nlgm) remove-wins set [22] ≠b(mlgm) — ≠b(m) last-writer-wins set [22, 56] ≠b(mlgm) — ≠b(nlgm) multi-value register [77, 93] £b(n2lgm) new:£b(nlgm) ≠b(nlgm) last-writer-wins register [56, 92] £b(lgm) — ≠b(lgm)
Table 1.1: Summary of metadata overhead results for different data types. Underlined implemen-tations are optimal;nandmare the numbers of, respectively, replicas and updates;≠b,Ob,£b are lower, upper and tight bounds on metadata, respectively.
size of metadata is challenging, as it creates a tension between correctness w.r.t. type semantics, efficiency, and fault-tolerance.
We formulate and study the metadata complexity problem for state-based RDTs. We define a metadata overhead metric, and we perform a worst-case analysis in order to express asymptotic metadata complexity as a function of the number of replicas or of updates. An analysis of six existing data type implementations w.r.t. this metric indicates that most incur substantial metadata overhead, higher than required, linear in the number of updates, or square in the number of replicas. We also observe that the concurrent semantics of a data type, i.e., its behavior under concurrent updates, has a critical impact on the exact extent of the metadata complexity. The two main contributions of this part of the thesis arepositive results, i.e., optimized data type implementations, andimpossibility results, i.e., lower bound proofs on the relative metadata complexity for some data type. It is natural to search for positive results first. However, it is a laborious process that requires developing nontrivial solutions, potentially incorrect ones. A lower bound sets the limits of possible optimizations and can show that an implementation is asymptotically optimal. Our lower bound proofs use a common structure that we apply to each data type using semantics-specific argument. This marks the end of the design process, saves designers from seeking the unachievable, and may lead them towards new design assumptions. Together, both kinds of results offer a comprehensive view of the metadata optimization problem. Table 1.1 summarizes our positive and impossibility results for all studied data types. From left to right, the table lists prior implementations and their relative complexity, expressed as metadata overhead, as well as the complexity of our optimizations (if any), and our lower bounds on complexity of any implementation. We underline asymptotically optimal implementations.
We show that the existing state-based implementation of counter data type is optimal. Counter requires vector of integers to handle concurrent increments and duplicated or reordered messages.
Set data type interface offers a range of different semantics choices w.r.t. how to treat concurrent operations on the same element to converge. For instance, a priority can be given to
some type of operation (addorremove) or operations can be arbitrated by timestamps (last-writer-wins [56]). These alternatives are uneven in terms of the metadata cost. Prior implementations all incur overhead in the order of the number of updates, as they keep track of the logical time of remove operations. However, this is not necessary for all set variants. Our optimized implementation of add-wins set reduces this cost to the optimum, i.e., the order of the number of replicas, by adapting a variant of version vectors to efficiently summarize observed information [77]. The lower bound on remove-wins set shows that such an optimization cannot be applied on this semantics. It is an open problem whether it is possible for last-writer-wins set.
Register type also comes in different variants. Similarly, we show they are uneven in metadata complexity. A last-writer-wins register uses timestamps to arbitrate concurrent assignments, whereas multi-value register identifies values of all conflicting writes and presents them to the application. The existing implementation of last-writer-wins register has negligible overhead and is the optimal one. On the contrary, we find that the existing implementation of multi-value register has substantial overhead, square in the number of replicas, due to inefficient treatment of concurrent writes of the same value. Our optimization alleviates the square component, and reaches the asymptotically optimal overhead, using nontrivial merge rules for version vectors. 1.1.2 Causally-Consistent Object Database for Client-Side Applications
In the second part of the thesis, we study the problem of providing object database with extended, causal consistency guarantees across RDT objects, at the client-side.
Client-side applications, such as in-browser and mobile apps, are poorly supported by the current technology for sharing mutable data over the wide-area. App developers resort to im-plementing their own ad-hoc application-level cache and buffers, in order to avoid slow, costly and sometimes unavailable round-trips to a data center, but they cannot solve system issues such as fault tolerance, or consistency/session guarantees [34, 96]. Recent client-side systems ensure only some of the desired properties, i.e., either make only limited consistency guarantees (at the granularity of a single object or of a small database only), do not tolerate failures, and/or do not scale to large numbers of client devices [18, 30, 37, 40, 69]. Standard algorithms for geo-replication [8, 12, 46, 66, 67] are not easily adaptable, because they were not designed to support high numbers of client replicas located outside of the server-side infrastructure.
Our thesis is that the system should be ensuring correct and scalable database access to client-side applications, addressing the (somewhat conflicting) requirements of consistency, availability, and convergence [68], at least as well as server-side systems. Under these requirements, the strongest consistency model istransactional causal consistency with RDT objects[66, 68].
Supporting thousands or millions of client-side replicas, under causal consistency with RDTs, challenges standard assumptions. To track causality precisely, per client, would create unac-ceptably fat metadata; but the more compact server-side metadata management approach has fault-tolerance issues. Additionally, full replication at high numbers of resource-poor devices
1.1. CONTRIBUTIONS
would be unacceptable [18]; but partial replication of data and metadata could cause anomalous message delivery or unavailability. Furthermore, it is not possible to assume, like many previous systems [8, 46, 66, 67], that the application is located inside the data center (DC), or has a sticky session to a single DC, to solve fault tolerance or consistency problems [13, 96].
In the second part of the thesis, we address these challenges. We present the algorithms, design, and evaluation of SwiftCloud, the first distributed object database designed for a high number of replicas. It efficiently ensures consistent, available, and convergent access to client nodes, tolerating failures. To achieve this, SwiftCloud uses a flexible client-server topology, and decouples reads from writes. The clientwrites fastinto the local cache, andreads in the past(also fast) data that is consistent, but occasionally stale. Our approach includes two major techniques: 1. Cloud-backed support for partial replicas. To simplify consistent partial replication at the client side and at the scale of client-side devices, we leverage the DC-side full replicas to provide a consistent view of the database to the client. The client merges this view with his own updates to achieve causal consistency. In some failure situations, a client may connect to a DC that happens to be inconsistent with its previous DC. Because the client does not have a full replica, it cannot fix the issue on its own. We leverage “reading in the past” to avoid this situation in the common case, and provide control over the inherent trade-off between staleness and unavailability: namely, a client observes a remote update only if it is stored in some numberK∏1 of DCs [69]. The higher the value ofK is, the more likely that an update is in both DCs, but the higher is the staleness.
2. Protocols with decoupled, bounded metadata.Our design funnels all communication through DCs. Thanks to this, and to “reading in the past,” SwiftCloud can use metadata that decouples two aspects [61]: ittracks causalityto enforce consistency, using small vectors assigned in the background by DCs, anduniquely identifiesupdates to protect from duplicates, using client-assigned scalar timestamps. This ensures that the metadata remains small and bounded. Furthermore, a DC can prune its log independently of clients, replacing it with a summary of delivered updates.
We implement SwiftCloud and demonstrate experimentally that our design reaches its objective, at a modest staleness cost. We evaluate SwiftCloud in Amazon EC2, against a port of WaltSocial [95] and against YCSB [42]. When data is cached, response time is two orders of magnitude lower than for server-based protocols with similar availability guarantees. With three DCs (servers), the system can accommodate thousands of client replicas. Metadata size does not depend on the number of clients, the number of failures, or the size of the database, and increases only slightly with the number of DCs: on average, 15 bytes of metadata per update, with 3 DCs, compared to kilobytes for previous algorithms with similar safety guarantees. Throughput is comparable to server-side replication for low locality workloads, and improved for high locality
ones. When a DC fails, its clients switch to a new DC in under 1000 ms, and remain consistent. Under evaluated configurations, 2-stability causes fewer than 1% stale reads.
1.2 Organization
The thesis is divided into three parts. The first part contains this introduction and Chapter 2, which introduces the common background of our work: RDT model with examples.
The second part focuses on the problem of RDT metadata complexity. In Chapter 3 we formulate the problem, we evaluate existing implementations w.r.t. a common metadata metric, we explore opportunities for improvement, and we propose improved implementations of add-wins set and multi-value register RDTs. In Chapter 4, we formally demonstrate a lower bound for metadata overhead of six data types, thereby proving optimality of four implementations. We discuss the scope of our results and compare them to related work in Chapter 5.
The third part of the thesis presents the design of SwiftCloud object database. We overview and formulate the client-side replication problem in Chapter 6. In Chapter 7, we present the SwiftCloud approach, and demonstrate an implementation of this approach using small and safe metadata. Chapter 8 presents our experimental evaluation. We discuss related work in Chapter 9, where we categorize existing approaches and compare them to SwiftCloud.
Chapter 10 concludes the thesis.
1.3 Authorship and Published Results
Part of the presented material appeared in earlier publications with our co-authors.
Although we were involved in the formulation of the RDT model [34, 91–93], presented in Chapter 2, it is not our main contribution, and it is not the focus of this thesis.
The semantics and optimizations of set RDTs were co-authored with Annette Bieniusa, Nuno Preguiça, Marc Shapiro, Carlos Baquero, Sérgio Duarte, and Valter Balegas, and published as a brief announcement at DISC’12 [23] and a technical report [21]. Some of the lower bounds and large part of the theory behind it are a result of an internship with Sebastian Burckhardt at Microsoft Research in Redmond. Both the optimized register RDT and four lower bounds were published at POPL’14 [34], co-authored with Sebastian Burckhardt, Alexey Gotsman and Hongseok Yang.
SwiftCloud was designed with Nuno Preguiça, Annette Bieniusa, Sérgio Duarte, Valter Balegas, and Marc Shapiro. The first four contributed to its implementation, which drew from an earlier prototype developed by Valter Balegas. This work is under submission, and partially covered by an earlier technical report [104]. Some of the ideas related to the staleness vs. metadata trade-off exploited in SwiftCloud were seeded in a short paper at EuroSys’12 HotCDP workshop [86], co-authored with Masoud Saeida Ardekani, Pierre Sutra and Marc Shapiro.
Chapter 2
Replicated Data Types
Any problem in computer science can be solved with another level of indirection. David Wheeler Contents
2.1 Motivation . . . 10 2.2 System Model and Object Implementation . . . 14 2.2.1 Object Implementation Model . . . 15 2.2.2 Replication . . . 16 2.2.3 Examples . . . 17 2.2.3.1 Counter Implementations . . . 18 2.2.3.2 Register Implementations . . . 21 2.3 Specification of Intended Behavior . . . 24 2.3.1 Specification Model . . . 24 2.3.2 Examples . . . 28 2.3.2.1 Counter Specification . . . 28 2.3.2.2 Register Specifications . . . 28 2.3.2.3 Set Specifications . . . 29 2.4 Execution Model and Correctness . . . 31 2.4.1 Execution Model . . . 31 2.4.2 Implementation Correctness . . . 33 2.4.3 Implementation Categories . . . 34 2.4.3.1 Network Specifications . . . 34 2.4.3.2 Visibility Witnesses . . . 35 2.4.3.3 Main Categories . . . 35
This chapter presents the background on the concept ofReplicated Data Type(RDT), pre-viously formalized in similar ways by Baquero and Moura [16], Roh et al. [85], and Shapiro et al. [92].1 An RDT is a type of replicated object accessed via highly available methods, i.e.,
methods that provide immediate response, with well-definedconcurrent semantics[34, 92]. RDT objects encapsulate the complexity of convergent replication over distributed, unreliable infras-tructure behind a relatively simple interface. RDTs include basic types such as counters, sets, and registers.
In Section 2.1, we recall the motivation for RDTs. We outline what problems they address, and what is specific to the RDT approach. In particular, we highlight what makes RDTs a popular abstraction for managing highly available replicated data [3, 25, 95].
In Section 2.2, we introduce a formal model of adata typeand itsimplementation. We illustrate the model with examples from differentcategoriesof implementations, in order to demonstrate the spectrum of RDT implementations, semantics, and some of the design challenges.
Not all RDT implementations provide useful behavior. We examine and specify what can characterize the intended behavior in Section 2.3: from a primary convergence requirement, to a precisespecificationof concurrent semantics of a data type. In Section 2.4, we define an execution model that describes object implementation and allows us to formally relate an implementation with a specification by a correctness condition. The correctness definition is parameterized, which leads us to formal categorization of implementations.
The material in this chapter is a prerequisite to Part II of the thesis, and is recommended reading before Part III. We highlight the relevant problems as they appear, in particular the challenges related to metadata.
The formalism and presentation used throughout this chapter follows closely a more general theory of Burckhardt et al. [33, 34], with some simplifications in both notation and model.
2.1 Motivation
High availability and responsiveness are important requirements for many interactive web applications. These applications often share mutable data between users in geographically distributed locations, i.e., data can be accessed and modified by users in different locations. To address the requirements, applications often rely ongeo-replicated databasesthat replicate data across data centers spread across the world, or even onclient-side data replicasat user devices [37, 43, 66, 95]. Replication systems provide users with fast access to local replica. This local access is unaffected by high network latency between replicas or by failures [43, 44, 66, 92, 95].2 1We use the term Replicated Data Types after a common model of Burckhardt et al. [34]. The same or specialized
concept appears in the literature under a variety of names: Conflict-free/Convergent/Commutative Replicated Data Types (CRDT) [92, 93], Replicated Abstract Data Types (RADT) [85], or Cloud Types [32]. We highlight the differences between these models where relevant.
2Inter-continental network round-trip times reach hundreds of milliseconds, which exceeds the threshold of
2.1. MOTIVATION
Failures are prevalent in large-scale systems, and range from individual server failures, variety of data center failures, to network failures, including partitions [11, 57]. When client-side replication is involved, extended replica disconnections are granted [79].
When high performance and availability of update operations are important to the application, they need to be performed at a local replica, without coordination with remote replicas [15]. This entails that updates must be acceptedconcurrentlyat different replicas and that they must be replicated asynchronously, which causes replicas to diverge and to execute updates in different orders. Unfortunately, this is inherently incompatible with strong consistency models, such as lin-earizability [53] or serializability [20], which rely on a global total order of updates (synchronous replication). This incompatibility, known as the CAP theorem [50], forces asynchronous replica-tion to offer weaker consistency models only; namely, variants ofeventual consistency[44, 79]. Under eventual consistency, replicas are allowed to transiently diverge under the expectation of eventual convergence towards a common value that incorporates all updates.
Let us examine more closely the differences between strongly and eventually consistent data replication. We discuss some challenges of implementing and using the latter that motivate our interest in the RDT abstraction.
Any replication protocol needs to addressconflicts, which we discuss next. Consider some replicated data item. For example, assume that the item represents a set of elements, such as the members of a group in a social network application, with the initial value {a,b}. In a replicated system, two clients connecting to different replicas might concurrently overwrite it with values {a} and {a,b,c}. Concurrent writes to the same item areconflicting. There are four main approaches to address conflicts. One is to accept only some updates to prevent conflicts (strong consistency). The other threeoptimistically accept all concurrent updates and resolve conflicts eventually (eventual consistency). We review these approaches and illustrate them with the help of our {a,b} example in Figure 2.1.
A. Conflict avoidance. Before accepting an update, the database executes a synchronous protocol to agree on a common order of updates execution on all replicas. Such a protocol consistently rejects all but one of the conflicting updates. This forces other writers to abort and retry rejected operations [43]. As this solution is strongly consistent, it is easy to program against, because replication is transparent to the application. However, it is not highly available [50], so it is out of the scope of this work.
B. Conflict arbitration using timestamps.The database accepts all updates and replicates them, but of two concurrent updates, one dominates the other, the “last” one according to timestamps of write operations. This heuristic is known aslast-writer-wins(LWW) [56]. Arbi-trary updates may belost, i.e., overwritten without being observed, even if they were reported as accepted to the client [58]. In our example in Figure 2.1B, assuming that timestamp of {a} is larger than that of {a,b,c}, the outcome ofwrite({a,b,c}) is lost. This is not a satisfying solution.
read:{a,b} write({a,b,c}) read:{a,b} write({a}) read:{a} read:{a} {a,b} {a,b} client A replica r1 replica r2 client B
sync. replication protocol: consensus or equivalent
ACCEPTED
REJECTED {a} {a}
(A) Conflict avoidance approach with synchronous protocol offering strong consistency.
read:{a,b} write({a,b,c}) read:{a,b} write({a})
read:{a} read:{a} {a,b} {a,b} client A replica r1 replica r2 client B
async. replication protocol: arbitration / LWW ACCEPTED ACCEPTED {a} {a} read:{a} read:{a,b,c}
(B) Conflict arbitration approach with asynchronous protocol causing lost updates.
read:{a,b} write({a,b,c}) read:{a,b} write({a})
read:{a}||{a,b,c} read:{a}||{a,b,c} {a,b} {a,b} client A replica r1 replica r2 client B
async. replication protocol: conflict detection + app-level resolution
ACCEPTED ACCEPTED {a} || {a,b,c} {a} || {a,b,c} read:{a} read:{a,b,c}
(C) Conflict detection approach with asynchronous protocol relying on application-level conflict resolution.
read:{a,b} add(c)
read:{a,b} remove(b)
read:{a,c} read:{a,c} {a,b} {a,b} client A replica r1 replica r2 client B
async. replication protocol: automatic DB-level conflict resolution
ACCEPTED ACCEPTED {a,c} {a,c} read:{a} read:{a,b,c}
(D) Automatic conflict resolution approach with asynchronous protocol embedding conflict resolution. Figure 2.1: Comparison of different approaches to conflicts on an example execution with two clients issuing conflicting updates on two different replicas,r1andr2. Boxes indicate operations
issued by clients; arrows indicate operation invocation and response.
C. Conflict detection and application-level resolution.To avoid lost updates, the database detects concurrent writes after they were accepted, and presents the set of concurrently-written values to the application. It is up to the application to resolve the conflict [44, 59, 80, 97]. Figure 2.1C illustrates this approach. Unfortunately, resolving conflicts is difficult ad-hoc at the application layer, and may not converge. For example, it is not obvious how to combine
2.1. MOTIVATION
values {a}and {a,b,c}from our example without violating intentions of the clients that wrote them. It may require the application to add more metadata to the values it stores (e.g., to record the fact that b wasremoved and c wasadded), which may be costly and complex, especially in the presence of replica failures [44]. As a last resort, the application may present the conflict to the user and let the user decide or confirm its choice. In either case, however, the conflict resolution may need to be run concurrently with other updates and across different application processes or users, which may generate even more conflicts, and prevent or delay database convergence [38].3 For example, one user may resolve the conflict to {a,c}, while
another resolves it to {a,b,c}, creating a new conflict. This approach can quickly become
unstableandcomplexat the application layer.
D. Automatic database-level conflict resolution.As an alternative, thedatabaseitself may automatically resolve conflicting updates. This is different from the previous approach, be-cause it assumes the database has enough knowledge to embed convergent conflict resolution in the replication protocol. For example, as illustrated in Figure 2.1D, if the database is aware that the read of {a,b}followed by the write of {a}is meant toremoveelementbfrom aset, and the other pair is meant to addc, it can integrate both updates. This can be also more efficient than the application-level solution, since the metadata can be managed by the database, i.e., lower in the abstraction stack. The challenge is then to design a database protocol that will tolerate concurrency and failures, still converging to a sensible value that integrates all updates. For some updates the right integration is simple (e.g.,addandremoveof different elements), for others it is harder (e.g., operations on the same element).
AReplicated Data Type(RDT) abstraction can express all three optimistic conflict resolution approaches (B)–(D), and in particular, addresses the challenges of automated database-level conflict resolution [34, 92]. The insight is to usetyping to indicate semantics of a data item (naming the role of object) and to mutate it viahigh-level object interface(naming the role of operations), rather than low-level read-write operations. The database implementation, and in particular a data type implementation class, can leverage the data type semantics to automate and to optimize conflict resolution. The data type semantics can also serve as a contract describing the interface to an application developer, thus separating concerns and responsibilities.
With RDTs, an application organizes its shared state as a collection of typed objects. For instance counter objects haveincrement,decrement andreadmethods, whereas set objects haveadd(a),remove(a) and readmethods. It is an object implementation duty to handle the burden of replication, and to resolve concurrent updates in a sensible manner, thereby releasing application programmers from this duty. Moreover, some categories of RDT implementations also address fault-tolerance of convergent replication, and tolerate messages reordering, message loss, disconnected replicas etc.
3An alternative would be to coordinate conflict resolution. However, this results in an unavailable protocol, similar
replicated database object impl. instance network layer m = send () d el iv er ( m ) replica r1 object impl. instance v = do(…, t) m = send () d el iv er ( m ) replica rn client Z (session) method calls
…
…
v = do(…, t) timestamp source timestamp source client A (session) method callsFigure 2.2: Components of the system model.
An RDT operates at the limited scope of an object. It does not support conflict-resolution involving complete database [24, 79], and does not address the other problems of eventual consistency at a larger scale, across different objects. Specifically, it does not address the somewhat orthogonal problem of update ordering, nor of atomicity guarantees. Examples of such desirable mechanisms might include cross-object causal consistency and its transactional variants [4, 18, 66]. We do not consider such ordering and atomicity guarantees for simplicity here, although the model we adopt can be extended to express them [34]. We return to intra-object ordering in Section 2.4.3, and to inter-object ordering in Part III.
2.2 System Model and Object Implementation
In this section, we formalize an implementation model of Replicated Data Type. We use the model of Burckhardt et al. [34] that can express existing implementations [16, 32, 85, 92].
Figure 2.2 gives an informal overview of the system components involved in our model. We consider areplicated databasethat hosts one or more namedobjectson a set of replicas. We assume that the database implements each objectindependently. In other words, a database consists of a single object, without loss of generality.4Thetype signatureof an object defines
its interface. Anobject implementationdefines its behavior, including its replication protocol. An application has multiple processes that runclient sessions. Each session issues a se-quence of operations to a replica by callingmethods; for simplicity, we assume that every session matches exactly one logical replica. A database replica delegates method calls to the corresponding object implementation. It responds by performing a local computation step. Local processing ensures thatalloperations are highly available and responsive. Replicas
communi-4This assumption matches many models from the literature [16, 85, 92, 102], and recent industrial
implementa-tions of RDTs [3, 25]. It also allows us to express, in Part III, a system with independent object implementaimplementa-tions yet offering cross-object consistency.
2.2. SYSTEM MODEL AND OBJECT IMPLEMENTATION
cate in the background via a message passing network layer to exchange updates. An object implementation includes functions that produce and apply messages.
2.2.1 Object Implementation Model
Thetypeøof an object comprises atype signaturetuple (Opø,Valø), which defines the client interface as a set of methods Opø (ranged over by o) and a set of return values Valø (ranged over byv). For simplicity, all methods share the same return value domain, with a special value ? 2Valøfor methods that do not return any value. Without loss of generality, we assume that every type has a single side-effect freequery methodread2Opø, which returns the value of an object. The remaining methodsOpø\ {read}are calledupdate methodsand are assumed to have side effects.
For example, a counter type, notedCtr, has a signature (OpCtr,ValCtr). MethodsOpCtr= {inc,read}, respectively, increment and read the counter value.readreturns a natural number,
ValCtr=N[{?}. Another example is an integer register type, noted LWWReg, with methods
OpLWWReg={write(a)|a2Z}[{read}. The latter returns integer values, ValCtr=Z[{?}. Note that we model arguments of a method as part of a method name for simplicity.
An object is instantiated from an implementation class of some type, to be defined shortly. An object is hosted on a set of databasereplicasReplicaID(ranged over byr). We assume that the database can provide the object implementation withunique timestampsfrom a domain Timestamp(ranged overt), totally ordered by<. Scalar clocks, such as Lamport’s logical clock [64], or real-time clock concatenated with a unique replica ID, can serve this purpose. For simplicity, we often assume integer timestamps.
Definition 2.1(Replicated Data Type Implementation). Animplementation of replicated data typeøis a tupleDø=(ß,M,initialize,do,send,deliver), where:
• Setß(ranged over byæ) is the domain of replica states.
• SetM(ranged over bym) is the domain of messages exchanged between replicas. • Functioninitialize:ReplicaID!ßdefines the initial state at every replica.
• Functiondo:ߣOpø£Timestamp!ߣValødefines methods.
• Functionssend:ß!ߣM anddeliver:ߣM!ßdefine update replication primitives. We callDøan implementation class of typeø. We refer to the components of a tupleDøusing dot notation, e.g.,Dø.ßfor its states.
We now discuss the role of implementation components and the intuition behind an execution model, to be formalized later. The database instantiates the object at every replicar2ReplicaID byinitialize(r). From that point on, it manages the object at each replica using the three imple-mentation functions explained next.
The client can perform a method o2Opøon a selected replica atanytime. Method invocation causes the replica to performdo(æ,o,t) on the current object state æ2ß, with timestamp t2
Timestamp. The timestamp can be used by the implementation to identify an update or to arbitrate updates. The outcome is a tuple (æ0,v)=do(æ,o,t), whereæ02ßbecomes the new object state at the replica invoking the operation, and the operation return valuev2Valøis presented to the client. Note that for queriesæ0=æ. Since method invocation is a local step, the client is never blocked by an external network or replica failure.
Replication functions are applied by a database replica non-deterministically, when the network receives a message or permits a transmission, and the replica decides to perform a replication step. To send a message, a replica appliessend(æ) with the local state of the objectæ. This produces a tuple (æ0,m)=send(æ), whereæ0is the new state of the object (send may have a side-effect), andmis a message that the replica sends, orbroadcasts, to all other replicas.5When a replica receives a messagem, it can applydeliver(æ,m) on the object in stateæ, which produces the new stateæ0that incorporates messagem.
2.2.2 Replication
Data type implementations can be categorized by how they replicate updates, and what re-quirements they impose on the network layer. In our prior work, Shapiro et al. [92] propose two mainimplementation categories:operation-based (op-based)andstate-based. In op-based implementation, each message carries information about thelatestupdates that have been performedat the senderreplica. Such message usually must not be lost, and sometimes requires further ordering or no-duplication guarantees. In contrast, in state-based implementation, each message carries information aboutall updates that are known to the sender; such messages can be duplicated, lost or reordered. We will formalize these differences after we familiarize the reader with their intuition. Both categories have the same expressive power, in the sense that every state-based object can be also expressed as an op-based object and vice versa [92].
One of the main challenges are the two expectations behind an object implementation: that the replicasconvergetowards a state that integrates all updates, and that both final and intermediate results provide meaningfulsemantics. This is complicated, because the convergence process driven by asynchronous replication is inherently concurrent itself, and with continuous stream of new updates. Moreover, the network layer may drop, reorder or duplicate messages. To achieve convergence and reasonable semantics, an implementation must often encode additional
metadatain its state when it performs updates (withdofunction), so that the replication protocol (sendanddeliverfunctions) have enough information to interpret (reconcile) concurrent operations, or to handle network failures.
We formulated sufficient convergence conditions specific to each of the two main implementa-tion categories in a prior work [92]. For an op-based implementaimplementa-tion to be convergent, it suffices that all concurrently generated messages commute, i.e., they are insensitive to the order of 5A broadcast message produced bysendmay reach more than one replica, but the implementation can emulate a
unicast primitive on top of it, by encoding information about the intended recipient inside a message (at the expense of unnecessary message complexity). None of our results is affected by this simplification.
2.2. SYSTEM MODEL AND OBJECT IMPLEMENTATION
Algorithm 2.1Template for data type implementation.
1: ß=≠definition of state domainÆ M=≠definition of message domainÆ
2: initialize(ri) :æ .method that returns stateæ2ß
3: letæ=≠definition of the initial state for replica rÆ
4: do(read, to) :v .timestamptocan be optionally used by a method
5: leta=≠definition of return valueÆ
6: do(update(arg), to) .update have no return value (v=?)
7: æ√≠definition of state mutation by methodupdate(arg)Æ 8: send() :m
9: letm=≠definition of generated message m2MÆ
10: æ√≠definition of mutation to the local stateÆ
11: deliver(m)
12: æ√≠definition of the state integrating message mÆ
delivery. For a state-based implementation to be convergent, it suffices that the domain of states ßform a (partially ordered)join semilattice, such that thedeliverfunction is a join operator for that lattice, and update methods can only cause the state to advance in the partial order of the lattice. This typically requires that the object state represents some summary of its history, which allows any two states to be integrated. We will illustrate these conditions alongside the following example implementations, and return formally to the problem when we define a category-agnostic convergence and semantics specification.
2.2.3 Examples
We illustrate the model with simple example implementations from both op-based and state-based categories. They help us to highlight some of the design alternatives and challenges, as well as primitive mechanisms used by implementations, such as clocks. More complex examples will appear later. The selected examples are a small part of our catalog of implementations [93].
Template. Algorithm 2.1 serves as a template for presentation of all data type implementations. The template presents, in pseudo-code, Definition 2.1 in a more readable manner. Text≠inside bracketsÆindicates parts that vary between data type implementations. We organize the code in blocks by indentation. Types of arguments and return values are omitted, but can be easily inferred from the type signature, or the implementation class. We assume all implementation functions have access to the current state. Function can perform in-place mutations that define the state after a function is applied. Components of the state are named in the return value of
Algorithm 2.2Op-based implementation of counter (Ctr). 1: ß=N0£N0 M=N0
2: initialize(ri) : (a,b)
3: let(a, b) = (0, 0) .a: current value;b: buffer of increments
4: do(read, to) :v
5: letv=a
6: do(inc, to)
7: a√a+1 .record increment in the current value
8: b√b+1 .record increment in the buffer
9: send() :d
10: letd=b .flush the buffer
11: b√0 12: deliver(d)
13: a√a+d .add to the local value
r1 inc (1, 1) (0, 0) send:1 (1, 0) r2 (0, 0) r3 (0, 0) inc (1, 1) deliver (1, 0) deliver (2, 1) deliver (2, 0) send:1 (2, 0) deliver (2, 0) read:1 read:2
Figure 2.3: Time-space diagram of an execution of the op-based counter (Algorithm 2.2) with three replicas. Every event is labelled with the execution step, the return value of the function performing that step (above line; except for unlabelledinitializeevents), and replica state after the step if it changed (below line). We indicate messages exchanged between replicas with arrows, thus we omit repeating their content indeliver. Timestamps indoare specified only if used. 2.2.3.1 Counter Implementations
Op-based. Algorithm 2.2 presents an op-based implementation of a counter type (Ctr). The state of a replica contains the current counter value, noteda, and an integer buffer that records the number of local increments since the last message was sent, noted b. Each increment operation,inc, simply increments both the current value and the buffer. When a replica sends a message, the implementation flushes the buffer, i.e., the local buffer becomes the content of the message and the buffer is zeroed. A replica that delivers a message increments its local counter by the value provided in the message without modifying its local buffer.
Figure 2.3 illustrates both an example execution of the op-based counter, and our presentation format of executions. After all replicas are initialized with theinitializefunction, clients at replica
r1 andr3perform oneinceach, which is implemented by the database replica applying thedo
function. The read operation that follows at replicar1 yields value 1, which includes the outcome
of the earlier local increment. The message with update fromr1 is generated usingsend, and
2.2. SYSTEM MODEL AND OBJECT IMPLEMENTATION
Algorithm 2.3State-based implementation of counter (Ctr). 1: ß=ReplicaID£(ReplicaID7!N0) M=ReplicaID7!N0 2: initialize(ri) : (r,vv)
3: letr=ri .replica ID
4: letvv=∏s.0 .vector/map of number of increments per replica; initially zeros
5: do(read, to) :v
6: letv=Pvv(s) .sum up all increments
7: do(inc, to)
8: vv√vv[r7!vv(r)+1] .increment own entry
9: send() :vvm
10: letvvm=vv
11: deliver(vvm)
12: vv√∏s.max{vv(s),vvm(s)} .compute entry-wise maximum of vectors
r1 inc (r1, [1 0 0]) (r1, [0 0 0]) r2 (r2, [0 0 0]) r3 (r3, [0 0 0]) inc (r3, [0 0 1]) read:1 read:2 send:[1 0 0] deliver (r2, [1 0 0]) deliver (r3, [1 0 1]) deliver (r1, [1 0 1]) deliver (r2, [1 0 1]) send:[1 0 0] deliver (r3, [1 0 1]) send:[0 0 1] read:2
Figure 2.4: An example execution of the state-based counter (Algorithm 2.3) with three replicas. For brevity, we note vector [r17!a,r27!b,r37!c] as [a b c].
adds its content to the counter valuea, but does not touch his bufferb. Eventually,r3 sends a
message, flushing his bufferb, which reaches all remaining replicas. Replicas converge towards valuea=2. At the end, a read at replicar1 yields return value 2.
Intuitively, the op-based counter converge if every replica eventually applies thesendfunction on every non-zero buffer and the network layer delivers every message to every replicaexactly once[34]; we will formalize these conditions later. The state converges, because the addition operation employed indeliveriscommutative, i.e., different increments can be delivered at replicas in different order and produce the same effect. Furthermore, the value that the counter replicas converge to is the total number of incrementsincissued at all replicas [34].
State-based. Let us now consider a state-based implementation of counter, illustrated in Algorithm 2.3. Recall that in state-based implementation a message can be lost, reordered or duplicated, and should contain the complete set of operations known by the sender. A scalar integer isinsufficientin this case. To see why, consider two replicas that concurrently increment their counter; if the current value of a counter was a scalar, the receiver’s replica would have no means of distinguishing whether the received increment is already (partially) known locally or if it is a fresh one. We will demonstrate this impossibility formally, in Part II of the thesis.
[0 0 0] [1 0 0] [0 1 0] [0 0 1] [2 0 0] [0 2 0] [0 0 2] [1 1 0] [1 1 1] [0 1 1] ... ... ... [1 0 1]
Figure 2.5: Hasse diagram of a fragment of the join-semilattice of states (vectors) of the state-based counter implementation. Solid arrows between states indicate a transitive reduction of< order; dotted blue paths indicate replicate state transitions during execution from Figure 2.4. We omit replica’s own ID in a state, which is irrelevant for the order, and show only vectors.
it knows about. These can be represented efficiently as avector, formally a map from replica identity to the number of increments, notedvv:ReplicaID!N0, such that each replica increments its own entry only. A value of a counter is simply the sum of all entries in the vectorvv. Theinc
operation increments the replica’s own entry in the vector (the identity of this entry is recorded asrduring the initialization). To replicate, the state-based counter sends its complete vector. The receiving replica computes entry-wise maximum with its own vector.
Figure 2.4 illustrates an execution of the state-based counter. Note that the first increment at replicar1 reaches replicar3 bothindirectlyfrom replicar2, and later, directly in a delayed
message from r1. Nevertheless, all replicas converge towards a vector [r17!1,r27!1,r37!0],
which corresponds to counter value 2.
Baquero and Moura [16], and our later work [92], show that the maximum operator on a vector used bydeliver, together with the way the vector is incremented, guarantee convergence of the counter state towards a maximum vector. Moreover, the maximum operator is resilient to message duplication, loss, and out-of-order delivery. This is thanks to the fact that the domain of states (vectors) is a semilattice under the partial order defined as:
vvvvv0 ()dom(vv)µdom(vv0)^ 8r2dom(vv0).vv(r)∑vv0(r). (2.1)
(We note by<a strict variant of this relation.)
Each inc update advances the vector in the v order, whereas the entry-wise maximum operator used bydeliveris in fact theleast upper boundof the semilattice:
(2.2) vvtvv0=∏r.max{vv(r),vv0(r)},
i.e., it generates the minimum vector that dominates both input vectors. We illustrate a fragment of the lattice on Figure 2.5. Throughout an execution of the implementation, replica states advance
2.2. SYSTEM MODEL AND OBJECT IMPLEMENTATION
Algorithm 2.4State-based implementation of last-writer-wins integer register (LWWReg). 1: ß=Z£Timestamp M=Z£Timestamp
2: initialize(ri) : (a,t)
3: leta=0 .register value
4: lett=t? .timestamp; initiallyt?=minTimestamp
5: do(read, to) :v
6: letv=a
7: do(write(ao), to)
8: ifto>tthen .sanity check for new timestamp
9: (a,t)√(ao,to) .overwrite
10: send() : (am,tm)
11: let(am,tm)=(a,t)
12: deliver((am,tm))
13: iftm>tthen
14: (a,t)√(am,tm).overwrite the existing value if the existing timestamp is dominated
in this lattice (indicated with dotted paths on the figure), adding each time more information about the operations performed. Replicas traverse the lattice in parallel, while merging states with least upper bound ensures that they converge.
Compared to the op-based implementation, the state of a replica has no designated outgoing buffer, but the whole state is transferred instead. Therefore, state-based implementation transi-tively deliversupdates, i.e., a replica may serve as an active relay point (asr2does forr3, in our
example). The implementation, however, is more complex.
It is possible to extend this solution to build a counter with an additional decrement method [93], by using a separate vector for decrements. Ensuring strong invariants over a counter, e.g., ensuring that the value remains positive, is more difficult and requires stronger consistency model in the general case.
2.2.3.2 Register Implementations
State-based LWW register. Another RDT example is an object that resolves concurrent updates byarbitration, using approach (B) from Section 2.1. Algorithm 2.4 presents a state-based register (LWWReg). Without loss of generality, we consider a register that stores integer values.
The register stores both a value, and a unique timestamp generated when the value was written; the latter is not visible to thereadoperation. Concurrent assignments to the register are resolved using thelast-writer-wins(LWW) policy applied at all replicas: the write with a higher timestampwinsand remains visible [56]. LWW guarantees that all replicas eventually select the same value, since the order of timestamp is interpreted in the same way everywhere.
The implementation may use any kind of totally ordered timestamps. However, ideally, every
writeshould be provided with a timestamp higher than that of the current value, so it takes effect. A popular implementation is a logical clock, or Lamport clock[64]. Lamport clock
Algorithm 2.5Basic state-based implementation of multi-value integer register (MVReg). 1: ß=ReplicaID£P(Z£(ReplicaID7!N0)) M=P(Z£(ReplicaID7!N0))
2: initialize(ri) : (r,A)
3: letr=ri .replica ID
4: letA=; .non-overwritten entries: set of pairs (a,vv) with values and version vectors
5: do(read, to) :V
6: letV={a|9vv: (a,vv)2A} .all stored values are the latest concurrent writes
7: do(write(a), to)
8: letvv=F{vv0|(_,vv0)2A} .compute entry-wise maximum of known vectors
9: letvv0=vv[r7!(vv(r)+1)] .increment own entry to dominate other vectors
10: A√{(a,vv0)} .replace the current value with the new entry
11: send() :Am
12: letAm=A
13: deliver(Am)
14: A√{(a,vv)2A[Am|6 9(a0,vv0)2A[Am:vv<vv0} .keep non-overwritten entries r1 write(13) (r1, {(13, [1 0 0])}) (r1, ∅) r2 (r2, ∅) r3 (r3, ∅) (rwrite(39) 3, {(39, [0 0 1])}) send:{(13, [1 0 0])}
read:13 read:{13,39} write(26)
(r2, {(26, [1 1 1])}) deliver (r1, {(26, [1 1 1])}) deliver (r2, {(13, [1 0 0])}) deliver (r2