INFO5011 Advanced Topics in IT: Cloud Computing Week 10: Consistency and Cloud Computing

(1)

Dr. Uwe Röhm School of Information Technologies

INFO5011 – Advanced Topics in IT:

Cloud Computing

Week 10: Consistency and Cloud Computing

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 10-2

Outline

!  Notions of Consistency ! CAP Theorem ! CALM Conjuncture ! Eventual Consistency ! Properties

!  Dynamic Data Placement

!  Data Consistency Properties and the Trade-offs in Commercial Cloud Storages

(2)

Revisit: The CAP Theorem

!  Theorem:

You can have at most two of these properties for any shared-data system.

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 10v-3

Consistency

Availability Partitioning Tolerance

[Brewer, PODC2000]

Notions of Consistency

!  Strong Consistency (aka 1-copy-Serializability)

! behavior as if there is only one copy of each data item, and ! only serializable accesses permitted

!  Weak Consistency

! No guarantee that a subsequent (read) access from this or any other

client will return a just updated value

! Might mean several versions of data (i.e. not 1-copy), might mean not serializable

! Updates will be propagated ‘eventually’ to all replicas

after some delay (‘inconsistency interval’) => Eventual Consistency

(3)

Eventual Consistency

!  A model originally proposed for disconnected operation (e.g., mobile computing)

!  Different nodes keep replicas and each update is “eventually” propagated to each replica

! And eventually, there is agreement on which update is the latest

! DNS is the most well-known system implementing eventual consistency

!  Usual definition is counterfactual:

“once updating ceases, and the system stabilizes, then after a long enough period, all replicas will have the same value”

10 - 5

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou)

Why Eventual Consistency?

!  Argument 1: Availability is King, Network Partitioning is Fact

! Cloud infrastructure, especially the lowest level (data storage) has a

crucial ‘always on’ requirement

! But networking can fail (temporarily) even within one rack or between different racks of same data center

! Algorithms for strong consistency would block till network is up again => seen as unacceptable

!  Argument 2: Latency Penalty and Costs too high

! In some applications, there’s replication between different data centers

needed. Having this synchronous (strong consistent) would impose huge performance penalty

! Typical latency within one data center: 2-3 ms

! Typical latency between continents: >100ms…

! Also: Replication needs bandwidth which translates into costs

(4)

Example from Yahoo!

[VLDB2011]

Figure 1: Globally replicated database that asyn-chronously propagates updates to remote datacen-ters.

east coast of the U.S., and in France. For clarity in our dis-cussion, we will focus on a single table containing records, but our techniques generalize directly to multiple tables or other data models. Each replica location stores a full or partial copy of the table.

Because of the high latency for communicating between datacenters, replication is typically done asynchronously. Usually, writes are persisted at one or more local servers and acknowledged to the applications (e.g., made 1-safe). Later, updates are sent to other replica locations. An example of this architecture is shown in Figure 1. As the figure shows, we can think of the system as having two distinct compo-nents: a database system, which manages reads and writes of data records, and a replication system, which manages repli-cation of updates between replica lorepli-cations. In real systems these components might be on the same server (as in MySQL replication [3]) or diﬀerent servers (as in PNUTS [9]). The replication system must ensure reliable delivery of updates to remote datacenters despite failures. Individual servers might fail (and even lose data), but local or remote copies can be used for recovery.

In each location, a given record exists either as a full replica or as a stub. A full replica is a normal copy of the record, possibly enhanced with metadata to support selec-tive replication, such as a list of other full replicas. A stub contains only the record’s primary key and metadata, but no data values. Note that we do not consider selective repli-cation at the field or column level in this paper.

3.1.1 Handling reads and writes

We assume that there is a master copy of each record where updates are applied before being propagated to repli-cas. In PNUTS, the master copy for diﬀerent records might be in diﬀerent datacenters: Alice’s master copy might be in France while Bob’s master copy might be in India. The re-sult is a per-record consistency model, where replicas might lag the master by one or more versions, but will always even-tually receive all updates (and apply them in the same or-der). No locking or commit protocol is needed since trans-actions are per-record; for more details see [9]. Our tech-niques extend to systems that do not have a master and allow updates to be applied anywhere (e.g. [12, 18], and as we discuss in Section A.7, a mode of PNUTS that supports eventual consistency).

When a record is inserted, the master copy decides where the full replicas of the record are to exist, and sends full replicas and stubs to the appropriate locations. When a record is updated, the master applies the update and then sends the updated data only to the locations that contain a full replica. This is where the resource savings of selec-tive replication come from, since bandwidth and disk I/Os

are only necessary for full replica locations. If a record is updated in a non-master region, the update has to be for-warded to the master, but this is because of the mastership scheme, not specifically selective replication. When a record is deleted, a message is sent to all replicas (full and stub) to notify them to delete the data.

A record may be read from any location. If the local database contains a full replica, it serves the request. Other-wise, the database reads the list of full replica locations from the stub and forwards the request to one of them (prefer-ably the one with lowest network delay). This is the main penalty for selective replication: some reads that would have been served locally if all data were replicated everywhere now need to be forwarded, with an attendant increase in re-sponse latency (and some cross-datacenter bandwidth cost). As we increase the number of full replicas, there are fewer forwarded reads but also more cost to propagate updates.

It may be necessary to change the set of full replicas if, for example, the access pattern changes. In this case we might promote some stubs to full replicas and demote some full replicas to stubs. In our mechanism, each location requests promotion or demotion for records based on local access pat-terns, but the master decides whether to grant the request. This allows the master to enforce constraints like a minimum number of copies (see Section 4.1). If the master decides to convert a replica, it notifies all regions of the new list of full replicas for this record to ensure that reads can be properly forwarded. Additionally, if promoting a stub, the record data must be sent to the location with the new full replica.

3.2 Optimization problem

Inter-datacenter bandwidth can be extremely expensive, especially for datacenters with limited backbone connectiv-ity. Therefore, we optimize system cost by minimizing band-width used. Other costs, such as server cost, are also impor-tant. However, minimizing bandwidth usage means avoid-ing sendavoid-ing traﬃc to some datacenters, which will also re-duce the number of servers needed in that datacenter. Thus, bandwidth is a useful proxy for total system cost.

Inter-datacenter bandwidth for replication consists of:

• Replication bandwidth: The bandwidth required to send updates between datacenters.

• Forwarding bandwidth: The bandwidth required to for-ward read requests to remote datacenters because the local replica contains a stub.

We want to minimize the sum of replication bandwidth and forwarding bandwidth.

Additionally, two types of constraints must be enforced. First, policy constraints specify where data must or cannot be replicated, often for legal reasons. Policy constraints may also specify a minimum number of full replicas to ensure data availability. Second, latency constraints might specify that the majority of users experience good response time. It is convenient to express this constraint by specifying the fraction of total global reads (e.g., 95%) that must be served by a local, full replica. Satisfying these constraints may mean making more full replicas, or making full replicas in diﬀerent locations, than would result from simply trying to minimize bandwidth cost.

Then, we can define our optimization problem as follows:

Definition 1. Constrained selective replication

problem - Given the following constraints:

1042

!  Scenario: a social networking application that uses this distributed database

!  distributed database, with replicas kept in-sync via an asynchronous replication mechanism

! inter data center communication needed

!  Users typically show some locality which should direct the replication mechanism to avoid costs and improve perf.

Sidenote: Partial Replication to save

Bandwidth while increasing Latency

!  Goal: replication algorithm with low bandwidth and low latency

!  Side-Constraints:

! Latency SLAs (at least local access must be read fast)

! Legal Constraints (not all data is allowed to be stored everywhere)

!  Approach:

! Asynchronous, primary-copy replication

!  Any write is guaranteed to be done at the master first ! Partial replication

!  full-copies vs. stubs

!  Read-everywhere, but if stub, it is actually a remote read (latency penalty) ! Policy Constraints that define

!  the allowable and mandatory locations for full replicas of each record, and

!  the minimum number of full replicas for each record

! Dynamic placement algorithm that takes local reads/writes into account

for promoting stubs to full copies and vice versa

(5)

Dynamic Placement

!  Dynamic demotion/promotion of stubs

! Reading a stub: stub promoted to full replica

! Update on replica: if after retention interval -> stub

Extra Properties for

Eventual Consistency

!  Programming an application is much harder if storage supports only eventual consistency

! What to do until everything settles down?? ! Handling inconsistency in the sequence of reads

! Cf Hellerstein et al PODS’10/CIDR’11: eventual consistency data model

supports monotonic programs (a very limited class)

!  A range of extra properties, which (if the storage provides this) can make programming not quite so hard

! e.g., read-your-own-writes, monotonic reads, …

(6)

Properties of Eventual Consistency

!  Causal Consistency

! If client A has communicated with client B that an item has been updated by A, B will see the updated value of A

!  Read-your-writes Consistency

! Special form of Causal Consistency for A=B

!  Session Consistency

! Practical version of Read-your-writes Consistency that guarantees

this property just within one ‘session’, but not between separate sessions

!  Monotonic Read Consistency

! A subsequent read will never return an older version of a data item

than a previous read

!  Monotonic Write Consistency

! Serializability just for writes

The CALM Conjuncture:

Consistency and Logical Monotonicity

!  Observation 1: monotonic ⇒ eventually consistent

!  Observation 2: coordination at every nonmonotonic operation ⇒ eventual consistent

!  Conjecture:

non-monotonic and uncoordinated ⇒ inconsistent

[Hellerstein, PODS 2010 Keynote]

(7)

State-of-the-Art with Current Offerings

for Cloud Storage

!  CIDR 2011 Paper:

Data Consistency Properties and the Trade-offs in Commercial Cloud Storages: the Consumers’

Perspective

!  The following slides are from this talk, available in the

original from

www.cidrdb.org/cidr2011/Talks/CIDR11_Wada.ppt

CIDR 2011: Consistency from the

Consumer’s Perspective

!  Paper investigates consistency models provided by commercial cloud storages

! If weak consistency, which extra properties supported?

! How often and in what circumstances is inconsistency (stale values)

observed?

! Any differences between what is observed and what is announced

from the vendor?

!  Investigation of the benefits for consumer of accepting weaker consistency models

! Are the benefits significant to justify consumers’ effort?

! When vendor offers choice of consistency model, how do they

compare in practice?

(8)

Observed Platforms

!  A variety of commercial cloud NoSQLs that are offered as storage service

! Amazon S3

! Two options: Regular and Reduced redundancy (durability)

! Amazon SimpleDB

! Two options: Consistent Reads and Eventual Consistent Reads ! Google App Engine datastore

! Two options: Strong and Eventual Consistent Reads ! Windows Azure Table and Blob

! No option available in data consistency

Frequency of Observing Stale Data

!  Experimental Setup

! A writer updates data once each 3 secs, for 5 mins

!  On GAE, it runs for 27 seconds

! A reader(s) reads it 100 times/sec

! Check if the data is stale by comparing value seen to the most

recent value written

! Plot against time since most recent write occurred

–  Execute the above once

every hour

•  On GAE, every 10 min

•  For at least 10 days

•  Done in Oct and Nov,

2010

(9)

SimpleDB: Read and Write from a

Single Thread

!  With eventual consistent read, 33% of chance to read freshest values within 500ms

! Perhaps one master and two

other replicas. Read takes value randomly from one of these?

•  First time for eventual

consistent read to reach 99% “fresh” is stable 500ms

•  Outlier cases of stale read

after 500ms, but no regular daily or weekly variation observed

Stale Data in Other Cloud Data Stores

Cloud NoSQL and

Accessing Source

What Observed SimpleDB (access from

one thread, two threads, two processes, two VMs or two regions)

Accessing source has no affect on the observable consistency. Eventual consistent reads have 33% chance to see stale value, till 500ms after write. S3 (with five access

configurations)

No stale data was observed in ~4M reads/config. Providing better consistency than SLA describes. GAE datastore (access

from a single app or two apps)

Eventual consistent reads from different apps have very low (3.3E-4 _{%) chance to observe values older}

than previous reads. Other reads never saw stale data. Azure Storages (with five

access configurations)

No stale data observed. Matches SLA described by the vendor (all reads are consistent).

(10)

Additional Properties:

Read-Your-Writes?

!  Read-your-writes: a read always sees a value at least as fresh as the latest write from the same thread/session

!  Our experiment: When reader and writer share 1 thread, all reads should be fresh

!  SimpleDB with eventual consistent read: does not have this property

!  GAE with eventual consistent read: may have this property

! No counterexample observed in ~3.7M reads over two weeks

Additional Properties:

Monotonic Reads?

!  Monotonic Reads: Each read sees a value at least as fresh as that seen in any previous read from the same thread/session

!  Our experiment: check for a fresh read followed by a stale one

!  SimpleDB: Monotonic read consistency is not supported

! In staleness, two successive

eventual consistent reads are almost independent

! The correlation between staleness

in two successive reads (up to 450ms after write) is 0.0218, which is very low

!  GAE with eventual consistent read: not supported

! 3.3E-4 % chance to read values older than previous reads 2nd_{Stale 2}nd_Fresh 1st_Stale _39.94% (~1.9M) 21.08% (~1.0M) 1st_{Fresh 23.36%} (~1.1M) 15.63% (~0.7M)

(11)

Additional Properties:

Monotonic Writes?

!  Monotonic Writes: Each write is completed in a replica after previous writes have been completed

!  Programming is “notoriously hard” if monotonic write consistency is missing

! W. Vogels. Eventually consistent. Commun. ACM, 52(1), 2009.

!  This is an implementation property, not directly visible to consumer. But we explore what happens when we do successive writes, and try to read the data

SimpleDB’s Eventual Consistent Read:

Monotonic Write

!  A data has value v0 before each run. Writing value v1 and then v2 there, then read it repeatedly

•  When v1 != v2, writing v2 “pushes” v1 to replicas immediately (previous value v0 is not observed)

•  Very different from the “only writing one value” case

•  When v1 = v2, second write does not push (v0 is observed) •  Same as the “only writing one value” case

v1 != v2 v1 = v2

(12)

SimpleDB’s Eventual Consistent Read:

Further exploration -- Inter-Element Consistency

!  Consistency between two values when writing and reading them through various combinations of APIs

Domain

Attribute Value

Item

•  Write a value

•  Write multiple values in an item

•  Write multiple values across items in a domain •  Read a value

•  Read multiple values in an item

•  Read multiple values across items in a domain SimpleDB’s Data Model SimpleDB’s API

10-23

Reading two values independently

Each read has 33% chance of freshness. Each read operation is independent

Writing two at once and reading two at once

Both are stale or both are fresh. Seems “batch write” and “batch read” access to one replica Writing two in the same

domain independently

The second write “pushes” the value of the first write (but only if two values are different)

INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou)

Trade-Off Analysis of SimpleDB: A Benefit for

Consumer from Weak Consistency?

•  No significant difference was observed in RTT, throughput, failure rate under various read-write ratios

•  If anything, it favors consistent read!

•  Financial cost is exactly same * Each client sends 1 rps. All obtained under 99:1 read-write ratio INFO5011 "Cloud Computing" - 2011 (U. Röhm and Y. Zhou) 10-24

(13)

What Consumers Can Observe

(as of the state of these experiments)

!  SimpleDB platform showed frequent inconsistency

!  It offers option for consistent read. No extra costs for the consumer were observed from our experiments

! At least under the scale of our experiments (few KB stored in a

domain and ~2,500 rps)

?? Maybe the consumer should always program SimpleDB with consistent reads?

What Consumers Can Observe (contd)

(as of the state of these experiments)

!  Some platforms gave (mostly) better consistency than they promise

! Consistency almost always (5-nines or better)

! Perhaps consistency violated only when network or node failure

occurs during execution of the operation

?? Maybe the chance of seeing stale data is so rare on these platforms that it need not be considered in programming?

! There are other, more frequent, sources of data corruption such as

data entry errors

! The manual processes that fix these may also be used for rare errors

from stale data

(14)

Implications of these Experiments for

Consumers?

!  Can a consumer rely on our findings in decision-making? NO!

! Vendors might change any aspect of implementation (including

presence of properties) without notice to consumers.

! e.g., Shift from replication in a data center to geographical

distribution

! Vendors might find a way to pass on to consumers the savings from

eventual consistent reads (compared to consistent ones)

!  The lesson is that clear SLAs are needed, that clarify the properties that consumers can expect

Summary

!  Strong Consistency

! is most desirable for consumers, but seen as not achievable with

network partitioning tolerance

! Cf. CAP Theorem

!  Eventual Consistency:

! Allows nodes to disagree on current data value

! But current algorithms/systems differ widely in how this is achieved

!  Current Cloud Storage Implementations

! Provide different variants of ‘eventual consistency’ without disclosing

the implementation details or clear SLAs

! Currently missing SLAs (observable, but no guarantees):

!  Rate of inconsistent reads, time to convergence, performance under variety of workloads, availability, costs

(15)

References

!  Werner Vogels: Eventually Consistent. Communications of the ACM,

Volume 52, No. 1, Jan 2009.

!  H. Wada, A. Fekete, L. Zhao, K.. Lee and A. Liu: Data Consistency Properties and the Trade-offs in Commercial Cloud Storages: the

Consumers’ Perspective. In CIDR 2011.

!  Sudarshan Kadambi, Jianjun Chen, Brian F. Cooper, David Lomax, Raghu Ramakrishnan, Adam Silberstein, Erwin Tam, Hector Garcia-Molinn: Where in the World is My Data? In: VLDB 2011.

!  Joseph M. Hellerstein: The Declarative Imperative – Experiences and