• No results found

Distributed Data, Replication and NoSQL

N/A
N/A
Protected

Academic year: 2020

Share "Distributed Data, Replication and NoSQL"

Copied!
11
0
0

Loading.... (view fulltext now)

Full text

(1)

Distributed Data,

Replication and

NoSQL

Administrivia

•  Final Exam

– Monday, May 11, 11:30-2:30 – RSF Fieldhouse

– Cumulative, stress end-of-semester – 3 cribsheets

•  Review Sessions

– At usual classtime, will be webcasted

– TAs and Prof H holding OH and sections next week

• TA Hours on Piazza

Outline

•  Basics

– Traditional DB replication – Replica consistency – Where do updates happen? •  Considerations from Massive Scale

– Performance requirements – NoSQL & Eventual Consistency •  Programming with Inconsistent data

– Approaches – Expected costs – CALM •  NoSQL today

Why Replicate Data?

Increase availability

Survive catastrophic failures

• Avoid correlation: different rack, different datacenter

Bypass transient failures

Reduce latency

Choose a “nearby” server

• Particularly for geo-distributed DBs

Ask many servers and take the 1

st

response

Load balancing

Other reasons

facilitate sharing, security, autonomy, system

management, etc.

A Definition: Replica Consistency

Replica Consistency

Typically linearizability of single writes

Illusion: all copies of data item X updated

atomically

Really: readers see values of X that they might

see in a single-node setting

Not to be confused with the C in ACID!

This has caused endless headaches

Not to be confused with serializablity

Actions only; not transactions. Single-object writes

You can layer serializability on top of linearizable

stores

Traditional DB Replication

Mechanisms

•  Small number of big databases (say one per continent) – Replicated for failure recovery, mostly

•  Data-shipping

– Interruptive at both ends

– Difficult to get transactional guarantees

• Even single-site transactions would require 2PC!

– Relatively easy interoperability across different systems •  Log-shipping

– Cheap/free at the source node – Modest bandwidth: diffs

– Single-node transactional replication comes built-in

• in a snapshot sense at least

– Tricky to do across different systems

(2)

Single- vs. Multi-Master Writes

•  Single-master

– Every data item has one appointed Master node – Writes are first performed at the Master – Replication propagates the writes to others – 2PL/2PC can be used for transactions

– Easy to understand, but defeats many benefits for writers

• Esp. reducing latency, also load balancing

•  Multi-master

– Writes can happen anywhere – Replication among all copies

– Allows even writers to enjoy lower latency, load balancing – Fast, but hard to make sense of things

• Same item can be updated in two places; which update “wins”?

• 2PL/2PC defeat the positive effects of multi-master

A Compromise: Quorums

•  Suppose you have N nodes replicating each item •  Send each write to W<N nodes

– sender timestamps the write •  Send each read to R<N nodes

– reader uses timestamps to choose result •  How to ensure readers see latest data:

– ensure W+R >= N (“strict” quorum) – E.g. W = R > N/2 (majority quorum) – E.g. W=N, R=1 (writer broadcasts) •  Advantages

– Not at the mercy of a single slow “master” – Relatively easy to recover from a failed node •  But...

– Transactional semantics still require 2PC

Is 2PC really so expensive?

How expensive is it to hold a unanimous vote?

Raw latencies depend on your network

• Geo-replication and speed of light • Vs. modern datacenter switches

But best-case behavior is not a good metric!

Much depends on your machines’ delay

distribution

Hardware: NW switches, mag disk vs. Flash, etc.

Software: GC in the JVM, carefully-crafted event

handlers...

Google does it; it must be good?!

Google Spanner uses 2PC and a host of other

stuff

But the latencies! Speed of light.

7 times round the world per second

latency (ms)

operation mean std dev count

all reads 8.7 376.4 21.5B

single-site commit 72.3 112.8 31.2M

[image:2.612.67.550.78.712.2]

multi-site commit 103.0 52.2 32.1M

Table 6:

F1-perceived operation latencies measured over the course of 24 hours.

[image:2.612.322.548.90.258.2]

of such tables are extremely uncommon. The F1 team

has only seen such behavior when they do untuned bulk

data loads as transactions.

Table 6 presents Spanner operation latencies as

mea-sured from F1 servers. Replicas in the east-coast data

centers are given higher priority in choosing Paxos

lead-ers. The data in the table is measured from F1 servers

in those data centers. The large standard deviation in

write latencies is caused by a pretty fat tail due to lock

conflicts. The even larger standard deviation in read

la-tencies is partially due to the fact that Paxos leaders are

spread across two data centers, only one of which has

machines with SSDs. In addition, the measurement

in-cludes every read in the system from two datacenters:

the mean and standard deviation of the bytes read were

roughly 1.6KB and 119KB, respectively.

6 Related Work

Consistent replication across datacenters as a storage

service has been provided by Megastore [5] and

Dy-namoDB [3]. DyDy-namoDB presents a key-value interface,

and only replicates within a region. Spanner follows

Megastore in providing a semi-relational data model,

and even a similar schema language. Megastore does

not achieve high performance. It is layered on top of

Bigtable, which imposes high communication costs. It

also does not support long-lived leaders: multiple

cas may initiate writes. All writes from different

repli-cas necessarily conflict in the Paxos protocol, even if

they do not logically conflict: throughput collapses on

a Paxos group at several writes per second. Spanner

pro-vides higher performance, general-purpose transactions,

and external consistency.

Pavlo et al. [31] have compared the performance of

databases and MapReduce [12]. They point to several

other efforts that have been made to explore database

functionality layered on distributed key-value stores [1,

4, 7, 41] as evidence that the two worlds are converging.

We agree with the conclusion, but demonstrate that

in-tegrating multiple layers has its advantages: inin-tegrating

concurrency control with replication reduces the cost of

commit wait in Spanner, for example.

The notion of layering transactions on top of a

repli-cated store dates at least as far back as Gifford’s

disser-tation [16]. Scatter [17] is a recent DHT-based key-value

store that layers transactions on top of consistent

repli-cation. Spanner focuses on providing a higher-level

in-terface than Scatter does. Gray and Lamport [18]

de-scribe a non-blocking commit protocol based on Paxos.

Their protocol incurs more messaging costs than

two-phase commit, which would aggravate the cost of

com-mit over widely distributed groups. Walter [36] provides

a variant of snapshot isolation that works within, but not

across datacenters. In contrast, our read-only

transac-tions provide a more natural semantics, because we

sup-port external consistency over all operations.

There has been a spate of recent work on reducing

or eliminating locking overheads. Calvin [40]

elimi-nates concurrency control: it pre-assigns timestamps and

then executes the transactions in timestamp order.

H-Store [39] and Granola [11] each supported their own

classification of transaction types, some of which could

avoid locking. None of these systems provides external

consistency. Spanner addresses the contention issue by

providing support for snapshot isolation.

VoltDB [42] is a sharded in-memory database that

supports master-slave replication over the wide area for

disaster recovery, but not more general replication

con-figurations. It is an example of what has been called

NewSQL, which is a marketplace push to support

scal-able SQL [38]. A number of commercial databases

im-plement reads in the past, such as MarkLogic [26] and

Oracle’s Total Recall [30]. Lomet and Li [24] describe an

implementation strategy for such a temporal database.

Farsite derived bounds on clock uncertainty (much

looser than TrueTime’s) relative to a trusted clock

refer-ence [13]: server leases in Farsite were maintained in the

same way that Spanner maintains Paxos leases. Loosely

synchronized clocks have been used for

concurrency-control purposes in prior work [2, 23]. We have shown

that TrueTime lets one reason about global time across

sets of Paxos state machines.

7 Future Work

We have spent most of the last year working with the

F1 team to transition Google’s advertising backend from

MySQL to Spanner. We are actively improving its

mon-itoring and support tools, as well as tuning its

perfor-mance. In addition, we have been working on improving

the functionality and performance of our backup/restore

system. We are currently implementing the Spanner

schema language, automatic maintenance of secondary

indices, and automatic load-based resharding. Longer

term, there are a couple of features that we plan to

in-Published in the Proceedings of OSDI 2012

12

10

TPS!

Hamilton Quote on Coordination

The first principle of successful

scalability is to batter the consistency

mechanisms down to a minimum, move

them off the critical path, hide them in a

rarely visited corner of the system, and

then make it as hard as possible for

application developers to get permission

to use them”

—James Hamilton (IBM, MS, Amazon)

But only relevant to massive scalability!

Outline

•  Basics

– Traditional DB replication – Replica consistency – Where do updates happen?

•  Considerations from Massive Scale – Performance requirements – NoSQL & Eventual Consistency •  Programming with Inconsistent data

(3)

Update-heavy Global-Scale Services

E.g. Amazon, Facebook, LinkedIn, Twitter

Shopping, Posting and Connecting

Latency and Availability both paramount

Favors multi-master solutions, or loose quora

Replica consistency becomes frustrating

• See Hamilton

Becomes quite natural to ditch Consistency!

The rise of NoSQL

NoSQL

•  Born in Amazon Dynamo •  Typical NoSQL approach

– Simple key/value data model. • put(key, val). get(key, val) – Replicated data (think 3+ replicas) – Multi-master: write anywhere, with timestamp – Update is acked by a single node

• No 2PC!

– Update “gossiped” lazily among the replicas in background

• A.k.a “anti-entropy”, “hinted handoff” and other fancy names

– Reads can consult 1 or more replicas

• Quorums can be used to achieve linearizability

NoSQL ambiguities

Atypical to bother with linearizability

Per-object ambiguities

You may not read the latest version

Writers can conflict

• Write the same key in different places • ‘Conflict Resolution’ or ‘Merge’ rules apply:

– E.g. Last/First writer wins • But what clock do you use? – E.g. Semantic merge (e.g. Increment)

Across objects

Typically no guarantees

No notion of “trans”-actional semantics

“Eventual Consistency”

“if no new updates are made to the object, eventually all

accesses will return the last updated value”

-- Werner Vogels, “Eventually Consistent”, CACM 2009

Which in practice means...??

Problems with E.C.

Two kinds of properties people discuss in

distributed systems:

1. 

Safety: nothing bad ever happens

False! At any given time, the state of the DB may

be “bad”

2. 

Liveness: a good thing eventually happens

Sort of! Only in a “quiescent” eventuality.

Outline

•  Basics

– Traditional DB replication – Replica consistency – Where do updates happen? •  Considerations from Massive Scale

– Performance requirements – NoSQL & Eventual Consistency

•  Programming with Inconsistent data – Approaches

(4)

How do programmers deal with EC?

1. 

What, me worry?

2. 

Application-level coordination

3. 

Provably consistent code: monotonicity

Case Study: The Shopping Cart

Based on Amazon Dynamo

With the global 2PC commit order “clock”

Key Value Key Value Key Value Key Value

-1

-1

Key Value Key Value

1

1

1

1

Key Value Key Value

1

1

(5)

Key Value

1

1

Key Value

1

1

1

1

Key Value Key Value

1

1

1

1

Key Value Key Value

2

2

1

1

Eventually Consistent Version

What could go wrong?

Item Count

1

1

2

Item Count

1

1

-1

-1

1

1

0

What, me worry?

How often does inconsistency occur?

E.g. What are the odds of a “stale read”?

P.B.S. Study at Berkeley (pbs.cs.berkeley.edu)

• Based on LinkedIn and Yammer traces • Stale reads are pretty rare

– Esp in a single datacenter, using flash disks • But they happen!

How bad are the effects of inconsistency?

In terms of application semantics

In terms of exception detection/handling

(6)

Application-level Reasoning

The most interesting part of the Dynamo paper

is its application-level smarts!

Basic idea:

Don’t mutate values in a key-value store

Accumulate logs at each node!

• Union up a set of items • Union is commutative/associative!

Only coordinate for checkout

Monotonic Code

Merge things “upward”

sets grow bigger (merge: Union)

counters go up (merge: MAX)

booleans go from false to true (merge: OR)

Intuition from the Integers

VON

NEUMANN

int ctr;

operator:= (x) {

// assign

ctr = x;

}

MONOTONIC

int ctr;

operator<= (x) {

// merge

ctr = MAX(ctr, x);

}

DISORDERLY INPUT STREAMS:

2, 5, 6, 7, 11, 22, 44, 91

5, 7, 2, 11, 44, 6, 22, 91, 5

VON

NEUMANN

MONOTONIC

DISORDERLY INPUT STREAMS:

2, 5, 6, 7, 11, 22, 44, 91

5, 7, 2, 11, 44, 6, 22, 91, 5

0 10 20 30 40 50 60 70 80 90 100

1 2 3 4 5 6 7 8 9 0

10 20 30 40 50 60 70 80 90 100

1 2 3 4 5 6 7 8 9

Intuition from the Integers

VON

NEUMANN

MONOTONIC

DISORDERLY INPUT STREAMS:

2, 5, 6, 7, 11, 22, 44, 91

5, 7, 2, 11, 44, 6, 22, 91, 5

0 10 20 30 40 50 60 70 80 90 100

1 2 3 4 5 6 7 8 9 0

10 20 30 40 50 60 70 80 90 100

1 2 3 4 5 6 7 8 9

(7)

General Monotonicity

Merge things “upward”

sets grow bigger (merge: Union)

counters go up (merge: MAX)

booleans go from false to true (merge: OR)

Can be partially ordered

E.g. sets growing

Note why this works!

Associative, Commutative, Idempotent

Core mathematical objects

Lattices & Monotonic logic

• E.g. Select/project/join/union. But not set-difference, negation

CALM Theorem

•  When can you have consistency without coordination? – Really.

– I mean really. Like “complexity theory” really.

– I.e. given application X, is there any way to code it up to run correctly without coordination?

•  CALM: Consistency As Logical Monotonicity

– Programs are eventually consistent (without coordination) iff they are monotonic.

• Monotonicity => coordination can be avoided (somehow)

• Non-monotonicity => coordination required

Hellerstein: The Declarative Imperative, 2010

Example

The fully monotone shopping cart

A “seal” or “manifest”

The Burden on App Developers

•  Given the tools you have – Java/Eclipse, C++/gdb, etc. •  Convince yourself your app is correct

– No race conditions? Fault tolerant?

•  Convince yourself your app will remain correct – Even after you are replaced

•  Convince yourself the app plays well with others – Does your eventual consistency taint somebody else’s

serializable data?

•  Wow. OK. Do you miss transactions yet? – Amazon reportedly replaced Dynamo – Transactions could be practical in a datacenter

• MS Azure makes use of this

– But what about global, high-performance systems?

Shameless plug: BOOM

•  Really good programmers avoid coordination – Maybe

•  Everyone else needs a better programming model – One that encourages monotonicity, and analyzes for it. – E.g. Bloom (http://bloom-lang.net)

• + Confluence analysis (Blazes)

• + Fault Tolerance analysis (Molly) • Etc.

(8)

Outline

•  Basics

– Traditional DB replication – Replica consistency – Where do updates happen? •  Considerations from Massive Scale

– Performance requirements – CAP, NoSQL, Eventual Consistency •  Programming with Inconsistent data

– Approaches – Expected costs – CALM

•  NoSQL today

Data Models

Key/Value Stores

Opaque values

“Document” Stores

Transparent values: JSON or XML

Amenable to indexing by tags

Query support still usually pretty simple

Think about these in terms of schema design

(e.g. ER diagrams)!

Popular NoSQL Systems

•  Hbase, Cassandra, Voldemort, Riak, Couchbase, CouchDB, MongoDB, etc. etc.

•  Why so many? – Not all that hard to build

– Lack of standards, design alternatives •  Interesting to compare to Hadoop

– Why no NoSQL core? Cause or effect?

Should you be using NoSQL?

•  Data model and query support

– Do you want/need the power of something like SQL?

• And are your queries canned, or ad-hoc?

– Do you want/need fixed or flexible schemas

• Note that you can put flexible data into a SQL database – See for example PostgreSQL’s JSON support

•  Scale

– Do you want/need massive scalability and high availability?

• What’s your data volume? Update/query workload?

• Will you need geo-replication

– Are you willing to sacrifice replica consistency? •  Agility and growth

– Are you building a service that could grow exponentially? – Optimizing for quick, simple coding?

• Or maintainability?

Data & Lessons to Live By

Topics

Everything in the slides

except:

FDs: What does 3NF achieve, Minimal Covers,

Decomp into 3NF

• But you are responsible for the definition of 3NF

Text Search: Crawlers

Data Vis: Vega

Guest Lectures

• Big Data • Data Science

(9)

Topics, cont.

In the Advanced Concurrency Control Lecture

:

You are not responsible for:

• Index concurrency (B-link trees, phantoms) • Distributed concurrency (2PC)

• Weak Isolation

You are responsible for:

• Multigranularity locking • MVCC

As you study…

"Reading maketh a full man; conference a ready

man; and writing an exact man."

-Francis Bacon

"If you want truly to understand something, try to

change it."

-Kurt Lewin

"I hear and I forget. I see and I remember. I do and

I understand."

-Chinese Proverb.

"Knowledge is a process of piling up facts; wisdom

lies in their simplification."

-Martin H. Fischer

Data Systems are More Similar than

Different

Relational DBs

Big Data Analytics

NoSQL Key-Value Stores

Search engines

File Systems

Don’t learn a system — learn the principles!

– examine how tricks vary across these use cases – draw connections

– Innovate!

Things are Changing

•  Cloud-scale •  Multicore •  Flash

•  Memory-resident databases •  Opportunity to revisit, refactor

– Not just the monolithic relational DBMS – Maybe disk seek is free

– Maybe on-chip cache locality is important

– Maybe there’s lots of room for multiple versions of data – Etc. Etc.

•  You’ve been trained to roll with—and drive—the change! – Well-grounded and open-minded.

Note: All of CS Moves This Way

Data grows exponentially.

The number of cores grows exponentially.

Programming

is

(parallel) data management.

Data-centric methods unlock parallelism.

declarative, disorderly

Note: All of human endeavour too

Data-centric, measurement-driven thinking

across society

Government, policy

Economics

Medicine

Science

Business

Etc.

(10)

You Have Enormous Value

Job market

McKinsey Big Data Report, 2011:

• “The US alone faces a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts to analyze big data and make decisions based on their findings.”

Choose your own adventure!

• engineering, sales, marketing, R&D, mgmt, etc • systems, applications, data science, … • big SW firms, web services, startups, application

domains… it’s all Data!

You Have Enormous Potential

•  Graduate school

– DBs & Data Science now at many top PhD programs – Exciting crossover to other areas

•  Save the world! – DataKind.org – CodeForAmerica.org – Data in the First Mile

• http://db.cs.berkeley.edu/papers/cidr11-firstmile.pdf

“More, more I’m still not satisfied”

Grad classes @ Berkeley

CS262A: grad level intro to DBMS and OS

CS286A: grad DB class (not offered next yr?)

DB, SysML, AMPLab seminars

Watch for 194’s and MOOCs

Data Science

Programming the Cloud

Parting Thoughts

•  Education is the ability to listen to almost anything without losing your temper or your self-confidence.

-Robert Frost

•  It is a miracle that curiosity survives formal education.

-Albert Einstein

•  Humility...yet pride and scorn; Instinct and study; love and hate; Audacity...reverence. These must mate

-Herman Melville, Art

•  The only thing one can do with good advice is to pass it on. It is never of any use to oneself.

-Oscar Wilde

Bye!

(11)

Brewer’s CAP Theorem

1. 

(replica) Consistency

2. 

Availability

3. 

(tolerance of network) Partitions

Basic formulation

A system can only provide 2 of {C, A, P}

More practical formulation

Partition ~= high latency (the CAL theorem?)

If you want High Availability and Low Latency:

• Give up on Consistency.

If you want Consistency:

• Give up on High Availability and Low Latency

Additional concern: finding replicas

•  Centralized directory

– A single node that maps keys to machines •  Hierarchical directory

– Namespace is partitioned and sub-partitioned – Ask “up the tree”

– DNS works this way •  Hashing

– Static, well-known hash function is not very flexible.

• More adaptive schemes based on “consistent hashing”

– Truly distributed hashing: e.g. Chord

•  All of these should be performant and fault-tolerant too! – Caching of locations is important throughout

– Variety of trickiness here

Fault Tolerance and Load Balancing

•  Nodes may fail

– Can be recovered from replicas

– In the interim, the replica nodes may be overloaded

• Interesting questions on how to get graceful degradation

– Some process needs to oversee this

• Various solutions here, both centralized and distributed

•  Load balancing

– Even without failures, keys may have skewed access – Consistent hashing helps adapt placement while still

Figure

Table 6: F1-perceived operation latencies measured over thecourse of 24 hours.

References

Related documents

We therefore turned to youth, whose perspectives on education policy are often misrepresented or go unheard altogether.Our chosen methodology led us to conduct a longitudinal study

The key maintained assumption in this analysis is that municipal green procurement policies in far-away cities increase the supply of LEED Accredited Professionals in nearby

Diuretics are certainly a good treatment option for those patents presenting with high or even normal blood pressure, however they should not be used as a blanket treatment for

4 At the request of an employee who is entitled under the FPU scheme at the age of 62 years and three months and who does not avail of the scheme referred to in the fi rst

Remark- ably, even after many changes in PES schemes over time, the same governance constraints and problems that have long contributed to inequities in forest access persist,

Therefore, heterogeneous yield curves can be derived by assuming different short rates and risk premia between agents though they may not be guranteed to be consistent with the

In order to minimise the reduction in the tensile strength and modulus caused by introducing uniform fibre waviness into the UD composites and to produce