Sherpa:
Cloud Computing of the Third Kind
Raghu Ramakrishnan
What’s in a Name?
• Data Intensive Super Scalable Computing
• Grid Computing
• Super Computing
• Cloud Computing
• Parallel Database Management Systems
• Distributed Database Management Systems
Workload,
Programming model,
Ownership model,
Architectural trade-offs
Vary across:
Cloud Computing: Computing as a Service
Cloud Computing
CPU Intensive
Data Intensive
Analytic
E.g., SSDS,
Hadoop
Packaged
Software
High-throughput
E.g., Condor
“Transactional”
Storage & Serving
Trivia Question
• What’s the world’s most widely used parallel
programming language?
Why Not Use an RDBMS for “Analytics”?
• RDBMS provides too much
– ACID transactions
– Complex query language
– Lots and lots of knobs to turn
• RDBMS provides too little
– Lots of optimization and tuning possible for analytics
• E.g., Column stores, bit-map indexes
– Flexible programming model
• E.g., Group By vs. Map-Reduce; multi-dimensional OLAP
• But many good ideas to borrow!
– Declarative language; parallelization and optimization
techniques; value of data consistency …
Why Not Use an RDBMS for “OLTP”?
• RDBMS provides too much
– ACID transactions
– Complex query language
– Lots and lots of knobs to turn
• RDBMS provides too little
– Lack of (cost-effective) scalability, availability
– Not enough schema/data type flexibility
• RDBMS and Sherpa aim for different parts of the space
– RDBMS: Heavyweight, strongly consistent OLTP
– Sherpa: Lightweight but massive scale, relaxed
“I want a big, virtual database”
“What I want is a robust, high performance virtual
relational database that runs transparently over a
cluster, nodes dropping in and out of service at
will, read-write replication and data migration all
done automatically.
I want to be able to install a database on a server
cloud and use it like it was all running on one
machine.”
-- Greg Linden’s blog, 2006
An Example Web App
uploads
tags
“flower”
as
»
Friend activity
»
Your Photos
Sonja uploaded
Brandon tagged a photo
»
Photos tagged as “flower”
Updates
Queries
Why Hosted?
simple
API
Rapid application development
On-demand scaling
DBA functions amortized across applications
Rapid application development
On-demand scaling
Rapid Application Development
• What does it take to get the Next Great Thing off the
ground?
•
Now:
– Set up multiple replicas of a clustered data store
– Set up a system for indexing
– Set up a system for caching
– Set up auxiliary DBMS instances for reporting, etc.
– Set up the feeds and messaging between them
– Write the application logic
– Fairly complex system at first line of new code
•
Our vision:
– Write the application logic
– Use a hosted infrastructure to store and query your data
– Or, as Joshua Shachter puts it: “The next cool thing shouldn’t take a team
of 30, it should be three guys, PHP and a long weekend”
Implications
•
Data management as a service
– Scientists and others who’ve resisted (installing, maintaining, and) using
DBMSs will find it much easier to reap the benefits
– “Data centers” and “Computing Centers” will come into vogue again
•
The Web is becoming open
– E.g., OpenSocial, OpenID
•
Hosted back-ends and RAD tools will make Web application
development accessible to all
– Ideas will be the most valuable currency, not the wherewithal to build
complex systems
•
Paradigm shifts possible for how we do research in many fields:
– Build applications that embed your algorithms and test them directly in
the field—Computer Scientists can interact directly with users (ironically,
this would still be a breakthrough of sorts after four decades!)
– Many other disciplines (e.g., Sociology, microeconomics) can design and
conduct online experiments involving unprecedented numbers of
PNUTS: DB in the Cloud
E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E CREATE TABLE Parts (ID VARCHAR, StockNumber INT, Status VARCHAR …
)
CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Parallel database
Parallel database Geographic replicationGeographic replication Indexes and views
Indexes and views
Structured, flexible schema
Structured, flexible schema
Hosted, managed infrastructure
Sherpa Data Services
PNUTS Services
• Query planning and execution • Index maintenance
Distributed infrastructure for tabular data
• Data partitioning • Update consistency • Replication YDOT FS • Ordered tables Applications YMB • Pub/sub messaging YCA: Authorization YDHT FS • Hash tables Zookeeper • Consistency service
Guiding Principles for PNUTS
Reliable and robust storage
– Replication for fault tolerance
– Predictable consistency guarantees
Simple to use
– Simple operations set
– Minimal client configuration
– Service-level authentication
– Flexible schemas
Highly Scalable / Performant
– Partitioning data over many
machines
– Horizontal scaling at every level
– Data is local to its usage
– Predictable performance via quality
of service levels
– Predicates evaluated on back end
– Cheaper consistency guarantees
than full ACID
Multiple rich access methods
– Hash and ordered table types
– System-maintained secondary
indexes
– Optimization for complex access
patterns
Rapid provisioning of new storage
– Simple, automated cluster growth
– Cheap table creation
– Pay as you grow, grow big as you
need
Operationally cheap
– Automated failover
– Automated load balancing
– No single points of failure
Data Model and Retrieval
YDOT/YDHT
•
Data model:
Key
→
value
dictionary
– Value can be packed with multiple
attributes
•
YDHT operations:
Hash table calls
– Get
– Set (insert and update)
– Remove
– Scan
•
YDOT: YDHT + ordered ranges
PNUTS
•
Data model:
Relational tables with
flexible schema
– Typed, declared attributes
– Fast addition of new attributes
•
Operations:
PNUTS query
language
– Point lookup – Range queries – Insert/Update/Remove – Complex predicates – Ordering – Top-K•
Primary API is web services (JSON over HTTP)
Storage servers
Routers
Clients
Tablet
Controller
YDHT
• Scalable distributed record store
– Optimized for small reads and writes
– Focus on ease of operations, multi-region redundancy, organic
scalability
Ways to use YDHT
• As a primary store
• As a materialized view/cache
• As part of PNUTS!
YDHT YDHT APP Primary store APP YDHT APP PNUTSData Concepts—YDHT
Table
Apple Lemon Grape Orange Lime Strawberry Kiwi Avocado Tomato BananaPrimary key
Record
Grapes are good to eat Limes are green
Apple is wisdom Strawberry shortcake Arrgh! Don’t get scurvy!
But at what price?
How much did you pay for this lemon? Is this a vegetable?
New Zealand
Tablet
Data Concepts—YDOT
Apple Banana Grape Orange Lime Strawberry Kiwi Avocado Tomato LemonOrdered by primary key
Grapes are good to eat
Limes are green Apple is wisdom
Strawberry shortcake Arrgh! Don’t get scurvy! But at what price?
The perfect fruit
Is this a vegetable?
How much did you pay for this lemon? New Zealand
Tablets contain
clustered
Storage unit 1 Storage unit 2 Storage unit 3
YDOT—Ordered Table Store
• YDOT provides clustered, ordered retrieval of records
Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Storage unit 1 Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon
Grapefruit…Pear?
Grapefruit…Lime? Lime…Pear? Storage unit 1 Canteloupe Storage unit 3 Lime Storage unit 2 Strawberry Storage unit 1Data Concepts—PNUTS
Apple Banana Grape Orange Lime Strawberry Kiwi Avocado Tomato LemonGrapes are good to eat
Limes are green Apple is wisdom
Strawberry shortcake Arrgh! Don’t get scurvy! But at what price?
The perfect fruit
Is this a vegetable?
How much did you pay for this lemon? New Zealand $1 $3 $2 $12 $8 $1 $9 $2 $900 $14
Schema:
declared, typed
fields
Name Description Price
Retains tablet
structure of
YDHT/YDOT
Flexible Schema
Posted date
Listing id
Item
Price
6/1/07
424252
Couch
$570
6/1/07
763245
Bike
$86
6/3/07
211242
Car
$1123
6/5/07
421133
Lamp
$15
Primary table
Color
Red
Condition
Good
Fair
Mastering
A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E Tablet masterBasic Consistency Model
Goal:
•
Make it easier for applications to reason about updates and cope with
asynchrony—alternative to “transactions” in an asynchronous world
•
What happens to a record with primary key “Brian”?
Guarantees:
•
Every reader will always see some consistent, but possibly stale version
•
Readers can request a more up-to-date version, but may pay extra latency
– Special case: Critical read (writer/readers see their own writes)
•
Writers can verify that the record is still at the version they expect
Time
Record
inserted Update Update Delete
v. 1 v. 2 v. 3 Generation 1 Record inserted Update Update Delete v. 1 v. 2 v. 4 Generation 2 Update v. 3 Record inserted Delete v. 1 Generation 3
Server 1 Server 2 Server 3 Server 4 Bike $86 6/2/07 636353 Chair $10 6/5/07 662113
Distribution
Couch $570 6/1/07 424252 Car $1123 6/1/07 256623 Lamp $19 6/7/07 121113 Bike $56 6/9/07 887734 Scooter $18 6/11/07 252111 Hammer $8000 6/11/07 116458Distribution for parallelism
Distribution for parallelism
Data shuffling for load balancing
Data shuffling for load balancing
Tablet Splitting and Balancing
Each storage unit has many tablets
Each storage unit has many tablets
Tablets may grow over time
Tablets may grow over time
Overfull tablets split
Overfull tablets split
Storage unit may become a hotspot
Storage unit may become a hotspot
Shed load by moving tablets to other servers
Data-path components
Each can be scaled horizontally
Data-path components Each can be scaled horizontally
Cluster 1 Cluster 2 Storage units Routers Tablet controller Tablet map Load balancer Server monitor WS API SU API Clients YMB
Architecture
Query processingYahoo! Message Broker (YMB)
•
Pub/sub based on reliable logging
– Topic-based
– Persistent subscriptions
– Multi-region presence
•
Guarantees
– In the presence of at most one YMB machine failure:
• Published messages will be delivered on live subscriptions system-wide
• Messages published in one region will be delivered to all subscribers in the order they were published (partial order)
• Published messages available for re-delivery until subscriber calls consume()
– If there are two machine failures:
• Subscribers will be notified of “broken subscription”
– Since messages may have been lost