Big Data Technology
CS 236620, Technion, Spring 2014
Edward Bortnikov & Ronny Lempel
Yahoo Labs, Haifa
Data = Systems
How to Get the Big Systems Right?
n
A multidisciplinary science on its own right
n Distributed Computing, Networking
n Hardware and Software Architecture
n Operations Research, Measurement, Performance Evaluation
n Power Management
n … and even Civil Engineering
n
In this course - aspects related to Computer Science
n
We’ll start with some principles …
An Ideal System Should …
Architect’s Dream -‐ Throughput
How many requests can be served in a unit of time?
Architect’s Dream -‐ Latency
How long does a single request take?
Scaling Up? Scaling Out?
Scale up
Example: Network Filesystems
NFS server
“server:/a/b/z.txt”
“/users/bob/courses/CS101.txt” “<server_123, block 20>”
R/W request
Metadata service (namenode)
Data service (datanode)
Monolithic
(e.g., historical NFS)
Distributed
Scale-‐Out Philosophy
§
Scalability through Decoupling
§
Whatever is split can be scaled independently
§
HDFS: Metadata and Data accesses decoupled
§
Minimize centralized processing
§
Metadata accesses coordinated but lean
§
Maximize I/O parallelism
§
Clients access the data nodes concurrently
The Peer-‐to-‐Peer Approach
§
Completely server-less
§
All nodes and functions are
fully symmetric
§
E.g., in a distributed data store every node has a
serving function and a management function
§
Less favored in managed DC environments
§
Very hard to maintain consistency guarantees
§
Very hard to optimize globally
An Ideal System Should …
The Tail at Scale
§
Problems are aggravated in large systems
§
Component-level variability amplified by scale
§
Failures and slow components are part of normal life,
not an exception
§
Two ways of addressing service variability
§
Prevent
bad things from happening by detecting and
isolating the slow/flawed components
§
Contain
bad things through redundancy
An Ideal System Should …
Expected Workload Matters
n
Latency-oriented
n
Interactive, user-facing systems
n
Example: Web search serving
n
Throughput-oriented
n
Back-end heavyweights
Data Accessibility Matters
Stream
Warehouse
Access Patterns Matter
n
Data Analytics
n Throughput-oriented applications
n Write-once (typically, append)
n Read-many (typically, large sequential reads)
n
Online Transaction Processing (OLTP)
n Latency-oriented applications
n Write-intensive
n Typically, many small direct accesses
Hardware Constraints Matter
Compute- or Data-Intensive?
Storage
Compute
Locality Matters
n
Can computation and storage be aligned?
n
Optimization?
n
How repetitive is the workload?
n
Optimization?
α
−
>
X
X
x
)
~
Pr(
Dominant Items Long tail Power-law distributionConsistency MaZers
§
Stricter properties = stronger consistency
§
Are you prepared to handle weird stuff?
§
Fancy stock alerts
§
Is it okay to lose an event once in a while?
§
Fancy a social network
§
Bob deletes photos with his ex-date Alice
§
Bob befriends Carol
A Dialogue in the Wild
Engineer: we afraid of any kind of synchronization
Scientist: what kind of guarantee do you want to get?
Engineer: let’s build something simple Relax your consistency models
We want the systems to be eventually consistent
Scientist: this is an interesting problem
Example: Amazon’s Outage
Weak consistency models can lead
Elasticity Matters
n
Resource demands often unknown in advance
n
Driven by application popularity
n
Goal: enablement of organic growth
n
Add- (and pay-) as-you-grow
n
Economies of scale
n
Pool multiple datasets and services in huge DC’s
nBetter use of shared resources (personnel, real
Cloud Computing
n
Computing resources delivered
over a network
n
Infrastructure issues
abstracted away
n
***-as-a-Service
Designing the Air Flows
Power Efficiency - Surprising Facts
n “At Facebook's Prineville, OR, facility, ambient air flows into the
building, passing first through a series of filters to remove bugs, dust, and other contaminants.”
n “Previous estimates suggested that electricity consumption in
massive server farms would double between 2005 and 2010. Instead, the number rose by 56% worldwide, and merely 36% in the US.”
n “The most efficient data centers now hover at temperatures
closer to 80 degrees Fahrenheit, and instead of sweaters, the technicians walk around in shorts.”
Summary
n
Design for scale
n
Design for fault-tolerance
n
Know what you design for
Further Reading
n