An Application-Aware Approach to
Systems Support for Big Data
Hong Jiang
National Science Foundation
&
Department of Computer Science & Engineering
University of Nebraska – Lincoln
Data Deluge
Social Network
Business Intelligence
Scientific Simulation
Mobile Apps
3,900
tweets per
second
275 EB data
flowing per
day in 2020
The Vs of Big Data
Volume
Velocity
Big Data Characteristics:
Volume
•
from130 EB stored date in
2005 to 40,000 EB in 2020,
estimated by IDC
Big Data Characteristics:
Velocity
•
Youtube: 100 hours of videos
uploaded per min.
•
Twitter: 3900 tweets per sec
Big Data Characteristics:
Variety
•
Structured, semi-structured,
unstructured data
(transactions, sensor, text,
audio, image, video, log,
etc.)
•
80% of an organization’s data
Big Data Characteristics:
Veracity
•
Trustworthiness of data
•
1 in 3 business leaders don’t
trust the information they
used to make decisions
Big Data, Big Challenges
•
Challenges to Computer Systems Research
–
Scalability!
à
In-memory computing
•
Capturing, Delivering, Storing
•
Real, or Near-Real Time Processing
•
Indexing and Searching
–
Data protection: reliability, availability, security
•
Against hardware/software faults
•
Against malicious attacks
–
Volume reduction & “Sanitizing”
•
Flash SSDs/SCM
–
OS storage stack overhaul!
Application-aware approach to
systems support for big data
In-memory computing support:
application-aware data filtering, indexing
& search, caching
Application-aware data protection:
Data backup & restore, error tolerant
data management
Volume reduction & sanitizing:
Application-aware data deduplication
and provenance-aware cleansing
Research on Cross-Cutting Interfaces
Honey, I shrunk the Data!
•
Data deduplication
–
Using a hash signature to uniquely identify a data chunk
–
Secure hash Signature: MD5, SHA-1, SHA-256, Tiger…
Benefits
n
Reduces the storage space
requirement for big data
n
Minimizes the network
Deduplication Dilemmas
•
Challenge:
–
High dedup
ratio &
throughput
–
@ low RAM &
CPU cost
–
Reliability
State-of-the-art Approaches
v
Locality based Approaches:
v
DDFS, Sparse Indexing, ChunkStash
v
Similarity based Approaches:
v
Extreme Binning
Index
Minimize the accesses
to on-disk index
Index
Only one on-disk
access per file
•
These approaches fail when data streams lack
either or both locality and similarity!
•
Our solution is SiLo: A Similarity-Locality
based Near-Exact Deduplication Scheme with
Low RAM Overhead and High Throughput
Motivation and Observation
•
Redundancy observation of small and large files
Small files
(
≤
64KB)
Large files
(
≥
2 MB)
% of file #
≥
80%
≤
20%
% of space
≤
20%
≥
80%
Grouping many highly
correlated small files into
a segment to minimize
dedupe overheads
into many small segments
Dividing the large files
to expose more similarity
Intuition
•
Combining and complementing
similarity
and
locality
(a) Similarity approach
Exis6ng data stream
Input data stream
Locality Enhancement
Poten6al duplicate
Similar
Similar
SiLo Architecture
Chunking User Interface
File Agent Job Agent Deduplication Metadata Agent
Storage Agent
Contain Store ……
Job MetaData Cache HashTable Block Store
File Deamon Storage Server MDS Storage Agent Contain Store Storage Agent Contain Store
Backup Server Deduplication Server
Disk Disk Disk ……
Network
Deduplica3on Server stores and looks up
all fingerprints of files and chunks.
Backup Server manages the
backup system and directs all File
Agents and Storage Servers.
Storage Server stores backed-‐
up data.
File Deamon provides a
func3onal interface in users‘
Deduplication Server
Block Block ……
Block Block ……
Block Block ……
Seg Key …… …… ……
DISK
RAM
SHTable Read Cache Write Buffer …… …… Block Block RepChunk ID Block ID …… …… Chunk ID …… LHTable … … Segment ……Similarity
Hash Table
Locality
Cache
The similarity
unit, (sequence
of chunks)
The locality unit,
(sequence of
Duplicate Elimination
SiLo achieves near-exact duplicate elimination for all workloads
SiLo with segment size of 4MB
The similarity
approach:
Extreme Binning
Locality approach: ChunkStash-‐HDD
RAM Usage for Indexing
SiLo consumes a RAM capacity that is only 1/41
∼
1/60 and 1/3
∼
1/90
Extreme Binning
performs poorly
on the Linux-‐set.
Deduplication Throughput
SiLo outperforms ChunkStash by a factor of about 3 and
Extreme Binning by a factor of about 1.5
Summary of SiLo
•
SiLo, a near-exact deduplication system
ü
Address the scalability of deduplication indexing in
big data environment
•
Combination of similarity and locality
ü
Mining the similarity and locality characteristics in
deduplication-based storage systems
•
We are working on applying deduplication to
Cluster Deduplica6on
•
To scale data deduplica6on to PB
or EB level datasets
•
Cluster deduplica6on
can sa6sfy
scalable capacity and
performance requirements
in Big
Data storage
–
Data rou6ng for assigning data to
appropriate deduplica6on nodes
–
Intra-‐node independent
Challenges of Cluster Deduplica6on
•
Chunk-‐index lookup disk boRleneck
–
The chunk index of large dataset is too big to fit into
the limited RAM of the deduplica6on server
–
Parallel lookup performance of mul6-‐stream degrades
significantly due to frequent and random disk I/Os
•
Deduplica6on node informa6on island
–
Deduplica6on is only performed within individual
servers due to overhead considera6ons, and leaves
cross-‐node redundancy untouched
The State of the Art
•
Locality
based op6miza6on mechanisms
–
NEC: HYDRstor
(large chunk, DHT based stateless rou6ng)
–
EMC: Data Domain Global Deduplica=on Array
(super-‐chunk,
stateless rou6ng & stateful rou6ng)
•
Similarity
based op6miza6on strategies
–
HP: Extreme Binning
(file similarity, stateless rou6ng)
–
Symantec: file rou=ng middleware
(file similarity, stateless
rou6ng)
–
EMC: content-‐aware load balancing
(client similarity, stateless
rou6ng)
•
Stateful vs. Stateless
Rou6ng
•
Challenges
–
The fingerprint-‐based rou6ng schemes have failed to achieve a good
tradeoff among
capacity saving
,
throughput
and
scalability large
clusters
•
Our solu6on is Σ-‐Dedupe, a scheme that op6mizes
cluster deduplica6on by
exploi6ng data similarity and
locality
in backup data streams
•
Novel data rou6ng for assigning data to nodes
–
Coarse-‐grained super-‐chunk (i.e., a consecu6ve chunk set)
–
Similarity based stateful data rou6ng algorithm using
handprints (i.e., signature of fingerprint set)
•
Handprint
based intra-‐node redundancy suppression
–
Fine-‐grained chunk-‐level
–
Similarity index structure
Handprin6ng
•
The Generaliza=on of Broder’s Theorem
:
–
If the similarity of two chunk sets
S
1
and
S
2
is
R
, then
the probability of their sharing at least one fingerprint
in their
k
smallest fingerprints is 1-‐(1-‐
R
)
k
•
Handprin6ng
:
k
smallest fingerprints in chunk set
•
Handprint vs. Fingerprint
•
More features to support local stateful data rou6ng
The Strong Ability of Handprin6ng in
Resemblance Detec6on
•
4 ~ 32 representa6ve fingerprints approaches the
System Architecture
Backup Clients
Director
Data Par66oning
Chunk Fingerprin6ng
Similarity-‐aware
Data Rou3ng
Similarity Index Lookup
Chunk Fingerprint Caching
Parallel Container Management
Backup Session
Management
File Recipe
Management
fingerprint
lookup
chunk
transfer
chunk metadata update
file metadata read & write
Similarity based Stateful Data Rou6ng
FP15
FP10
FP8
FP7
FP4
FP2
FP1
Superchunk
Chunks
FP7 FP15
FP2
Handprint
(1) handprint extraction
(2) node mapping
req
ack
(3) resemblance
lookup
(4)
resemblance
discount
(5)
superchunk
routing
FP7/N
FP2/N
FP15/N
Key Data Structures
RFP
CID
a1cb
359
...
...
ef2d
764
Similarity Index
...
Disk Array
Fingerprints
CID
802
...
513
Chunk Fingerprint Cache
RAM
containers
containers
containers
3c5e, f76a, ...
e43b, 9fd1, ...
Container
Metadata
Data
Chunks
Evalua6on
•
Evalua6on goals:
–
Parallel deduplica6on efficiency in a single node
–
Cluster deduplica6on efficiency
•
Experiment pla_orm
–
Quad-‐core 8-‐thread Intel X3440 2.53 GHz CPU, 16GB
RAM
•
Workload
Datasets Size (GB)
Dedupe Ra3o
Linux
160
8.23(CDC)/7.96(SC)
Evalua6on Metrics
•
Dedupe efficiency (DE): “bytes saved per sec”
•
Normalized effec6ve dedupe ra6o (NEDR) =
clusterDedupeRa6o/singleNodeDedupeRa6o
•
× α/(α+σ)
–
α: average storage usage in dedupe nodes
–
σ: standard devia6on of storage usage in nodes
The Performance of Similarity Index
Parallel Lookup
#Lock affects the
performance of
parallel index lookup
Dedupe Efficiency in Single-‐node
Deduplica6on Server
We choose Fix-‐sized
Chunking and 4KB
Effec6ve Deduplica6on Ra6o
Our Σ-‐Dedupe can
achieve over 90%
space saving of costly
Number of Fingerprint Index Lookup
Messages
Σ-‐Dedupe has
almost the same
low overhead as
scalable schemes
Summary
Cluster
Deduplica3on
Dedupe
Ra3o
Throughput Data Skew Overhead
Extreme Binning
Medium
High
Medium
Low
EMC Stateless
Medium
High
Medium
Low
EMC Stateful
High
Low
Low
High
Conclusions
•
Cluster deduplica6on can be improved by
exploi6ng both
similarity
and
locality
in data
streams.
•
Handprin6ng
technique has strong ability to
detect resemblance.
•
Σ-‐Dedupe
nearly achieves the space efficiency as
the costly Stateful rou6ng based scheme but only
at a overhead like the highly scalable Stateless
Computer and Information
Science and Engineering (CISE)
CISE Core Research Programs
CISE
Office of the Assistant
Director
Advanced
Cyberinfrastruct
ure (ACI)
Data
High
Performance
Compu3ng
Networking/
Cybersecurity
Socware
Computing and
Communications
Foundations (CCF)
Algorithmic
Founda3ons
Communica3on
and Informa3on
Founda3ons
Socware and
Hardware
Founda3ons
Computer and
Network Systems
(CNS)
Computer
Systems Research
Networking
Technology and
Systems
Information and
Intelligent
Systems (IIS)
Human-‐Centered
Compu3ng
Informa3on
Integra3on and
Informa3cs
Robust
Intelligence
Who is the CISE community?
Computer
Science &
Informa6on
Science &
Computer
Engineering
(CISE), 61%
Engineering
(excluding
Interdisciplinary
Centers, 4.5%
Sciences &
Humani6es, 24%
PI and Co-‐PI Departments for FY 2011 Awards Funded by NSF
CISE
Snapshot of CISE FY 2012 Ac3vi3es
CISE
Research Budget
$865M
Number of Proposals
7695
Number of Awards
1,741
Success Rate
~22%
Average Annualized
Award Size
$200K
Number of Panels
Held
316
Number of People
Supported
18,460
CISE
Senior Researchers
8417
Other Professionals
943
Postdoctoral Associates 371
Graduate Students
6131
Undergraduate
Students
Applying to Core Programs
•
Program Solicitations:
–
CCF:
NSF 12-581
–
CNS:
NSF 12-582
–
IIS:
NSF 12-580
•
Project Types:
–
Large:
$1,200,001 to $3,000,000; up to 5 years, collaborative teams
–
Medium: $500,001 to $1,200,000; up to 4 years,
multi-investigator teams
–
Small:
up to $500,000; up to 3 years, one or two investigator projects
•
CISE-wide Submission Windows (to be adjusted for 2013 and beyond):
–
Large:
November 1 - 30, annually
–
Medium: September 15 – 30, annually
–
Small:
December 3 – 17, annually
•
PI Limit:
–
Participate in no more than 2 “core” proposals/year
Coordinated
Solicitations
Selected CISE Cross-‐Cuxng Programs
•
Cross-‐Directorate
–
Secure and Trustworthy Cyberspace (SaTC)
Securing our Na6on’s cyberspace from malicious behavior, while preserving privacy and promo6ng
usability.
–
Cyber-‐Physical Systems (CPS)
Integra6ng computa6on, communica6on, and control into physical systems.
–
Cyber-‐Enabled Sustainability and Science (CyberSEES); Hazard SEES
Two programs under the Science, Engineering, and Educa=on for Sustainability (SEES) umbrella
–
Exploi.ng Parallelism and Scalability (XPS)
Groundbreaking research leading to a new era of parallel compu6ng
–
Enhancing Access to the Radio Spectrum (EARS)
Enhancing access to wireless service and/or efficiency with which radio spectrum is used.
–
Compu=ng Educa=on for the 21
stCentury (CE21)
Increasing number and diversity of students and educators in compu6ng educa6on and learning.
–
Cyberlearning: Transforming Educa=on (CTE)
Designing and implemen6ng technologies to aid and understand learning.
For a comprehensive list of CISE funding opportuni6es, visit:
hRp://www.nsf.gov/funding/pgm_list.jsp?org=CISE
Selected CISE Cross-‐Cuxng Programs
•
Cross-‐Division
–
Expedi=ons in Compu=ng
Exploring new fron6ers in compu6ng and informa6on science.
•
Cross-‐Agency
–
Core Techniques and Technologies for Advancing Big Data Science &
Engineering (BIG DATA)
Developing tools to manage and analyze data in order to extract knowledge from data.
–
Na=onal Robo=cs Ini=a=ve (NRI)
Developing and using robots that work alongside, or coopera6vely with, people.
–
Smart and Connected Health (SCH)
Transforming healthcare knowledge and delivery, and improving quality of life through
IT.
For a comprehensive list of CISE funding opportuni6es, visit:
hRp://www.nsf.gov/funding/pgm_list.jsp?org=CISE
Can we continue the exponential growth in computational
power (Moore’s Law) in the coming decades?
Research to Expand the Limits of Computa3on
Happening now
• Architectural innova6ons with mul6-‐
core and many-‐core
• Domain-‐specific integrated circuits
• Energy-‐efficient compu6ng and new
processor architectures
Mid-‐term solu3ons
• Need to fully exploit broadly available
concurrency and parallelism • Algorithmic innova6ons exploi6ng
parallelism
• Sozware systems leading to improved
performance
Long-‐term solu3ons
• New materials (e.g., carbon nano-‐
tubes, graphene based devices)
• Non-‐charge transfer devices; (e.g.,
electron spin)
Exploi6ng Parallelism and Scalability (XPS)
Support groundbreaking research that will lead to a new era of parallel compu3ng
.
Founda.onal Principles
• New models guiding parallel algorithm design on
diverse pla_orms
• Op6miza6on for resources (energy, bandwidth,
memory hierarchy)
Cross-‐layer Approaches
• Re-‐thinking/re-‐designing the hardware and
sozware stack
• Coordina6on across all layers
Scalable Distributed Architectures
• Highly scalable and parallel architectures for
people and things connected everywhere • Run6me pla_orms and virtualiza6on tools
Domain-‐specific Design
• Exploi6ng domain knowledge to improve
programmability and performance
•
Goal is to establish
new
collaborations combining
expertise cutting across
abstraction, software,
hardware layers.
•
Each proposal must have
two, or more, PIs providing
different and distinct
From Data to Knowledge to
Action
Data represent a transformative new
currency for science, engineering,
Federal Big Data R&D Ini3a3ve
(WH Launch on March 29,2012)
•
Cross-agency “Big Data” Senior Steering
Group –
chartered in spring 2011 by the White
House OSTP:
•
Co-chaired by NSF and NIH
•
Significant research community input
•
Major Announcements
: NSF, NIH, USGS,
DoD, DARPA, DOE
•
NEW PROGRAM
: Core Techniques and
Technologies for Advancing Big Data Science
& Engineering (BIG DATA)
•
All NSF Directorates and 8 NIH Institutes
•
Research thrusts: Collection, Storage,
and Management; Data Analytics;
Research in Data Sharing and
Collaboration
Founda3onal
research
to develop
new techniques and
technologies to derive
knowledge from data
New
cyberinfrastructu
re
to manage, curate,
and serve data to
research communi3es
New approaches for
educa3on and
workforce
development
New types of inter-‐
disciplinary
collabora3ons,
grand
challenges, and
compe33ons
Core Techniques and Technologies for Advancing Big
Data Science & Engineering (BIG DATA)
•
Foundational research for managing, analyzing, visualizing, and
extracting knowledge from large, diverse, distributed, and heterogeneous
data sets.
•
New solicitation to be issued for FY 2013
Collec3on, Storage, and Management of “Big Data”
• Data representa6on, storage, and retrieval
• New parallel data architectures, including clouds
• Data management policies, including privacy and access
• Communica6on and storage devices with extreme capaci6es
• Sustainable economic models for access and preserva6on
Data Analy3cs
• Computa6onal, mathema6cal,
sta6s6cal, and algorithmic techniques for modeling high dimensional data
• Learning, inference, predic6on, and
knowledge discovery for large volumes of dynamic data sets • Data mining to enable automated
hypothesis genera6on, event correla6on, and anomaly detec6on • Informa6on infusion of mul6ple data
Data Sharing and Collabora3on
• Tools for distant data sharing, real
6me visualiza6on, and sozware reuse of complex data sets
• Cross disciplinary informa6on and
knowledge sharing
• Remote opera6on and real 6me
access to distant data sources and instruments