An Application-Aware Approach to Systems Support for Big Data

(1)

An Application-Aware Approach to

Systems Support for Big Data

Hong Jiang

National Science Foundation

&

Department of Computer Science & Engineering

University of Nebraska – Lincoln

(2)

Data Deluge

Social Network

Business Intelligence

Scientific Simulation

Mobile Apps

3,900

tweets per

second

275 EB data

flowing per

day in 2020

(3)

(4)

The Vs of Big Data

Volume

_Velocity

(5)

Big Data Characteristics:

Volume

• from130 EB stored date in

2005 to 40,000 EB in 2020,

estimated by IDC

(6)

Big Data Characteristics:

Velocity

• Youtube: 100 hours of videos

uploaded per min.

• Twitter: 3900 tweets per sec

(7)

Big Data Characteristics:

Variety

• Structured, semi-structured,

unstructured data

(transactions, sensor, text,

audio, image, video, log,

etc.)

• 80% of an organization’s data

(8)

Big Data Characteristics:

Veracity

• Trustworthiness of data

• 1 in 3 business leaders don’t

trust the information they

used to make decisions

(9)

Big Data, Big Challenges

• Challenges to Computer Systems Research

–

 

Scalability!

à

In-memory computing

• Capturing, Delivering, Storing

• Real, or Near-Real Time Processing

• Indexing and Searching

–

 

Data protection: reliability, availability, security

• Against hardware/software faults

• Against malicious attacks

–

 

Volume reduction & “Sanitizing”

• Flash SSDs/SCM

–

 

OS storage stack overhaul!

(10)

(11)

Application-aware approach to

systems support for big data

In-memory computing support:

application-aware data filtering, indexing

& search, caching

Application-aware data protection:

Data backup & restore, error tolerant

data management

Volume reduction & sanitizing:

Application-aware data deduplication

and provenance-aware cleansing

Research on Cross-Cutting Interfaces

(12)

Honey, I shrunk the Data!

• Data deduplication

–

 

Using a hash signature to uniquely identify a data chunk

–

 

Secure hash Signature: MD5, SHA-1, SHA-256, Tiger…

Beneﬁts

n

 

Reduces the storage space

requirement for big data

n

 

Minimizes the network

(13)

Deduplication Dilemmas

• Challenge:

–

 

High dedup

ratio &

throughput

–

 

@ low RAM &

CPU cost

–

 

Reliability

(14)

State-of-the-art Approaches

v

 

Locality based Approaches:

v

 

DDFS, Sparse Indexing, ChunkStash

v

 

Similarity based Approaches:

v

 

Extreme Binning

Index

Minimize the accesses

to on-disk index

Index

Only one on-disk

access per file

(15)

• These approaches fail when data streams lack

either or both locality and similarity!

• Our solution is SiLo: A Similarity-Locality

based Near-Exact Deduplication Scheme with

Low RAM Overhead and High Throughput

(16)

Motivation and Observation

• Redundancy observation of small and large files

Small files

(

≤

64KB)

Large files

(

≥

2 MB)

% of file #

≥

80%

≤

20%

% of space

≤

20%

≥

80%

Grouping many highly

correlated small files into

a segment to minimize

dedupe overheads

into many small segments

Dividing the large files

to expose more similarity

(17)

Intuition

• Combining and complementing

similarity

and

locality

(a) Similarity approach

Exis6ng data stream

Input data stream

Locality Enhancement

Poten6al duplicate

Similar

(18)

SiLo Architecture

Chunking User Interface

File Agent Job Agent Deduplication Metadata Agent

Storage Agent

Contain Store ……

Job MetaData Cache HashTable Block Store

File Deamon Storage Server MDS Storage Agent Contain Store Storage Agent Contain Store

Backup Server Deduplication Server

Disk Disk Disk ……

Network

Deduplica3on Server stores and looks up

all ﬁngerprints of ﬁles and chunks.

Backup Server manages the

backup system and directs all File

Agents and Storage Servers.

Storage Server stores backed-‐

up data.

File Deamon provides a

func3onal interface in users‘

(19)

Deduplication Server

Block Block ……

Seg Key …… …… ……

DISK

RAM

SHTable Read Cache Write Buffer …… …… Block Block RepChunk ID Block ID …… …… Chunk ID …… LHTable … … Segment ……

Similarity

Hash Table

Locality

Cache

The similarity

unit, (sequence

of chunks)

The locality unit,

(sequence of

(20)

Duplicate Elimination

SiLo achieves near-exact duplicate elimination for all workloads

SiLo with segment size of 4MB

The similarity

approach:

Extreme Binning

Locality approach: ChunkStash-‐HDD

(21)

RAM Usage for Indexing

SiLo consumes a RAM capacity that is only 1/41

∼

_{1/60 and 1/3}

∼

_1/90

Extreme Binning

performs poorly

on the Linux-‐set.

(22)

Deduplication Throughput

SiLo outperforms ChunkStash by a factor of about 3 and

Extreme Binning by a factor of about 1.5

(23)

Summary of SiLo

• SiLo, a near-exact deduplication system

ü

 

Address the scalability of deduplication indexing in

big data environment

• Combination of similarity and locality

ü

 

Mining the similarity and locality characteristics in

deduplication-based storage systems

• We are working on applying deduplication to

(24)

Cluster Deduplica6on

• To scale data deduplica6on to PB

or EB level datasets

• Cluster deduplica6on

can sa6sfy

scalable capacity and

performance requirements

in Big

Data storage

–

 

Data rou6ng for assigning data to

appropriate deduplica6on nodes

–

 

Intra-‐node independent

(25)

Challenges of Cluster Deduplica6on

• Chunk-‐index lookup disk boRleneck

–

 

The chunk index of large dataset is too big to ﬁt into

the limited RAM of the deduplica6on server

–

 

Parallel lookup performance of mul6-‐stream degrades

signiﬁcantly due to frequent and random disk I/Os

• Deduplica6on node informa6on island

–

 

Deduplica6on is only performed within individual

servers due to overhead considera6ons, and leaves

cross-‐node redundancy untouched

(26)

The State of the Art

• Locality

based op6miza6on mechanisms

–

 

NEC: HYDRstor

(large chunk, DHT based stateless rou6ng)

–

 

EMC: Data Domain Global Deduplica=on Array

(super-‐chunk,

stateless rou6ng & stateful rou6ng)

• Similarity

based op6miza6on strategies

–

 

HP: Extreme Binning

(ﬁle similarity, stateless rou6ng)

–

 

Symantec: ﬁle rou=ng middleware

(ﬁle similarity, stateless

rou6ng)

–

 

EMC: content-‐aware load balancing

(client similarity, stateless

rou6ng)

• Stateful vs. Stateless

Rou6ng

• Challenges

–

 

The ﬁngerprint-‐based rou6ng schemes have failed to achieve a good

tradeoﬀ among

capacity saving

,

throughput

and

scalability large

clusters

(27)

• Our solu6on is Σ-‐Dedupe, a scheme that op6mizes

cluster deduplica6on by

exploi6ng data similarity and

locality

in backup data streams

• Novel data rou6ng for assigning data to nodes

–

 

Coarse-‐grained super-‐chunk (i.e., a consecu6ve chunk set)

–

 

Similarity based stateful data rou6ng algorithm using

handprints (i.e., signature of ﬁngerprint set)

• Handprint

based intra-‐node redundancy suppression

–

 

Fine-‐grained chunk-‐level

–

 

Similarity index structure

(28)

Handprin6ng

• The Generaliza=on of Broder’s Theorem

:

–

 

If the similarity of two chunk sets

S

₁

and

S

₂

is

R

, then

the probability of their sharing at least one ﬁngerprint

in their

k

smallest ﬁngerprints is 1-‐(1-‐

R

)

k

• Handprin6ng

:

k

smallest ﬁngerprints in chunk set

• Handprint vs. Fingerprint

• More features to support local stateful data rou6ng

(29)

The Strong Ability of Handprin6ng in

Resemblance Detec6on

• 4 ~ 32 representa6ve ﬁngerprints approaches the

(30)

System Architecture

Backup Clients

Director

Data Par66oning

Chunk Fingerprin6ng

Similarity-‐aware

Data Rou3ng

Similarity Index Lookup

Chunk Fingerprint Caching

Parallel Container Management

Backup Session

Management

File Recipe

Management

fingerprint

lookup

chunk

transfer

chunk metadata update

file metadata read & write

(31)

Similarity based Stateful Data Rou6ng

FP15

FP10

FP8

FP7

FP4

FP2

FP1

Superchunk

Chunks

FP7 FP15

FP2

Handprint

(1) handprint extraction

(2) node mapping

req

ack

(3) resemblance

lookup

(4)

resemblance

discount

(5)

superchunk

routing

FP7/N

FP2/N

_FP15/N

(32)

Key Data Structures

RFP

CID

a1cb

359 ...

_...

ef2d

764 Similarity Index

...

Disk Array

Fingerprints

CID

802 ...

513 Chunk Fingerprint Cache

RAM

containers

3c5e, f76a, ...

e43b, 9fd1, ...

Container

Metadata

Data

Chunks

(33)

Evalua6on

• Evalua6on goals:

–

 

Parallel deduplica6on eﬃciency in a single node

–

 

Cluster deduplica6on eﬃciency

• Experiment pla_orm

–

 

Quad-‐core 8-‐thread Intel X3440 2.53 GHz CPU, 16GB

RAM

• Workload

Datasets Size (GB)

Dedupe Ra3o

Linux

160 8.23(CDC)/7.96(SC)

(34)

Evalua6on Metrics

• Dedupe eﬃciency (DE): “bytes saved per sec”

• Normalized eﬀec6ve dedupe ra6o (NEDR) =

clusterDedupeRa6o/singleNodeDedupeRa6o

• × α/(α+σ)

–

 

α: average storage usage in dedupe nodes

–

 

σ: standard devia6on of storage usage in nodes

(35)

The Performance of Similarity Index

Parallel Lookup

#Lock aﬀects the

performance of

parallel index lookup

(36)

Dedupe Eﬃciency in Single-‐node

Deduplica6on Server

We choose Fix-‐sized

Chunking and 4KB

(37)

Eﬀec6ve Deduplica6on Ra6o

Our Σ-‐Dedupe can

achieve over 90%

space saving of costly

(38)

Number of Fingerprint Index Lookup

Messages

Σ-‐Dedupe has

almost the same

low overhead as

scalable schemes

(39)

Summary

Cluster

Deduplica3on

Dedupe

Ra3o

Throughput Data Skew Overhead

Extreme Binning

Medium

High

Medium

Low

EMC Stateless

Medium

High

Medium

Low

EMC Stateful

High

Low

High

(40)

Conclusions

• Cluster deduplica6on can be improved by

exploi6ng both

similarity

and

locality

in data

streams.

• Handprin6ng

technique has strong ability to

detect resemblance.

• Σ-‐Dedupe

nearly achieves the space eﬃciency as

the costly Stateful rou6ng based scheme but only

at a overhead like the highly scalable Stateless

(41)

Computer and Information

Science and Engineering (CISE)

(42)

CISE Core Research Programs

CISE

Office of the Assistant

Director

Advanced

Cyberinfrastruct

ure (ACI)

Data

High

Performance

Compu3ng

Networking/

Cybersecurity

Socware

Computing and

Communications

Foundations (CCF)

Algorithmic

Founda3ons

Communica3on

and Informa3on

Founda3ons

Socware and

Hardware

Founda3ons

Computer and

Network Systems

(CNS)

Computer

Systems Research

Networking

Technology and

Systems

Information and

Intelligent

Systems (IIS)

Human-‐Centered

Compu3ng

Informa3on

Integra3on and

Informa3cs

Robust

Intelligence

(43)

Who is the CISE community?

Computer

Science &

Informa6on

Science &

Computer

Engineering

(CISE), 61%

Engineering

(excluding

Interdisciplinary

Centers, 4.5%

Sciences &

Humani6es, 24%

PI and Co-‐PI Departments for FY 2011 Awards Funded by NSF

CISE

(44)

Snapshot of CISE FY 2012 Ac3vi3es

CISE

Research Budget

$865M

Number of Proposals

7695

Number of Awards

1,741

Success Rate

~22%

Average Annualized

Award Size

$200K

Number of Panels

Held

316 Number of People

Supported

18,460

CISE

Senior Researchers

8417

Other Professionals

943 Postdoctoral Associates 371

Graduate Students

6131

Undergraduate

Students

(45)

Applying to Core Programs

• Program Solicitations:

–

 

CCF:

NSF 12-581

–

 

CNS:

NSF 12-582

–

 

IIS:

NSF 12-580

• Project Types:

–

 

Large:

$1,200,001 to $3,000,000; up to 5 years, collaborative teams

–

 

Medium: $500,001 to $1,200,000; up to 4 years,

multi-investigator teams

–

 

Small:

up to $500,000; up to 3 years, one or two investigator projects

• CISE-wide Submission Windows (to be adjusted for 2013 and beyond):

–

 

Large:

November 1 - 30, annually

–

 

Medium: September 15 – 30, annually

–

 

Small:

December 3 – 17, annually

• PI Limit:

–

 

Participate in no more than 2 “core” proposals/year

Coordinated

Solicitations

(46)

Selected CISE Cross-‐Cuxng Programs

• Cross-‐Directorate

–

 

Secure and Trustworthy Cyberspace (SaTC)

Securing our Na6on’s cyberspace from malicious behavior, while preserving privacy and promo6ng

usability.

–

 

Cyber-‐Physical Systems (CPS)

Integra6ng computa6on, communica6on, and control into physical systems.

–

 

Cyber-‐Enabled Sustainability and Science (CyberSEES); Hazard SEES

Two programs under the Science, Engineering, and Educa=on for Sustainability (SEES) umbrella

–

 

Exploi.ng Parallelism and Scalability (XPS)

Groundbreaking research leading to a new era of parallel compu6ng

–

 

Enhancing Access to the Radio Spectrum (EARS)

Enhancing access to wireless service and/or eﬃciency with which radio spectrum is used.

–

 

Compu=ng Educa=on for the 21

st

Century (CE21)

Increasing number and diversity of students and educators in compu6ng educa6on and learning.

–

 

Cyberlearning: Transforming Educa=on (CTE)

Designing and implemen6ng technologies to aid and understand learning.

For a comprehensive list of CISE funding opportuni6es, visit:

hRp://www.nsf.gov/funding/pgm_list.jsp?org=CISE

(47)

Selected CISE Cross-‐Cuxng Programs

• Cross-‐Division

–

 

Expedi=ons in Compu=ng

Exploring new fron6ers in compu6ng and informa6on science.

• Cross-‐Agency

–

 

Core Techniques and Technologies for Advancing Big Data Science &

Engineering (BIG DATA)

Developing tools to manage and analyze data in order to extract knowledge from data.

–

 

Na=onal Robo=cs Ini=a=ve (NRI)

Developing and using robots that work alongside, or coopera6vely with, people.

–

 

Smart and Connected Health (SCH)

Transforming healthcare knowledge and delivery, and improving quality of life through

IT.

For a comprehensive list of CISE funding opportuni6es, visit:

hRp://www.nsf.gov/funding/pgm_list.jsp?org=CISE

(48)

Can we continue the exponential growth in computational

power (Moore’s Law) in the coming decades?

(49)

Research to Expand the Limits of Computa3on

Happening now

• Architectural innova6ons with mul6-‐

core and many-‐core

• Domain-‐speciﬁc integrated circuits

• Energy-‐eﬃcient compu6ng and new

processor architectures

Mid-‐term solu3ons

• Need to fully exploit broadly available

concurrency and parallelism • Algorithmic innova6ons exploi6ng

parallelism

• Sozware systems leading to improved

performance

Long-‐term solu3ons

• New materials (e.g., carbon nano-‐

tubes, graphene based devices)

• Non-‐charge transfer devices; (e.g.,

electron spin)

(50)

Exploi6ng Parallelism and Scalability (XPS)

Support groundbreaking research that will lead to a new era of parallel compu3ng

.

Founda.onal Principles

• New models guiding parallel algorithm design on

diverse pla_orms

• Op6miza6on for resources (energy, bandwidth,

memory hierarchy)

Cross-‐layer Approaches

• Re-‐thinking/re-‐designing the hardware and

sozware stack

• Coordina6on across all layers

Scalable Distributed Architectures

• Highly scalable and parallel architectures for

people and things connected everywhere • Run6me pla_orms and virtualiza6on tools

Domain-‐speciﬁc Design

• Exploi6ng domain knowledge to improve

programmability and performance

• Goal is to establish

new

collaborations combining

expertise cutting across

abstraction, software,

hardware layers.

• Each proposal must have

two, or more, PIs providing

different and distinct

(51)

From Data to Knowledge to

Action

Data represent a transformative new

currency for science, engineering,

(52)

Federal Big Data R&D Ini3a3ve

(WH Launch on March 29,2012)

• Cross-agency “Big Data” Senior Steering

Group –

chartered in spring 2011 by the White

House OSTP:

• Co-chaired by NSF and NIH

• Significant research community input

• Major Announcements

: NSF, NIH, USGS,

DoD, DARPA, DOE

• NEW PROGRAM

: Core Techniques and

Technologies for Advancing Big Data Science

& Engineering (BIG DATA)

• All NSF Directorates and 8 NIH Institutes

• Research thrusts: Collection, Storage,

and Management; Data Analytics;

Research in Data Sharing and

Collaboration

Founda3onal

research

to develop

new techniques and

technologies to derive

knowledge from data

New

cyberinfrastructu

re

to manage, curate,

and serve data to

research communi3es

New approaches for

educa3on and

workforce

development

New types of inter-‐

disciplinary

collabora3ons,

grand

challenges, and

compe33ons

(53)

Core Techniques and Technologies for Advancing Big

Data Science & Engineering (BIG DATA)

• Foundational research for managing, analyzing, visualizing, and

extracting knowledge from large, diverse, distributed, and heterogeneous

data sets.

• New solicitation to be issued for FY 2013

Collec3on, Storage, and Management of “Big Data”

• Data representa6on, storage, and retrieval

• New parallel data architectures, including clouds

• Data management policies, including privacy and access

• Communica6on and storage devices with extreme capaci6es

• Sustainable economic models for access and preserva6on

Data Analy3cs

• Computa6onal, mathema6cal,

sta6s6cal, and algorithmic techniques for modeling high dimensional data

• Learning, inference, predic6on, and

knowledge discovery for large volumes of dynamic data sets • Data mining to enable automated

hypothesis genera6on, event correla6on, and anomaly detec6on • Informa6on infusion of mul6ple data

Data Sharing and Collabora3on

• Tools for distant data sharing, real

6me visualiza6on, and sozware reuse of complex data sets

• Cross disciplinary informa6on and

knowledge sharing

• Remote opera6on and real 6me

access to distant data sources and instruments