• No results found

An Application-Aware Approach to Systems Support for Big Data

N/A
N/A
Protected

Academic year: 2021

Share "An Application-Aware Approach to Systems Support for Big Data"

Copied!
53
0
0

Loading.... (view fulltext now)

Full text

(1)

An Application-Aware Approach to

Systems Support for Big Data

Hong Jiang

National Science Foundation

&

Department of Computer Science & Engineering

University of Nebraska – Lincoln

(2)

Data Deluge

Social Network

Business Intelligence

Scientific Simulation

Mobile Apps

3,900

tweets per

second

275 EB data

flowing per

day in 2020

(3)
(4)

The Vs of Big Data

Volume

Velocity

(5)

Big Data Characteristics:

Volume

from130 EB stored date in

2005 to 40,000 EB in 2020,

estimated by IDC

(6)

Big Data Characteristics:

Velocity

Youtube: 100 hours of videos

uploaded per min.

Twitter: 3900 tweets per sec

(7)

Big Data Characteristics:

Variety

Structured, semi-structured,

unstructured data

(transactions, sensor, text,

audio, image, video, log,

etc.)

80% of an organization’s data

(8)

Big Data Characteristics:

Veracity

Trustworthiness of data

1 in 3 business leaders don’t

trust the information they

used to make decisions

(9)

Big Data, Big Challenges

Challenges to Computer Systems Research

Scalability!

à

In-memory computing

Capturing, Delivering, Storing

Real, or Near-Real Time Processing

Indexing and Searching

Data protection: reliability, availability, security

Against hardware/software faults

Against malicious attacks

Volume reduction & “Sanitizing”

Flash SSDs/SCM

OS storage stack overhaul!

(10)
(11)

Application-aware approach to

systems support for big data

In-memory computing support:

application-aware data filtering, indexing

& search, caching

Application-aware data protection:

Data backup & restore, error tolerant

data management

Volume reduction & sanitizing:

Application-aware data deduplication

and provenance-aware cleansing

Research on Cross-Cutting Interfaces

(12)

Honey, I shrunk the Data!

Data deduplication

Using a hash signature to uniquely identify a data chunk

Secure hash Signature: MD5, SHA-1, SHA-256, Tiger…

Benefits  

n

Reduces the storage space

requirement for big data

n

Minimizes the network

(13)

Deduplication Dilemmas

Challenge:

High dedup

ratio &

throughput

@ low RAM &

CPU cost

Reliability

(14)

State-of-the-art Approaches

v

Locality based Approaches:

v

DDFS, Sparse Indexing, ChunkStash

v

Similarity based Approaches:

v

Extreme Binning

Index

Minimize the accesses

to on-disk index

Index

Only one on-disk

access per file

(15)

These approaches fail when data streams lack

either or both locality and similarity!

Our solution is SiLo: A Similarity-Locality

based Near-Exact Deduplication Scheme with

Low RAM Overhead and High Throughput

(16)

Motivation and Observation

Redundancy observation of small and large files

Small files

(

64KB)

Large files

(

2 MB)

% of file #

80%

20%

% of space

20%

80%

Grouping many highly

correlated small files into

a segment to minimize

dedupe overheads

into many small segments

Dividing the large files

to expose more similarity

(17)

Intuition

Combining and complementing

similarity

and

locality

 

(a) Similarity approach

Exis6ng  data  stream  

Input  data  stream

Locality  Enhancement  

Poten6al  duplicate

Similar  

Similar  

(18)

SiLo Architecture

Chunking User Interface

File Agent Job Agent Deduplication Metadata Agent

Storage Agent

Contain Store ……

Job MetaData Cache HashTable Block Store

File Deamon Storage Server MDS Storage Agent Contain Store Storage Agent Contain Store

Backup Server Deduplication Server

Disk Disk Disk ……

Network

Deduplica3on  Server  stores  and  looks  up  

all  fingerprints  of  files  and  chunks.    

Backup  Server  manages  the  

backup  system  and  directs  all  File  

Agents  and  Storage  Servers.    

Storage  Server  stores  backed-­‐

up  data.    

File  Deamon  provides  a  

func3onal  interface  in  users‘  

(19)

Deduplication Server

Block Block ……

Block Block ……

Block Block ……

Seg Key …… …… ……

DISK

RAM

SHTable Read Cache Write Buffer …… …… Block Block RepChunk ID Block ID …… …… Chunk ID …… LHTable … … Segment ……

Similarity  

Hash  Table  

Locality  

Cache  

The  similarity  

unit,  (sequence  

of  chunks)  

The  locality  unit,  

(sequence  of  

(20)

Duplicate Elimination

SiLo achieves near-exact duplicate elimination for all workloads

SiLo  with  segment  size  of  4MB  

The  similarity  

approach:  

Extreme  Binning  

Locality  approach:  ChunkStash-­‐HDD  

(21)

RAM Usage for Indexing

SiLo consumes a RAM capacity that is only 1/41

1/60 and 1/3

1/90

Extreme  Binning  

performs  poorly  

on  the  Linux-­‐set.  

(22)

Deduplication Throughput

SiLo outperforms ChunkStash by a factor of about 3 and

Extreme Binning by a factor of about 1.5

(23)

Summary of SiLo

SiLo, a near-exact deduplication system

ü

Address the scalability of deduplication indexing in

big data environment

Combination of similarity and locality

ü

Mining the similarity and locality characteristics in

deduplication-based storage systems

We are working on applying deduplication to

(24)

Cluster  Deduplica6on  

To  scale  data  deduplica6on  to  PB  

or  EB  level  datasets    

Cluster  deduplica6on  

can  sa6sfy

 

scalable  capacity  and  

performance  requirements  

in  Big  

Data  storage  

Data  rou6ng  for  assigning  data  to  

appropriate  deduplica6on  nodes  

Intra-­‐node  independent  

(25)

Challenges  of  Cluster  Deduplica6on  

Chunk-­‐index  lookup  disk  boRleneck  

The  chunk  index  of  large  dataset  is  too  big  to  fit  into  

the  limited  RAM  of  the  deduplica6on  server  

Parallel  lookup  performance  of  mul6-­‐stream  degrades  

significantly  due  to  frequent  and  random  disk  I/Os  

Deduplica6on  node  informa6on  island  

Deduplica6on  is  only  performed  within  individual  

servers  due  to  overhead  considera6ons,  and  leaves  

cross-­‐node  redundancy  untouched    

(26)

The  State  of  the  Art  

Locality

 based  op6miza6on  mechanisms  

NEC:  HYDRstor  

(large  chunk,  DHT  based  stateless  rou6ng)    

EMC:  Data  Domain  Global  Deduplica=on  Array  

(super-­‐chunk,  

stateless  rou6ng  &  stateful  rou6ng)  

Similarity

 based  op6miza6on  strategies  

HP:  Extreme  Binning  

(file  similarity,  stateless  rou6ng)  

Symantec:  file  rou=ng  middleware

 (file  similarity,  stateless  

rou6ng)  

EMC:  content-­‐aware  load  balancing  

(client  similarity,  stateless  

rou6ng)  

Stateful  vs.  Stateless

 Rou6ng  

Challenges

 

The  fingerprint-­‐based  rou6ng  schemes  have  failed  to  achieve  a  good  

tradeoff  among  

capacity  saving

,  

throughput

 and  

scalability  large  

clusters  

(27)

Our  solu6on  is  Σ-­‐Dedupe,  a  scheme  that  op6mizes  

cluster  deduplica6on  by  

exploi6ng  data  similarity  and  

locality

 in  backup  data  streams  

Novel  data  rou6ng  for  assigning  data  to  nodes  

Coarse-­‐grained  super-­‐chunk  (i.e.,  a  consecu6ve  chunk  set)    

Similarity  based  stateful  data  rou6ng  algorithm  using  

handprints  (i.e.,  signature  of  fingerprint  set)  

 

Handprint  

based  intra-­‐node  redundancy  suppression  

Fine-­‐grained  chunk-­‐level  

Similarity  index  structure  

(28)

Handprin6ng  

The  Generaliza=on  of  Broder’s  Theorem

:    

If  the  similarity  of  two  chunk  sets  

S

1  

and  

S

2

 is  

R

,  then  

the  probability  of  their  sharing  at  least  one  fingerprint  

in  their  

k

 smallest  fingerprints  is  1-­‐(1-­‐

R

)

k

 

 

Handprin6ng

:  

k

 smallest  fingerprints  in  chunk  set  

Handprint  vs.  Fingerprint  

More  features  to  support  local  stateful  data  rou6ng  

(29)

The  Strong  Ability  of  Handprin6ng  in  

Resemblance  Detec6on  

4  ~  32  representa6ve  fingerprints  approaches  the  

(30)

System  Architecture  

Backup Clients

Director

Data  Par66oning  

Chunk  Fingerprin6ng    

Similarity-­‐aware    

Data  Rou3ng  

Similarity  Index  Lookup  

Chunk  Fingerprint  Caching  

Parallel  Container  Management  

Backup  Session  

Management  

File  Recipe  

Management  

fingerprint

lookup

chunk

transfer

chunk metadata update

file metadata read & write

(31)

Similarity  based  Stateful  Data  Rou6ng

 

FP15  

FP10  

FP8  

FP7  

FP4  

FP2  

FP1  

Superchunk

Chunks

FP7 FP15

FP2

Handprint

(1) handprint extraction

(2) node mapping

req

ack

(3) resemblance

lookup

(4)

resemblance

discount

(5)

superchunk

routing

FP7/N

FP2/N

FP15/N

(32)

Key  Data  Structures  

RFP

CID

a1cb

359

...

...

ef2d

764

Similarity Index

...

Disk Array

Fingerprints

CID

802

...

513

Chunk Fingerprint Cache

RAM

containers

containers

containers

3c5e, f76a, ...

e43b, 9fd1, ...

Container

Metadata  

Data

Chunks

(33)

Evalua6on  

Evalua6on  goals:  

Parallel  deduplica6on  efficiency  in  a  single  node  

Cluster  deduplica6on  efficiency  

Experiment  pla_orm  

Quad-­‐core  8-­‐thread  Intel  X3440  2.53  GHz  CPU,  16GB  

RAM  

Workload  

Datasets   Size  (GB)  

Dedupe  Ra3o  

Linux  

160  

8.23(CDC)/7.96(SC)  

(34)

Evalua6on  Metrics  

Dedupe  efficiency  (DE):  “bytes  saved  per  sec”  

Normalized  effec6ve  dedupe  ra6o  (NEDR)  =  

clusterDedupeRa6o/singleNodeDedupeRa6o  

 ×  α/(α+σ)  

α:  average  storage  usage  in  dedupe  nodes  

σ:  standard  devia6on  of  storage  usage  in  nodes  

(35)

The  Performance  of  Similarity  Index  

Parallel  Lookup    

#Lock  affects  the  

performance  of  

parallel  index  lookup  

(36)

Dedupe  Efficiency  in  Single-­‐node  

Deduplica6on  Server    

We  choose  Fix-­‐sized  

Chunking  and  4KB  

(37)

Effec6ve  Deduplica6on  Ra6o  

Our  Σ-­‐Dedupe  can  

achieve  over  90%  

space  saving  of  costly  

(38)

Number  of  Fingerprint  Index  Lookup  

Messages  

Σ-­‐Dedupe  has  

almost  the  same    

low  overhead  as  

scalable  schemes  

(39)

Summary  

Cluster  

Deduplica3on  

Dedupe  

Ra3o  

Throughput   Data  Skew   Overhead  

Extreme  Binning  

Medium  

High  

Medium  

Low  

EMC  Stateless  

Medium  

High  

Medium  

Low  

EMC  Stateful  

High  

Low  

Low  

High  

(40)

Conclusions  

Cluster  deduplica6on  can  be  improved  by  

exploi6ng  both  

similarity

 and  

locality

 in  data  

streams.  

Handprin6ng

 technique  has  strong  ability  to  

detect  resemblance.  

Σ-­‐Dedupe  

nearly  achieves  the  space  efficiency  as  

the  costly  Stateful  rou6ng  based  scheme  but  only  

at  a  overhead  like  the  highly  scalable  Stateless  

(41)

Computer and Information

Science and Engineering (CISE)

(42)

CISE Core Research Programs

CISE

Office of the Assistant

Director

Advanced

Cyberinfrastruct

ure (ACI)

Data  

High  

Performance  

Compu3ng  

Networking/  

Cybersecurity  

Socware  

Computing and

Communications

Foundations (CCF)

Algorithmic    

Founda3ons  

Communica3on  

and  Informa3on  

Founda3ons  

Socware  and  

Hardware  

Founda3ons  

Computer and

Network Systems

(CNS)

Computer  

Systems  Research  

Networking  

Technology  and  

Systems  

Information and

Intelligent

Systems (IIS)

Human-­‐Centered  

Compu3ng  

Informa3on  

Integra3on  and  

Informa3cs  

Robust  

Intelligence  

(43)

Who is the CISE community?

Computer  

Science  &  

Informa6on  

Science  &  

Computer  

Engineering  

(CISE),  61%  

Engineering  

(excluding  

Interdisciplinary  

Centers,  4.5%  

Sciences  &  

Humani6es,  24%  

PI  and  Co-­‐PI  Departments  for  FY  2011  Awards  Funded  by  NSF  

CISE  

(44)

Snapshot  of  CISE  FY  2012  Ac3vi3es  

CISE

Research Budget

$865M

Number of Proposals

7695

Number of Awards

1,741

Success Rate

~22%

Average Annualized

Award Size

$200K

Number of Panels

Held

316

Number of People

Supported

18,460

CISE

Senior Researchers

8417

Other Professionals

943

Postdoctoral Associates 371

Graduate Students

6131

Undergraduate

Students

(45)

Applying to Core Programs

Program Solicitations:

CCF:

NSF 12-581

CNS:

NSF 12-582

IIS:

NSF 12-580

Project Types:

Large:

$1,200,001 to $3,000,000; up to 5 years, collaborative teams

Medium: $500,001 to $1,200,000; up to 4 years,

multi-investigator teams

Small:

up to $500,000; up to 3 years, one or two investigator projects

CISE-wide Submission Windows (to be adjusted for 2013 and beyond):

Large:

November 1 - 30, annually

Medium: September 15 – 30, annually

Small:

December 3 – 17, annually

PI Limit:

Participate in no more than 2 “core” proposals/year

Coordinated

Solicitations

(46)

Selected  CISE  Cross-­‐Cuxng  Programs  

 

Cross-­‐Directorate  

Secure  and  Trustworthy  Cyberspace  (SaTC)    

Securing  our  Na6on’s  cyberspace  from  malicious  behavior,  while  preserving  privacy  and  promo6ng  

usability.  

Cyber-­‐Physical  Systems  (CPS)    

Integra6ng  computa6on,  communica6on,  and  control  into  physical  systems.    

Cyber-­‐Enabled    Sustainability  and  Science  (CyberSEES);  Hazard  SEES  

Two  programs  under  the  Science,  Engineering,  and  Educa=on  for  Sustainability  (SEES)  umbrella  

Exploi.ng  Parallelism  and  Scalability  (XPS)  

Groundbreaking  research  leading  to  a  new  era  of  parallel  compu6ng  

Enhancing  Access  to  the  Radio  Spectrum  (EARS)  

Enhancing  access  to  wireless  service  and/or  efficiency  with  which  radio  spectrum  is  used.  

Compu=ng  Educa=on  for  the  21

st

 Century  (CE21)    

Increasing  number  and  diversity  of  students  and  educators  in  compu6ng  educa6on  and  learning.  

Cyberlearning:  Transforming  Educa=on  (CTE)    

Designing  and  implemen6ng  technologies  to  aid  and  understand  learning.  

 

 

 

For  a  comprehensive  list  of  CISE  funding  opportuni6es,  visit:  

hRp://www.nsf.gov/funding/pgm_list.jsp?org=CISE

   

(47)

Selected  CISE  Cross-­‐Cuxng  Programs  

 

Cross-­‐Division  

Expedi=ons  in  Compu=ng    

Exploring  new  fron6ers  in  compu6ng  and  informa6on  science.  

Cross-­‐Agency  

Core  Techniques  and  Technologies  for  Advancing  Big  Data  Science  &  

Engineering  (BIG  DATA)

   

Developing  tools  to  manage  and  analyze  data  in  order  to  extract  knowledge  from  data.  

 

Na=onal  Robo=cs  Ini=a=ve  (NRI)    

Developing  and  using  robots  that  work  alongside,  or  coopera6vely  with,  people.  

Smart  and  Connected  Health  (SCH)  

   

Transforming  healthcare  knowledge  and  delivery,  and  improving  quality  of  life  through  

IT.  

 

 

 

 

For  a  comprehensive  list  of  CISE  funding  opportuni6es,  visit:  

hRp://www.nsf.gov/funding/pgm_list.jsp?org=CISE

   

(48)

Can we continue the exponential growth in computational

power (Moore’s Law) in the coming decades?

(49)

Research  to  Expand  the  Limits  of  Computa3on

 

Happening  now

 

• Architectural  innova6ons  with  mul6-­‐

core  and  many-­‐core  

• Domain-­‐specific  integrated  circuits  

• Energy-­‐efficient  compu6ng  and    new  

processor  architectures  

Mid-­‐term  solu3ons  

• Need  to  fully  exploit  broadly  available  

concurrency  and  parallelism     • Algorithmic  innova6ons  exploi6ng  

parallelism  

• Sozware  systems  leading  to  improved  

performance  

Long-­‐term  solu3ons      

• New  materials  (e.g.,  carbon  nano-­‐

tubes,  graphene  based  devices)  

• Non-­‐charge  transfer  devices;  (e.g.,  

electron  spin)    

(50)

Exploi6ng  Parallelism  and  Scalability  (XPS)

 

Support  groundbreaking  research  that  will  lead  to  a  new  era  of  parallel  compu3ng

.  

 

Founda.onal  Principles

 

• New  models  guiding  parallel  algorithm  design  on  

diverse  pla_orms  

• Op6miza6on  for  resources  (energy,  bandwidth,  

memory  hierarchy)    

Cross-­‐layer  Approaches

 

• Re-­‐thinking/re-­‐designing  the  hardware  and  

sozware  stack  

• Coordina6on  across  all  layers  

Scalable  Distributed  Architectures

 

• Highly  scalable  and  parallel  architectures  for  

people  and  things  connected  everywhere   • Run6me  pla_orms  and  virtualiza6on  tools  

Domain-­‐specific    Design

 

• Exploi6ng  domain  knowledge  to  improve  

programmability  and  performance    

Goal is to establish

new

collaborations combining

expertise cutting across

abstraction, software,

hardware layers.

Each proposal must have

two, or more, PIs providing

different and distinct

(51)

From Data to Knowledge to

Action

Data represent a transformative new

currency for science, engineering,

(52)

Federal  Big  Data  R&D  Ini3a3ve  

(WH  Launch  on  March  29,2012)  

Cross-agency “Big Data” Senior Steering

Group –

chartered in spring 2011 by the White

House OSTP:

Co-chaired by NSF and NIH

Significant research community input

Major Announcements

: NSF, NIH, USGS,

DoD, DARPA, DOE

NEW PROGRAM

: Core Techniques and

Technologies for Advancing Big Data Science

& Engineering (BIG DATA)

All NSF Directorates and 8 NIH Institutes

Research thrusts: Collection, Storage,

and Management; Data Analytics;

Research in Data Sharing and

Collaboration

Founda3onal  

research  

to  develop  

new  techniques  and  

technologies  to  derive  

knowledge  from  data  

New  

cyberinfrastructu

re  

to  manage,  curate,  

and  serve  data  to  

research  communi3es  

New  approaches  for  

educa3on  and  

workforce  

development  

New  types  of  inter-­‐

disciplinary  

collabora3ons,  

grand  

challenges,  and  

compe33ons  

(53)

Core  Techniques  and  Technologies  for  Advancing  Big  

Data  Science  &  Engineering  (BIG  DATA)  

Foundational research for managing, analyzing, visualizing, and

extracting knowledge from large, diverse, distributed, and heterogeneous

data sets.

New solicitation to be issued for FY 2013

Collec3on,  Storage,  and  Management   of  “Big  Data”  

• Data  representa6on,  storage,  and   retrieval    

• New  parallel  data  architectures,   including  clouds  

• Data  management  policies,  including   privacy  and  access  

• Communica6on  and  storage  devices  with   extreme  capaci6es  

• Sustainable  economic  models  for  access   and  preserva6on  

Data  Analy3cs    

• Computa6onal,  mathema6cal,  

sta6s6cal,  and  algorithmic   techniques  for  modeling  high   dimensional  data  

• Learning,  inference,  predic6on,  and  

knowledge  discovery  for  large   volumes  of  dynamic  data  sets   • Data  mining  to  enable  automated  

hypothesis  genera6on,  event   correla6on,  and  anomaly  detec6on   • Informa6on  infusion  of  mul6ple  data  

Data  Sharing  and  Collabora3on  

• Tools  for  distant  data  sharing,  real  

6me  visualiza6on,  and  sozware   reuse  of  complex  data  sets  

• Cross  disciplinary  informa6on  and  

knowledge  sharing  

• Remote  opera6on  and  real  6me  

access  to  distant  data  sources  and   instruments  

References

Related documents

 For cloud storage, if the Kaseya agent is uninstalled before uninstalling Data Backup on an endpoint, the data will remain on the Amazon S3 server and you will continue to

Store data into High-Performance PCIe server-mounted flash storage Process

Keywords Multicriteria scheduling, Sequence- dependent setups, Flowshop, Pareto-optimal frontier, Genetic Algorithms, Simulated Annealing.. Afshin Mansouri

This is a larger share than in the 2008 survey where 20% of the borrowers who had raised supplementary loans in connection with refinancing re- sponded that they intended to

 File level backups and restores using Symantec Backup Exec software  Deduplication of backup data to heavily reduce backup storage requirements  Remote server agents

To ascertain the AccuVault™ data reduction rates and potential storage savings, we used the Binary Testing deduplication test suite for file server and Exchange Server

(-) WAN network performance (+) Low CAPEX (+) Off-site protection DATA AGENT Application Server LAN CATALOG Backup Server Media Server Data Metadata Secondary Storage

Expanded education pathway development from middle school, high school, to college and career; Increase number of student participating work- based learning; increased CTE