Setting the Direction for Big Data Benchmark Standards

(1)

Setting the Direction for Big Data

Benchmark Standards

Chaitan Baru, Center for Large-scale Data Systems research

(CLDS), San Diego Supercomputer Center, UC San Diego

Milind Bhandarkar, Greenplum

Raghunath Nambiar, Cisco

Meikel Poess, Oracle

(2)

Characterizing the application

environment

• Enormous data sizes (volume), high data rates (velocity),

variety of data genres

• Datacenter issues:

▫

Large-scale and evolving system configurations, shifting loads,

heterogeneous technologies

• Data genres:

▫

Structured, unstructured, graphs, streams, images, scientific

data, etc

• Software options:

▫

SQL, NoSQL, Hadoop ecosystem,

…

• Hardware options:

▫

HDD vs SSD (NVM); different types of HDD, NVM, and main

memory; large-memory systems; etc.

(3)

3

• Industry’s first workshop on big data benchmarking

• Acknowledgements

▫

National Science Foundation (Grant# IIS-1241838)

▫

SDSC industry sponsors: Seagate, Greenplum,

NetApp, Mellanox, Brocade (event host)

• WBDB2012 steering committee



Chaitan Baru, SDSC/UC San Diego



Milind Bhandarkar, Greenplum/EMC



Raghunath Nambiar, Cisco



Meikel Poess, Oracle

(4)

Invited Attendee Organizations

• Actian

• AMD

• BMMsoft

• Brocade

• CA Labs

• Cisco

• Cloudera

• Convey Computer

• CWI/Monet

• Dell

• EPFL

• Facebook

• Google

• Greenplum

4

• San Diego Supercomputer Center

• SAS

• Scripps Research Institute

• Seagate

• Shell

• SNIA

• Teradata Corporation

• Twitter

• UC Irvine

• Univ. of Minnesota

• Univ. of Toronto

• Univ. of Washington

• VMware

• WhamCloud

• Yahoo!



Hewlett-Packard



Hortonworks



Indiana Univ / Hathitrust

Research Foundation



InfoSizing



Intel



LinkedIn



MapR/Mahout



Mellanox



Microsoft



NSF



NetApp



NetApp/OpenSFS



Oracle

Red Hat

(5)

Topics of discussions

• Audience: Who is the audience for this benchmark?

• Application: What application should we model?

• Single benchmark spec: Is it possible to develop a single

benchmark to capture characteristics of multiple applications?

• Component vs. end-to-end benchmark. Is it possible to factor out a

set of benchmark “components”, which can be isolated and plugged

into an end-to-end benchmark(s)?

• Paper and Pencil vs Implementation-based. Should the

implementation be specification-driven or implementation-driven?

• Reuse. Can we reuse existing benchmarks?

• Benchmark Data. Where do we get the data from?

• Innovation or competition? Should the benchmark be for innovation

(6)

Audience: Who is the primary

audience for a big data benchmark?

• Customers



Workload should preferably be expressed in



English



Or, a declarative Language (unsophisticated user)



But, not a procedural language (sophisticated user)

▫

Want to compare among different vendors

• Vendors

▫

Would like to sell machines/systems based on benchmarks

• Computer science/hardware research is also an audience

▫

Niche players and technologies will emerge out of academia

(7)

Applications: What application

should we model?

• An application that somebody could donate?

• An application based on empirical data?

▫ 

Examples from scientific applications

• Multi-channel retailer-based application, like

the amended TPC-DS for Big Data?

▫ 

Mature schema, large scale data generator,

execution rules, audit process exists.

• “Abstraction” of an Internet-scale application,

e.g. data management at the Facebook site, with

synthetic data

(8)

Single Benchmark vs Multiple

• Is it possible to develop a single benchmark to represent

multiple applications?

• Yes, but not desired if there is no synergy between the

applications, e.g. say, at the data model level

▫

Synthetic Facebook application might provide context

for a single benchmark

▫

Click streams, data sorting/indexing, weblog

(9)

Component benchmark vs.

end-to-end benchmark

• Are there components that can be isolated and

plugged into an end-to-end benchmark?

• The benchmark should consist of individual

components that ultimately make up an end-to-end

benchmark

• The benchmark should include a component that

extracts large data

▫

Many data science applications extract large data and

then visualize the output

▫

Opportunity for “pushing down” viz into the data

(10)

Paper and Pencil / Specification driven

versus Implementation driven

• Start with an implementation and develop specification

at the same time

• Some post-workshop activity has begun in this area

(11)

Where Do we Get the Data From?

• Downloading data is not an option

• Data needs to be generated (quickly)

• Examples of actual datasets from scientific applications

▫

Observational data (e.g. LSST), simulation outputs

• Using existing data generators (TPC-DS, TPC-H)

• Data that is generic enough with good characteristics is

(12)

Should the benchmark be for

innovation or competition?

• Innovation and competition are not mutually exclusive

▫

Should be used for both

▫

The benchmark should be designed for competition,

such a benchmark will then also be used internally for

innovation

• TPC-H is a prime example of a benchmark model that

could drive competition and innovation (if combined

correctly)

(13)

#$%&'("" - ')*+,+-.",/00-12"3).*45617"81-5"#16.,6*2+-."$1-*),,+.9" $)18-156.*)"%-/.*+:"" - 4220;<<===>20*>-19<20*?,<,0)*<20*?,@A>A>B>0?8" C4D"3/+:?"-."2-0"-8"#$%&'(E" F-:/5)";"" - G-"24)-1)2+*6:":+5+2" - #),2)?"/0"2-"ABB"#H"" F):-*+2D";"1-::+.9"/0?62),"" F61+)2D" - 5LFKUHODWLRQDOPRGHODQGGDWDWDEOHVIDFWWDEOHV«" - I6,D"2-"6??"24)"-24)1"2=-",-/1*)," " !"#$%&'&$!()*+,&-.$%&'&$/01(2$

Can

we reuse existing benchmarks?

• Yes, we could but we need to discuss:

▫

How much augmentation is necessary?

▫

Can the benchmark data be scaled

▫

If the benchmark uses SQL, we should not require it

• Examples: but none of the following could be used

unmodified

▫

Statistical Workload Injector for Map Reduce (SWIM)

▫

GridMix3 (lots of shortcomings)



Open source

▫

TPC-DS

▫

YCSB++ (lots of shortcomings)

▫

Terasort – strong sentiment for using

this, perhaps as part of an “end-to-end”

scenario

(14)

Keep in mind principles for good

benchmark design

• Self-scaling, e.g. TPC-C

• Comparability between scale factors

▫

Results should be comparable at different scales

• Technology agnostic (if meaningful to the application)

• Simple to run

TPC

+Longevity: TPC-C has carried the load for 20 years

+Comparability

Audit requirements and strict detailed run rules mean one can compare results published by two different entities

+Scaling

Results just as meaningful at the high-end of the market as at the low-end; as relevant on clusters as on single servers

(15)

Extrapolating Results

• TPC benchmarks typically run on “over-specified”

systems

▫

i.e. Customer installations may have less hardware

than benchmark installation (SUT)

• Big Data Benchmarking may be opposite

▫

May need to run benchmark on systems that are

smaller than customer installations



Can we extrapolate?

▫

Scaling may be “piecewise linear”. Need to find those

inflexion points

(16)

Interest in results from the Workshop

• June 9, 2012:

▫

Summary of Workshop on Big Data Benchmarking, C. Baru, Workshop on

Architectures and Systems for Big Data, Portland, OR.

• June 22-23, 2012:

▫

Towards Industry Standard Benchmarks for Big Data, M. Bhandarkar, The

Extremely Large Databases Conference at Asia, June 22-23, 2012.

• August 27-31, 2012:

▫

Setting the Direction for Big Data Benchmark Standards, Baru, Bhandarkar,

Nambiar, Poess, Rabl, 4

th

_{Int’l TPC Technology Conference (TPCTC 2012), with}

VLDB2012, Istanbul, Turkey

• September 10-13, 2012:

▫

Introducing the Big Data Benchmarking Community, Lightning Talk and Poster at

the 6

th

_{Extremely Large Database Conference (XLDB), Stanford Univ, Palo Alto.}

(17)

Big Data Benchmarking Community

(BDBC)

• Biweekly phone conferences

▫

Group communications and coordination is facilitated

via phone calls every two weeks .

▫

See http://clds.sdsc.edu/bdbc/community for call-in

information

▫

Contact [email protected] or [email protected]

to join BDBC mailing list

(18)

Big Data Benchmarking Community

Participants

o

Actian

o

AMD

o

Argonne National

o

Labs

o

BMMsoft

o

Brocade

o

CA Labs

o

Cisco

o

Cloudera

o

CMU

o

Convey Computer

o

CWI/Monet

o

DataStax

o

Facebook

o

GaTech

o

Google

o

Greenplum

o

Hortonworks

o

Hewlett-Packard

o

IBM

o

IndianaU/HTRF

o

InfoSizing

o

Intel

o

Johns Hopkins U.

o

LinkedIn

o

MapR/Mahout

o

SLAC

o

SNIA

o

Teradata

o

Twitter

o

Univ. of Minnesota

o

UC Berkeley

o

UC Irvine

o

UC San Diego

o

Univ of Passau

o

Univ of Toronto

o

Univ. of Washington

o

VMware

o

WhamCloud

o

NetApp

o

Netflix

o

NIST

o

NSF

o

OpenSFS

o

Oracle

o

Ohio State U.

o

Paypal

o

PNNL

o

Red Hat

o

San Diego

Supercomputer Center

o

SAS

(19)

2 nd

_{Workshop on Big Data}

Benchmarking: India

• December 17-18, 2012, Pune, India

▫

Hosted by Persistent Systems, India

• See

http://clds.ucsd.edu/wbdb2012.in

• Themes:

▫

Application Scenarios/Use Cases

▫

Benchmark Process

▫

Benchmark Metrics

▫

Data Generation

• Format: Invited speakers; Lightning talks; Demos

• We welcome sponsorship

▫

Current Sponsors: NSF, Persistent Systems, Seagate,

Greenplum, Brocade, Mellanox

(20)

3 rd

_{Workshop on Big Data}

Benchmarking: China

• June 2013, Xian, China

▫

Hosted by Shanxi Supercomputing Center

• See http://clds.ucsd.edu/wbdb2013.cn

▫

Current Sponsors: IBM, Seagate, Greenplum,

Mellanox, Brocade?

(21)

The Center for Large-scale Data

Systems Research: CLDS

• At the San Diego Supercomputer Center, UC San Diego

• R&D center and forum for discussions on big

data-related topics

▫

Benchmarking; Project on Data Dynamics; Information

Value; Focus on industry segments, e.g Healthcare

• Center Structure:

▫

Workshops; Program-level activities; Project-level activities

• For sponsorship opportunities:

▫

Contact: Chaitan Baru, SDSC, [email protected]

• Thank you!