• No results found

4th Workshop on Big Data Benchmarking

N/A
N/A
Protected

Academic year: 2021

Share "4th Workshop on Big Data Benchmarking"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

4th WBDB: Welcome and

Introduction

Chaitan Baru

Associate Director, Data Initiatives

San Diego Supercomputer Center

Director, Center for Large-scale Data Systems Research

University of California San Diego

(3)

Thanks!

Brocade: Providing the venue+catering

Sheri Mukai; Michele Limbocker; Suresh

Vobillisetty

CLDS sponsors:

Pivotal, Intel, NetApp, Seagate

CLDS Organizing Committee

Speakers/attendees

(4)

CLDS: Center for Large-scale Data

Systems Research

R&D activity within San Diego Supercomputer

Center

Current projects/activities

Big Data Benchmarking

Opportunity to work with CS graduate students

Data Value

How Much Information

CSE Master of Advanced Studies (MAS) in Big Data

Science

SDSC Data Science Institute

Initiative focused on onsite education and training in

Data Science for industry

(5)

SDSC

A national and UC-based center for

high-performance computing and data-intensive

computing (big data)

Established >25 years ago

Engaged in Research + Development +

Production (RDP)

Offers datacenter services to UC, also non-UC

(6)

Comet: System Characteristics

• 

Planned for Jan 2015

• 

Total flops ~1.8-2.0 PF

• 

Dell primary integrator

▫  Intel processors

▫  Mellanox InfiniBand

▫  Aeon storage vendor

• 

Standard compute nodes

▫  Intel next-gen processors

▫  128 GB DRAM

▫  320GB SSD

• 

Large-memory nodes

▫  1.5TB DRAM

• 

GPU nodes

• 

Hybrid fat-tree topology

▫  FDR InfiniBand

▫  Rack-level full bisection bandwidth

(72 nodes)

▫  4:1 oversubscription cross-rack

• 

Performance Storage

▫  7 PB, 200 GB/s

▫  Scratch & Persistent Storage

• 

Durable Storage (reliability)

▫  6 PB disk

• 

Gateway hosting nodes and VM

image repository

(7)

WBDB Background

Genesis of this effort

▫ 

NSF Cluster Exploratory (CluE) research project

▫ 

On “Performance Evaluation of On-Demand Provisioning of

Data Intensive Applications” (2009-2012)

▫ 

Led to a study of benchmarks to compare Hadoop and

relational DBMS

Launched Workshops on Big Data Benchmarking

▫ 

Funded by NSF and industry sponsorships

  1st WBDB: May 2012, San Jose. Hosted by Brocade

  2nd WBDB: December 2012, Pune, India. Hosted by Persistent Systems / Infosys

  3rd WBDB: July 2013, Xi’an, China. Hosted by Xi’an University

(8)

1

st

WBDB Attendee Organizations

•  Actian •  AMD •  BMMsoft •  Brocade •  CA Labs •  Cisco •  Cloudera •  Convey Computer •  CWI/Monet •  Dell •  EPFL •  Facebook •  Google •  Greenplum 8

•  San Diego Supercomputer Center

•  SAS

•  Scripps Research Institute

•  Seagate •  Shell •  SNIA •  Teradata Corporation •  Twitter •  UC Irvine •  Univ. of Minnesota •  Univ. of Toronto •  Univ. of Washington •  VMware •  WhamCloud •  Yahoo!   Hewlett-Packard   Hortonworks

  Indiana Univ / Hathitrust

Research Foundation   InfoSizing   Intel   LinkedIn   MapR/Mahout   Mellanox   Microsoft   NSF   NetApp   NetApp/OpenSFS   Oracle   Red Hat

(9)
(10)

WBDB Outcomes

• 

Big Data Benchmarking Community (BDBC) mailing list

(~160 members from ~75 organizations)

▫  (Remote) Talks every other Thursday at 9AM US Pacific time

• 

Selected papers to be published in Springer Verlag LNCS:

2012 and 2013 Issues

• 

Paper from First Workshop

▫  Setting the Direction for Big Data Benchmark Standards by C. Baru, M. Bhandarkar, R. Nambiar, M. Poess, and T. Rabl,

published in Selected Topics in Performance Evaluation and

Benchmarking, Springer-Verlag

• 

Article in inaugural issue of Big Data Journal

▫  Big Data Benchmarking and the Big Data Top100 List by Baru,

Bhandarkar, Nambiar, Poess, Rabl, Big Data Journal, Vol.1, No.1, 60-64, Anne Liebert Publications.

• 

Formation of the TPC-BD Subcommittee on BigData

benchmarking

(11)

Current Status: Issues Discussed at the

Workshops

• 

Different types of benchmarks—for different aspects of a

system

• 

Micro-benchmarks. Specific lower-level, system operations

▫  I/O operations, e.g. A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Panda et al, OSU

• 

Functional benchmarks

▫  Terasort

▫  Basic SQL: Individual SQL operations, e.g. Select, Project, Join, Order-By, …

• 

Genre-specific benchmarks

▫  E.g. Graph500

• 

Application-level benchmarks

▫  Measure system-level performance of hardware and software, for a given dataset and workload (a given application scenario)

(12)

Benchmark Design Issues

Audience: Who is the audience for such a benchmark?

▫  Marketing (Customers / End users), Internal Use (Engineering), Academic Use

Application: What is the application that should be

modeled?

▫  Abstractions of a data pipeline, e.g. Internet-scale business

Should the benchmark be for innovation or competition?

(13)

Design Issues - 2

Single benchmark specification: Is it possible to develop

a single benchmark to capture characteristics of multiple

applications?

▫  Single, multi-step benchmark, with plausible end-to-end scenario

Component vs. end-to-end benchmark. Is it possible to

factor out a set of benchmark “components”, which can

be isolated and plugged into an end-to-end benchmark?

▫  The benchmark should consist of individual components that ultimately make up an end-to-end benchmark

(14)

Design Issues - 3

Paper and Pencil vs Implementation-based. Should the

implementation be specification-driven or

implementation-driven?

▫  Start with an implementation and develop specification at the same time

Reuse. Can we reuse existing benchmarks?

▫  Leverage existing work and built-up knowledgebase

Benchmark Data. Where do we get the data from?

▫  Synthetic data generation: structured, non-structured data

Verifiability

. Should there be a process for verification of

results?

(15)

Abstractions of the Big Data World

from WBDB

Enterprise Warehouse + Agglomeration of other

data

Structured enterprise data warehouse

Extended to incorporate data from other non-fully

structured data sources (e.g. weblogs, text, streams)

Pool of data with sequence of processing

Enterprise data processing as a pipeline from data

ingestion to transformation, extraction, subsetting,

machine learning, predictive analytics

Data from multiple structured and non-structured

(16)

Proposal 1: BigBench

Ghazal et al: Teradata, Oracle, U.of Toronto,

InfoSizing

Derived from TPC-Decision Support (TPC-DS)

Multiple snowflake schemas with shared dimensions

24 tables with an average of 18 columns

99 distinct SQL 99 queries with random substitutions

More representative skewed database content

Sub-linear scaling of non-fact tables

Ad-hoc, reporting, iterative and extraction queries

ETL-like data maintenance

(17)

BigBench Data Model

•  Workload = Set of queries

▫  On structured, semistructured, unstructured data

▫  Data mining, ML

•  Paper published in ACM SIGMOD 2013. Full specification to appear

(18)

Proposal 2: Deep Analytics Pipeline

An end-to-end data processing pipline:

Data from multiple sources

Loose, flexible schema

Data requires structuring

ELT rather than ETL

Application characteristics

Processing pipelines

Running models with data

18 Acquisition/ Recording Extraction/ Cleaning/ Annotation Integration/ Aggregation/ Representation Analysis/ Modeling Interpretation

(19)

Example of an Application: User

Modeling

Objective: Determine user interests by

mining user activities

Large dimensionality of possible user

activities

Typical user has sparse activity vector

(20)

User Modeling Pipeline

Data Acquisition

Sessionization

Feature and Target Generation

Model Training

Offline Scoring & Evaluation

Batch Scoring & Upload to serving

(21)

4th WBDB, October 9-10, San Jose, 2013

TPC-BD subcommittee

Join TPC if you want to influence that process

BigData Top100 List

An open, community effort to rank systems by

performance (with price/performance) on Big Data

workloads

“HPC meets enterprise”: Combine ideas from TPC and

Top500

TPC has influenced design and efficiency of DBMSs over

25 years

“Borrow” ranking concept from Top500

(22)

Next Steps: BigData Community

Challenges

Challenges related to the Deep Analytics Pipeline

Definition of each step

Ideas for machine learning and predictive

analytics steps

Ideas for metrics: performance and price/

performance

Announce competitions via Kaggle and other

venues

(23)

5

th

WBDB

Would like to host it in Europe—Germany?—

around Summer 2014

Looking for interested hosts, sponsors, local

References

Related documents

The Health Services Research program at BUSPH offers students substantial advising support — from your Academic Advisor, the Program Director, and the department faculty member who

4 case studies annotation 1 annotation 2 annotation 3 annotation 4 encoding 4 encoding 3 encoding 2 encoding 1 link 1 link 2 link 3 link 4 source 1 Aaa aaa Aaa aaa source 3 source

• Continued design discussions with Network, Security and

On the other hand TPE (theater provided equipment) is developed in order to ensure that deployed units receive required amount of equipment critical for their

Patch Medium 6.4cm x 6.4cm 2 Pieces Per Box PVPS PROCEED Ventral.

We discussed native americans in our social studies class today.. We discussed Native Americans in our social studies

Georgetown Ridgefield Darien Westport Norwalk Stamford Prospect Park Bogota Fii Rockleigh Rochelle Park Wallington Teterboro Palisades Park Midland Park Leonia Little Ferry

Overall level of states of work operation in gold medal award Mosques based Islamic Education Center (Tadika) in Songkhla province in aspect of building and environment,