4th Workshop on Big Data Benchmarking

(1)

(2)

4th WBDB: Welcome and

Introduction

Chaitan Baru

Associate Director, Data Initiatives

San Diego Supercomputer Center

Director, Center for Large-scale Data Systems Research

University of California San Diego

(3)

Thanks!

• Brocade: Providing the venue+catering

▫

 

Sheri Mukai; Michele Limbocker; Suresh

Vobillisetty

• CLDS sponsors:

▫

 

Pivotal, Intel, NetApp, Seagate

• CLDS Organizing Committee

• Speakers/attendees

(4)

CLDS: Center for Large-scale Data

Systems Research

• R&D activity within San Diego Supercomputer

Center

• Current projects/activities

▫

 

Big Data Benchmarking



 

Opportunity to work with CS graduate students

▫

 

Data Value

▫

 

How Much Information

▫

 

CSE Master of Advanced Studies (MAS) in Big Data

Science

▫

 

SDSC Data Science Institute



 

Initiative focused on onsite education and training in

Data Science for industry

(5)

SDSC

• A national and UC-based center for

high-performance computing and data-intensive

computing (big data)

• Established >25 years ago

• Engaged in Research + Development +

Production (RDP)

• Offers datacenter services to UC, also non-UC

(6)

Comet: System Characteristics

• Planned for Jan 2015

• Total flops ~1.8-2.0 PF

• Dell primary integrator

▫  Intel processors

▫  Mellanox InfiniBand

▫  Aeon storage vendor

• Standard compute nodes

▫  Intel next-gen processors

▫  128 GB DRAM

▫  320GB SSD

• Large-memory nodes

▫  1.5TB DRAM

• GPU nodes

• Hybrid fat-tree topology

▫  FDR InfiniBand

▫  Rack-level full bisection bandwidth

(72 nodes)

▫  4:1 oversubscription cross-rack

• Performance Storage

▫  7 PB, 200 GB/s

▫  Scratch & Persistent Storage

• Durable Storage (reliability)

▫  6 PB disk

• Gateway hosting nodes and VM

image repository

(7)

WBDB Background

• Genesis of this effort

▫ 

NSF Cluster Exploratory (CluE) research project

▫ 

On “Performance Evaluation of On-Demand Provisioning of

Data Intensive Applications” (2009-2012)

▫ 

Led to a study of benchmarks to compare Hadoop and

relational DBMS

• Launched Workshops on Big Data Benchmarking

▫ 

Funded by NSF and industry sponsorships

  1st WBDB: May 2012, San Jose. Hosted by Brocade

  2nd WBDB: December 2012, Pune, India. Hosted by Persistent Systems / Infosys

  3rd WBDB: July 2013, Xi’an, China. Hosted by Xi’an University

(8)

1 st

_{WBDB Attendee Organizations}

•  Actian •  AMD •  BMMsoft •  Brocade •  CA Labs •  Cisco •  Cloudera •  Convey Computer •  CWI/Monet •  Dell •  EPFL •  Facebook •  Google •  Greenplum 8

•  San Diego Supercomputer Center

•  SAS

•  Scripps Research Institute

•  Seagate •  Shell •  SNIA •  Teradata Corporation •  Twitter •  UC Irvine •  Univ. of Minnesota •  Univ. of Toronto •  Univ. of Washington •  VMware •  WhamCloud •  Yahoo!   Hewlett-Packard   Hortonworks

  Indiana Univ / Hathitrust

Research Foundation   InfoSizing   Intel   LinkedIn   MapR/Mahout   Mellanox   Microsoft   NSF   NetApp   NetApp/OpenSFS   Oracle   Red Hat

(9)

(10)

WBDB Outcomes

• Big Data Benchmarking Community (BDBC) mailing list

(~160 members from ~75 organizations)

▫  (Remote) Talks every other Thursday at 9AM US Pacific time

• Selected papers to be published in Springer Verlag LNCS:

2012 and 2013 Issues

• Paper from First Workshop

▫  Setting the Direction for Big Data Benchmark Standards by C. Baru, M. Bhandarkar, R. Nambiar, M. Poess, and T. Rabl,

published in Selected Topics in Performance Evaluation and

Benchmarking, Springer-Verlag

• Article in inaugural issue of Big Data Journal

▫  Big Data Benchmarking and the Big Data Top100 List by Baru,

Bhandarkar, Nambiar, Poess, Rabl, Big Data Journal, Vol.1, No.1, 60-64, Anne Liebert Publications.

• Formation of the TPC-BD Subcommittee on BigData

benchmarking

(11)

Current Status: Issues Discussed at the

Workshops

• Different types of benchmarks—for different aspects of a

system

• Micro-benchmarks. Specific lower-level, system operations

▫  I/O operations, e.g. A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Panda et al, OSU

• Functional benchmarks

▫  Terasort

▫  Basic SQL: Individual SQL operations, e.g. Select, Project, Join, Order-By, …

• Genre-specific benchmarks

▫  E.g. Graph500

• Application-level benchmarks

▫  Measure system-level performance of hardware and software, for a given dataset and workload (a given application scenario)

(12)

Benchmark Design Issues

• Audience: Who is the audience for such a benchmark?

▫  Marketing (Customers / End users), Internal Use (Engineering), Academic Use

• Application: What is the application that should be

modeled?

▫  Abstractions of a data pipeline, e.g. Internet-scale business

• Should the benchmark be for innovation or competition?

(13)

Design Issues - 2

• Single benchmark specification: Is it possible to develop

a single benchmark to capture characteristics of multiple

applications?

▫  Single, multi-step benchmark, with plausible end-to-end scenario

• Component vs. end-to-end benchmark. Is it possible to

factor out a set of benchmark “components”, which can

be isolated and plugged into an end-to-end benchmark?

▫  The benchmark should consist of individual components that ultimately make up an end-to-end benchmark

(14)

Design Issues - 3

• Paper and Pencil vs Implementation-based. Should the

implementation be specification-driven or

implementation-driven?

▫  Start with an implementation and develop specification at the same time

• Reuse. Can we reuse existing benchmarks?

▫  Leverage existing work and built-up knowledgebase

• Benchmark Data. Where do we get the data from?

▫  Synthetic data generation: structured, non-structured data

• Verifiability

. Should there be a process for verification of

results?

(15)

Abstractions of the Big Data World

from WBDB

• Enterprise Warehouse + Agglomeration of other

data

▫

 

Structured enterprise data warehouse

▫

 

Extended to incorporate data from other non-fully

structured data sources (e.g. weblogs, text, streams)

• Pool of data with sequence of processing

▫

 

Enterprise data processing as a pipeline from data

ingestion to transformation, extraction, subsetting,

machine learning, predictive analytics

▫

 

Data from multiple structured and non-structured

(16)

Proposal 1: BigBench

• Ghazal et al: Teradata, Oracle, U.of Toronto,

InfoSizing

• Derived from TPC-Decision Support (TPC-DS)

▫

 

Multiple snowflake schemas with shared dimensions

▫

 

24 tables with an average of 18 columns

▫

 

99 distinct SQL 99 queries with random substitutions

▫

 

▫

 

Sub-linear scaling of non-fact tables

▫

 

Ad-hoc, reporting, iterative and extraction queries

▫

 

ETL-like data maintenance

(17)

BigBench Data Model

•  Workload = Set of queries

▫  On structured, semistructured, unstructured data

▫  Data mining, ML

•  Paper published in ACM SIGMOD 2013. Full specification to appear

(18)

Proposal 2: Deep Analytics Pipeline

• An end-to-end data processing pipline:

▫

 

Data from multiple sources

▫

 

Loose, flexible schema

▫

 

Data requires structuring

▫

 

ELT rather than ETL

• Application characteristics

▫

 

Processing pipelines

▫

 

Running models with data

18 Acquisition/ Recording Extraction/ Cleaning/ Annotation Integration/ Aggregation/ Representation Analysis/ Modeling Interpretation