4th WBDB: Welcome and
Introduction
Chaitan Baru
Associate Director, Data Initiatives
San Diego Supercomputer Center
Director, Center for Large-scale Data Systems Research
University of California San Diego
Thanks!
•
Brocade: Providing the venue+catering
▫
Sheri Mukai; Michele Limbocker; Suresh
Vobillisetty
•
CLDS sponsors:
▫
Pivotal, Intel, NetApp, Seagate
•
CLDS Organizing Committee
•
Speakers/attendees
CLDS: Center for Large-scale Data
Systems Research
•
R&D activity within San Diego Supercomputer
Center
•
Current projects/activities
▫
Big Data Benchmarking
Opportunity to work with CS graduate students
▫
Data Value
▫
How Much Information
▫
CSE Master of Advanced Studies (MAS) in Big Data
Science
▫
SDSC Data Science Institute
Initiative focused on onsite education and training in
Data Science for industry
SDSC
•
A national and UC-based center for
high-performance computing and data-intensive
computing (big data)
•
Established >25 years ago
•
Engaged in Research + Development +
Production (RDP)
•
Offers datacenter services to UC, also non-UC
Comet: System Characteristics
•
Planned for Jan 2015
•
Total flops ~1.8-2.0 PF
•
Dell primary integrator
▫ Intel processors
▫ Mellanox InfiniBand
▫ Aeon storage vendor
•
Standard compute nodes
▫ Intel next-gen processors
▫ 128 GB DRAM
▫ 320GB SSD
•
Large-memory nodes
▫ 1.5TB DRAM
•
GPU nodes
•
Hybrid fat-tree topology
▫ FDR InfiniBand
▫ Rack-level full bisection bandwidth
(72 nodes)
▫ 4:1 oversubscription cross-rack
•
Performance Storage
▫ 7 PB, 200 GB/s
▫ Scratch & Persistent Storage
•
Durable Storage (reliability)
▫ 6 PB disk
•
Gateway hosting nodes and VM
image repository
WBDB Background
•
Genesis of this effort
▫
NSF Cluster Exploratory (CluE) research project
▫
On “Performance Evaluation of On-Demand Provisioning of
Data Intensive Applications” (2009-2012)
▫
Led to a study of benchmarks to compare Hadoop and
relational DBMS
•
Launched Workshops on Big Data Benchmarking
▫
Funded by NSF and industry sponsorships
1st WBDB: May 2012, San Jose. Hosted by Brocade
2nd WBDB: December 2012, Pune, India. Hosted by Persistent Systems / Infosys
3rd WBDB: July 2013, Xi’an, China. Hosted by Xi’an University
1
st
WBDB Attendee Organizations
• Actian • AMD • BMMsoft • Brocade • CA Labs • Cisco • Cloudera • Convey Computer • CWI/Monet • Dell • EPFL • Facebook • Google • Greenplum 8• San Diego Supercomputer Center
• SAS
• Scripps Research Institute
• Seagate • Shell • SNIA • Teradata Corporation • Twitter • UC Irvine • Univ. of Minnesota • Univ. of Toronto • Univ. of Washington • VMware • WhamCloud • Yahoo! Hewlett-Packard Hortonworks
Indiana Univ / Hathitrust
Research Foundation InfoSizing Intel LinkedIn MapR/Mahout Mellanox Microsoft NSF NetApp NetApp/OpenSFS Oracle Red Hat
WBDB Outcomes
•
Big Data Benchmarking Community (BDBC) mailing list
(~160 members from ~75 organizations)
▫ (Remote) Talks every other Thursday at 9AM US Pacific time
•
Selected papers to be published in Springer Verlag LNCS:
2012 and 2013 Issues
•
Paper from First Workshop
▫ Setting the Direction for Big Data Benchmark Standards by C. Baru, M. Bhandarkar, R. Nambiar, M. Poess, and T. Rabl,
published in Selected Topics in Performance Evaluation and
Benchmarking, Springer-Verlag
•
Article in inaugural issue of Big Data Journal
▫ Big Data Benchmarking and the Big Data Top100 List by Baru,
Bhandarkar, Nambiar, Poess, Rabl, Big Data Journal, Vol.1, No.1, 60-64, Anne Liebert Publications.
•
Formation of the TPC-BD Subcommittee on BigData
benchmarking
Current Status: Issues Discussed at the
Workshops
•
Different types of benchmarks—for different aspects of a
system
•
Micro-benchmarks. Specific lower-level, system operations
▫ I/O operations, e.g. A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Panda et al, OSU
•
Functional benchmarks
▫ Terasort
▫ Basic SQL: Individual SQL operations, e.g. Select, Project, Join, Order-By, …
•
Genre-specific benchmarks
▫ E.g. Graph500
•
Application-level benchmarks
▫ Measure system-level performance of hardware and software, for a given dataset and workload (a given application scenario)
Benchmark Design Issues
•
Audience: Who is the audience for such a benchmark?
▫ Marketing (Customers / End users), Internal Use (Engineering), Academic Use
•
Application: What is the application that should be
modeled?
▫ Abstractions of a data pipeline, e.g. Internet-scale business
•
Should the benchmark be for innovation or competition?
Design Issues - 2
•
Single benchmark specification: Is it possible to develop
a single benchmark to capture characteristics of multiple
applications?
▫ Single, multi-step benchmark, with plausible end-to-end scenario
•
Component vs. end-to-end benchmark. Is it possible to
factor out a set of benchmark “components”, which can
be isolated and plugged into an end-to-end benchmark?
▫ The benchmark should consist of individual components that ultimately make up an end-to-end benchmark
Design Issues - 3
•
Paper and Pencil vs Implementation-based. Should the
implementation be specification-driven or
implementation-driven?
▫ Start with an implementation and develop specification at the same time
•
Reuse. Can we reuse existing benchmarks?
▫ Leverage existing work and built-up knowledgebase
•
Benchmark Data. Where do we get the data from?
▫ Synthetic data generation: structured, non-structured data
•
Verifiability
. Should there be a process for verification of
results?
Abstractions of the Big Data World
from WBDB
•
Enterprise Warehouse + Agglomeration of other
data
▫
Structured enterprise data warehouse
▫
Extended to incorporate data from other non-fully
structured data sources (e.g. weblogs, text, streams)
•
Pool of data with sequence of processing
▫
Enterprise data processing as a pipeline from data
ingestion to transformation, extraction, subsetting,
machine learning, predictive analytics
▫
Data from multiple structured and non-structured
Proposal 1: BigBench
•
Ghazal et al: Teradata, Oracle, U.of Toronto,
InfoSizing
•
Derived from TPC-Decision Support (TPC-DS)
▫
Multiple snowflake schemas with shared dimensions
▫
24 tables with an average of 18 columns
▫
99 distinct SQL 99 queries with random substitutions
▫
More representative skewed database content
▫
Sub-linear scaling of non-fact tables
▫
Ad-hoc, reporting, iterative and extraction queries
▫
ETL-like data maintenance
BigBench Data Model
• Workload = Set of queries
▫ On structured, semistructured, unstructured data
▫ Data mining, ML
• Paper published in ACM SIGMOD 2013. Full specification to appear
Proposal 2: Deep Analytics Pipeline
•
An end-to-end data processing pipline:
▫
Data from multiple sources
▫
Loose, flexible schema
▫
Data requires structuring
▫
ELT rather than ETL
•
Application characteristics
▫
Processing pipelines
▫
Running models with data
18 Acquisition/ Recording Extraction/ Cleaning/ Annotation Integration/ Aggregation/ Representation Analysis/ Modeling Interpretation
Example of an Application: User
Modeling
•
Objective: Determine user interests by
mining user activities
•
Large dimensionality of possible user
activities
•
Typical user has sparse activity vector
User Modeling Pipeline
•
Data Acquisition
•
Sessionization
•
Feature and Target Generation
•
Model Training
•
Offline Scoring & Evaluation
•
Batch Scoring & Upload to serving
4th WBDB, October 9-10, San Jose, 2013