SNW Panel Big Data and Cloud Benchmarking

(1)

SNW Panel

Big Data and Cloud Benchmarking

Panelists:

•  Chaitan Baru, Center for Large-scale Data Systems research (CLDS), San Diego Supercomputer Center, UC San Diego

•  Raghu Nambiar, Strategist, Performance and Solution

Engineering, Data Center Group, Cisco

(2)

Big Data and Cloud Benchmarking

Chaitan Baru

Director, Center for Large-scale Data Systems research (CLDS) San Diego Supercomputer Center, UC San Diego

(3)

CLDS: Center for Large-scale Data

Systems research

•  A center dedicated to the study of technical,

management, and economic issues related to large-scale data systems

▫  Architectures and systems for large-scale data

▫  Benchmarks for big data applications and systems

▫  Data growth and Information value

▫  A forum for exchange of ideas

▫  Management and professional education

  Exec Ed program with Rady School of Management

(4)

CLDS: Current Program Focus

•  Big Data Benchmarking

▫  Promote development of standards

•  The Growth of Data in Enterprises, Science, and

Society

▫  Develop industry and science case studies

•  Professional Education

▫  Cloud Computing: Technical and Business aspects

(i.e. “provider” versus “consumer/user” view) ▫  By verticals: e.g. healthcare

(5)

Data Growth

•  A taxonomy of big data

▫  Identify sources of data growth

•  Realizing value from big data

▫  Costs of big data

▫  Productivity benefits of big data

analytics and decision making

•  Data Growth Index and Data

Growth Forecasts

▫  Earlier HMI? How Much

Information report quoted in McKinsey Report on data growth

▫  Lead researcher: Dr. Jim Short

(ex-MIT Sloan School)

(6)

Data…the Changing Context

•  Rapid growth in data

▫   data-driven science and business decisions

▫  IT acquisition decisions being made by lines of business, not a CIO (CIO  office systems)

•  Scientific workloads as a predictor of future business

workloads

▫  Sensor-based systems, remote sensing, genome sequencing…

•  A point of inflexion for technology

▫  Changing software: from RDBMS, to noSQL, to Hadoop

ecosystem, …

▫  Changing hardware: multi-core, solid-state disk, large memory, new types of memory

▫  Changing platforms: dedicated systems vs clouds

▫  Changing business costs / models: ultra-high productivity, energy efficiency, rent vs own, (“first bulb to first light”)

(7)

TPC Benchmarks

•  The Transaction Processing Council

▫  Industry benchmark standards group

▫  Releases audited benchmark results for database systems (transactions and queries)

•  TPC-C

▫  First result, September, 1992: 54 tpmC

  $188,562/tpmC (TPC transactions per minute)

▫  Recent result, December 2010: 30,249,688 tpmC

  $1.01/tpmC

  $30M system, 27 SPARC server nodes, 4 processors, 16 cores, 512GB, 3x300 10K

drives, 4x8Gbps HBA

•  TPC-D: Decision support benchmark

▫  First result, December 1995: 100 GB, 84 QthD and $52,170/QphD, ~$4M

•  TPC-H: Follow-on to TPC-D

▫  Recent result, October 2011: 1,112,401 QphH, $0.12/QphH, 100GB database

▫  $132,676 system, 8x2 processors, x6 cores; 24GB RAM/node

7

~600,000x transaction performance improvement ~200,000x price/performance improvement

~100,000x query performance improvement ~450,000x price/performance improvement

(8)

Big Data in 1995!

8

(9)

Benchmarking Issues

•  “Reference benchmarks” for big data

▫  Define modalities of big data

▫  Define end-to-end flows of big data

▫  Identify key real-world characteristics

▫  Identify which existing benchmarks can be reused

  E.g. Terasort, Graph500, YCSB, etc.

•  “Probe benchmarks” for clouds

▫  E.g. Azurescope, plus many ad hoc efforts

▫  Propose: “Cloud Weather Service”

  Focus on application-level metrics, not system metrics

  Need a simple but systematic approach

(10)

Difference between TPC and Big Data

Benchmarking

•  The need to address more of the “lifecycle” of data

▫  From generation to reporting, and data growth

•  Dealing with different genres of data

▫  Should you buy different hardware / software for different types of data?

•  Data management software options

▫  SQL, noSQL, Hadoop ecosystem

•  Hardware configuration options

▫  SSD, large memory, new types of memory

•  Evolving / Heterogeneous hardware platforms

▫  Big data systems grow over time  heterogeneous hardware within a

single system

•  From applications POV:

▫  Ability to integrate realtime data into decision support. E.g. Facebook:

takes 48 hours to integrate click stream into business intelligence systems. Want to make that realtime.

(11)

11

•  NSF-supported workshop on Big Data

Benchmarking, WBDB2012,

http://clds.sdsc.edu/wbdb2012, May 8-9, at Brocade Exec Briefing Center, San Jose, CA.

▫  Participants: CLDS/SDSC, Amazon, Brocade, Cisco, Dell, EMC, Facebook, Google, HP, IndianaU, Intel, JHU, LinkedIn,

Mellanox, Microsoft, Netflix, Oracle, PayPal, SAS, Seagate, Shell, TSRI, UCI, U.Toronto, U.Wash, WhamCloud

▫  Results will be presented at

  Workshop on Architectures and Systems for Big Data,

June 9, Portland, OR

  TPC Technical Committee meeting, VLDB2012, Aug

(12)

12

Towards an Industry Standard for

Performance Evaluation and

Benchmarking Big Data Workloads

Raghunath Nambiar

Strategist, Performance and Solution Engineering

Data Center Group, Cisco Systems, Inc

(13)

There are 15 billion devices connected to the Internet

That’s 2.2 devices for every man, woman, and child on the planet earth If

was a country, it would be the 3rd

largest in the world

1.  China (1.339 billion) 2.  India (1.218 billion) 3.  Facebook (900 million) 4.  United States (311 million) 5.  Indonesia (237 million) 6.  Brazil (190 billion) 7.  Pakistan (175 million) 8.  Nigeria (158 million) 9.  Bangladesh (150 million) 10.  Russia (142 million) 2008 0.5 Zettabytes 2011 2.5 Zettabytes 35 Zettabytes2020

(14)

Almost every business is conducted over internet

Business generate more data, Store more data,

Store them for longer period,

often required due to compliance

More data will improves predictive analytics

Sales Products Process

Inventory Finance Payroll

Shipping Tracking

Authorization Customers

Profile

Machine logs Sensor data Call data records Web click stream data

Satellite feeds GPS data

Sales data Blogs Emails Pictures Video

Structured

Semi-structured Un-structured

(15)

•  Industry standard benchmarks

Transaction Processing Performance Council (TPC)

Standard Performance Evaluation Corporation (SPEC)

Storage Performance Council (SPC)

•  Application benchmarks

VMWare VMMark

SAP Standard Application Benchmarks Oracle Applications Benchmark

(16)

•  Vendor point of view

Define the playing field (measurable, repeatable)

Enable competitive analysis

Monitor release to release progress

Result understood by engineering, sales and customers

Accelerate focused technology development

•  Customer point of view

Cross-vendor comparisons (performance, Cost, Energy)

Evaluate new technologies

Eliminate costly in-house characterization

Industry Standard Benchmarks

Broad Industry representation (all decision taken by the

board)

Verifiable (audit process) Domain specific standard tests

Resolution of disputes and challenges

(17)

• Relevant – A reader of the result believes the benchmark reflects something important

• Repeatable – There is confidence that the benchmark can be run a second time with the same result

• Fair – All systems and/or software being compared can participate equally • Verifiable – There is confidence that the documented result is real

• Economical – The test sponsors can afford to run the benchmark

Huppler, K: The Art of Building a Good Benchmark: In: Nambiar, R.O, Poess, M. (eds.) TPCTC 2009: LNCS, vol. 5895, pp. 167-182. Springer, Heidelberg (2009 ) • Performance • Cost of Ownership • Energy Efficiency • Floor Space Efficiency • Manageability • In-House vs Hosted

(18)

(19)

Big Data Benchmarking

Milind Bhandarkar

(20)

Applications Drive

Systems

•

Data Science

•

Machine Learning

•

Analytics & Reporting

(21)

Data Science Workload

(Courtesy: Hilary Mason, Chief Scientist, Bit.ly)

•

Obtain

•

Scrub

•

Explore

•

Model

(22)

Obtain

•

Corpus needs to be usable & sufficient

•

Possibly from multiple independent sources

•

Needs to be automated for streams

•

Needs to have efficient ingestion for one-time data

(23)

Scrub

•

Raw data is always messy

•

Missing data, inconsistent data, charsets

•

NY, New York, NYC, Big Apple etc

•

Growing Dictionaries

•

Join with Crowdsourcing

(24)

Explore

•

Visualize, Clustering, Dimensionality reduction

•

Feature correlations (scatter plots)

(25)

Model

•

Find correlation of past data and known outcomes

•

Find good training set

•

Label the training set

•

Derive model parameters

•

Apply model, and validate

(26)

Interpret

•

Models are built for prediction and interpretation

•

Check that there are no surprises

•

Reason about models

(27)

Data Science Data Flow

•

Raw Data (Timed, Partitioned, Crowdsourced, De-duped etc)

•

Derived data (simple aggregates, other statistics)

•

Models (Feature weights, decision trees)

(28)

Data Diversity/Genres

•

Natural Language Text, and Annotations

•

(Bags of words) : Concept

•

Graphs (sparse matrices)

•

Dense Matrices

(29)

Tools at Hand

•

MPP Data Bases

•

Big Data (NoSQL, Hadoop etc)

•

Low latency message-passing

•

Variety of Compute Frameworks

•

Parallel SQL, MapReduce, MPI, BSP, and layered frameworks

(30)

Benchmarks

•

Need to emulate real data science workloads at various scales

•

TeraSort, Grep and Wordcount not enough 