• No results found

SNW Panel Big Data and Cloud Benchmarking

N/A
N/A
Protected

Academic year: 2021

Share "SNW Panel Big Data and Cloud Benchmarking"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

SNW Panel

Big Data and Cloud Benchmarking

Panelists:

•  Chaitan Baru, Center for Large-scale Data Systems research (CLDS), San Diego Supercomputer Center, UC San Diego

•  Raghu Nambiar, Strategist, Performance and Solution

Engineering, Data Center Group, Cisco

(2)

Big Data and Cloud Benchmarking

Chaitan Baru

Director, Center for Large-scale Data Systems research (CLDS) San Diego Supercomputer Center, UC San Diego

(3)

CLDS: Center for Large-scale Data

Systems research

•  A center dedicated to the study of technical,

management, and economic issues related to large-scale data systems

▫  Architectures and systems for large-scale data

▫  Benchmarks for big data applications and systems

▫  Data growth and Information value

▫  A forum for exchange of ideas

▫  Management and professional education

  Exec Ed program with Rady School of Management

(4)

CLDS: Current Program Focus

•  Big Data Benchmarking

▫  Promote development of standards

•  The Growth of Data in Enterprises, Science, and

Society

▫  Develop industry and science case studies

•  Professional Education

▫  Cloud Computing: Technical and Business aspects

(i.e. “provider” versus “consumer/user” view) ▫  By verticals: e.g. healthcare

(5)

Data Growth

•  A taxonomy of big data

▫  Identify sources of data growth

•  Realizing value from big data

▫  Costs of big data

▫  Productivity benefits of big data

analytics and decision making

•  Data Growth Index and Data

Growth Forecasts

▫  Earlier HMI? How Much

Information report quoted in McKinsey Report on data growth

▫  Lead researcher: Dr. Jim Short

(ex-MIT Sloan School)

(6)

Data…the Changing Context

•  Rapid growth in data

▫   data-driven science and business decisions

▫  IT acquisition decisions being made by lines of business, not a CIO (CIO  office systems)

•  Scientific workloads as a predictor of future business

workloads

▫  Sensor-based systems, remote sensing, genome sequencing…

•  A point of inflexion for technology

▫  Changing software: from RDBMS, to noSQL, to Hadoop

ecosystem, …

▫  Changing hardware: multi-core, solid-state disk, large memory, new types of memory

▫  Changing platforms: dedicated systems vs clouds

▫  Changing business costs / models: ultra-high productivity, energy efficiency, rent vs own, (“first bulb to first light”)

(7)

TPC Benchmarks

•  The Transaction Processing Council

▫  Industry benchmark standards group

▫  Releases audited benchmark results for database systems (transactions and queries)

•  TPC-C

▫  First result, September, 1992: 54 tpmC

  $188,562/tpmC (TPC transactions per minute)

▫  Recent result, December 2010: 30,249,688 tpmC

  $1.01/tpmC

  $30M system, 27 SPARC server nodes, 4 processors, 16 cores, 512GB, 3x300 10K

drives, 4x8Gbps HBA

•  TPC-D: Decision support benchmark

▫  First result, December 1995: 100 GB, 84 QthD and $52,170/QphD, ~$4M

•  TPC-H: Follow-on to TPC-D

▫  Recent result, October 2011: 1,112,401 QphH, $0.12/QphH, 100GB database

▫  $132,676 system, 8x2 processors, x6 cores; 24GB RAM/node

7

~600,000x transaction performance improvement ~200,000x price/performance improvement

~100,000x query performance improvement ~450,000x price/performance improvement

(8)

Big Data in 1995!

8

(9)

Benchmarking Issues

•  “Reference benchmarks” for big data

▫  Define modalities of big data

▫  Define end-to-end flows of big data

▫  Identify key real-world characteristics

▫  Identify which existing benchmarks can be reused

  E.g. Terasort, Graph500, YCSB, etc.

•  “Probe benchmarks” for clouds

▫  E.g. Azurescope, plus many ad hoc efforts

▫  Propose: “Cloud Weather Service”

  Focus on application-level metrics, not system metrics

  Need a simple but systematic approach

(10)

Difference between TPC and Big Data

Benchmarking

•  The need to address more of the “lifecycle” of data

▫  From generation to reporting, and data growth

•  Dealing with different genres of data

▫  Should you buy different hardware / software for different types of data?

•  Data management software options

▫  SQL, noSQL, Hadoop ecosystem

•  Hardware configuration options

▫  SSD, large memory, new types of memory

•  Evolving / Heterogeneous hardware platforms

▫  Big data systems grow over time  heterogeneous hardware within a

single system

•  From applications POV:

▫  Ability to integrate realtime data into decision support. E.g. Facebook:

takes 48 hours to integrate click stream into business intelligence systems. Want to make that realtime.

(11)

11

•  NSF-supported workshop on Big Data

Benchmarking, WBDB2012,

http://clds.sdsc.edu/wbdb2012, May 8-9, at Brocade Exec Briefing Center, San Jose, CA.

▫  Participants: CLDS/SDSC, Amazon, Brocade, Cisco, Dell, EMC, Facebook, Google, HP, IndianaU, Intel, JHU, LinkedIn,

Mellanox, Microsoft, Netflix, Oracle, PayPal, SAS, Seagate, Shell, TSRI, UCI, U.Toronto, U.Wash, WhamCloud

▫  Results will be presented at

  Workshop on Architectures and Systems for Big Data,

June 9, Portland, OR

  TPC Technical Committee meeting, VLDB2012, Aug

(12)

12

Towards an Industry Standard for

Performance Evaluation and

Benchmarking Big Data Workloads

Raghunath Nambiar

Strategist, Performance and Solution Engineering

Data Center Group, Cisco Systems, Inc

(13)

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 13

There are 15 billion devices connected to the Internet

That’s 2.2 devices for every man, woman, and child on the planet earth If

was a country, it would be the 3rd

largest in the world

1.  China (1.339 billion) 2.  India (1.218 billion) 3.  Facebook (900 million) 4.  United States (311 million) 5.  Indonesia (237 million) 6.  Brazil (190 billion) 7.  Pakistan (175 million) 8.  Nigeria (158 million) 9.  Bangladesh (150 million) 10.  Russia (142 million) 2008 0.5 Zettabytes 2011 2.5 Zettabytes 35 Zettabytes2020

(14)

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 14

Almost every business is conducted over internet

Business generate more data, Store more data,

Store them for longer period,

often required due to compliance

More data will improves predictive analytics

Sales Products Process

Inventory Finance Payroll

Shipping Tracking

Authorization Customers

Profile

Machine logs Sensor data Call data records Web click stream data

Satellite feeds GPS data

Sales data Blogs Emails Pictures Video

Structured

Semi-structured Un-structured

(15)

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 15

•  Industry standard benchmarks

Transaction Processing Performance Council (TPC)

Standard Performance Evaluation Corporation (SPEC)

Storage Performance Council (SPC)

•  Application benchmarks

VMWare VMMark

SAP Standard Application Benchmarks Oracle Applications Benchmark

(16)

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 16

•  Vendor point of view

Define the playing field (measurable, repeatable)

Enable competitive analysis

Monitor release to release progress

Result understood by engineering, sales and customers

Accelerate focused technology development

•  Customer point of view

Cross-vendor comparisons (performance, Cost, Energy)

Evaluate new technologies

Eliminate costly in-house characterization

Industry Standard Benchmarks

Broad Industry representation (all decision taken by the

board)

Verifiable (audit process) Domain specific standard tests

Resolution of disputes and challenges

(17)

© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 17 Relevant Repeatable Fair Verifiable Economical

• Relevant – A reader of the result believes the benchmark reflects something important

• Repeatable – There is confidence that the benchmark can be run a second time with the same result

• Fair – All systems and/or software being compared can participate equally • Verifiable – There is confidence that the documented result is real

• Economical – The test sponsors can afford to run the benchmark

Huppler, K: The Art of Building a Good Benchmark: In: Nambiar, R.O, Poess, M. (eds.) TPCTC 2009: LNCS, vol. 5895, pp. 167-182. Springer, Heidelberg (2009 ) • Performance • Cost of Ownership • Energy Efficiency • Floor Space Efficiency • Manageability • In-House vs Hosted

(18)
(19)

Big Data Benchmarking

Milind Bhandarkar

(20)

Applications Drive

Systems

Data Science

Machine Learning

Analytics & Reporting
(21)

Data Science Workload

(Courtesy: Hilary Mason, Chief Scientist, Bit.ly)

Obtain

Scrub

Explore

Model
(22)

Obtain

Corpus needs to be usable & sufficient

Possibly from multiple independent sources

Needs to be automated for streams

Needs to have efficient ingestion for one-time data
(23)

Scrub

Raw data is always messy

Missing data, inconsistent data, charsets

NY, New York, NYC, Big Apple etc

Growing Dictionaries

Join with Crowdsourcing
(24)

Explore

Visualize, Clustering, Dimensionality reduction

Feature correlations (scatter plots)
(25)

Model

Find correlation of past data and known outcomes

Find good training set

Label the training set

Derive model parameters

Apply model, and validate
(26)

Interpret

Models are built for prediction and interpretation

Check that there are no surprises

Reason about models
(27)

Data Science Data Flow

Raw Data (Timed, Partitioned, Crowdsourced, De-duped etc)

Derived data (simple aggregates, other statistics)

Models (Feature weights, decision trees)
(28)

Data Diversity/Genres

Natural Language Text, and Annotations

(Bags of words) : Concept

Graphs (sparse matrices)

Dense Matrices
(29)

Tools at Hand

MPP Data Bases

Big Data (NoSQL, Hadoop etc)

Low latency message-passing

Variety of Compute Frameworks

Parallel SQL, MapReduce, MPI, BSP, and layered frameworks
(30)

Benchmarks

Need to emulate real data science workloads at various scales

TeraSort, Grep and Wordcount not enough 

References

Related documents

Jayanthakumaran and Frank (2004) test the hypothesis that trade reforms have had a positive impact on manufacturing exports, using both time series and cross-sectional data,

In the home environment, target entities are those that are relevant for being monitored and controlled by applications such as energy management, security/safety

difficile spores could be transmitted from the farm environment to humans through a number of mechanisms including direct contact, airborne dispersal, avian, rodent or arthropod

The rationale for these cut scores is that if the lowest-scoring 15% of the national student population has consistently been found to be at severe risk in reading and math, and

Keywords : Multiple objective Scheduling, Branch and Bound, Pareto Optimal Solutions, Genetic Algorithm, Particle Swarm

This paper aimed to develop a mixed-integer linear programming model for a two-echelon warehouse network redesign problem with capacitated plant and uncapacitated warehouses..

The factors are credit, education, farming experience, farm size, number of breeding stock, and training contacts of the producers which positively and significantly affect

domestic finance to adequately fill the void left by the decline of London and the breakdown of the world financial system in the interwar period, when neither the Buenos Aires