SNW Panel
Big Data and Cloud Benchmarking
Panelists:
• Chaitan Baru, Center for Large-scale Data Systems research (CLDS), San Diego Supercomputer Center, UC San Diego
• Raghu Nambiar, Strategist, Performance and Solution
Engineering, Data Center Group, Cisco
Big Data and Cloud Benchmarking
Chaitan Baru
Director, Center for Large-scale Data Systems research (CLDS) San Diego Supercomputer Center, UC San Diego
CLDS: Center for Large-scale Data
Systems research
• A center dedicated to the study of technical,
management, and economic issues related to large-scale data systems
▫ Architectures and systems for large-scale data
▫ Benchmarks for big data applications and systems
▫ Data growth and Information value
▫ A forum for exchange of ideas
▫ Management and professional education
Exec Ed program with Rady School of Management
CLDS: Current Program Focus
• Big Data Benchmarking
▫ Promote development of standards
• The Growth of Data in Enterprises, Science, and
Society
▫ Develop industry and science case studies
• Professional Education
▫ Cloud Computing: Technical and Business aspects
(i.e. “provider” versus “consumer/user” view) ▫ By verticals: e.g. healthcare
Data Growth
• A taxonomy of big data
▫ Identify sources of data growth
• Realizing value from big data
▫ Costs of big data
▫ Productivity benefits of big data
analytics and decision making
• Data Growth Index and Data
Growth Forecasts
▫ Earlier HMI? How Much
Information report quoted in McKinsey Report on data growth
▫ Lead researcher: Dr. Jim Short
(ex-MIT Sloan School)
Data…the Changing Context
• Rapid growth in data
▫ data-driven science and business decisions
▫ IT acquisition decisions being made by lines of business, not a CIO (CIO office systems)
• Scientific workloads as a predictor of future business
workloads
▫ Sensor-based systems, remote sensing, genome sequencing…
• A point of inflexion for technology
▫ Changing software: from RDBMS, to noSQL, to Hadoop
ecosystem, …
▫ Changing hardware: multi-core, solid-state disk, large memory, new types of memory
▫ Changing platforms: dedicated systems vs clouds
▫ Changing business costs / models: ultra-high productivity, energy efficiency, rent vs own, (“first bulb to first light”)
TPC Benchmarks
• The Transaction Processing Council
▫ Industry benchmark standards group
▫ Releases audited benchmark results for database systems (transactions and queries)
• TPC-C
▫ First result, September, 1992: 54 tpmC
$188,562/tpmC (TPC transactions per minute)
▫ Recent result, December 2010: 30,249,688 tpmC
$1.01/tpmC
$30M system, 27 SPARC server nodes, 4 processors, 16 cores, 512GB, 3x300 10K
drives, 4x8Gbps HBA
• TPC-D: Decision support benchmark
▫ First result, December 1995: 100 GB, 84 QthD and $52,170/QphD, ~$4M
• TPC-H: Follow-on to TPC-D
▫ Recent result, October 2011: 1,112,401 QphH, $0.12/QphH, 100GB database
▫ $132,676 system, 8x2 processors, x6 cores; 24GB RAM/node
7
~600,000x transaction performance improvement ~200,000x price/performance improvement
~100,000x query performance improvement ~450,000x price/performance improvement
Big Data in 1995!
8
Benchmarking Issues
• “Reference benchmarks” for big data
▫ Define modalities of big data
▫ Define end-to-end flows of big data
▫ Identify key real-world characteristics
▫ Identify which existing benchmarks can be reused
E.g. Terasort, Graph500, YCSB, etc.
• “Probe benchmarks” for clouds
▫ E.g. Azurescope, plus many ad hoc efforts
▫ Propose: “Cloud Weather Service”
Focus on application-level metrics, not system metrics
Need a simple but systematic approach
Difference between TPC and Big Data
Benchmarking
• The need to address more of the “lifecycle” of data
▫ From generation to reporting, and data growth
• Dealing with different genres of data
▫ Should you buy different hardware / software for different types of data?
• Data management software options
▫ SQL, noSQL, Hadoop ecosystem
• Hardware configuration options
▫ SSD, large memory, new types of memory
• Evolving / Heterogeneous hardware platforms
▫ Big data systems grow over time heterogeneous hardware within a
single system
• From applications POV:
▫ Ability to integrate realtime data into decision support. E.g. Facebook:
takes 48 hours to integrate click stream into business intelligence systems. Want to make that realtime.
11
• NSF-supported workshop on Big Data
Benchmarking, WBDB2012,
http://clds.sdsc.edu/wbdb2012, May 8-9, at Brocade Exec Briefing Center, San Jose, CA.
▫ Participants: CLDS/SDSC, Amazon, Brocade, Cisco, Dell, EMC, Facebook, Google, HP, IndianaU, Intel, JHU, LinkedIn,
Mellanox, Microsoft, Netflix, Oracle, PayPal, SAS, Seagate, Shell, TSRI, UCI, U.Toronto, U.Wash, WhamCloud
▫ Results will be presented at
Workshop on Architectures and Systems for Big Data,
June 9, Portland, OR
TPC Technical Committee meeting, VLDB2012, Aug
12
Towards an Industry Standard for
Performance Evaluation and
Benchmarking Big Data Workloads
Raghunath Nambiar
Strategist, Performance and Solution Engineering
Data Center Group, Cisco Systems, Inc
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 13
There are 15 billion devices connected to the Internet
That’s 2.2 devices for every man, woman, and child on the planet earth If
was a country, it would be the 3rd
largest in the world
1. China (1.339 billion) 2. India (1.218 billion) 3. Facebook (900 million) 4. United States (311 million) 5. Indonesia (237 million) 6. Brazil (190 billion) 7. Pakistan (175 million) 8. Nigeria (158 million) 9. Bangladesh (150 million) 10. Russia (142 million) 2008 0.5 Zettabytes 2011 2.5 Zettabytes 35 Zettabytes2020
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 14
Almost every business is conducted over internet
Business generate more data, Store more data,
Store them for longer period,
often required due to compliance
More data will improves predictive analytics
Sales Products Process
Inventory Finance Payroll
Shipping Tracking
Authorization Customers
Profile
Machine logs Sensor data Call data records Web click stream data
Satellite feeds GPS data
Sales data Blogs Emails Pictures Video
Structured
Semi-structured Un-structured
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 15
• Industry standard benchmarks
Transaction Processing Performance Council (TPC)
Standard Performance Evaluation Corporation (SPEC)
Storage Performance Council (SPC)
• Application benchmarks
VMWare VMMark
SAP Standard Application Benchmarks Oracle Applications Benchmark
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 16
• Vendor point of view
Define the playing field (measurable, repeatable)
Enable competitive analysis
Monitor release to release progress
Result understood by engineering, sales and customers
Accelerate focused technology development
• Customer point of view
Cross-vendor comparisons (performance, Cost, Energy)
Evaluate new technologies
Eliminate costly in-house characterization
Industry Standard Benchmarks
Broad Industry representation (all decision taken by the
board)
Verifiable (audit process) Domain specific standard tests
Resolution of disputes and challenges
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 17 Relevant Repeatable Fair Verifiable Economical
• Relevant – A reader of the result believes the benchmark reflects something important
• Repeatable – There is confidence that the benchmark can be run a second time with the same result
• Fair – All systems and/or software being compared can participate equally • Verifiable – There is confidence that the documented result is real
• Economical – The test sponsors can afford to run the benchmark
Huppler, K: The Art of Building a Good Benchmark: In: Nambiar, R.O, Poess, M. (eds.) TPCTC 2009: LNCS, vol. 5895, pp. 167-182. Springer, Heidelberg (2009 ) • Performance • Cost of Ownership • Energy Efficiency • Floor Space Efficiency • Manageability • In-House vs Hosted
Big Data Benchmarking
Milind Bhandarkar
Applications Drive
Systems
•
Data Science
•
Machine Learning
•
Analytics & Reporting
Data Science Workload
(Courtesy: Hilary Mason, Chief Scientist, Bit.ly)
•
Obtain
•
Scrub
•
Explore
•
Model
Obtain
•
Corpus needs to be usable & sufficient
•
Possibly from multiple independent sources
•
Needs to be automated for streams
•
Needs to have efficient ingestion for one-time data
Scrub
•
Raw data is always messy
•
Missing data, inconsistent data, charsets
•
NY, New York, NYC, Big Apple etc
•
Growing Dictionaries
•
Join with Crowdsourcing
Explore
•
Visualize, Clustering, Dimensionality reduction
•
Feature correlations (scatter plots)
Model
•
Find correlation of past data and known outcomes
•
Find good training set
•
Label the training set
•
Derive model parameters
•
Apply model, and validate
Interpret
•
Models are built for prediction and interpretation
•
Check that there are no surprises
•
Reason about models
Data Science Data Flow
•
Raw Data (Timed, Partitioned, Crowdsourced, De-duped etc)
•
Derived data (simple aggregates, other statistics)
•
Models (Feature weights, decision trees)
Data Diversity/Genres
•
Natural Language Text, and Annotations
•
(Bags of words) : Concept
•
Graphs (sparse matrices)
•
Dense Matrices
Tools at Hand
•
MPP Data Bases
•
Big Data (NoSQL, Hadoop etc)
•
Low latency message-passing
•
Variety of Compute Frameworks
•
Parallel SQL, MapReduce, MPI, BSP, and layered frameworks
Benchmarks
•
Need to emulate real data science workloads at various scales
•
TeraSort, Grep and Wordcount not enough