IN S TIT U TE O F C O M P U TIN G T E C H N O LO G Y
Benchmarking and Ranking
Big Data Systems
Xinhui Tian
ICT, Chinese Academy of Sciences and University of Chinese Academy of Sciences
Outline
n BigDataBench
n BigDataBench Dwarfs
n BigData100
BPOE 2016 3
Why Big Data Benchmarking?
Measuring big data systems and architectures quantitatively
What is BigDataBench?
n An open source big data benchmarking project
• http://prof.ict.ac.cn/BigDataBench
BPOE 2016 5
BigDataBench Users
n http://prof.ict.ac.cn/BigDataBench/users/
n Industry users
n Accenture, BROADCOM, SAMSUMG, Huawei, IBM
n About 20 academia groups published papers
MPI Shark Impala NoSql
BigDataBench 3.2 Overview
Search Engine15 Real-‐world Data Sets
BDGS(Big Data Generator Suite) for scalable data
Hadoop RDMA
Facebook Social Network ImageNet
Wikipedia Entries
E-‐commerce Transaction
English broadcasting audio
Amazon Movie Reviews
ProfSearch Resumes DVD Input Streams Google Web Graph
SoGou Data Image scene
MNIST
Genome sequence data Assembly of the human genome
Social
Network E-commerce
Multimedia Bioinformatics
BPOE 2016 7
What’s New in BigDataBench 3.2
• WordCount, Grep, Naïve Bayes, PageRank, K-‐means
BigDataBench support for Flink
• JStorm, Spark Streaming
Streaming
• GraphX, GraphLab, Flink Gelly
BigDataBench 3.2
BigDataBench evolution
BigDataBench 3.1
BigDataBench 3.0
BigDataBench 2.0
2013.12 Typical Internet service domainsAn architectural perspective
19 workloads & data generation tools
2014.4 Multidisciplinary effort32 workloads: diverse implementations
5 application domains: 14 data sets and 33 workloads Same specifications: diverse implementations
Multi-‐tenancy version
BigDataBench subset and simulator version
2014.12
New software stack: Flink, JStorm, GraphX, GraphLab New workload type: Streaming, Graph processing New dataset and workloads
BPOE 2016 9
BigDataBench Publications
n BigDataBench: a Big Data Benchmark Suite from Internet Services. 20th IEEE
International Symposium On High Performance Computer Architecture (HPCA-‐2014).
n Characterizing data analysis workloads in data centers. 2013 IEEE
International Symposium on Workload Characterization (IISWC 2013)(Best paper award)
n BigOP: generating comprehensive big data workloads as a benchmarking
framework. 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014)
n BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. The
Fourth workshop on big data benchmarking (WBDB 2014)
n Identifying Dwarfs Workloads in Big Data Analytics arXiv preprint
arXiv:1505.06872
n BigDataBench-‐MT: A Benchmark Tool for Generating Realistic Mixed Data
Outline
n BigDataBench
n BigDataBench Dwarfs
n BigData100
BPOE 2016 11
BigDataBench Methodology
Application Domain 1 Application Domain … Application Domain NData models of different types & semantics Data operations & workloads patterns Benchmark specification 1 Benchmark specification … Benchmark specification N Real-‐world data sets Data generation tools Workloads with diverse implementations Multi-‐tenancy version BigDataBench subset Mix with different
percentages
Reduce
BigDataBench Methodology with Dwarfs
Application Domain 1 Application Domain … Application Domain NData models of different types & semantics
Dwarfs Workloads Identification Benchmark specification 1 Benchmark specification … Benchmark specification N Real-‐world data sets Data generation tools Workloads with diverse implementations Multi-‐tenancy version BigDataBench subset Mix with different
percentages
Reduce
BPOE 2016 13
Dwarfs Workloads
n Finding the common atomic workloads in big
data analytics workloads
n Representing maximum patterns of big data
Inspiration
Successful Compute Abstractions
• Relational algebra
– 5 primitive operations – Select, Project, Product,
Union, Difference • Parallel computing – Computational & communication patterns – 13 dwarfs Successful Benchmarks • TPC-‐C – OLTP domain
– Functions of abstraction
• HPCC
– High performance computing – Seven basically tests
BPOE 2016 15
Fundamental Issues
What are the challenges for big data dwarfs?
How to find the dwarfs workloads in big data
Challenges
Massive application domains make us wonder where to start or how to achieve a wide range of coverage
Classification Dimension reduction Association rule mining Regression Clustering Data Mining Data mining Deep learning Machine
learning Computer vision
Signal processi ng Natural language processing
Many techniques for processing big data exist, which bring greater complexity for identifying dwarfs workloads
A machine learning library – scikit learn, implements so many algorithms, which is still much less than the
total number of algorithms
•
80%
data growth are unstructured dataBPOE 2016 17
Methodology of Dwarfs
Application Domain 1 Application Domain 2 Application Domain N Processing Techniques • Machine learning • Data mining • Deep learning • Computer vision
• Natural language processing ...
Libraries
• Mllib,Mahout
Frameworks
• Spark, Hadoop, GraphLab
Benchmarks
• BigBench, LinkBench ...
Big Data Analytics Representative Algorithms Frequently appearing operations Different combinations Dwarfs 1 Dwarfs 2 Dwarfs M … …
Methodology of Dwarfs
Application Domain 1 Application Domain 2 Application Processing Techniques • Machine learning • Data mining • Deep learning • Computer vision• Natural language processing ...
Libraries
• Mllib,Mahout
Frameworks
• Spark, Hadoop, GraphLab
Benchmarks
Big Data Analytics Representative Algorithms Frequently appearing operations Different combinations Dwarfs 1 Dwarfs 2 Dwarfs M … …
BPOE 2016 19
Methodology of Dwarfs
Application Domain 1 Application Domain 2 Application Domain N Processing Techniques • Machine learning • Data mining • Deep learning • Computer vision
• Natural language processing ...
Libraries
• Mllib,Mahout
Frameworks
• Spark, Hadoop, GraphLab
Benchmarks
• BigBench, LinkBench ...
Big Data Analytics
Dwarfs 1 Dwarfs 2 Dwarfs M … … Frequently appearing operations Different combinations Representative Algorithms
Methodology of Dwarfs
Application Domain 1 Application Domain 2 Application Processing Techniques • Machine learning • Data mining • Deep learning • Computer vision• Natural language processing ...
Libraries
• Mllib,Mahout
Frameworks
• Spark, Hadoop, GraphLab
Benchmarks
Big Data Analytics Representative Algorithms Frequently appearing operations Different combinations Dwarfs 1 Dwarfs 2 Dwarfs M … …
BPOE 2016 21
Algorithms Chosen to Investigate
C4.5/CART/ID3, Logistic regression, SVM, KNN, HMM, Maximum-‐entropy markov model, Conditional random field, PageRank, HITS, Aporiori, FP-‐growth, Principal component analysis, Back Propagation, Adaboost, MCMC,
Connected component, Random forest
Data Mining & Machine Learning
Latent semantic indexing, pLSI, Latent dirichlet allocation, Index,
Porter Stemming, Sphinx speech recognition
Natural Language Processing
Demographic/Content based recommendation, Collaborative
Filtering
Recommendation
MPEG-‐2, Scale-‐invariant feature transform, Image segmentation,
Ray Tracing Computer Vision Needleman-‐Wunsch, Smith-‐ Waterman, BLAST Database Software CNN, DBN Deep Learning
Frequently-‐appearing Operations
Similarity AnalysisKNN, K-‐means, Recommendation, Feature matching, Image
segmentation
Neural Network
Back propagation, CNN, DBN, Neural network
Multimedia Representation
Sphinx speech recognition, MPEG-‐2,
Matrix/Vector Calculation Linear Algebra Similarity Analysis • Jaccard similarity • Locality sensitive hashing … Association Rules Mining • Aporiori • FP-‐Growth • … Database • Set union • Set difference Set Operations Sampling MCMC, LDA, Random sampling, Downsampling … Transform Operation FFT, Convolution computation, DCT, Speech recognition, MPEG-‐2 … Graph Operation BFS, DFS, Decision tree, Connected Component Logic Operation Encrpytion, index, Fingerprint, SimHash, MinHash Statistic Operation • Probability calculation
• LSI, pLSI, Latent dirichlet allocation …
Sort
• Partial sort, quick sort, Top k sort… • K-‐means, Decision tree …
BPOE 2016 23
Dwarfs in Big Data Offline Analytics
Sampling Transform operation Graph operation Logic operation Set operation Statistic operation Sort Linear Algebra
Outline
n BigDataBench
n BigDataBench Dwarfs
n BigData100
BPOE 2016 25
BigData100 Ranking
n An open source project for benchmarking and
ranking big data systems
n Using benchmarks from BigDataBench
n Big data systems including Hadoop, Spark, Flink
and many SQL-‐interface systems
Why Do Big Data Systems Ranking
n First a quick glance of current big data
BPOE 2016 27
Big Data Systems
Programming Interface Computation Engines Data Storage Data Sources
SQL-like DataFlow APIs
DataFlow Engines
DB layer Graph Layer ML Layer …
Graph MPP Streaming …
Distributed File Systems
Column-oriented Format Key-Value Storage Resource Management
NOSQL Relational DB
Structured Data Unstructured Data Semi-structured Data R Language
Computation Engine
Programming Model Pa ra lle lCo m pu ta tio n Fa ult To le ra nc e Ta sk Sc he du lin g Lo ca l E xe cu tio n Ne tw ork Tr an sfe r In te rm ed ia te D ata Ma na ge m en tBPOE 2016 29
The Dataflow Model
n A computation job is represented as a
directed acyclic graph (DAG) consisting of
data sources, sinks and operators
n First used in the Microsoft Dryad system
n Google MapReduce can be seen as a special
DataFlow
Sort Map Spill Merge Reduce Input Output Input1 Operator1 Operator2 Operator3 Operator4 Operator5 Input2 Output Channel1 Channel2 Channel3 Channel4 Intermediate Data T ransformation User function Input1 RDD1 RDD2 RDD3 RDD4 RDD5 Input2 Output Transformation1 Transformation2 Transformation3 Transformation4 RDD6 Action Cached RDD Cached Input1 PACT1 PACT2 PACT3 Input2 Channel1 Channel2MapReduce Dryad Spark
BPOE 2016 31
Other Specific Models
Impala Pregel
Kornacker M, Behm A, Bittorf V, et al. Impala: A Modern, Open-‐Source SQL Engine for Hadoop[C]//CIDR. 2015.
Malewicz G, et al. Pregel: a system for large-‐scale graph processing[C] Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: 135-‐146.
Why Do Big Data Systems Ranking?
n Understanding the performance difference
between general-‐purpose and specific systems
n performance comparison
n Figuring out how the design of framework
impact the performance of specific types of workloads
BPOE 2016 33
Benchmark
n Current Benchmarks n Micro workloads: • WordCount, Grep n Iteration workloads: • PageRank, Kmeans n Interactive Queries• Select, Aggregation, Join
n Others
Benchmark Behaviors: Spark Case
n Spark
n Resilient distributed datasets (RDDs)
• Immutable, partitioned collections of objects
• Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage
• Can be cached for efficient reuse
HDFS File Filtered RDD Mapped RDD
BPOE 2016 35
Benchmark Behaviors
n Wordcount, Grep, NaiveBayes, Select
n IO-‐intensive
n Computation can be easy paralleled n No or little data transfer
Benchmark Behaviors
n PageRank, Kmeans
n Iterative Computation
n Data sets join and aggregation in each iteration n A lot of shuffle operations
Iteration 1 GroupBy DataSet2 Join Iteration 2 GroupBy DataSet2 Join Iteration 3 GroupBy DataSet2 Join DataSet1
BPOE 2016 37
Benchmark Behaviors
n Database queries
n One DAG job that consists of many different
transformations
• Including complex aggregation and join algorithms
Map DataSet1
DataSet2 Map
Workload Behaviors on CPU and
Memory (JVM)
n Simple Batch Workloads
n Wordcount, Grep, NaiveBayes, Select
n Iterative Workloads
n PageRank, Kmeans
n Interactive Queries
BPOE 2016 39
CPU -‐ Simple Batch Workloads
Grep WordCount NaiveBayes
CPU -‐ Iterative Workloads
BPOE 2016 41
CPU -‐ Query Processing
Memory -‐ JVM Memory Model
BPOE 2016 43
JVM Memory Usage in Spark
n EC: Eden Capacity
n EU: Eden Usage
n OC: Old Memory Capacity
Simple Batch Workloads
Grep
WordCount NaiveBayes
BPOE 2016 45
Iterative Workloads
Query Processing
BPOE 2016 47
Revisit the dataflow
Iteration 1 GroupBy DataSet2 Join Map Iteration 2 GroupBy DataSet2 Join Map Iteration 3 GroupBy DataSet2 Join Map Map DataSet1 DataSet2 Map
Some Primary Results of
BigData100
n Cluster Configuration
Computation Nodes 16 nodes
CPU per node 2 x Intel Xeon E5645, 12 cores Memory per node 32 GB
Disk per node 1TB x 2 SATA disks Network Gigabit Ethernet
BPOE 2016 49
System Selected
n Simple Batch Workloads and Iterative
Workloads
n Hadoop 2.7.1, Spark 1.5.1, Flink 0.9.1
n Interactive Query Processing
n Spark 1.5.1, Hive 1.2.1 (on Hadoop, on Tez and on
Size of data sets
n WordCount, Grep
n 1.600,000,000 English words from Wikipedia n About 300GB
n Naïve Bayes
n 5,700,000 reviews from Amazon Movie Reviews n 300GB
BPOE 2016 51
Size of data sets
n PageRank
n Google Web Graph, 16 million vertexes, 99 million
edges (unstructured graph)
n KMeans
n Facebook Social Network, 460 million vectors
n Queries – subset of TPC-‐DS
n item: 4 GB
n customer: 13 GB
Results – Simple Batch Workloads
410 330 548 0 100 200 300 400 500 600Hadoop Spark Flink
Ru n Ti m e (s ) Grep 680 345 1105 0 200 400 600 800 1000 1200
Hadoop Spark Flink
Ru n Ti m e (s ) WordCount
BPOE 2016 53
Results
490 351 563 0 100 200 300 400 500 600Hadoop Spark Flink
Ru n Ti m e (s ) Naive Bayes
Observation
n Spark has the best performance on that kind
of workloads
n Good data locality
n One task for each partition
n One reason for the bad performance in Flink
n High level data locality
BPOE 2016 55
Results – Iterative Workloads
7753 780 598 469 0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Hadoop Spark Flink Flink-‐delta
Ru n Ti m e (s ) PageRank 6722 672 805 0 1000 2000 3000 4000 5000 6000 7000 8000
Hadoop Spark Flink
Ru n Ti m e( s) KMeans
Observation
n Spark, Flink can be much faster than Hadoop
n Advantage of the DAG model
n The delta iteration from Flink can be more
BPOE 2016 57
Results – Queries
38 58 56 65 9 0 10 20 30 40 50 60 70SparkSQL HiveOnSpark HiveOnTez Hive Impala
Ru n Ti m e( s) Select
Results -‐ Queries
155 174 248 490 123 0 100 200 300 400 500 600SparkSQL HiveOnSpark HiveOnTez Hive Impala
Ru n Ti m e( s) Aggregation
BPOE 2016 59
Results – Queries
214 342 318 593 791 0 100 200 300 400 500 600 700 800 900SparkSQL HiveOnSpark HiveOnTez Hive Impala
Ru n Ti m e( s) Join
Observations
n Impala has the best performance on select
and aggregation
n Specific data structure designed for database
query
n Good disk performance due to I/O buffer
managemen and short-‐circuit local reads
n Impala gets bad performance for join
BPOE 2016 61