Benchmarking and Ranking Big Data Systems

(1)

IN S TIT U TE O F C O M P U TIN G T E C H N O LO G Y

Benchmarking and Ranking

Big Data Systems

Xinhui Tian

ICT, Chinese Academy of Sciences and University of Chinese Academy of Sciences

(2)

Outline

n BigDataBench

n BigDataBench Dwarfs

n BigData100

(3)

BPOE 2016 3

Why Big Data Benchmarking?

Measuring big data systems and architectures quantitatively

(4)

What is BigDataBench?

n An open source big data benchmarking project

• http://prof.ict.ac.cn/BigDataBench

(5)

BPOE 2016 5

BigDataBench Users

n http://prof.ict.ac.cn/BigDataBench/users/

n Industry users

n Accenture, BROADCOM, SAMSUMG, Huawei, IBM

n About 20 academia groups published papers

(6)

MPI Shark Impala NoSql

BigDataBench 3.2 Overview

Search Engine

15 Real-‐world Data Sets

BDGS(Big Data Generator Suite) for scalable data

Hadoop RDMA

Facebook Social Network ImageNet

Wikipedia Entries

E-‐commerce Transaction

English broadcasting audio

Amazon Movie Reviews

ProfSearch Resumes DVD Input Streams Google Web Graph

SoGou Data Image scene

MNIST

Genome sequence data Assembly of the human genome

Social

Network E-commerce

Multimedia Bioinformatics

(7)

BPOE 2016 7

What’s New in BigDataBench 3.2

• WordCount, Grep, Naïve Bayes, PageRank, K-‐means

BigDataBench support for Flink

• JStorm, Spark Streaming

Streaming

• GraphX, GraphLab, Flink Gelly

(8)

BigDataBench 3.2

BigDataBench evolution

BigDataBench 3.1

BigDataBench 3.0

BigDataBench 2.0

2013.12 Typical Internet service domains_{An
architectural
perspective}

19 workloads & data generation tools

2014.4 Multidisciplinary effort_{32
workloads:
diverse
implementations}

5 application domains: 14 data sets and 33 workloads Same specifications: diverse implementations

Multi-‐tenancy version

BigDataBench subset and simulator version

2014.12

New software stack: Flink, JStorm, GraphX, GraphLab New workload type: Streaming, Graph processing New dataset and workloads

(9)

BPOE 2016 9

BigDataBench Publications

n BigDataBench: a Big Data Benchmark Suite from Internet Services. 20th IEEE

International Symposium On High Performance Computer Architecture (HPCA-‐2014).

n Characterizing data analysis workloads in data centers. 2013 IEEE

International Symposium on Workload Characterization (IISWC 2013)（Best paper award）

n BigOP: generating comprehensive big data workloads as a benchmarking

framework. 19th International Conference on Database Systems for Advanced Applications (DASFAA 2014)

n BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. The

Fourth workshop on big data benchmarking (WBDB 2014)

n Identifying Dwarfs Workloads in Big Data Analytics arXiv preprint

arXiv:1505.06872

n BigDataBench-‐MT: A Benchmark Tool for Generating Realistic Mixed Data

(10)

Outline

n BigData100

(11)

BPOE 2016 11

BigDataBench Methodology

Application Domain 1 Application Domain … Application Domain N

Data models of different types & semantics Data operations & workloads patterns Benchmark specification 1 Benchmark specification … Benchmark specification N Real-‐world data sets Data generation tools Workloads with diverse implementations Multi-‐tenancy version BigDataBench subset Mix with different

percentages

Reduce

(12)

BigDataBench Methodology with Dwarfs

Application Domain 1 Application Domain … Application Domain N

Data models of different types & semantics

Dwarfs Workloads Identification Benchmark specification 1 Benchmark specification … Benchmark specification N Real-‐world data sets Data generation tools Workloads with diverse implementations Multi-‐tenancy version BigDataBench subset Mix with different

percentages

Reduce

(13)

BPOE 2016 13

Dwarfs Workloads

n Finding the common atomic workloads in big

data analytics workloads

n Representing maximum patterns of big data

(14)

Inspiration

Successful Compute Abstractions

• Relational algebra

– 5 primitive operations – Select, Project, Product,

Union, Difference • _{Parallel
computing} – Computational & communication patterns – 13 dwarfs Successful Benchmarks • TPC-‐C – OLTP domain

– Functions of abstraction

• HPCC

– High performance computing – Seven basically tests

(15)

BPOE 2016 15

Fundamental Issues

What are the challenges for big data dwarfs?

How to find the dwarfs workloads in big data

(16)

Challenges

Massive application domains make us wonder where to start or how to achieve a wide range of coverage

Classification Dimension reduction Association rule mining Regression Clustering Data Mining Data mining Deep learning Machine

learning Computer vision

Signal processi ng Natural language processing

Many techniques for processing big data exist, which bring greater complexity for identifying dwarfs workloads

A machine learning library – scikit learn, implements so many algorithms, which is still much less than the

total number of algorithms

• _80%

data growth are unstructured data

(17)

BPOE 2016 17

Methodology of Dwarfs

Application Domain 1 Application Domain 2 Application Domain N Processing Techniques • Machine learning • Data mining • Deep learning • Computer vision

• Natural language processing ...

Libraries

• Mllib，Mahout

Frameworks

• Spark, Hadoop, GraphLab

Benchmarks

• BigBench, LinkBench ...

Big Data Analytics _{Representative} Algorithms Frequently appearing operations Different combinations Dwarfs 1 Dwarfs 2 Dwarfs M … …

(18)

Methodology of Dwarfs

Application Domain 1 Application Domain 2 Application Processing Techniques • Machine learning • Data mining • Deep learning • Computer vision

Libraries

• Mllib，Mahout

Frameworks

Benchmarks

(19)

BPOE 2016 19

Methodology of Dwarfs

Application Domain 1 Application Domain 2 Application Domain N Processing Techniques • Machine learning • Data mining • Deep learning • Computer vision

Libraries

• Mllib，Mahout

Frameworks

Benchmarks

• BigBench, LinkBench ...

Big Data Analytics

Dwarfs 1 Dwarfs 2 Dwarfs M … … Frequently appearing operations Different combinations Representative Algorithms

(20)

Methodology of Dwarfs

Application Domain 1 Application Domain 2 Application Processing Techniques • Machine learning • Data mining • Deep learning • Computer vision

Libraries

• Mllib，Mahout

Frameworks

Benchmarks

(21)

BPOE 2016 21

Algorithms Chosen to Investigate

C4.5/CART/ID3, Logistic regression, SVM, KNN, HMM, Maximum-‐entropy markov model, Conditional random field, PageRank, HITS, Aporiori, FP-‐growth, Principal component analysis, Back Propagation, Adaboost, MCMC,

Connected component, Random forest

Data Mining & Machine Learning

Latent semantic indexing, pLSI, Latent dirichlet allocation, Index,

Porter Stemming, Sphinx speech recognition

Natural Language Processing

Demographic/Content based recommendation, Collaborative

Filtering

Recommendation

MPEG-‐2, Scale-‐invariant feature transform, Image segmentation,

Ray Tracing Computer Vision Needleman-‐Wunsch, Smith-‐ Waterman, BLAST Database Software CNN, DBN Deep Learning

(22)

Frequently-‐appearing Operations

Similarity Analysis

KNN, K-‐means, Recommendation, Feature matching, Image

segmentation

Neural Network

Back propagation, CNN, DBN, Neural network

Multimedia Representation

Sphinx speech recognition, MPEG-‐2,

Matrix/Vector Calculation Linear Algebra Similarity Analysis • Jaccard similarity • Locality sensitive hashing … Association Rules Mining • Aporiori • FP-‐Growth • … Database • Set union • Set difference Set Operations Sampling MCMC, LDA, Random sampling, Downsampling … Transform Operation FFT, Convolution computation, DCT, Speech recognition, MPEG-‐2 … Graph Operation BFS, DFS, Decision tree, Connected Component Logic Operation Encrpytion, index, Fingerprint, SimHash, MinHash Statistic Operation • Probability calculation

• LSI, pLSI, Latent dirichlet allocation …

Sort

• Partial sort, quick sort, Top k sort… • K-‐means, Decision tree …

(23)

BPOE 2016 23

Dwarfs in Big Data Offline Analytics

Sampling Transform operation Graph operation Logic operation Set operation Statistic operation Sort Linear Algebra

(24)

Outline

n BigData100

(25)

BPOE 2016 25

BigData100 Ranking

n An open source project for benchmarking and

ranking big data systems

n Using benchmarks from BigDataBench

n Big data systems including Hadoop, Spark, Flink

and many SQL-‐interface systems

(26)

Why Do Big Data Systems Ranking

n First a quick glance of current big data

(27)

BPOE 2016 27

Big Data Systems

Programming Interface Computation Engines Data Storage Data Sources

SQL-like DataFlow APIs

DataFlow Engines

DB layer Graph Layer ML Layer …

Graph MPP Streaming …

Distributed File Systems

Column-oriented Format Key-Value Storage Resource Management

NOSQL Relational DB

Structured Data Unstructured Data Semi-structured Data R Language

(28)

Computation Engine

Programming Model Pa ra lle lCo m pu ta tio n Fa ult To le ra nc e Ta sk Sc he du lin g Lo ca l E xe cu tio n Ne tw ork Tr an sfe r In te rm ed ia te D ata Ma na ge m en t

(29)

BPOE 2016 29

The Dataflow Model

n A computation job is represented as a

directed acyclic graph (DAG) consisting of

data sources, sinks and operators

n First used in the Microsoft Dryad system

n Google MapReduce can be seen as a special

(30)

DataFlow

Sort Map Spill Merge Reduce Input Output Input1 Operator1 Operator2 Operator3 Operator4 Operator5 Input2 Output Channel1 Channel2 Channel3 Channel4 Intermediate Data T ransformation User function Input1 RDD1 RDD2 RDD3 RDD4 RDD5 Input2 Output Transformation1 Transformation2 Transformation3 Transformation4 RDD6 Action Cached RDD Cached Input1 PACT1 PACT2 PACT3 Input2 Channel1 Channel2

MapReduce Dryad Spark

(31)

BPOE 2016 31

Other Specific Models

Impala _Pregel

Kornacker M, Behm A, Bittorf V, et al. Impala: A Modern, Open-‐Source SQL Engine for Hadoop[C]//CIDR. 2015.

Malewicz G, et al. Pregel: a system for large-‐scale graph processing[C] Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: 135-‐146.

(32)

Why Do Big Data Systems Ranking?

n Understanding the performance difference

between general-‐purpose and specific systems

n performance comparison

n Figuring out how the design of framework

impact the performance of specific types of workloads

(33)

BPOE 2016 33

Benchmark

n Current Benchmarks n Micro workloads: • _{WordCount,
Grep} n Iteration workloads: • PageRank, Kmeans n Interactive Queries

• _{Select,
Aggregation,
Join}

n Others

(34)

Benchmark Behaviors: Spark Case

n Spark

n Resilient distributed datasets (RDDs)

• _{Immutable,
partitioned
collections
of
objects}

• Created through parallel transformations (map, filter, groupBy, join, …) on data in stable storage

• _{Can
be cached for
efficient
reuse}

HDFS File Filtered RDD Mapped RDD

(35)

BPOE 2016 35

Benchmark Behaviors

n Wordcount, Grep, NaiveBayes, Select

n IO-‐intensive

n Computation can be easy paralleled n No or little data transfer

(36)

Benchmark Behaviors

n PageRank, Kmeans

n Iterative Computation

n Data sets join and aggregation in each iteration n A lot of shuffle operations

Iteration 1 GroupBy DataSet2 Join Iteration 2 GroupBy DataSet2 Join Iteration 3 GroupBy DataSet2 Join DataSet1

(37)

BPOE 2016 37

Benchmark Behaviors

n Database queries

n One DAG job that consists of many different

transformations

• _{Including
complex
aggregation
and
join
algorithms}

Map DataSet1

DataSet2 Map

(38)

Workload Behaviors on CPU and

Memory (JVM)

n Simple Batch Workloads

n Wordcount, Grep, NaiveBayes, Select

n Iterative Workloads

n PageRank, Kmeans

n Interactive Queries

(39)

BPOE 2016 39

CPU -‐ Simple Batch Workloads

Grep WordCount NaiveBayes

(40)

CPU -‐ Iterative Workloads

(41)

BPOE 2016 41

CPU -‐ Query Processing

(42)

Memory -‐ JVM Memory Model

(43)

BPOE 2016 43

JVM Memory Usage in Spark

n EC: Eden Capacity

n EU: Eden Usage

n OC: Old Memory Capacity

(44)

Simple Batch Workloads

Grep

WordCount NaiveBayes

(45)

BPOE 2016 45

Iterative Workloads

(46)

Query Processing

(47)

BPOE 2016 47

Revisit the dataflow

Iteration 1 GroupBy DataSet2 Join Map Iteration 2 GroupBy DataSet2 Join Map Iteration 3 GroupBy DataSet2 Join Map Map DataSet1 DataSet2 Map

(48)

Some Primary Results of

BigData100

n Cluster Configuration

Computation Nodes 16 nodes

CPU per node 2 x Intel Xeon E5645, 12 cores Memory per node 32 GB

Disk per node 1TB x 2 SATA disks Network Gigabit Ethernet

(49)

BPOE 2016 49

System Selected

n Simple Batch Workloads and Iterative

Workloads

n Hadoop 2.7.1, Spark 1.5.1, Flink 0.9.1

n Interactive Query Processing

n Spark 1.5.1, Hive 1.2.1 (on Hadoop, on Tez and on

(50)

Size of data sets

n WordCount, Grep

n 1.600,000,000 English words from Wikipedia n About 300GB

n Naïve Bayes

n 5,700,000 reviews from Amazon Movie Reviews n 300GB

(51)

BPOE 2016 51

Size of data sets

n PageRank

n Google Web Graph, 16 million vertexes, 99 million

edges (unstructured graph)

n KMeans

n Facebook Social Network, 460 million vectors

n Queries – subset of TPC-‐DS

n item: 4 GB

n customer: 13 GB

(52)

Results – Simple Batch Workloads

410 330 548 0 100 200 300 400 500 600

Hadoop Spark Flink

Ru n Ti m e (s ) Grep 680 345 1105 0 200 400 600 800 1000 1200

Hadoop Spark Flink

Ru n Ti m e (s ) WordCount

(53)

BPOE 2016 53

Results

490 351 563 0 100 200 300 400 500 600

Hadoop Spark Flink

Ru n Ti m e (s ) Naive Bayes

(54)

Observation

n Spark has the best performance on that kind

of workloads

n Good data locality

n One task for each partition

n One reason for the bad performance in Flink

n High level data locality

(55)

BPOE 2016 55

Results – Iterative Workloads

7753 780 ₅₉₈ ₄₆₉ 0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Hadoop Spark Flink Flink-‐delta

Ru n Ti m e (s ) PageRank 6722 672 805 0 1000 2000 3000 4000 5000 6000 7000 8000

Hadoop Spark Flink

Ru n Ti m e( s) KMeans

(56)

Observation

n Spark, Flink can be much faster than Hadoop

n Advantage of the DAG model

n The delta iteration from Flink can be more

(57)

BPOE 2016 57

Results – Queries

38 58 ₅₆ 65 9 0 10 20 30 40 50 60 70

SparkSQL HiveOnSpark HiveOnTez Hive Impala

Ru n Ti m e( s) Select

(58)

Results -‐ Queries

155 174 248 490 123 0 100 200 300 400 500 600

Ru n Ti m e( s) Aggregation

(59)

BPOE 2016 59

Results – Queries

214 342 ₃₁₈ 593 791 0 100 200 300 400 500 600 700 800 900

Ru n Ti m e( s) Join

(60)

Observations

n Impala has the best performance on select

and aggregation

n Specific data structure designed for database

query

n Good disk performance due to I/O buffer

managemen and short-‐circuit local reads

n Impala gets bad performance for join

(61)

BPOE 2016 61

Benchmarking and Ranking Big Data Systems