• No results found

Benchmarking and Ranking Big Data Systems

N/A
N/A
Protected

Academic year: 2021

Share "Benchmarking and Ranking Big Data Systems"

Copied!
61
0
0

Loading.... (view fulltext now)

Full text

(1)

IN S TIT U TE O F C O M P U TIN G T E C H N O LO G Y

Benchmarking  and  Ranking  

Big  Data  Systems

Xinhui Tian

ICT,  Chinese    Academy  of  Sciences  and     University  of  Chinese  Academy  of  Sciences

(2)

Outline

n BigDataBench

n BigDataBench Dwarfs

n BigData100

(3)

BPOE  2016 3

Why  Big  Data  Benchmarking?

Measuring  big  data  systems  and  architectures   quantitatively

(4)

What  is  BigDataBench?  

n An  open  source  big  data  benchmarking  project  

• http://prof.ict.ac.cn/BigDataBench

(5)

BPOE  2016 5

BigDataBench Users

n http://prof.ict.ac.cn/BigDataBench/users/

n Industry  users

n Accenture,  BROADCOM,  SAMSUMG,  Huawei,  IBM

n About  20  academia  groups  published  papers  

(6)

MPI Shark Impala NoSql

BigDataBench 3.2  Overview

Search Engine

15    Real-­‐world  Data  Sets

BDGS(Big  Data  Generator  Suite)  for  scalable  data

Hadoop RDMA

Facebook  Social  Network ImageNet

Wikipedia   Entries

E-­‐commerce    Transaction

English  broadcasting  audio

Amazon  Movie  Reviews

ProfSearch Resumes DVD  Input  Streams Google  Web  Graph

SoGou Data Image  scene

MNIST

Genome  sequence  data Assembly  of  the  human  genome

Social

Network E-commerce

Multimedia Bioinformatics

(7)

BPOE  2016 7

What’s  New  in  BigDataBench 3.2

• WordCount,  Grep,  Naïve  Bayes,  PageRank,  K-­‐means

BigDataBench support  for  Flink

• JStorm, Spark Streaming

Streaming

• GraphX,  GraphLab,  Flink Gelly

(8)

BigDataBench 3.2

BigDataBench  evolution  

BigDataBench 3.1

BigDataBench 3.0

BigDataBench 2.0

2013.12 Typical  Internet  service  domainsAn  architectural  perspective

19  workloads  &  data  generation  tools

2014.4 Multidisciplinary  effort32  workloads:  diverse  implementations  

5  application  domains:  14  data  sets  and  33  workloads Same  specifications:  diverse  implementations

Multi-­‐tenancy  version

BigDataBench subset  and  simulator  version    

2014.12

New  software stack:  Flink,  JStorm,  GraphX,  GraphLab New  workload type:  Streaming,  Graph  processing New  dataset and  workloads

(9)

BPOE  2016 9

BigDataBench Publications

n BigDataBench:  a  Big  Data  Benchmark  Suite  from  Internet  Services.    20th  IEEE  

International  Symposium  On  High  Performance  Computer  Architecture   (HPCA-­‐2014).

n Characterizing  data  analysis  workloads  in  data  centers.    2013  IEEE  

International  Symposium  on  Workload  Characterization  (IISWC  2013)(Best   paper  award)

n BigOP:  generating  comprehensive  big  data  workloads  as  a  benchmarking  

framework.    19th  International  Conference  on  Database Systems  for   Advanced  Applications  (DASFAA  2014)

n BDGS:  A  Scalable  Big  Data  Generator  Suite  in  Big  Data  Benchmarking.  The  

Fourth  workshop  on  big  data  benchmarking  (WBDB  2014)

n Identifying  Dwarfs  Workloads  in  Big  Data  Analytics arXiv preprint  

arXiv:1505.06872

n BigDataBench-­‐MT:   A  Benchmark  Tool  for  Generating  Realistic   Mixed  Data  

(10)

Outline

n BigDataBench

n BigDataBench Dwarfs

n BigData100

(11)

BPOE  2016 11

BigDataBench Methodology

Application   Domain  1 Application   Domain  … Application   Domain  N

Data  models  of  different   types  &  semantics Data  operations & workloads patterns Benchmark   specification   1 Benchmark   specification   … Benchmark   specification   N Real-­‐world  data   sets   Data  generation   tools Workloads   with   diverse   implementations   Multi-­‐tenancy   version BigDataBench subset Mix  with  different  

percentages

Reduce  

(12)

BigDataBench Methodology  with  Dwarfs

Application   Domain  1 Application   Domain  … Application   Domain  N

Data  models  of  different   types  &  semantics

Dwarfs  Workloads Identification Benchmark   specification   1 Benchmark   specification   … Benchmark   specification   N Real-­‐world  data   sets   Data  generation   tools Workloads   with   diverse   implementations   Multi-­‐tenancy   version BigDataBench subset Mix  with  different  

percentages

Reduce  

(13)

BPOE  2016 13

Dwarfs  Workloads

n Finding  the  common  atomic  workloads  in  big  

data  analytics  workloads

n Representing  maximum  patterns  of  big  data  

(14)

Inspiration

Successful  Compute  Abstractions

• Relational  algebra

– 5  primitive  operations – Select,  Project,  Product,  

Union,  Difference • Parallel  computing – Computational  &   communication  patterns – 13  dwarfs Successful  Benchmarks • TPC-­‐C   – OLTP  domain

– Functions  of  abstraction

• HPCC

– High  performance  computing – Seven  basically  tests  

(15)

BPOE  2016 15

Fundamental  Issues  

What are  the  challenges  for  big  data  dwarfs?

How to  find  the  dwarfs  workloads  in  big  data  

(16)

Challenges

Massive  application  domains  make  us  wonder  where   to  start  or  how  to  achieve  a  wide  range  of  coverage

Classification Dimension reduction Association   rule  mining Regression Clustering Data  Mining Data   mining Deep   learning Machine  

learning Computer  vision

Signal processi ng Natural   language   processing

Many  techniques  for  processing  big  data  exist,  which  bring   greater  complexity  for  identifying  dwarfs  workloads

A  machine  learning  library  – scikit learn,  implements   so  many  algorithms,  which  is  still  much  less  than  the  

total  number  of  algorithms

80%  

data  growth   are  unstructured   data

(17)

BPOE  2016 17

Methodology  of  Dwarfs  

Application   Domain  1 Application   Domain  2 Application   Domain  N Processing  Techniques • Machine  learning • Data  mining • Deep  learning • Computer  vision

• Natural  language  processing  ...

Libraries

• Mllib,Mahout

Frameworks

• Spark,  Hadoop,  GraphLab

Benchmarks

• BigBench,  LinkBench ...

Big  Data  Analytics Representative   Algorithms Frequently   appearing   operations Different   combinations Dwarfs  1 Dwarfs  2 Dwarfs  M … …

(18)

Methodology  of  Dwarfs  

Application   Domain  1 Application   Domain  2 Application   Processing  Techniques • Machine  learning • Data  mining • Deep  learning • Computer  vision

• Natural  language  processing  ...

Libraries

• Mllib,Mahout

Frameworks

• Spark,  Hadoop,  GraphLab

Benchmarks

Big  Data  Analytics Representative   Algorithms Frequently   appearing   operations Different   combinations Dwarfs  1 Dwarfs  2 Dwarfs  M … …

(19)

BPOE  2016 19

Methodology  of  Dwarfs  

Application   Domain  1 Application   Domain  2 Application   Domain  N Processing  Techniques • Machine  learning • Data  mining • Deep  learning • Computer  vision

• Natural  language  processing  ...

Libraries

• Mllib,Mahout

Frameworks

• Spark,  Hadoop,  GraphLab

Benchmarks

• BigBench,  LinkBench ...

Big  Data  Analytics

Dwarfs  1 Dwarfs  2 Dwarfs  M … … Frequently   appearing   operations Different   combinations Representative   Algorithms

(20)

Methodology  of  Dwarfs  

Application   Domain  1 Application   Domain  2 Application   Processing  Techniques • Machine  learning • Data  mining • Deep  learning • Computer  vision

• Natural  language  processing  ...

Libraries

• Mllib,Mahout

Frameworks

• Spark,  Hadoop,  GraphLab

Benchmarks

Big  Data  Analytics Representative   Algorithms Frequently   appearing   operations Different   combinations Dwarfs  1 Dwarfs  2 Dwarfs  M … …

(21)

BPOE  2016 21

Algorithms  Chosen  to  Investigate

C4.5/CART/ID3,  Logistic  regression,  SVM,   KNN,  HMM,  Maximum-­‐entropy  markov model,  Conditional  random  field,   PageRank,  HITS,  Aporiori,  FP-­‐growth,   Principal  component  analysis,  Back   Propagation,  Adaboost,  MCMC,  

Connected  component,  Random  forest

Data  Mining  &   Machine  Learning

Latent  semantic  indexing,  pLSI,   Latent  dirichlet allocation,  Index,  

Porter  Stemming,  Sphinx  speech   recognition

Natural  Language  Processing

Demographic/Content  based   recommendation,  Collaborative  

Filtering

Recommendation

MPEG-­‐2,  Scale-­‐invariant  feature   transform,  Image  segmentation,  

Ray  Tracing Computer  Vision Needleman-­‐Wunsch,  Smith-­‐ Waterman,  BLAST Database  Software CNN, DBN Deep  Learning

(22)

Frequently-­‐appearing  Operations

Similarity  Analysis

KNN,  K-­‐means,  Recommendation,   Feature  matching,  Image  

segmentation

Neural  Network

Back  propagation,  CNN,  DBN,  Neural   network  

Multimedia  Representation

Sphinx  speech  recognition,  MPEG-­‐2,  

Matrix/Vector  Calculation Linear  Algebra Similarity   Analysis • Jaccard similarity • Locality  sensitive   hashing  … Association   Rules   Mining • Aporiori • FP-­‐Growth • … Database • Set  union • Set  difference Set  Operations Sampling MCMC,  LDA,   Random  sampling,   Downsampling … Transform   Operation FFT,  Convolution   computation,  DCT,   Speech  recognition,   MPEG-­‐2 … Graph   Operation BFS,  DFS,  Decision   tree,  Connected   Component Logic   Operation Encrpytion,   index,   Fingerprint,  SimHash,   MinHash Statistic  Operation • Probability  calculation

• LSI,  pLSI,  Latent  dirichlet allocation  …

Sort

• Partial  sort,  quick  sort,  Top  k  sort… • K-­‐means,  Decision  tree  …

(23)

BPOE  2016 23

Dwarfs  in  Big  Data  Offline  Analytics

Sampling Transform  operation Graph operation Logic  operation Set  operation Statistic  operation Sort Linear  Algebra

(24)

Outline

n BigDataBench

n BigDataBench Dwarfs

n BigData100

(25)

BPOE  2016 25

BigData100  Ranking

n An  open  source  project  for  benchmarking  and  

ranking  big  data  systems

n Using  benchmarks  from  BigDataBench

n Big  data  systems  including  Hadoop,  Spark,  Flink

and  many  SQL-­‐interface  systems

(26)

Why  Do  Big  Data  Systems  Ranking

n First  a  quick  glance  of  current  big  data  

(27)

BPOE  2016 27

Big  Data  Systems

Programming Interface Computation Engines Data Storage Data Sources

SQL-like DataFlow APIs

DataFlow Engines

DB layer Graph Layer ML Layer …

Graph MPP Streaming …

Distributed File Systems

Column-oriented Format Key-Value Storage Resource Management

NOSQL Relational DB

Structured Data Unstructured Data Semi-structured Data R Language

(28)

Computation  Engine

Programming   Model Pa ra lle lCo m pu ta tio n Fa ult To le ra nc e Ta sk  Sc he du lin g Lo ca l  E xe cu tio n Ne tw ork  Tr an sfe r In te rm ed ia te  D ata   Ma na ge m en t

(29)

BPOE  2016 29

The  Dataflow  Model

n A  computation  job  is  represented  as  a  

directed  acyclic  graph  (DAG)  consisting  of  

data  sources,  sinks  and  operators

n First  used  in  the  Microsoft  Dryad  system

n Google  MapReduce can  be  seen  as  a  special  

(30)

DataFlow

Sort Map Spill Merge Reduce Input Output Input1 Operator1 Operator2 Operator3 Operator4 Operator5 Input2 Output Channel1 Channel2 Channel3 Channel4 Intermediate Data T ransformation User function Input1 RDD1 RDD2 RDD3 RDD4 RDD5 Input2 Output Transformation1 Transformation2 Transformation3 Transformation4 RDD6 Action Cached RDD Cached Input1 PACT1 PACT2 PACT3 Input2 Channel1 Channel2

MapReduce Dryad Spark

(31)

BPOE  2016 31

Other  Specific  Models

Impala Pregel

Kornacker M,  Behm A,  Bittorf V,  et  al.  Impala:  A  Modern,  Open-­‐Source  SQL  Engine  for  Hadoop[C]//CIDR.  2015.

Malewicz G,  et  al.  Pregel:  a  system  for  large-­‐scale  graph  processing[C] Proceedings  of  the  2010  ACM  SIGMOD  International  Conference  on   Management  of  data.  ACM,  2010:  135-­‐146.

(32)

Why  Do  Big  Data  Systems  Ranking?

n Understanding  the  performance  difference  

between  general-­‐purpose  and  specific   systems  

n performance  comparison  

n Figuring  out  how  the  design  of  framework  

impact  the  performance  of  specific  types  of   workloads

(33)

BPOE  2016 33

Benchmark

n Current  Benchmarks n Micro  workloads:   • WordCount,  Grep n Iteration  workloads: • PageRank,  Kmeans n Interactive  Queries

Select,  Aggregation,  Join

n Others

(34)

Benchmark  Behaviors:  Spark  Case

n Spark

n Resilient  distributed  datasets  (RDDs)

Immutable,  partitioned  collections  of  objects

Created  through  parallel  transformations (map,  filter,   groupBy,  join,  …)  on  data  in  stable  storage

Can  be cached for  efficient  reuse

HDFS  File Filtered  RDD Mapped  RDD

(35)

BPOE  2016 35

Benchmark  Behaviors

n Wordcount,  Grep,  NaiveBayes,  Select

n IO-­‐intensive  

n Computation  can  be  easy  paralleled n No  or  little  data  transfer

(36)

Benchmark  Behaviors

n PageRank,  Kmeans

n Iterative  Computation

n Data  sets  join  and  aggregation  in  each  iteration n A  lot  of  shuffle  operations

Iteration  1 GroupBy DataSet2 Join Iteration  2 GroupBy DataSet2 Join Iteration  3 GroupBy DataSet2 Join DataSet1

(37)

BPOE  2016 37

Benchmark  Behaviors

n Database  queries

n One  DAG  job that  consists  of  many  different  

transformations

Including  complex  aggregation  and  join  algorithms

Map DataSet1

DataSet2 Map

(38)

Workload  Behaviors  on  CPU  and  

Memory  (JVM)

n Simple  Batch  Workloads

n Wordcount,  Grep,  NaiveBayes,  Select

n Iterative  Workloads

n PageRank,  Kmeans

n Interactive  Queries

(39)

BPOE  2016 39

CPU  -­‐ Simple  Batch  Workloads

Grep WordCount NaiveBayes

(40)

CPU  -­‐ Iterative  Workloads

(41)

BPOE  2016 41

CPU  -­‐ Query  Processing

(42)

Memory  -­‐ JVM  Memory  Model

(43)

BPOE  2016 43

JVM  Memory  Usage  in  Spark

n EC:  Eden  Capacity

n EU:  Eden  Usage

n OC:  Old  Memory  Capacity

(44)

Simple  Batch  Workloads

Grep

WordCount NaiveBayes

(45)

BPOE  2016 45

Iterative  Workloads

(46)

Query  Processing

(47)

BPOE  2016 47

Revisit  the  dataflow

Iteration  1 GroupBy DataSet2 Join Map Iteration  2 GroupBy DataSet2 Join Map Iteration  3 GroupBy DataSet2 Join Map Map DataSet1 DataSet2 Map

(48)

Some  Primary  Results  of  

BigData100

n Cluster  Configuration

Computation  Nodes 16  nodes

CPU  per node 2 x  Intel  Xeon  E5645,  12  cores Memory  per  node 32 GB

Disk  per  node 1TB  x  2  SATA  disks Network Gigabit Ethernet

(49)

BPOE  2016 49

System  Selected

n Simple  Batch  Workloads  and  Iterative  

Workloads

n Hadoop  2.7.1,  Spark  1.5.1,  Flink 0.9.1

n Interactive  Query  Processing

n Spark  1.5.1,  Hive  1.2.1  (on  Hadoop,  on  Tez and  on  

(50)

Size  of  data  sets

n WordCount,  Grep

n 1.600,000,000  English  words  from  Wikipedia n About  300GB

n Naïve  Bayes

n 5,700,000  reviews  from  Amazon  Movie  Reviews n 300GB

(51)

BPOE  2016 51

Size  of  data  sets

n PageRank

n Google  Web  Graph,  16  million  vertexes,  99  million  

edges  (unstructured  graph)

n KMeans

n Facebook  Social  Network,  460  million  vectors

n Queries  – subset  of  TPC-­‐DS

n item: 4  GB

n customer: 13  GB

(52)

Results  – Simple  Batch  Workloads

410 330 548 0 100 200 300 400 500 600

Hadoop Spark Flink

Ru n   Ti m e   (s ) Grep 680 345 1105 0 200 400 600 800 1000 1200

Hadoop Spark Flink

Ru n Ti m e   (s ) WordCount

(53)

BPOE  2016 53

Results

490 351 563 0 100 200 300 400 500 600

Hadoop Spark Flink

Ru n   Ti m e   (s ) Naive  Bayes

(54)

Observation

n Spark  has  the  best  performance  on  that  kind  

of  workloads

n Good  data  locality

n One  task  for  each  partition

n One  reason  for  the  bad  performance  in  Flink

n High  level  data  locality

(55)

BPOE  2016 55

Results  – Iterative  Workloads

7753 780 598 469 0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Hadoop Spark Flink Flink-­‐delta

Ru n   Ti m e   (s ) PageRank 6722 672 805 0 1000 2000 3000 4000 5000 6000 7000 8000

Hadoop Spark Flink

Ru n   Ti m e( s) KMeans

(56)

Observation

n Spark,  Flink can  be  much  faster  than  Hadoop

n Advantage  of  the  DAG  model

n The  delta  iteration  from  Flink can  be  more  

(57)

BPOE  2016 57

Results  – Queries

38 58 56 65 9 0 10 20 30 40 50 60 70

SparkSQL HiveOnSpark HiveOnTez Hive Impala

Ru n   Ti m e( s) Select

(58)

Results  -­‐ Queries

155 174 248 490 123 0 100 200 300 400 500 600

SparkSQL HiveOnSpark HiveOnTez Hive Impala

Ru n   Ti m e( s) Aggregation

(59)

BPOE  2016 59

Results  – Queries

214 342 318 593 791 0 100 200 300 400 500 600 700 800 900

SparkSQL HiveOnSpark HiveOnTez Hive Impala

Ru n   Ti m e( s) Join

(60)

Observations

n Impala  has  the  best  performance  on  select  

and  aggregation

n Specific  data  structure  designed  for  database  

query

n Good  disk  performance  due  to  I/O  buffer  

managemen and  short-­‐circuit  local  reads  

n Impala  gets  bad  performance  for  join  

(61)

BPOE  2016 61

References

Related documents

The voluntary activity of refugees thus helped diffuse conflicts arising from the contradictions between refugees and the state, by using social capital between refugees as

In 1912, while discussing the place of the Chinese collection at the American Museum of Natural History in New York, anthropology curator Clark Wissler emphatically announced:

methodologies are therefore needed to access this “invisible” group. Since much is written about migrant sex workers and little is heard directly from them, approaches that

reason. 4) The societies shall produce an attested copy of the resolution duly approved by the Co-Operative department alongwith technical bids. 5) The tender without

The Hotel: 3 Stars Self Catering Studios for 2/3 Apartments for 2/4 Air Conditioning (seasonal) Swimming Pool Free Wifi (public areas).. Located at the southern edge of tiny

These will be on the Political lobby work (in order to strengthen Floorball’s position in Worldwide Sport) in the International Sports community, strengthening the Marketing of

Recently published results from the ENZAMET trial indicate that patients with mHSPC benefit from the addition of enzalutamide to standard first-line treatment in terms of

Source Data: US Bureau of Economic Analysis, Trucost Environmental Register (validated environmental performance data of 4,800 companies and suppliers), natural capital valuation