Architecture Support for Big Data Analytics

(1)

1

Architecture Support for

Big Data Analytics

Ahsan Javed Awan EMJD-DC (KTH-UPC)

(http://uk.linkedin.com/in/ahsanjavedawan/)

(2)

2

Motivation

Why should we care?

(3)

Motivation

Cont...

● A mismatch between the characteristics of emerging workloads and the underlying hardware.

– M. Ferdman et-al, “Clearing the clouds: A study of emerging scale-out workloads on modern

hardware,” in ASPLOS 2012.

– Z. Jia, et-al “Characterizing data analysis workloads in data centers,” in IISWC 2013.

– C. Zheng et-al, “Characterizing os behavior of scale-out data center workloads.” in WIVOSCA

2013.

– M. Dimitrov et al, “Memory system characterization of big data workloads,” in BigData

Conference, 2013.

– Z. Jia et-al, “Characterizing and subsetting big data workloads,” in IISWC 2014

– A. Yasin et-al, “Deep-dive analysis of the data analytics workload in cloudsuite,” in IISWC 2014. – L. Wang et-al, “Bigdatabench: A big data benchmark suite from internet services,” in HPCA,

2014.

(4)

4

Motivation

Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server

(5)

5

Motivation

Cont...

Our Focus

Our Focus Improve the single node performance

in scale-out configuration Phoenix ++, Metis, Ostrich, etc.. Hadoop, Spark, Flink, etc..

(6)

6

Progress Meeting 12-12-14

Which Scale-out Framework ?

(7)

Our Approach

● A three fold analysis method at Application, Thread and

Micro-architectural level

– Tuning of Spark internal Parameters

– Tuning of JVM Parameters (Heap size etc..) – Concurrency Analysis

– General Architectural Exploration

(8)

8

Our Approach

Benchmarks

3GB of Wikipedia raw datasets, Amazon Movies Reviews and numerical records have been used

(9)

Our Hardware Configuration

Machine Details

Hyper Threading and Turbo Boost is disabled

(10)

10

Our Approach

(11)

Multicore Scalability of Spark

Application Level Performance

(12)

12

Multicore Scalability of Spark

Stage Level Performance

● Shuffle Map Stages don't scale beyond 12 threads

across different workloads

● No of concurrent files open in Map-side shuffling is

C*R where C is no of threads in executor pool and R is no of reduce tasks

(13)

Multicore Scalability of Spark

Task Level Performance

(14)

14

(15)

CPU Utilization is not

scaling with performance

(16)

16

Is there any Work Time

Inflation ??

(17)

How does Micro-architecture contribute to

Work time inflation ??

(18)

18

(19)

(20)

20

(21)

Key Findings

● More than 12 threads in an executor pool does not yield significant

performance

● Spark runtime system need to be improved to provide better load

balancing and avoid work-time inflation.

● Work time inflation and load imbalance on the threads are the

scalability bottlenecks.

● Removing the bottlenecks in the front-end of the processor would

not remove more than 20% of stalls.

● Effort should be focused on removing the memory bound stalls

since they account for up to 72% of stalls in the pipeline slots.

● Memory bandwidth of current processors is sufficient for

(22)

22

Motivation

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

(23)

Motivation

Do Spark based data analytics benefit from using larger scale-up servers

(24)

24

Motivation

(25)

Motivation

(26)

26

Motivation

(27)

Motivation

(28)

28

Motivation

(29)

Motivation

How does data size affects micro-architectural performance ?

(30)

30

Motivation

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

(31)

(32)

32

(33)

Key Findings

● Spark workloads do not benefit significantly from executors with

more than 12 cores.

● The performance of Spark workloads degrades with large volumes

of data due to substantial increase in garbage collection and file I/O time.

● With out any tuning, Parallel Scavenge garbage collection scheme

outperforms Concurrent Mark Sweep and G1 garbage collectors for Spark workloads.

● Spark workloads exhibit improved instruction retirement due to

lower L1 cache misses and better utilization of functional units inside cores at large volumes of data.

● Memory bandwidth utilization of Spark benchmarks decreases

with large volumes of data and is 3x lower than the available off-chip bandwidth on our test machine

(34)

34

Our Approach

● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade,

“Performance charaterization of in-memory data analytics on a mordern cloud server,” in 5th International IEEE Conference on

Big Data and Cloud Computing, 2015.

● A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade, “How

Data Volume Affects Spark Based Data Analytics on a Scale-up Server”, in 6th International Workshop on Big Data

Benchmarking, Performance Optimization and Emerging Hardware (BpoE) held in conjunction with VLDB, 2015.

(35)

Motivation

Future Directions

NUMA Aware Task Scheduling

Cache Aware Transformations

Exploiting Processing In Memory Architectures

HW/SW Data Prefectching