Big Data Visualization on the MIC

(1)

Big Data Visualization on the MIC

Tim Dykes

School of Creative Technologies

University of Portsmouth

[email protected]

(2)

Splotch Team

●

Tim Dykes

, University of Portsmouth

●

Claudio Gheller

, Swiss National Supercomputing Centre

●

Marzia Rivi

, University of Oxford

●

Mel Krokos

, University of Portsmouth

●

Klaus Dolag

, University Observatory Munich

(3)

Splotch Overview

●

MIC-Splotch Implementation & Optimization

●

Performance Measurements

(4)

Splotch

●

Ray-casting algorithm for large datasets

●

Primarily for astrophysical N-body simulations

●

Applicable to any data representable as point-like elements with attributes

●

Particle contribution to image determined using radiative transfer equation

and a Gaussian distribution function

[1]

[2] [3]

[1] 3D Modelling and Visualization of Galaxies By B.Koribalsky, C.Gheller and K.Dolag (ATNF, Australia, ETH-CSCS, Switzerland, Univ. Observatory Munich, Germany)

(5)

(6)

Notable Challenges for Parallel

Implementation

Load balancing for data

spread unevenly

throughout the image

High concentration of particles in small area

Low concentration of particles spread across large area

Potential race conditions due to

single pixels affected by many

elements

(7)

Motivations for MIC-Splotch

●

Accelerator/coprocessor usage in HPC

●

Exploitation of all available hardware

●

New architecture

(8)

MIC Architecture

●

PCIe SMP-on-a-chip

●

Up to 61 cores

●

Up to 244 HW threads

●

512-bit wide SIMD

●

Up to 8GB GDDR5

(9)

MIC Architecture cont.

Parallel Programming Models

–

OpenMP

–

Intel Cilk Plus

–

MPI

–

_Pthreads

Processing Models Available

–

Native

●

Cross compile source to run directly on device

–

Offload

●

LEO (Language Extensions for Offload)

–

Symmetric

●

Use each coprocessor as a node in an MPI cluster, or subdivide

the device to contain a series of MPI nodes

(10)

(11)

Optimization Methods

●

Address memory allocation and transfer issues

●

Thread management

●

Automatic vectorization

(12)

Memory Allocation

Mitigation Advice:

Pre-allocating large buffers

Allocating with large pages

Avoid dynamic allocations

Problem:

Dynamic allocation is slow

_{start of the program and}Allocate memory at the

reuse through rather than deleting and reallocating

MIC_USE_2MB_BUFFERS=64K Enables use of 2MB page sizes

for any allocations over 64K

This can also reduce page faults and translation look-aside (TLB)

(13)

Double buffered computation

●

Particles processed in chunks

●

Computation overlapped with transfer

●

Reduced transfer times

(14)

Multithreaded Rendering

Split threads into groups Create full image buffer

for each group

Split images into tiles T0 → TN T0 T1 T2 T0 TN Step 1 - Allocate Tile_list_N T0 T1 ... TN P0 P13 P17 P4 P6 …... P10P45 Particle Subset N ThreadN Step 2 - Prerender

Each thread generates a list of particle indices per tile for subset of particle data allocated

Step 3 - Render T0 Thread0 Tile_list_0 → N T0 T0 T0 T0 P0 P13 P17 P66 P69 P88P92 P99

Each thread renders all particles from all lists

for one particular tile Step 4

Image buffers accumulated and transferred back to host

(15)

Vectorization

Aiding Automatic Vectorization

–

Data structure organisation

–

Data alignment

–

Compiler directives

Converting from array of structures to structure of arrays provided 10% performance boost to rasterization phase

Data should be aligned to 64 byte boundaries on host using _mm_malloc() to

ensure offload allocation & transfers are also aligned

Use of the “__assume_aligned(ptr,64)” directive informs the compiler the array being worked on is aligned correctly. #pragma ivdep informs compiler that vectors do not overlap each other Use of the compiler option -vec-reportX

(where X = 0-6 ) provides detailed information on what has and has not been vectorized, along with

suggestions as to why

The guide to auto-vectorization with Intel C++ compilers is useful at this stage

(16)

Vectorization Cont.

Manual Vectorization

– _{Difficult to automatically vectorize complex areas of code.}

– _{Intrinsics, mapping directly to the Intel Many-Core Instruction (IMCI) set, can be used to}

manually vectorize code.

– The rendering phase of Splotch is not amenable to automatic vectorization, due to fairly

unpredictable unaligned memory access patterns.

For each particle the spread of affected pixels is calculated, then each column of pixels is rendered.

A pixel color is calculated by multiplying the color of the particle by a contributive factor. This value is then additively combined with the previous color of the affected pixel.

(17)

Manual Vectorization Method

Step 1: Pack particle color x5 into _m512 vector container V1

R G B R G B R G B R G B R G B

Step 2: Pack contribution value x3 per pixel into _m512 vector container V2

C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5

Step 3: Pack affected pixel colors (up to 5) into _m512 vector container V3

R G B R G B R G B R G B R G B

Step 4: Fused Multiply-Add vectors where V3 = (V1*V2) + V3

R G B R G B R G B R G B R G B C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 R G B R G B R G B R G B R G B x x x x x x x x x x x x x x x + + + + + + + + + + + + + + +

Step 5: Masked Unaligned store V3 back to memory

_mm512_setr_ps(r,g,b,r,g,b...) _mm512_setr_ps(c1,c1,c1,c2,c2...) _mm512_setr_ps(p1.r, p1.g, p1.b, p2.r..._)

}

_mm512_fmadd_ps(V1,V2,V3) _mm512_mask_packstorelo_ps((void*)&dest[idx], _mask, V3) _mm512_mask_packstorehi_ps(((void*)&dest[idx])+64, _mask, V3)

The _mask ensures the unused 16th

float in the vector containers is not written to the image

int mask = 0b0111111111111111; _mmask16 _mask =

(18)

Offloading with MPI

●

Data heavy algorithms can benefit from MPI based offloading.

●

Multiple MPI processes run on the host, sharing a single or

multiple devices.

●

Allows to allocate, transfer and process chunks of data in

parallel, providing a significant performance boost.

The script to subdivide multiple devices

amongst 8 tasks can be unwieldy

(19)

Performance Testing

Test System Specification

– _{'Dommic' Facility at the Swiss National Supercomputing Centre}

– _{7 Nodes}

– _{Each node based on dual socket eight-core Intel Xeon 2670, running at 2.6GHz} with 32 GB main system memory and two Intel Xeon Phi 5110 coprocessors available

Test Scenario

– _{~21 Million particle N-Body simulation produced using the Gadget code.} – 100 frame animation orbiting the dataset

– 8 host MPI processes per device, 2 thread groups of 15 threads each – _{4 OpenMP threads per available core (~236)}

(20)

Results

Per-Frame time for all phases: host OpenMP 1-16 cores vs single and

dual Xeon Phi devices.

Per-Frame time for rasterization: host OpenMP 1-16 cores vs single and

(21)

Results Cont.

Per-Frame processing time comparing MPI, OpenMP and MPI offloading to single and dual Xeon Phi devices.

(22)

Results Notes

●

Best performance boost is seen in the rasterization phase, with a

single device outperforming 16 OpenMP threads by ~2.5x.

●

Use of a second device provides, as expected, 2x performance boost

in comparison to single device

●

Per frame processing times for the current implementation compares

to 4 host OpenMP threads for a single device, while two devices

outperforms 16 OpenMP threads due to non-linear scaling of the host

OpenMP implementation

●

In comparison with the linearly scaling MPI implementation, each

(23)

Further work

●

Further optimisation and tuning through use of Intel

VTune

●

Dynamic thread grouping system

●

Comparison against GPU model

●

Exploration of MPI running on both host and device

(24)

References

Splotch Publications

1. Dolag, K., Reinecke, M., Gheller, C., Imboden, S.: Splotch: Visualizing Cosmological Simulations. New Journal of Physics, 10(12) id. 125006 (2008)

2. Jin,Z.,Krokos,M.,Rivi,M.,Gheller,C.,Dolag,K.,Reinecke,M.:High-Performance Astrophysical Visualization using Splotch. Procedia Computer Science, 1(1) 1775– 1784 (2010)

3. Rivi, M., Gheller, C., Dykes, T., Krokos, M., Dolag, K.: GPU Accelerated Particle Visualisation with Splotch. To appear in Astronomy and Computing (2014)

4. Dykes, T., Gheller, C., Rivi, M., Krokos, M.: Big Data Visualization on the Xeon Phi. Submitted to International Supercomputing Conference (2014)

Big Data Visualization on the MIC