• No results found

Big Data Visualization on the MIC

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Visualization on the MIC"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data Visualization on the MIC

Tim Dykes

School of Creative Technologies

University of Portsmouth

[email protected]

(2)

Splotch Team

Tim Dykes

, University of Portsmouth

Claudio Gheller

, Swiss National Supercomputing Centre

Marzia Rivi

, University of Oxford

Mel Krokos

, University of Portsmouth

Klaus Dolag

, University Observatory Munich

(3)

Contents

Splotch Overview

MIC-Splotch Implementation & Optimization

Performance Measurements

(4)

Splotch

Ray-casting algorithm for large datasets

Primarily for astrophysical N-body simulations

Applicable to any data representable as point-like elements with attributes

Particle contribution to image determined using radiative transfer equation

and a Gaussian distribution function

[1]

[2] [3]

[1] 3D Modelling and Visualization of Galaxies By B.Koribalsky, C.Gheller and K.Dolag (ATNF, Australia, ETH-CSCS, Switzerland, Univ. Observatory Munich, Germany)

(5)
(6)

Notable Challenges for Parallel

Implementation

Load balancing for data

spread unevenly

throughout the image

High concentration of particles in small area

Low concentration of particles spread across large area

Potential race conditions due to

single pixels affected by many

elements

(7)

Motivations for MIC-Splotch

Accelerator/coprocessor usage in HPC

Exploitation of all available hardware

New architecture

(8)

MIC Architecture

PCIe SMP-on-a-chip

Up to 61 cores

Up to 244 HW threads

512-bit wide SIMD

Up to 8GB GDDR5

(9)

MIC Architecture cont.

Parallel Programming Models

OpenMP

Intel Cilk Plus

MPI

Pthreads

Processing Models Available

Native

Cross compile source to run directly on device

Offload

LEO (Language Extensions for Offload)

Symmetric

Use each coprocessor as a node in an MPI cluster, or subdivide

the device to contain a series of MPI nodes

(10)
(11)

Optimization Methods

Address memory allocation and transfer issues

Thread management

Automatic vectorization

(12)

Memory Allocation

Mitigation Advice:

Pre-allocating large buffers

Allocating with large pages

Avoid dynamic allocations

Problem:

Dynamic allocation is slow

start of the program andAllocate memory at the

reuse through rather than deleting and reallocating

MIC_USE_2MB_BUFFERS=64K Enables use of 2MB page sizes

for any allocations over 64K

This can also reduce page faults and translation look-aside (TLB)

(13)

Double buffered computation

Particles processed in chunks

Computation overlapped with transfer

Reduced transfer times

(14)

Multithreaded Rendering

Split threads into groups Create full image buffer

for each group

Split images into tiles T0 → TN T0 T1 T2 T0 TN Step 1 - Allocate Tile_list_N T0 T1 ... TN P0 P13 P17 P4 P6 …... P10P45 Particle Subset N ThreadN Step 2 - Prerender

Each thread generates a list of particle indices per tile for subset of particle data allocated

Step 3 - Render T0 Thread0 Tile_list_0 → N T0 T0 T0 T0 P0 P13 P17 P66 P69 P88P92 P99

Each thread renders all particles from all lists

for one particular tile Step 4

Image buffers accumulated and transferred back to host

(15)

Vectorization

Aiding Automatic Vectorization

Data structure organisation

Data alignment

Compiler directives

Converting from array of structures to structure of arrays provided 10% performance boost to rasterization phase

Data should be aligned to 64 byte boundaries on host using _mm_malloc() to

ensure offload allocation & transfers are also aligned

Use of the “__assume_aligned(ptr,64)” directive informs the compiler the array being worked on is aligned correctly. #pragma ivdep informs compiler that vectors do not overlap each other Use of the compiler option -vec-reportX

(where X = 0-6 ) provides detailed information on what has and has not been vectorized, along with

suggestions as to why

The guide to auto-vectorization with Intel C++ compilers is useful at this stage

(16)

Vectorization Cont.

Manual Vectorization

Difficult to automatically vectorize complex areas of code.

Intrinsics, mapping directly to the Intel Many-Core Instruction (IMCI) set, can be used to

manually vectorize code.

– The rendering phase of Splotch is not amenable to automatic vectorization, due to fairly

unpredictable unaligned memory access patterns.

For each particle the spread of affected pixels is calculated, then each column of pixels is rendered.

A pixel color is calculated by multiplying the color of the particle by a contributive factor. This value is then additively combined with the previous color of the affected pixel.

(17)

Manual Vectorization Method

Step 1: Pack particle color x5 into _m512 vector container V1

R G B R G B R G B R G B R G B

Step 2: Pack contribution value x3 per pixel into _m512 vector container V2

C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5

Step 3: Pack affected pixel colors (up to 5) into _m512 vector container V3

R G B R G B R G B R G B R G B

Step 4: Fused Multiply-Add vectors where V3 = (V1*V2) + V3

R G B R G B R G B R G B R G B C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 R G B R G B R G B R G B R G B x x x x x x x x x x x x x x x + + + + + + + + + + + + + + +

Step 5: Masked Unaligned store V3 back to memory

_mm512_setr_ps(r,g,b,r,g,b...) _mm512_setr_ps(c1,c1,c1,c2,c2...) _mm512_setr_ps(p1.r, p1.g, p1.b, p2.r..._)

}

_mm512_fmadd_ps(V1,V2,V3) _mm512_mask_packstorelo_ps((void*)&dest[idx], _mask, V3) _mm512_mask_packstorehi_ps(((void*)&dest[idx])+64, _mask, V3)

The _mask ensures the unused 16th

float in the vector containers is not written to the image

int mask = 0b0111111111111111; _mmask16 _mask =

(18)

Offloading with MPI

Data heavy algorithms can benefit from MPI based offloading.

Multiple MPI processes run on the host, sharing a single or

multiple devices.

Allows to allocate, transfer and process chunks of data in

parallel, providing a significant performance boost.

The script to subdivide multiple devices

amongst 8 tasks can be unwieldy

(19)

Performance Testing

Test System Specification

'Dommic' Facility at the Swiss National Supercomputing Centre

7 Nodes

Each node based on dual socket eight-core Intel Xeon 2670, running at 2.6GHz with 32 GB main system memory and two Intel Xeon Phi 5110 coprocessors available

Test Scenario

~21 Million particle N-Body simulation produced using the Gadget code. – 100 frame animation orbiting the dataset

– 8 host MPI processes per device, 2 thread groups of 15 threads each – 4 OpenMP threads per available core (~236)

(20)

Results

Per-Frame time for all phases: host OpenMP 1-16 cores vs single and

dual Xeon Phi devices.

Per-Frame time for rasterization: host OpenMP 1-16 cores vs single and

(21)

Results Cont.

Per-Frame processing time comparing MPI, OpenMP and MPI offloading to single and dual Xeon Phi devices.

(22)

Results Notes

Best performance boost is seen in the rasterization phase, with a

single device outperforming 16 OpenMP threads by ~2.5x.

Use of a second device provides, as expected, 2x performance boost

in comparison to single device

Per frame processing times for the current implementation compares

to 4 host OpenMP threads for a single device, while two devices

outperforms 16 OpenMP threads due to non-linear scaling of the host

OpenMP implementation

In comparison with the linearly scaling MPI implementation, each

(23)

Further work

Further optimisation and tuning through use of Intel

VTune

Dynamic thread grouping system

Comparison against GPU model

Exploration of MPI running on both host and device

(24)

References

Splotch Publications

1. Dolag, K., Reinecke, M., Gheller, C., Imboden, S.: Splotch: Visualizing Cosmological Simulations. New Journal of Physics, 10(12) id. 125006 (2008)

2. Jin,Z.,Krokos,M.,Rivi,M.,Gheller,C.,Dolag,K.,Reinecke,M.:High-Performance Astrophysical Visualization using Splotch. Procedia Computer Science, 1(1) 1775– 1784 (2010)

3. Rivi, M., Gheller, C., Dykes, T., Krokos, M., Dolag, K.: GPU Accelerated Particle Visualisation with Splotch. To appear in Astronomy and Computing (2014)

4. Dykes, T., Gheller, C., Rivi, M., Krokos, M.: Big Data Visualization on the Xeon Phi. Submitted to International Supercomputing Conference (2014)

The Splotch Code

References

Related documents

In this section, the channel model is assumed to be selective fading channel, where the parameters of the channel in this case corresponding to multipaths where two paths are

a) Compensation. Compensation means payment in cash or in kind for an asset to be acquired or affected. by a project at replacement cost. 'Cut-off date' is the date prior to which

National security concerns also affect export side. Trade sanctions have been used as a major means to counteract external threat to national security, mainly in the form of

In addition, the simplified dynamic model obtained by neglecting variation in current of motors is validated by numerical in- vestigation with an overhead crane and a

organic substances. Organic substances are separated from the solution that the total oxygen used to oxidize organic content in waste is also reduced. In the

Different views about the devaluation of the Egyptian pound have been communicated in these news outlets; they varied in portraying positive or negative aspects

Ancaman kepada keselamatan Sabah – pembinaan negara-bangsa di sebalik masyarakat yang berbilang etnik, pelarian dan orang tidak bernegara, dan migran ekonomi, ditambah pula

The COTS-Based Systems (CBS) Initiative is one of the technical engineering practice initiatives at the Software Engineering Institute (SEI) and is aimed at establishing