Big Data Visualization on the MIC
Tim Dykes
School of Creative Technologies
University of Portsmouth
[email protected]
Splotch Team
●
Tim Dykes
, University of Portsmouth
●
Claudio Gheller
, Swiss National Supercomputing Centre
●
Marzia Rivi
, University of Oxford
●
Mel Krokos
, University of Portsmouth
●
Klaus Dolag
, University Observatory Munich
Contents
●
Splotch Overview
●
MIC-Splotch Implementation & Optimization
●
Performance Measurements
Splotch
●
Ray-casting algorithm for large datasets
●
Primarily for astrophysical N-body simulations
●
Applicable to any data representable as point-like elements with attributes
●Particle contribution to image determined using radiative transfer equation
and a Gaussian distribution function
[1]
[2] [3]
[1] 3D Modelling and Visualization of Galaxies By B.Koribalsky, C.Gheller and K.Dolag (ATNF, Australia, ETH-CSCS, Switzerland, Univ. Observatory Munich, Germany)
Notable Challenges for Parallel
Implementation
Load balancing for data
spread unevenly
throughout the image
High concentration of particles in small area
Low concentration of particles spread across large area
Potential race conditions due to
single pixels affected by many
elements
Motivations for MIC-Splotch
●
Accelerator/coprocessor usage in HPC
●
Exploitation of all available hardware
●
New architecture
MIC Architecture
●
PCIe SMP-on-a-chip
●Up to 61 cores
●
Up to 244 HW threads
●
512-bit wide SIMD
●
Up to 8GB GDDR5
MIC Architecture cont.
Parallel Programming Models
–
OpenMP
–
Intel Cilk Plus
–MPI
–
Pthreads
Processing Models Available
–
Native
●
Cross compile source to run directly on device
–
Offload
●
LEO (Language Extensions for Offload)
–
Symmetric
●
Use each coprocessor as a node in an MPI cluster, or subdivide
the device to contain a series of MPI nodes
Optimization Methods
●
Address memory allocation and transfer issues
●
Thread management
●
Automatic vectorization
Memory Allocation
Mitigation Advice:
Pre-allocating large buffers
Allocating with large pages
Avoid dynamic allocations
Problem:
Dynamic allocation is slow
start of the program andAllocate memory at thereuse through rather than deleting and reallocating
MIC_USE_2MB_BUFFERS=64K Enables use of 2MB page sizes
for any allocations over 64K
This can also reduce page faults and translation look-aside (TLB)
Double buffered computation
●
Particles processed in chunks
●
Computation overlapped with transfer
●Reduced transfer times
Multithreaded Rendering
Split threads into groups Create full image buffer
for each group
Split images into tiles T0 → TN T0 T1 T2 T0 TN Step 1 - Allocate Tile_list_N T0 T1 ... TN P0 P13 P17 P4 P6 …... P10P45 Particle Subset N ThreadN Step 2 - Prerender
Each thread generates a list of particle indices per tile for subset of particle data allocated
Step 3 - Render T0 Thread0 Tile_list_0 → N T0 T0 T0 T0 P0 P13 P17 P66 P69 P88P92 P99
Each thread renders all particles from all lists
for one particular tile Step 4
Image buffers accumulated and transferred back to host
Vectorization
Aiding Automatic Vectorization
–
Data structure organisation
–Data alignment
–
Compiler directives
Converting from array of structures to structure of arrays provided 10% performance boost to rasterization phase
Data should be aligned to 64 byte boundaries on host using _mm_malloc() to
ensure offload allocation & transfers are also aligned
Use of the “__assume_aligned(ptr,64)” directive informs the compiler the array being worked on is aligned correctly. #pragma ivdep informs compiler that vectors do not overlap each other Use of the compiler option -vec-reportX
(where X = 0-6 ) provides detailed information on what has and has not been vectorized, along with
suggestions as to why
The guide to auto-vectorization with Intel C++ compilers is useful at this stage
Vectorization Cont.
Manual Vectorization
– Difficult to automatically vectorize complex areas of code.
– Intrinsics, mapping directly to the Intel Many-Core Instruction (IMCI) set, can be used to
manually vectorize code.
– The rendering phase of Splotch is not amenable to automatic vectorization, due to fairly
unpredictable unaligned memory access patterns.
For each particle the spread of affected pixels is calculated, then each column of pixels is rendered.
A pixel color is calculated by multiplying the color of the particle by a contributive factor. This value is then additively combined with the previous color of the affected pixel.
Manual Vectorization Method
Step 1: Pack particle color x5 into _m512 vector container V1
R G B R G B R G B R G B R G B
Step 2: Pack contribution value x3 per pixel into _m512 vector container V2
C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5
Step 3: Pack affected pixel colors (up to 5) into _m512 vector container V3
R G B R G B R G B R G B R G B
Step 4: Fused Multiply-Add vectors where V3 = (V1*V2) + V3
R G B R G B R G B R G B R G B C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 R G B R G B R G B R G B R G B x x x x x x x x x x x x x x x + + + + + + + + + + + + + + +
Step 5: Masked Unaligned store V3 back to memory
_mm512_setr_ps(r,g,b,r,g,b...) _mm512_setr_ps(c1,c1,c1,c2,c2...) _mm512_setr_ps(p1.r, p1.g, p1.b, p2.r..._)
}
_mm512_fmadd_ps(V1,V2,V3) _mm512_mask_packstorelo_ps((void*)&dest[idx], _mask, V3) _mm512_mask_packstorehi_ps(((void*)&dest[idx])+64, _mask, V3)The _mask ensures the unused 16th
float in the vector containers is not written to the image
int mask = 0b0111111111111111; _mmask16 _mask =
Offloading with MPI
●
Data heavy algorithms can benefit from MPI based offloading.
●Multiple MPI processes run on the host, sharing a single or
multiple devices.
●
Allows to allocate, transfer and process chunks of data in
parallel, providing a significant performance boost.
The script to subdivide multiple devices
amongst 8 tasks can be unwieldy
Performance Testing
Test System Specification
– 'Dommic' Facility at the Swiss National Supercomputing Centre
– 7 Nodes
– Each node based on dual socket eight-core Intel Xeon 2670, running at 2.6GHz with 32 GB main system memory and two Intel Xeon Phi 5110 coprocessors available
Test Scenario
– ~21 Million particle N-Body simulation produced using the Gadget code. – 100 frame animation orbiting the dataset
– 8 host MPI processes per device, 2 thread groups of 15 threads each – 4 OpenMP threads per available core (~236)
Results
Per-Frame time for all phases: host OpenMP 1-16 cores vs single and
dual Xeon Phi devices.
Per-Frame time for rasterization: host OpenMP 1-16 cores vs single and
Results Cont.
Per-Frame processing time comparing MPI, OpenMP and MPI offloading to single and dual Xeon Phi devices.
Results Notes
●
Best performance boost is seen in the rasterization phase, with a
single device outperforming 16 OpenMP threads by ~2.5x.
●
Use of a second device provides, as expected, 2x performance boost
in comparison to single device
●
Per frame processing times for the current implementation compares
to 4 host OpenMP threads for a single device, while two devices
outperforms 16 OpenMP threads due to non-linear scaling of the host
OpenMP implementation
●
In comparison with the linearly scaling MPI implementation, each
Further work
●
Further optimisation and tuning through use of Intel
VTune
●
Dynamic thread grouping system
●
Comparison against GPU model
●
Exploration of MPI running on both host and device
References
Splotch Publications
1. Dolag, K., Reinecke, M., Gheller, C., Imboden, S.: Splotch: Visualizing Cosmological Simulations. New Journal of Physics, 10(12) id. 125006 (2008)
2. Jin,Z.,Krokos,M.,Rivi,M.,Gheller,C.,Dolag,K.,Reinecke,M.:High-Performance Astrophysical Visualization using Splotch. Procedia Computer Science, 1(1) 1775– 1784 (2010)
3. Rivi, M., Gheller, C., Dykes, T., Krokos, M., Dolag, K.: GPU Accelerated Particle Visualisation with Splotch. To appear in Astronomy and Computing (2014)
4. Dykes, T., Gheller, C., Rivi, M., Krokos, M.: Big Data Visualization on the Xeon Phi. Submitted to International Supercomputing Conference (2014)