Porting a Legacy Global Lagrangian PIC Code
Porting a Legacy Global Lagrangian PIC Code
on Many-Core and GPU-Accelerated Architectures
on Many-Core and GPU-Accelerated Architectures
PASC Conference, BaselPASC Conference, Basel
July 3rd, 2018
July 3rd, 2018 Noé Ohana
Noé Ohana1, Andreas JockschAndreas Jocksch2,Emmanuel LantiEmmanuel Lanti 1, Aaron ScheinbergAaron Scheinberg1, Stephan Brunner
Stephan Brunner1,Claudio GhellerClaudio Gheller2,Laurent VillardLaurent Villard1
1SPC, LausanneSPC, Lausanne 2CSCS, LuganoCSCS, Lugano
Acknowledgments: Alberto Bottino, Ben McMillan, Alexey Mishchenko, Thomas Hayward
Acknowledgments: Alberto Bottino, Ben McMillan, Alexey Mishchenko, Thomas Hayward This work has been carried out within the framework of the EUROfusion Consortium and has received funding
Outline
1. Introduction 2. Refactoring 3. Results 4. Conclusion
1. Introduction 2. Refactoring 3. Numerical performance Single node Multinode 4. Conclusion
Motivation
1. Introduction 2. Refactoring 3. Results 4. Conclusion
Problem:
Development timescale of legacy code ORB5 [Tran, 1999, Jolliet, 2009] exceeds HPC architecture evolution timescale
Adopted strategy:
Disentangle numerical kernels and physical modules ⇒ modularization More and more cores per compute node ⇒ trade multitasking for multithreading
Keep portability ⇒ directive-based approach (OpenMP and OpenACC) Develop testbed code with fondamental kernels (PASC project)
Motivation
1. Introduction 2. Refactoring 3. Results 4. Conclusion
Problem:
Development timescale of legacy code ORB5 [Tran, 1999, Jolliet, 2009] exceeds HPC architecture evolution timescale
Adopted strategy:
Disentangle numerical kernels and physical modules ⇒ modularization More and more cores per compute node ⇒ trade multitasking for multithreading
Keep portability ⇒ directive-based approach (OpenMP and OpenACC) Develop testbed code with fondamental kernels (PASC project)
Recipe
1. Introduction 2. Refactoring 3. Results 4. Conclusion
Timeline:
2014 2015 2016 2017 2018
usual ORB5 development (new physics) ”big merge” ORB5 refactoring
PASC project
Tools: Profiler
ORB5, a multi-physics code...
1. Introduction 2. Refactoring 3. Results 4. ConclusionGlobal gyrokinetic PIC code:
Equations derived from variational formulation with consistent ordering [Tronko, 2017]
Ad-hoc or MHD equilibrium Electromagnetic perturbations
Different field solver options (long wavelength approximation, Padé approximation, or all orders)
Different gyro-averaging options (h∇φi or ∇hφi) Multiple gyrokinetic species
Fixed or adaptive number of Larmor points Drift-kinetic, adiabatic or hybrid electrons Inter- and intra- species collisions
Heat sources Strong flows
... with various numerical features
1. Introduction 2. Refactoring 3. Results 4. ConclusionNumerical features:
Variational formulation of field equations with finite element B-splines up to third order
Control variate schemes
Electromagnetic cancellation problem in Ampère’s law solved with enhanced control variate or pullback scheme [Mishchenko, 2014] Runge-Kutta of fourth order time integrator
Magnetic coordinates, straight-field-line Field-aligned Fourier filter
Noise control (Krook operator, coarse graining, quadtree)
Original parallelization: 2 levels of MPI (domain decomposition and cloning)
MPI parallelization scheme
1. Introduction 2. Refactoring 3. Results 4. Conclusion1D domain decomposition plus domain cloning:
Reduce/broadcast operations between clones
All-to-all (parallel data transpose and particle move) and neighboor (guard cells) communications between subdomains
Key modifications prior to multithreading
1. Introduction 2. Refactoring 3. Results 4. ConclusionData structures
Structure of arrays instead of arrays of structure Pack variables for host-to-device memory transfers
Modularization
Stages of a time step:
Guiding center data Field data
deposit
solve
get fields push
Key modifications prior to multithreading
1. Introduction 2. Refactoring 3. Results 4. ConclusionData structures
Structure of arrays instead of arrays of structure Pack variables for host-to-device memory transfers Modularization
Stages of a time step:
Guiding center data Field data
deposit
solve
get fields push
Key modifications prior to multithreading
1. Introduction 2. Refactoring 3. Results 4. ConclusionData structures
Structure of arrays instead of arrays of structure Pack variables for host-to-device memory transfers Modularization
Stages of a time step:
Guiding center data Larmor ring data Field data
build Larmor deposit
solve
get fields gyroaverage
MPI+OpenMP
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Single node simulations with 106ions (4 Larmor points each), 106 electrons,
and 128 × 32 × 4 cubic splines, on Broadwell CPU (2 × 18 cores, 2.1GHz)
0 0.2 0.4 0.6 0.8 1
36 clones, 1 thread 0.14 0.053 0.27 0.14 0.72
Full MPI
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
MPI+OpenMP
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Single node simulations with 106ions (4 Larmor points each), 106 electrons,
and 128 × 32 × 4 cubic splines, on Broadwell CPU (2 × 18 cores, 2.1GHz)
0 0.2 0.4 0.6 0.8 1 1 clone, 36 threads 2 clones, 18 threads 3 clones, 12 threads 4 clones, 9 threads 6 clones, 6 threads 9 clones, 4 threads 12 clones, 3 threads 18 clones, 2 threads 36 clones, 1 thread 0.18 0.13 0.15 0.12 0.14 0.14 0.14 0.14 0.14 0.054 0.061 0.054 0.054 0.055 0.054 0.055 0.054 0.054 0.053 0.35 0.26 0.31 0.25 0.26 0.3 0.26 0.27 0.27 0.16 0.12 0.17 0.12 0.12 0.17 0.14 0.14 0.14 0.72 0.7 (97%) 0.69 (96%) 0.78 (108%) 0.65 (90%) 0.63 (87%) 0.78 (108%) 0.64 (88%) 0.87 (121%) Full MPI Full OpenMP
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
MPI+OpenMP
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Single node simulations with 106ions (4 Larmor points each), 106 electrons,
and 128 × 32 × 4 cubic splines, on Broadwell CPU (2 × 18 cores, 2.1GHz)
0 0.2 0.4 0.6 0.8 1 2 clones, 18 threads 4 clones, 9 threads 6 clones, 6 threads 12 clones, 3 threads 18 clones, 2 threads 36 clones, 1 thread 0.18 0.13 0.15 0.12 0.14 0.14 0.14 0.14 0.14 0.054 0.054 0.055 0.054 0.054 0.054 0.053 0.26 0.25 0.26 0.26 0.27 0.27 0.12 0.12 0.12 0.14 0.14 0.14 0.72 0.7 (97%) 0.69 (96%) 0.65 (90%) 0.63 (87%) 0.64 (88%) Full MPI Full OpenMP
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
MPI+OpenMP
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Single node simulations with 106ions (4 Larmor points each), 106 electrons,
and 128 × 32 × 4 cubic splines, on Broadwell CPU (2 × 18 cores, 2.1GHz)
0 0.2 0.4 0.6 0.8 1 2 clones, 36 threads 2 clones, 18 threads 4 clones, 9 threads 6 clones, 6 threads 12 clones, 3 threads 18 clones, 2 threads 36 clones, 1 thread 0.099 0.18 0.13 0.15 0.12 0.14 0.14 0.14 0.14 0.14 0.054 0.054 0.055 0.054 0.054 0.054 0.053 0.23 0.26 0.25 0.26 0.26 0.27 0.27 0.11 0.12 0.12 0.12 0.14 0.14 0.14 0.72 0.7 (97%) 0.69 (96%) 0.65 (90%) 0.63 (87%) 0.64 (88%) 0.57 (79%) Full MPI Full OpenMP Hyperthreading
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
MPI+OpenMP
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Single node simulations with 106ions (4 Larmor points each), 106 electrons,
and 128 × 32 × 4 cubic splines, on Broadwell CPU (2 × 18 cores, 2.1GHz)
0 0.2 0.4 0.6 0.8 1 2 clones, 36 threads 2 clones, 18 threads 4 clones, 9 threads 6 clones, 6 threads 12 clones, 3 threads 18 clones, 2 threads 36 clones, 1 thread 0.099 0.18 0.13 0.15 0.12 0.14 0.14 0.14 0.14 0.14 0.054 0.054 0.055 0.054 0.054 0.054 0.053 0.23 0.26 0.25 0.26 0.26 0.27 0.27 0.11 0.12 0.12 0.12 0.14 0.14 0.14 0.72 0.7 (97%) 0.69 (96%) 0.65 (90%) 0.63 (87%) 0.64 (88%) 0.57 (79%) Full MPI Full OpenMP Hyperthreading
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
MPI+OpenACC
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Single node simulations with 106ions (4 Larmor points each), 106 electrons,
and 128 × 32 × 4 cubic splines, on Haswell CPU (12 cores, 2.6GHz)
0 0.5 1 1.5
Full OpenMP, Intel compiler (24 threads) Full MPI, Intel compiler (12 clones) 0.21 0.28 0.12 0.15 0.1 0.49 0.59 0.17 0.24 1.4 1.2 (80%)
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
PGI compiler ∼20% slower than Intel
MPI+OpenACC version ∼2.7 times faster than MPI only, and ∼2.2 times faster than MPI+OpenMP
MPI+OpenACC
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Single node simulations with 106ions (4 Larmor points each), 106 electrons,
and 128 × 32 × 4 cubic splines, on Haswell CPU (12 cores, 2.6GHz)
0 0.5 1 1.5
Full MPI, PGI compiler (12 clones) Full OpenMP, Intel compiler (24 threads) Full MPI, Intel compiler (12 clones) 0.3 0.21 0.28 0.12 0.2 0.12 0.15 0.1 0.1 0.67 0.49 0.59 0.27 0.17 0.24 1.4 1.2 (80%) 1.7 (117%)
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
PGI compiler ∼20% slower than Intel
MPI+OpenACC version ∼2.7 times faster than MPI only, and ∼2.2 times faster than MPI+OpenMP
MPI+OpenACC
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Single node simulations with 106ions (4 Larmor points each), 106 electrons,
and 128 × 32 × 4 cubic splines, on Haswell CPU (12 cores, 2.6GHz)
0 0.5 1 1.5
Full MPI, PGI compiler (12 clones) Full OpenMP, Intel compiler (24 threads) Full MPI, Intel compiler (12 clones) 0.3 0.21 0.28 0.12 0.2 0.12 0.15 0.1 0.1 0.67 0.49 0.59 0.27 0.17 0.24 1.4 1.2 (80%) 1.7 (117%)
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
PGI compiler ∼20% slower than Intel
MPI+OpenACC version ∼2.7 times faster than MPI only, and ∼2.2 times faster than MPI+OpenMP
MPI+OpenACC
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Single node simulations with 106ions (4 Larmor points each), 106 electrons,
and 128 × 32 × 4 cubic splines, on Haswell CPU (12 cores, 2.6GHz)
0 0.5 1 1.5
MPI+OpenACC, PGI compiler (1 GPU) Full MPI, PGI compiler (12 clones) Full OpenMP, Intel compiler (24 threads) Full MPI, Intel compiler (12 clones) 0.12 0.3 0.21 0.28 0.12 0.2 0.12 0.15 0.1 0.1 0.21 0.67 0.49 0.59 0.27 0.17 0.24 1.4 1.2 (80%) 1.7 (117%) 0.54 (37%)
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
PGI compiler ∼20% slower than Intel
MPI+OpenACC version ∼2.7 times faster than MPI only, and ∼2.2 times faster than MPI+OpenMP
MPI+OpenACC
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Single node simulations with 106ions (4 Larmor points each), 106 electrons,
and 128 × 32 × 4 cubic splines, on Haswell CPU (12 cores, 2.6GHz)
0 0.5 1 1.5
MPI+OpenACC, PGI compiler (1 GPU) Full MPI, PGI compiler (12 clones) Full OpenMP, Intel compiler (24 threads) Full MPI, Intel compiler (12 clones) 0.12 0.3 0.21 0.28 0.12 0.2 0.12 0.15 0.1 0.1 0.21 0.67 0.49 0.59 0.27 0.17 0.24 1.4 1.2 (80%) 1.7 (117%) 0.54 (37%)
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
PGI compiler ∼20% slower than Intel
MPI+OpenACC version ∼2.7 times faster than MPI only, and ∼2.2 times faster than MPI+OpenMP
Architecture comparisons
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Different physics (electro-static or -magnetic, adiabatic or kinetic electrons, adaptive number of Larmor points, ...) with similar particle and cell resolutions
0 0.5 1 1.5 2 2.5
TEM
CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell
0.12 0.19 0.21 0.12 0.21 0.55 0.23 0.49 0.27 0.12 0.17 1.2 (80%∗ ) 0.6 (41%) 1.3 (89%) 0.54 (37%)
(∗ percentages are relative to full MPI on Haswell, not shown)
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
KNL (64 cores, 1.3GHz) performance similar to Haswell CPU
Weaker GPU performance for SAW due to control variate iterations on CPU GPU performance similar to Broadwell CPU
Architecture comparisons
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Different physics (electro-static or -magnetic, adiabatic or kinetic electrons, adaptive number of Larmor points, ...) with similar particle and cell resolutions
0 0.5 1 1.5 2 2.5
TEM
CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell
0.12 0.19 0.21 0.12 0.21 0.55 0.23 0.49 0.27 0.12 0.17 1.2 (80%∗ ) 0.6 (41%) 1.3 (89%) 0.54 (37%)
(∗ percentages are relative to full MPI on Haswell, not shown)
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
KNL (64 cores, 1.3GHz) performance similar to Haswell CPU
Weaker GPU performance for SAW due to control variate iterations on CPU GPU performance similar to Broadwell CPU
Architecture comparisons
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Different physics (electro-static or -magnetic, adiabatic or kinetic electrons, adaptive number of Larmor points, ...) with similar particle and cell resolutions
0 0.5 1 1.5 2 2.5
ITG SAW TEM
CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell
0.16 0.16 0.21 0.16 0.18 0.13 0.23 0.26 0.12 0.19 0.21 0.28 0.73 0.8 0.34 0.78 1 0.24 0.15 0.34 0.18 0.49 0.59 0.12 0.15 0.26 0.23 0.26 0.24 0.67 0.26 0.59 0.64 0.21 0.55 0.23 0.49 0.59 0.24 0.15 0.12 0.27 0.12 0.17 0.24 1.4 1.2 (80%∗ ) 0.6 (41%) 1.3 (89%) 0.54 (37%) 2.7 2.2 (84%) 1 (38%) 2.3 (86%) 1.3 (48%) 0.95 0.71 (75%) 0.4 (42%) 1.1 (113%)
0.31 (33%) (∗ percentages are relative to full MPI on Haswell, not shown)
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
KNL (64 cores, 1.3GHz) performance similar to Haswell CPU
Weaker GPU performance for SAW due to control variate iterations on CPU GPU performance similar to Broadwell CPU
Architecture comparisons
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
Different physics (electro-static or -magnetic, adiabatic or kinetic electrons, adaptive number of Larmor points, ...) with similar particle and cell resolutions
0 0.5 1 1.5 2 2.5
ITG SAW TEM
CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell
0.16 0.16 0.21 0.16 0.18 0.13 0.23 0.26 0.12 0.19 0.21 0.28 0.73 0.8 0.34 0.78 1 0.24 0.15 0.34 0.18 0.49 0.59 0.12 0.15 0.26 0.23 0.26 0.24 0.67 0.26 0.59 0.64 0.21 0.55 0.23 0.49 0.59 0.24 0.15 0.12 0.27 0.12 0.17 0.24 1.4 1.2 (80%∗ ) 0.6 (41%) 1.3 (89%) 0.54 (37%) 2.7 2.2 (84%) 1 (38%) 2.3 (86%) 1.3 (48%) 0.95 0.71 (75%) 0.4 (42%) 1.1 (113%)
0.31 (33%) (∗ percentages are relative to full MPI on Haswell, not shown)
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field
Gyro-average Push Diagnostics
Other modules
KNL (64 cores, 1.3GHz) performance similar to Haswell CPU
Weaker GPU performance for SAW due to control variate iterations on CPU GPU performance similar to Broadwell CPU
Application to production run
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
ITG production run on 256 nodes with 256 · 106 ions (4 Larmor points each)
and 256 × 512 × 256 cubic splines
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
MPI+OpenACC (1 GPU per node) MPI+OpenMP (24 threads per node) Full MPI (12 clones per node) Full MPI, BEFORE REFACTORING (12 clones per node)
0.17 0.24 0.61 0.49 0.59 0.26 0.31 0.22 0.37 0.17 0.2 0.19 4.4 4.4 (230%) 1.9 (100%) 1.8 (96%) 1.3 (68%)
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field Gyro-average Push Parallel move Diagnostics Other modules
Timings of communication-bound stages (field solver and parallel move) become non-negligible.
Multithreaded versions bring less speed-up than on sinle node because they do not improve inter-node communication.
Application to production run
1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion
ITG production run on 256 nodes with 256 · 106 ions (4 Larmor points each)
and 256 × 512 × 256 cubic splines
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
MPI+OpenACC (1 GPU per node) MPI+OpenMP (24 threads per node) Full MPI (12 clones per node) Full MPI, BEFORE REFACTORING (12 clones per node)
0.17 0.24 0.61 0.49 0.59 0.26 0.31 0.22 0.37 0.17 0.2 0.19 4.4 4.4 (230%) 1.9 (100%) 1.8 (96%) 1.3 (68%)
Wall clock time per step (s)
Build Larmor Deposit Field solve Get field Gyro-average Push Parallel move Diagnostics Other modules
Timings of communication-bound stages (field solver and parallel move) become non-negligible.
Multithreaded versions bring less speed-up than on sinle node because they do not improve inter-node communication.
Conclusion
1. Introduction 2. Refactoring 3. Results 4. Conclusion
Achievements:
Single code making efficient use of different architectures
Good compromise between efficiency, ergonomy, and ”future-proofness” (flexibility to accomodate for new physics and new architectures) Future work:
Turn on sorting when particle-to-field operations are significant Reduce amount of communications in solver
Bibliography
1. Introduction 2. Refactoring 3. Results 4. Conclusion
T.M. Tran, K. Appert,M. Fivaz, G. Jost, J. Vaclavik and L. Villard
Global gyrokinetic simulation of ion-temperature-gradient-driven instabilities using particles
Theory of Fusion Plasmas, Int. Workshop (Editrice Compositori, Bologna, Societa Italiana di Fisica), 45, 1999
S. Jolliet
Gyrokinetic particle-in-cell global simulations of ion-temperature-gradient and collisionless-trapped-electron-mode turbulence in tokamaks
PhD thesis, EPFL, 2009
A. Mishchenko, A. Könies, R. Kleiber, and M. Cole
Pullback transformation in gyrokinetic electromagnetic simulations Physics of Plasmas, 21, 2014
N. Tronko, A. Bottino, C. Chandre and E. Sonnendruecker
Hierarchy of second order gyrokinetic Hamiltonian models for particle-in-cell codes