Porting a Legacy Global Lagrangian PIC Code on Many-Core and GPU-Accelerated Architectures

(1)

Porting a Legacy Global Lagrangian PIC Code

on Many-Core and GPU-Accelerated Architectures

PASC Conference, Basel

July 3rd_{, 2018}

July 3rd_{, 2018} Noé Ohana

Noé Ohana1, Andreas JockschAndreas Jocksch2,Emmanuel LantiEmmanuel Lanti 1, Aaron ScheinbergAaron Scheinberg1, Stephan Brunner

Stephan Brunner1,Claudio GhellerClaudio Gheller2,Laurent VillardLaurent Villard1

1_{SPC, Lausanne}_{SPC, Lausanne} 2_{CSCS, Lugano}_{CSCS, Lugano}

Acknowledgments: Alberto Bottino, Ben McMillan, Alexey Mishchenko, Thomas Hayward

Acknowledgments: Alberto Bottino, Ben McMillan, Alexey Mishchenko, Thomas Hayward This work has been carried out within the framework of the EUROfusion Consortium and has received funding

(2)

Outline

1. Introduction 2. Refactoring 3. Results 4. Conclusion

1. Introduction 2. Refactoring 3. Numerical performance Single node Multinode 4. Conclusion

(3)

Motivation

Problem:

Development timescale of legacy code ORB5 [Tran, 1999, Jolliet, 2009] exceeds HPC architecture evolution timescale

Adopted strategy:

Disentangle numerical kernels and physical modules ⇒ modularization More and more cores per compute node ⇒ trade multitasking for multithreading

Keep portability ⇒ directive-based approach (OpenMP and OpenACC) Develop testbed code with fondamental kernels (PASC project)

(4)

Motivation

Problem:

Development timescale of legacy code ORB5 [Tran, 1999, Jolliet, 2009] exceeds HPC architecture evolution timescale

Adopted strategy:

Disentangle numerical kernels and physical modules ⇒ modularization More and more cores per compute node ⇒ trade multitasking for multithreading

Keep portability ⇒ directive-based approach (OpenMP and OpenACC) Develop testbed code with fondamental kernels (PASC project)

(5)

Recipe

Timeline:

2014 2015 2016 2017 2018

usual ORB5 development (new physics) ”big merge” ORB5 refactoring

PASC project

Tools: Profiler

(6)

ORB5, a multi-physics code...

Global gyrokinetic PIC code:

Equations derived from variational formulation with consistent ordering [Tronko, 2017]

Ad-hoc or MHD equilibrium Electromagnetic perturbations

Different field solver options (long wavelength approximation, Padé approximation, or all orders)

Different gyro-averaging options (h∇φi or ∇hφi) Multiple gyrokinetic species

Fixed or adaptive number of Larmor points Drift-kinetic, adiabatic or hybrid electrons Inter- and intra- species collisions

Heat sources Strong flows

(7)

... with various numerical features

Numerical features:

Variational formulation of field equations with finite element B-splines up to third order

Control variate schemes

Electromagnetic cancellation problem in Ampère’s law solved with enhanced control variate or pullback scheme [Mishchenko, 2014] Runge-Kutta of fourth order time integrator

Magnetic coordinates, straight-field-line Field-aligned Fourier filter

Noise control (Krook operator, coarse graining, quadtree)

Original parallelization: 2 levels of MPI (domain decomposition and cloning)

(8)

MPI parallelization scheme

1D domain decomposition plus domain cloning:

Reduce/broadcast operations between clones

All-to-all (parallel data transpose and particle move) and neighboor (guard cells) communications between subdomains

(9)

Key modifications prior to multithreading

Data structures

Structure of arrays instead of arrays of structure Pack variables for host-to-device memory transfers

Modularization

Stages of a time step:

Guiding center data Field data

deposit

solve

get fields push

(10)

Key modifications prior to multithreading

Data structures

Structure of arrays instead of arrays of structure Pack variables for host-to-device memory transfers Modularization

Guiding center data Field data

deposit

solve

get fields push

(11)

Key modifications prior to multithreading

Data structures

Structure of arrays instead of arrays of structure Pack variables for host-to-device memory transfers Modularization

Guiding center data Larmor ring data Field data

build Larmor deposit

solve

get fields gyroaverage

(12)

MPI+OpenMP

1. Introduction 2. Refactoring 3. Results 3.1 Single node 3.2 Multinode 4. Conclusion

Single node simulations with 106_{ions (4 Larmor points each), 10}6 _electrons,

and 128 × 32 × 4 cubic splines, on Broadwell CPU (2 × 18 cores, 2.1GHz)

0 0.2 0.4 0.6 0.8 1

36 clones, 1 thread 0.14 0.053 0.27 0.14 0.72

Full MPI

Wall clock time per step (s)

Build Larmor Deposit Field solve Get field

Gyro-average Push Diagnostics

Other modules

(13)

MPI+OpenMP

0 0.2 0.4 0.6 0.8 1 1 clone, 36 threads 2 clones, 18 threads 3 clones, 12 threads 4 clones, 9 threads 6 clones, 6 threads 9 clones, 4 threads 12 clones, 3 threads 18 clones, 2 threads 36 clones, 1 thread 0.18 0.13 0.15 0.12 0.14 0.14 0.14 0.14 0.14 0.054 0.061 0.054 0.054 0.055 0.054 0.055 0.054 0.054 0.053 0.35 0.26 0.31 0.25 0.26 0.3 0.26 0.27 0.27 0.16 0.12 0.17 0.12 0.12 0.17 0.14 0.14 0.14 0.72 0.7 (97%) 0.69 (96%) 0.78 (108%) 0.65 (90%) 0.63 (87%) 0.78 (108%) 0.64 (88%) 0.87 (121%) Full MPI Full OpenMP

Other modules

(14)

MPI+OpenMP

0 0.2 0.4 0.6 0.8 1 2 clones, 18 threads 4 clones, 9 threads 6 clones, 6 threads 12 clones, 3 threads 18 clones, 2 threads 36 clones, 1 thread 0.18 0.13 0.15 0.12 0.14 0.14 0.14 0.14 0.14 0.054 0.054 0.055 0.054 0.054 0.054 0.053 0.26 0.25 0.26 0.26 0.27 0.27 0.12 0.12 0.12 0.14 0.14 0.14 0.72 0.7 (97%) 0.69 (96%) 0.65 (90%) 0.63 (87%) 0.64 (88%) Full MPI Full OpenMP

Other modules

(15)

MPI+OpenMP

0 0.2 0.4 0.6 0.8 1 2 clones, 36 threads 2 clones, 18 threads 4 clones, 9 threads 6 clones, 6 threads 12 clones, 3 threads 18 clones, 2 threads 36 clones, 1 thread 0.099 0.18 0.13 0.15 0.12 0.14 0.14 0.14 0.14 0.14 0.054 0.054 0.055 0.054 0.054 0.054 0.053 0.23 0.26 0.25 0.26 0.26 0.27 0.27 0.11 0.12 0.12 0.12 0.14 0.14 0.14 0.72 0.7 (97%) 0.69 (96%) 0.65 (90%) 0.63 (87%) 0.64 (88%) 0.57 (79%) Full MPI Full OpenMP Hyperthreading

Other modules

(16)

MPI+OpenMP

0 0.2 0.4 0.6 0.8 1 2 clones, 36 threads 2 clones, 18 threads 4 clones, 9 threads 6 clones, 6 threads 12 clones, 3 threads 18 clones, 2 threads 36 clones, 1 thread 0.099 0.18 0.13 0.15 0.12 0.14 0.14 0.14 0.14 0.14 0.054 0.054 0.055 0.054 0.054 0.054 0.053 0.23 0.26 0.25 0.26 0.26 0.27 0.27 0.11 0.12 0.12 0.12 0.14 0.14 0.14 0.72 0.7 (97%) 0.69 (96%) 0.65 (90%) 0.63 (87%) 0.64 (88%) 0.57 (79%) Full MPI Full OpenMP Hyperthreading

Other modules

(17)

MPI+OpenACC

and 128 × 32 × 4 cubic splines, on Haswell CPU (12 cores, 2.6GHz)

0 0.5 1 1.5

Full OpenMP, Intel compiler (24 threads) Full MPI, Intel compiler (12 clones) 0.21 0.28 0.12 0.15 0.1 0.49 0.59 0.17 0.24 1.4 1.2 (80%)

Other modules

PGI compiler ∼20% slower than Intel

MPI+OpenACC version ∼2.7 times faster than MPI only, and ∼2.2 times faster than MPI+OpenMP

(18)

MPI+OpenACC

0 0.5 1 1.5

Full MPI, PGI compiler (12 clones) Full OpenMP, Intel compiler (24 threads) Full MPI, Intel compiler (12 clones) 0.3 0.21 0.28 0.12 0.2 0.12 0.15 0.1 0.1 0.67 0.49 0.59 0.27 0.17 0.24 1.4 1.2 (80%) 1.7 (117%)

Other modules

(19)

MPI+OpenACC

0 0.5 1 1.5

Full MPI, PGI compiler (12 clones) Full OpenMP, Intel compiler (24 threads) Full MPI, Intel compiler (12 clones) 0.3 0.21 0.28 0.12 0.2 0.12 0.15 0.1 0.1 0.67 0.49 0.59 0.27 0.17 0.24 1.4 1.2 (80%) 1.7 (117%)

Other modules

(20)

MPI+OpenACC

0 0.5 1 1.5

MPI+OpenACC, PGI compiler (1 GPU) Full MPI, PGI compiler (12 clones) Full OpenMP, Intel compiler (24 threads) Full MPI, Intel compiler (12 clones) 0.12 0.3 0.21 0.28 0.12 0.2 0.12 0.15 0.1 0.1 0.21 0.67 0.49 0.59 0.27 0.17 0.24 1.4 1.2 (80%) 1.7 (117%) 0.54 (37%)

Other modules

(21)

MPI+OpenACC

0 0.5 1 1.5

MPI+OpenACC, PGI compiler (1 GPU) Full MPI, PGI compiler (12 clones) Full OpenMP, Intel compiler (24 threads) Full MPI, Intel compiler (12 clones) 0.12 0.3 0.21 0.28 0.12 0.2 0.12 0.15 0.1 0.1 0.21 0.67 0.49 0.59 0.27 0.17 0.24 1.4 1.2 (80%) 1.7 (117%) 0.54 (37%)

Other modules

(22)

Architecture comparisons

Different physics (electro-static or -magnetic, adiabatic or kinetic electrons, adaptive number of Larmor points, ...) with similar particle and cell resolutions

0 0.5 1 1.5 2 2.5

TEM

CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell

0.12 0.19 0.21 0.12 0.21 0.55 0.23 0.49 0.27 0.12 0.17 1.2 (80%∗ ) 0.6 (41%) 1.3 (89%) 0.54 (37%)

(∗ percentages are relative to full MPI on Haswell, not shown)

Other modules

KNL (64 cores, 1.3GHz) performance similar to Haswell CPU

Weaker GPU performance for SAW due to control variate iterations on CPU GPU performance similar to Broadwell CPU

(23)

Architecture comparisons

0 0.5 1 1.5 2 2.5

TEM

CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell

0.12 0.19 0.21 0.12 0.21 0.55 0.23 0.49 0.27 0.12 0.17 1.2 (80%∗ ) 0.6 (41%) 1.3 (89%) 0.54 (37%)

(∗ percentages are relative to full MPI on Haswell, not shown)

Other modules

(24)

Architecture comparisons

0 0.5 1 1.5 2 2.5

ITG SAW TEM

CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell

0.16 0.16 0.21 0.16 0.18 0.13 0.23 0.26 0.12 0.19 0.21 0.28 0.73 0.8 0.34 0.78 1 0.24 0.15 0.34 0.18 0.49 0.59 0.12 0.15 0.26 0.23 0.26 0.24 0.67 0.26 0.59 0.64 0.21 0.55 0.23 0.49 0.59 0.24 0.15 0.12 0.27 0.12 0.17 0.24 1.4 1.2 (80%∗ ) 0.6 (41%) 1.3 (89%) 0.54 (37%) 2.7 2.2 (84%) 1 (38%) 2.3 (86%) 1.3 (48%) 0.95 0.71 (75%) 0.4 (42%) 1.1 (113%)

0.31 (33%) (∗ percentages are relative to full MPI on Haswell, not shown)

Other modules

(25)

Architecture comparisons

0 0.5 1 1.5 2 2.5

ITG SAW TEM

CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell CPU Haswell + GPU Best KNL Best CPU Broadwell Best CPU Haswell

0.16 0.16 0.21 0.16 0.18 0.13 0.23 0.26 0.12 0.19 0.21 0.28 0.73 0.8 0.34 0.78 1 0.24 0.15 0.34 0.18 0.49 0.59 0.12 0.15 0.26 0.23 0.26 0.24 0.67 0.26 0.59 0.64 0.21 0.55 0.23 0.49 0.59 0.24 0.15 0.12 0.27 0.12 0.17 0.24 1.4 1.2 (80%∗ ) 0.6 (41%) 1.3 (89%) 0.54 (37%) 2.7 2.2 (84%) 1 (38%) 2.3 (86%) 1.3 (48%) 0.95 0.71 (75%) 0.4 (42%) 1.1 (113%)

0.31 (33%) (∗ percentages are relative to full MPI on Haswell, not shown)

Other modules

(26)

Application to production run

ITG production run on 256 nodes with 256 · 106 _{ions (4 Larmor points each)}

and 256 × 512 × 256 cubic splines

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

MPI+OpenACC (1 GPU per node) MPI+OpenMP (24 threads per node) Full MPI (12 clones per node) Full MPI, BEFORE REFACTORING (12 clones per node)

0.17 0.24 0.61 0.49 0.59 0.26 0.31 0.22 0.37 0.17 0.2 0.19 4.4 4.4 (230%) 1.9 (100%) 1.8 (96%) 1.3 (68%)

Build Larmor Deposit Field solve Get field Gyro-average Push Parallel move Diagnostics Other modules

Timings of communication-bound stages (field solver and parallel move) become non-negligible.

Multithreaded versions bring less speed-up than on sinle node because they do not improve inter-node communication.

(27)

Application to production run

ITG production run on 256 nodes with 256 · 106 _{ions (4 Larmor points each)}

and 256 × 512 × 256 cubic splines

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

MPI+OpenACC (1 GPU per node) MPI+OpenMP (24 threads per node) Full MPI (12 clones per node) Full MPI, BEFORE REFACTORING (12 clones per node)

0.17 0.24 0.61 0.49 0.59 0.26 0.31 0.22 0.37 0.17 0.2 0.19 4.4 4.4 (230%) 1.9 (100%) 1.8 (96%) 1.3 (68%)

Build Larmor Deposit Field solve Get field Gyro-average Push Parallel move Diagnostics Other modules

Timings of communication-bound stages (field solver and parallel move) become non-negligible.

Multithreaded versions bring less speed-up than on sinle node because they do not improve inter-node communication.

(28)

Conclusion

Achievements:

Single code making efficient use of different architectures

Good compromise between efficiency, ergonomy, and ”future-proofness” (flexibility to accomodate for new physics and new architectures) Future work:

Turn on sorting when particle-to-field operations are significant Reduce amount of communications in solver

(29)

Bibliography

T.M. Tran, K. Appert,M. Fivaz, G. Jost, J. Vaclavik and L. Villard

Global gyrokinetic simulation of ion-temperature-gradient-driven instabilities using particles

Theory of Fusion Plasmas, Int. Workshop (Editrice Compositori, Bologna, Societa Italiana di Fisica), 45, 1999

S. Jolliet

Gyrokinetic particle-in-cell global simulations of ion-temperature-gradient and collisionless-trapped-electron-mode turbulence in tokamaks

PhD thesis, EPFL, 2009

A. Mishchenko, A. Könies, R. Kleiber, and M. Cole

Pullback transformation in gyrokinetic electromagnetic simulations Physics of Plasmas, 21, 2014

N. Tronko, A. Bottino, C. Chandre and E. Sonnendruecker

Hierarchy of second order gyrokinetic Hamiltonian models for particle-in-cell codes