Parallel computing and benchmarking

3.3 Matrix elements

4.2.4 Parallel computing and benchmarking

It is often useful to separate a Vegas integration into two phases: warmup and production. During the warmup phase we construct a grid that adapts to the shape of the integral without storing any results and during the production phase we freeze the grid and generate the desired output. Breaking down the integration in these two phases allows for a number of optimisations to be carried over and to exclude statistical distortions from unoptimised grids.

During the production phase, we are only interested on generating final output from as many statistically independent iterations as possible, using the same grid. The number of points (N ) required in order to produce statistically sound results, however, might be too large for a computer to handle in one go∗. Thanks to the techniques introduced in Section 4.1.2 we can break an N -point iteration into k sub-iterations of N/k points which will afterwards be fused into one pseudorun.

Due to the fact that no information needs to be shared between different production iterations, we are able to run multiple replicas of NNLOjet on different CPUs, machines or clusters and combine the results afterwards. In that respect, the production phase is a solved issue where getting more precise results and his- tograms simply requires consuming more resources. The technicalities of running

Figure 4.3: Study of the performance evolution of Vegas as a function of the number of threads for a VBF Real integration with minimal cuts. Default corresponds to the typical implementation of Vegas based on OpenMP. Experimental removes some restriction on the capacity of OpenMP for parallelisation. Tested in Intel(R) Xeon(R) Gold G6130, 64 physical cores.

production in a grid system are further detailed in the documentation of pyHepGrid in Appendix D.

The warmup phase, on the other hand, introduces some restrictions with respect to the production phase, most notably, each iteration needs information on the previous one in order to adapt the grid: they need to share memory. Several solutions to the problem of multithreaded programming exist. As a first step we implement the OpenMP standard so that the main task of the event generator is shared between a given number of threads. The number of threads to be used can be selected by the user through the environmental flag OMP NUM THREADS, each one reserving OMP STACKSIZE Mb of memory. At the end of each warmup iteration all threads are synchronised and the adaptation of the grid is performed using the total combined set of information.

In Fig. 4.3 we study the performance gain when using OpenMP for a warmup run. We compare the real time (this is, human time between the start and end of the process) of a naive implementation of the OpenMP standard with a more aggressive implementation (still experimental) which requires some changes to the NNLOjet code. The naive implementation is akin to the implementation found in extensively used programs such as MCFM [115].

It can be seen that for the default (naive) implementation the performance gain saturates after a certain number of threads, beyond 16 threads almost no gain is observed and the performance is actually punished after ∼ 25 cores are used. For the experimental implementation, on the other hand, we observe gains for any number of CPUs and we find a penalty only when we are using almost twice as many threads as physical cores the machine has. Once we enable as many threads as physical cores the machine has, we enter in the hyperthreading region. In this region the performance gain is much more modest (even negative when too many threads are active and the program competes with the operative system for resources).

The difference between the default and experimental implementations of Vegas is mainly due to the use of “critical” blocks, regions of code that are forced to run sequentially. The “experimental” implementation bypasses all these blocks of sequential code for a better threads-performance relation. The only trade-off for the experimental implementation of Vegas with respect to the default one is a greater memory usage of a ∼ +10% in the benchmarks. NNLOjet can be compiled with these experimental features with the use of the compile flag critical=off.

Another drawback of OpenMP is that parallelisation is limited to one single memory-sharing node or CPU. For processes with many particles in the final state, this is often insufficient to warm up a grid to stability in a reasonable amount of time. As in the production phase, we would ideally be able to run our warmup across different independent nodes, synchronising the results at the end of every iteration before the grid is adapted.

Since the adaptation process only requires the knowledge of the value of the integral in each subvolume after the iteration finishes, it follows that we only need a way to share this information (an array of numbers) at the end of every iteration between different NNLOjet instances in order to use multiple nodes and speed up

runtime.∗

In the Vegas implementation of NNLOjet, we share the information between the independent instances using tcp sockets. At the end of every iteration all separate instances of NNLOjet pause and synchronise information with a central server by data transfer through tcp sockets. We use standard unix libraries so the only requirement is for the target system to have a network connection able to communi- cate with the central server. This solution allows us to parallelise NNLOjet within a single node (via OpenMP) and among independent resources at the same time and the usage of unix standard libraries guarantees that does work in any target system.

NNLOjet is compiled with socket support with the use of the compile flag sockets=true.

In document Next-to-Next-to-Leading Order QCD Corrections to Higgs Boson Production in Association with two Jets in Vector Boson Fusion (Page 117-120)