Likwid Interface - Towards a discipline of performance engineering : lessons learned from stenc

eters, and a string serving as a description. A project is a container for the methods used (and implemented) to solve a specific problem.

6.2.2 Methods

A method is defined as the algorithmic way to solve a problem. In computational sciences, at the lowest level, it is a code, written in any pro- gramming language, that while running on a system, tries to find a solu- tion for the problem/project it refers to. Since it must run on a computer system, it needs to go through several stages, from being written and tested, to be compiled and then executed.

The complexity related to its execution can vary from the basic need of a compiler (e.g., GCC or Intel compiler) to the need of a complete chain of software that must be installed and present in the environment of the system while the code is run. For instance, in computational molecular dynamics, scientists often use a software called GROMACS to run their experiments. In such a scenario, GROMACS must be available when running a simulation. All the software needed by a certain method are registered and stored in a method descriptor, together with the method name and a short description. The software entries in such a descriptor represent software modules, by default Lmod modules. For each used software, PROVA! stores the building and installing recipes, in the form of easyconfig files, used by EasyBuild [48,70,69].

Between the abstraction of Method, namely the code one wants to run, and the software needed for it to run, namely a Lmod module, there is the abstraction of methodType. It binds the method, the required software counterpart and they way the user code must be compiled and executed (in the form of compilation and run scripts).

Acting this way, the whole research from the creation of a project until the execution of an experiment is self-documented. All the configuration files are stored and could be used by an independent researcher to recre- ate the state of the system at each step, even without using PROVA!, to reproduce an experiment.

6.3 Likwid Interface

To analyze and interpret the results, it is crucial to trust and to be able to reproduce them. PROVA! manages the software stack, but other pa-

0 10 20 30 40 50 1 2 4 6 8 12 16 24 32 GFlop/s Number of Threads

Performance Comparison of Project: blur_FGCS17 Parameters (WIDTH HEIGHT): 1024 1024

Implemented Methods OpenMP_none OpenMP_node OpenMP_spread OpenMP_fill

Figure 6.6: Performance graph of a 2D Gaussian blur, with naive OpenMP

implementation both with and without explicit pinning on Mint, taken from [59]. The histogram shows the average value out of 5 executions, and the error bars the standard deviation.

rameters can affect the measurements, such as the way the threads are assigned to the available resources.

As stated in [130], and described in Section2.3.1, thread/process affinity is vital for performance. Correct pinning is even more important on processors supporting SMT, where hardware threads share resources on a single core. With Likwid [130] no code changes are required, and it of- fers a portable approach to the pinning problem. Thus, Likwid[112] has been integrated into PROVA!, delegating to it the explicit pinning of the threads to the cores. The pinning is performed through three possible strategies, nominally: ByNode, ByFilling, BySpreading. Said strategies ex- ploit the abstractions of logical nodes and locality domains (in particular the socket), offered by likwid (when present at the hardware level).

In Section9.1.3and Section9.1.4, is shown the output of likwid-topology, which graphically presents the architectural details of a compute node.

6.3. LIKWID INTERFACE 69

The cores available on a socket have a physical and logical id that do not always match. The pinning strategy ByNode binds a thread to each of the logical nodes, by increasing id. In the computer systems available at the time of writing, a node is usually composed of two or more sockets, hosting each a microprocessor. Even inside a single microprocessor, as shown in Section2.3.1, several locality domains can be present. The pinning strategy ByFilling binds the threads to the processors, taking care of assigning first to the processors in a single socket and then, when each processor on the first socket has received its thread, moving to the next available socket. The strategy BySpreading follows a complementary approach: it tries to balance the threads over the available sockets, assigning threads in a way that equally distributes them to the sockets. Listing6.1

shows an example of how each of the three strategies, translates into a likwid command, in case of executing mycode with 4 threads.

# byNode likwid´pin ć N:0´3 ./ mycode # byFilling likwid´pin ć S0:0´3 ./ mycode # bySpreading likwid´pin ć S0:0´1 @S1 :0´1 ./ mycode

Listing 6.1: Example of how the pinning strategies defined by PROVA!

translate into a likwid command.

Figure 6.6, taken from [59], presents a test case of a 2-dimensional Gaussian blur applied to a 1024x1024 grid, containing values in single precision (float): all the histograms represent the performance of the same code, where merely the pinning strategy has been varied, using no explicit pinning (suffix none), and pinning using the previously described strategies. The testbed is the Mint cluster at the University of Basel, whose nodes are dual socket AMD Opteron 6274 “Bulldozer” with a nominal clock speed of 2.2 GHz and 16 cores per chip.

When the goal is to characterize different systems, one cannot rely on the OS for managing the threads but must define the thread/core affinity explicitly.

0.125 0.25

0.5

1 2 3 4 6 8

16

32

64 actual flop:byte ratio

8

16

32

64

128

256

512 attainable GFlop/sec

AMD Opteron(TM) Processor 6274

OpenMP_node

PATUS_none

PLUTO-pet_spread

Figure 6.7: Roofline for the Mint cluster with the three kernels implementing a

3D wave equation. The description of the experiment and the discussion of the results have been published in [59].

In document Towards a discipline of performance engineering : lessons learned from stencil kernel benchmarks (Page 77-80)