Network Subsystem - Towards a discipline of performance engineering : lessons learned from sten

Number of threads 5 10 15 20 25 30 35 40 Performance in GFLOP/s 200 timesteps 500 timesteps 1000 timesteps 0 5 10 15 20 25 Number of threads 5 10 15 20 25 30 35 40 Performance in GFLOP/s 200 timesteps 500 timesteps 1000 timesteps

Figure 2.12: Performance of a 2-dimensional isotropic box stencil with constant

coefficients and radius 1, grid size of 100002 and NUMA unaware initializa- tion. On the left the performance, varying the numbers iterations, on an Intel Xeon E5-2695 with Cluster on Die mode enabled and NUMA balancing dis- abled. On the right the performance of the same code, on the same machine, but with NUMA balancing enabled.

results on the right (most recent execution) show how the performance of said stencil varies while increasing the number of iterations. In Fig- ure2.12, on the left, the performance stops scaling at six threads (cores), due to the saturation of the socket. PROVA!, as described in Chapter 5, stores contextual information of the experiment: not only software used but also details of the machine, where the code executes. Having a look at these files, a difference in a single parameter popped up, i.e., the value of the NUMA balancing. An application will generally perform best when the threads of its processes are accessing memory on the same NUMA node as the threads are scheduled. Automatic NUMA balancing moves tasks (which can be threads or processes) closer to the memory they are accessing. It also moves the application data to the memory closer to the tasks that reference it. After noticing the change in the value and with further investigations, turned out that the machine had been restarted and the value set to its default value of 1.

2.4 Network Subsystem

Together with hardware and software, the third fundamental component of high performance computing is represented by the network interconnect. A large variety of solutions is available, with the cheapest one being the Gigabit Ethernet. The trend is represented by optical interconnects

that tend to provide a wider bandwidth while being more efficient. At the time of writing, the #1 machine in the TOP500 [9] listing, named Summit, is composed of 4356 nodes: it is self-evident that central switches cannot be used anymore, and hierarchical structure should be preferred. Summit utilizes a dual-rail Mellanox EDR InfiniBand interconnect, with a non-blocking fat-tree topology, for both storage and inter-process com- munications traffic. It interconnects thousands of compute nodes con- taining both IBM POWER CPUs and NVIDIA GPUs, delivering 200Gb/s network speed to each of the compute platform. The advances in the InfiniBand technology allow the applications to communicate latency- sensitive data effectively.

MPI [42] is one of the programming models that allow the applications to communicate over a network interconnect, by exchanging mes- sages. Part of the network subsystem is used to carry out the message exchange, thus demanding a good mapping between the hardware and the communication requirements of the applications. Such a problem is becoming more and more relevant since, as of now, the supercomputers scale by increasing the number of nodes and their heterogeneity: it in- evitably affects how the systems are programmed and must be addressed on the software side.

Chapter 3

Reproducibility Challenges:

Software Complexity

In the previous chapter has been described how the hardware architectures have evolved, showing the complexity of the machines that are available at the time of writing. The manycore paradigm is the rule, and the application developers must exploit the large amount of parallelism offered, thus directly affecting the software. Both inter-node and the fine- grain intra-node parallelism must be dealt with and addressed. On the programming side, one must consider data locality as well as synchro- nization, trying to reduce communication.

3.1 Amdahl’s and Gustafson’s Laws

Since computer manufacturers are providing architectures with an increasing number of computing cores, applications must be parallelized to properly exploit all the available processors. Ideally, when passing from running a program on a single compute core to running it on N cores, one may think of obtaining a reduction of the execution time t to t{N. Let us denote with fseq and fpar the sequential and the parallel part

of a program, respectively. If the parallel part can be made N times faster by using N processors, then the time to solution is:

TN “ fseq˚T1`

fpar˚T1

2

4

6

8

10 Number of processors

2

4

6

8

10 Speedup

Amdahl's law:

Parallel speedup vs Sequential fraction

0.5

0.75

0.9

0.95

0.99

0.999

1

Figure 3.1: Amdhal’s law [10]: parallel speedup vs sequential fraction, for

ranges of the parallel fraction between 0.5 and 1.

Amdahl’s law [10], states that given a fixed problem, the speedup (i.e. the ration between the original and the new execution time) of a parallel machine with N processors is:

SstrongpN, fparq “ _f 1

par

N ` p1 ´ fparq

where fpar “ 1 ´ fseq represents the fraction of the program that can be

parallelized. In Figure 3.1 is shown how the speedup would look like when fpar varies between 0.5 (half of the program can be parallelized)

and 1 (the whole program can be parallelized). The consequences of Am- dahl’s law are dramatic: inamely, a non parallelizable fraction of the original code fseq “0.25 limits the speedup to:

SstrongpN, fparq “ 1

0.25 “4.

Such a forecast represents the reason why Amdahl’s law is usually defined as pessimistic.

3.1. AMDAHL’S AND GUSTAFSON’S LAWS 37

2

4

6

8

10 Number of processors

2

4

6

8

10 Speedup

Gustafson's law:

Parallel speedup vs Sequential fraction

0.5

0.75

0.9

0.95

0.99

0.999

1

Figure 3.2: Speedup as a function of the number of cores for the ranges of the

parallel fraction 0.5 to 1 assuming Gustafson’s law [60].

In 1978 Gustafson approached the problem from a different perspec- tive: rather than fixing the problem size, as for Amdahl’s assumption, one may fix the time to solution. In such a condition, the result is scaling up the size of a problem, assuming that the size that can be solved grows together with the available parallelism. Such a way to approach the scaling is defined weak scaling, opposed to the one where the size is fixed, defined strong scaling. In Gustafson’s scenario, the speedup is calculated as:

SweakpN, fparq “ N ˚ fpar` p1 ´ fparq

which represents a much more optimistic view than Amdahl’s, as can be seen in Figure3.2.

In document Towards a discipline of performance engineering : lessons learned from stencil kernel benchmarks (Page 43-48)