Memory Bandwidth and Cache Sharing - Hybrid Code Performance

4.3 Hybrid Code Performance

4.3.1 Memory Bandwidth and Cache Sharing

The use of multicore processors in modern HPC systems has created issues [42] that must be considered when looking at application performance. Among these are the issues of memory bandwidth (the ability to get data from memory to the processing cores), and the effects of cache sharing (where multiple cores share one or more levels of cache). Experiments have been done with the hybrid and pure MPI code to assess the effects of these on the code performance. Memory bandwidth and cache sharing issues are easily exposed in a code by comparing performance of the application with fully populated and under populated nodes or processors. In order to carry out these tests it is necessary to control which processes are assigned to which cores on which physical processor. A combination of standard linux command line tools allow this to be done. Examining /proc/cpuinfo on each node of a system describes the layout of the processing cores of the node in terms of processor ID as recognised by the system, physical ID and core ID. So, for example, a node of Merlin may report core/processor IDs as in Table 4.3. In this example, to ensure four processes are running on the same processor it is necessary to ensure they are assigned to system processors 0, 2, 4 and 6 or 1, 3, 5 and 7. An example avoiding

76 4.3 Hybrid Code Performance 1 10 100 1000 64 128 256 512 Ti me ( Seco nd s) Processor Cores

Small Test Case -‐ PPN=4 Small Test Case -‐ PPN=8 Large Test Case -‐ PPN=4 Large Test Case -‐ PPN=8

Figure 4.1: Merlin memory bandwidth tests. Error bars of 5% shown. Very little performance difference is observed between the codes running on a fully populated node (ppn=8) and an under populated node (ppn=4). Overlapping error bars indicate no statistical significance in difference between timings on under populated and fully populated nodes.

cache sharing would require processes to be assigned to system processors 0, 1, 4 and 5 (since cores 0 and 1 share L2 cache, as do cores 2 and 3). This process assignment can be done using the command line tool taskset, which can be used to set the CPU affinity of a process. A combination of knowledge of how the system views the processors in terms of numbers and use of the taskset tool allows fine control of process placement to test for performance issues related to memory bandwidth and cache sharing.

Combined effects of Memory Bandwidth and Cache Sharing

By running the MPI code on a number of fully populated nodes on Merlin (using 8 processes per node, ppn=8), then again on twice the number of nodes using half the cores (ppn=4), with processes placed so that cache is not shared, any performance differences due to either the

4.3 Hybrid Code Performance 77

Small Simulation

Pure MPI Hybrid (1 MPI)

Shared Exclusive Diff.(%) Shared Exclusive Diff.(%)

Total 2436.699 2328.641 4.43 % 3622.872 3577.680 1.25 %

Forces 2211.631 2168.351 1.96 % 3305.688 3275.290 0.92 %

Large Simulation

Pure MPI Hybrid (1 MPI)

Shared Exclusive Diff.(%) Shared Exclusive Diff.(%)

Total 8410.663 8103.421 3.65 % 16935.252 16751.750 1.08 %

Forces 7791.354 7691.7058 1.28 % 16061.832 15921.399 0.87 %

Table 4.4: Timing effects of cache sharing. Times in seconds. MPI code affected more by cache sharing than hybrid code, but effects on overall performance are small.

reduced memory bandwidth or the cache sharing on a fully populated node can be seen. The results of this test are shown in Figure 4.1, which shows the total elapsed wall time for both the small and large test sizes running on under populated and fully populated nodes. They clearly show that there is little performance difference between a fully populated and under populated node, demonstrating that memory bandwidth and cache sharing is not a large issue with this MD code, so is not a significant factor when examining performance results.

Cache Sharing

The effect on performance due solely to cache sharing can be examined separately by running the code on an underpopulated single node of the system using either one or both processors. Each node in the Merlin cluster contains two quad-core processors and each physical processor has two pairs of cores on a chip sharing an L2 cache. Therefore, running 4 processes or threads on one processor (using all 4 cores, so the L2 cache is shared) and then on two processors (using one of each pair of cores, so each core has exclusive access to the cache) will expose any performance difference caused by cache sharing. As both processors share the connection to main memory, the memory bandwidth contention when running the same number

78 4.3 Hybrid Code Performance

of threads/processes on a node should be the same whether the processes are concentrated on one processor or spread over both. The timing results for the two codes, and difference between the exclusive and shared cache timings are presented in Table 4.4.

The differences between shared and exclusive cache use are very small, never larger than 4.5%, which occurs in the pure MPI code. Overall the pure MPI code is affected more by the cache sharing than the hybrid code, as the difference between exclusive cache timing and shared cache timing is larger for the total time of the MPI code in both the small and large test case. This is to be expected; the MPI code runs with four processes continually, whereas the Hybrid code only runs one process until the threads are spawned in the forces routine. The hybrid code is therefore only sharing cache between cores during the execution of the OpenMP parallelised routines, at most other points only one MPI process is running which will have exclusive access to the cache. This effect of the hybrid parallelisation means that it is reasonable to expect that the Hybrid code will be affected less by cache sharing over the entire run of the application. Examining the portion of the total difference that can be attributed to the forces routine (where both hybrid and pure MPI codes are running in parallel) shows that it makes up a far greater proportion of the total difference in the hybrid code than the MPI code, which is in line with these expectations. Examining only the forces routine timing does not show a significant difference between either code.

The results of the cache sharing analysis shows that cache sharing has a slight negative impact on the performance of the code, but that the difference is very small between shared and exclusive cache use. The cache sharing is also not a significant cause of the performance difference between the pure MPI and hybrid codes, as both are affected.

In document Performance engineering of hybrid message passing + shared memory programming on multi-core clusters (Page 89-92)