Application Profile - THESIS DEPOSIT - Addressing parallel application memory consumption

2. THESIS DEPOSIT

6.3 Application Profile

In this section we perform an application analysis, on the Orthrus benchmark, to gain an understanding into communication and memory consumption behaviour. By looking at the communication patterns of the application we can understand how each technique should reduce memory consumption and impact performance. By measuring the number of sources of communication for each process we can understand the ideal number of receive queues, and the size

of these communications we can understand the required size of these queues. Further, we can speculate about the performance of a queue configuration based on the balance of these two attributes.

As receive queue count and size are the primary cause for MPI memory consumption this information can be used to design low memory configurations with minimal performance implications.

6.3.1 Application Communication Classification

To gain understanding of Orthrus we first perform an analysis of its communication profile. We evaluate the ratio of point-to-point to collective communications, and measure the source and destination of point-to-point communications to understand the communication structure. Lastly, we analyse the message sizes used in the point-to-point communications to help understand the effect of receive queue size.

Point-to-Point and Collectives Analysis

Table 6.1 represents the breakdown of total MPI communication calls made during execution on 64, 128 and 256 cores. From this table it is clear to see that the dominant communication type, in terms of frequency, is point-to-point. We also note that the growth factor between 128 and 256 cores suggests that the frequency of point-to-point messages (3.6_×) is increasing at a faster rate than that of collectives (2.5×). These results suggests that optimisations targeted towards point-to-point communications is likely to of considerable benefit.

Source - Destination Analysis

In Figure 6.1 we illustrate the communication pattern of our benchmark application run on 128 cores. During an instrumented execution, we measure the source and destination of every point-to-point communication. The colour of a cell indicates the density of communication between the source and destination processes, where black represents maximum observed communication through

Library Call Type 64 Cores 128 Cores 256 Cores

MPI ISend Point-to-Point 118,416 446,576 1,618,632

MPI IRecv Point-to-Point 118,416 446,576 1,618,632

MPI Gather Collective 704 1,408 2,816

MPI Gatherv Collective 512 1,024 2,048

MPI Allgather Collective 3,328 6,656 13,312

MPI Allreduce Collective 105,088 263,168 652,288

Table 6.1: Different communications at 64, 128 and 256 cores

a grey-scale to white representing no communication. From this data we can see the complexity of the problem decomposition and associated processor lay- out; in this case each processor sends data to _≈ 1

4 of the other processors and receives data from another _≈ 1

4 of the processors. When visualised as a 3D processor decomposition each processes sends data to all processes in a plane of the x-dimension and receives data from all processes in a plane of the z-dimension. This behaviour is not representative of the communication profile of the underlying seven-point stencil algorithm, and may indicate issues with the communication structure.

The fact that each processor also receives from a large number of other processors means that this application will require a large number of QPs. This suggests that the application is a prime example to demonstrate MPI memory scalability problems, and to evaluate potential solutions.

Point-to-Point Message Size

Figure 6.2 augments Figure 6.1 by looking at the size of these point-to-point messages. From this figure we can see three very distinct clusterings: the first in the 0 B to 4 B range, the second spanning from 16 KB to 64 KB, the last group spanning from 128 KB to 512 KB.

What we ascertain from Figure 6.2 is that to optimise the receive queues for Orthrus, we are required to support a small percentage (_≈20%) of very small messages and a large percentage (_≈80%) of larger messages. There is little to be gained from a large number of medium sized receive queue buffers, and so

0 31 63 95 127 0 31 63 95 127 Source Rank Destination Rank

Figure 6.1: Source-destination distribution for point-to-point messages on 128 cores

large buffers should be prioritised.

6.3.2 Application Memory Profile

We now perform some memory analysis to establish a memory consumption profile, which will help us understand the importance of memory reductions. By understanding how much memory MPI consumes, and where in the application the consumption occurs, we are able to gauge the potential savings from our different solutions.

Temporal Trace

Figure 6.3 shows the temporal memory profile of Orthrus on 128 cores. We can see that Orthrus has a very regular profile, with a repeating pattern covering the 20 time-steps. We note that there is a ‘ramp-up’ phase during the first

0 5 10 15 20 25 0 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k Frequenc y Percentage (%) Message Size (B)

Figure 6.2: Frequency distribution of message sizes on 128 cores

0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 Memory Consumption ( MB ) Time ( % )

Figure 6.3: Temporal memory usage for Orthrus on128 cores

time-step where allocations are made.

We can see that the height of the spikes is consistent and non-increasing, indicating well-managed memory and suggests there are no memory leaks.

In document Addressing parallel application memory consumption (Page 120-124)