2. THESIS DEPOSIT
6.3 Application Profile
In this section we perform an application analysis, on the Orthrus benchmark, to gain an understanding into communication and memory consumption be- haviour. By looking at the communication patterns of the application we can understand how each technique should reduce memory consumption and impact performance. By measuring the number of sources of communication for each process we can understand the ideal number of receive queues, and the size
of these communications we can understand the required size of these queues. Further, we can speculate about the performance of a queue configuration based on the balance of these two attributes.
As receive queue count and size are the primary cause for MPI memory consumption this information can be used to design low memory configurations with minimal performance implications.
6.3.1
Application Communication Classification
To gain understanding of Orthrus we first perform an analysis of its communi- cation profile. We evaluate the ratio of point-to-point to collective communica- tions, and measure the source and destination of point-to-point communications to understand the communication structure. Lastly, we analyse the message sizes used in the point-to-point communications to help understand the effect of receive queue size.
Point-to-Point and Collectives Analysis
Table 6.1 represents the breakdown of total MPI communication calls made during execution on 64, 128 and 256 cores. From this table it is clear to see that the dominant communication type, in terms of frequency, is point-to-point. We also note that the growth factor between 128 and 256 cores suggests that the frequency of point-to-point messages (3.6×) is increasing at a faster rate than that of collectives (2.5×). These results suggests that optimisations targeted towards point-to-point communications is likely to of considerable benefit.
Source - Destination Analysis
In Figure 6.1 we illustrate the communication pattern of our benchmark appli- cation run on 128 cores. During an instrumented execution, we measure the source and destination of every point-to-point communication. The colour of a cell indicates the density of communication between the source and destination processes, where black represents maximum observed communication through
Library Call Type 64 Cores 128 Cores 256 Cores
MPI ISend Point-to-Point 118,416 446,576 1,618,632
MPI IRecv Point-to-Point 118,416 446,576 1,618,632
MPI Gather Collective 704 1,408 2,816
MPI Gatherv Collective 512 1,024 2,048
MPI Allgather Collective 3,328 6,656 13,312
MPI Allreduce Collective 105,088 263,168 652,288
Table 6.1: Different communications at 64, 128 and 256 cores
a grey-scale to white representing no communication. From this data we can see the complexity of the problem decomposition and associated processor lay- out; in this case each processor sends data to ≈ 1
4 of the other processors and receives data from another ≈ 1
4 of the processors. When visualised as a 3D processor decomposition each processes sends data to all processes in a plane of the x-dimension and receives data from all processes in a plane of the z-dimension. This behaviour is not representative of the communication profile of the underlying seven-point stencil algorithm, and may indicate issues with the communication structure.
The fact that each processor also receives from a large number of other processors means that this application will require a large number of QPs. This suggests that the application is a prime example to demonstrate MPI memory scalability problems, and to evaluate potential solutions.
Point-to-Point Message Size
Figure 6.2 augments Figure 6.1 by looking at the size of these point-to-point messages. From this figure we can see three very distinct clusterings: the first in the 0 B to 4 B range, the second spanning from 16 KB to 64 KB, the last group spanning from 128 KB to 512 KB.
What we ascertain from Figure 6.2 is that to optimise the receive queues for Orthrus, we are required to support a small percentage (≈20%) of very small messages and a large percentage (≈80%) of larger messages. There is little to be gained from a large number of medium sized receive queue buffers, and so
0 31 63 95 127 0 31 63 95 127 Source Rank Destination Rank
Figure 6.1: Source-destination distribution for point-to-point messages on 128 cores
large buffers should be prioritised.
6.3.2
Application Memory Profile
We now perform some memory analysis to establish a memory consumption profile, which will help us understand the importance of memory reductions. By understanding how much memory MPI consumes, and where in the application the consumption occurs, we are able to gauge the potential savings from our different solutions.
Temporal Trace
Figure 6.3 shows the temporal memory profile of Orthrus on 128 cores. We can see that Orthrus has a very regular profile, with a repeating pattern covering the 20 time-steps. We note that there is a ‘ramp-up’ phase during the first
0 5 10 15 20 25 0 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k Frequenc y Percentage (%) Message Size (B)
Figure 6.2: Frequency distribution of message sizes on 128 cores
0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 Memory Consumption ( MB ) Time ( % )
Figure 6.3: Temporal memory usage for Orthrus on128 cores
time-step where allocations are made.
We can see that the height of the spikes is consistent and non-increasing, indicating well-managed memory and suggests there are no memory leaks.