Performance of SMG versus MPI - The SMG DSM system: enabling shared memory for the grid

For all the applications, an MPI version of the application was executed, and also the SMG variants that included the update protocol, the subscription protocol, and multi- threading. In all cases, for the same application, the tests were executed on the same cluster. The nearest neighbour class of applications are not considered in this section as they are beset by excessive coherence communication overheads, and simply are not competitive for applications with low computation to communication ratios.

EP has a very large computation to communication ratio, and this enables the performance of SMG to be very competitive with MPI, see Figures 10.3 to 10.5, and Table 10.5 (SMG-sub refers to the use of the subscription protocol). This is hardly surprising as little work is required of the DSM. This does demonstrate that the DSM engine imposes little if any overhead, and demonstrates the effectiveness of having a communication module with a separate thread. Note that in this application, the performance is inde- pendent of the subscription protocol (as indicated by similar plots in the graphs below) as the DSM engine is employed very little throughout the execution.

The EP application itself results in little communication, with the bulk of any communication usage for the SMG variants resulting from the control messages for the initialisation and finalisation of the system.

PERFORMANCE OF SMG VERSUS MPI 158

Figure 10.4: EP Total Messages sent (SMG-sub obscures SMG)

No. Procs 2 4 8 16 32 64 128

MPI 2 4.02 8.04 14.5 28 53.9 86.19

SMG 2 3.99 7.99 15.87 31.15 58.56 87.51

SMG-sub 2 3.98 7.49 15.8 29.05 54.69 80.15

SMG (threaded) 3.77 7.54 15.91 28.25 55.77 87.02

Table 10.5: Speed-up for EP execution on IITAC

PERFORMANCE OF SMG VERSUS MPI 159

Matrix

The communication behaviour of the Matrix application (Figures 10.6 to 10.8) high- lights that SMG imposes no major overhead for applications with high computation to communication ratios, and so good speedup is obtainable as shown in Table 10.6. The volume of data transferred is greater than for MPI as once the resultant matrix is gathered through the release phase of the barrier process at the coordinator it will be distributed to all other processes during the acquire phase. Coherence traffic could be minimised at the acquire phase of the initial barrier if the barrier coordinator performed the initialisation of the incident matrices, i.e. no coherence data would be seen during the arrival phase. The addition of a simple function to the SMG API can easily provide this:

SMG barrier iam coordinator(int id).

The subscription protocol variant of the SMG application can offer no advantage as the sharing pattern cannot be established until after the first barrier. The protocol introduces small amounts of additional coherence overhead to initialise the subscription lists.

Figure 10.6: Matrix Total Data sent

The number of messages generated with the MPI versions is also lower than for SMG for the Matrix application, see Figure 10.7, but these values are inclusive of initialisation messages for the DSM engine. In general SMG will be competitive in terms of numbers of messages generated, but not for the cumulative payload of those messages.

PERFORMANCE OF SMG VERSUS MPI 160

Figure 10.7: Matrix Total Messages sent

There is no discernible difference in the execution times between the SMG variants apart from that the update protocol variant does suffer a degradation compared to the other variants for processor counts of 128; this may be due to intermittent effects. The seemingly unusual speed-up figures given in Table 10.6 ( 2.01, 4.17, and 8.06), for the MPI variant of the application with processor numbers of 2, 4, and 8, would seem to stem from the ability for a processor to maintain a larger amount of the application’s working set within the processor’s cache, which results in a higher cache hit-rate, thus super-linear performance with respect to a single processor run.

No. Procs 2 4 8 16 32 64 128

MPI 2.01 4.17 8.06 15.76 30.96 66.96 114.54

SMG 2 3.99 7.97 15.24 29.76 56.09 104.17

SMG-sub 1.98 3.98 7.89 15.66 30.11 56.03 107.76

SMG (threaded) 4 8.44 16.5 32.2 58.54 107.93

Table 10.6: Speed-up for Matrix execution on IITAC

At this point, it is important to highlight the increase in the number of DSM page- faults, as illustrated in Figure 10.9, when the subscription protocol is used rather than the update protocol. This ’feature’ was expected, see Section 7.4.2.

PERFORMANCE OF SMG VERSUS MPI 161

Figure 10.8: Matrix Speedup

PERFORMANCE OF SMG VERSUS MPI 162

Laplace

The major deficit of the update protocol is evident in the message payload figures for the Laplace application, see Figure 10.10. The ’avalanche’ effect associated with the update protocol is self-evident when the process size increases beyond 16, resulting in the parallel application being communication bound for a significant amount of time as the interconnect becomes saturated with traffic, see the difference in Table 10.7 for the total data transferred between SMG and MPI for applications with 128 processes (This is the scenario described in Section 7.4.1). The SMG message payloads are orders of magnitude greater than for message passing versions. The numbers of messages are relatively similar for all cases, see Figure 10.11. The benefits of the subscription protocol are further evidenced by the reduction in the data volume transferred, but there is still significant overhead associated with the protocol implementation.

The subscription protocol does offer some respite in terms of speedup at low processor numbers (Figure 10.12), but as the number of processors are scaled up the overhead associated with the protocol implementation becomes detrimental to the performance of the application. This figure also demonstrates the terrible performance of SMG using the update coherence protocol i..e no speedup occurs for this application, due to being communication bound.

It must be noted that the MPI version of the application also performs poorly when the number of processors are scaled into double digits, and performs very badly thereafter. The computation-communication ratio is not high enough to deliver good scalability. Larger application dimensions are possible, say 8192 × 8192, but for the purpose of this thesis, i.e. the performance of the SMG DSM, are inconsequential.

As indicated by Figure 10.13, the subscription variant of the application does incur extra DSM overhead. The most visible manifestation of this is the number of extra page-faults that are generated.

No. Procs 4 8 16 32 64 128

MPI 9.36E+8 1.32E+9 2.55E+9 4.97E+9 9.81E+09 3.96E+10

SMG 9.65E+9 1.71E+10 3.99E+10 7E+10 2.98E+12 5.88E+12

SMG (Sub) 9.46E+8 2.28E+9 5.17E+9 1.19E+10 2.87E+10 7.65E+10 Table 10.7: Laplace - Message payload (in bytes)

PERFORMANCE OF SMG VERSUS MPI 163

Figure 10.10: Laplace Total Data sent

PERFORMANCE OF SMG VERSUS MPI 164

Figure 10.12: Laplace Speedup

PERFORMANCE OF SMG VERSUS MPI 165

SOR

Again, the major deficit of the update protocol is evident in the message payload figures, see in this case in Table 10.8, or in graphical form in Figure 10.14. The SMG message payloads are significantly greater than for message passing versions. The numbers of messages are relatively similar for all cases (Figure 10.15). The benefits of the subscription protocol can be seen in the dramatic reduction in the data volume transferred compared to the SMG update protocol version, but it is still significantly higher than for MPI. Like the Laplace application there is little speed-up from either SMG version, but as shown in Figure 10.16 for larger numbers of processors all variants performance deteriorates as the computation-to-communication ratio is too small to maintain scalability.

Figure 10.17 illustrates the DSM overhead associated with the use of the subscription protocol with respect to the update protocol. This difference may be reduced if the rate at which the subscription performs flushing of the local process’ own subscription list (see Section 7.4.2) is increased.

No. Procs 4 8 16 32 64 128

MPI 3.607E+8 4.391E+8 5.960E+8 9.098E+8 1.537E+9 2.793E+9 SMG 1.894E+9 4.056E+9 8.035E+9 1.562E+10 3.038E+10 5.949E+10 SMG (Sub) 5.398E+8 6.396E+8 1.179E+9 3.336E+9 7.708E+9 1.628E+10

Table 10.8: SOR - Message payload (in bytes)

PERFORMANCE OF SMG VERSUS MPI 166

Figure 10.15: SOR - Total Messages sent

BENEFITS OF HYBRIDISATION 167

Figure 10.17: SOR - Pagefault for different protocols

In document The SMG DSM system: enabling shared memory for the grid (Page 177-187)