Communication Buffers - Simulating Effects of System Memory Loss

2. THESIS DEPOSIT

5.3 Simulating Effects of System Memory Loss

5.4.4 Communication Buffers

The topics previously covered in this section have been aimed at improving scalability where strong scaling results in monotonically decreasing memory consumption. Sadly this is not the case, as we saw at the start of this chapter in Figure 5.1(b), sometimes memory consumption can increase at large scale. One of the factors behind this increasing component is the usage of communication buffers, either explicitly within the application or within the MPI implementation. The concept of reserving a block of memory proportional in size to the number of ranks in the job will clearly result in increases in memory consumption at large scale.

In Chapter 6 we present a study on these effects, identifying and analysing the memory reserved by MPI for receiving communications from other processes.

5.5 Summary

In this chapter we have demonstrated how gaining knowledge about an application’s HWM, from tools such as WMTools and memP, allows us to analyse the effect of strong scaling on memory consumption.

By using a selection of scientific applications we illustrate how strong scaling can be used to accommodate reductions in memory-per-core. Utilising higher core counts to save memory has a negative impact on the application’s parallel efficiency, and we use this information to model the overall runtime increase in- curred by memory-per-core reductions. Using the Maui scheduler we simulated

the execution of three different workloads, composed of different job mixes, to study these effects.

We illustrated how a reduction in memory-per-core from 1.5 GB to 682 MB results in a workload runtime increase of 13.8%. We demonstrated the benefits of improving memory scalability by constructing a specific workload of memory scalable applications, and demonstrate a runtime increase of only 10.2% when memory-per-core is reduced from 1.5 GB to 256 MB.

We continued this chapter by with a look at some of the causes of poor memory scalability, with specific focus on ghost cells, and the way processor decompositions can effect their dominance. Using WMTools we were able to analyse the variation in HWM afforded by different processor decompositions (1D and 2D) on the SNAP benchmark – observing a 6% reduction in HWM at 512 cores. Similarly we analyse different parallelism modes within SNAP, presenting a comparison of flat MPI with hybrid OpenMP and MPI. From this we were able to demonstrate a saving of nearly 43% with an accompanied 3% decrease in runtime, with SNAP at 1024 cores running one MPI task per NUMA region.

Using WMTools to analyse the memory consumption at node-level, rather than the per-core level available in other tools, enabled a fair comparison of these parallelism techniques.

MPI Memory Consumption

In this chapter we address the memory consumption of the MPI library at increased scale. Specifically we investigate a known problem of poor memory utilisation on InfiniBand network hardware [66, 67, 75, 115].

The problem stems from the necessity, within current implementations, to store certain information for each communicating pair of nodes in a job. Thus as the core count of the job increases the memory requirements of these communication buffers scale accordingly.

Firstly, we present an analysis of the problem, as exhibited by the OpenMPI MPI implementation on two different InfiniBand implementations: QLogic and Mellanox. This analysis is performed using WMTools, to identify the contribu- tion to memory of the MPI library at time of HWM.

Additionally, we present an investigation into available solutions, including runtime configurations and vendor-provided libraries, and evaluate their impact on both memory and application runtime.

We use WMTools in the context of Orthrus, a generic 3D implicit linear solver benchmark developed at AWE, to analyse this MPI memory consumption on specific InfiniBand implementations. Our analysis utilises two machines with InfiniBand from different vendors, QLogic on Cab and Mellanox on Kay, to understand the key fundamental differences in MPI memory consumption. The similarities in the platforms, as expressed in the system diagrams in Ap- pendix A.1, allows us to compare results with a high degree of confidence.

Using Orthrus to drive the PETSc solver library [8] using theBlock JACOBI preconditioner andGMRESsolver, we solve a 503 _{per-core weak scaled problem.} On both platforms we utilised the Intel 12 compiler and OpenMPI 1.6.3, to

build and run Orthrus.

Our reasoning for using the Orthrus benchmark in this chapter is down to the internal communication structure. The dependence on point-to-point communications, and the associated scaling of this communication pattern makes it the perfect candidate to examine MPI memory consumption. Due to these characteristics Orthrus had previously illustrated memory scalability issues, and thus represented a good candidate for memory analysis. Whilst the artefacts presented in this chapter are exposed through Orthrus they exist in all codes, with varying magnitudes. As such any memory savings presented here will be proportional to the initial artefact and will be code dependent.

We use this benchmark application to definitively demonstrate the memory scaling issues of MPI under certain conditions, allowing us to identify hardware and software configurations of specific interest. We examine the effectiveness of runtime configurations, where communication buffer sizes are constrained, in reducing MPI memory consumption. Lastly, we investigate the use of vendor- specific communication libraries, allowing the optimisation of communication protocols for specific hardware.

The MPI consumption analysis research presented in this chapter was pub- lished in [102].

6.1 InfiniBand Communication

InfiniBand supports five modes of transport: Reliable Connection (RC), Reliable Datagram (RD), Unreliable Connection (UC), Unreliable Datagram (UD) and Raw Datagram. RC is the most common strategy amongst MPI implementations, due to its support for Remote Direct Memory Access (RDMA) and so enhanced performance. RC and UC require a connection to be made between every queue pair (QP), and memory allocated in the event of communication, an inherently non-scalable method. RD is similar to RC but is designed to be inherently more scalable – only one QP is used to communicate with other RD

QPs.

UC and UD differ from RC and RD as they do not provide acknowledgements for messages, and therefore are often impractical for MPI network connections. Raw Datagram provides the facility to communicate messages which are not interpreted.

Messages sent between QPs are tracked by a send Work Queue Entry (WQE), thus the number of WQEs allotted per QP defines the maximum number of outstanding send-receive operations.

In document Addressing parallel application memory consumption (Page 113-117)