Implementation Issues - The SMG DSM system: enabling shared memory for the grid

R-GMA was chosen for the information & monitoring system as it offers greater flexibility in the accessing of information, however, many issues arose from features of this system.

• The run-time monitoring system (enabled by R-GMA) can only handle a limited amount of events (˜80,000) events at any given time. To enable more events within a given period then an offline monitoring system is required, with logging to files during the execution of the application.

• Additionally, R-GMA imposes a limit of 256 characters on the length of a string variable. While this is not a problem in the majority of cases it does restrict the length of the file paths and the developer’s ability in the ’tagging’ of applications.

• All processes must register with the information system. The master process will poll the system waiting for this to occur. Start-up messages are sent to all processes once notification has been received and the topology map has been generated. This results in very poor start-up times.

• The use of the R-GMA information system required the use on non-GLUE- compliant tables.

• Currently the R-GMA C APIs are in a state of flux so the most recent releases have not been used.

• The impact of logging system information necessitated the use of a thread to perform this task (it takes a relatively long time to insert data into R-GMA). As a separate thread is required for the DSM system an additional thread will steal resources, so it must be run at a lower priority.

• The front-end ’hybridisation’ tool is limited in its functionality. It can only perform set queries on the monitoring data it obtains.

• The hybridisation tool is also restricted in its ability to identify potential areas, or the interval, of code for hybridisation to within one file. It should be noted that this is a typical case of OpenMP applications where parallel regions must be self-contained in a single lexical context.

• The hybridisation tool will have problems resolving the correct code locations to the developer where a translator has converted OpenMP code to SMG DSM code.

CHAPTER

10

Evaluation

The primary goal for this thesis was to explore the use of the shared memory program- ming model with grids. To achieve this successfully, communications must be efficiently and sparingly used. The mechanisms developed in the course of this thesis are primar- ily concerned with achieving this while contributing little additional overhead. Below some analysis and evaluation is performed. Three main contributions are evaluated: the subscription protocol, hybridisation, and providing topology information to enable efficient synchronisation operations while operating in a grid environment. The evaluation objectives can be categorised under four main headings:

1. Evaluation of the SMG DSM as an execution model compared with MPI.

2. Evaluation of the hybridisation approach, where more efficient message-passing mechanisms can incrementally replace DSM actions when circumstances dictate that performance improvements can be obtained. The process of incremental hybridisation is also explored.

3. Evaluation of the SMG DSM with and without a grid information system.

4. The evaluation of SMG on a grid with emulation of the overhead of inter-site latencies versus SMG on a cluster of identical compute resources.

A number of applications have been used. While some have been unsuitable for one feature they have proven suitable to evaluate another. The metrics employed are overall performance and scalability (through efficient use of communication resources).

Note: All graphs that feature the number of processors in an application are presented in a logarithmic scale, Log2(No. Processes), see Figure 10.1.

EXPERIMENTAL METHODOLOGY 148

10.1 Experimental Methodology

The DSM components are represented in Figure 10.2. The DSM communication uses the same message-passing libraries as the MPI version of the test applications. An assortment of applications has been taken and versions constructed for all the features that needed to be tested. In this section the applications used are described and why they were chosen. The execution environment is described. As cross-site multi-threaded MPI is not available at present, the grid is simulated. The interesting aspects like page- faulting, page protection, write trapping/ write collection strategies, etc., are all validly tested by the simulations.

10.1.1 Test applications

Test applications were chosen that range from those that have no real basis in parallel computing to those that are embarrassingly parallel. Marinov et al [157] list characteristics of DSM applications. The test applications are routines of real value themselves but are examples that exhibit the temporal/spatial memory sharing patterns of ’real’ applications [158].

While this chapter adopts a light touch regarding presentation of results, these are the distillation from approximately 8700 (cumulative) wall-clock hours of high performance computing experiments. The applications used to evaluate SMG are: Embarrassingly Parallel (EP), Conjugate Gradient (CG), Fast Fourier Transform (FFT), and Integer Sort (IS) from the NAS benchmarks [159, 160]; Barnes-Hut from the Splash Bench- marks [161]; and the common Matrix, SOR, Laplace, TSP, and Gauss benchmarks that implement well known routines. EP and Matrix were used to benchmark the overhead of the SMG DSM engine, while nearest neighbour applications like SOR and Laplace were used to evaluate the sharing performance and communication efficiency of the DSM. IS, Gauss, and Barnes-Hut demonstrate the use of different synchronisation primitives. FFT demonstrates the ability to construct an application with irregular access patterns that SMG finds difficult to handle. Other applications have been ported to SMG and MPI but were eventually deemed unsuitable for further evaluation for a number of reasons, mostly that their results were too similar to the chosen applications to be relevant, e.g. Jacobi is in the class of nearest neighbour applications.

A number of versions of each application were required for evaluating different compar- isons. These versions included MPI, SMG using the update protocol, SMG using the Subscription protocol, SMG using multiple user threads, SMG-MPI hybrid, and SMG for a Grid with/without access to an information system.

The various characteristic types of shared memory accesses that occur, described by [34], are discussed in Section 4.1. The SPLASH benchmarks have some technical considera- tions [161] when ported to non-shared memory platforms. Lu’s Thesis [162] identifies the problems with comparing a message passing version (PVM) of some of the implementation of the SPLASH benchmarks against a shared-memory style (Treadmarks DSM), i.e. some of them are just not suitable to DSM because of the excessive communication to computation ratio. Other work has demonstrated the difficulties in running NPB over

EXPERIMENTAL METHODOLOGY 149

DSM (Scash, OdinMP)[163]. Unfortunately there are no credible benchmarks that have been explicitly developed for benchmarking the use of computational grids. There are grid applications that portend to do this [164], but really don’t (e.g. work-flow using NAS benchmarks). Where the MPI programs were developed for this thesis, they were constrained to use the same algorithm employed by the DSM versions, so they might not be the most efficient implementation for the available resources.

Only the results for EP, Matrix, Laplace and SOR are presented here, as the others simply confirm the same salient results for these four applications. For these four applications, the term P used below refers to the number of processes, and N refers to the dimension of the application. Where MPI collective operations are required the effective message count is taken to be equal to 2 × (P−1) in the calculation of the total amount of messages generated.

Embarrassingly Parallel

The Embarrassingly Parallel (EP) benchmark is part of the NAS suite and generates pairs of Gaussian random deviates. This benchmark is representative of many Monte- Carlo style simulation applications in that it is heavily computation-intensive with little communication/synchronisation. This application was chosen as it is useful in establish- ing a base reference for the computational capacity of the system.

It can be readily seen that the parallel processes can be independently generated. The program generates nπ/4 Gaussian pairs per process. The only communication is the gathering of an array of results from the processes that is done at the end by the root process. This only point of communication is between the root process and all other processes and only involves (P-1) messages in total.

The SMG implementation allocates a small shared array where the resultant data is deposited by each process. The MPI implementation uses a gather operation to sum the partial results of all processes. The total data communicated with both is approximately (P -1) * N. For all test scenarios it was assumed the value N = 36 (the maximum for the NAS application).

Matrix

It is only natural that the matrix multiplication example that was used in Section 2.1 to demonstrate the benefits of parallel computing be used as one of the evaluation applications. The application multiplies two dense matrices A and B, with dimensions

N × M andM × P, to produce a resultant matrix C = AB with dimensionsN × P. As previously discussed this is an easily parallelised application, where each thread of execution computes a subset of the resultant matrix.

Like the EP benchmark, matrix multiplication is of the embarrassingly parallel class. It has been seen already that a naive implementation of Equation 10.1, matrix multiplication of two square matrices with dimension N, has computation that is O(N3_{), while} communication isO(N2), so for relatively large values of N this is an ideal application for parallelisation.

EXPERIMENTAL METHODOLOGY 150

In the DSM version all of the matrices are allocated as a single shared region. There are two communication regions where both MPI and DSM versions have a root process that initialises the incident matrices A and B, and distributes them to other processes for computation. When computation is finished all computed portions of C are gathered by the root process. Square matrices are assumed with dimension 6144. For the integer matrix multiply version the storage requirement for the three matrices is 432MiB, although a floating-point version is also available (given the same dimensions the storage requirement becomes 860MiB).

Cij = m X n=1 ainbnj (10.1) Laplace

Laplace is a simple iterative stencil application that implements an algorithm for a stripped-down version of the Jacobi transformation method of matrix diagonalization for approximating the solution to a linear system of equations. During each iteration each element is updated based on the values of its nearest neighbours (usually the av- erage, i.e. Equation 10.2), with the boundary values remaining fixed. The Laplace application, although fundamentally solving the same problem as SOR, uses a different algorithm resulting in half the number global barrier calls and a commensurate number of messages. As Laplace converges more slowly than SOR, with each grid point changing slowly with each iteration, extremely large volumes of data are transmitted. Iterative schemes of this type require time to achieve sufficient accuracy and are reserved for large systems of equations where there are a majority of zero elements in the matrix. The implementation of the algorithm assumes the contrary, with many of the elements being initialised to non-zero values.

The computation complexity isO(N2) with communication O(N) for an efficient implementation (such as the MPI implementation). However the DSM versions can generate traffic of O(N2). [58] derives a metric from the equations expressed in Section 3.3 for obtaining the potential speedup for this class of stencil operations in a Grid environment. For computation involving processors distributed across grid sites the dimensions would need to be very substantial; this magnitude is not supported in SMG at present due to the limitations with the virtual address space size.

The DSM application is implemented using barrier synchronisation primitives with a bound shared memory region for the grid. Like the SOR application below, the dimensions for the problem size are 6144 X 6144, with each element being a double-precision floating point value, giving a shared region requirement of 288MiB when the application is implemented using a naive approach with a single shared memory region for the whole application grid. This involves all modifications to the whole shared array being transferred among among all processes at the end of each iteration. For the MPI version only the region to be processed and the neighbouring strips need to be transferred between the processes.

EXPERIMENTAL METHODOLOGY 151

N ewA[i][j] = (A[i−1][j] +A[i+ 1][j] +A[i][j−1] +A[i][j+ 1]) /4 (10.2)

SOR

Successive-Over-Relaxation (SOR) is a simple iterative relaxation algorithm. It is one of the numerical methods for solving partial differential equations. The equations are represented discretely using a two dimensional array. During each iteration each element of the array is updated as a function of its nearest neighbours and a given relaxation parameter, omega, typically as defined by Equation 10.3. The implementation for this thesis uses the standard red-black approach to prevent a node overwriting a value before it is accessed by a neighbour. Each iteration is divided in two, with alternate elements updated in each half, i.e. there are two synchronisations per iteration. The number of iterations can be fixed or variable, dependent upon a stopping condition that is usually specified by the user.

Like the Laplace application (which is from the same preconditioner class of applications) the complexity of the algorithm for a system of sizeN is a fairly modest O(N2), so it makes a poor candidate for parallelisation as there is considerable overhead (no matter what the implementation) in distributing the initial matrix and communicating partial results. The communication requirement is O(N2), as with every iteration all neighbouring values must be exchanged between processes.

In the DSM versions barriers are used to synchronise the processes, while the MPI versions use blocking send/receive pairs. The input data for the application is a square matrix with dimensions 6144 X 6144, giving a shared memory region of 288MiB (larger sizes are possible, but the value of N = 6144 was chosen to be consistent with other applications). For the SMG version all processes allocate the shared region locally. For the MPI version only the region to be processed and the neighbouring strips need to be transferred between the processes. SMG allocates a single shared memory region of size (X * Y * sizeof(value)).

N ewA[i][j] = ((A[i−1][j] + A[i+ 1][j] + A[i][j−1] + A[i][j+ 1]) (10.3) ∗ omega ∗ 0.25) + (1−omega) ∗ A[i][j]

10.1.2 Testbed Description

The basic characteristics of the machines used are given in Table 10.1. Basic machine metrics were obtained using lmbench [165]. The systems were the Moloch and IITAC clusters from the Trinity Centre for High-Performance Computing (TCHPC) [1], and the Walton cluster from the Irish Centre for High-End Computing (ICHEC) [2]. As of writing, the latter two occupy positions No.345 and No.367 respectively of the Top 500 List [166]. Walton is not relevant to the the four applications discussed here, so will not be mentioned further. All machines run a version of Linux that is compatible with

EXPERIMENTAL METHODOLOGY 152

the needs of SMG, i.e. kernel version 2.4.5 or later. The salient attributes regarding OS operations (i.e. system calls used extensively in SMG like memory mapping/protection) and pthread functions for each system are given in Appendix A, while the physical specifications are given in Table 10.1.

The SMG applications use the exact same message passing library as that used for the MPI applications, i.e. MPICH2-1.0.4, using thech3:sock communication device. It must be noted that while the more efficient Infiniband Interconnect is available on the IITAC cluster, it was not used as the MPI distribution MVAPICH2 [167] (which is based on the MPICH2 distribution) has until recently not supported multi-threaded MPI applications; the eventual support came too late for the use of this interconnect to be considered. As the Gigabit Ethernet is only intended for management of the cluster, the bandwidth and latency figures (see Appendix A, page 191) are much lower that what one would expect from this interconnect.

Attribute Moloch IITAC

Num. Nodes 65 346

Node Type IBM x335 IBM e326

Num. CPU 2 2

CPU Model Intel Xeon 3.06 Ghz AMD Opteron 250 2.4 Ghz

L1 Cache 16Kb I, 16Kb D 64Kb I, 64K D

L2 Cache 512Kb 1024Kb

Memory 2GB 400Mhz DDR 4GB DDR PC3200

Interconnect 2 X Gbit Ethernet 2 X Gbit Ethernet +

10Gbit InfiniBand (IB) Table 10.1: Infrastructure attributes

The latency and bandwidth metrics for communications between nodes in the systems are given in Appendix A. These messaging costs are inclusive of the overhead associated with the MPI implementation.

Grid Simulation

There is no grid-enabled MPI implementation that currently supports multi-threaded applications (a requirement of SMG). All the testing therefore employed a non-grid flavour of MPI. The SMG DSM was tested in a virtual grid environment, with the number of sites configurable at run-time through file-based information (the systems used do not have the required software for use with the R-GMA information system).

• Single-Site: for single site MPI configuration, where there are no benefits from the information component, jobs could nonetheless be submitted that access the information & monitoring system. Applications were able to make use of the file based monitoring component for logging information.

EXPERIMENTAL METHODOLOGY 153

• Multi-Site: as the available grid infrastructure did not support cross-site multi- threaded MPI, it was simulated using the same systems outlined above. This was achieved through the simulation of the characteristics (i.e. latency and bandwidth) of the inter-site communication links for a hypothetical grid consisting of four sites. It would have been better to have obtained valid data from the grid information system, but alas there was no grid information system in the simulation, so the latencies between sites were simulated using data obtained using the tools at www.hea.net. The figures for bandwidth (Table 10.2) were taken from the MRTG tool, while latencies (given in Table 10.3) were obtained using theLooking glass tool. The actual figure for estimated bandwidth was derived from the differ- ences between the stated figures for the maximum bandwidth obtainable and the maximum bandwidth utilised in a month. The bandwidth between two sites was determined to be the minimum of the estimated bandwidths of the two sites. These values were fed into the information system implementation, and this in turn supplied these values to a modified version of the MPI implementation of the SMG communications API (see Listing 6.1) that provided the simulation of delays between grid sites.

For multi-site grid simulations the file-backed approach was used to provide user- specified information such as site topology (this allowed logical partitioning of one of the systems described above into different sites), site information (memory, CPU power) and the site interconnect network information.

The information was accessed in any of these scenarios using the same defined SMG API (see Listing 9.1).

The system performance measurements are not representative, since the DSM communication was unable to avail of blocking receive MPI calls. However, the significant memory overhead associated with the DSM (update coherence) persists, as well as the

In document The SMG DSM system: enabling shared memory for the grid (Page 165-177)