Shared memory and loop-parallel programming are the favoured methods of implement- ing parallel applications due to the lower burden placed on the programmer. Explicit data movement does not occur with shared memory models, whereas with message pass- ing it must.
Message-passing is used extensively on distributed memory machines, allowing the exe- cution of applications written in this model to execute on almost all current platforms, including those with shared memory. Often a recompile is all that’s required for this to be achieved. So why not just use message passing exclusively? It always comes down to the extra burden placed on the programmer.
The future trends of parallel computer architecture will dictate the need for hybrid programming models that fully exploit the potential offered by constellations of SMPs (particularly those with multi-core CPUs). Already research exists in exploring the use of mixed-mode programming models with combinations of message passing and loop-parallel/shared memory [38, 39, 40]. Although considered by many to employ a loop-parallelism model, OpenMP can be viewed very much as a hybrid. Development involving both OpenMP and message passing applications can be found in the most powerful supercomputers (such as the Earth Simulator [19]), which are in effect a con- stellation of SMPs. Message passing is used to communicate between nodes, while shared memory can be used internally in the SMP.
There are new problems that arise when dealing with a hybrid programming model: the developer needs to be knowledgeable in two programming models. The readability of code is significantly reduced, especially for third-parties (those who did not write the program); this will be an issue with maintenance of such code, and importantly how easy will it be to debug?
CHAPTER
3
Wide-Area Parallel Computing
In the previous chapter the motivations for parallel computing were explored. To recap, it essentially allows for an application that was compute bound to complete in a timely manner when executed on a system with increased processing capacity (Amdahl’ Law), and/or allows for increasing the accuracy of the application results by increasing the resolution at which the solution is computed (essentially Gustafson’s Observation). In this chapter we explore what will take parallel computing to a wide-area environment. Previous parallel computing research focused on the benefits of federating resources from the point of view of the computational instruction execution rate. The increase of ag- gregate memory resources was a lesser issue. There are anticipated applications [24], that will benefit enormously from the increased parallelisation offered by such an in- creased memory: image rendering, parallel search, speech and visual recognition, ge- netic sequencing, data synthesis, and data-mining. These applications will not only have increased computational requirements, but will also have a commensurate memory resource requirement. The challenge will be to harness both increases in the memory and computational resources.
To support this new class of parallel application we must be able to effectively pool these resources into a unified resource. Consider an application where the required memory resource is greater than the available resources, as the application accesses virtual mem- ory pages the corresponding physical pages may need to be migrated to the physical resource. Pages that are deemed less necessary will begin to be swapped out to a swap space that is usually located on a dedicated partition on the local secondary storage system. This storage is slow enough [41], that it is feasible to buffer the page in the memory of an adjacent (in network terms) machine. Increasing the total number of such machines will increase the total memory pool available for application use. If all the available machines have similar demands, then this approach is pointless, although machines might be dedicated to this purpose.
Increased parallelism can most easily be achieved by hiding latency, and the the most promising way to achieve this will be a weakening in memory consistency [42]. However, in the context of the claims made in the above paragraph, a latent effect of doing this will
WIDE AREA PARALLEL COMPUTING PLATFORMS 28
be a commensurate increase in the memory resource usage. To access the benefits of the parallelism with grid computing, two principal metrics must be favourable: scalability and efficiency.
3.1
Wide Area Parallel Computing Platforms
The natural extension of the cluster paradigm described in Section 2.2.2 is to combine resources at multiple sites into a coupled system as depicted in Figure 3.1, referred to as a meta-computer, or when many sites are involved in a formal manner, a grid. Making such disparate entities appear a unified resource to a user requires a middleware support ’glue’ layer to solve problems involving areas such as administration, security, etc. The grid middleware aims to abstract the resources to provide standard methods and interfaces to access the resources at each site. Multiple grid middlewares exist, each developed by different projects: Legion [43], Unicore [44], and more recently EGEE (Enabling Grids for E-sciencE) [45].
Figure 3.1: Grid: wide-area distributed computing
Attempting to use such collections of resources will result in the effects of physical laws governing communication such as bandwidth and latency (see Section 5.1 for more infor- mation) becoming more apparent. A number of grid projects are examining wide area computing theory by providing high-bandwidth, low-latency and reasonably determin- istic connections between sites, e.g. Naregi [46] in Japan and DAS-3 in Holland, both linking high performance centres to create a giant supercomputer [47]. Naregi is designed for a maximum latency of 25ms (800 km). DAS-3 reduces this further to 3ms (the la- tency across Holland, and coincidentally the latency for an access to a hard-disk [41]).
PROGRAMMING MODELS FOR WIDE AREA PARALLEL COMPUTING 29
As in smaller-scale systems, the available parallel computing architectures for such wide- area environments are still:
• wide-area shared memory: such an architecture is barely feasible unless the memory consistency is sufficiently weak to hide high latencies. It is conceivable, but highly impractical to construct a wide area interconnect with hardware support for shared memory. Shared memory could only be realistically supported by virtually sharing memory resources while hiding latencies, effectively via a DSM on a grid. Latencies as low as those for DAS-3 do improve the situation.
• wide-area distributed memory: essentially this a Grid. MPI software imple- mentations exist, e.g. PAC-X [48], MPICH-G2 [49], and GridMPI [50], that allow multiple compute elements to appear as a single unified resource, i.e. ’a cluster of clusters’, by adding a transport mechanism between the native MPI implementa- tions used on each cluster. For PAC-X there is a gateway process at each site that proxies all messages for a non-local process to the relevant gateway process in the remote cluster. Such software is discussed further in Appendix D.
The function of the middleware is to provide the functionality to authenticate the user and to perform job and data management functions. Using a Grid requires interaction with this middleware that manages the differences of the underlying architecture and platforms. For example, if a user possesses appropriate grid credentials they can sign onto the EGEE gLite grid at a gateway machine and then utilise any EGEE resources (compute/storage) that are available to them, and perform other operations such as using the information systems (see Appendix D, page 226). Compute jobs are submitted using a commandgLite-job-submit. Each submitted job will be assigned a unique job identifier that can be used to check status of the job. Once the job has completed the output may be retrieved using the commandgLite-job-get-output. Similar abstractions are going to be needed to run parallel jobs. Other commands exist for data management operations.