Very recently Gomperts et al. [98] reported the scalability of the Gaussian code for frequency calculations on the SGI Altix 450 cc-NUMA machine [240], for a coupled perturbed Hartree- Fock (CPHF) [74, 203, 215] calculation to obtain the infra-red and vibrational circular dichro- ism (VCD) spectra [68] for theα-Pinene7 molecule. They observed a performance anomaly where the first four threads’ execution time was far less than those of the other twenty-eight threads, within link l1002. The cause was shown to be the near simulataneous data-loads of the Density matrix by all the worker threads. By using code modifications, which involve creating thread-local datastructures at runtime, they were able to improve performance by a factor of 2. Performance analysis was done using an SGI developed profiling tool called ‘Histx’, which works by sampling the hardware performance counters for the Itanium2 processor during ap- plication execution. Profiling revealed the default first-touch policy led to all memory being
Figure 5.9: Speedup plot for C60 using a cc-PVTZ basis set for the HF, BLYP, B3LYP meth- ods.
(a) HF (Base time: 1275 seconds)
(b) BLYP (Base time: 500 seconds)
allocated on one NUMA node. They re-coded Gaussian to create node local copies through the creation of temporary arrays in parallel regions and relying on the first-touch policy to ensure pages were local to the thread of execution. (An approach similar to this was used in 5.4 of this chapter). They go on to propose modifications to theFirstPrivateOpenMP clause to affect node local placement. There is general agreement with the conclusions reached in [98] and work performed in this chapter, wherein for different code paths within Gaussian (the CPHF code) and different test molecule, the need for locality aware access to data im- pacts performance. In our work we were able to affect application driven thread and memory placement to ensure node local placement. We found the use of dynamic migration of the Fock matrix and node interleaving of the Density matrix beneficial in improving scalability for the HF, BLYP and B3LYP methods. The overheads for migration were outweighed by the benefits of node local data placement.
The NUMA and multi-thread extended LPM is a simple model which uses an additive cost model of data items and scaling factors to predict performance. In the following section, we review previous work done in the area of performance modelling of NUMA platforms.
Yang [309] develops a queuing model for a cache-based multiprocessor system that uses hierarchical buses, and which resembles modern multi-core CPUs. The model captures system bus contention as well as cache and memory interference for a general memory reference pattern. The modelling was validated against simulation and it was found that bus traffic for enforcing coherency was significant and hence Yang proposes an adaptive cache coherence protocol which allows for multiple copies of shared data within a set of processors that share the same interconnect link to main memory.
Torrellas et. al. [281] study the performance of a hierarchical shared-memory multipro- cessor. They develop an analytic model of traffic in a machine which mimics the Stanford DASH [158] shared-memory multiprocessor system. By using traces collected for a 16 pro- cessor configuration they predict performance for a 256 processor system. Three major factors are identified in their study as influencing performance (a) locality of data access, (b) the amount of data sharing between threads, and (c) the available bandwidth in a cluster. A key recommendation for reducing bus contention is to facilitate direct access to memory without involving the bus used by the attached processors. The SGI Altix system implements this rec- ommendation by virtue of its hardware SHUB [166] that responds to remote memory requests for a group of Itanium2 processors that share the same bus.
Zhang and Qin [316] present several analytic models to predict the overhead of opera- tions (scheduling, synchronization, data layout and access patterns) which impact the perfor- mance of a NUMA multiprocessor system. Their models incorporate memory and network contention. It assumes that memory requests are uniformly spread across all the memory mod- ules, this is done to facilitate quick numerical solutions for their models. A key finding of
their modelling is that the rate of remote-memory requests could be used to predict the remote- memory access delay.
LaRowe et. al. [143] implement parametrized dynamic page migration strategies and fol- low on to measure the performance of parallel programs using these strategies as well as de- veloping an analytical model of the memory system performance for a NUMA system using mean-value analysis. Their measurements and modelling shows that replication of commonly used pages is beneficial and avoids pages bouncing between NUMA memory nodes.
Ring based interconnects are used in multi-core microprocessors notably the Cell BE [8], the Core i7 [152] and Larabee [238]. Holliday and Stumm [117] investigate the performance of hierarchical, ring-based shared memory multiprocessors using simulation. The simulator was validated against the Hector shared-memory system [293]. Instead of using trace or execution driven simulation, a synthetic workload model was used to enable fast turn-around for a 1024 processor system. Their key findings were that maximizing locality in applications reduces memory contention; multiple memory banks are required to facilitate multiple outstanding requests to memory; an adaptive maximum number of outstanding memory transactions is needed to adjust for and aid computation or communication efficiency subject to changes in communication locality; and processor and memory subsystem design needs to be balanced to avoid creating system hot-spots.
Bhuyan et. al. [25] use simulation to quantify the effects of memory management policies for scientific applications on NUMA platforms which use a multistage switching network. A key finding was that the degree of performance improvement of an application is both depen- dent on the memory management technique and the switch architecture.
Kaeli et. al. [144] present performance analysis of a CC-NUMA prototype machine devel- oped at the IBM T. J. Watson research center. By using hardware instrumentation, traces were obtained for transaction processing benchmarks to identify which elements in the OS or user code were responsible for inter-node references. An analytic model of the prototype system was created to evaluate the effect of architectural changes. Some key results are (a) replication of read-only pages is critical in reducing the number of non-local references; (b) process pages should be placed local to the owing process during initial page allocation; (c) there needs to be an API to provide the OS with semantic hints about page contents and its placement.
Schmollinger and Kaufmann [236] present an extension to the BSP model [288] of parallel computation to aid in mapping algorithms onto clusters of SMP machines (NUMA machines with a hierarchical interconnect).
Nord´en [196] presents an analytic model describing OpenMP PDE solvers which take into account the NUMA ratio; a locality and optimal locality factor. Using this model Nord´en shows that ordered local PDE methods are insensitive to high NUMA ratios and allows these algorithms to scale well on any NUMA system. The modelling also indicates that there are
great gains to be had from using an optimal data distribution.