A linear performance model was proposed to model the measured execution time for ERI evaluation. PPCoeffs (α,γ) were obtained for a set of benchmark systems. Theα PPCoeff refers to how well the code uses the superscalar resources of the processor, γ corresponds to the average cost in cycles of an L2 miss.
In this chapter, the LPM was shown to produce good fits for measured cycle counts ob- tained using hardware performance counters. Using a set of test molecular systems, it was found that the optimal blocking factor is both platform and computation specific.
Evaluation of the LPM in conjunction with functional cache simulation shows it is able to reproduce the trends and cycle counts as a function of varying the cache blocking factor. A parametric cache variation study of the PRISM algorithm was performed and this showed that L1 linesize and total L2 size impact on the algorithm’s performance. Detailed breakdowns of read and write misses shows that PRISM’s cache performance is limited by read misses generated during ERI computation.
4.8.1
Future Work
For future work, it would be useful to test the use of the LPM at runtime to aid in searching for optimal blocking factors. Data could be gathered for the first couple of SCF cycles, to obtain PPCoeffs. Then the blocking factor could be varied by starting from the default and taking measurements for an increased and decreased blocking factor. Once the blocking factor with the lowest cycle count is measured, it can be used for the remaining SCF cycles.
Using the LPM, a training run could be performed to determine the optimal cache blocking factor for any given hardware architecture, using a range of molecular systems. The results could be weighted to determine if one cache blocking factor would be universally suitable on this hardware architecture. This training would be carried out prior to deploying Gaussian for production use.
Wallin et. al [294], point out in their paper that increased cacheline size aids scientific ap- plications. The design of modern microprocessors is largely driven by commercial workloads, most of which have poor data-locality and suffer from false-sharing, and thus the cacheline sizes of microprocessors are usually 64 bytes. Wallin et. al advocate the judicious use of prefetching to mimic the effect of larger cachelines.
Our cache variation experiments indicate that there is scope to reduce L1 and L2 read and write misses. This can be achieved by incorporating a series of prefetch techniques [43, 278]. One possible way of achieving this is through the use of a prefetch queue [52, 88] in sections of code which exhibit large misses. A prefetch queue is a FIFO queue which holdsnaddress that need to be prefetched. The queue is effective in handling memory references which are interspersed throughout memory. The depth of the queue, which is determined experimentally, is deep enough to ensure that once values are popped off the stack, the cacheline of interest is cache resident.
Routines and specific lines in the source code which cause read and write misses have been identified using the KCachegrind tool [140]. KCachegrind uses output from a Callgrind run to
attribute cycle count costs to specific point in source. For future work, prefetch queues could be inserted into specific routines, located using the KCachegrind GUI.
A Study of Thread and Memory
Placement Effects using the Gaussian
Electronic Structure Code
5.1
Introduction
In this chapter, we study the effects of thread and memory placement and extend the LPM to ac- count for NUMA effects, using selected application kernels from the Gaussian computational chemistry code [86].
Shared memory, parallel platforms have increased in complexity evolving from UMA (Uniform Memory Access) to NUMA (Non-Uniform Memory Access) machines. As seen in Chapter 3, on NUMA architectures it is faster for a processor to access memory which is local to it, than remote memory. Moreover individual processor chips have now evolved to- wards multi-core designs [17] accelerating the shift towards asymmetry in memory latency and bandwidth. Multicore processors from various vendors also have subtle differences in their cache hierarchy, e.g. where some have shared L2 on-chip, others have dedicated L2 and shared L3 on-chip [51, 287].
Widely used shared-memory programming models like Pthreads [64] and OpenMP [228], as mentioned in Chapter 3, do not explicitly expose or handle underlying hardware asymme- Work reported in this Chapter has been carried out in collaboration with Dr. Rui Yang (School of Computer Science, ANU). Material from this chapter was published in: (a) Proceedings of IS- PAN 2008, MEMORY ANDTHREADPLACEMENTEFFECTS AS AFUNCTION OFCACHEUSAGE: A STUDY OF THE GAUSSIAN CHEMISTRY CODE ON THE SUNFIRE X4600 M2, http://doi.
ieeecomputersociety.org/10.1109/I-SPAN.2008.13; (b) (Accepted) Proceedings of
HPCC 2009, A SIMPLEPERFORMANCEMODEL FORMULTI-THREADEDAPPLICATIONSEXECUT-
ING ONNON-UNIFORMMEMORYACCESSCOMPUTERS
tries1. This in effect poses challenges in obtaining good performance for scientific codes on NUMA platforms. The goal of this chapter is to study the effects of thread and memory place- ment on the observed performance of the Gaussian code on a contemporary multi-core NUMA platform, the SunFire X4600 M2 [265]. To facilitate this, a series of questions are addressed.
(a) What are the performance characteristics (in terms of latency and bandwidth) of the SunFire X4600 M2?
(b) Cache blocking dramatically affects the performance of the Gaussian code. What are the combined effects of cache blocking and placement on the overall runtime of the Gaussian code? How does cache blocking affect scaling as the number of processors is increased?
(c) Can the Linear Performance Model (LPM) be extended to handle NUMA systems? If so, how accurate is it?
(d) Can page migration improve the performance of Gaussian calculations run on NUMA systems?
This chapter is organized as follows: Section 5.2 discusses the architecture and perfor- mance characteristics of the X4600 M2, in terms of latency and memory bandwidth for specific thread and memory placements. Section 5.3 reviews the software environment, test molecular systems and modifications made to Gaussian. Section 5.4 demonstrates how the Placement Distribution Model (PDM) from Chapter 3 can be used to study the effects of thread and mem- ory placement in Gaussian; Section 5.5 extends and evaluates the LPM to account for NUMA effects; Section 5.6 uses page migration to affect data locality in Gaussian. In Section 5.7 we review previous work in the area of performance modelling of NUMA systems. Section 5.8 concludes the chapter and discusses future work.