• No results found

The support for thread binding and memory placement provided by Solaris and Linux has been outlined and contrasted. For Linux, the kernel was modified in order to provide a user API that could be used to verify binding and determine physical memory placement from a user

supplied virtual address. Using the various thread and memory placement APIs, a framework was outlined for performing NUMA performance experiments.

Detailed measurements of the latency, bandwidth and BLAS performance characteristics of two different hardware platforms were undertaken. These showed the Opteron system to be ”more NUMA” than the Sun V1280 system, despite the fact that it had only 4 processors. To assist with the analysis of performance data, a simple placement distribution model for both platforms was outlined. The PDM uses directed graphs to represent processor, memory and interconnect layout.

It was found that if multiple level 1 or level 2 BLAS operations are run in parallel on the Opteron system performance differences of up to a factor of two were observed depending on memory and thread placement. For level 3 BLAS, differences are much smaller as there is much better re-use of data from level 2 cache.

The use of the PDM and subsequent experiments show that memory placement is impor- tant in achieving good performance on NUMA platforms. The PDM categorizes performance results in terms of contention classes and for the lowest contention class, is able to do so with standard deviations forStreamCopy and Scale between 12% to 15% (Opteron/Solaris), 12% to 16% (Opteron/Linux), 10% (V1280/Solaris). The PDM results forStreamTriad, L2 BLAS and L3 BLAS indicate that a cache blocked computations are affected less by memory and thread placement than Triad and L2 BLAS, which require data to be streamed from mem- ory into the processor. The PDM errors ranged from 0.6% to 18% for Solaris/Opteron, 0.6% to 37% for Linux/Opteron and 10% to 20% for the Solaris/V1280.

It would be beneficial for an application to be able to discover, at runtime, the processor and memory topology, and subsequently be able to use this information within the application to effect thread and memory placement, which is specific to its needs.

Results obtained in this Chapter should the importance of both thread and memory place- ment. Both the Solaris and Linux operating systems utilize NUMA specific information to affect thread scheduling and memory management decisions. It would be beneficial for a user space application to be able to discover, at runtime, both the processor and memory topology and subsequently use this information to effect thread and memory placement.

Use of a Simple Linear Performance

Model for Electron Repulsion Integral

Evaluation

4.1

Introduction

All modern microprocessors utilize a cache memory hierarchy to ameliorate latencies asso- ciated with accessing main memory. In recognition of this much effort has been devoted to designing algorithms that carefully orchestrate computation in synchrony with data move- ment [91, 100, 153, 306, 307]. Almost always these algorithms involve a variety of trade-offs, such as the size of a cache blocking factor, or whether to recompute an intermediate quan- tity on the fly or pay the penalty of storing and retrieving the data from some distant memory location. In this respect developing models that can be used to describe performance at vari- ous levels in the cache hierarchy as algorithmic or system hardware parameters are changed is important [78, 290].

As discussed in Chapter 2, traditional cache performance models are either analytical or simulation based. Analytic models parametrize various aspects of the system to give an empiri- cal performance estimate, while simulation based techniques predict performance based on the sequence of executable instructions. Simulation based techniques can be functional or cycle accurate, using inputs that are either execution driven (i.e. generated by interpreting instruc- Material from this Chapter was published in: (a) Proceedings of HPSC Vietnam 2006, MOD-

ELLING THEPERFORMANCE OFGAUSSIANCHEMISTRYCODE ON X86 ARCHITECTURES, in Mod- eling, Simulation and Optimization of Complex Processes, Editor: Hans Georg Bock. Springer 2008, (b) BUILDING FAST, RELIABLE, AND ADAPTIVE SOFTWARE FOR COMPUTATIONAL SCI-

ENCE, Journal of Physics: Conference Series, 2008http://stacks.iop.org/1742-6596/

125/012015.

tions from the binary being simulated) or trace driven (i.e. the streams of loads and stores for the simulation are intercepted and saved to disk for offline use). Cache behaviour is then sim- ulated by supplying the instruction sequences to the cache simulator which in turn models the cache hierarchy. Although trace and execution driven methods are 100 to 1000 times slower than execution on native hardware, they capture dynamic aspects of code execution which oc- cur at run-time (i.e. side-effects arising from interactions between the application, operating system and hardware), that analytical cache models are unable to capture.

In this Chapter the utility of a simple Linear Performance Model (LPM) is investigated to determine if it can provide sufficiently accurate predictive information (that can be used to guide algorithmic decisions or model the effects of cache blocking changes) for quantum chemistry calculations. In this model [191] the overall performance is given as a simple linear combination of instructions issued and cache misses,

Cycles=α(ICount)+β(L1Misses)+γ(L2Misses) (4.1)

whereICountis the instruction count,L1Missesthe total number of Level 1 cache misses,L2Misses

the total number of Level 2 cache misses, and coefficientsα;β;γare penalty factors. The value

ofαreflects the ability of the code to exploit the underlying super-scalar architecture, β is the average cost of an L1 cache miss, andγis the average cost of an L2 cache miss. We will refer to the coefficientsα,β andγas the Processor and Platform specific Coefficients (PPCoeffs).

The LPM differs from other cache performance models in that it ignores the intricacies of program execution and assumes platform and processor specific factors that influence ap- plication performance can be averaged out and captured by the values of the PPCoeffs. For a candidate algorithm PPCoeffs could be obtained by running the code on similar processor family revisions which have different cache sizes or by varying a given fundamental cache blocking factor within the algorithm. Either method will yield different instruction, L1 and L2 cache miss counts allowing the PPCoeffs to be obtained using a least squares fit of the observed counts to Equation 4.1.

The aims of this chapter are three fold – (i) to study the effects of cache blocking on the PRISM electron repulsion integral (ERI) algorithm within the Gaussian code; (ii) to assess the ability of the LPM to describe the execution time of quantum chemistry calculations that use PRISM across a variety of hardware platforms; (iii) to combine parameters obtained for the LPM on existing hardware with functional cache simulation to predict the effect of architec- tural changes on the total runtime.

The structure and layout of this chapter is as follows. Section 4.2, gives an overview of ERI evaluation and in particular focuses on the PRISM algorithm and its use of cache blocking. Section 4.3 covers methodology, software and hardware platforms and benchmark systems

used in this chapter. Sections 4.4, 4.5 and 4.6 consider aims (i), (ii) and (iii) respectively. A review of related work is given in Section 4.7, while Section 4.8 concludes the chapter.