4.6 LPM and Functional Simulation
4.6.4 Summary: Functional Cache Simulation and the LPM
In this section the LPM has been used in conjunction with functional cache simulation for six microprocessor platforms (cf. Chapter 2, Table 2.1). The k300a-04/6-31G* system was used along with a variety of cache blocking factors for PRISM. A comparison between simulation and hardware performance counter results were made and a parametric cache variation study was performed.
The comparison of results obtained from simulation and hardware performance counters found that Instruction counts were in good agreement for all platforms. L1 misses for the Op- teron, Athlon64 and Pentium M showed deviations for PRISM blocking factors less than 32Kw but were in agreement there after. In the case of the Pentium 4 and EM64T configurations the general trend was followed albeit there was a constant offset between simulation and hardware performance counters. L1 misses for the PowerPC were not in agreement. Simulation results
for L2 misses from the Opteron, Pentium 4 and Pentium M configurations followed the same trends as hardware performance counters. Although the EM64T simulation did follow the gen- eral trend of values from hardware performance counters, there was a large offset between the two curves. L2 misses for the PowerPC were not in agreement. On comparing cycle counts, it was found that the LPM results were in good agreement with hardware performance counter results.
Results for a parametric cache variation study using the Opteron, Pentium 4 and Pentium M cache configurations were presented. In-depth results for the Opteron were first discussed before presenting contour plots of the cache variation study for the other two configurations. From the in-depth results it was seen that increasing the L1 linesize and total L2 size reduced the total cycle count, this was attributed to reductions in conflict misses in the L1 data cache and a greater reduction of read misses than write misses for the L2 cache. Contour plots for the three processors showed that Gaussian’s default blocking factor works for k300a-04/6-31G* for all but two cases, the 4MB and 8MB L2 cache respectively.
4.7
Related Work
Ramdas et. al [220, 221] perform qualitative analysis and assess the prospects of mapping an implementation of the Rys ERI method [73, 161] onto FPGAs. They present a quantitative analysis of the ‘bootstrap’ phase of Rys ERI evaluation, which corresponds to ‘Generate Sig- nificant Shell-Pair List’ for the PRISM algorithm (cf. Algorithm 2, page 63). A discrete event simulation is used to determine the impact of arithmetic units (adders and multipliers) in the FPGA. In comparison this chapter has considered the PRISM ERI algorithm, its ERI batching behaviour, cache blocking effects on observed performance.
The LPM is lightweight in obtaining application specific performance characteristics. PP- Coeffs are obtained using hardware counter data which can then be used by either trace based or execution based simulators. The following is a review of related work covering the use of analytic modelling, synthetic benchmarks and simulation to aid in modelling application performance.
Very recently, Bj¨orn [84] developed a cycle-approximate instruction set simulation method- ology which uses prior training and regression based performance prediction for a series of em- bedded application benchmarks. The model requires instruction and memory access counter information, which are fitted to observed cycle counts obtained from an ARM v5 cycle accurate simulator. The prediction phase uses functional simulation to obtain instruction and memory access counters, which are then used to obtain fits regression coefficients obtained from prior training runs. Cycle counts are found to be in error by 5%. Bj¨orn’s general approach is very similar to that taken here in obtaining least-squares fits for the LPM. Unlike Bj¨orn, we have fo-
cused on ERI evaluation and obtained fits across a range of hardware platforms and benchmark molecular systems. The LPM’s results vary from 3.3% to 7.9%.
Other related approaches to application modelling are covered below. Most of these ap- proaches are heavy-weight in comparison to the LPM, with respect to the time taken to obtain results. The following review is biased towards related work which considered performance modelling of scientific application codes.
Using a sparse set of trace based cache simulations, Gluhovsky and O’Krafka [122] build a multivariate model of multiple cache miss rate components. This can then be used to ex- trapolate for other hypothetical system configurations. This is used in work carried out by Sharapov et al. [242]. They provide a methodology for characterising performance on very large parallel systems. They combine queuing theory models and cycle accurate simulation for estimating parallel performance using trace driven simulation. Traces are collected from a full machine simulator and bus traces are obtained from real hardware. These are then used to drive a trace-driven simulation from which parameters for an analytic model are created to project performance estimates.
Cheveresan et al. [53] perform detailed characterisation of scientific and commercial ap- plications. For their study, traces are generated using an ISA simulator which allows for the capture of architectural traps, direct memory accesses and MMU activity. In their analysis they show that complete scientific codes (rather than kernels) show similar characteristics to commercial applications.
Song et al. [252] create an analytic model to quantitatively predict L2 cache misses on a multi-core chip. They use stack processing and circular sequence profiles to analyze a trace the L2 cache accesses. The model can predict L2 misses for various multi-core architectures using previously obtained traces.
Marathe et al. [171] create a framework for extracting partial access memory traces using dynamic binary re-writing. These traces are compressed using various algorithms tailored for lossless capturing instruction stream traces. These are then used for offline memory hierarchy simulations, which allow them to correlate reference statistics for cache eviction information and streaming behaviour to locations in code that cause this.
Strohmaier and Shan [260] create a synthetic performance probe called APEX-Map, which measures the performance of global data movement. This is characterised as three parameters – the global datasize ‘M’, temporal locality ‘α’ and spatial locality ‘L’. APEX-Map generates a generic address stream based on non-uniform, random access to global data. From the results obtained, it is possible to generate a multi-dimensional performance surface allowing for the study of spatial and temporal effects.
Marin and Mellor-Crummey [172] create and evaluate a toolkit for semi-automatic mod- elling of static and dynamic components of an application’s characteristics by capturing mem-
ory access traces. This stream is analyzed to generate platform independent characteristics of the scientific code being studied. This can then be used to extrapolate performance and other platforms of interest.
Grabelny et al. [104] create a framework for performance prediction based on discrete event simulation. They model systems of interest into individual components i.e. O/S, network device, processor etc. In a two-phased approach, they characterise applications of interest by capturing memory access streams and then replay these using the discrete event simulation framework to derive performance estimates.
Snavely et al. [1] use profile convolving a trace based method which involves the creation of a machine profile and an application profile. Machine profiles describe the behavior of loads and stores for the given processor, while the application profile is a runtime utility which cap- tures and statistically records all memory references. Convolving involves creating a mapping of the machine signature and application profile; this is then fed to an interconnect simulator to create traces that aids in predicting performance.
Ahn and Vetter [7] describe multivariate statistical techniques to analyse hardware perfor- mance data from scientific codes using clustering, factor analysis and principal components analysis.
Epshteyn et al. [78] use active learning models along with empirical models to guide gen- eration of efficient BLAS libraries.
Vera et al. [290] use cache miss equations to obtain an analytical description of cache memory behavior of loop based codes. These are used at compile time to determine near optimal cache layouts for data and code.
Andrade et al. [11] extend probabilistic miss equations using analytical modelling to model cache behavior of indirections in memory access streams. This approach works if accesses are uniformly distributed in the array of interest. The goal of this work was to create better analytical models to aid compiler optimizations.
Mathis and Keryson [175] perform analytical modelling of an unstructured mesh applica- tion. They parametrise system performance and application specific input parameters in terms of latency, bandwidth and processing rate. Using a detailed analytic model they then make performance and scalability estimates of the unstructured mesh code.