Automated instrumentation techniques are a natural solution to instrumenting large code bases and have been extensively covered in the literature [19, 33, 44, 132, 149, 166]. As mentioned in Section 2.4.1, there are three main approaches to instrumentation, these are binary, byte code and source code instrumentation; however, the use of source code based instrumentation is focused upon in this thesis.
The TAU tool set is a mature performance toolkit which along with imple- mentations of the various instrumentation techniques, includes an automatic source instrumentation tool [111, 149]. While this tool is effective for the most part, it is unable to handle FORTRAN77 and nested loops, both of which are supported by the tool developed in this thesis.
Byfl is another tool which is capable of instrumenting code bases, with what the authors term “software counters” which can be used to record data from executing code [132]. However this tool is tied to LLVM and so limits the use of high performance compilers such as those produced by Intel, Cray and PGI. However, the tool developed in this thesis is compiler agnostic.
The work which is most similar to that presented in this thesis is that devel- oped by Wuet al; their tool was developed so that instrumentation written by performance engineers could be applied to multiple codes by application engi- neers. In the case of HYDRA this quality is highly desirable as it allows one set of instrumentation to be applied to multiple versions of the code and potentially applied by a third party. This separation of instrumentation from code also has the additional benefit that instrumentation can be transported non-securely to the location of the HYDRA source. The tool developed in this thesis differs by focusing on the needs of a performance engineer rather than end users [166] by focusing on the flexibility of instrumentation rather than usability.
The use of analytical and simulation-based performance models has previ- ously been demonstrated in a wide range of scientific and engineering appli-
cation domains. The construction of such models can augment many aspects of performance engineering, including: comparing the expected performance of multiple candidate machines during procurement [73], improving the scheduling of jobs on a shared machine via walltime estimates [151], identifying bottlenecks and potential optimisations to evaluate their effect upon performance ahead-of- implementation [125] and post-installation machine validation [101].
One body of work similar to that presented in this thesis is described in [59, 60, 61], in which the authors apply an analytical performance model to an al- gebraic multigrid application executing on a range of architectures available at the Lawrence Livermore National Laboratory (including a BlueGene/P and a BlueGene/Q). The focus of these papers is on understanding the scalability of algebraic multigrid applications and the utility of hybrid OpenMP/MPI pro- gramming. This work presents a model of a geometric multigrid application. This differs from previous work produced at the University of Warwick (in which several models of MPI-based codes [13, 14, 40, 123, 124] were developed) in that the focus is on using the performance model to asses the suitability of different partitioning algorithms/libraries at varying scale.
Gileset al have published several papers on the design and performance of the Oxford Parallel Library for Unstructured Solvers (OPlus) and its successor, OP2 [26, 35, 65, 66, 67, 126]. One of these papers details the construction of an analytical performance model of a simple airfoil benchmark (consisting of ≈2 K lines of code), executing on commodity clusters containing CPU and GPU hardware [126]. The performance model achieves high levels of accuracy, but does not support multigrid execution. This thesis details the construction of a performance model for a significantly more complex production application (consisting of≈45 K lines of code), and presents model validations for datasets with multiple grid levels. This work additionally differs by using data provided by a mini-application in the performance modelling process.
Numerous mini-applications have been developed, some of which have been released as part of projects such as the Mantevo Project [77] and the UK Mini-
App Consortium [158]). Mini-applications from these repositories and other standalone mini-applications have been used in a variety of contexts: i) explor- ing new architectures, as demonstrated by Pennycook et al who use miniMD, a mini-application version of LAMMPS, to explore the performance of a molec- ular dynamics code on Intel Xeon Phi [134]; ii) examining the scalability of a neutron transport application [156]; iii) to aid the design and development of a domain specific active library [140] and, iv) exploring different programming models [110]. This work aims to provide further evidence as to the success of the mini-application approach by detailing its usage in examining new hardware features.
Benchmarks and mini-applications representing Computational Fluid Dy- namics (CFD) codes already exist, but are missing some or all the applica- tion characteristics that need to be captured in order to represent unstructured mesh, geometric multigrid applications. LULESH is a hydrodynamics mini- application representative of ALE3D, however it does not utilise a geometric multigrid solver [96]. Likewise, both the Rodinia CFD [34] and AIRFOIL [65] codes developed by Corrigan et al and Giles et al respectively, utilise an un- structured mesh but are also lacking a geometric multigrid solver. Furthermore, the Rodinia CFD code does not support datasets which represent geometries us- ing nodes with a variable number of neighbours. The HPGMG-FV benchmark, while not a CFD code, does utilise a geometric multigrid solver, however; it operates on a structured mesh [1]. The mini-application developed in this work supports datasets with the following characteristics: i) geometric multigrid, ii) a variable number of neighbours per node and iii) unstructured, all of which contribute to the spacial and temporal memory access behaviour of HYDRA.
Another body of work which is similar to that in this thesis and which it builds upon, deals with the validation of a mini-applications performance characteristics against those of the parent code. The technique employed by Tramm et al involves comparing the correlation of parallel efficiency loss to performance counters for both the mini-application and the original code [156].
Previously this technique has been applied to mini-applications of a neutron transport code [156] and a Lagrangian hydrodynamics code [12]; in this thesis it is applied to a different class of application. Messer et al develop three mini-applications and use a comparison between the scalability of the mini- application and the original code as evidence of their similarity [117]. However, the authors focus on distributed memory scalability whereas in this work the focus is on intranode scalability (OpenMP and SIMD).