3.3 Machines
3.3.5 Intel X3430 workstation
This machine, a 2.4GHz Intel X3430 workstation, is not used in the vast- majority of benchmarks, nor for any timings runs. However, due to the lack of select PAPI counters, primarily the L1 Data Cache Hit or Access measure- ments, it is used to obtain these values to provide an approximation of an application’s behaviour for these metrics. It is presumed measurements such as hit rates are not directly translatable between machines due to differences in cache sizes and/or behaviours. However metrics such as total accesses may be tied more closely to an application rather than a machine since accesses repre- sents the sum of the hit and miss rate and thus, as long as a cache miss does not also register as an additional hit, the ratio of these two characteristics does not
matter. The access count from one machine could potentially then be used to compute the potential cache hits on another by subtracting the measured miss rate of the second machine.
However, while these derived numbers can potentially be useful for identi- fying trends/behaviours, the translatable nature of the L1 Data Cache Access count is an assumption and cannot neccesarily be considered accurate for the target machine. While all other counters are measured directly from a target machines (such as L1 Data Miss Rate, L2 Hit Rate etc.), the L1 Data Access- es/Hits are always provided by the X3430 Workstation. Without the existence of such counters on different architectures such as Sandy Bridge or Haswell, it is difficult to verify how well such numbers translate. Appendix B.3 attempts to address some of the issues surrounding the use of PAPI counters amongst different architectures, including the accuracy/validity of Floating-Point Oper- ations per Second (FLOP/s) and the use of L1 Data Cache Access rates. Despite this, the method is adopted since no hardware based alternative are available for such machines if the relevant counters are not available for a chipset, and such counters are still sufficient to identify trends of interest when contrasting between kernels (since such readings are all from the same machine).
3.4
Summary
This section has introduced a variety of tools and machines used within this work for the purposes of empirical investigation. It highlights some of the core benchmark characteristics, and tools necessary to reveal insightful performance details about the behaviour of a code. In the remainder of this work, these tools are applied to the task of performance analysis and optimisation, exploring how the applications introduced in this chapter behave in parallel environments and how this knowledge can be applied in a predictive capacity.
Performance Scaling of a Near-Neighbour Hydrodynamics
Application
Hydrodynamics is a domain of science belonging to the field of Computational Fluid Dynamics (CFD), specifically addressing the behaviour of fluid or fluid- like substances in motion across a passage of time within a spatial domain. These behaviours can be modelled computationally through the use of physical laws/equations that represent fluid behaviour.
Predicting the dynamic behaviour of materials as they flow under the influ- ence of high pressure and stress is of considerable importance to understanding weapons. Without recourse to underground testing, access to experimental hy- drodynamics facilities and supporting high-performance simulations has an im- portant role in providing data to assess weapon safety and performance. Hydra is a benchmark 3D Eulerian structured mesh hydrocode implemented in For- tran, with which the explosive compression of materials, shock waves, and the behaviour of materials at the interface between components can be investigated. The ultimate goal of any High Performance Computing (HPC) application is to provide accurate results, yet it is implicitly acknowledged that it is desirable for these results to be obtained as quickly as possible. Given the possible vari- ance in machine configuration, both software and hardware, understanding the behaviour of the applications in question is crucial to both quick execution of said application and knowing how its performance might be impacted by modi-
fications in the future. This can be further enhanced by the use ofperformance
models, mathematical or simulation-based systems that are capable of capturing
an application’s core behaviours and predicting its runtime.
an application of interest. However, in order to do so, a greater understanding is required of the application itself. This chapter introduces Hydra, investi- gating its current strong and weak-scaling performance with respect to both its code structure and its use of the differing machine components such as com- pute resources, point-to-point communications, collectives etc. It also highlights any unusual behaviours that may be of interest in the model construction or optimisation process.
Specifically, this chapter sets out to achieve the following goals:
• Introduce Hydra, a hydrodynamics benchmark provided by the Atomic
Weapons Establishment (AWE), describing its structure, critical path and communication patterns;
• Investigate the parallel scaling performance of Hydra, including serial
compute, strong-scaling and weak-scaling performance on a large scale machine — part of this work is published prior in 2011 [44];
• Identify performance influencing factors that can guide modelling and op-
timisation efforts, including any unusual discrepancies that warrent futher investigation.
4.1
Hydra
The Hydra benchmark code simulates a cube of mixed materials under stress by discretising the data onto a 3D grid of cells given byNx×Ny×Nzand using
message passing for parallelisation. The 3D cube of data is decomposed onto a number of processing elements (PEs) in a typical Single Program Multiple Data (SPMD) fashion during execution. By representing the spatial volume as a collection of cells, the physical properties of materials at different cartesian locations within the grid can be quantified. The benchmark can then reflect delta changes in the value of these properties as the time progresses throughout the course of a simulation.
To achieve this goal the simulation executes a series of functions that are each responsible for updating different simulated properties. The rate of progress is
delineated by ∆t, the amount of simulated time that has passed since the last
update. A single pass of this collection of functions is known as an iteration. Repeated iterations of this series of functions progresses the simulated time,
with the benchmark terminating once the sum of ∆tvalues across all iterations
reaches a preconfigured amount. Large ∆tvalues progress the simulation faster
but lead to a loss of detail, potentially becoming too course-grained to be an
accurate simulation. Small ∆t values avoid this loss of detail, but increase
the overall runtime and may offer little benefit to accuracy if the grid is not sufficiently refined/discretised to a point where any differences are appreciable. To mitigate this, ∆tcan change from iteration to iteration and is determined by the current state of the simulation; a suitable value is computed at the beginning
of every iteration. The total number of iterations executed is determined by
the amount required for the sum of ∆tvalues to reach a preconfigured total.
From this it can be determined that two properties dictate the overall run- time of the benchmark — the time taken to run a single iteration, and the number of iterations to run to completion. Given its repetitive nature, identi- fying the critical path across the course of an iteration becomes key to under- standing the performance of Hydra. As a parallel program, during the course of its execution the functional components of Hydra can be summarised as falling into one of a number of different categories tied to the use of various machine components (e.g. memory or network interconnect), therefore a constructive breakdown of the various sub-functions called during the course of an iteration is required. The five categories identified within this work are as follows:
• Memory Management— Functions responsible for the dynamic allo-
cation of large temporary arrays (Section 4.2.3).
• Compute — Kernels that perform computational operations (Sections
• Update Boundary — Specialised kernels used to update the problem boundary cells of the grid (Section 4.2.6).
• Point-to-Point Communications (Exchange)– MPI Point-to-Point
communications, such as Send and Recv, used to communicate data be- tween two MPI processes (Section 4.3.2).
• Collective Communications— MPI functions, such as MPI Allgather
or MPI Allreduce, that provide gather/scatter operations to communicate data across a set (potentially all) of the available MPI processes (Section 4.3.3)
The significant details of each operation is provided in more depth in Sections 4.2 and 4.3. It is necessary to provide a distinction between them for the purposes of separating each of the functions into their different components, important when distinguishing between different performance behaviours, especially in a parallel environment.
4.2
Serial Behaviour
This section introduces the serial behaviour of Hydra, focusing on the applica- tion behaviours that influence performance in the absence of parallel consider- ations. Doing so will reveal how Hydra’s walltime can be tied to the problem configuration and the structure of its compute kernels.
4.2.1
Structured Mesh
To simulate a hydrodynamic system, the problem space is discretised into cells. The segmentation of the problem space influences both the accuracy and speed to solution; the greater the number of cells the more refined the solution be- comes. Since computation must occur for each cell position within the grid, accuracy is increased but requires a more significant amount of computation.
Nx = 8 Ny = 8
Nz = 8
(a) Hydra Grid
(b) Cell-
Centered (c) Nodal (d) Faced
Figure 4.1: An 8×8×8 Cell Structured Mesh
These cell decompositions are known as meshes and can be structured, unstruc- tured or hybrid in nature.
Structured meshes consist of a regular pattern, with a well-defined neigh- bour relationship between cells. The cells are typically quadrilateral (2D) or cuboid (3D) in shape. Such meshes implicitly store information regarding cell neighbours as part of their data structure, with the indexing of a 2D or 3D data array acting as a cartesian co-ordinate system from which lookups can be performed, making them relatively memory efficient.
Unstructured meshes are irregular in nature, with variable cell shapes, re- sulting in a more ill-defined neighbour lookup for an arbitrary cell. As such, they must also store neighbour relationship data, making them more memory inefficient.
Hybrid meshes incorporate both structured and unstructured components, possessing regions that can be one of either approach resulting in an overall decomposition that consists of both variants.
Hydra’s regular, spatially discretised grid is one such example of a structured mesh. Its problem size is determined by two components, the spatial size and the cell size. The spatial property defines the simulated physical size of the problem. The cell count however determines the number of cells this physical space is decomposed into — e.g. with 100 cells each cell represents 1/100th of the simulated physical space. The majority of compute kernels within Hydra are tasked with operating upon every cell within the grid; consequently the greater the number of cells, the more significant the impact upon compute/memory performance.
During the course of the simulation an iterative solve refreshes a collection of simulated physical properties, termedquantities, each of which has a distinct value stored per cell. These quantities fall into one of three different classifica- tions — cell-centered, nodal or faced:
• Cell-Centered (Figure 4.1(b)) — Oriented at the centre of a cell.
• Nodal (Figure 4.1(c)) — Oriented at the vertex of a cell.
• Faced (Figure 4.1(d)) — Oriented at the centre of a cell face. This is the
equivalent of a nodal quantity in one dimension, and of a cell centered quantity in the remaining two dimensions.
These classifications influence the storage requirements and the amount of work required to process them. Each quantity has its own grid of data, and multiple quantities are updated per cell at different stages during the course of a Hydra iteration. The data for each quantity is stored in a 3D Structure-of-Arrays (SoA) format.