• No results found

Chapter 2 Fluid Theory and Modelling

2.2 Numerical Modelling

The process of transforming a set of partial differential equations describing a phys- ical model into a numerical algorithm suitable for simulating a given scenario is dependent on a number of factors. Computation is built up from a set of basic integer and floating point operations. These have a fixed level of precision and ac- curacy, and so can introduce numerical instabilities into a code that if not careful can dominate other factors. Typically scientific modelling is done using the stan- dard IEEE754 double-precision format. In this numbers are stored as a 64bit binary number comprised of 1 sign bit, 11bit exponent and 52bit mantissa, giving a max- imum precision of log10(253) ≈ 16 decimal digits (for single precision it is only 7

digits). Given that the relative size of terms in the equations can be separated by many orders of magnitude it is important to express calculation of variables in such a way as to maintain the precision as far as possible. For many quantities where we are dealing with small perturbations on a background value it can make sense to store the perturbed value in a separate variable[Higham, 2002].

For modelling a 3D system the number of operations per time-step quickly becomes unmanageable for any reasonable system size on a single workstation, there- fore a parallel algorithm is needed to divide the model across many processors. This is achieved by using a technique know as domain decomposition, whereby each pro- cessor handles a small local cell of the domain and communicates the changes across its boundaries with neighbouring cells. Thus each processors grid has an extra layer of grid points around the edge which are copied from the neighbouring cell after each time-step. Another guiding principle is to reuse data where possible, in that all calculations requiring a variable (or array element) should be performed in close proximity to ensure the data is still in the local CPU cache, as fetching from main memory is very slow by comparison. To the extent that if the quantity needed can be (re)calculated from values known to be in the cache then this will be faster than loading a stored value from main memory.

In order to maintain stability of the numerical algorithm the speed at which variations in parameters propagate (ie the fastest wave speed in that direction) must be less than the the grid spacing times the time-step. This is known as the Courant-Friedrichs-Lewy (CFL) condition[Courant et al.,1928].

v.∆t

∆x < C (2.34)

2.2.1 Central Difference

The derivative of a numerical quantity on a regularly spaced grid can be simply derived from the difference in value between one grid point and the next in either forward or backward differencing, however this is only accurate to first order. The central difference is derived from taking the difference between the Taylor expansions off(x±h), f(x±2h) etc. f(x±h) =f(x)±hf′(x) +h 2 2 f ′′(x)±h3 6 f (3)(x) +O 1±(h4) (2.35) So that f(x+h)f(xh) = 2hf′(x) +h 3 3 f (3)(x) +O 1(h4) (2.36)

For a function evaluated on a uniform grid, the derivative accurate to higher order requires a stencil of N points around the central point. Using the notation f(x+ kh) =fk N = 2 f′(x) f1−f−1 2h (2.37) N = 4 f′(x) f−2−8f−1+ 8f1−f2 8h (2.38)

Similarly the by subtracting the Taylor series in such a way to just leave the 2nd derivatives, the numerical approximations are

N = 3 f′′(x) f−1−2f0+f1

h2 (2.39)

N = 5 f′′(x) −f−2+ 16f−1−30f0+ 16f1−f2 12h2

(2.40)

The higher accuracy of a larger stencil comes at a cost of more computation, and the assumption that the function is sufficiently smooth for the higher order terms of the Taylor expansion to converge. This can fail if, for instance, there are discontinuities (shocks) in the simulation. There also comes a point where the round-off error due to the limited floating point precision exceeds the truncation error from the higher order terms in the Taylor series expansion. So careful evaluation of the cost/benefit needs to be considered, as adding more terms may not increase the overall accuracy of the model.

just consider each dimension independently and ignoring the diagonal terms. This gives a simple algorithm but leads to inaccuracies at the scale of the stencil size, where it takes several time-steps for changes to propagate in off-axis directions. This can lead to what is known as the chequerboard instability, whereby the decoupled alternate grid cells oscillate around the smooth value. Coupling the diagonal terms can be achieved via a small artificial diffusion slightly above the necessary numerical diffusion term required to cancel the next order error introduced via the Taylor approximation. The alternative often used in magnetohydrodynamics is to use a staggered grid such as the Yee scheme[Yee,1966], where the electric field is defined on the edges of the grid cells, the magnetic field at the centre of the faces, with the velocity and scalar quantities such as density and temperature at the cell centre. For non-orthogonal curvilinear grids the form of the metric tensor and jacobian also become important choices for the numerical scheme.

2.2.2 Runge-Kutta

Similar arguments to the spatial derivatives come when considering the evolution of the quantities in time by the numerical integration of the PDEs. The goal being that given the state of the system at time t, the evaluation of each PDE leads to the new values att+ ∆t. The Euler method is simply

f(x, t+ ∆t) =f(x, t) + ∆t∂f(x, t)

∂t (2.41)

However, this method which only uses the derivative at the start of the interval, is only accurate to one power of the step-size smaller than the error on the deriva- tive[Press et al., 2007, p907]. There are various other methods that take a trial step to the midpoint of the interval and re-evaluate the time derivative there. A popular algorithm that is sufficiently accurate for our purposes is the fourth-order Runge-Kutta. Here, in each time-step the derivative is evaluated four times: once at the initial point, twice at trial midpoints and once at a trial endpoint. These values are then combined with an appropriate weighting to find the new function

value to fourth order accuracy. For The formula is expressed as follows k1 = ∆t ∂f(x, t) ∂t k2 = ∆t ∂f(x+k1/2, t+ ∆t/2) ∂t k3 = ∆t ∂f(x+k2/2, t+ ∆t/2) ∂t k4 = ∆t ∂f(x+k3, t+ ∆t) ∂t f(x, t+ ∆t) = f(x, t) + 1 6k1+ 1 3k2+ 1 3k3+ 1 6k4+O(∆t 5) (2.42)

For systems that evolve over time an adaptive method of determining the appropri- ate time-step for a given error. One such method is the Dormand-Prince[Dormand

and Prince, 1980] fifth order Runge-Kutta, whereby the next term in the Taylor

expansion is calculated to give an estimate of the truncation error, the step-size can then be adjusted accordingly to ensure accuracy is maintained. Other more elaborate algorithms include the Bulirsch-Stoer and Predictor-Corrector methods. However, for the scenarios expected to be studied in this work regarding instabilities in turbulent plasmas the basic Runge-Kutta should be sufficient since the gradients involved will not be high enough to constitute shocks.

2.2.3 MPI & OpenMP

In order to make any code as portable as possible (and not re-invent the wheel), the use of established, preferably open-source, standard libraries and application programming interfaces (APIs) is required. In the case of large scale parallel pro- grams this means MPI and OpenMP. OpenMP is a simple scheme, built into most modern compilers, suited to multi-core workstations and moderate size shared mem- ory systems, in which by added a few simple directives to standard C code around elements such as loops the task can be split into multiple threads and shared over several processor cores. To scale upto much larger systems where each node has it’s own local memory, the Message Passing Interface (MPI) is used. Here multiple copies for the program are started on each node. Each node then works out where it is in the whole domain and works on its part of the problem, passing ‘messages’ of the changes in the boundary layer around the edge of the grid to the corresponding nodes via a high bandwidth, and more importantly low-latency, interconnect such as Infiniband.

domain decomposition problems, and as such make the handling of boundary con- ditions relatively trivial. The scaling efficiency of the parallel program is then a combination of the calculation time for each sub-grid with the communication over- head of passing all the new state in the boundary layer or ghost zones.

The hybrid OpenMP/MPI approach allows efficient use of multi-core/multi- chip hardware such that in our case each node has two Xeon X5650 CPUs each with 6 cores. Scheduling the simulation with 6 OpenMP threads per MPI task then enables efficient used of the level 1 and 2 CPU cache on each chip. This also avoids the issue of the 12 cores per node not dividing evenly when setting up jobs to divide grids typically defined using powers of two grid cells in each direction. The other aspect is that as the number of MPI tasks rises for a given resolution of simulation grid then the surface area to volume ratio of each MPI process’ arrays rises. Reducing the relative number of MPI processes to CPU cores reduces the communications overhead governed by the size of the surface of each domain.

Related documents