In this chapter we examined the QUI software package in considerable detail. QUI provides
a flexible, modular tool for the simulation of cardiac tissue. QUI’s scripting language affords
users with little computing experience the ability to compose simulations from modular devices
and to parameterise those devices to their needs. QUIprovides a broad selection of devices for
computation and output, and a library of RHS modules allowing the user to choose their desired model of kinetics.
For users that require additional functionality,QUI’s device and RHS APIs make it relatively
simple to extend; most devices require the definition of a single struct, and three functions.
Third-party developers are assisted in conforming to device conventions by a range of template
and shortcut macros. We have also described a set of ‘laws’ thatQUI devices should obey to
ensure consistent, predictable behaviour.
The description ofQUIprovided by this chapter is the first known documentation of the code
since its development around twenty years ago. This chapter forms the basis of our understanding
CHAPTER
THREE
BEATBOX: PARALLELISATION OF
QUI
USING MPI
3.1
Introduction
As discussed in Section 1.10, the complexity of cardiac simulations is steadily increasing. A single run of a typical simulation may take months on a high-specification desktop computer. When many runs are required, or when experimenting with different parameters, this becomes impractical. In order to achieve useful results in a reasonable amount of time, researchers are increasingly reliant on parallel computers.
The dominant form of parallel machine today is the distributed memory cluster. While a number of codes are now adopting hybrid shared/distributed memory models, this will not be
developed at this stage. QUI does not presently have especially large memory requirements,
and most of the memory is given over to New, which is the domain to be discretised. Given
current RAM capacities of nodes in state of the art machines such as HECToR(Phase 2a: 2GB
per node, Phase 2b: 1.33GB per node) and the rate of increase in cores per processor, it is likely to be some time before the reduction in RAM per node necessitates the use of mixed- mode programming. Shared memory programming usually requires more involved parallel design to avoid race conditions and deadlocks. To facilitate an extensible parallel architecture, in which third-party devices can benefit from parallelism implicitly, it is necessary to centralise parallelisation as much as possible. This is far more easily achieved using a distributed memory architecture.
3.1.1
Design Aims
The core aim of parallelisation is to greatly reduce the runtimes of large simulations by allowing users to harness the power of High Performance Computing (HPC) facilities. The code’s ability to scale — i.e. its efficiency in using parallel processes — will be the main measure of its success. To achieve this, the parallelisation will have to resolve the conflicting goals of load balance and minimising communication.
As HPC is a specialist field in own right, it would demand a great deal of Beatbox users to
learn and understand the details of its parallel implementation in order to use it. For this reason,
Beatbox’s parallel implementation should remain opaque to the user. User scripts should contain
no data specific to their use in parallel, and the same script should be able to run safely and efficiently in both sequential and parallel modes. Such safety should not come at the expense of
QUI’s modular structure.
To a lesser extent, third-party developers should also be sheltered from the complexity of the parallel implementation. Third-party RHS modules and the majority of third-party devices should be able to benefit from parallelism implicitly, or with only minor modification to the code. Clear and concise guidelines should inform the user of any need to modify the code, and detail how to proceed.
To ease the burden of ongoing maintenance, the process of parallelisation should make min- imal changes to the code, with as much code as possible shared between implementations. On systems where the parallel library is unavailable, the user should be able to compile a sequential- only executable.
3.1.2
MPI — The Message Passing Interface
MPI — The Message Passing Interface [Message Passing Interface Forum, 2003] is a FORTRAN and C/C++ library, providing tools for running programs on distributed memory machines. MPI provides an abstraction layer between the executable and the hardware implementation of the parallel machine.
Under MPI, an executable runs in one or more processes, where each process has its own, ring-fenced area of memory in which to operate. The MPI runtime environment declares the total
number of processes, size ∈ N, size ≥1, and assigns an identity, rank ∈ N; 0 ≤rank <size,
to each process. Within an MPI program, processes may call MPI library functions to read or write files on disk, or to communicate with other processes. Functions in the MPI library can be either independent or collective. An independent function behaves in the same way as a normal C function. A collective function — such as one involving communication — requires the involvement of other processes, and will not return until the operation has completed on all of the processes involved. It is the blocking nature of collective operations that can cause delays in parallel programs, as processes wait for their slowest neighbour to complete its part of the task. The processes involved in a collective operation are defined by a communicator, supplied as an
argument to the function. All processes in an MPI program are members of theMPI_COMM_WORLD
communicator. Additional communicators can be defined at runtime. This allows subsets of processes to take part in collective operations, which can improve efficiency and affords additional flexibility. Within a communicator, processes are numbered from 0, meaning that processes may have different ranks in each of the communicators of which it is a member.