Modelling Parallel Computation - Performance Engineering

2.4 Performance Engineering

2.4.3 Modelling Parallel Computation

The formation of Amdahl’s law in 1967 was a just a starting point for the development of further abstract and concrete models governing and describing the performance of both parallel hardware and applications. While Amdahl’s law contains just a few parameters to represent both the software (fraction of parallel code) and the hardware (number of processors), newer models have potentially hundreds, characterising compute, communication behaviour and synchronisation behaviour. The increase in parameters allows more complex hardware and software behaviours, but brings with it a decrease in tractability. However, performance models still provide a useful platform from which to perform analyses.

Parallel Random Access Machine (PRAM)

In 1978, PRAM was proposed as one of the first models of parallel computing. It was developed due to the need for a framework with which to develop parallel algorithms while avoiding the nuances of real hardware [56]. This model has the follow constituent parts: an unbounded set of processors P =p0, p1, ... which

instructions; an unbounded global memory and processor local memory, both capable of storing non-negative integers; a set of input registerI0, I1, ..., In; a pro-

gram counter; a per processor accumulator, also capable of storing non-negative integers; and, a finite program A comprising the aforementioned instructions. Computation then proceeds as follows:

1. The input is placed in the input registers: one bit per Ik.

2. All memory is cleared and the length of the program (A) is placed into the accumulator ofP0.

3. Instructions are then executed simultaneously by eachp∈P.

4. The FORK instruction can be used to spawn computation on an idle processor.

5. Execution stops when either aHALTinstruction is executed byP0or a write

instruction is executed on the same global memory location simultaneously by multiple processors.

There are several deficiencies with PRAM, the first of which is the lack of cost associated with communication. This prevents the model from being an accurate representation for NUMA machines which have multiple communication layers (core-to-core, socket-to-socket and node-to-node). This lack of cost associated with communication time additionally prevents the scaling behaviour of an application from being modelled when run at large scale, where communication costs can dominate.

Bulk Synchronous Parallel (BSP) Computation Model

BSP was developed in 1990 by Leslie Valiant with the same underlying goal as PRAM [150, 161], to provide a model which allows researchers to independently develop parallel algorithms and hardware. BSP overcomes the primary flaw with PRAM: there was no cost parameter for communication events, which limited the accuracy of PRAM when costing various algorithmic and hardware

choices. BSP execution proceeds in supersteps (S), each of which consists of three sub-steps which are described below along with the relevant modelling parameters.

1. Simultaneous computation on local processors Wi, where this is the cost

of computation in instruction rate per processor.

2. Data transfer between local processorshs,ig, wheregis a measure of net-

work permeability under continuous traffic to uniformly random address locations andhs,i is the number of messages or some function of this and

the size of the messages per processor [150]. 3. Barrier synchronisation (l).

With values for the aforementioned parameters, Equation 2.6 can be used to compute the cost of an algorithm (C) expressed using BSP.

C=X

s∈S

(max

i∈P(ws,i) + maxi∈P(hs,ig+l)) (2.6)

From examining Equation 2.6, two observations about the performance of BSP applications can be drawn. First is that load balance is important for both computation and global communication since they are both maximums over all processors and second, minimising the number of supersteps will have a positive effect on performance as it reduces the amount of global synchronisation [150]. As reported by Skillicornet althe BSP model has been used to the benefit of several application’s performance. Most notably is the work by Hillet al where the authors develop a BSP model of a multigrid application [78]. This model is then used to predictively assess the impact of different network choices on the runtime of the code.

One of the main weaknesses of BSP is its disregard to data locality. This presents an issue for HPC applications as data locality is important to make use of hardware features such as caches and vector units, not using either of these

can lead to significant slow downs. However, the issue of locality was addressed in an extension to BSP by Tiskin [155].

The LogP Model

The PRAM and BSP models are high level abstractions of parallel computation and over simplify the detail of applications and hardware, specifically in the area of communication, which permits the development of algorithms which would perform poorly on real hardware. Culleret al address this issue with their LogP model [37], which (as given by its name) has the following parameters:

L An upper bound on the latency or delay incurred in sending a point-to-point communication containing a small number of words.

o The length of time a processor is engaged in message sending activities (over- head).

g The maximum interval between consecutive messages (gap). The reciprocal ofgis the per-processor memory bandwidth.

P The number of processors. For simplicity, all local operations complete in unit time.

Given values for these parameters in cycles, a communication graph can be can be constructed indicating the cost of a particular communication pattern. The authors go to great lengths to differentiate LogP from PRAM and BSP by pointing out the flaws in these previous models. Most notably the authors point out that PRAM does not penalise algorithms which use an excessive amount of communication whereas LogP discourages this. While BSP also discourages a large amount of interprocess communication, it mandates that onlyhmessages can be sent and received by processors in any superstep, where as LogP allows more fine-grained control of messages and hence the ability to represent more complex schedules of messages [37].

The LogP model has been successfully used by the authors, and by many other researchers to investigate the performance of algorithms [28, 50, 82, 86, 95]. Building on the success of this model, further extensions have been made: LogGP, which incorporates terms for long messages [5], LoPC which intro- duces terms for contention [57] and LogfP which handles small Infiniband messages [83].

In document Performance engineering unstructured mesh, geometric multigrid codes (Page 47-51)