2.4 Algorithmic Skeleton Frameworks
2.4.2 Cost Models for Algorithmic Skeletons
The usage of cost models for statically predicting the performance of parallel programs is very powerful. This allows programmers topredict the run-time performance of their programs on different inputs without having to run
them. Since profiling a parallel program can be time consuming and hard, a static prediction of the run-time behaviour of a program can ease the task of developing parallel software, since the least efficient parallelisations can be discarded without needing to run them. A number of models of parallel programming have been developed, aimed at predicting the run- time performance of the parallel programs defined within their particular model. A brief overview of representative cost models, both for structured and unstructured parallelism is discussed below.
Parallel Abstract Machines and other Models of Parallelism PRAM The Parallel Random Access Machine is an abstract machine designed to model the performance of parallel algorithms [FW78]. The PRAM model focused mainly on MIMD machines, but its application was studied to other architectures, such as SIMD. The PRAM model makes a number of simplifications and assumptions that have an important effect in the real run-time of parallel applications, such as an unbounded number of processors in the machine, or not considering resource contention.
NESL Blelloch and Greiner have demonstrated provable time and space bounds for nested data parallel computations in NESL [Ble95]. NESL is a parallel programming language that heavily influenced other more recent approaches to data parallelism, such as Data Parallel Haskell [CLPJ+07].
Despite providing provable time and space cost models, NESL, is specifically designed for nested data parallelism, and is not directly applicable to task parallelism. Moreover, nested data parallelism is handled by applying a flattening transformation that converts it to flat data parallelism. This may lead to an increase of the space complexity of algorithms, and is a potential source of inefficiency [SBHG08].
BSP The Bulk Synchronous Parallel abstract computer serves a similar purpose to the PRAM model [Val90, Val89]. Contrary to the PRAM model, in the BSP model the communication and synchronisation between compo- nents of a parallel application is taken into account. In the BSP model, computation takes place as a sequence of supersteps. A superstep consists
2.4. ALGORITHMIC SKELETON FRAMEWORKS 65
on a sequence of three different phases. First, a number of independent parallel processes perform some local computation in parallel, by using lo- cal data. Then, these processes communicate to each other, sharing their local data with the other processes. Finally, the processes wait in a syn- chronisation point to ensure that all communication has taken place, before continuing. BSP algorithms have, therefore, a very regular computation and communication structure that can be exploited to predict the run-time per- formance of parallel programs. The use of the BSP model for determining worst-case execution times of parallel programs has been explored [GGL12]. Although the BSP model has been explored in the context of algorithmic skeletons [JL11a], this work is limited to data-parallel skeletons. Extending it to task-parallel skeletons is not straightforward. For example, defining what is a superstep in a stream-based parallel program implemented in terms of farms and pipelines is not clear, since local computation and com- munication happens in an interleaved way.
Synchronous Data Flow Languages Synchronous data flow languages provide constructs that are highly related to streaming algorithmic skele- tons. Examples of this are Lucid [AW77] and LUSTRE [CPHP87]. These languages define a set of constructs that operate on streams, and have a natural notion of clocks [CP96, CDE+06], that has been formalised as
a clock calculus [CP95]. Most recently, two different teams have used a specialisation of the clock calculus to infer rates in data flow programs: Rate Types [BL14] to infer maximum throughput; and Data Flow Fu- sion [LCKR13], infer costs to guide a stream fusion procedure.
Cost Models for Algorithmic Skeletons
Algorithmic skeletons have a clear computation and communication struc- ture. This clear structure makes them predictable, when compared to un- structured parallel programming approaches.
A notable example of cost models for algorithmic skeletons was devel- oped by Skillicorn et al. [SC95]. They use a cost calculus for optimising par- allel implementations developed in Bird’s theory of lists [Ski93b]. The most
important advantage of this work is that it is applicable to any skeleton- based approach that is derived from Bird’s theory of lists.
Many alternative cost models for algorithmic skeleton frameworks have been studied [Ham99]. The usefulness of cost modelling has been con- vincingly demonstrated more recently by using both static analysis [HC02] and dynamic profiling [LL10] More recently, Matsuzaki [Mat17] has demon- strated this by not only defining efficient tree skeletons on distributed mem- ory systems, but also deriving accurate cost models for them.
Armih et al. [AMT11] develop cost models for algorithmic skeletons on heterogeneous architectures that take into account important low-level information about the underlying architecture, such as cache sizes.
Cost models have been used for deriving efficient parallel programs from specifications [Ski93a], or guiding software refactorings that improve the program’s efficiency [BDH+13].
All the current approaches in cost models for algorithmic skeletons are either derived manually, obtained through approximation and measurement, or ignore important low-level information about synchronisation and com- munication. None of the state-of-the-art approaches derive accurate cost models from an operational semantics, while also taking into account low- level communication and synchronisation details. Outside the algorithmic skeleton community, the BSP model is the closest to achieving this. How- ever, the BSP model requires programmers to focus on low-level implemen- tation details. The operational semantics of algorithmic skeletons devel- oped in the context of the StA framework tackle this problem by providing a mechanism for both: (a) deriving systematically cost models from the op- erational semantics of parallel structures; and (b) taking low-level commu- nication and synchronisation into account for the cost models of the derived algorithmic skeletons. However, some of the work on cost modelling for al- gorithmic skeletons is orthogonal to the one defined in this thesis, and can be used to further refine the approach described in this thesis, e.g. taking cache sizes into account [AMT11].
2.4. ALGORITHMIC SKELETON FRAMEWORKS 67