Cost Models for Algorithmic Skeletons - Algorithmic Skeleton Frameworks

2.4 Algorithmic Skeleton Frameworks

2.4.2 Cost Models for Algorithmic Skeletons

The usage of cost models for statically predicting the performance of parallel programs is very powerful. This allows programmers topredict the run-time performance of their programs on different inputs without having to run

them. Since profiling a parallel program can be time consuming and hard, a static prediction of the run-time behaviour of a program can ease the task of developing parallel software, since the least efficient parallelisations can be discarded without needing to run them. A number of models of parallel programming have been developed, aimed at predicting the run- time performance of the parallel programs defined within their particular model. A brief overview of representative cost models, both for structured and unstructured parallelism is discussed below.

Parallel Abstract Machines and other Models of Parallelism PRAM The Parallel Random Access Machine is an abstract machine designed to model the performance of parallel algorithms [FW78]. The PRAM model focused mainly on MIMD machines, but its application was studied to other architectures, such as SIMD. The PRAM model makes a number of simplifications and assumptions that have an important effect in the real run-time of parallel applications, such as an unbounded number of processors in the machine, or not considering resource contention.

NESL Blelloch and Greiner have demonstrated provable time and space bounds for nested data parallel computations in NESL [Ble95]. NESL is a parallel programming language that heavily influenced other more recent approaches to data parallelism, such as Data Parallel Haskell [CLPJ+_07].

Despite providing provable time and space cost models, NESL, is specifically designed for nested data parallelism, and is not directly applicable to task parallelism. Moreover, nested data parallelism is handled by applying a flattening transformation that converts it to flat data parallelism. This may lead to an increase of the space complexity of algorithms, and is a potential source of inefficiency [SBHG08].

BSP The Bulk Synchronous Parallel abstract computer serves a similar purpose to the PRAM model [Val90, Val89]. Contrary to the PRAM model, in the BSP model the communication and synchronisation between compo- nents of a parallel application is taken into account. In the BSP model, computation takes place as a sequence of supersteps. A superstep consists

2.4. ALGORITHMIC SKELETON FRAMEWORKS 65

on a sequence of three different phases. First, a number of independent parallel processes perform some local computation in parallel, by using local data. Then, these processes communicate to each other, sharing their local data with the other processes. Finally, the processes wait in a synchronisation point to ensure that all communication has taken place, before continuing. BSP algorithms have, therefore, a very regular computation and communication structure that can be exploited to predict the run-time performance of parallel programs. The use of the BSP model for determining worst-case execution times of parallel programs has been explored [GGL12]. Although the BSP model has been explored in the context of algorithmic skeletons [JL11a], this work is limited to data-parallel skeletons. Extending it to task-parallel skeletons is not straightforward. For example, defining what is a superstep in a stream-based parallel program implemented in terms of farms and pipelines is not clear, since local computation and communication happens in an interleaved way.

Synchronous Data Flow Languages Synchronous data flow languages provide constructs that are highly related to streaming algorithmic skeletons. Examples of this are Lucid [AW77] and LUSTRE [CPHP87]. These languages define a set of constructs that operate on streams, and have a natural notion of clocks [CP96, CDE+_{06], that has been formalised as}

a clock calculus [CP95]. Most recently, two different teams have used a specialisation of the clock calculus to infer rates in data flow programs: Rate Types [BL14] to infer maximum throughput; and Data Flow Fu- sion [LCKR13], infer costs to guide a stream fusion procedure.

Cost Models for Algorithmic Skeletons

Algorithmic skeletons have a clear computation and communication structure. This clear structure makes them predictable, when compared to unstructured parallel programming approaches.

A notable example of cost models for algorithmic skeletons was developed by Skillicorn et al. [SC95]. They use a cost calculus for optimising parallel implementations developed in Bird’s theory of lists [Ski93b]. The most

important advantage of this work is that it is applicable to any skeleton- based approach that is derived from Bird’s theory of lists.

Many alternative cost models for algorithmic skeleton frameworks have been studied [Ham99]. The usefulness of cost modelling has been con- vincingly demonstrated more recently by using both static analysis [HC02] and dynamic profiling [LL10] More recently, Matsuzaki [Mat17] has demonstrated this by not only defining efficient tree skeletons on distributed mem- ory systems, but also deriving accurate cost models for them.

Armih et al. [AMT11] develop cost models for algorithmic skeletons on heterogeneous architectures that take into account important low-level information about the underlying architecture, such as cache sizes.

Cost models have been used for deriving efficient parallel programs from specifications [Ski93a], or guiding software refactorings that improve the program’s efficiency [BDH+13].

All the current approaches in cost models for algorithmic skeletons are either derived manually, obtained through approximation and measurement, or ignore important low-level information about synchronisation and communication. None of the state-of-the-art approaches derive accurate cost models from an operational semantics, while also taking into account low- level communication and synchronisation details. Outside the algorithmic skeleton community, the BSP model is the closest to achieving this. How- ever, the BSP model requires programmers to focus on low-level implemen- tation details. The operational semantics of algorithmic skeletons developed in the context of the StA framework tackle this problem by providing a mechanism for both: (a) deriving systematically cost models from the operational semantics of parallel structures; and (b) taking low-level communication and synchronisation into account for the cost models of the derived algorithmic skeletons. However, some of the work on cost modelling for algorithmic skeletons is orthogonal to the one defined in this thesis, and can be used to further refine the approach described in this thesis, e.g. taking cache sizes into account [AMT11].

2.4. ALGORITHMIC SKELETON FRAMEWORKS 67

In document Structured arrows : a type based framework for structured parallelism (Page 77-81)