C o m p u tin g Science
UNIVERSITY
of
GLASGOW
G ranularity in L arge-Scale P arallel
Functional P rogram m in g
Hans Wolfgang Loidl
A thesis subm itted f o r a D octor o f Philosophy Degree in
C om puting Science at the U niversity o f Glasgow
March 1998
ProQuest Number: 13815404
All rights reserved INFORMATION TO ALL USERS
The qu ality of this repro d u ctio n is d e p e n d e n t upon the q u ality of the copy subm itted. In the unlikely e v e n t that the a u th o r did not send a c o m p le te m anuscript and there are missing pages, these will be note d . Also, if m aterial had to be rem oved,
a n o te will in d ica te the deletion.
uest
ProQuest 13815404
Published by ProQuest LLC(2018). C op yrig ht of the Dissertation is held by the Author. All rights reserved.
This work is protected against unauthorized copying under Title 17, United States C o d e M icroform Edition © ProQuest LLC.
ProQuest LLC.
789 East Eisenhower Parkway P.O. Box 1346
A bstract
This thesis demonstrates how to reduce the runtime of large non-strict functional programs using parallel evaluation. The parallelisation of several programs shows the importance of granularity, i.e. the computation costs of program expressions. The aspect of granularity is studied both on a practical level, by presenting and measuring runtime granularity improvement mechanisms, and at a more formal level, by devising a static granularity analysis.
By parallelising several large functional programs this thesis dem onstrates for the first time the advantages of combining lazy and parallel evaluation on a large scale: laziness aids modularity, while parallelism reduces runtime. One of the parallel programs is the Lolita system which, with more than 47,000 lines of code, is the largest existing parallel non-strict functional program. A new mechanism for parallel programming, evaluation strategies, to which this thesis contributes, is shown to be useful in this parallelisation. Evaluation strategies simplify parallel programming by separating algorithmic code from code specifying dynamic behaviour. For large programs the abstraction provided by functions is maintained by using a data-oriented style of parallelism, which defines parallelism over intermediate data structures rather than inside the functions.
A highly parameterised simulator, Gr a nSi m, has been constructed collaboratively and is discussed in detail in this thesis. Gr a nSim is a tool for architecture-independent parallelisation and a testbed for implementing runtime-system features of the paral lel graph reduction model. By providing an idealised as well as an accurate model of the underlying parallel machine, Gr a nSim has proven to be an essential part of an integrated parallel software engineering environment. Several parallel runtime- system features, such as granularity improvement mechanisms, have been tested via
Gr a nSi m. It is publicly available and in active use at several universities worldwide. In order to provide granularity information this thesis presents an inference-based static granularity analysis. This analysis combines two existing analyses, one for cost and one for size information. It determines an upper bound for the computation costs of evaluating an expression in a simple strict higher-order language. By exposing recurrences during cost reconstruction and using a library of recurrences and their closed forms, it is possible to infer the costs for some recursive functions. The possible performance improvements are assessed by measuring the parallel performance of a hand-analysed and annotated program.
C ontents
A b stract ii
1 In trod u ction 1
1.1 Parallel Lazy Functional P ro g ra m m in g ... 2
1.1.1 Parallel Programming ... 3
1.1.2 Functional P ro g ram m in g ... 4
1.1.3 Lazy P ro g ra m m in g ... 5
1.1.4 Relationship to Other Approaches for Parallel Programming . 6 1.2 The Dynamic Behaviour of Parallel P r o g r a m s ... 7
1.3 Static Information about Dynamic Behaviour ... 9
1.4 C o n trib u tio n s ... 11
1.5 Thesis S tru c tu re ... 13
2 T he P arallel Im p lem en ta tio n o f Functional L anguages 15 2.1 In tro d u c tio n ... 16
2.2 Principles of Parallel Functional L anguages... 16
2.2.1 Why are Functional Languages Good for P a ralle lism ?... 17
2.2.2 The Role of S t r i c t n e s s ... 18
2.2.3 Language Support for Parallel P ro g ra m m in g ... 21
2.3 Implementation of Functional L a n g u a g e s ... 26
2.3.1 The Graph Reduction M o d el... 27
2.3.2 The Dataflow M o d e l ... 29
2.3.3 O ther M o d e ls ... 32
2.4 Runtime-System I s s u e s ... 33
2.4.1 Evaluation M o d e l s ... 34
2.4.2 Storage Management M odels... 41
2.4.3 Communication Models ... 45
2.4.4 Load D istrib u tio n ... 48
2.5 Our M o d e l... 50
2.6 S u m m a r y ... 52
3 GranSim— A S im u lator for P a ra llel H askell 53 3.1 In tro d u c tio n ... 54 3.2 Structure of Gr a nSi m ... 56 3.3 Characteristics of GranSi m ... 59 3.3.1 F le x ib ility ... 60 3.3.2 Accuracy ... 66 3.3.3 Visualisation ... 67 3.3.4 E ffic ie n c y ... 75 3.3.5 Integration into G H C ... 78 3.3.6 Robustness ... 78 3.4 GRANSiM-Light... 79 3.5 Shortcomings of GranSi m ... 81 3.6 Validation of Simulation R e s u lts ... 83 3.6.1 GranSim versus H B C P P ... 83 3.6.2 GranSim versus G R I P ... 83 3.6.3 GranSim versus G U M ... 85 3.7 S u m m a r y ... 86
C ontents vii
4 L a rg e-S ca le P a r a lle l F u n c tio n a l P r o g r a m m in g 88
4.1 In tro d u c tio n ... 89
4.2 Problems with Parallel Programming in - th e - la r g e ... 91
4.2.1 A Simple E x a m p l e ... 91
4.2.2 Data-Oriented P a ra lle lis m ... 92
4.2.3 Dynamic B ehaviour... 93
4.3 Evaluation S tra te g ie s ... 94
4.3.1 Evaluation S tra te g ie s ... 94
4.3.2 Strategies Controlling Evaluation D e g r e e ... 94
4.3.3 Combining S tra te g ie s... 95
4.3.4 Data-Oriented P a r a lle lis m ... 96
4.3.5 Control-Oriented P arallelism ... 97
4.3.6 Additional Dynamic B e h a v io u r ... 98
4.3.7 Strategic Function A pplication... 100
4.4 Alpha-Beta S e a r c h ... 103 4.4.1 Simple A lg o rith m ... 104 4.4.2 Pruning A lg o r ith m ... 109 4.5 L o l i t a ... 113 4.5.1 A lg o rith m ... 113 4.5.2 Sequential P ro filin g ... 114 4.5.3 Top Level P i p e l i n e ... 114 4.5.4 Parallel P a r s in g ... 115 4.5.5 Parallel Semantic A n a ly s is ... 119
4.5.6 Overall Parallel Structure ... 120
4.5.7 Sun SPARCserver Implementation ... 121
4.6 LinSolv ... 123
4.6.1 The Sequential A lg o r ith m ... 124
4.6.3 Improved V e rs io n ... 131
4.6.4 Parallelism over the Homomorphic I m a g e s ... 132
4.6.5 S u m m a r y ... 134
4.7 Comparison with Parallel Imperative P ro g ra m m in g ... 138
4.7.1 LinSolv ... 138
4.7.2 Parallel Resultant C o m p u ta tio n ... 140
4.7.3 Parallel P-Adic Computation on Rational N u m b e r s ... 141
4.8 A Methodology for Parallel Non-Strict Functional Programming . . . 142
4.9 Related W o r k ... 144
4.9.1 Evaluation S tra te g ie s ... 144
4.9.2 Large-Scale Parallel Functional P r o g ra m m in g ... 151
4.10 D isc u ssio n ... 154
5 G ra n u la rity in P a r a lle l F u n c tio n a l P r o g r a m s 156 5.1 In tro d u c tio n ... 157
5.2 Dynamic Control of P a ra lle lis m ... 158
5.3 Importance of G r a n u la rity ... 161
5.4 The Relationship between Granularity and the Evaluation Model . . 163
5.4.1 Granularity with e ag er-th read -crea tio n ... 163
5.4.2 Granularity with ev aluate-and-die... 165
5.5 Granularity Improvement Mechanisms ... 167
5.5.1 Explicit T h r e s h o l d ... 167
5.5.2 Priority S p a r k in g ... 169
5.5.3 Priority S c h e d u lin g ... 170
5.6 Using Granularity Improvement M echanism s... 170
5.6.1 Divide-and-Conquer P ro g ra m s ... 171
5.6.2 Larger Parallel P ro g ra m s... 174
C ontents ix 5.7.1 Runtime M e th o d s ... 176 5.7.2 Program m er Annotation A p p ro a c h e s ... 181 5.7.3 Profiling M e th o d s ... 182 5.8 D iscu ssio n ... 184 6 G ranularity A n a ly sis 185 6.1 In tro d u c tio n ... 186 6.2 Design Philosophy ... 187 6.3 Syntax of £ ... 189
6.4 A Static Cost Semantics for £ ... 191
6.4.1 A Sized Time System for £ ... 191
6.4.2 From Cost-Expressions to C ost-Functions... 195
6.5 Cost In feren ce... 197
6.5.1 Structure of the Inference ... 198
6.5.2 A Size and Cost Reconstruction A lg o rith m ... 201
6.5.3 Simplifying C o n strain ts... 210
6.5.4 Solving Recurrence Relations ... 211
6.5.5 Correctness Issu e s... 212
6.6 E x a m p le ... 214
6.6.1 Cost and Size A n a ly s is ... 215
6.6.2 A n n o ta tio n s ... 218
6.6.3 M easurem ents... 219
6.7 Comparison with Other W o r k ... 220
6.7.1 Complexity A n a ly s is ... 220
6.7.2 Cost Analysis for Strict L a n g u a g e s ... 222
6.7.3 Demand A n a ly s is ... 223
6.7.4 Cost Analysis of Lazy L a n g u a g e s ... 225
6.7.5 Logic Languages ... 225
7 C on clu sion s 228
7.1 S u m m a r y ... 228 7.2 C o n trib u tio n s ... 231 7.3 Further work ... 234
List o f tab les
3.1 Simulation times (in seconds) of Gr a nSim and H B C P P ... 77
4.1 Measurements of the simple and the pruning Alpha-Beta search algorithm 107 4.2 Measurements of all versions of L in S o lv ... 136
1.1 Possible structure of a parallelising c o m p ile r ... 10
2.1 The principle of parallel graph r e d u c tio n ... 28
2.2 The principle of the dataflow m o d e l... 31
2.3 Locking of closures and generation of waiting l i s t s ... 38
3.1 Global structure of Gr a nSi m ... 57
3.2 The bulk fetching mechanism (with 3 thunks per p a c k e t ) ... 62
3.3 A comparison of packing and rescheduling schemes ... 63
3.4 Overall activity profile (original in colour) ... 70
3.5 Per-processor activity profile (original in c o lo u r ) ... 71
3.6 Per-thread activity p r o f i l e ... 73
3.7 Bucket statistics of thread runtime and heap allocations... 74
3.8 Global structure of G R A N SiM -L ight... 80
3.9 Activity profiles from Gr a nSim and H B C P P ... 83
3.10 Gr a nSim and G R IP activity profiles of LinSolv ... 84
3.11 Gr a nSim (top) and G R IP (bottom) granularity profiles of a ray tracer 86 3.12 Gr a nSim and G U M activity profiles of a determinant computation . 87 4.1 Structure of sum-of-squares ... 102
4.2 Top level structure of choosing the best next m o v e ... 105
4.3 D ata parallel combination function in the simple Alpha-Beta search a l g o r it h m ... 106
List of figures xiii
4.4 Pruning subtrees in the optimised Alpha-Beta search algorithm . . . 109
4.5 Pruning version of the Alpha-Beta s e a rc h ... 110
4.6 Strategy for a parallel pruning version with a static force length . . . I l l 4.7 Speedup with varying force length (Gr a nSim) ... 112
4.8 D ata parallel versions with static force lengths of 0 and 4 ... 112
4.9 Overall pipeline structure of L o l i t a ... 114
4.10 The top level function of L o l i t a ... 116
4.11 A granularity control strategy used in the parsing s t a g e ... 117
4.12 Activity profiles of Lolita with span thresholds of 50% and 90% . . . 119
4.13 Granularity profiles of Lolita with span thresholds of 50% and 90% . 119 4.14 Detailed structure of Lolita ... 121
4.15 Activity profile of Lolita run under G U M with 2 p r o c e s s o r s ... 122
4.16 Structure of the LinSolv algorithm ... 124
4.17 Top level code of the sequential LinSolv a lg o rith m ... 128
4.18 Naive parallel pre-strategy c o d e ... 129
4.19 Strategy version of a naive parallel LinSolv a lg o r ith m ... 130
4.20 Activity profile of pre-strategy and strategic naive L in S o lv ... 131
4.21 Strategy version of an improved parallel LinSolv a lg o r ith m ... 132
4.22 Activity profiles of pre-strategy and strategic improved LinSolv . . . 132
4.23 Strategy of the final parallel LinSolv a lg o rith m ... 133
4.24 Activity profiles of pre-strategy and strategic final L in S o lv ... 133
4.25 A tree CRA used in the pre-strategy v e r s io n ... 135
4.26 Activity profile of LinSolv in a 3 processor G U M s e t u p ... 137
4.27 PACLIB code of generating and synchronising processes in LinSolv . 139 4.28 Per-thread activity profiles for imperative LinSolv and parallel p-adic c o m p u ta tio n ... 140
5.1 Runtime and parallelism overhead with varying thread granularity . . 160 5.2 Speedups and number of threads of p a r f a c t with eager-thread-creation 164
5.3 Speedups and number of threads of p a r f a c t with evaluate-and-die . 165 5.4 Speedup of p a r f a c t (under G U M ) on a workstation network and a
shared-memory m a c h i n e ... 166
5.5 Unbalanced divide-and-conquer tree generated by u nbal ... 171
5.6 Speedup of u n b al with varying cut-off v alu es... 172
5.7 Relative runtimes and speedups of u nbal with priority sparking and sc h e d u lin g ... 173
5.8 Relative runtimes and speedups of queens with priority sparking and s c h e d u lin g ... 174
5.9 Relative runtimes with variants of priority sparking and scheduling . 175 6.1 A sized time system for £ ... 193
6.2 Subtyping relation for £ ... 195
6.3 Overall structure of the a n a ly s is ... 200
6.4 An algebraic unification algorithm on sized types ... 204
6.5 A size and cost reconstruction algorithm for £ ... 205
6.6 A size and cost reconstruction algorithm for £ (c o n tin u e d )... 206
6.7 Definition of size s tr i p p i n g ... 207
6.8 Inference for l e n g t h ... 209
6.9 Matching of cost ex p ressio n s... 212
6.10 £ code for c o i n s ... 214
6.11 A part of the inference of d e l ... 215
6.12 Recurrences and their closed f o r m s ... 218
6.13 Annotated £ code for c o i n s ... 219 6.14 Granularity with varying cut-off values (eager and lazy thread creation) 220
A cknow ledgem ents
First and foremost I would like to thank my supervisors, Kevin Hammond and Phil Trinder, for guiding my research over the course of my PhD and for helping me in avoiding the pitfalls on the way to obtaining a PhD degree. In particular, I am very grateful for initially getting the opportunity of doing research in Glasgow And I am indebted to both of my supervisors for the hand-holding done in the notoriously difficult writing-up stage.
I would like to thank the members of my viva panel, Greg Michaelson, John O ’Donnell and David W att, for the very constructive comments on how to improve my thesis. I am indebted to members at the RISC-Linz institute, in particular to Bruno Buch- berger, Hoon Hong and Wolfgang Schreiner, for giving me a great start into the academic world. And I am grateful to the Ministry of Science and Research, of the Republic of A ustria for funding part of my PhD.
Finally my thanks to the whole functional programming group in Glasgow for pro viding such an active and stimulating environment.
List o f figures xvii
D eclaration
I hereby declare th at this thesis has been composed by myself, th at the work herein is my own except where otherwise stated, and th at the work presented has not been presented for any university degree before.
Sections 4.2 and 4.3 are revised versions of m aterial published in (Trinder et al. 1998). Sections 4.4, 4.5 and 4.9.1 cover m aterial to be published in (Loidl k Trinder 1997) and (Loidl et al. 1997), respectively. Section 4.6 is a revised version of material sub m itted for publication in (Loidl 1997). An earlier version of the m aterial in Chapter 6 was published in (Loidl & Hammond 1996 a).
• Trinder, P., Hammond, K., Loidl, H.-W. k Peyton Jones, S. (1998), Algorithm + Strategy = Parallelism, Journal of Functional Programming 8(1).
• Loidl, H.-W. & Trinder, P. (1997), Engineering Large Parallel Functional Pro grams, in IFL’97 — International Workshop on the Implementation of Func tional Languages, University of St. Andrews, Scotland, UK, September 10-12. To appear in LNCS.
• Loidl, H.-W., Morgan, R., Trinder, P., Poria, S., Cooper, C., Peyton Jones, S. k Garigliano, R. (1997), Parallelising a Large Functional Program; Or: Keeping LOLITA Busy, in IFL ’97 — International Workshop on the Implementation of Functional Languages, University of St. Andrews, Scotland, UK, September 10-12. To appear in LNCS.
• Loidl, H.-W. (1997), LinSolv: A Case Study in Strategic Parallelism, in Glasgow Workshop on Functional Programming, Ullapool, Scotland, UK, September 15- 17. Submitted for publication.
• Loidl, H.-W. k Hammond, K. (1996a), A Sized Time System for a Parallel Func tional Language, in Glasgow Workshop on Functional Programming, Ullapool, Scotland, UK, July 8-10.
In trod u ction
After decades of claiming th at functional programming languages are well suited for implicitly-parallel execution, only a few systems have dem onstrated this on a large scale. The research towards efficient implementations has revealed many problems in designing a parallel runtime-system th a t efficiently manages the generated paral lelism without overwhelming the machine with bookkeeping overhead. The limited information provided by the programmer about the parallel execution of the program necessitates very sophisticated, and very general, runtime-system techniques.
One of the m ajor strengths of functional languages is their clear and simple declarative semantics. From a compiler-design point of view this makes it possible to put theory to some practical use. For example static analyses are easily developed, which provide, at compile time, information about some runtime properties of the program. In the m aturing sequential compiler technology for functional languages these analyses provide crucial information for program transform ation steps, which represent the backbone of compiler optimisations. For the parallel execution of functional languages they can provide information to enable the runtime-system to manage the parallelism more efficiently.
This thesis investigates how to statically extract information about the granularity of potential parallel threads, i.e. the com putation costs of these threads, and how to use this information in the runtime-system. In evaluating the importance of granularity for the efficiency of parallel program execution a set of large functional programs is studied. It transpires th at a combinator-oriented approach towards exposing po tential parallelism in the program leads to rather obfuscated code with intertwined behavioural and algorithmic code. To remedy this shortcoming this thesis contributes
1.1. Parallel Lazy Functional Program m ing 2
to a programming technique for separating these two kinds of code. This technique is used in the parallelisation of several programs, the largest of which consists of more than 47,000 lines of Haskell, making it the largest existing parallel non-strict functional program.
1.1
Parallel Lazy Functional Program m ing
Parallel computation offers an enticing picture of the performance th a t can be achieved by the next generation of computers: no longer is the program required to run on only one processor but it becomes possible to execute parts of the program on different processors. This enables the programmer to reduce the runtim e of a program further by decomposing it into parallel components, either autom atically or by hand. Poten tially, it offers scalability in the performance of multiprocessors: in order to speed-up a machine it is only necessary to add new processors.
However, with most existing parallel programming models it is necessary to specify explicitly the decomposition of the program into parallel threads, the order of thread creation, the synchronisation, the communication between threads etc. In practice this often requires significant restructuring or even recoding of a sequential program. The root of this complication is the specification of an algorithm as a sequence of operations performed on a global store in an imperative programming model. In contrast, a declarative program does not specify such a sequence of operations. The compiler and the runtime-system are free to choose different orders of operations, or evaluation order, provided the semantics of the language is preserved. This opens up the possibility for an implicitly parallel execution of a declarative program where the programmer does not have to specify anything more than is needed for the sequential execution anyway.
Our programming model is therefore a combination of three models:
• parallel programming to reduce runtime by executing a program on several pro cessors,
• functional programming to achieve a higher level of programming by abstracting over operational aspects,
• non-strict programming to increase modularity by decoupling control and defi nition.
The im plementation model used in this thesis is parallel graph reduction. Section 2.3.1 discusses this model in more detail.
1.1.1
Parallel Program m ing
A parallel program reduces runtime by sharing the work to be done amongst many processors. To achieve such a reduction in runtime several threads, independent units of com putation, are executed on different processors1. Introducing the concept of threads means th a t mechanisms for generating threads, synchronising threads, communicating d a ta between threads, and term inating threads have to be established. We term these aspects of the program execution the dynamic behaviour of a parallel program. Clearly, the dynamic behaviour of a parallel program is significantly more complex than th a t of a sequential program.
Many existing parallel programming languages require the programmer to explicitly specify these aspects of parallel program execution. Objects specific to parallel exe cution, like semaphores and monitors, are used to describe synchronisation between threads. Managing these new objects, however, adds a new dimension of complex ity to program development, for example the results of the parallel program might become non-deterministic, and especially the design of robust large-scale parallel sys tems becomes a daunting challenge.
The approach towards parallel computation advocated in this thesis is to let most of these resources be managed by the runtime-system in order to avoid the additional complexity for the programmer to handle these resources explicitly. All the pro grammer has to do is to expose parallelism, i.e. to identify parts of the program th at may be usefully evaluated in parallel. This model is therefore termed one of semi explicit parallelism. Ideally a compiler should autom atically partition the program into parallel threads. If accurate strictness information is present this could be done by generating a parallel thread for every strict argument of an expression. However, the effects of different decompositions, or partitions, of the program into sequential
1We do not distinguish between complete heavy-weight threads, sometimes called tasks, and light-weight threads that can only exist within a task.
1.1. Parallel Lazy Functional Program ming 4
components are of special importance for the work presented in this thesis. Therefore the programmer is required to expose the potential parallelism in the program. In summary, our model offers the possibility of reducing the runtime by only exposing potential parallelism and without explicitly managing the parallel threads.
1.1.2
Functional Program m ing
Functional languages, as well as other declarative languages, describe what to com pute without specifying the order in which to compute it. The exact evaluation order is only loosely defined by the data dependencies between expressions in the program. The compiler can choose any evaluation order of independent expressions. In par ticular, they can be evaluated in parallel. The semantic property th a t allows such a flexibility in the evaluation order is referential transparency, stating th at the result of an expression does not change if a subexpression is replaced by another expression with the same result. For formal reasoning this allows to use the technique of replac ing equals for equals. In the context of parallel computation this allows the compiler, or the runtime-system, to choose various orders of evaluation and to change them dynamically.
Based on this property of functional languages it is easy to implement a naive au tom atically parallelising compiler. For example, all strict arguments of a function call as well as the function body itself can be evaluated in parallel. However, the problem with this approach is the management overhead related to the vast amount of parallelism generated. Often the generated threads are too short to warrant an execution by a parallel thread altogether. Therefore, much effort has been put into increasing the length of these threads, which increases their granularity because each thread performs more computation.
This thesis studies how to increase the granularity of the generated threads and thereby improve the performance of the parallel program. A compile-time approach is taken, in which information about the granularity of potential parallel tasks is inferred at compile-time and forwarded, via autom atically inserted annotations, to the runtime-system, which then uses this information in order to decide whether a parallel thread should be generated. This design naturally splits into one static component for inferring com putation costs, a granularity analysis, and one dynamic component for using this information, granularity improvement mechanisms. It should be noted
th a t the use of compile-time information from a static analysis does not amount to a static partitioning of the program. In our model the runtime-system is free to ignore parallelism. Thus, it is possible th a t different pieces of code th a t have been marked for parallel execution are actually merged into one thread by the runtime- system. In summary, we focus on functional languages because the lack of an explicit evaluation order specified in a program gives the compiler and the runtime-system a high degree of freedom in choosing a specific evaluation order. Although the use of implicit parallelism is not the immediate goal, this work makes some progress towards this long term goal.
1.1.3
Lazy Program m ing
An algorithm in a declarative language describes a property rather than a procedure. Executing the algorithm amounts to finding a solution for the property specified. This approach can be taken further to the level where values are bound to variables. The operational meaning of such a binding is to evaluate the expression. The declarative meaning, however, only identifies a variable with a value.
The idea of lazy evaluation, or more precisely of non-strict languages, is to decouple denotational definition from operational control. Defining the value of a variable does not mean th at the definition has to be evaluated immediately. The definition only describes a property between a variable and a value in the program. The evaluation degree and the evaluation order are defined by the d ata dependencies in the program. This enables the reuse of the same variable in many different contexts, which examine different parts of the value. Thus, abstracting this control aspect out of the algorithm increases the modularity of programs.
There is an obvious tension between the goal of lazy evaluation, to abstract over control aspects of the code, and parallel computation, to enforce a parallel control structure of the code. Lazy evaluation tries to evaluate as small a portion of the result as possible, whereas parallel com putation aims at generating independent threads of some minimal size. In order to achieve good parallel performance this means th at at some places it may be necessary to specify how far a d ata structure should be evaluated, i.e. to specify its evaluation degree. Still, lazy evaluation is valuable for modular program design because this evaluation degree can be specified separately from the definition of the d ata structure itself. This encourages a data-oriented style of
1.1. Parallel Lazy Functional Program m ing 6
parallel programming, i.e. a style where the parallelism is specified over intermediate d ata structures rather than in the modules th a t generate these d ata structures. In the programming technique the parallel programming group at Glasgow has developed, evaluation strategies, this style of programming has proven to be extremely useful for large parallel programs.
The high degree of m odularity provided by lazy languages is particularly im portant for the design of large programs. Furthermore, extremely time consuming programs, which would profit most from a reduction in runtime provided by parallel compu tation, are typically very large. Therefore, it is im portant th a t the language for parallelising the program supports modularity. Otherwise the gain in performance would have been bought with a loss in maintainability. In summary, the use of lazy evaluation decouples definition from control. This aides modularity and code re-use in a sequential model of computation. In a parallel model it also aides top down parallelisation of big programs by using data-oriented parallelism over intermediate d ata structures.
1.1.4
R elation sh ip to O ther A pproaches for Parallel P ro
gram m ing
The approach towards parallelism taken by functional languages is in stark contrast to th at taken by High Performance Fortran (HPF) (Rice 1993) and other parallel extensions of imperative languages. In parallel functional programming the program ming language itself is unchanged. However, at certain points additional information is added to the program and used by the parallel runtime-system. This additional information only represents hints to the runtime-system th a t may be ignored rather than directives th a t must be obeyed. Therefore, the annotations do not change the semantics of the program. These annotations are in some sense analogous to regis ter declarations in imperative languages th a t allow the programmer to add valuable operational information to the program but can be ignored by the compiler. It is inter esting to note th at many of these annotations, like register declarations, are nowadays rarely used and th a t most of the time autom atic register allocation performed by the compiler is perfectly satisfactory for the programmer. Clearly, this state has not yet been achieved with parallelism annotations for functional languages. But the distinc tion between functional language features and operational annotations for parallelism
enables a similar approach.
In contrast, parallel programs written in HPF-like languages aim at a near optimal usage of parallel machine resources. In doing so, they reveal low-level machine details and allow the program to specify details of the program execution leading to highly machine specific programs. As a result abstractions over primitive low-level constructs are evolving in the same way as high-level programming language constructs evolved out of common patterns of low-level instructions.
Based on these differences in the language design we consider parallel functional lan guages to be most useful for achieving moderate speed-up with only minimal changes in the code. Hopefully the necessary changes in the code th a t are still needed today can be reduced to zero with further progress towards implicit parallelism. HPF-like languages are more appropriate for applications in the supercomputing area where it is feasible to spend large programmer effort in restructuring code in order to get near optimal performance. However, we believe th a t the programming techniques used in our model, like data-oriented parallelism via non-strict d ata structures, can also be applied for this kind of languages in order to build high-level abstractions for certain kinds of parallelism.
1.2
The D ynam ic B ehaviour o f Parallel Program s
The main reason for the complexity of writing parallel programs is the complex dy namic behaviour generated by a set of cooperating threads. In addition to the cor rectness of the sequential pieces of com putation the tim ing of communication has to be considered in order to avoid deadlock situations and to guarantee both correct ness and termination of the parallel program. Furthermore, the performance tuning of a parallel program requires a fine balance between several competing goals like creating many threads to use idle time of processors during the com putation and limiting the number of generated threads to limit the bookkeeping overhead for the runtime-system.
Many parallel languages allow the programmer to control all these aspects of the dynamic behaviour. In our model, however, almost all of these details are hidden by the runtime-system. This design decision is based on the observation th a t the pro grammer is often overwhelmed with the complexity of writing a parallel program and
1.2. The Dynam ic Behaviour o f Parallel Programs 8
explicitly managing the dynamic behaviour. In order to make such an semi-explicit approach feasible, the runtime-system has to make sophisticated decisions on how to manage the parallelism. For example, in our model the creation of parallel threads is based, to some extent, on the current load of a processor. The communication between threads is implicitly performed via reading and writing shared structures. The only extension necessary for specifying the parallelism in the program is a com- binator th at exposes parallelism called par. However, in order to get a more detailed control over the partitioning of the program into parallel threads it is often neces sary to specify the evaluation order in an expression. This is done via adding seq combinators. Ideally, both kinds of combinators could be inserted into the program by an automatically parallelising compiler. However, first efficient runtime-system techniques to manage the parallelism have to be devised. The long term goal of this work is to autom ate this process of adding annotations describing the parallelism in the program.
One of the aspects of the dynamic behaviour is the granularity of a com putation. By the granularity of a program expression we mean the computation costs of evaluating this expression. The inefficiency of fine-grained threads lies in the fact th a t they spend most of their com putation on parallelism overhead like generating the thread or com municating with other threads. Historically, this has proven to be a severe problem for machines like ALICE (Darlington k Reeve 1981) and runtime-systems based on both graph-reduction (Hammond k Peyton Jones 1992, Hammond et al. 1994) and dataflow (Arvind k Nikhil 1990, Shaw et al. 1996). In order to m itigate this prob lem the programmer often tries to increase the granularity of the generated threads in the performance tuning stage of parallel program development. One goal of this thesis is to investigate how this process can be autom ated using statically-extracted information about the granularity of the generated threads. This information is used in the runtime-system to improve the performance of the parallel program without further information provided by the user.
This thesis studies granularity as one of the most im portant aspects of the dynamic behaviour of parallel program execution. However, it is, of course, not the sole impor tan t aspect of the dynamic behaviour. For example, the communication behaviour of the runtime-system determines the size of the graph structures th a t are sent within one unit of communication, determining the granularity of the communication. We have previously studied different fetching schemes in order to reduce the total commu
nication overhead (Loidl & Hammond 19966). Similarly, the scheduling mechanism is im portant to hide latency in a system involving a lot of communication. The d ata locality is an im portant property, which deserves further study, too.
1.3
S tatic Inform ation about D ynam ic Behaviour
One of the attractive features of functional languages for compiler optimisations is the fact th a t due to their clear semantic properties a lot of information about the program ’s dynamic behaviour can be inferred statically. The most im portant example of such a static analysis is strictness analysis, which detects expressions in a non-strict program th a t can be evaluated eagerly, and therefore more cheaply, without violating the semantics of the program. State-of-the-art compilers for non-strict functional languages like the Glasgow Haskell Compiler (GHC) (Peyton Jones et al. 1993, Peyton Jones 1996) heavily rely on the information provided by these analyses to perform a sequence of meaning preserving program transformations th a t improve the efficiency of the program.
Such statically-inferred information can also be exploited for parallel computation. However, because of the different dynamic behaviour of a parallel program additional information about the program execution is required. This thesis focuses on the aspect of granularity and presents a static granularity analysis, which is able to give an estimation of the com putation costs of evaluating program expressions. Providing this additional information to the parallel runtime-system is an im portant step towards truly implicit parallelism for functional languages.
One im portant difference to classical analyses like strictness analysis, however, is the fact th at granularity analysis has to infer information about an intensional prop erty of the program execution. It can therefore be only correct with respect to an instrum ented semantics, which itself models the property of interest. In this case com putation costs are modelled as computation steps and inferred as an estim ate for an upper bound. This indirect way of extracting information affects the quality of the result. However, in contrast to strictness analysis wrong granularity information will not affect the semantics of the generated program but only its performance. Therefore it is possible to design an analysis th a t sometimes makes guesses about computation costs.
1.3. Static Information about D ynam ic Behaviour 10 L program Other Analyses Thread Pool Spark Pool L program R u n tim e System Parallel executable Code Generation Granularity Analysis Program Transformations Compiler Front End
Granularity Analysis Compilation Granularity Improvement F ig u re 1.1 Possible structure of a parallelising compiler
Figure 1.1 summarises a possible overall structure of a parallelising compiler. The front end of the compiler translates the input program into an intermediate language, called C. This language is designed to be simple in order to ease later analysis and program transformation stages, operating on this language. The program transfor mation stages, which present the main part of the compiler, perform program optimi sations and make use of the information provided by various static analyses such as granularity analysis to obtain information about the evaluation costs of program ex pressions. In the programming model used in this thesis parallelism annotations have to be present in the input program already. The program transformations can then add further information to the existing annotations. However, at this stage enough information is available to autom atically insert parallelism annotations, if the goal is implicit parallelism. Finally, the code generation stage of the compiler produces a parallel executable. In the setup used in this thesis the parallel executable will
be machine independent by using a runtime-system th a t hides details of the parallel architecture. As a further optimisation it would be possible to generate specialised code for particular parallel machines. The granularity improvement mechanisms th at are developed in this thesis then make use of the additional granularity information attached to sparks and threads to make better scheduling decisions based on this additional information.
In summary, this thesis focuses on the parallel execution of non-strict functional programs th a t are annotated in order to expose potential parallelism. A parallel graph reduction model is used to implement the parallel execution of the program. In particular, this thesis tackles two parts in the structure shown in Figure 1.1: the granularity analysis and the granularity improvement mechanisms.
1.4
C ontributions
This section gives a list of research contributions made in this thesis. A more detailed discussion of the contents of the contributions with a separation of the authorship of parts in the contributions is given at the end of the thesis in Section 7.2.
1. Parallelisation of large lazy functional programs (Loidl Sz Trinder 1997): This thesis demonstrates how to combine the advantages of lazy evaluation, in par ticular modularity, and of parallel evaluation, namely reduced runtime, on a large scale. In the parallelisation of a set of large algorithms the modularity provided by lazy evaluation helps to minimise the code changes required to im prove the parallel performance of the program. The implementation includes both the design of parallel functional algorithms, such as LinSolv, as well as par allelising existing code, such as Lolita. W ith more than 47,000 lines of Haskell code Lolita is the largest existing parallel non-strict functional program. The programs dem onstrate a crucially im portant aspect of strategic programming in the large, namely the separation of behavioural from algorithmic code. 2. Highly parameterised, accurate simulator (Gr a nSim) (Hammond et al. 1995):
The collaboratively developed Gr a nSim simulator is of use for architecture-in- dependent parallelisation as well as a testbed for the implementation of specific runtime-system features. Its robustness has been tested with large parallel ap plications. By being highly parameterised it is very flexible in the parallelisation
1.4. Contributions 12
and tuning of functional programs. By being accurate and closely related to the parallel G U M runtime-system it encourages prototype implementations of specific runtime-system features. Gr a nSim has been integrated into an engi neering environment for parallel program development in order to facilitate the development and performance tuning of large programs. A set of visualisation tools has proven crucial for understanding the dynamic behaviour of Gr a nSim and G U M programs. Prim ary contributions to Gr a nSim made in this the sis include the design of the communication system, the implementation of an idealised simulation, and the integration of Gr a nSim into GHC.
3. Use and refinement of evaluation strategies (Trinder et al. 1998): This thesis contributes to evaluation strategies by adding strategic function application and by providing some of the first uses of strategies. The latter in part drove the design of the current version of strategies. Strategic function application has proven very useful in large parallel applications such as Lolita. In particular, it supports data-oriented parallelisation, which achieves high modularity by decoupling the definition of a function from the specification of its parallelism. 4. A static granularity analysis (Loidl & Hammond 1996a): A granularity anal ysis for inferring upper bounds of computation costs in a simple strict higher- order language, based on existing analyses (Hughes et al. 1996, Reistad &; Gifford 1994), is presented. The analysis is formulated as a subtype inference system. A detailed outline of an implementation is given and an extended cost reconstruction algorithm is developed. The analysis has not been implemented but measurements with a hand analysed program allow some assessment of the importance of the inferred information.
5. Implementation and measurement of runtime-system features to improve paral lel performance: (Loidl & Hammond 1995): This thesis discusses several gran ularity improvement mechanisms the author has implemented in Gr a nSim. Measurements studying their impact on the parallel performance of a set of test programs are provided. As a result moderate improvements in performance have been achieved for programs th at are annotated with granularity information.
In addition to the major contributions above this thesis also makes less significant con tributions towards a comparison of imperative and functional parallel programming by
presenting results from parallel imperative implementations of three computer alge bra algorithms in Section 4.7. Chapter 2 gives a detailed survey of several techniques for the parallel implementation of functional languages, going beyond the issues ad dressed in the main p art of the thesis, and Sections 5.7 and 6.7 survey alternative approaches for improving granularity and for designing analyses extracting granularity information, respectively. In the examination of large programs other runtime-system aspects of the parallel execution of lazy functional programs have proven im portant. Different packing and rescheduling schemes have been implemented in Gr a nSim, ad dressing the issue of efficient communication in a parallel graph reduction system (see Section 3.3.1). Details of the implementation and various measurements are presented elsewhere (Loidl & Hammond 19966).
1.5
T hesis Structure
The structure of this thesis is as follows.
C h a p te r 2 gives a survey of various approaches towards a parallel implementation of functional languages. In particular, this chapter describes details of the parallel graph reduction model th a t is used in this thesis and its relationship to other execution models. The discussion distinguishes key runtime-system issues for parallel program execution: the evaluation model, the storage management model, the communication model, and the load distribution mechanism.
C h a p te r 3 gives a detailed description of the Gr a nSim simulator th a t is developed in this thesis. Gr a nSim is a flexible and accurate simulator for the parallel execu tion of Haskell programs. It supports both an idealised simulation and an accurate simulation modelling the characteristics of a particular architecture. In parallelising a set of large Haskell programs Gr a nSim has been extensively used for developing and tuning the parallel code. In later chapters Gr a nSimwill be used as the platform for measurements on granularity.
C h a p te r 4 discusses the parallelisation of several large lazy functional programs. This chapter first presents evaluation strategies, which have been developed in a group effort. Then three programs are discussed in detail: a parallel Alpha-Beta search algorithm, highlighting the interplay between lazy and parallel evaluation, LinSolv, a symbolic computation algorithm using infinite intermediate d ata structures, and
1.5. Thesis Structure 14
Lolita, a large natural language engineering system.
C h a p te r 5 focuses on the aspect of granularity for the dynamics of parallel program execution. For a set of programs the granularity of the generated threads is measured. It is shown th a t by increasing the granularity the performance of the programs can be improved. Three different granularity improvement mechanisms are discussed and measured: explicit thresholding, priority sparking, and priority scheduling.
C h a p te r 6 presents a static granularity analysis for a simple strict functional lan guage. This analysis infers an upper bound for the number of computation steps needed to evaluate a program expression. The analysis is developed as an inference system together with an analysis for the size of program values. A detailed out line of a possible implementation is given, combining two existing analyses. Finally, a small test program is hand-analysed and the resulting annotated program is measured showing some performance improvements.
C h a p te r 7 draws conclusions from the presented approach towards improving the performance of parallel lazy functional programs. It evaluates the importance of a structured approach towards program parallelisation, in particular for the perfor mance tuning stage of parallel program development. And it identifies areas of future work, in particular for achieving the long term goal of truly implicitly parallel execu tion of functional programs.
T he P arallel Im p lem en tation o f
Functional Languages
Capsule
This chapter discusses several approaches towards a parallel implementa tion of functional languages. It starts with motivating the use of functional languages for parallel programming. Then it presents the basic ideas of pop ular models for the implementation of functional languages and evaluates how easily parallel evaluation can be expressed in these models. The main part of this chapter focuses on critical runtime-system issues and outlines several efficient implementation techniques. The following runtime-system issues are examined:
• the evaluation model, • the storage management, • the communication model, and • load distribution.
In this thesis a parallel graph reduction model is used. The mechanisms for implementing the above runtime-system issues in this model axe compared with possible alternatives. The overall discussion is based on an implementation on stock hardware rather than specialised hardware for functional programming.
2.1. Introduction 16
2.1
Introd u ction
In assessing the quality of various kinds of programming languages the requirement of parallel execution usually complicates the language and therefore diminishes its value for large-scale program design. Not so with functional languages! The higher level of abstraction, compared to imperative languages, decouples the semantics of the language from operational considerations such as sequential or parallel evaluation. In particular the referentially transparent nature of functional languages allows var ious different ways of evaluating an expression. However, implementing an efficient system for parallel functional programming, consisting of an optimising compiler and a flexible runtime-system, has proven to be quite difficult.
Functional languages and their implementation have a rather long history. Whereas early models for implementing functional languages were defined on a rather low level, e.g. the SECD machine (Landin 1964), more recent models such as the graph reduction and the dataflow models present a far higher level of abstraction, allowing parallelism to be expressed naturally in this framework. However, when implementing such a model many runtime-system issues have to be tackled. The core of this chap ter deals with the efficient implementation of these runtime-system issues on stock hardware. We do not consider special purpose hardware since the development on parallel hardware during the last years has shown a clear focus on general purpose machines.
The structure of this chapter is as follows. Section 2.2 discusses how functional languages can express parallelism in general, and which kind of model is used in this thesis. Section 2.3 outlines several models for implementing functional languages and evaluates how easily parallel evaluation can be expressed in these models. Section 2.4 focuses on key issues of the runtime-system for the efficient parallel implementation. Section 2.5 puts our model into the context developed thus far. Finally, Section 2.6 summarises aspects of our implementation model th a t have to be addressed in order to construct an efficient parallel evaluation of functional languages.
2.2
P rinciples o f Parallel Functional Languages
This section discusses why functional languages are a good vehicle for writing parallel programs. It discusses some semantic issues th at have an im portant impact on the
parallel behaviour of the program, and connects them with runtime-system issues discussed in more detail in subsequent sections.
2.2.1
W h y are Functional Languages G ood for Parallelism ?
W ith the advent of parallel machine architectures and their promise of far higher performance than it is possible for conventional architectures, the design of languages for parallel com putation has become an im portant research topic. A key aspect in the design of parallel languages is the way th a t the parallel execution is described. Im perative languages traditionally extend the sequential model with explicitly handled threads to describe independent pieces of computation and messages to communicate data between these threads. If these notions remain visible to the programmer he has to cope with issues like possible deadlocks in the computation, the partitioning of the com putation into components, and the placement of these computations onto the processors of the parallel machine. This adds a new dimension of complexity to the design of a parallel algorithm and distracts from the m athem atical properties of the algorithm like its correctness.
Another approach, which restricts the generality of this message passing style of com putation, has recently become extremely popular: synchronous parallel computing. The two best known models in this class are BSP (McColl 1996) and SPMD (Smirni et al. 1995). The idea in these models is to synchronise all communication in the sys tem by either alternating between supersteps of com putation and communication, or by using an implicit barrier for finishing all communication. This restriction enforces a certain structure of the parallel program. However, it also facilitates the performance evaluation of the program. Furthermore, the basic communication operation in these models, namely broadcast, can be implemented very efficiently on the latest parallel hardware. Here hardware realisation and programming model go hand in hand, simi lar to the success of RISC machines for sequential computation. However, usually the programmer still has to handle explicit threads and messages, which complicates the parallel program significantly compared to the sequential model. This thesis focuses on a higher-level approach of parallel programming, hiding most of these aspects in the runtime-system. It is, however, still possible to re-use existing lower-level code for specialised tasks.
2.2. Principles o f Parallel Functional Languages 18
fying what to compute without specifying a sequence of instructions describing how to compute the result. As a result functional languages are referentially transparent, which implies th at independent parts of the program can be evaluated in parallel. Thus, the language does not necessarily need to be extended to deal with parallel evaluation. In principle, the problem of parallelising an algorithm can be reduced to the problem of reducing data dependencies in the program — something th at can be done via source-to-source program transformations in much the same way as program optimisations in sequential compilers. Reasoning about the correctness of such trans formations is no more difficult than for standard transformations used in sequential optimising compilers. Furthermore, parallelism based on functional languages yields a deterministic result, and it is guaranteed to be the same result as in the sequential execution. There is no danger for deadlock in such a model, unless a program runs out of resources.
Of course, the higher level of abstraction also imposes some overhead on the execution. Therefore, an optimised parallel algorithm using lower level features like an imperative computation model and message passing for communication will usually result in a better performance of the algorithm. However, especially for large programs it is extremely difficult to work at such a low level of abstraction.
2.2.2
T he R ole o f S trictness
This section discusses fundamental semantic properties of functional programming languages and their impact on the sequential and parallel evaluation of such languages. It focuses on strictness as the most im portant of these properties.
D e fin itio n o f S tr ic tn e ss
One im portant semantic property of a programming language is the strictness of user defined functions. A function is strict if its result is undefined, whenever the its argument is undefined. A non-strict language is a language th at perm its the definition of non-strict functions. More formally, a function / is strict if and only if
where ± represents an undefined result (e.g. caused by a failing or non-term inating com putation). A discussion of strictness is given for example in (Field & Harrison 1988) [Chapter 4].
One im portant advantage of non-strict over strict parallel languages is the ease of expressing producer/consum er parallelism in the former. In particular the coroutine nature of lazy evaluation avoids a barrier synchronisation between the producer and the consumer process. Following the terminology of Goldberg (1988a) this means, it is easy to express vertical parallelism, i.e. parallelism between a function and its argument, in a non-strict language. In contrast, strict languages tend to rely more on horizontal parallelism, parallelism between different arguments, which evaluates the arguments of a function in parallel. It should be noted th a t this form of parallelism can also be used in non-strict languages, namely for those argument positions in which the function is strict. A separate strictness analysis is needed to determine which arguments can be safely evaluated before the function itself is called.
In order to use a parallel function application, strictness information on user defined functions is needed, which ensures th a t creating parallel threads for each argument satisfies the non-strict semantics of the program. The resulting parallelism is called conservative parallelism, i.e. the values of all parallel threads are known to be needed in the computation. If non-strict arguments are evaluated in parallel, too, specula tive parallelism is generated. Dealing with this kind of parallelism complicates the underlying evaluation model because it must be ensured th a t no process consumes all available resources and it should be possible to term inate processes. However, if this is guaranteed on runtime-system level then the parallel evaluation of all argu ments in a function call satisfies the non-strict semantics, too. Although speculative parallelism is an im portant issue for parallel functional languages, it is not directly related to the main runtime-system aspect this thesis is investigating: granularity. Therefore, this thesis does not give an exhaustive survey of this particular branch of the field.
E v a lu a tio n M ec h a n ism s
This section briefly discusses possible evaluation mechanisms for functional languages. These definitions build on top of the notion of reduction in the lambda-calculus (Church 1941) and delta-reduction for built-in rules like basic arithm etic. The term i
2.2. Principles o f Parallel Functional Languages 20
nology of this chapter follows (Field & Harrison 1988) [Chapter 6].
D e fin itio n 1 (r e d e x ) A redex (reducible expression) is an expression that can be reduced according to the rules of lambda-calculus or delta-reduction.
Intuitively, a redex is an expression th a t can be immediately evaluated. To be more precise about the degree of evaluation several different normal forms can be distin guished.
D e fin itio n 2 (w ea k h e a d n o r m a l fo rm ) An expression is in weak head normal form (WHNF) if, and only if, it is a constant or if it is of the form
f ei . . . en, fo r some 0 < n < arity of f
where f is either a data constructor or function (primitive or user defined).
Intuitively, evaluating an expression to weak head normal form means evaluating only the top level constructor. The expressions e\ . . . en may still contain redexes.
D e fin itio n 3 (n o rm a l fo rm ) A n expression is in normal form if it does not contain any redexes.
An expression in normal form matches the intuitive notion of a value in the language. In an expression, which is not in normal form, the leftmost redex is the redex textually left to all other redexes and the outermost redex is the redex not contained in another redex. Based on these definitions and the two normal forms above it is possible to specify the reduction order yielding the two main evaluation mechanisms used in this thesis.
D e fin itio n 4 (ea g er e v a lu a tio n , c a ll-b y -v a lu e ) A n eager evaluation mechanism chooses in every reduction step the leftmost innermost redex and reduces it to weak head normal form.
D e fin itio n 5 (la z y e v a lu a tio n , c a ll-b y -n e e d ) A lazy evaluation mechanism chooses in every reduction step the leftmost outermost redex and reduces it to weak head nor mal form. When substituting expressions fo r arguments no expression is duplicated, but they are shared in the reduced expression.
The evaluation transformer (Burn 1987, Burn 19916) approach for autom atic par allelisation defines a whole set of such evaluation mechanisms, which are tuned to the strictness of the result th at should be computed in the given context. It uses detailed strictness information obtained by a sophisticated strictness analysis to de termine, given the demand on an expression, how far the components of the expres sion have to be evaluated. Thus, all components can be safely evaluated in parallel to the degree determined by the evaluation transformer. However, this requires the generation of several variants of the code for each function, specialised to the partic ular context in which it is used. This approach has been used by Burn (1991a), in the distributed-memory HDG machine (Kingdon et al. 1991), in the PAM machine (Loogen et al. 1989), in RushalPs parallel im plementation of the Spineless G-machine on top of a virtual shared-memory KSR1 machine (Rushall 1995), and in the shared- memory EQUALS system (Kaser et al. 1997).
B e y o n d S tr ic tn e ss
In order to preserve the semantics of the program, strictness information is needed for implicit parallelisation in order to decide which arguments can be safely evaluated in parallel. However, more information about dynamic properties of the program is useful in order to extract efficient parallelism. In particular, granularity informa tion, i.e. information about the size of a com putation, is needed in order to decide whether it is worth paying thread creation and synchronisation overhead for comput ing an expression in parallel. This question is discussed in detail in later chapters. Chapter 5 shows th at too fine granularity can deteriorate parallel performance and develops runtime-system mechanisms to increase granularity. Chapter 6 presents a granularity analysis for a simple strict, higher-order language to estimate the costs of an evaluation.
2.2.3
Language Support for P arallel Program m ing
The previous section has shown th a t it is possible to automatically parallelise a func tional program by executing all strict arguments of a function call in parallel. Shar ing and granularity information, if available, can be used to determine whether it is worthwhile creating a thread for a computation.