1·1·1 The Problem
There is a fundamental desire to obtain computationally derived results as fast as possible. There are two clear approaches to satiate this desire: build a bigger, faster computer that is capable of solving the task at hand, or combine existing computers together so that through cooperation these computers can form a larger virtual computer capable of solving the program more quickly. Given the plethora of
computers already in existence in the home, office, and in laboratories, not to mention the relatively low cost of a ‘typical’ computer, there is little doubt that the financial cost of combining a number of existing machines is far outweighed by the expense of purchasing a single super-computer of significant complexity.
If the utilisation of a virtual computer with multiple processing elements1 is to be explored, schemes must be developed to identify program components able to be executed by the independent processing elements. These components must be
delivered to the processing elements in timely fashion and the intermediate results must be combined to provide superior throughput. The examination of each aspect of this
1 The term processing element is used to represent a processor capable of general computation implemented in hardware.
process will be presented in this document with the goal of delivering a system that enables the parallel2 evaluation of programs in a manner that minimises processing overhead and hence execution times.
1·1·2 The Solution
1·1·2·1 Philosophy
Networked computing potentially provides very high computational power. Today, the typical desktop machine can perform computation at significant speed — and today, many computers are networked. The linking of these computers in a closer fashion to provide a larger, virtual computer is both desirable and possible. This has been attempted at a number of levels including: Wisdom [Murray 1990] which simulates the virtual computer at the operating system level and GUM [Hammond, Loidl, Mattson, Partridge, Peyton Jones, and Trinder unpub. a; Trinder, Hammond, Mattson, Partridge, and Peyton Jones 1996; Hammond, Mattson, Partridge, Peyton Jones, and Trinder 1995; Dermoudy 1996b] which simulates the virtual computer at the function call and arithmetic expression level. Efficiently utilising the power of all available processing elements, however, is complicated and fraught with problems. The amount of parallelism available in an algorithm must be identified, extracted, and exploited for maximum benefit. Unfortunately, this may give rise to delays due to communication latency and result in processing element idleness if the amount of parallelism extant in the problem is less than the number of available processing elements. A major problem is the identification of the computation that may be undertaken by a processing element during periods of inactivity.
1·1·2·2 Language
For maximum application, general-purpose computers should be used for general- purpose computing. It has long been argued [Backus 1978] that imperative
programming is an inappropriate paradigm for parallel computation. The bottleneck of centralised state in conjunction with the presence of side-effect laden statements seriously constrains the identification and isolation of the potential parallelism inherent
2 The terms concurrent and parallel will be used consistently with the definitions given in [Hammond and Michaelson 1999]. Concurrent evaluation involves multiple independent processes working on different activities simultaneously. Parallel evaluation involves cooperating processes working on a single activity.
Chapter One: Introduction in many algorithms. The functional programming paradigm, conversely, is free of side- effects, is not restricted by the bottleneck of centralised state, and hence supports the flexible evaluation of sub-expressions in any order; indeed the Church-Rosser property that evaluation of program sub-parts in any order yields a unique answer holds
[Barendregt 1984].
Functional programs inherently possess many opportunities for parallel evaluation and are ideal tools for programming multicomputers. The programmer is relieved of the burden of specifying parallelism as this can be inferred by the compiler [Peyton Jones 1989]. Opportunities for parallel evaluation include: elements of an arithmetic expression, parameters of functions, alternative statements of conditionals, and elements of finite or infinite lists.
1·1·2·3 Architecture
Given a parallel machine, the concurrent threads making up the program can be distributed to all the available processing elements. In an environment in which autonomous processing elements possessing local memories are connected by a network, communication of values and program fragments must occur between the processing elements in order to execute the program. Secondly, there must be a runtime system that allocates these fragments to the processing elements to ensure the evaluation of the program and the provision of the result. Such an algorithm is termed a load distribution algorithm. In a shared environment (one in which all processing elements have access to a global memory) such complexities are not present — but it has been shown [Goldberg 1988] that such architectures do not scale-up adequately due to the increased contention for the memory bus.
There is no need to pursue a single thread of execution in a parallel system. In fact, if the number of processing elements exceeds the number of threads requiring execution, it is advantageous to employ these excess processing elements on other work. Hence, the runtime system could speculate that certain fragments of the program may contribute to the overall outcome and migrate these speculative fragments to idle processing
elements. Further, a priority scheme could be included which assigns a measure of ‘importance’ or ‘potential usefulness’ to each thread, indicating how closely it relates to the final program result. This scheme could also indicate a migration desirability that
considers the amount of work contained within a thread and additionally an estimate of a thread’s data dependence.
Flynn [Flynn 1972] identified four categories of computational hardware architecture: uni-processor machines, pipeline machines (where a single data stream flows
synchronously through an array of processors), massively parallel architectures (where multiple processors synchronously execute the same instruction on different data values), and flexible parallel architectures (where multiple processors can execute differing instructions on independent data values). More will be said about each
architecture in Chapter Two; suffice it to say that the latter category is the only category capable of general-purpose parallel programming.