2. Parallel Programs and Tuning
2.2 Expressing Parallel Programs
This section describes an abstract and simplified programming environment for programs which execute on DM MIMD systems. The features described are a subset of the features present in many real programming environments, so programs described in terms of this environment can be implemented within real environments. However, real programming environments are often much richer in features, so the converse does not necessarily hold. This abstract environment is presented in order to avoid confiision as to terminology, since definitions and terms vary between proprietaiy environments, and to avoid the loss of generality which might arise from adopting a particular environment for this research.
Processors each have exclusive access to a local memory and are connected via links, which are dedicated point to point, bi-directional connections between processors, by which data can be transferred from the local memory of one processor to the local memory of another. A program consists of a set of tasks, each of which supports a single instruction stream. Each processor supports one or more tasks. Tasks communicate by sending messages. The effect is to copy data fi’om the memory buffer of one task to a memory buffer of the same size on another task. Messages are transmitted over channels, which are directed, dedicated, point-to-point links between tasks. A channel may be implemented using no links, if the tasks reside on the same processor, or using one or more links, if tasks are on different processors. Multiple channels may contend for the use of a given link.
A task graph diagram depicts labelled tasks and their channel connections. A processor topology diagram represents the labelled processors and their connections via bi-directional links. Tasks will be labelled alphabetically and processors will be labelled numerically. It is assumed for the purposes of this dissertation that channels and links are unique; they will be referred to in terms o f their source and destination - e.g. AtoB for a channel and lto2 for a link. A task mapping file defines how tasks are mapped onto processors. A channel mapping file defines how channels are mapped onto links. These
concepts are illustrated in Figure 1. In practice, the channel mapping is often defined in terms of tables in the router, rather than as a separate file.
task graph
processor
topology
diagram
task mapping file
task A on processor 1
task B on processor 2
task C on processor 3
task D on processor 4
channel mapping file
AtoD via lto2 & 2to4
BtoA via 2tol
DtoB via 4to2
DtoC via 4to2, 2tol & ItoS
Key
processor
links
task
channel
Figure 1 : A Parallel System
It is now necessary to explain some of the mechanism by which channels are implemented Messages are communicated as a result of a number of packets being exchanged by the tasks involved. The details of this depend on the system software in question. Typically, the sending task first receives a packet fi’om the sender to indicate that the receiver is ready and then transmits a series of packets of data to the receiver. In sending the first packet, the task which is to receive the message indicates to the sending task that there is a block of memory ready to receive the data. Data is sent as a stream of packets in order to ensure that the communication bandwidth is shared effectively between the channels, even when large messages are sent. Where messages are routed across the network of processors, activity may be generated on intermediate processors and links. This processor activity is required to look up in a table which link the packet must be transmitted upon and to manage the transmission. This is illustrated in Figure 2.
packet
to show
that the
buffer is
available
receiver
processor
sender
sender bufferlink
packets
of data
receiver bufferlocation of
activity caused
by a single
data packet
Figure 2 : Messages, Routing and Packets
A task is defined to have a single instruction stream. It is assumed that its progress must only be delayed by message passing or by contention for processor time with other tasks on the same processor. In particular, it must not be delayed by accessing devices such as semaphores or critical sections directly. Where it is required to access data or operating system resources which are shared, it is assumed that a single task acts as a server task, accessing the data or resource directly and communicating with other tasks through messages. It is assumed that all tasks terminate and that the program as a whole always terminates.
The experimental parallel programs and tools described here have mostly been developed using INMOS’s™ ANSI C Toolset™. This is a C compiler and development environment with libraries to support parallel programming and facilities for configuring software to run on a defined network of processors. The compiled and configured software was executed on a Parsys Supemode™ [Parsys89], which can connect its collection of transputers™ [Transputer88] in a user-definable network. The reason for this choice was the availability of the Supemode™ machine and of the INMOS™ ANSI C Toolset™. The performance of the T800 transputers™ used in this research has now been surpassed by that of more recent processors. However, this does not diminish their usefulness as a low cost vehicle for research into parallelism, since they represent, for experimental purposes, platforms offering considerably higher performance. These include the transputer’s successor, the T9000™, the Texas Instruments™ TMS320C40™ series of processors and the Analogue Devices ADSP21060™ series, also known as the SHARC™. Similar system software environments include 3L Parallel C™, PAROS [Lill92] and Helios™ [Helios91].